Book Ena

Download as pdf or txt
Download as pdf or txt
You are on page 1of 436

Explorations in Numerical Analysis

James V. Lambers and Amber C. Sumner

December 20, 2016


ii

Copyright by

James V. Lambers and Amber C. Sumner

2016
Preface

This book evolved from lecture notes written by James Lambers and used in undergraduate numeri-
cal analysis courses at the University of California at Irvine, Stanford University and the University
of Southern Mississippi. It is written for a year-long sequence of numerical analysis courses for ei-
ther advanced undergraduate or beginning graduate students. Part II is suitable for a semester-long
first course on numerical linear algebra.
The goal of this book is to introduce students to numerical analysis from both a theoretical
and practical perspective, in such a way that these two perspectives reinforce each other. It is
not assumed that the reader has prior programming experience. As mathematical concepts are
introduced, code is used to illustrate them. As algorithms are developed from these concepts, the
reader is invited to traverse the path from pseudocode to code.
Coding examples throughout the book are written in Matlab. Matlab has been a vital tool
throughout the numerical analysis community since its creation thirty years ago, and its syntax that
is oriented around vectors and matrices greatly accelerates the prototyping of algorithms compared
to other programming environments.
The authors are indebted to the students in the authors’ MAT 460/560 and 461/561 courses,
taught in 2015-16, who were subjected to an early draft of this book.

J. V. Lambers
A. C. Sumner

iii
iv
Contents

I Preliminaries 1

1 What is Numerical Analysis? 3


1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Systems of Linear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Polynomial Interpolation and Approximation . . . . . . . . . . . . . . . . . . 6
1.1.4 Numerical Differentiation and Integration . . . . . . . . . . . . . . . . . . . . 6
1.1.5 Nonlinear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.6 Initial Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.7 Boundary Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Getting Started with MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Basic Mathematical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.2 Obtaining Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.3 Basic Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.4 Storage of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.5 Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.6 Creating Special Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . 12
1.2.7 Transpose Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.8 if Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.9 for Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.10 while Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.11 Function M-files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2.12 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.2.13 Polynomial Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.2.14 Number Formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.2.15 Inline and Anonymous Functions . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.2.16 Saving and Loading Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.2.17 Using Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3 How to Use This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.4.1 Sources of Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.4.2 Error Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.3 Forward and Backward Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.4.4 Conditioning and Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

v
vi CONTENTS

1.4.5 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.5 Computer Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.5.1 Floating-Point Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.5.2 Issues with Floating-Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . 38
1.5.3 Loss of Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

II Numerical Linear Algebra 43

2 Methods for Systems of Linear Equations 45


2.1 Triangular Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.1.1 Upper Triangular Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.1.2 Diagonal Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.1.3 Lower Triangular Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2.1 Row Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2.2 The LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.2.3 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.3 Estimating and Improving Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.3.1 The Condition Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.3.2 Iterative Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.3.3 Scaling and Equilibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.4 Special Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.4.1 Banded Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.4.2 Symmetric Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.4.3 Symmetric Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . . . . 73
2.5 Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.5.1 Stationary Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.5.2 Krylov Subspace Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3 Least Squares Problems 91


3.1 The Full Rank Least Squares Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.1.1 Derivation of the Normal Equations . . . . . . . . . . . . . . . . . . . . . . . 91
3.1.2 Solving the Normal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.1.3 The Condition Number of AT A . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.2 The QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.2.1 Gram-Schmidt Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.2.2 Classical Gram-Schmidt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.2.3 Modified Gram-Schmidt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.2.4 Householder Reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.2.5 Givens Rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.3 Rank-Deficient Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.3.1 QR with Column Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.3.2 Complete Orthogonal Decomposition . . . . . . . . . . . . . . . . . . . . . . . 113
3.3.3 The Pseudo-Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
CONTENTS vii

3.3.4 Perturbation Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116


3.4 The Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.4.1 Existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.4.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.4.4 Minimum-norm least squares solution . . . . . . . . . . . . . . . . . . . . . . 120
3.4.5 Closest Orthogonal Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.4.6 Other Low-Rank Approximations . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.5 Least Squares with Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.5.1 Linear Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.5.2 Quadratic Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3.6 Total Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

4 Eigenvalue Problems 127


4.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.1.1 Definitions and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.1.2 Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.1.3 Perturbation Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.2 Power Iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.2.1 The Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.2.2 Orthogonal Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.2.3 Inverse Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.3 The QR Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.3.1 Hessenberg Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.3.2 Shifted QR Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.3.3 Computation of Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.4 The Symmetric Eigenvalue Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.4.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.4.2 Perturbation Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.4.3 Rayleigh Quotient Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.4.4 The Symmetric QR Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.5 The SVD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.6 Jacobi Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.6.1 The Jacobi Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.6.2 The 2-by-2 Symmetric Schur Decomposition . . . . . . . . . . . . . . . . . . . 153
4.6.3 The Classical Jacobi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.6.4 The Cyclic-by-Row Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.6.5 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.6.6 Parallel Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.6.7 Jacobi SVD Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
viii CONTENTS

III Data Fitting and Function Approximation 157

5 Polynomial Interpolation 159


5.1 Existence and Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.2 Lagrange Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.3 Divided Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.3.1 Newton Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.3.2 Computing the Newton Interpolating Polynomial . . . . . . . . . . . . . . . . 167
5.3.3 Equally Spaced Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
5.4 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.4.1 Error Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.4.2 Chebyshev Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.5 Osculatory Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.5.1 Hermite Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.5.2 Divided Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.6 Piecewise Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
5.6.1 Piecewise Linear Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 184
5.6.2 Cubic Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

6 Approximation of Functions 193


6.1 Discrete Least Squares Approximations . . . . . . . . . . . . . . . . . . . . . . . . . 193
6.2 Continuous Least Squares Approximation . . . . . . . . . . . . . . . . . . . . . . . . 201
6.2.1 Orthogonal Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
6.2.2 Construction of Orthogonal Polynomials . . . . . . . . . . . . . . . . . . . . . 204
6.2.3 Legendre Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.2.4 Chebyshev Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.2.5 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
6.2.6 Roots of Orthogonal Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . 210
6.3 Rational Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
6.3.1 Continued Fraction Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.3.2 Chebyshev Rational Approximation . . . . . . . . . . . . . . . . . . . . . . . 215
6.4 Trigonometric Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
6.4.1 Fourier Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
6.4.2 The Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6.4.3 The Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.4.4 Convergence and Gibbs’ Phenomenon . . . . . . . . . . . . . . . . . . . . . . 223

7 Differentiation and Integration 225


7.1 Numerical Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
7.1.1 Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
7.1.2 Lagrange Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
7.1.3 Higher-Order Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
7.1.4 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
7.1.5 Differentiation Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
7.2 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
CONTENTS ix

7.2.1 Quadrature Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232


7.2.2 Interpolatory Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.2.3 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
7.3 Newton-Cotes Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
7.3.1 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
7.3.2 Higher-Order Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
7.4 Composite Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
7.4.1 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
7.5 Gaussian Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
7.5.1 Direct Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
7.5.2 Orthogonal Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
7.5.3 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
7.5.4 Other Weight Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
7.5.5 Prescribing Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
7.6 Extrapolation to the Limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
7.6.1 Richardson Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
7.6.2 The Euler-Maclaurin Expansion . . . . . . . . . . . . . . . . . . . . . . . . . 251
7.6.3 Romberg Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
7.7 Adaptive Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
7.8 Multiple Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
7.8.1 Double Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
7.8.2 Higher Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

IV Nonlinear and Differential Equations 269

8 Zeros of Nonlinear Functions 271


8.1 Nonlinear Equations in One Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
8.1.1 Existence and Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
8.1.2 Sensitivity of Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
8.2 The Bisection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
8.3 Fixed-Point Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
8.3.1 Successive Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
8.3.2 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
8.3.3 Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
8.4 Newton’s Method and the Secant Method . . . . . . . . . . . . . . . . . . . . . . . . 287
8.4.1 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
8.4.2 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
8.4.3 The Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
8.5 Convergence Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
8.6 Systems of Nonlinear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
8.6.1 Fixed-Point Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
8.6.2 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
8.6.3 Broyden’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
x CONTENTS

9 Initial Value Problems 309


9.1 Existence and Uniqueness of Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 311
9.2 One-Step Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
9.2.1 Euler’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
9.2.2 Solving IVPs in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
9.2.3 Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
9.2.4 Implicit Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
9.3 Multistep Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
9.3.1 Adams Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
9.3.2 Predictor-Corrector Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
9.3.3 Backward Differentiation Formulae . . . . . . . . . . . . . . . . . . . . . . . . 322
9.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
9.4.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
9.4.2 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
9.4.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
9.4.4 Stiff Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
9.5 Adaptive Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
9.5.1 Error Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
9.5.2 Adaptive Time-Stepping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
9.6 Higher-Order Equations and Systems of Differential Equations . . . . . . . . . . . . 338
9.6.1 Systems of First-Order Equations . . . . . . . . . . . . . . . . . . . . . . . . . 338
9.6.2 Higher-Order Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341

10 Two-Point Boundary Value Problems 343


10.1 The Shooting Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
10.1.1 Linear Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
10.1.2 Nonlinear Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
10.2 Finite Difference Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
10.2.1 Linear Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
10.2.2 Nonlinear Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
10.3 Collocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
10.4 The Finite Element Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
10.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365

V Appendices 367

A Review of Calculus 369


A.1 Limits and Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
A.1.1 Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
A.1.2 Limits of Functions of Several Variables . . . . . . . . . . . . . . . . . . . . . 371
A.1.3 Limits at Infinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
A.1.4 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
A.1.5 The Intermediate Value Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 373
A.2 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
CONTENTS xi

A.2.1 Differentiability and Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . 375


A.3 Extreme Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
A.4 Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
A.5 The Mean Value Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
A.5.1 The Mean Value Theorem for Integrals . . . . . . . . . . . . . . . . . . . . . 379
A.6 Taylor’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380

B Review of Linear Algebra 385


B.1 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
B.2 Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
B.3 Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
B.4 Linear Independence and Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
B.5 Linear Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
B.5.1 The Matrix of a Linear Transformation . . . . . . . . . . . . . . . . . . . . . 389
B.5.2 Matrix-Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
B.5.3 Special Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
B.6 Matrix-Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
B.7 Other Fundamental Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 393
B.7.1 Vector Space Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
B.7.2 The Transpose of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
B.7.3 Inner and Outer Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
B.7.4 Hadamard Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
B.7.5 Kronecker Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
B.7.6 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
B.8 Understanding Matrix-Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . 396
B.8.1 The Identity Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
B.8.2 The Inverse of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
B.9 Triangular and Diagonal Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
B.10 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
B.11 Vector and Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
B.11.1 Vector Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
B.11.2 Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
B.12 Function Spaces and Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
B.13 Inner Product Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
B.14 Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
B.15 Differentiation of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
xii CONTENTS
List of Figures

1.1 The dotted red curves demonstrate polynomial interpolation (left plot) and least-
squares approximation (right plot) applied to f (x) = 1/(1 + x2 ) (blue solid curve). . 7
1.2 Screen shot of Matlab at startup in Mac OS X . . . . . . . . . . . . . . . . . . . . 10
1.3 Figure for Exercise 1.2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.1 The function f (x) = 1/(1 + x2 ) (solid curve) cannot be interpolated accurately on
[−5, 5] using a tenth-degree polynomial (dashed curve) with equally-spaced interpo-
lation points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
5.2 Cubic spline that passing through the points (0, 3), (1/2, −4), (1, 5), (2, −6), and (3, 7).190

6.1 Data points (xi , yi ) (circles) and least-squares line (solid line) . . . . . . . . . . . . . 196
6.2 Data points (xi , yi ) (circles) and quadratic least-squares fit (solid curve) . . . . . . . 199
6.3 Graphs of f (x) = ex (red dashed curve) and 4th-degree continuous least-squares
polynomial approximation f4 (x) on [0, 5] (blue solid curve) . . . . . . . . . . . . . . 203
6.4 Graph of cos x (solid blue curve) and its continuous least-squares quadratic approx-
imation (red dashed curve) on (−π/2, π/2) . . . . . . . . . . . . . . . . . . . . . . . 208
6.5 (a) Left plot: noisy signal (b) Right plot: discrete Fourier transform . . . . . . . . . 221
6.6 Aliasing effect on noisy signal: coefficients fˆ(ω), for ω outside (−63, 64), are added
to coefficients inside this interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

7.1 Graph of f (x) = e3x sin 2x on [0, π/4], with quadrature nodes from Example 7.7.2
shown on the graph and on the x-axis. . . . . . . . . . . . . . . . . . . . . . . . . . . 261

8.1 Left plot: Well-conditioned problem of solving f (x) = 0. f 0 (x∗ ) = 24, and an
approximate solution ŷ = f −1 () has small error relative to . Right plot: Ill-
conditioned problem of solving f (x) = 0. f 0 (x∗ ) = 0, and ŷ has large error relative
to . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
8.2 Illustrations of the Intermediate Value Theorem. Left plot: f (x) = x − cos x has a
unique root on [0, π/2]. Right plot: g(x) = ex cos(x2 ) has multiple roots on [0, π]. . . 274
8.3 Because f (π/4) > 0, f (x) has a root in (0, π/4). . . . . . . . . . . . . . . . . . . . . . 275
8.4 Progress of the Bisection method toward finding a root of f (x) = x − cos x on (0, π/2)277
8.5 Fixed-point Iteration applied to g(x) = cos x + 2. . . . . . . . . . . . . . . . . . . . . 286
8.6 Approximating a root of f (x) = x − cos x using the tangent line of f (x) at x0 = 1. . 289

xiii
xiv LIST OF FIGURES

8.7 Newton’s Method used to compute the reciprocal of 8 by solving the equation f (x) =
8 − 1/x = 0. When x0 = 0.1, the tangent line of f (x) at (x0 , f (x0 )) crosses the x-axis
at x1 = 0.12, which is close to the exact solution. When x0 = 1, the tangent line
crosses the x-axis at x1 = −6, which causes searching to continue on the wrong
portion of the graph, so the sequence of iterates does not converge to the correct
solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
8.8 Newton’s Method applied to f (x) = x2 − 2. The bold curve is the graph of f . The
initial iterate x0 is chosen to be 1. The tangent line of f (x) at the point (x0 , f (x0 ))
is used to approximate f (x), and it crosses the x-axis at x1 = 1.5, which is much
closer to the exact solution than x0 . Then, the tangent line at (x1 , f (x1 )) is used
to approximate f (x), and it crosses the x-axis at x2 = 1.416̄, which is already very
close to the exact solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

9.1 Solutions of y 0 = −2ty, y(0) = 1 on [0, 1], computed using Euler’s method and the
fourth-order Runge-Kutta method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318

10.1 Left plot: exact (solid curve) and approximate (dashed curve with circles) solutions
of the BVP (10.8) computed using finite differences. Right plot: error in the approx-
imate solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
10.2 Exact (solid curve) and approximate (dashed curve with circles) solutions of the
BVP (10.11) from Example 10.2.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
10.3 Exact (blue curve) and approximate (dashed curve) solutions of (10.18), (10.19) from
Example 10.3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
10.4 Piecewise linear basis functions φj (x), as defined in (10.25), for j = 1, 2, 3, 4, with
N =4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
10.5 Exact (solid curve) and approximate (dashed curve) solutions of (10.22), (10.23) with
f (x) = x and N = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
List of Tables

6.1 Data points (xi , yi ), for i = 1, 2, . . . , 10, to be fit by a linear function . . . . . . . . . 195
6.2 Data points (xi , yi ), for i = 1, 2, . . . , 10, to be fit by a quadratic function . . . . . . . 198
6.3 Data points (xi , yi ), for i = 1, 2, . . . , 5, to be fit by an exponential function . . . . . . 200

xv
xvi LIST OF TABLES
Part I

Preliminaries

1
Chapter 1

What is Numerical Analysis?

1.1 Overview
This book provides a comprehensive introduction to the subject of numerical analysis, which is the
study of the design, analysis, and implementation of numerical algorithms for solving mathematical
problems that arise in science and engineering. These numerical algorithms differ from the analytical
methods that are presented in other mathematics courses, in that they rely exclusively on the four
basic arithmetic operations, addition, subtraction, multiplication and division, so that they can be
implemented on a computer.
The goal in numerical analysis is to develop numerical methods that are effective, in terms of
the following criteria:
• A numerical method must be accurate. While this seems like common sense, careful consid-
eration must be given to the notion of accuracy. For a given problem, what level of accuracy
is considered sufficient? As will be discussed in Section 1.4, there are many sources of error.
As such, it is important to question whether it is prudent to expend resources to reduce one
type of error, when another type of error is already more significant. This will be illustrated
in Section 7.1.
• A numerical method must be efficient. Although computing power has been rapidly increasing
in recent decades, this has resulted in expectations of solving larger-scale problems. Therefore,
it is essential that numerical methods produce approximate solutions with as few arithmetic
operations or data movements as possible. Efficiency is not only important in terms of time;
memory is still a finite resource and therefore algorithms must also aim to minimize data
storage needs.
• A numerical method must be robust. A method that is highly accurate and efficient for some
(or even most) problems, but performs poorly on others, is unreliable and therefore not likely
to be used in applications, even if any alternative is not as accurate and efficient. The user
of a numerical method needs to know that the result produced can be trusted.
These criteria should be balanced according to the requirements of the application. For example,
if less accuracy is acceptable, then greater efficiency can be achieved. This can be the case, for
example, if there is so much uncertainty in the underlying mathematical model that there is no
point in obtaining high accuracy.

3
4 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

1.1.1 Error Analysis


Numerical analysis is employed to develop algorithms for solving problems that arise in other areas
of mathematics, such as calculus, linear algebra, or differential equations. Of course, these areas
already include algorithms for solving such problems, but these algorithms are analytical methods.
Examples of analytical methods are:

• Applying the Fundamental Theorem of Calculus to evaluate a definite integral,

• Using Gaussian elimination, with exact arithmetic, to solve a system of linear equations, and

• Using the method of undetermined coefficients to solve an inhomogeneous ordinary differential


equation.

Such analytical methods have the benefit that they yield exact solutions, but the drawback is that
they can only be applied to a limited range of problems. Numerical methods, on the other hand, can
be applied to a much wider range of problems, but only yield approximate solutions. Fortunately, in
many applications, one does not necessarily need very high accuracy, and even when such accuracy
is required, it can still be obtained, if one is willing to expend the extra computational effort (or,
really, have a computer do so).
Because solutions produced by numerical algorithms are not exact, we will begin our exploration
of numerical analysis with one of its most fundamental concepts, which is error analysis. Numerical
algorithms must not only be efficient, but they must also be accurate, and robust. In other words,
the solutions they produce are at best approximate solutions because an exact solution cannot
be computed by analytical techniques. Furthermore, these computed solutions should not be too
sensitive to the input data, because if they are, any error in the input can result in a solution that
is essentially useless. Such error can arise from many sources, such as

• neglecting components of a mathematical model or making simplifying assumptions in the


model,

• discretization error, which arises from approximating continuous functions by sets of discrete
data points,

• convergence error, which arises from truncating a sequence of approximations that is meant
to converge to the exact solution, to make computation possible, and

• roundoff error, which is due to the fact that computers represent real numbers approximately,
in a fixed amount of storage in memory.

We will see that in some cases, these errors can be surprisingly large, so one must be careful
when designing and implementing numerical algorithms. Section 1.4 will introduce fundamental
concepts of error analysis that will be used throughout this book, and Section 1.5 will discuss
computer arithmetic and roundoff error in detail.
1.1. OVERVIEW 5

1.1.2 Systems of Linear Equations


Next, we will learn about how to solve a system of linear equations
a11 x1 + a12 x2 + · · · + a1n xn = b1
a21 x1 + a22 x2 + · · · + a2n xn = b2
..
.
an1 x1 + an2 x2 + · · · + ann xn = bn ,
which can be more conveniently written in matrix-vector form
Ax = b,
where A is an n × n matrix, because the system has n equations (corresponding to rows of A) and
n unknowns (corresponding to columns).
To solve a general system with n equations and unknowns, we can use Gaussian elimination
to reduce the system to upper-triangular form, which is easy to solve. In some cases, this process
requires pivoting, which entails interchanging of rows or columns of the matrix A. Gaussian elimi-
nation with pivoting can be used not only to solve a system of equations, but also to compute the
inverse of a matrix, even though this is not normally practical. It can also be used to efficiently
compute the determinant of a matrix.
Gaussian elimination with pivoting can be viewed as a process of factorizing the matrix A.
Specifically, it achieves the decomposition
P A = LU,
where P is a permutation matrix that describes any row interchanges, L is a lower-triangular
matrix, and U is an upper-triangular matrix. This decomposition, called the LU decomposition, is
particularly useful for solving Ax = b when the right-hand side vector b varies. We will see that
for certain special types of matrices, such as those that arise in the normal equations, variations of
the general approach to solving Ax = b can lead to improved efficiency.
Gaussian elimination and related methods are called direct methods for solving Ax = b, because
they compute the exact solution (up to roundoff error, which can be significant in some cases) in
a fixed number of operations that depends on n. However, such methods are often not practical,
especially when A is very large, or when it is sparse, meaning that most of its entries are equal to
zero. Therefore, we also consider iterative methods. Two general classes of iterative methods are:
• stationary iterative methods, which can be viewed as fixed-point iterations, and rely primarily
on splittings of A to obtain a system of equations that can be solved rapidly in each iteration,
and
• non-stationary methods, which tend to rely on matrix-vector multiplication in each iteration
and a judicious choice of search direction and linesearch to compute each iterate from the
previous one.
We will also consider systems of equations, for which the number of equations, m, is greater
than the number of unknowns, n. This is the least-squares problem, which is reduced to a system
with n equations and unknowns,
AT Ax = AT b,
6 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

called the normal equations. While this system can be solved directly using methods discussed
above, this can be problematic due to sensitivity to roundoff error. We therefore explore other
approaches based on orthogonalization of the columns of A.
Another fundamental problem from linear algebra is the solution of the eigenvalue problem

Ax = λx,

where the scalar λ is called an eigenvalue and the nonzero vector x is called an eigenvector. This
problem has many applications throughout applied mathematics, including the solution of differen-
tial equations and statistics. We will see that the tools developed for efficient and robust solution
of least squares problems are useful for the eigenvalue problem as well.

1.1.3 Polynomial Interpolation and Approximation


Polynomials are among the easiest functions to work with, because it is possible to evaluate them,
as well as perform operations from calculus on them, with great efficiency. For this reason, more
complicated functions, or functions that are represented only by values on a discrete set of points
in their domain, are often approximated by polynomials.
Such an approximation can be computed in various ways. We first consider interpolation,
in which we construct a polynomial that agrees with the given data at selected points. While
interpolation methods are efficient, they must be used carefully, because it is not necessarily true
that a polynomial that agrees with a given function at certain points is a good approximation to
the function elsewhere in its domain.
One remedy for this is to use piecewise polynomial interpolation, in which a low-degree polyno-
mial, typically linear or cubic, is used to approximate data only on a given subdomain, and these
polynomial “pieces” are “glued” together to obtain a piecewise polynomial approximation. This
approach is also efficient, and tends to be more robust than standard polynomial interpolation, but
there are disadvantages, such as the fact that a piecewise polynomial only has very few derivatives.
An alternative to polynomial interpolation, whether piecewise or not, is polynomial approxi-
mation, in which the goal is to find a polynomial that, in some sense, best fits given data. For
example, it is not possible to exactly fit a large number of points with a low-degree polynomial,
but an approximate fit can be more useful than a polynomial that can fit the given data exactly
but still fail to capture the overall behavior of the data. This is illustrated in Figure 1.1.

1.1.4 Numerical Differentiation and Integration


It is often necessary to approximate derivatives or integrals of functions that are represented only
by values at a discrete set of points, thus making differentiation or integration rules impossible to
use directly. Even when this is not the case, derivatives or integrals produced by differentiation or
integration rules can often be very complicated functions, making their computation and evaluation
computationally expensive.
While there are many software tools, such as Mathematica or Maple, that can compute deriva-
tives of integrals symbolically using such rules, they are inherently unreliable because they require
detection of patterns in whatever data structure is used to represent the function being differen-
tiated or integrated, and it is not as easy to implement software that performs this kind of task
effectively as it is for a person to learn how to do so through observation and intuition.
1.1. OVERVIEW 7

Figure 1.1: The dotted red curves demonstrate polynomial interpolation (left plot) and least-squares
approximation (right plot) applied to f (x) = 1/(1 + x2 ) (blue solid curve).

Therefore, it is important to have methods for evaluating derivatives and integrals that are
insensitive to the complexity of the function being acted upon. Numerical techniques for these op-
erations make use of polynomial interpolation by (implicitly) constructing a polynomial interpolant
that fits the given data, and then applying differentiation or integration rules to the polynomial.
We will see that by choosing the method of polynomial approximation judiciously, accurate results
can be obtain with far greater efficiency than one might expect.
As an example, consider the definite integral
Z 1
1
dx.
0 x2 − 5x + 6

Evaluating this integral exactly entails factoring the denominator, which is simple in this case but
not so in general, and then applying partial fraction decomposition to obtain an antiderivative,
which is then evaluated at the limits. Alternatively, simply computing

1
[f (0) + 4f (1/4) + 2f (1/2) + 4f (3/4) + f (1)],
12

where f (x) is the integrand, yields an approximation with 0.01% error (that is, the error is 10−4 ).
While the former approach is less tedious to carry out by hand, at least if one has a calculator,
clearly the latter approach is the far more practical use of computational resources.
8 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

1.1.5 Nonlinear Equations


The vast majority of equations, especially nonlinear equations, cannot be solved using analytical
techniques such as algebraic manipulations or knowledge of trigonometric functions. For example,
while the equations
1
x2 − 5x + 6 = 0, cos x =
2
can easily be solved to obtain exact solutions, these slightly different equations
1
x2 − 5xex + 6 = 0, x cos x =
2
cannot be solved using analytical methods.
Therefore, iterative methods must instead be used to obtain an approximate solution. We will
study a variety of such methods, which have distinct advantages and disadvantages. For example,
some methods are guaranteed to produce a solution under reasonable assumptions, but they might
do so slowly. On the other hand, other methods may produce a sequence of iterates that quickly
converges to the solution, but may be unreliable for some problems.
After learning how to solve nonlinear equations of the form f (x) = 0 using iterative methods
such as Newton’s method, we will learn how to generalize such methods to solve systems of nonlinear
equations of the form f (x) = 0, where f : Rn → Rn . In particular, for Newton’s method, computing
xn+1 − xn = −f (xn )/f 0 (xn ) in the single-variable case is generalized to solving the system of
equations Jf (xn )sn = −f (xn ), where Jf (xn ) is the Jacobian of f evaluated at xn , and sn =
xn+1 − xn is the step from each iterate to the next.

1.1.6 Initial Value Problems


Next, we study various algorithms for solving an initial value problem, which consists of an ordinary
differential equation
dy
= f (t, y), a ≤ t ≤ b,
dt
an an initial condition
y(t0 ) = α.
Unlike analytical methods for solving such problems, that are used to find the exact solution in the
form of a function y(t), numerical methods typically compute values y1 , y2 , y3 , . . . that approximate
y(t) at discrete time values t1 , t2 , t3 , . . .. At each time tn+1 , for n > 0, value of the solution is
approximated using its values at previous times.
We will learn about two general classes of methods: one-step methods, which are derived using
Taylor series and compute yn+1 only from yn , and multistep methods, which are based on polynomial
interpolation and compute yn+1 from yn , yn−1 , . . . , yn−s+1 , where s is the number of steps in the
method. Either type of method can be explicit, in which yn+1 can be described in terms of an
explicit formula, or implicit, in which yn+1 is described implicitly using an equation, normally
nonlinear, that must be solved during each time step.
The difference between consecutive times tn and tn+1 , called the time step, need not be uniform;
we will learn about how it can be varied to achieve a desired level of accuracy as efficiently as
possible. We will also learn about how the methods used for the first-order initial-value problem
described above can be generalized to solve higher-order equations, as well as systems of equations.
1.2. GETTING STARTED WITH MATLAB 9

One key issue with time-stepping methods is stability. If the time step is not chosen to be
sufficiently small, the computed solution can grow without bound, even if the exact solution is
bounded. Generally, the need for stability imposes a more severe restriction on the size of the time
step for explicit methods, which is why implicit methods are commonly used, even though they
tend to require more computational effort per time step. Certain systems of differential equations
can require an extraordinarily small time step to be solved by explicit methods; such systems are
said to be stiff.

1.1.7 Boundary Value Problems


We conclude with a discussion of solution methods for the two-point boundary value problem

y 00 = f (x, y, y 0 ), a ≤ x ≤ b,

with boundary conditions


y(a) = α, y(b) = β.
One approach, called the shooting method, transforms this boundary-value problem into an initial-
value problem so that methods for such problems can then be used. However, it is necessary to
find the correct initial values so that the boundary condition at x = b is satisfied. An alternative
approach is to discretize y 00 and y 0 using finite differences, the approximation schemes covered last
semester, to obtain a system of equations to solve for an approximation of y(x); this system can be
linear or nonlinear. We conclude with the Rayleigh-Ritz method, which treats the boundary value
problem as a continuous least-squares problem.

1.2 Getting Started with MATLAB


Matlab is commercial software, originally developed by Cleve Moler in 1982 [24] and currently
sold by The Mathworks. It can be purchased and downloaded from mathworks.com. As of this
writing, the student version can be obtained for $50, whereas academic and industrial licenses are
much more expensive. For any license, “toolboxes” can be purchased in order to obtain additional
functionality, but for the tasks performed in this book, the core product will be sufficient.
As an alternative, one can instead use Octave, a free application which uses the same program-
ming language as Matlab, with only minor differences. It can be obtained from gnu.org. Its
user interface is not as “friendly” as that of Matlab, but it has improved significantly in its most
recent versions. In this book, examples will feature only Matlab, but the code will also work in
Octave, without modification except where noted.
Figure 1.2 shows Matlab when it is launched. The large window on the right is the command
window, in which commands can be entered interactively at the “” prompt. On the left, the top
window lists the files in the current working directory. By default, this directory is a subdirectory
of your Documents (Mac OS X) or My Documents (Windows) directory that is named MATLAB. It
is important to keep track of your current working directory, because that is the first place that
Matlab looks for M-files, or files that contain Matlab code. These will be discussed in detail
later in this section.
We will now present basic commands, operations and programming constructs in Matlab.
Before we begin, we will use the diary command to save all subsequent commands, and their
10 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

Figure 1.2: Screen shot of Matlab at startup in Mac OS X

output, to a text file. By default, this output is saved to a file that is named diary in the current
working directory, but we will supply our own filename as an argument, to make the saved file
easier to open later in a text editor.

>> diary tutorial.txt

1.2.1 Basic Mathematical Operations


Next, we will perform basic mathematical operations. Try out these commands, and observe the
corresponding output. Once a command is typed at the prompt, simply hit the Enter key to execute
the command.

>> a=3+4

a =

>> b=sqrt(a)

b =

2.645751311064591

>> c=exp(a)
1.2. GETTING STARTED WITH MATLAB 11

c =

1.096633158428459e+003
As can be seen from these statements, arithmetic operations and standard mathematical func-
tions can readily be performed, so Matlab could be used as a “desk calculator” in which results
of expressions can be stored in variables, such as a, b and c in the preceding example. Also, note
that once a command is executed, the output displayed is the variable name, followed by its value.
This is typical behavior in Matlab, so for the rest of this tutorial, the output will not be displayed
in the text.

1.2.2 Obtaining Help


Naturally, when learning new software, it is important to be able to obtain help. Matlab includes
the help command for this purpose. Try the following commands. You will observe that Matlab
offers help on various categories of commands, functions or operators (hereafter referred to simply
as “commands”), and also help pages on individual commands, such as lu.
>> help
>> help ops
>> help lu

1.2.3 Basic Matrix Operations


Matlab, which is short for “matrix laboratory”, is particularly easy to use with vectors and
matrices. We will now see this for ourselves by constructing and working with some simple matrices.
Try the following commands.
>> A=[ 1 0; 0 2 ]
>> B=[ 5 7; 9 10 ]
>> A+B
>> 2*ans
>> C=A+B
>> 4*A
>> C=A-B
>> C=A*B
>> w=[ 4; 5; 6 ]
As we can see, matrix arithmetic is easily performed. However, what happens if we attempt an
operation that, mathematically, does not make sense? Consider the following example, in which a
2 × 2 matrix is multiplied by a 3 × 1 vector.
>> A*w
??? Error using ==> mtimes
Inner matrix dimensions must agree.
Since this operation is invalid, Matlab does not attempt to perform the operation and instead
displays an error message. The function name mtimes refers to the function that implements the
matrix multiplication operator that is represented in the above command by *.
12 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

1.2.4 Storage of Variables


Let’s examine the variables that we have created so far. The whos command is useful for this
purpose.
>> whos
Name Size Bytes Class Attributes

A 2x2 32 double
B 2x2 32 double
C 2x2 32 double
a 1x1 8 double
ans 2x2 32 double
b 1x1 8 double
c 1x1 8 double
w 3x1 24 double
Note that each number, such as a, or each entry of a matrix, occupies 8 bytes of storage, which
is the amount of memory allocated to a double-precision floating-point number. This system of
representing real numbers will be discussed further in Section 1.5. Also, note the variable ans. It
was not explicitly created by any of the commands that we have entered. It is a special variable
that is assigned the most recent expression that is not already assigned to a variable. In this case,
the value of ans is the output of the operation 4*A, since that was not assigned to any variable.

1.2.5 Complex Numbers


Matlab can also work with complex numbers. The following command creates a vector with one
real element and one complex element.
>> z=[ 6; 3+4i ]
Now run the whos command again. Note that it states that z occupies 32 bytes, even though it has
only two elements. This is because each element of z has a real part and an imaginary part, and
each part occupies 8 bytes. It is important to note that if a single element of a vector or matrix is
complex, then the entire vector or matrix is considered complex. This can result in wasted storage
if imaginary parts are supposed to be zero, but in fact are small, nonzero numbers due to roundoff
error (which will be discussed in Section 1.5).
The real and imag commands can be used to extract the real and imaginary parts, respectively,
of a complex scalar, vector or matrix. The output of these commands are stored as real numbers.
>> y=real(z)
>> y=imag(z)

1.2.6 Creating Special Vectors and Matrices


It can be very tedious to create matrices entry-by-entry, as we have done so far. Fortunately,
Matlab has several functions that can be used to easily create certain matrices of interest. Try
the following commands to learn what these functions do. In particular, note the behavior when
only one argument is given, instead of two.
1.2. GETTING STARTED WITH MATLAB 13

>> E=ones(6,5)
>> E=ones(3)
>> R=rand(3,2)

As the name suggests, rand creates a matrix with random entries. More precisely, the entries
are random numbers that are uniformly distributed on [0, 1].

Exercise 1.2.1 What if we want the entries distributed within a different interval, such
as [−1, 1]? Create such a matrix, of size 3 × 2, using matrix arithmetic that we have seen,
and the ones function.
In many situations, it is helpful to have a vector of equally spaced values. For example, if we
want a vector consisting of the integers from 1 to 10, inclusive, we can create it using the statement

>> z=[ 1 2 3 4 5 6 7 8 9 10 ]

However, this can be very tedious if a vector with many more entries is needed. Imagine creating
a vector with all of the integers from 1 to 1000! Fortunately, this can easily be accomplished using
the colon operator. Try the following commands to see how this operator behaves.

>> z=1:10
>> z=1:2:10
>> z=10:-2:1
>> z=1:-2:10

It should be noted that the second argument, that determines spacing between entries, need not
be an integer.
Exercise 1.2.2 Use the colon operator to create a vector of real numbers between 0 and
1, inclusive, with spacing 0.01.

1.2.7 Transpose Operators


We now know how to create row vectors with equally spaced values, but what if we would rather
have a column vector? This is just one of many instances in which we need to be able to compute
the transpose of a matrix in Matlab. Fortunately, this is easily accomplished, using the single
quote as an operator. For example, this statement

>> z=(0:0.1:1)’

has the desired effect. However, one should not simply conclude that the single quote is the
transpose operator, or they could be in for an unpleasant surprise when working with complex-
valued matrices. Try these commands to see why:

>> z=[ 6; 3+4i ]


>> z’
>> z.’
14 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

We can see that the single quote is an operator that takes the Hermitian transpose of a matrix A,
commonly denoted by AH : it is the transpose and complex conjugate of A. That is, AH = AT .
Meanwhile, the dot followed by the single quote is the transpose operator.
Either operator can be used to take the transpose for matrices with real entries, but one must
be more careful when working with complex entries. That said, why is the “default” behavior, rep-
resented by the simpler single quote operator, the Hermitian transpose rather than the transpose?
This is because in general, results or techniques established for real matrices, that make use of the
transpose, do not generalize to the complex case unless the Hermitian transpose is used instead.

1.2.8 if Statements
Now, we will learn some essential programming constructs, that Matlab shares with many other
programming languages. The first is an if statement, which is used to perform a different task
based on the result of a given conditional expression, that is either true or false.
At this point, we will also learn how to write a script in Matlab. Scripts are very useful for
the following reasons:

• Some Matlab statements, such as the programming constructs we are about to discuss, are
quite complicated and span several lines. Typing them at the command window prompt can
be very tedious, and if a mistake is made, then the entire construct must be retyped.

• It frequently occurs that a sequence of commands needs to be executed several times, with no
or minor changes. It can be very tedious and inefficient to repeatedly type out such command
sequences several times, even if Matlab’s history features (such as using the arrow keys to
scroll through past commands) are used.

A script can be written in a plain text file, called an M-file, which is a file that has a .m extension.
An M-file can be written in any text editor, or in Matlab’s own built-in editor. To create a new
M-file or edit an existing M-file, one can use the edit command at the prompt:

>> edit entermonthyear

If no extension is given, a .m extension is assumed. If the file does not exist in the current working
directory, Matlab will ask if the file should be created.
In the editor, type in the following code, that computes the number of days in a given month,
while taking leap years into account.

% entermonthyear - script that asks the user to provide a month and year,
% and displays the number of days in that month

% Prompt user for month and year


month=input(’Enter the month (1-12): ’);
year=input(’Enter the 4-digit year: ’);
% For displaying the month by day, we construct an array of strings
% containing the names of all the months, in numerical order.
% This is a 12-by-9 matrix, since the longest month name (September)
% contains 9 letters. Each row must contain the same number of columns, so
% other month names must be padded to 9 characters.
1.2. GETTING STARTED WITH MATLAB 15

months=[ ’January ’; ’February ’; ’March ’; ’April ’; ’May ’; ’June ’; ’July


% extract the name of the month indicated by the user
monthname=months(month,:);
% remove trailing blanks
monthname=deblank(monthname);
if month==2
% month is February
if rem(year,4)==0
% leap year
days=29;
else
% non-leap year
days=28;
end
elseif month==4 || month==6 || month==9 || month==11
% "30 days hath April, June, September and November..."
days=30;
else
% "...and all the rest have 31"
days=31;
end
% display number of days in the given month
disp([ monthname ’, ’ num2str(year) ’ has ’ num2str(days) ’ days.’ ])

Can you figure out how an if statement works, based on your knowledge of what the result should
be?
Note that this M-file includes comments, which are preceded by a percent sign (%). Once a %
is entered on a line, the rest of that line is ignored. This is very useful for documenting code so
that a reader can understand what it is doing. The importance of documenting one’s code cannot
be overstated. In fact, it is good practice to write the documentation before the code, so that the
process of writing code is informed with a clearer idea of the task at hand.
As this example demonstrates, if statements can be nested within one another. Also note the
use of the keywords else and elseif. These are used to provide alternative conditions under which
different code can be executed, if the original condition in the if statement turns out to be false.
If any conditions paired with the elseif keyword also turn out to be false, then the code following
the else keyword, if any, is executed.
This script features some new functions that can be useful in many situations:

• deblank(s): returns a new string variable that is the same as s, except that any “white
space” (spaces, tabs, or newlines) at the end of s is removed

• rem(a,b): returns the remainder after dividing a by b

• num2str(x): returns a string variable based on formatting the number x as text

To execute a script M-file, simply type the name of the file (without the .m extension) at the
prompt.
16 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

>> entermonthyear
Enter the month (1-12): 5
Enter the 4-digit year: 2001
May, 2001 has 31 days.

Note that in the above script M-file, most of the statements are terminated with semicolons.
Unlike in programming languages such as C++, the semicolon is not required. If it is omitted from
a statement, then the value of any expression that is computed in that statement is displayed, along
with its variable name (or ans, if there is no variable associated with the expression). Including the
semicolon suppresses printing. In most cases, this is the desired behavior, because excessive output
can make software less user-friendly. However, omitting semicolons can be useful when writing and
debugging new code, because seeing intermediate results of a computation can expose bugs. Once
the code is working, then semicolons can be added to suppress superfluous output.

1.2.9 for Loops


The next programming construct is the for loop, which is used to carry out an iterative process
that runs for a predetermined number of iterations. To illustrate for loops, we examine the script
file gausselim.m:

% gausselim - script that performs Gaussian elimination on a random 40-by-40


% matrix
m=40;
n=40;
% generate random matrix
A=rand(m,n);
% display it
disp(’Before elimination:’)
disp(A)
for j=1:n-1,
% use elementary row operations to zero all elements in column j below
% the diagonal
for i=j+1:m,
mult=A(i,j)/A(j,j);
% subtract mult * row j from row i
for k=j:n,
A(i,k)=A(i,k)-mult*A(j,k);
end
% equivalent code:
%A(i,j:n)=A(i,j:n)-mult*A(j,j:n);
end
end
% display updated matrix
disp(’After elimination:’)
disp(A)
1.2. GETTING STARTED WITH MATLAB 17

Note the syntax for a for statement: the keyword for is followed by a loop variable, such as i,
j or k in this example, and that variable is assigned a value. Then the body of the loop is given,
followed by the keyword end.
What does this loop actually do? During the nth iteration, the loop variable is set equal to the
nth column of the expression that is assigned to it by the for statement. Then, the loop variable
retains this value throughout the body of the loop (unless the loop variable is changed within
the body of the loop, which is ill-advised, and sometimes done by mistake!), until the iteration is
completed. Then, the loop variable is assigned the next column for the next iteration. In most
cases, the loop variable is simply used as a counter, in which case assigning to it a row vector of
values, created using the colon operator, yields the desired behavior.
Now run this script, just like in the previous example. The script displays a randomly generated
matrix A, then performs Gaussian elimination on A to obtain an upper triangular matrix U , and
then displays the final result U .
Exercise 1.2.3 An upper trianglar matrix U has the property that uij = 0 whenever
i > j; that is, the entire “lower triangle” of U , consisting of all entries below the main
diagonal, must be zero. Examine the matrix U produced by the script gausselim above.
Why are some subdiagonal entries nonzero?

1.2.10 while Loops


Next, we introduce the while loop, which, like a for loop, is also used to implement an iterative
process, but is controlled by a conditional expression rather than a predetermined set of values such
as 1:n. The following script, saved in the file newtonsqrt.m, illustrates the use of a while loop.

% newtonsqrt - script that uses Newton’s method to compute the square root
% of 2

% choose initial iterate


x=1;
% announce what we are doing
disp(’Computing the square root of 2...’)
% iterate until convergence. we will test for convergence inside the loop
% and use the break keyword to exit, so we can use a loop condition that’s
% always true
while true
% save previous iterate for convergence test
oldx=x;
% compute new iterate
x=x/2+1/x;
% display new iterate
disp(x)
% if relative difference in iterates is less than machine precision,
% exit the loop
if abs(x-oldx)<eps*abs(x)
break;
18 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

end
end
% display result and verify that it really is the square root of 2
disp(’The square root of 2 is:’)
x
disp(’x^2 is:’)
disp(x^2)

Note the use of the expression true in the while statement. The value of the predefined
variable true is 1, while the value of false is 0, following the convention used in many programming
languages that a nonzero number is interpreted as the boolean value “true”, while zero is interpreted
as “false”.
A while loop runs as long as the condition in the while statement is true. It follows that
this particular while statement is an infinite loop, since the value of true will never be false.
However, this loop will exit when the condition in the enclosed if statement is true, due to the
break statement. A break statement causes the enclosing for √ or while loop to immediately exit.
This particular while loop computes an approximation to 2 such that the relative difference
between each new iterate x and the previous iterate oldx is less than the value of the predefined
variable eps, which is informally known as “unit roundoff” or “machine precision”. This concept
will
√ be discussed further in Section 1.5.1. The actual process used to obtain this approximation to
2 is obtained using Newton’s method, which will be discussed in Section 8.4.
Go ahead and run this script by typing its name, newtonsqrt, at the prompt. The code will
display each iteration
√ as the approximation is improved until it is sufficiently accurate. Note that
convergence to 2 is quite rapid! This effect will be explored further in Chapter 8.

1.2.11 Function M-files


In addition to script M-files, Matlab also uses function M-files. A function M-file is also a text
file with a .m extension, but unlike a script M-file, that simply includes a sequence of commands
that could have instead been typed at the prompt, the purpose of a function M-file is to extend
the capability of Matlab by defining a new function, that can then be used by other code.
The following function M-file, called fahtocel.m, illustrates function definition.

function tc=fahtocel(tf)
% converts function input ’tf’ of temperatures in Fahrenheit to
% function output ’tc’ of temperatures in Celsius
temp=tf-32;
tc=temp*5/9;

Note that a function definition begins with the keyword function; this is how Matlab distinguishes
a script M-file from a function M-file (though in a function M-file, comments can still precede
function).
After the keyword function, the output arguments of the function are specified. In this function,
there is only one output argument, tc, which represents the Celsius temperature. If there were
more than one, then they would be enclosed in square brackets, and in a comma-separated list.
1.2. GETTING STARTED WITH MATLAB 19

After the output arguments, there is a = sign, then the function name, which should match the
name of the M-file aside from the .m extension. Finally, if there are any input arguments, then they
are listed after the function name, separated by commas and enclosed in parentheses.
After this first line, all subsequent code is considered the body of the function–the statements
that are executed when the function is called. The only exception is that other functions can be
defined within a function M-file, but they are “helper” functions, that can only be called by code
within the same M-file. Helper functions must appear after the function after which the M-file is
named.
Type in the above code for fahtocel into a file fahtocel.m in the current working directory.
Then, it can be executed as follows:

>> tc=fahtocel(212)
tc =
100

If tc= had been omitted, then the output value 100 would have been assigned to the special variable
ans, described in §1.2.4.
Note that the definition of fahtocel uses a variable temp. Here, it should be emphasized that
all variables defined within a function, including input and output arguments, are only defined
within the function itself. If a variable inside a function, such as temp, happens to have the same
name as another variable defined in the top-level workspace (the memory used by variables defined
outside of any function), or in another function, then this other variable is completely independent
of the one that is internal to the function. Consider the following example:

>> temp=32
temp =
32
>> tfreeze=fahtocel(temp)
tfreeze =
0
>> temp
temp =
32

Inside fahtocel, temp is set equal to zero by the subtraction of 32, but the temp in the top-level
workspace retains its value of 32.
Comments included at the top of an M-file (whether script or function) are assumed by Matlab
to provide documentation of the M-file. As such, these comments are displayed by the help
command, as applied to that function. Try the following commands:

>> help fahtocel


>> help newtonsqrt

We now illustrate other important aspects of functions, using the following M-file, which is
called vecangle2.m:

% vecangle2 - function that computes the angle between two given vectors
% in both degrees and radians. We use the formula
20 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

% x’*y = ||x||_2 ||y||_2 cos(theta), where theta is the angle between x and
% y
function [anglerad,angledeg]=vecangle2(x,y)
n=length(x);
if n~=length(y)
error(’vector lengths must agree’)
end
% compute needed quantities for above formula
dotprod=x’*y;
xnorm=norm(x);
ynorm=norm(y);
% obtain cos(angle)
cosangle=dotprod/(xnorm*ynorm);
% use inverse cosine to obtain angle in radians
anglerad=acos(cosangle);
% if angle in degrees is desired (that is, two output arguments are
% specified), then convert to degrees. Otherwise, don’t bother
if nargout==2,
angledeg=anglerad*180/pi;
end
As described in the comments, the purpose of this function is to compute the angle between two
vectors in n-dimensional space, in both radians and degrees. Note that this function accepts
multiple input arguments and multiple output arguments. The way in which this function is called
is similar to how it is defined. For example, try this command:
>> [arad,adeg]=vecangle2(rand(5,1),rand(5,1))
It is important that code include error-checking, if it might be used by other people. To that
end, the first task performed in this function is to check whether the input arguments x and y have
the same length, using the length function that returns the number of elements of a vector or
matrix. If they do not have the same length, then the error function is used to immediately exit
the function vecangle2 and display an informative error message.
Note the use of the variable nargout at the end of the function definition. The function
vecangle2 is defined to have two output arguments, but nargout is the number of output arguments
that are actually specified when the function is called. Similarly, nargin is the number of input
arguments that are specified.
These variables allow functions to behave more flexibly and more efficiently. In this case, the
angle between the vectors is only converted to degrees if the user specified both output arguments,
thus making nargout equal to 2. Otherwise, it is assumed that the user only wanted the angle in
radians, so the conversion is never performed. Matlab typically provides several interfaces to its
functions, and uses nargin and nargout to determine which interface is being used. These multiple
interfaces are described in the help pages for such functions.
Exercise 1.2.4 Try calling vecangle2 in various ways, with different numbers of input
and output arguments, and with vectors of either the same or different lengths. Observe
the behavior of Matlab in each case.
1.2. GETTING STARTED WITH MATLAB 21

1.2.12 Graphics
Next, we learn some basic graphics commands. We begin by plotting the graph of the function
y = x2 on [−1, 1]. Start by creating a vector x, of equally spaced values between −1 and 1, using
the colon operator. Then, create a vector y that contains the squares of the values in x. Make sure
to use the correct approach to squaring the elements of a vector!
The plot command, in its simplest form, takes two input arguments that are vectors, that must
have the same length. The first input argument contains x-values, and the second input argument
contains y-values. The command plot(x,y) creates a new figure window (if none already exists)
and plots y as a function of x in a set of axes contained in the figure window. Try plotting the
graph of the function y = x2 on [−1, 1] using this command.
Note that by default, plot produces a solid blue curve. In reality, it is not a curve; it simply
“connects the dots” using solid blue line segments, but if the segments are small enough, the
resulting piecewise linear function resembles a smooth curve. But what if we want to plot curves
using different colors, different line styles, or different symbols at each point?
Use the help command to view the help page for plot, which lists the specifications for different
colors, line styles, and marker styles (which are used at each point that is plotted). The optional
third argument to the plot command is used to specify these colors and styles. They can be mixed
together; for example, the third argument ’r--’ plots a dashed red curve. Experiment with these
different colors and styles, and with different functions.
Matlab provides several commands that can be used to produce more sophisticated plots. It
is recommended that you view the help pages for these commands, and also experiment with their
usage.

• hold is used to specify that subsequent plot commands should be superimposed on the same
set of axes, rather than the default behavior in which the current axes are cleared with each
new plot command.

• subplot is used to divide a figure window into an m × n matrix of axes, and specify which
set of axes should be used for subsequent plot commands.

• xlabel and ylabel are used to label the horizontal and vertical axes, respectively, with given
text.

• title is used to place given text at the top of a set of axes.

• legend is used to place a legend within a set of axes, so that the curves displayed on the axes
can be labeled with given text.

• gtext is used to place given text at an arbitrary point within a figure window, indicated by
clicking the mouse at the desired point.

Exercise 1.2.5 Reproduce the plot shown in Figure 1.3 using the commands discussed
in this section.
Finally, it is essential to be able to save a figure so that it can be printed or included in a
document. In the figure window, go to the File menu and choose “Save” or “Save As”. You will
22 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

Figure 1.3: Figure for Exercise 1.2.5

see that the figure can be saved in a variety of standard image formats, such as JPEG or Windows
bitmap (BMP).
Another format is “Matlab Figure (*.fig)”. It is highly recommended that you save your figure
in this format, as well as the desired image format. Then, if you need to go back and change
something about the figure after you have already closed the figure window, you can simply use
the open command, with the .fig filename as its input argument, to reopen the figure window.
Otherwise, you would have to recreate the entire figure from scratch.

1.2.13 Polynomial Functions


Finally, Matlab provides several functions for working with polynomials. A polynomial is repre-
sented within Matlab using a row vector of coefficients, with the highest degree coefficient listed
first. For example, f (x) = x4 −3x2 +2x+5 is represented by the vector [ 1 0 -3 2 5 ]. Note that
the zero for the second element is necessary, or this vector would be interpreted as x3 − 3x2 + 2x + 5.
The following functions work with polynomials in this format:

• r=roots(p) returns a column vector r consisting of the roots of the polynomial represented
by p

• p=poly(r) is, in a sense, an inverse of roots. This function produces a row vector p that
represents the monic polynomial (that is, with leading coefficient 1) whose roots are the
entries of the vector r.
1.2. GETTING STARTED WITH MATLAB 23

• y=polyval(p,x) evaluates the polynomial represented by p at all of the entries of x (which


can be a scalar, vector or matrix) and returns the resulting values in y.

• q=polyder(p) computes the coefficients of the polynomial q that is the derivative of the
polynomial p.

• q=polyint(p) computes the coefficients of the polynomial q that is the antiderivative, or


indefinite integral, of the polynomial p. A constant of integration of zero is assumed.

• r=conv(p,q) computes the coefficients of the polynomial r that is the product of the poly-
nomials p and q.
It is recommended that you experiment with these functions in order to get used to working with
them.

1.2.14 Number Formatting


By default, numbers are displayed in “short” format, which uses 5 decimal places. The format
command, with one of Matlab’s predefined formats, can be used to change how numbers are
displayed. For example, type format long at the prompt, and afterwards, numbers are displayed
using 15 decimal digits. Here is how 1/3 is displayed using various formats:
short 0.33333
short e 3.33333e-01
long 0.333333333333333
long e 3.33333333333333e-01
bank 0.33
hex 3fd5555555555555
rat 1/3

1.2.15 Inline and Anonymous Functions


Often, it is desirable to use functions in Matlab that compute relatively simple expressions, but
it is tedious to make a single small function M-file for each such function. Instead, very simple
functions can be defined using the inline command:

>> f=inline(’exp(sin(2*x))’);
>> f(pi/4)
ans =
2.7183

If an inline function takes more than one argument, it is important to specify which argument
p is first,
which is second, and so on. For example, to construct an inline function for f (x, y) = x + y 2 , it
2

is best to proceed as follows:


>> f=inline(’sqrt(x^2+y^2)’,’x’,’y’);
>> f(2,1)
ans =
2.2361
24 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

Alternatively, one can define an anonymous function as follows:

>> f=@(x)exp(sin(2*x));
>> f(0)
ans =
1

Anonymous functions are helpful when it is necessary to pass a function f as a parameter to


another function g, but g assumes that f has fewer input parameters than it actually accepts. An
anonymous function can be used to fill in values for the extra parameters of f before it is passed
to g.
This will be particularly useful when using Matlab’s functions for solving ordinary differential
equations, as they expect the time derivative f in the ODE dy/dt = f (t, y) to be a function of only
two arguments. If the function f that computes the time derivative has additional input arguments,
one can use, for example, the anonymous function @(t,y)f(t,y,a,b,c) where a, b and c are the
additional input arguments, whose values must be known.

1.2.16 Saving and Loading Data


It is natural to have to end a Matlab session, and then start a new session in which the data from
the previous session needs to be re-used. Also, one might wish to send their data to another user.
Fortunately, Matlab supports saving variables to a file and loading them into a workspace.
The save command saves all variables in the current workspace (whether it is the top-level
workspace, or the scope of a function) into a file. By default, this file is given a .mat extension.
For example, save mydata saves all variables in a file called mydata.mat in the current working
directory. Similarly, the load command loads data from a given file.

1.2.17 Using Octave


In Octave, by default output is paginated, meaning that output stops after each full page and waits
for user input to continue. You can use the space bar to advance by a full page, or the enter key to
advance one line at a time. Type q to immediately return to the prompt. To turn off pagination,
type more off at the prompt. This makes the output behave like in Matlab, which does not
paginate output.

1.3 How to Use This Book


In order to get the most out of this book, it is recommended that you not simply read through it.
As you have already seen, there are frequent “interruptions” in the text at which you are asked
to perform some coding task. The purposes of these coding tasks are to get you accustomed to
programming, particularly in Matlab and Python, but also to reinforce the concepts of numerical
analysis that are presented throughout the book.
Reading about the design, analysis and implementation of the many algorithms covered in
this book is not sufficient to be able to fully understand these aspects or be able to efficiently
and effectively work with these algorithms in code, especially as part of a larger application. A
1.4. ERROR ANALYSIS 25

“hands-on” approach is needed to achieve this level of proficiency, and this book is written with
this necessity in mind.

1.4 Error Analysis


Mathematical problems arising from scientific applications present a wide variety of difficulties that
prevent us from solving them exactly. This has led to an equally wide variety of techniques for
computing approximations to quantities occurring in such problems in order to obtain approximate
solutions. In this chapter, we will describe the types of approximations that can be made, and
learn some basic techniques for analyzing the accuracy of these approximations.

1.4.1 Sources of Approximation


Suppose that we are attempting to solve a particular instance of a problem arising from a mathe-
matical model of a scientific application. We say that such a problem is well-posed if it meets the
following criteria:
• A solution of the problem exists.
• The solution is unique.
• A small perturbation in the problem data results in a small perturbation in the solution; i.e.,
the solution depends continuously on the data.
By the first condition, the process of solving a well-posed problem can be seen to be equivalent to
the evaluation of some function f at some known value x, where x represents the problem data.
Since, in many cases, knowledge of the function f is limited, the task of computing f (x) can be
viewed, at least conceptually, as the execution of some (possibly infinite) sequence of steps that
solves the underlying problem for the data x. The goal in numerical analysis is to develop a finite
sequence of steps, i.e., an algorithm, for computing an approximation to the value f (x).
There are two general types of error that occur in the process of computing this approximation
to f (x):
1. data error is the error in the data x. In reality, numerical analysis involves solving a problem
with approximate data x̂. The exact data is often unavailable because it must be obtained
by measurements or other computations that fail to be exact due to limited precision. In
addition, data may be altered in order to simplify the solution process.
2. computational error refers to the error that occurs when attempting to compute f (x̂). Ef-
fectively, we must approximate f (x̂) by the quantity fˆ(x̂), where fˆ is a function that ap-
proximates f . This approximation may be the result of truncation, which occurs when it
is not possible to evaluate f exactly using a finite sequence of steps, and therefore a finite
sequence that evaluates f approximately must be used instead. This particular source of
computational error will be discussed in this chapter. Another source of computational error
is roundoff error, which will discussed in Section 1.5.

Exercise 1.4.1 Consider the process of computing cos(π/4) using a calculator or com-
puter. Indicate sources of data error and computational error, including both truncation
and roundoff error.
26 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

1.4.2 Error Measurement


Now that we have been introduced to some specific errors that can occur during computation,
we introduce useful terminology for discussing such errors. Suppose that a real number ŷ is an
approximation to some real number y. For instance, ŷ may be the closest number to y that can
be represented using finite precision, or ŷ may be the result of a sequence of arithmetic operations
performed using finite-precision arithmetic, where y is the result of the same operations performed
using exact arithmetic.

Definition 1.4.1 (Absolute Error, Relative Error) Let ŷ be a real number that is
an approximation to the real number y. The absolute error in ŷ is

Eabs = ŷ − y.

The relative error in ŷ is


ŷ − y
Erel = ,
y
provided that y is nonzero.
The absolute error is the most natural measure of the accuracy of an approximation, but it can
be misleading. Even if the absolute error is small in magnitude, the approximation may still be
grossly inaccurate if the exact value y is even smaller in magnitude. For this reason, it is preferable
to measure accuracy in terms of the relative error.
The magnitude of the relative error in ŷ can be interpreted as a percentage of |y|. For example,
if the relative error is greater than 1 in magnitude, then ŷ can be considered completely erroneous,
since the error is larger in magnitude as the exact value. Another useful interpretation of the
relative error concerns significant digits, which are all digits excluding leading zeros. Specifically,
if the relative error is at most β −p , where β is an integer greater than 1, then the representation of
ŷ in base β has at least p correct significant digits.
It should be noted that the absolute error and relative error are often defined using absolute
value; that is,
ŷ − y
Eabs = |ŷ − y|, Erel = .
y
This definition is preferable when one is only interested in the magnitude of the error, which is
often the case. If the sign, or direction, of the error is also of interest, then the first definition must
be used.

Example 1.4.2 If we add the numbers 0.4567 × 100 and 0.8580 × 10−2 , we obtain the exact result

x = 0.4567 × 100 + 0.008530 × 100 = 0.46523 × 100 ,

which is rounded to
x̂ = 0.4652 × 100 .
The absolute error in this computation is

Eabs = x̂ − x = 0.4652 − 0.46523 = −0.00003,


1.4. ERROR ANALYSIS 27

while the relative error is


x̂ − x 0.4652 − 0.46523
Erel = = ≈ −0.000064484.
x 0.46523
Now, suppose that we multiply 0.4567 × 104 and 0.8530 × 10−2 . The exact result is

x = (0.4567 × 104 ) × (0.8530 × 10−2 ) = 0.3895651 × 102 = 38.95651,

which is rounded to
x̂ = 0.3896 × 102 = 38.96.
The absolute error in this computation is

Eabs = x̂ − x = 38.96 − 38.95651 = 0.00349,

while the relative error is


x̂ − x 38.96 − 38.95651
Erel = = ≈ 0.000089587.
x 38.95651
We see that in this case, the relative error is smaller than the absolute error, because the exact
result is larger than 1, whereas in the previous operation, the relative error was larger in magnitude,
because the exact result is smaller than 1. 2

Example 1.4.3 Suppose that the exact value of a computation is supposed to be 10−16 , and an
approximation of 2 × 10−16 is obtained. Then the absolute error in this approximation is

Eabs = 2 × 10−16 − 10−16 = 10−16 ,

which suggests the computation is accurate because this error is small. However, the relative error
is
2 × 10−16 − 10−16
Erel = = 1,
10−16
which suggests that the computation is completely erroneous, because by this measure, the error is
equal in magnitude to the exact value; that is, the error is 100%. It follows that an approximation of
zero would be just as accurate. This example, although an extreme case, illustrates why the absolute
error can be a misleading measure of error. 2

1.4.3 Forward and Backward Error


Suppose that we compute an approximation ŷ = fˆ(x) of the value y = f (x) for a given function
f and given problem data x. Before we can analyze the accuracy of this approximation, we must
have a precisely defined notion of error in such an approximation. We now provide this precise
definition.
Definition 1.4.4 (Forward Error) Let x be a real number and let f : R → R be a
function. If ŷ is a real number that is an approximation to y = f (x), then the forward
error in ŷ is the difference ∆y = ŷ − y. If y 6= 0, then the relative forward error in
ŷ is defined by
∆y ŷ − y
= .
y y
28 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

Clearly, our primary goal in error analysis is to obtain an estimate of the forward error ∆y. Un-
fortunately, it can be difficult to obtain this estimate directly.
An alternative approach is to instead view the computed value ŷ as the exact solution of a
problem with modified data; i.e., ŷ = f (x̂) where x̂ is a perturbation of x.

Definition 1.4.5 (Backward Error) Let x be a real number and let f : R → R be


a function. Suppose that the real number ŷ is an approximation to y = f (x), and that
ŷ is in the range of f ; that is, ŷ = f (x̂) for some real number x̂. Then, the quantity
∆x = x̂ − x is the backward error in ŷ. If x 6= 0, then the relative forward error in
ŷ is defined by
∆x x̂ − x
= .
x x
The process of estimating ∆x is known as backward error analysis. As we will see, this estimate of
the backward error, in conjunction with knowledge of f , can be used to estimate the forward error.
As will be discussed in Section 1.5, floating-point arithmetic does not follow the laws of real
arithmetic. This tends to make forward error analysis difficult. In backward error analysis, however,
real arithmetic is employed, since it is assumed that the computed result is the exact solution to a
modified problem. This is one reason why backward error analysis is sometimes preferred.

Exercise 1.4.2 Let x0 = 1, and f (x) = ex . If the magnitude of the forward error
in computing f (x0 ), given by |fˆ(x0 ) − f (x0 )|, is 0.01, then determine a bound on the
magnitude of the backward error.

Exercise 1.4.3 For a general function f (x), explain when the magnitude of the forward
error is greater than, or less than, that of the backward error. Assume f is differentiable
near x and use calculus to explain your reasoning.

1.4.4 Conditioning and Stability


In most cases, the goal of error analysis is to obtain an estimate of the forward relative error
(f (x̂) − f (x))/f (x), but it is often easier to instead estimate the relative backward error (x̂ − x)/x.
Therefore, it is necessary to be able to estimate the forward error in terms of the backward error.
The following definition addresses this need.

Definition 1.4.6 (Condition Number) Let x be a real number and let f : R → R


be a function. The absolute condition number, denoted by κabs , is the ratio of the
magnitude of the forward error to the magnitude of the backward error,

|f (x̂) − f (x)|
κabs = .
|x̂ − x|

If f (x) 6= 0, then the relative condition number of the problem of computing y = f (x),
denoted by κrel , is the ratio of the magnitude of the relative forward error to the magnitude
of the relative backward error,

|(f (x̂) − f (x))/f (x)| |∆y/y|


κrel = = .
|(x̂ − x)/x| |∆x/x|
1.4. ERROR ANALYSIS 29

Intuitively, either condition number is a measure of the change in the solution due to a change in
the data. Since the relative condition number tends to be a more reliable measure of this change,
it is sometimes referred to as simply the condition number.
If the condition number is large, e.g. much greater than 1, then a small change in the data can
cause a disproportionately large change in the solution, and the problem is said to be ill-conditioned
or sensitive. If the condition number is small, then the problem is said to be well-conditioned or
insensitive.
Since the condition number, as defined above, depends on knowledge of the exact solution f (x),
it is necessary to estimate the condition number in order to estimate the relative forward error. To
that end, we assume, for simplicity, that f : R → R is differentiable and obtain

|x∆y|
κrel =
|y∆x|
|x(f (x + ∆x) − f (x))|
=
|f (x)∆x|
0
|xf (x)∆x|

|f (x)∆x|
0
xf (x)
≈ .
f (x)

Therefore, if we can estimate the backward error ∆x, and if we can bound f and f 0 near x, we can
then bound the relative condition number and obtain an estimate of the relative forward error. Of
course, the relative condition number is undefined if the exact value f (x) is zero. In this case, we
can instead use the absolute condition number. Using the same approach as before, the absolute
condition number can be estimated using the derivative of f . Specifically, we have κabs ≈ |f 0 (x)|.

Exercise 1.4.4 Let f (x) = ex , g(x) = e−x , and x0 = 2. Suppose that the relative
backward error in x0 satisfies |∆x0 /x0 | = |x̂0 − x0 |/|x0 | ≤ 10−2 . What is an upper
bound on the relative forward error in f (x0 ) and g(x0 )? Use Matlab or a calculator to
experimentally confirm that this bound is valid, by evaluating f (x) and g(x) at selected
points and comparing values.
The condition number of a function f depends on, among other things, the absolute forward
error f (x̂) − f (x). However, an algorithm for evaluating f (x) actually evaluates a function fˆ
that approximates f , producing an approximation ŷ = fˆ(x) to the exact solution y = f (x). In
our definition of backward error, we have assumed that fˆ(x) = f (x̂) for some x̂ that is close
to x; i.e., our approximate solution to the original problem is the exact solution to a “nearby”
problem. This assumption has allowed us to define the condition number of f independently of any
approximation fˆ. This independence is necessary, because the sensitivity of a problem depends
solely on the problem itself and not any algorithm that may be used to approximately solve it.
Is it always reasonable to assume that any approximate solution is the exact solution to a
nearby problem? Unfortunately, it is not. It is possible that an algorithm that yields an accurate
approximation for given data may be unreasonably sensitive to perturbations in that data. This
leads to the concept of a stable algorithm: an algorithm applied to a given problem with given data
x is said to be stable if it computes an approximate solution that is the exact solution to the same
problem with data x̂, where x̂ is a small perturbation of x.
30 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

It can be shown that if a problem is well-conditioned, and if we have a stable algorithm for
solving it, then the computed solution can be considered accurate, in the sense that the relative
error in the computed solution is small. On the other hand, a stable algorithm applied to an
ill-conditioned problem cannot be expected to produce an accurate solution.

Example 1.4.7 This example will illustrate the last point made above. To solve a system of linear
equations Ax = b in Matlab, we can use the \ operator:

x = A\b

Enter the following matrix and column vectors in Matlab, as shown. Recall that a semicolon (;)
separates rows.

>> A=[ 0.6169 0.4798; 0.4925 0.3830 ]


>> b1=[ 0.7815; 0.6239 ]
>> b2=[ 0.7753; 0.6317 ]

Then, solve the systems A*x1 = b1 and A*x2 = b2. Note that b1 and b2 are not very different,
but what about the solutions x1 and x2? The algorithm implemented by the \ operator is stable, but
what can be said about the conditioning of the problem of solving Ax = b for this matrix A? The
conditioning of systems of linear equations will be studied in depth in Chapter 2. 2

Exercise 1.4.5 Let f (x) be a function that is one-to-one. Then, solving the equation
f (x) = c for some c in the range of f is equivalent to computing x = f −1 (c). What is the
condition number of the problem of solving f (x) = c?

1.4.5 Convergence
Many algorithms in numerical analysis are iterative methods that produce a sequence {αn } of ap-
proximate solutions which, ideally, converges to a limit α that is the exact solution as n approaches
∞. Because we can only perform a finite number of iterations, we cannot obtain the exact solution,
and we have introduced computational error.
If our iterative method is properly designed, then this computational error will approach zero
as n approaches ∞. However, it is important that we obtain a sufficiently accurate approximate
solution using as few computations as possible. Therefore, it is not practical to simply perform
enough iterations so that the computational error is determined to be sufficiently small, because it
is possible that another method may yield comparable accuracy with less computational effort.
The total computational effort of an iterative method depends on both the effort per iteration
and the number of iterations performed. Therefore, in order to determine the amount of compu-
tation that is needed to attain a given accuracy, we must be able to measure the error in αn as
a function of n. The more rapidly this function approaches zero as n approaches ∞, the more
rapidly the sequence of approximations {αn } converges to the exact solution α, and as a result,
fewer iterations are needed to achieve a desired accuracy. We now introduce some terminology that
will aid in the discussion of the convergence behavior of iterative methods.
1.4. ERROR ANALYSIS 31

Definition 1.4.8 (Big-O Notation) Let f and g be two functions defined on a domain
D ⊆ R that is not bounded above. We write that f (n) = O(g(n)) if there exists a positive
constant c such that
|f (n)| ≤ c|g(n)|, n ≥ n0 ,
for some n0 ∈ D.
As sequences are functions defined on N, the domain of the natural numbers, we can apply
big-O notation to sequences. Therefore, this notation is useful to describe the rate at which a
sequence of computations converges to a limit.

Definition 1.4.9 (Rate of Convergence) Let {αn }∞ ∞


n=1 and {βn }n=1 be sequences that
satisfy
lim αn = α, lim βn = 0,
n→∞ n→∞

where α is a real number. We say that {αn } converges to α with rate of convergence
O(βn ) if αn − α = O(βn ).

We say that an iterative method converges rapidly, in some sense, if it produces a sequence
of approximate solutions whose rate of convergence is O(βn ), where the terms of the sequence
βn approach zero rapidly as n approaches ∞. Intuitively, if two iterative methods for solving the
same problem perform a comparable amount of computation during each iteration, but one method
exhibits a faster rate of convergence, then that method should be used because it will require less
overall computational effort to obtain an approximate solution that is sufficiently accurate.

Example 1.4.10 Consider the sequence {αn }∞


n=1 defined by

n+1
αn = , n = 1, 2, . . .
n+2
Then, we have

n + 1 1/n
lim αn = lim
n→∞ n→∞ n + 2 1/n
1 + 1/n
= lim
n→∞ 1 + 2/n
limn→∞ (1 + 1/n)
=
limn→∞ (1 + 2/n)
1 + limn→∞ 1/n
=
1 + limn→∞ 2/n
1+0
=
1+0
= 1.

That is, the sequence {αn } converges to α = 1. To determine the rate of convergence, we note that

n+1 n+1 n+2 −1


αn − α = −1= − = ,
n+2 n+2 n+2 n+2
32 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

and since
−1 1
n + 2 ≤ n

for any positive integer n, it follows that


 
1
αn = α + O .
n
On the other hand, consider the sequence {αn }∞
n=1 defined by

2n2 + 4n
αn = , n = 1, 2, . . .
n2 + 2n + 1
Then, we have
2n2 + 4n 1/n2
lim αn = lim
n→∞ n→∞ n2 + 2n + 1 1/n2
2 + 4/n
= lim
n→∞ 1 + 2/n + 1/n2
limn→∞ (2 + 4/n)
=
limn→∞ (1 + 2/n + 1/n2 )
2 + limn→∞ 4/n
=
1 + limn→∞ (2/n + 1/n2 )
= 2.

That is, the sequence {αn } converges to α = 2. To determine the rate of convergence, we note that

2n2 + 4n 2n2 + 4n 2n2 + 4n + 2 −2


αn − α = 2
− 2 = 2
− 2
= 2 ,
n + 2n + 1 n + 2n + 1 n + 2n + 1 n + 2n + 1
and since
−2 2 2
n2 + 2n + 1 (n + 1)2 ≤ n2
=

for any positive integer n, it follows that


 
1
αn = α + O .
n2
2

We can also use big-O notation to describe the rate of convergence of a function to a limit.

Example 1.4.11 Consider the function f (h) = 1 + 2h. Since this function is continuous for all
h, we have
lim f (h) = f (0) = 1.
h→0
It follows that
f (h) − f0 = (1 + 2h) − 1 = 2h = O(h),
so we can conclude that as h → 0, 1 + 2h converges to 1 of order O(h). 2
1.4. ERROR ANALYSIS 33

Example 1.4.12 Consider the function f (h) = 1 + 4h + 2h2 . Since this function is continuous for
all h, we have
lim f (h) = f (0) = 1.
h→0

It follows that
f (h) − f0 = (1 + 4h + 2h2 ) − 1 = 4h + 2h2 .
To determine the rate of convergence as h → 0, we consider h in the interval [−1, 1]. In this
interval, |h2 | ≤ |h|. It follows that

|4h + 2h2 | ≤ |4h| + |2h2 |


≤ |4h| + |2h|
≤ 6|h|.

Since there exists a constant C (namely, 6) such that |4h + 2h2 | ≤ C|h| for h satisfying |h| ≤ h0 for
some h0 (namely, 1), we can conclude that as h → 0, 1 + 4h + 2h2 converges to 1 of order O(h). 2

In general, when f (h) denotes an approximation that depends on h, and

f0 = lim f (h)
h→0

denotes the exact value, f (h) − f0 represents the absolute error in the approximation f (h). When
this error is a polynomial in h, as in this example and the previous example, the rate of convergence
is O(hk ) where k is the smallest exponent of h in the error. This is because as h → 0, the
smallest power of h approaches zero more slowly than higher powers, thereby making the dominant
contribution to the error.
By contrast, when determining the rate of convergence of a sequence {αn } as n → ∞, the
highest power of n determines the rate of convergence. As powers of n are negative if convergence
occurs at all as n → ∞, and powers of h are positive if convergence occurs at all as h → 0, it can be
said that for either types of convergence, it is the exponent that is closest to zero that determines
the rate of convergence.

Example 1.4.13 Consider the function f (h) = cos h. Since this function is continuous for all h,
we have
lim f (h) = f (0) = 1.
h→0

Using Taylor’s Theorem, with center h0 = 0, we obtain

f 00 (ξ(h)) 2
f (h) = f (0) + f 0 (0)h + h ,
2
where ξ(h) is between 0 and h. Substituting f (h) = cos h into the above, we obtain

− cos ξ(h) 2
cos h = 1 − (sin 0)h + h ,
2
or
cos ξ(h) 2
cos h = 1 − h .
2
34 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

Because | cos x| ≤ 1 for all x, we have



cos ξ(h) 2 1 2
| cos h − 1| = − h ≤ h ,
2 2

so we can conclude that as h → 0, cos h converges to 1 of order O(h2 ). 2

Exercise 1.4.6 Determine the rate of convergence of


1
lim eh − h − h2 = 1.
h→0 2

Example 1.4.14 The following approximation to the derivative is based on the definition:
f (x0 + h) − f (x0 ) h
f 0 (x0 ) ≈ , error = − f 00 (ξ).
h 2
An alternative approximation is
f (x0 + h) − f (x0 − h) h2 000
f 0 (x0 ) ≈ , error = − f (ξ).
2h 6
Taylor series can be used to obtain the error terms in both cases.
Exercise 1.4.7 Use Taylor’s Theorem to derive the error terms in both formulas.

Exercise 1.4.8 Try both of these approximations with f (x) = sin x, x0 = 1, and h =
10−1 , 10−2 , 10−3 , and then h = 10−14 . What happens, and can you explain why?
If you can’t explain what happens for the smallest value of h, fortunately this will be addressed in
Section 1.5. 2

1.5 Computer Arithmetic


In computing the solution to any mathematical problem, there are many sources of error that can
impair the accuracy of the computed solution. The study of these sources of error is called error
analysis, which will be discussed later in this section. First, we will focus on one type of error that
occurs in all computation, whether performed by hand or on a computer: roundoff error.
This error is due to the fact that in computation, real numbers can only be represented using
a finite number of digits. In general, it is not possible to represent real numbers exactly with this
limitation, and therefore they must be approximated by real numbers that can be represented using
a fixed number of digits, which is called the precision. Furthermore, as we shall see, arithmetic
operations applied to numbers that can be represented exactly using a given precision do not
necessarily produce a result that can be represented using the same precision. It follows that if a
fixed precision is used, then every arithmetic operation introduces error into a computation.
Given that scientific computations can have several sources of error, one would think that it
would be foolish to compound the problem by performing arithmetic using fixed precision. However,
using a fixed precision is actually far more practical than other options, and, as long as computations
are performed carefully, sufficient accuracy can still be achieved.
1.5. COMPUTER ARITHMETIC 35

1.5.1 Floating-Point Representation


We now describe a typical system for representing real numbers on a computer.

Definition 1.5.1 (Floating-point Number System) Given integers β > 1, p ≥ 1,


L, and U ≥ L, a floating-point number system F is defined to be the set of all real
numbers of the form
x = ±mβ E .
The number m is the mantissa of x, and has the form
 
p−1
X
m= dj β −j  ,
j=0

where each digit dj , j = 0, . . . , p − 1 is an integer satisfying 0 ≤ di ≤ β − 1. The number


E is called the exponent or the characteristic of x, and it is an integer satisfying
L ≤ E ≤ U . The integer p is called the precision of F, and β is called the base of F.

The term “floating-point” comes from the fact that as a number x ∈ F is multiplied by or divided by
a power of β, the mantissa does not change, only the exponent. As a result, the decimal point shifts,
or “floats,” to account for the changing exponent. Nearly all computers use a binary floating-point
system, in which β = 2.

Example 1.5.2 Let x = −117. Then, in a floating-point number system with base β = 10, x is
represented as
x = −(1.17)102 ,

where 1.17 is the mantissa and 2 is the exponent. If the base β = 2, then we have

x = −(1.110101)26 ,

where 1.110101 is the mantissa and 6 is the exponent. The mantissa should be interpreted as a
string of binary digits, rather than decimal digits; that is,

1.110101 = 1 · 20 + 1 · 2−1 + 1 · 2−2 + 0 · 2−3 + 1 · 2−4 + 0 · 2−5 + 1 · 2−6


1 1 1 1
= 1+ + + +
2 4 16 64
117
=
64
117
= .
26
2

Exercise 1.5.1 Express x = π as accurately as possible in a floating-point number system


with base β = 2 and precision p = 12.
36 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

1.5.1.1 Overflow and Underflow


A floating-point system F can only represent a finite subset of the real numbers. As such, it is
important to know how large in magnitude a number can be and still be represented, at least
approximately, by a number in F. Similarly, it is important to know how small in magnitude a
number can be and still be represented by a nonzero number in F; if its magnitude is too small,
then it is most accurately represented by zero.

Definition 1.5.3 (Underflow, Overflow) Let F be a floating-point number system.


The smallest positive number in F is called the underflow level, and it has the value

UFL = mmin β L ,

where L is the smallest valid exponent and mmin is the smallest mantissa. The largest
positive number in F is called the overflow level, and it has the value

OFL = β U +1 (1 − β −p ).

The value of mmin depends on whether floating-point numbers are normalized in F; this point
will be discussed later. The overflow level is the value obtained by setting each digit in the mantissa
to β − 1 and using the largest possible value, U , for the exponent.
Exercise 1.5.2 Determine the value of OFL for a floating-point system with base β = 2,
precision p = 53, and largest exponent U = 1023.
It is important to note that the real numbers that can be represented in F are not equally spaced
along the real line. Numbers having the same exponent are equally spaced, and the spacing between
numbers in F decreases as their magnitude decreases.

1.5.1.2 Normalization
It is common to normalize floating-point numbers by specifying that the leading digit d0 of the
mantissa be nonzero. In a binary system, with β = 2, this implies that the leading digit is equal
to 1, and therefore need not be stored. In addition to the benefit of gaining one additional bit of
precision, normalization also ensures that each floating-point number has a unique representation.
One drawback of normalization is that fewer numbers near zero can be represented exactly than
if normalization is not used. One workaround is a practice called gradual underflow, in which the
leading digit of the mantissa is allowed to be zero when the exponent is equal to L, thus allowing
smaller values of the mantissa. In such a system, the number UFL is equal to β L−p+1 , whereas in
a normalized system, UFL = β L .
Exercise 1.5.3 Determine the value of UFL for a floating-point system with base β = 2,
precision p = 53, and smallest exponent L = −1022, both with and without normalization.

1.5.1.3 Rounding
A number that can be represented exactly in a floating-point system is called a machine number.
Since only finitely many real numbers are machine numbers, it is necessary to determine how non-
machine numbers are to be approximated by machine numbers. The process of choosing a machine
1.5. COMPUTER ARITHMETIC 37

number to approximate a non-machine number is called rounding, and the error introduced by such
an approximation is called roundoff error. Given a real number x, the machine number obtained
by rounding x is denoted by fl(x).
In most floating-point systems, rounding is achieved by one of two strategies:

• chopping, or rounding to zero, is the simplest strategy, in which the base-β expansion of a
number is truncated after the first p digits. As a result, fl(x) is the unique machine number
between 0 and x that is nearest to x.

• rounding to nearest sets fl(x) to be the machine number that is closest to x in absolute value;
if two numbers satisfy this property, then an appropriate tie-breaking rule must be used, such
as setting fl(x) equal to the choice whose last digit is even.

Example 1.5.4 Suppose we are using a floating-point system with β = 10 (decimal), with p = 4
significant digits. Then, if we use chopping, or rounding to zero, we have fl(2/3) = 0.6666, whereas
if we use rounding to nearest, then we have fl(2/3) = 0.6667. 2

Example 1.5.5 When rounding to even in decimal, 88.5 is rounded to 88, not 89, so that the last
digit is even, while 89.5 is rounded to 90, again to make the last digit even. 2

1.5.1.4 Machine Precision


In error analysis, it is necessary to estimate error incurred in each step of a computation. As such,
it is desirable to know an upper bound for the relative error introduced by rounding. This leads to
the following definition.

Definition 1.5.6 (Machine Precision) Let F be a floating-point number system. The


unit roundoff or machine precision, denoted by u, is the real number that satisfies

fl(x) − x
≤u
x

for any real number x such that UFL < x < OFL.
An intuitive definition of u is that it is the smallest positive number such that

fl (1 + u) > 1.

The value of u depends on the rounding strategy that is used. If rounding toward zero is used,
then u = β 1−p , whereas if rounding to nearest is used, u = 21 β 1−p .
It is important to avoid confusing u with the underflow level UFL. The unit roundoff is deter-
mined by the number of digits in the mantissa, whereas the underflow level is determined by the
range of allowed exponents. However, we do have the relation that 0 < UFL < u.
In analysis of roundoff error, it is assumed that fl(x op y) = (x op y)(1 + δ), where op is an
arithmetic operation and δ is an unknown constant satisfying |δ| ≤ u. From this assumption, it
can be seen that the relative error in fl(x op y) is |δ|. In the case of addition, the relative backward
error in each operand is also |δ|.
38 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

1.5.1.5 The IEEE Floating-Point Standard

In fact, most computers conform to the IEEE standard for floating-point arithmetic. The standard
specifies, among other things, how floating-point numbers are to be represented in memory. Two
representations are given, one for single-precision and one for double-precision.
Under the standard, single-precision floating-point numbers occupy 4 bytes in memory, with
23 bits used for the mantissa, 8 for the exponent, and one for the sign. IEEE double-precision
floating-point numbers occupy eight bytes in memory, with 52 bits used for the mantissa, 11 for
the exponent, and one for the sign. That is, in the IEEE floating-point standard, p = 24 for single
precision, and p = 53 for double precision, even though only 23 and 52 bits, respectively, are used
to store mantissas.

Example 1.5.7 The following table summarizes the main aspects of a general floating-point system
and a double-precision floating-point system that uses a 52-bit mantissa and 11-bit exponent. For
both systems, we assume that rounding to nearest is used, and that normalization is used. 2

General Double Precision


Form of machine number ±mβ E ±1.d1 d2 · · · d52 2E
Precision p 53
Exponent range L≤E≤U −1023 ≤ E ≤ 1024
UFL (Underflow Level) βL 2−1023
OFL (Overflow Level) β U +1 (1 − β −p ) 21025 (1 − 2−53 )
u 1 1−p
2β 2−53

Exercise 1.5.4 Are the values for UFL and OFL given in the table above the actual
values used in the IEEE double-precision floating point system? Experiment with powers
of 2 in Matlab to find out. What are the largest and smallest positive numbers you can
represent? Can you explain any discrepancies between these values and the ones in the
table?

1.5.2 Issues with Floating-Point Arithmetic


We now discuss the various issues that arise when performing floating-point arithmetic, or finite-
precision arithmetic, which approximates arithmetic operations on real numbers.

1.5.3 Loss of Precision


When adding or subtracting floating-point numbers, it is necessary to shift one of the operands
so that both operands have the same exponent, before adding or subtracting the mantissas. As a
result, digits of precision are lost in the operand that is smaller in magnitude, and the result of the
operation cannot be represented using a machine number. In fact, if x is the smaller operand and
y is the larger operand, and |x| < |y|u, then the result of the operation will simply be y (or −y, if
y is to be subtracted from x), since the entire value of x is lost in rounding the result.
1.5. COMPUTER ARITHMETIC 39

Exercise 1.5.5 Consider the evaluation of the summation


n
X
xi ,
i=1

where each term xi is positive. Will the sum be computed more accurately in floating-
point arithmetic if the numbers are added in order from smallest to largest, or largest to
smallest? Justify your answer.
In multiplication or division, the operands need not be shifted, but the mantissas, when mul-
tiplied or divided, cannot necessarily be represented using only p digits of precision. The product
of two mantissas requires 2p digits to be represented exactly, while the quotient of two mantissas
could conceivably require infinitely many digits.

1.5.3.1 Violation of Arithmetic Rules

Because floating-point arithmetic operations are not exact, they do not follow all of the laws of real
arithmetic. In particular, floating-point arithmetic is not associative; i.e., x + (y + z) 6= (x + y) + z
in floating-point arithmetic.

Exercise 1.5.6 In Matlab, generate three random numbers x, y and z, and compute
x + (y + z) and (x + y) + z. Do they agree? Try this a few times with different random
numbers and observe what happens.

1.5.3.2 Overflow and Underflow

Furthermore, overflow or underflow may occur depending on the exponents of the operands, since
their sum or difference may lie outside of the interval [L, U ].
p
Exercise 1.5.7 Consider the formula z = x2 + y 2 . Explain how overflow can occur in
computing z, even if x, y and z all have magnitudes that can be represented. How can
this formula be rewritten so that overflow does not occur?

1.5.3.3 Cancellation

Subtraction of floating-point numbers presents a unique difficulty, in addition to the rounding error
previously discussed. If the operands, after shifting exponents as needed, have leading digits in
common, then these digits cancel and the first digit in which the operands do not match becomes
the leading digit. However, since each operand is represented using only p digits, it follows that
the result contains only p − m correct digits, where m is the number of leading digits that cancel.
In an extreme case, if the two operands differ by less than u, then the result contains no
correct digits; it consists entirely of roundoff error from previous computations. This phenomenon
is known as catastrophic cancellation. Because of the highly detrimental effect of this cancellation,
it is important to ensure that no steps in a computation compute small values from relatively large
operands. Often, computations can be rearranged to avoid this risky practice.
40 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?

Example 1.5.8 When performing the subtraction


0.345769258233
−0.345769258174
0.000000000059
only two significant digits can be included in the result, even though each of the operands includes
twelve, because ten digits of precision are lost to catastrophic cancellation. 2

Example 1.5.9 Consider the quadratic equation


ax2 + bx + c = 0,
which has the solutions
√ √
−b + b2 − 4ac −b − b2 − 4ac
x1 = , x2 = .
2a 2a
Suppose that b > 0. Then, in computing x1 , we √ encounter catastrophic cancellation if b is much
larger than a and c, because this implies that b2 − 4ac ≈ b and as a result we are subtracting
two numbers that are nearly equal in computing the numerator. On the other hand, if b < 0, we
encounter this same difficulty in computing x2 .
Suppose that we use 4-digit rounding arithmetic to compute the roots of the equation
x2 + 10, 000x + 1 = 0.
Then, we obtain x1 = 0 and x2 = −10, 000. Clearly, x1 is incorrect because if we substitute
x = 0 into the equation then we obtain the contradiction 1 = 0. In fact, if we use 7-digit rounding
arithmetic then we obtain the same result. Only if we use at least 8 digits of precision do we obtain
roots that are reasonably correct,
x1 ≈ −1 × 10−4 , x2 ≈ −9.9999999 × 103 .
A similar result is obtained if we use 4-digit rounding arithmetic but compute x1 using the
formula
−2c
x1 = √ ,
b + b2 − 4ac
which can be obtained
√ by multiplying and dividing the original formula for x1 by the conjugate of the
numerator, −b − b2 − 4ac. The resulting formula is not susceptible to catastrophic cancellation,
because an addition is performed instead of a subtraction. 2

Exercise 1.5.8 Use the Matlab function randn to generate 1000 normally distributed
random numbers with mean 1000 and standard deviation 0.1. Then, use these formulas
to compute the variance:

1. n1 ni=1 (xi − x̄)2


P

2. n1 ni=1 x2i − x̄2


P 

where x̄ is the mean. How do the results differ? Which formula is more susceptible to
issues with floating-point arithmetic, and why?
1.5. COMPUTER ARITHMETIC 41

Exercise 1.5.9 Recall Exercise 1.4.8, in which two approximations of the derivative were
tested using various values of the spacing h. In light of the discussion in this chapter,
explain the behavior for the case of h = 10−14 .
42 CHAPTER 1. WHAT IS NUMERICAL ANALYSIS?
Part II

Numerical Linear Algebra

43
Chapter 2

Methods for Systems of Linear


Equations

In this chapter we discuss methods for solving the problem Ax = b, where A is a square invertible
matrix of size n, x is an unknown n-vector, and b is a vector of size n. Solving a system of
linear equations comes up in a number of applications such as solving a linear system of ordinary
differential equations (ODEs).

2.1 Triangular Systems


The basic idea behind methods for solving a system of linear equations is to reduce them to linear
equations involving a single unknown, because such equations are trivial to solve. These types of
equations make up a system that we call triangular systems of equations. There are three types of
triangular systems:

1. upper triangular, where aij = 0 for i > j,

2. diagonal, where aij = 0 for i 6= j,

3. and lower triangular, where aij = 0 for i < j.

2.1.1 Upper Triangular Systems


In this section we will look at how each system can be solved efficiently, and analyze the cost of
each method. First we will study the upper triangular system, which has the form
 
a11 a12 a13 · · · a1j

 a21 a22 · · · a2j  
 .. .. .. 
A=  . . .  .
 . .
. . .. 
 
0 aij
To illustrate this process, let us take a look at a sytem of equations in upper triangular form where
n = 3.

45
46 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

a11 x1 + a12 x2 + a13 x3 = b1


a22 x2 + a23 x3 = b2
a33 x3 = b3

From the above augmented matrix, it is easy to see that we can now simply solve

a33 x3 = b3
x3 = b3 /a33

Now that we have found x3 , we substitute it in to the previous equation to solve for the unknown
in that equation, which is

a22 x2 + a23 x3 = b2
x2 = (b2 − a23 )/a22 .

Similarly as before, now substitute x2 , and x3 into the first equation, and again we have a linear
equation with only one unknown.

a11 x1 + a12 x2 + a13 x3 = b1


x1 = (b1 − a12 x2 − a13 x3 )/a11

The process we have previously described is known as back subsitution. Just by looking at the
above 3 × 3 case you can see a pattern emerging, and this pattern can be described in terms of an
algorithm that can be used to solve these types of triangular systems.
Exercise 2.1.1 (a) Write a Matlab function to solve the following triangular system:

2x + y − 3z = −10
−2y + z = −2
z = 6

(b) How many floating point operations does this function perform?

In the algorithm, we assume that U is the upper triangular matrix containing the coefficients
of the system, and y is the vector containing the right-hand sides of the equations.
2.1. TRIANGULAR SYSTEMS 47

for i = n, n − 1, . . . , 1 do
xi = yi
for j = i + 1, i + 2, . . . , n do
xi = xi − uij xj
end
xi = xi /uii
end

Upper triangular systems are the goal we are trying to achieve through Gaussian elimnation, which
we discuss in the next section. If we take a look at the number of level of loops in the above
algorithm, we can see that there are two levels of nesting involved, therefore it takes n2 floating
point operations to solve these systems.

2.1.2 Diagonal Systems


Now we switch our focus to a system that is even simpler to solve. This system as the form
 
a11 0 0 ··· 0

 0 a22 0 ··· 0 

A=
 0 0 a33 ··· 0 .

 .. .. .. .. .. 
 . . . . . 
0 0 0 · · · ann

From looking at the augmented matrix, [A b],


 
a11 0 0 ··· 0 | b1

 0 a22 0 ··· 0 | b2 
A=
 0 0 a33 ··· 0 | b3 
 .. .. .. .. .. .. 
 . . . . . | . 
0 0 0 · · · ann | bn

it can be seen that in the system, each linear equation only has one unknown, so we can solve each
equation independently of the other equations. It does not matter which equation we start with in
solving this system, since they can all be solved independently of each other. But for the purpose
of this book, we will start at the top. The equations we need to solve are as follows:

a11 x1 = b1
a22 x2 = b2
a33 x3 = b3
..
.
ann xn = bn .
48 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

From the above we can see that each solution is found by xn = bn /ann .
Exercise 2.1.2 (a) Write a Matlab function to solve the following diagonal system
of equations:

3x = 4
2y = 8
7z = 21

(b) How many floating point operations does this function perform?

We can also look at this in terms of an algorithm:

for i = 1, 2, . . . , n do
xi = bi /aii
end

We can see that the above algorithm only has 1 level of nested loops, therefore the number of
floating point operations required to solve a system of this type is only n1 or just n!

2.1.3 Lower Triangular Systems


The next type of system we will look at are lower triangular systems, which has the form
 
a11 0

 a21 a22 

A=
 a31 a32 .

 .. .. .. 
 . . . 
ai1 ai2 · · · ai,j−1 aij
These types of systems can be solved similarly to the upper triangular systems using a method
called forward substitution. That will be discussed later in this chapter.

2.2 Gaussian Elimination


2.2.1 Row Operations
Now that we have looked at the simplest cases, we want to switch our focus to solving a general
square invertible system. An approach that we use is to first reduce our original system to one of
the types of systems discussed in Section 2.1, since we now know an efficient way to solve these
systems. Such a reduction is achieved by manipulating the equations in the system in such a way
that the solution does not change, but unknowns are eliminated from selected equations until,
finally, we obtain an equation involving only a single unknown. These manipulations are called
elementary row operations, and they are defined as follows:

• Multiplying both sides of an equation by a scalar (Ri → sRi )


2.2. GAUSSIAN ELIMINATION 49

• Reordering the equations by interchanging both sides of the ith and jth equation in the
system (Ri → Rj )

• Replacing equation i by the sum of equation i and a multiple of both sides of equation j
(Rj → Rj − sRi )

Exercise 2.2.1 Prove that the following row operations do not change the solution set.

• Multiplying both sides of an equation by a scalar.

• Replacing equation i by the sum of equation i and a multiple of both sides of equation.
The third operation is by far the most useful. We will now demonstrate how it can be used to
reduce a system of equations to a form in which it can easily be solved.
Example Consider the system of linear equations

x1 + 2x2 + x3 = 5,
3x1 + 2x2 + 4x3 = 17,
4x1 + 4x2 + 3x3 = 26.

First, we eliminate x1 from the second equation by subtracting 3 times the first equation from the
second. This yields the equivalent system

x1 + 2x2 + x3 = 5,
−4x2 + x3 = 2,
4x1 + 4x2 + 3x3 = 26.

Next, we subtract 4 times the first equation from the third, to eliminate x1 from the third equation
as well:

x2 + 2x2 + x3 = 5,
−4x2 + x3 = 2,
−4x2 − x3 = 6.

Then, we eliminate x2 from the third equation by subtracting the second equation from it, which
yields the system

x1 + 2x2 + x3 = 5,
−4x2 + x3 = 2,
−2x3 = 4.

This system is in upper-triangular form, because the third equation depends only on x3 , and the
second equation depends on x2 and x3 .
Because the third equation is a linear equation in x3 , it can easily be solved to obtain x3 = −2.
50 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

x1 + 2x2 + x3 = 5, x1 + 2x2 + x3 = 5,
−4x2 + x3 = 2, −4x2 + x3 = 2,
−2x3 = 4. → x3 = −2.
Then, we can substitute this value into the second equation, which yields −4x2 = 4.

x1 + 2x2 + x3 = 5, x1 + 2x2 + x3 = 5,
−4x2 + −2 = 2, → x2 = −1,
x3 = −2. x3 = −2.
This equation only depends on x2 , so we can easily solve it to obtain x2 = −1. Finally, we substitute
the values of x2 and x3 into the first equation to obtain x1 = 9.

x1 + 2(−1) + −2 = 5, → x1 = 9,
x2 = −1, x2 = −1,
x3 = −2. x3 = −2.
This process of computing the unknowns from a system that is in upper-triangular form is called
back substitution. 2
Exercise 2.2.2 Write a Matlab function to solve the following upper triangular system
using back substitution.

3x1 + 2x2 + x3 − x4 = 0
−x2 + 2x3 + x4 = 0
x3 + x4 = 1
−x4 = 2

Your function should return the solution, and should take the corresponding matrix and
right hand side vector as input.
In general, a system of n linear equations in n unknowns is in upper-triangular form if the ith
equation depends only on the unknowns xj for i < j ≤ n.
Now, performing row operations on the system Ax = b can be accomplished by performing
them on the augmented matrix
 
a11 a12 · · · a1n | b1
   a21 a22 · · · a2n | b2 
 
A b = . . . .
 .. .. | .. 
an1 an2 · · · ann | bn

By working with the augmented matrix instead of the original system, there is no need to continually
rewrite the unknowns or arithmetic operators. Once the augmented matrix is reduced to upper
triangular form, the corresponding system of linear equations can be solved by back substitution,
as before.
2.2. GAUSSIAN ELIMINATION 51

The process of eliminating variables from the equations, or, equivalently, zeroing entries of
the corresponding matrix, in order to reduce the system to upper-triangular form is called Gaus-
sian elimination. We will now step through an example as we discuss the steps of the Gaussian
Elimination algorithm. The algorithm is as follows:

for j = 1, 2, . . . , n − 1 do
for i = j + 1, j + 2, . . . , n do
mij = aij /ajj
for k = j + 1, j + 2, . . . , n do
aik = aik − mij ajk
end
bi = bi − mij bj
end
end

This algorithm requires approximately 23 n3 arithmetic operations, so it can be quite expensive if n


is large. Later, we will discuss alternative approaches that are more efficient for certain kinds of
systems, but Gaussian elimination remains the most generally applicable method of solving systems
of linear equations.
The number mij is called a multiplier. It is the number by which row j is multiplied before
adding it to row i, in order to eliminate the unknown xj from the ith equation. Note that this
algorithm is applied to the augmented matrix, as the elements of the vector b are updated by the
row operations as well.
It should be noted that in the above description of Gaussian elimination, each entry below
the main diagonal is never explicitly zeroed, because that computation is unnecessary. It is only
necessary to update entries of the matrix that are involved in subsequent row operations or the
solution of the resulting upper triangular system. We will see that when solving systems of equations
in which the right-hand side vector b is changing, but the coefficient matrix A remains fixed, it is
quite practical to apply Gaussian elimination to A only once, and then repeatedly apply it to each
b, along with back substitution, because the latter two steps are much less expensive.
We now illustrate the use of both these algorithms with an example.
Example Consider the system of linear equations

x1 + 2x2 + x3 − x4 = 5
3x1 + 2x2 + 4x3 + 4x4 = 16
4x1 + 4x2 + 3x3 + 4x4 = 22
2x1 + x3 + 5x4 = 15.

This system can be represented by the coefficient matrix A and right-hand side vector b, as follows:
   
1 2 1 −1 5
 3 2 4 4   16 
A=
 4
, b=
 22  .

4 3 4 
2 0 1 5 15
52 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

To perform row operations to reduce this system to upper triangular form, we define the augmented
matrix  
1 2 1 −1 5
   3 2 4 4 16 
à = A b =   4 4 3
.
4 22 
2 0 1 5 15
We first define Ã(1) = Ã to be the original augmented matrix. Then, we denote by Ã(2) the result of
the first elementary row operation, which entails subtracting 3 times the first row from the second
in order to eliminate x1 from the second equation:
 
1 2 1 −1 5
 0 −4 1 7 1 
Ã(2) = 
 4
.
4 3 4 22 
2 0 1 5 15
Next, we eliminate x1 from the third equation by subtracting 4 times the first row from the
third:  
1 2 1 −1 5
 0 −4 1 7 1 
Ã(3) = 
 0 −4 −1
.
8 2 
2 0 1 5 15
Then, we complete the elimination of x1 by subtracting 2 times the first row from the fourth:
 
1 2 1 −1 5
0 −4 1 7 1 
Ã(4) = 

.
 0 −4 −1 8 2 
0 −4 −1 7 5
We now need to eliminate x2 from the third and fourth equations. This is accomplished by sub-
tracting the second row from the third, which yields
 
1 2 1 −1 5
 0 −4 1 7 1 
Ã(5) =  ,
 0 0 −2 1 1 
0 −4 −1 7 5
and the fourth, which yields
 
1 2 1 −1 5
 0 −4 1 7 1 
Ã(6) = .
 0 0 −2 1 1 
0 0 −2 0 4
Finally, we subtract the third row from the fourth to obtain the augmented matrix of an upper-
triangular system,  
1 2 1 −1 5
 0 −4 1 7 1 
Ã(7) =  .
 0 0 −2 1 1 
0 0 0 −1 3
2.2. GAUSSIAN ELIMINATION 53

Note that in a matrix for such a system, all entries below the main diagonal (the entries where the
row index is equal to the column index) are equal to zero. That is, aij = 0 for i > j.
From this, we see that we need to examine all columns, i = 1 . . . n − 1. Likewise, we need to
examine rows where j = i + 1, i + 2, . . . , n. Now we are getting ready to build our algorithm.
Now, we can perform back substitution on the corresponding system,

x1 + 2x2 + x3 − x4 = 5,
−4x2 + x3 + 7x4 = 1,
−2x3 + x4 = 1,
−x4 = 3,

to obtain the solution, which yields x4 = −3, x3 = −2, x2 = −6, and x1 = 16. 2

2.2.2 The LU Factorization


We have learned how to solve a system
 of linear equations Ax = b by applying Gaussian elimination
to the augmented matrix à = A b , and then performing back substitution on the resulting
upper-triangular matrix. However, this approach is not practical if the right-hand side b of the
system is changed, while A is not. This is due to the fact that the choice of b has no effect on the
row operations needed to reduce A to upper-triangular form. Therefore, it is desirable to instead
apply these row operations to A only once, and then “store” them in some way in order to apply
them to any number of right-hand sides.
To accomplish this, we first note that subtracting mij times row j from row i to eliminate aij
is equivalent to multiplying A by the matrix
 
1 0 ··· ··· ··· ··· ··· 0

 0 1 0 0 

 .. . . .. .. .. 

 . . . . . 

 .. .. .. .. .. 
 . . . . . 
Mij =  ,
 
.. .. .. .. ..
 . −mij . . . .
.. . . . . . . ..
 
 
 . . . . . 
..
 
 
 . 0 1 0 
0 ··· ··· ··· ··· ··· 0 1

where the entry −mij is in row i, column j. Each such matrix Mij is an example of an elementary
row matrix, which is a matrix that results from applying any elementary row operation to the
identity matrix.
More generally, if we let A(1) = A and let A(k+1) be the matrix obtained by eliminating elements
of column k in A(k) , then we have, for k = 1, 2, . . . , n − 1,

A(k+1) = M (k) A(k)


54 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

where  
1 0 ··· ··· ··· ··· ··· ··· 0

 0 1 0 0 

 .. . . . . .. .. 

 . . . . . 

 .. .. .. .. 

 . 0 . . . 

.. .. .. .. ..
M (k)
 
=
 . . −mk+1,k . . . ,

 .. .. .. .. .. .. 
 . . . 0 . . . 
.. .. .. .. . . . . . . ..
 
 
 . . . . . . . . 
.. .. .. ..
 
..
. 1 0
 
 . . . . 
0 ··· 0 −mnk 0 ··· ··· 0 1
with the elements −mk+1,k , . . . , −mnk occupying column k. It follows that the matrix

U = A(n) = M (n−1) A(n−1) = M (n−1) M (n−2) · · · M (1) A

is upper triangular, and the vector

y = M (n−1) M (n−2) · · · M (1) b,

being the result of applying the same row operations to b, is the right-hand side for the upper-
triangular system that is to be solved by back subtitution.

Exercise 2.2.3 (a) Write a Matlab function that computes the matrix U by us-
ing the above description. Start in the first column, and accumulate all of the
multipliers in a matrix. Once you have done this for each n − 1 columns, you
can multiply them together in the manner described to accumulate the matrix
U = M (n−1) M (n−2) . . . M 1 A.

(b) Your function should store the values of mij in the appropriate entry of a matrix
we will call L. This matrix will be lower unit triangular, and is discussed in the
next section.

2.2.2.1 Unit Lower Triangular Matrices


We have previously learned about upper triangular matrices that result from Gaussian elimination.
Recall that an m × n matrix A is upper triangular if aij = 0 whenever i > j. This means that all
entries below the main diagonal, which consists of the entries a11 , a22 , . . ., are equal to zero. A
system of linear equations of the form U x = y, where U is an n × n nonsingular upper triangular
matrix, can be solved by back substitution.
Exercise 2.2.4 Prove that such a matrix, as in U described above, is nonsingular if and
only if all of its diagonal entries are nonzero.
Similarly, a matrix L is lower triangular if all of its entries above the main diagonal, that
is, entries `ij for which i < j, are equal to zero. We will see that a system of equations of the
form Ly = b, where L is an n × n nonsingular lower triangular matrix, can be solved using a
2.2. GAUSSIAN ELIMINATION 55

process similar to back substitution, called forward substitution. As with upper triangular matri-
ces, a lower triangular matrix is nonsingular if and only if all of its diagonal entries are nonzero.
Exercise 2.2.5 Prove the following useful properties for triangular matrices.
Triangular matrices have the following useful properties:

(a) The product of two upper/lower triangular matrices is upper/lower triangular.

(b) The inverse of a nonsingular upper/lower triangular matrix is upper/lower trian-


gular.
Now that you have proven these properties, we can say that matrix multiplication and inversion
preserve triangularity.
Now, we note that each matrix M (k) , k = 1, 2, . . . , n − 1, is not only a lower-triangular matrix,
but a unit lower triangular matrix, because all of its diagonal entries are equal to 1. Next, we note
two important properties of unit lower/upper triangular matrices:
• The product of two unit lower/upper triangular matrices is unit lower/upper triangular.

• A unit lower/upper triangular matrix is nonsingular, and its inverse is unit lower/upper
triangular.
In fact, the inverse of each M (k) is easily computed. We have
 
1 0 ··· ··· ··· ··· ··· ··· 0
 0 1 0 0 
 
 . . ..
 .. . . . . . .. 
 . . 

 . . .. ..
 .. ..

 0 . . 

.
. .
.. m .. .. ..
L(k) = [M (k) ]−1 = 
 
 . k+1,k . . . .

 .. .. .. .. .. .. 
 . . . 0 . . . 
 .. .. .. .. . . . . . . ..
 

 . . . . . . . . 
 .. .. .. ..
 
..
. 1 0

 . . . . 
0 ··· 0 mnk 0 ··· ··· 0 1

It follows that if we define M = M (n−1) · · · M (1) , then M is unit lower triangular, and M A = U ,
where U is upper triangular. It follows that A = M −1 U = LU , where

L = L(1) · · · L(n−1) = [M (1) ]−1 · · · [M (n−1) ]−1

is also unit lower triangular. Furthermore, from the structure of each matrix L(k) , it can readily
be determined that  
1 0 ··· ··· 0

 m21 .. 
 1 0 . 
 .. .. .. ..  .
L= . m32 . . . 
 .. ..
 
.. 
 . . . 1 0 
mn1 mn2 · · · mn,n−1 1
56 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

That is, L stores all of the multipliers used during Gaussian elimination. The factorization of A
that we have obtained,
A = LU,
is called the LU decomposition, or LU factorization, of A.
Exercise 2.2.6 Write a Matlab function to compute the matrices L and U for a ran-
domly generated matrix. Check your accuracy by multiplying them together to see if you
get A = LU .

2.2.2.2 Solution of Ax = b
Once the LU decomposition A = LU has been computed, we can solve the system Ax = b by first
noting that if x is the solution, then
Ax = LU x = b.
Therefore, we can obtain x by first solving the system

Ly = b,

and then solving


U x = y.
Then, if b should change, then only these last two systems need to be solved in order to obtain the
new solution; the LU decomposition does not need to be recomputed.
The system U x = y can be solved by back substitution, since U is upper-triangular. To solve
Ly = b, we can use forward substitution, since L is unit lower triangular.

for i = 1, 2, . . . , n do
yi = bi
for j = 1, 2, . . . , i − 1 do
yi = yi − `ij yj
end
end

Exercise 2.2.7 Implement a function for forward substitution in Matlab. Try your
function on the following lower unit triangular matrix.

 
1 0 0 0
 2 1 0 0 
A=
 3

3 1 0 
4 6 4 1
Like back substitution, this algorithm requires O(n2 ) floating-point operations. Unlike back sub-
stitution, there is no division of the ith component of the solution by a diagonal element of the
matrix, but this is only because in this context, L is unit lower triangular, so `ii = 1. When
applying forward substitution to a general lower triangular matrix, such a division is required.
2.2. GAUSSIAN ELIMINATION 57

Example The matrix  


1 2 1 −1
 3 2 4 4 
A=
 4

4 3 4 
2 0 1 5
can be reduced to the upper-triangular matrix
 
1 2 1 −1
 0 −4 1 7 
U = 
 0 0 −2 1 
0 0 0 −1
by performing the following row operations. These row operations are represented below in the
elementary row matrices E1 , . . . E6 .
• Subtracting three times the first row from the second

• Subtracting four times the first row from the third

• Subtracting two times the first row from the fourth

• Subtracting the second row from the third

• Subtracting the second row from the fourth

• Subtracting the third row from the fourth


   
1 0 0 0 1 0 0 0
 −3 1 0 0   0 1 0 0 
E1 =  0 0 1 0 ,
 E2 = 
 −4 0 1
,
0 
0 0 0 1 0 0 0 1
   
1 0 0 0 1 0 0 0
 0 1 0 0   0 1 0 0 
E3 =  0 0 1 0 ,
 E4 = 
 0 −1 1
,
0 
−2 0 0 1 0 0 0 1
   
1 0 0 0 1 0 0 0
 0 1 0 0   0 1 0 0 
E5 =  0
, E6 =  .
0 1 0   0 0 1 0 
0 −1 0 1 0 0 −1 1

Exercise 2.2.8 (a) How might these elementary row matrices be used as a part of the
matrix factorization? Consider how you would apply each one of them individually
to the matrix A to get the intermediate result, U = L−1 A.

(b) Check your results by multiplying the matrices in the order that you have discovered
to get U as a result.

(c) Explain why this works, what is this matrix multiplication equivalent to?
58 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

Now that you have discovered what we call pre-multiplying the matrix A by L−1 , we know that
 
1 0 0 0
 3 1 0 0 
L = E1−1 E2−1 E3−1 E4−1 E5−1 E6−1 = 
 4 1 1 0 ,

2 1 1 1

and we have the factorization A = LU .


We see that L is a unit lower-triangular matrix, with the subdiagonal entries equal to the
multipliers. That is, if mij times row j is subtracted from row i, where i > j, then `ij = mij .
 T
Let b = 5 1 22 15 . Applying the same row operations to b, which is equivalent to
pre-multiplying b by L−1 , or solving the system Ly = b, yields the modified right-hand side
 
5
 1 
y = E6 E5 E4 E3 E2 E1 b = 
 1 .

We then use back substitution to solve the system U x = y:

x4 = y4 /u44 = 3/(−1) = −3,


x3 = (y3 − u34 x4 )/u33 = (1 − 1(−3))/(−2) = −2,
x2 = (y2 − u23 x3 − u24 x4 )/u22 = (1 − 1(−2) − 7(−3))/(−4) = −6,
x1 = (y1 − u12 x2 − u13 x3 − u14 x4 )
= (5 − 2(−6) − 1(−2) + 1(−3))/1 = 16.

2.2.2.3 Implementation Details


Because both forward and back substituion require only O(n2 ) operations, whereas Gaussian elim-
ination requires O(n3 ) operations, changes in the right-hand side b can be handled quite efficiently
by computing the factors L and U once, and storing them. This can be accomplished quite effi-
ciently, because L is unit lower triangular. It follows from this that L and U can be stored in a
single n × n matrix by storing U in the upper triangular part, and the multipliers mij in the lower
triangular part. Once the factorization A = LU has been found, there are two systems to solve
once the substitution A = LU has been made into Ax = b to get LU x = b.

L Ux = b Let U x = y
Ly = b Solving requires O(n2 )
Ux = y Solving requires O(n2 )

2.2.2.4 Existence and Uniqueness


Not every nonsingular n × n matrix A has an LU decomposition. For example, if a11 = 0, then the
multipliers mi1 , for i = 2, 3, . . . , n, are not defined, so no multiple of the first row can be added to
2.2. GAUSSIAN ELIMINATION 59

the other rows to eliminate subdiagonal elements in the first column. That is, Gaussian elimination
can break down. Even if a11 6= 0, it can happen that the (j, j) element of A(j) is zero, in which
case a similar breakdown occurs. When this is the case, the LU decomposition of A does not exist.
This will be addressed by pivoting, resulting in a modification of the LU decomposition.
It can be shown that the LU decomposition of an n × n matrix A does exist if the leading
principal submatrices of A, defined by
 
a11 · · · a1k
[A]1:k,1:k =  ... .. ..  , k = 1, 2, . . . , n,

. . 
ak1 · · · akk
are all nonsingular. Furthermore, when the LU decomposition exists, it is unique.
Exercise 2.2.9 Prove, by contradiction, that the LU factorization is unique. Start by
supposing that two distinct LU factorizations exist.

2.2.2.5 Practical Computation of Determinants


Computing the determinant of an n × n matrix A using its definition requires a number of arith-
metic operations that is exponential in n. However, more practical methods for computing the
determinant can be obtained by using its properties:
• If à is obtained from A by adding a multiple of a row of A to another row, then det(Ã) =
det(A).
• If B is an n × n matrix, then det(AB) = det(A) det(B).
Qn
• If A is a triangular matrix (either upper or lower), then det(A) = i=1 aii .
It follows from these properties that if Gaussian elimination is used to reduce A to an upper-
triangular matrix U , then det(A) = det(U ), where U is the resulting upper-triangular matrix,
because the elementary row operations needed to reduce A to U do not change the determinant.
Because U is upper triangular, det(U ), being the product of its diagonal entries, can be computed
in n − 1 multiplications. It follows that the determinant of any matrix can be computed in O(n3 )
operations.
It can also be seen that det(A) = det(U ) by noting that if A = LU , then det(A) = det(L) det(U ),
by one of the abovementioned properties, but det(L) = 1, because L is a unit lower triangular
matrix. It follows from the fact that L is lower triangular that det(L) is the product of its diagonal
entries, and it follows from the fact that L is unit lower triangular that all of its diagonal entries
are equal to 1.

2.2.2.6 Perturbations and the Inverse


Using what we have learned about matrix norms and convergence of sequences of matrices, we can
quantify the change in the inverse of a matrix A in terms of the change in A. Suppose that an n × n
matrix F satisfies kF kp < 1, for some p. Then, from our previous discussion, limk→∞ F k = 0. It
follows that, by convergence of telescoping series,
k
!
X
lim F (I − F ) = lim I − F k+1 = I,
i
k→∞ k→∞
i=0
60 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

P∞
and therefore (I −F ) is nonsingular, with inverse (I −F )−1 = i=0 F
i. By the properties of matrix
norms, and convergence of geometric series, we then obtain

−1
X 1
k(I − F ) kp ≤ kF kip = .
1 − kF kp
i=0

Now, let A be a nonsingular matrix, and let E be a perturbation of A such that r ≡ kA−1 Ekp <
1. Because A + E = A(I − F ) where F = −A−1 E, with kF kp = r < 1, I − F is nonsingular, and
therefore so is A + E. We then have

k(A + E)−1 − A−1 kp = k − A−1 E(A + E)−1 kp


= k − A−1 E(I − F )−1 A−1 kp
≤ kA−1 k2p kEkp k(I − F )−1 kp
kA−1 k2p kEkp
≤ ,
1−r
from the formula for the difference of inverses, and the submultiplicative property of matrix norms.

2.2.2.7 Bounding the perturbation in A


From a roundoff analysis of forward and back substitution, which we do not reproduce here, we
have the bounds

max |δ L̄ij | ≤ nu` + O(u2 ),


i,j

max |δ Ūij | ≤ nuGa + O(u2 )


i,j

where a = maxi,j |aij |, ` = maxi,j |L̄ij |, and G is the growth factor. Putting our bounds together,
we have

max |δAij | ≤ max |eij | + max |L̄δ Ūij | + max |Ū δ L̄ij | + max |δ L̄δ Ūij |
i,j i,j i,j i,j i,j
2 2 2
≤ n(1 + `)Gau + n `Gau + n `Gau + O(u )

from which it follows that

kδAk∞ ≤ n2 (2n` + ` + 1)Gau + O(u2 ).

2.2.2.8 Bounding the error in the solution


Let x̄ = x + δx be the computed solution. Since the exact solution to Ax = b is given by
x = A−1 b, we are also interested in examining x̄ = (A + δA)−1 b. Can we say something about
k(A + δA)−1 b − A−1 bk?
We assume that kA−1 δAk = r < 1. We have

A + δA = A(I + A−1 δA) = A(I − F ), F = −A−1 δA.


2.2. GAUSSIAN ELIMINATION 61

From the manipulations

(A + δA)−1 b − A−1 b = (I + A−1 δA)−1 A−1 b − A−1 b


= (I + A−1 δA)−1 (A−1 − (I + A−1 δA)A−1 )b
= (I + A−1 δA)−1 (−A−1 (δA)A−1 )b

and this result from Section 2.2.2.6,


1
k(I − F )−1 k ≤ ,
1−r
we obtain
kδxk k(x + δx) − xk
=
kxk kxk
k(A + δA)−1 b − A−1 bk
=
kA−1 bk
1
≤ kA−1 kkδAk
1−r
1 kδAk
≤ −1
κ(A)
1 − kA δAk kAk
κ(A) kδAk
≤ .
1 − κ(A) kδAk kAk
kAk

Note that a similar conclusion is reached if we assume that the computed solution x̄ solves a nearby
problem in which both A and b are perturbed, rather than just A.
We see that the important factors in the accuracy of the computed solution are
• The growth factor G

• The size of the multipliers mik , bounded by `

• The condition number κ(A)

• The precision u
In particular, κ(A) must be large with respect to the accuracy in order to be troublesome. For
example, consider the scenario where κ(A) = 102 and u = 10−3 , as opposed to the case where
κ(A) = 102 and u = 10−50 . However, it is important to note that even if A is well-conditioned, the
error in the solution can still be very large, if G and ` are large.

2.2.3 Pivoting
During Gaussian elimination, it is necessary to interchange rows of the augmented matrix whenever
the diagonal element of the column currently being processed, known as the pivot element, is equal
to zero.
However, if we examine the main step in Gaussian elimination,
(j+1) (j) (j)
aik = aik − mij ajk ,
62 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

(j)
we can see that any roundoff error in the computation of ajk is amplified by mij . Because the mul-
tipliers can be arbitrarily large, it follows from the previous analysis that the error in the computed
solution can be arbitrarily large, meaning that Gaussian elimination is numerically unstable.
Therefore, it is helpful if it can be ensured that the multipliers are small. This can be accom-
plished by performing row interchanges, or pivoting, even when it is not absolutely necessary to do
so for elimination to proceed.

2.2.3.1 Partial Pivoting


One approach is called partial pivoting. When eliminating elements in column j, we seek the largest
element in column j, on or below the main diagonal, and then interchanging that element’s row
with row j. That is, we find an integer p, j ≤ p ≤ n, such that

|apj | = max |aij |.


j≤i≤n

Then, we interchange rows p and j.


(j) (j)
In view of the definition of the multiplier, mij = aij /ajj , it follows that |mij | ≤ 1 for
j = 1, . . . , n − 1 and i = j + 1, . . . , n. Furthermore, while pivoting in this manner requires O(n2 )
comparisons to determine the appropriate row interchanges, that extra expense is negligible com-
pared to the overall cost of Gaussian elimination, and therefore is outweighed by the potential
reduction in roundoff error. We note that when partial pivoting is used, the growth factor G is
2n−1 , where A is n × n.

2.2.3.2 Complete Pivoting


While partial pivoting helps to control the propagation of roundoff error, loss of significant digits
(j)
can still result if, in the abovementioned main step of Gaussian elimination, mij ajk is much larger
(j) (j)
in magnitude than aij . Even though mij is not large, this can still occur if ajk is particularly
large.
Complete pivoting entails finding integers p and q such that

|apq | = max |aij |,


j≤i≤n,j≤q≤n

and then using both row and column interchanges to move apq into the pivot position in row j and
column j. It has been proven that this is an effective strategy for ensuring that Gaussian elimination
is backward stable, meaning it does not cause the entries of the matrix to grow exponentially as
they are updated by elementary row operations, which is undesirable because it can cause undue
amplification of roundoff error.

2.2.3.3 The LU Decomposition with Pivoting


Suppose that pivoting is performed during Gaussian elimination. Then, if row j is interchanged
with row p, for p > j, before entries in column j are eliminated, the matrix A(j) is effectively
multiplied by a permutation matrix P (j) . A permutation matrix is a matrix obtained by permuting
the rows (or columns) of the identity matrix I. In P (j) , rows j and p of I are interchanged, so that
2.2. GAUSSIAN ELIMINATION 63

multiplying A(j) on the left by P (j) interchanges these rows of A(j) . It follows that the process of
Gaussian elimination with pivoting can be described in terms of the matrix multiplications.

Exercise 2.2.10 (a) Find the order, as described above, in which the permutation ma-
trices P and the multiplier matrices M should be multiplied by A.

(b) What permutation matrix P would constitute no change in the previous matrix?
However, because each permutation matrix P (k) at most interchanges row k with row p, where
p > k, there is no difference between applying all of the row interchanges “up front”, instead of
applying P (k) immediately before applying M (k) for each k. It follows that

[M (n−1) M (n−2) · · · M (1) ][P (n−1) P (n−2) · · · P (1) ]A = U,

and because a product of permutation matrices is a permutation matrix, we have

P A = LU,

where L is defined as before, and P = P (n−1) P (n−2) · · · P (1) . This decomposition exists for any
nonsingular matrix A.
Once the LU decomposition P A = LU has been computed, we can solve the system Ax = b
by first noting that if x is the solution, then

P Ax = LU x = P b.

Therefore, we can obtain x by first solving the system Ly = P b, and then solving U x = y. Then,
if b should change, then only these last two systems need to be solved in order to obtain the new
solution; as in the case of Gaussian elimination without pivoting, the LU decomposition does not
need to be recomputed.
Example Let  
1 4 7
A =  2 8 5 .
3 6 9
Applying Gaussian elimination to A, we subtract twice the first row from the second, and three
times the first row from the third, to obtain
 
1 4 7
A(2) =  0 0 −9  .
0 −6 −12

At this point, Gaussian elimination breaks down, because the multiplier m32 = a32 /a22 = −6/0 is
undefined.
Therefore, we must interchange the second and third rows, which yields the upper triangular
matrix  
1 4 7
U = A(3) = P (2) A(2) =  0 −6 −12  ,
0 0 −9
64 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

where P (2) is the permutation matrix


 
1 0 0
P (2) = 0 0 1 
0 1 0
obtained by interchanging the second and third rows of the identity matrix.
It follows that we have computed the factorization

P A = LU,

or      
1 0 0 1 4 7 1 0 0 1 4 7
 0 0 1   2 8 5  =  3 1 0   0 −6 −12  .
0 1 0 3 6 9 2 0 1 0 0 −9
It can be seen in advance that A does not have an LU factorization because the second minor of
A, a1:2,1:2 , is a singular matrix. 2

2.2.3.4 Practical Computation of Determinants, Revisited


When Gaussian elimination is used without pivoting to obtain the factorization A = LU , we have
det(A) = det(U ), because det(L) = 1 due to L being unit lower-triangular. When pivoting is used,
we have the factorization P A = LU , where P is a permutation matrix. Because a permutation
matrix is orthogonal; that is, P T P = I, and det(A) = det(AT ) for any square matrix, it follows
that det(P )2 = 1, or det(P ) = ±1. Therefore, det(A) = ± det(U ), where the sign depends on the
sign of det(P ).
To determine this sign, we note that when two rows (or columns) of A are interchanged, the
sign of the determinant changes. Therefore, det(P ) = (−1)p , where p is the number of row inter-
changes that are performed during Gaussian elimination. The number p is known as the sign of the
permutation represented by P that determines the final ordering of the rows. We conclude that
det(A) = (−1)p det(U ).

2.3 Estimating and Improving Accuracy


2.3.1 The Condition Number
Let A ∈ Rm×n be nonsingular, and let A = U ΣV T be the SVD of A. Then the solution x of the
system of linear equations Ax = b can be expressed as
n
−1 −1 T
X uT bi
x=A b=VΣ U b= vi .
σi
i=1

This formula for x suggests that if σn is small relative to the other singular values, then the system
Ax = b can be sensitive to perturbations in A or b. This makes sense, considering that σn is the
distance between A and the set of all singular n × n matrices.
In an attempt to measure the sensitivity of this system, we consider the parameterized system

(A + E)x() = b + e,
2.3. ESTIMATING AND IMPROVING ACCURACY 65

where E ∈ Rn×n and e ∈ Rn are perturbations of A and b, respectively. Taking the Taylor
expansion of x() around  = 0 yields

x() = x + x0 (0) + O(2 ),

where
x0 () = (A + E)−1 (e − Ex),
which yields x0 (0) = A−1 (e − Ex).
Using norms to measure the relative error in x, we obtain
kA−1 (e − Ex)k
 
kx() − xk kek
= || + O(2 ) ≤ ||kA−1 k + kEk + O(2 ).
kxk kxk kxk
Multiplying and dividing by kAk, and using Ax = b to obtain kbk ≤ kAkkxk, yields
 
kx() − xk kek kEk
= κ(A)|| + ,
kxk kbk kAk
where
κ(A) = kAkkA−1 k
is called the condition number of A. We conclude that the relative errors in A and b can be
amplified by κ(A) in the solution. Therefore, if κ(A) is large, the problem Ax = b can be quite
sensitive to perturbations in A and b. In this case, we say that A is ill-conditioned; otherwise, we
say that A is well-conditioned.
The definition of the condition number depends on the matrix norm that is used. Using the
`2 -norm, we obtain
σ1 (A)
κ2 (A) = kAk2 kA−1 k2 = .
σn (A)
It can readily be seen from this formula that κ2 (A) is large if σn is small relative to σ1 . We
also note that because the singular values are the lengths of the semi-axes of the hyperellipsoid
{Ax | kxk2 = 1}, the condition number in the `2 -norm measures the elongation of this hyperellipsoid.
Example The matrices
   
0.7674 0.0477 0.7581 0.1113
A1 = , A2 =
0.6247 0.1691 0.6358 0.0933

do not appear to be very different from one another, but κ2 (A1 ) = 10 while κ2 (A2 ) = 1010 . That
is, A1 is well-conditioned while A2 is ill-conditioned.
To illustrate the ill-conditioned nature of A2 , we solve the two systems of equations A2 x1 = b1
and A2 x2 = b2 for the unknown vectors x1 and x2 , where
   
0.7662 0.7019
b1 = , b2 = .
0.6426 0.7192

These vectors differ from one another by roughly 10%, but the solutions
−1.4522 × 108
   
0.9894
x1 = , x2 =
0.1452 9.8940 × 108
66 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

differ by several orders of magnitude, because of the sensitivity of A2 to perturbations. 2

Just as the largest singular value of A is the `2 -norm of A, and the smallest singular value is
the distance from A to the nearest singular matrix in the `2 -norm, we have, for any `p -norm,

1 k∆Akp
= min .
κp (A) A+∆A singular kAkp

That is, in any `p -norm, κp (A) measures the relative distance in that norm from A to the set of
singular matrices.
Because det(A) = 0 if and only if A is singular, it would appear that the determinant could be
used to measure the distance from A to the nearest singular matrix. However, this is generally not
the case. It is possible for a matrix to have a relatively large determinant, but be very close to a
singular matrix, or for a matrix to have a relatively small determinant, but not be nearly singular.
In other words, there is very little correlation between det(A) and the condition number of A.

Example Let
1 −1 −1 −1 −1 −1 −1 −1 −1 −1
 

 0 1 −1 −1 −1 −1 −1 −1 −1 −1 


 0 0 1 −1 −1 −1 −1 −1 −1 −1 


 0 0 0 1 −1 −1 −1 −1 −1 −1 

 0 0 0 0 1 −1 −1 −1 −1 −1 
A= .

 0 0 0 0 0 1 −1 −1 −1 −1 


 0 0 0 0 0 0 1 −1 −1 −1 


 0 0 0 0 0 0 0 1 −1 −1 

 0 0 0 0 0 0 0 0 1 −1 
0 0 0 0 0 0 0 0 0 1

Then det(A) = 1, but κ2 (A) ≈ 1, 918, and σ10 ≈ 0.0029. That is, A is quite close to a singular
T ,
matrix, even though det(A) is not near zero. For example, the nearby matrix à = A − σ10 u10 v10
whose entries are equal to those of A to within two decimal places, is singular. 2

2.3.2 Iterative Refinement

Although we have learned about solving a system of linear equations Ax = b. we have yet to
discuss methods of estimating the error in a computed solution x̃. A simple approach to judging
the accuracy of x̃ is to compute the residual vector r = b − Ax̃, and then compute the magnitude
of r using any vector norm. However, this approach can be misleading, as a small residual does not
necessarily imply that the error in the solution, which is e = x − x̃, is small.
To see this, we first note that

Ae = A(x − x̃) = Ax − Ax̃ = b − Ax̃ = r.


2.3. ESTIMATING AND IMPROVING ACCURACY 67

It follows that for any vector norm k · k, and the corresponding induced matrix norm, we have

kek = kA−1 rk
≤ kA−1 kkrk
kbk
≤ kA−1 kkrk
kbk
kAxk
≤ kA−1 kkrk
kbk
krk
≤ kAkkA−1 k kxk.
kbk

We conclude that the magnitude of the relative error in x̃ is bounded as follows:

kek krk
≤ κ(A) ,
kxk kbk

where
κ(A) = kAkkA−1 k
is the condition number of A. Therefore, it is possible for the residual to be small, and the error
to still be large.
We can exploit the relationship between the error e and the residual r, Ae = r, to obtain an
estimate of the error, ẽ, by solving the system Ae = r in the same manner in which we obtained x̃
by attempting to solve Ax = b.
Since ẽ is an estimate of the error e = x − x̃ in x̃, it follows that x̃ + ẽ is a more accurate
approximation of x than x̃ is. This is the basic idea behind iterative refinement, also known as
iterative improvement or residual correction. The algorithm is as follows:

Choose a desired accuracy level (error tolerance) T OL


x̃(0) = 0
r(0) = b
for k = 0, 1, 2, . . .
Solve Aẽ(k) = r(k)
if kẽ(k) k∞ < T OL
break
end
x̃(k+1) = x̃(k) + ẽ(k)
r(k+1) = r(k) − Aẽ(k)
end

The algorithm repeatedly applies the relationship Ae = r, where e is the error and r is the
residual, to update the computed solution with an estimate of its error. For this algorithm to be
effective, it is important that the residual r̃(k) be computed as accurately as possible, for example
using higher-precision arithmetic than for the rest of the computation.
It can be shown that if the vector r(k) is computed using double or extended precision that x(k)
converges to a solution where almost all digits are correct when κ(A)u ≤ 1.
68 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

It is important to note that in the above algorithm, the new residual r(k+1) is computed using
the formula r(k+1) = r(k) − Aẽ(k) , rather than the definition r(k+1) = b − Ax(k+1) . To see that
these formulas are equivalent, we use the definition of x(k+1) to obtain

r(k+1) = b − Ax̃(k+1)
= b − A(x̃(k) + ẽ(k) )
= b − Ax̃(k) − Aẽ(k)
= r(k) − Aẽ(k) .

This formula is preferable to the definition because as k increases, both r(k) and ẽ(k) should approach
the zero vector, and therefore smaller vectors than b and Ax̃(k+1) will be subtracted to obtain r(k+1) ,
thus reducing the amount of cancellation error that occurs.

2.3.3 Scaling and Equilibration


As we have seen, the bounds for the error depend on κ(A) = kAkkA−1 k. Perhaps we can re-scale
the equations so that the condition number is changed. We replace the system

Ax = b

by the equivalent system


DAx = Db
or possibly
DAEy = Db
where D and E are diagonal matrices and y = E −1 x.
Suppose A is symmetric positive definite; that is, A = AT and xT Ax > 0 for any nonzero vector
x. We want to replace A by DAD, that is, replace aij by di dj aij , so that κ(DAD) is minimized.
It turns out that for a class of symmetric matrices, such minimization is possible. A symmetric
positive definite matrix A is said to have Property A if there exists a permutation matrix Π such
that  
T D F
ΠAΠ = ,
FT D
where D is a diagonal matrix. For example, all tridiagonal matrices that are symmetric positive
definite have Property A.
For example, suppose  
50 7
A= .
7 1
Then λmax ≈ 51 and λmin ≈ 1/51, which means that κ(A) ≈ 2500. However,
" # " 1 # " #
√1 0 50 7 √ 0 1 √7
DAD = 50 50 = √7 50
0 1 7 1 0 1 50
1

and
1+ √7
50
κ= ≈ 200.
1− √7
50
2.4. SPECIAL MATRICES 69

One scaling strategy is called equilibration. The idea is to set A(0) = A and compute A(1/2) =
(1) (1) Pn (0)
D(1) A(0) = {di aij }, choosing the diagonal matrix D(1) so that di j=1 |aij | = 1. That is, all row
(1/2) (1)
sums of |D(1) A(0) | are equal to one. Then, we compute A(1) = A(1/2) E (1) = {aij ej }, choosing
(1) Pn (1/2)
each element of the diagonal matrix E (1) so that ej i=1 |aij | = 1. That is, all column sums of
|A (1/2) (1)
E | are equal to one. We then repeat this process, which yields

A(k+1/2) = D(k+1) A(k) ,


A(k+1) = A(k+1/2) E (k+1) .

Under very general conditions, the A(k) converge to a matrix whose row and column sums are all
equal.

2.4 Special Matrices


2.4.1 Banded Matrices
We now take a look at different types of special matrices. The types of matrices you will see in this
section have special properties that often make them easier to solve than a general matrix. The
first type of special matrix we will discuss is called a banded matrix. An n × n matrix A is said
to have upper bandwidth p if aij = 0 whenever j − i > p. Let’s take a moment to visualize this
scenario.  
−2 1 5 0 0
 1 1 8 4 0 
 
B=  0 4 3 −2 7 
,
 0 0 11 1 1 
0 0 0 3 −2
The above matrix would have an upper bandwidth of p = 2 since aij = 0 whenever i = 1, j = 4;
therefore, 4 − 1 > 2. Similarly, A has lower bandwidth q if aij = 0 whenever i − j > q. A matrix that
has upper bandwidth p and lower bandwidth q is said to have bandwidth w = p + q + 1. The above
matrix has a lower bandwidth of q = 1 since aij = 0 whenever i = 3, j = 1; therefore, 3 − 1 > 1.
Any n × n matrix A has a bandwidth w ≤ 2n − 1. If w < 2n − 1, then A is said to be banded.
However, cases in which the bandwidth is O(1), such as when A is a tridiagonal matrix for which
p = q = 1, are of particular interest because for such matrices, Gaussian elimination, forward
substitution and back substitution are much more efficient.

Exercise 2.4.1 (a) If A has lower bandwidth q, and A = LU is the LU decomposition


of A (without pivoting), then what is the lower bandwidth of the lower triangular
matrix L?

(b) How many elements, at most, need to be eliminated per column?

(c) If A has upper bandwidth p, and A = LU is the LU decomposition of A (without


pivoting), then what is the upper bandwidth of the upper triangular matrix U ?

(d) How many elements, at most, need to be eliminated per column?


70 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

Exercise 2.4.2 If A has O(1) bandwidth, then how many FLOPS do Gaussian elimina-
tion, forward substitution and back substitution require?
Example The matrix  
−2 1 0 0 0

 1 −2 1 0 0 

A=
 0 1 −2 1 0 ,

 0 0 1 −2 1 
0 0 0 1 −2
which arises from discretization of the second derivative operator, is banded with lower bandwith
and upper bandwith 1, and total bandwidth 3. Its LU factorization is
 
−2 1 0 0 0
 1 −2 1 0 0 
 
 0
 1 −2 1 0  =
 0 0 1 −2 1 
0 0 0 1 −2
  
1 0 0 0 0 −2 1 0 0 0
 −1 3
1 0 0 0    0 −2 1 0 0 

 2 2 4

 0 − 1 0 0   0 0 −3 1 0 
 3
3
  5
.
 0 0 −4 1 0   0 0 0 −4 1 
4
0 0 0 −5 1 0 0 0 0 − 56
We see that L has lower bandwith 1, and U has upper bandwith 1. 2

Exercise 2.4.3 (a) Write a Matlab function to find the LU factorization of a tridi-
agonal matrix.

(b) Now write a Matlab function for any banded matrix with bandwidth w.

(c) How do the two functions differ in performance? How many FLOPS do each re-
quire?
When a matrix A is banded with bandwidth w, it is wasteful to store it in the traditional
2-dimensional array. Instead, it is much more efficient to store the elements of A in w vectors of
length at most n. Then, the algorithms for Gaussian elimination, forward substitution and back
substitution can be modified appropriately to work with these vectors. For example, to perform
Gaussian elimination on a tridiagonal matrix, we can proceed as in the following algorithm. We
assume that the main diagonal of A is stored in the vector a, the subdiagonal (entries aj+1,j ) is
stored in the vector l, and the superdiagonal (entries aj,j+1 ) is stored in the vector u.

for j = 1, 2, . . . , n − 1 do
lj = lj /aj
aj+1 = aj+1 − lj uj
end

Notice that this algorithm is much shorter than regular Gaussian Elimination. That is because
the number of operations for solving a tridiagonal system is significantly reduced.
2.4. SPECIAL MATRICES 71

back subtitution as follows:

y1 = b1
for i = 2, 3, . . . , n do
yi = bi − li−1 yi−1
end
xn = yn /an
for i = n − 1, n − 2, . . . , 1 do
xi = (yi − ui xi+1 )/ai
end

After Gaussian elimination, the components of the vector l are the subdiagonal entries of L in the
LU decomposition of A, and the components of the vector u are the superdiagonal entries of U .
Pivoting can cause difficulties for banded systems because it can cause fill-in: the introduction
of nonzero entries outside of the band. For this reason, when pivoting is necessary, pivoting schemes
that offer more flexibility than partial pivoting are typically used. The resulting trade-off is that
the entries of L are permitted to be somewhat larger, but the sparsity (that is, the occurrence of
zero entries) of A is preserved to a greater extent.

2.4.2 Symmetric Matrices


2.4.2.1 The LDLT Factorization
Suppose that A is a nonsingular n × n matrix that has an LU factorization A = LU . We know
that L is unit lower-triangular, and U is upper triangular. If we define the diagonal matrix D by
 
u11 0 · · · 0
 0 u22 . . . 0 
 
D= .
.. .. ..  = diag(u11 , u22 , . . . , unn ),

 .. . . . 
0 · · · 0 unn

then D is also nonsingular, and then the matrix D−1 U , which has entries
uij
[D−1 U ]ij = , i, j = 1, 2, . . . , n.
uii

The diagonal entries of this matrix are equal to one, and therefore D−1 U is unit upper-triangular.
Therefore, if we define the matrix M by M T = D−1 U , then we have the factorization

A = LU = LDD−1 U = LDM T ,

where both L and M are unit lower-triangular, and D is diagonal. This is called the LDM T
factorization of A.
Because of the close connection between the LDM T factorization and the LU factorization,
the LDM T factorization is not normally used in practice for solving the system Ax = b for a
general nonsingular matrix A. However, this factorization becomes much more interesting when A
is symmetric.
72 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

If A = AT , then LDM T = (LDM T )T = M DT LT = M DLT , because D, being a diagonal


matrix, is also symmetric. Because L and M , being unit lower-triangular, are nonsingular, it
follows that
M −1 LD = DLT M −T = D(M −1 L)T .
The matrix M −1 L is unit lower-triangular. Therefore, the above equation states that a lower-
triangular matrix is equal to an upper-triangular matrix, which implies that both matrices must
be diagonal. It follows that M −1 L = I, because its diagonal entries are already known to be equal
to one.
We conclude that L = M , and thus we have the LDLT factorization

A = LDLT .

Let’s take a look at the factorization of the symmetric matrix

Example 2.4.1 Example  


4 2
S= .
2 3
The first step would be to subtract half of row one from row 2.
  
1 0 4 2
S= 1
2 1 0 2
The symmetry has been lost, but can be regained by finishing the factorization.

1 21
   
1 0 4 0
S= 1
2 1 0 2 0 1
From the above example we see that whenever we have a symmetric matrix we have the factorization
LDU where L is lower unit triangular, D is diagonal, and U is upper unit triangular. Note that
U = LT , and therefore we have the factorization LDLT . This factorization is quite economical,
compared to the LU and LDM T factorizations, because only n(n + 1)/2 entries are needed to
represent L and D. Once these factors are obtained, we can solve Ax = b by solving the simple
systems
Ly = b, Dz = y, LT x = z,
using forward substitution, simple divisions, and back substitution.
The LDLT factorization can be obtained by performing Gaussian elimination, but this is not
efficient, because Gaussian elimination requires performing operations on entire rows of A, which
does not exploit symmetry. This can be addressed by omitting updates of the upper-triangular
portion of A, as they do not influence the computation of L and D. An alternative approach, that
is equally efficient in terms of the number of floating-point operations, but more desirable overall
due to its use of vector operations, involves computing L column-by-column.
If we multiply both sides of the equation A = LDLT by the standard basis vector ej to extract
the jth column of this matrix equation, we obtain
j
X
aj = `k vkj ,
k=1
2.4. SPECIAL MATRICES 73

where    
A= a1 · · · an , L= `1 · · · `n
are column partitions of A and L, and vj = DLT ej .
Suppose that columns 1, 2, . . . , j − 1 of L, as well as d11 , d22 , . . . , dj−1,j−1 , the first j − 1 diagonal
elements of D, have already been computed. Then, we can compute vkj = dkk `jk for k = 1, 2, . . . , j−
1, because these quantities depend on elements of L and D that are available. It follows that
j−1
X
aj − `k vkj = `j vjj = `j djj `jj .
k=1

However, `jj = 1, which means that we can obtain djj from the jth component of the vector
j−1
X
uj = a j − `k vkj ,
k=1

and then obtain the “interesting” portion of the new column `j , that is, entries j : n, by computing
`j = uj /djj . The remainder of this column is zero, because L is lower-triangular.

Exercise 2.4.4 Write a Matlab function that implements the following LDLT algo-
rithm. This algorithm should take the random symmetric matrix A as input, and return
L and D as output. Check to see if A = LDLT .
The entire algorithm proceeds as follows:

L=0
D=0
for j = 1 : n do
for k = 1 : j − 1 do
vkj = dkk `jk
end
uj = aj:n,j
for k = 1 : j − 1 do
uj = uj − `j:n,k vkj
end
djj = u1j
`j:n,j = uj /djj
end

This algorithm requires approximately 13 n3 floating-point operations, which is half as many as


Gaussian elimination. If pivoting is required, then we obtain a factorization of the form P A =
LDLT . However, we will soon see that for an important class of symmetric matrices, pivoting is
unnecessary.

2.4.3 Symmetric Positive Definite Matrices


In the last section we looked at solving symmetric systems where A = AT . Now we still consider the
same symmetric systems, but we consider the special case where all the elements of D of A = LDLT
are positive. This type of matrix has the following properties:
74 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

2.4.3.1 Properties
A real, n × n symmetric matrix A is symmetric positive definite if A = AT and, for any nonzero
vector x,
xT Ax > 0.

Exercise 2.4.5 Show that if matrices A and B are positive definite, then A+B is positive
definite.
A symmetric positive definite matrix is the generalization to n×n matrices of a positive number.
If A is symmetric positive definite, then it has the following properties:

• A is nonsingular; in fact, det(A) > 0.

• All of the diagonal elements of A are positive.

• The largest element of the matrix lies on the diagonal.

• All of the eigenvalues of A are positive.

In general it is not easy to determine whether a given n × n symmetric matrix A is also positive
definite. One approach is to check the matrices
 
a11 a12 · · · a1k
 a21 a22 · · · a2k 
Ak =  . ..  , k = 1, 2, . . . , n,
 
..
 .. . . 
ak1 ak2 · · · akk

which are the leading principal submatrices of A.

Exercise 2.4.6 Show that A is positive definite if and only if det(Ak ) > 0 for k =
1, 2, . . . , n.
There are other types of matrices with all the same properties as above except zeros are allowed.
These matrices are defined as follows: negative definite, where xT Ax < 0; positive semi-definite,
where xT Ax ≥ 0; and negative semi-definite, where xT Ax ≤ 0.
Exercise 2.4.7 Find the values of c for which the following matrix is

(a) positive definite

(b) positive semi-definite

(c) negative definite

(d) negative semi-definite


 
3 −1 c
 −1 3 −1 
c −1 3
One desirable property of symmetric positive definite matrices is that Gaussian elimination
can be performed on them without pivoting, and all pivot elements are positive. Furthermore,
2.4. SPECIAL MATRICES 75

Gaussian elimination applied to such matrices is robust with respect to the accumulation of roundoff
error. However, Gaussian elimination is not the most practical approach to solving systems of
linear equations involving symmetric positive definite matrices, because it is not the most efficient
approach in terms of the number of floating-point operations that are required.

2.4.3.2 The Cholesky Factorization

Instead, it is preferable to compute the Cholesky factorization of A,

A = GGT ,

where G is a lower triangular matrix with positive diagonal entries. Because A is factored into two
matrices that are the transpose of one another, the process of computing the Cholesky factorization
requires about half as many operations as the LU decomposition.
The algorithm for computing the Cholesky factorization can be derived by matching entries of
GGT with those of A. This yields the following relation between the entries of G and A,

k
X
aik = gij gkj , i, k = 1, 2, . . . , n, i ≥ k.
j=1

From this relation, we obtain the following algorithm.

for j = 1, 2, . . . , n do

gjj = ajj
for i = j + 1, j + 2, . . . , n do
gij = aij /gjj
for k = j + 1, . . . , i do
aik = aik − gij gkj
end
end
end

The innermost loop subtracts off all terms but the last (corresponding to j = k) in the above
summation that expresses aik in terms of entries of G. Equivalently, for each j, this loop subtracts
the matrix gj gjT from A, where gj is the jth column of G. Note that based on the outer product
view of matrix multiplication, the equation A = GGT is equivalent to

n
X
A= gj gjT .
j=1

Therefore, for each j, the contributions of all columns g` of G, where ` < j, have already been
subtracted from A, thus allowing column j of G to easily be computed by the steps in the outer
loops, which account for the last term in the summation for aik , in which j = k.
76 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

Exercise 2.4.8 Write a Matlab function that performs the Cholesky factorization A =
GGT . Have your function:

(a) take the matrix symmetric matrix A as input,

(b) return variable isposdef that checks to see if the matrix A is positive definite, and

(c) return the lower triangular matrix G as output.

Exercise 2.4.9 How many FLOPs does this algorithm require?

(a) Use the Matlab commands ’tic’ and ’toc’ for when the size of the matrix is double.
Compare calculation times for these matrices, and try using large matrices.

(b) Count how many FLOPs are performed in your implementation of the Cholesky
algorithm.

Example 2.4.2 Example Let


 
9 −3 3 9
 −3 17 −1 −7 
 3 −1 17 15  .
A= 

9 −7 15 44

A is a symmetric positive definite matrix. To compute its Cholesky decomposition A = GGT , we


equate entries of A to those of GGT , which yields the matrix equation
    
a11 a12 a13 a14 g11 0 0 0 g11 g21 g31 g41
 a21 a22 a23 a24   g21 g22
  0 0   0 g22 g32 g42 

 a31 =  ,
a32 a33 a34   g31 g32 g33 0  0 0 g33 g43 
a41 a42 a43 a44 g41 g42 g43 g44 0 0 0 g44

and the equivalent scalar equations


2
a11 = g11 ,
a21 = g21 g11 ,
a31 = g31 g11 ,
a41 = g41 g11 ,
2 2
a22 = g21 + g22 ,
a32 = g31 g21 + g32 g22 ,
a42 = g41 g21 + g42 g22 ,
2 2 2
a33 = g31 + g32 + g33 ,
a43 = g41 g31 + g42 g32 + g43 g33 ,
2 2 2 2
a44 = g41 + g42 + g43 + g44 .
2.4. SPECIAL MATRICES 77

We compute the nonzero entries of G one column at a time. For the first column, we have
√ √
g11 = a11 = 9 = 3,
g21 = a21 /g11 = −3/3 = −1,
g31 = a31 /g11 = 3/3 = 1,
g41 = a41 /g11 = 9/3 = 3.

Before proceeding to the next column, we first subtract all contributions to the remaining entries of
A from the entries of the first column of G. That is, we update A as follows:
2
a22 = a22 − g21 = 17 − (−1)2 = 16,
a32 = a32 − g31 g21 = −1 − (1)(−1) = 0,
a42 = a42 − g41 g21 = −7 − (3)(−1) = −4,
2
a33 = a33 − g31 = 17 − 12 = 16,
a43 = a43 − g41 g31 = 15 − (3)(1) = 12,
2
a44 = a44 − g41 = 44 − 32 = 35.

Now, we can compute the nonzero entries of the second column of G just as for the first column:
√ √
g22 = a22 = 16 = 4,
g32 = a32 /g22 = 0/4 = 0,
g42 = a42 /g22 = −4/4 = −1.

We then remove the contributions from G’s second column to the remaining entries of A:
2
a33 = a33 − g32 = 16 − 02 = 16,
a43 = a43 − g42 g32 = 12 − (−1)(0) = 12,
2
a44 = a44 − g42 = 35 − (−1)2 = 34.

The nonzero portion of the third column of G is then computed as follows:


√ √
g33 = a33 = 16 = 4,
g43 = a43 /g43 = 12/4 = 3.

Finally, we compute g44 :


2 √ √
a44 = a44 − g43 = 34 − 32 = 25, g44 = a44 = 25 = 5.

Thus the complete Cholesky factorization of A is


    
9 −3 3 9 3 0 0 0 3 −1 1 3
 −3 17 −1 −7   −1 4 0 0  4 0 −1 
 0

 3 −1 17 15  =  1
   .
0 4 0   0 0 4 3 
9 −7 15 44 3 −1 3 5 0 0 0 5
2
78 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

If A is not symmetric positive definite, then the algorithm will break down, because it will attempt
to compute gjj , for some j, by taking the square root of a negative number, or divide by a zero gjj .

Example 2.4.3 Example The matrix


 
4 3
A=
3 2

is symmetric but not positive definite, because det(A) = 4(2) − 3(3) = −1 < 0. If we attempt to
compute the Cholesky factorization A = GGT , we have
√ √
g11 = a11 = 4 = 2,
g21 = a21 /g11 = 3/2,
2
a22 = a22 − g21 = 2 − 9/4 = −1/4,
√ p
g22 = a22 = −1/4,

and the algorithm breaks down. 2


In fact, due to the expense involved in computing determinants, the Cholesky factorization is
also an efficient method for checking whether a symmetric matrix is also positive definite. Once
the Cholesky factor G of A is computed, a system Ax = b can be solved by first solving Gy = b by
forward substitution, and then solving GT x = y by back substitution.

This is similar to the process of solving Ax = b using the LDLT factorization, except that
there is no diagonal system to solve. In fact, the LDLT factorization is also known as the “square-
root-free Cholesky factorization”, since it computes factors that are similar in structure to the
Cholesky factors, but without computing any square roots. Specifically, if A = GGT is the Cholesky
factorization of A, then G = LD1/2 . As with the LU factorization, the Cholesky factorization is
unique, because the diagonal is required to be positive.

2.5 Iterative Methods


Given a system of n linear equations in n unknowns, described by the matrix-vector equation

Ax = b,

where A is an invertible n × n matrix, we can obtain the solution using a direct method such as
Gaussian elimination in conjunction with forward and back substitution. However, there are several
drawbacks to this approach:

• If we have an approximation to the solution x, a direct method does not provide any means
of taking advantage of this information to reduce the amount of computation required.

• If we only require an approximate solution, rather than the exact solution except for roundoff
error, it is not possible to terminate the algorithm for a direct method early in order to obtain
such an approximation.
2.5. ITERATIVE METHODS 79

• If the matrix A is sparse, Gaussian elimination or similar methods can cause fill-in, which is
the introduction of new nonzero elements in the matrix, thus reducing efficiency.

• In some cases, A may not be represented as a two-dimensional array or a set of vectors;


instead, it might be represented implicitly by a function that returns as output the matrix-
vector product Ax, given the vector x as input. A direct method is not practical for such a
representation, because the individual entries of A are not readily available.

For this reason, it is worthwhile to consider alternative approaches to solving Ax = b, such as


iterative methods. In particular, an iterative method based on matrix-vector multiplication will
address all of these drawbacks of direct methods.
There are two general classes of iterative methods: stationary iterative methods and non-
stationary methods. Either type of method accepts as input an initial guess x(0) (usually chosen
to be the zero vector) and computes a sequence of iterates x(1) , x(2) , x(3) , . . ., that, hopefully,
converges to the solution x. A stationary method has the form

x(k+1) = g(x(k) ),

for some function g : Rn → Rn . The solution x is a fixed point, or stationary point, of g. In other
words, a stationary iterative method is one in which fixed-point iteration, which we have previously
applied to solve nonlinear equations, is used to obtain the solution.

2.5.1 Stationary Iterative Methods


To construct a suitable function g, we compute a splitting of the matrix A = M − N , where M is
nonsingular. Then, the solution x satisfies

M x = N x + b,

or
x = M −1 (N x + b).
We therefore define
g(x) = M −1 (N x + b),
so that the iteration takes the form

M x(k+1) = N x(k) + b.

It follows that for the sake of efficiency, the splitting A = M − N should be chosen so that the
system M y = c is easily solved.

2.5.1.1 Convergence Analysis


Before we describe specific splittings, we examine the convergence of this type of iteration. Using
the fact that x is the exact solution of Ax = b, we obtain

M (x(k+1) − x) = N (x(k) − x) + b − b,
80 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

which yields
x(k+1) − x = M −1 N (x(k) − x)

and
x(k) − x = (M −1 N )k (x(0) − x).

That is, the error after each iteration is obtained by multiplying the error from the previous iteration
by T = M −1 N . Therefore, in order for the error to converge to the zero vector, for any choice of
the initial guess x(0) , we must have ρ(T ) < 1, where ρ(T ) is the spectral radius of T . Let λ1 . . . λn
be the eigenvalues of a matrix A. Then ρ(A) = max {| λ1 | . . . | λn |}.

2.5.1.2 Jacobi Method

We now discuss some basic stationary iterative methods. For convenience, we write

A = D + L + U,

where D is a diagonal matrix whose diagonal entries are the diagonal entries of A, L is a strictly
lower triangular matrix defined by

aij i>j
`ij = ,
0 i≤j

and U is a strictly upper triangular matrix that is similarly defined: uij = aij if i < j, and 0
otherwise.
The Jacobi method is defined by the splitting

A = M − N, M = D, N = −(L + U ).

That is,
x(k+1) = D−1 [−(L + U )x(k) + b].

If we write each row of this vector equation individually, we obtain


 
(k+1) 1  X (k)
xi = bi − aij xj  .
aii
j6=i

This description of the Jacobi method is helpful for its practical implementation, but it also reveals
how the method can be improved. If the components of x(k+1) are computed in order, then the
(k+1)
computation of xi uses components 1, 2, . . . , i − 1 of x(k) even though these components of
x (k+1) have already been computed.
2.5. ITERATIVE METHODS 81

Exercise 2.5.1 Solve the linear system Ax = b by using the Jacobi method, where
 
2 7 1
A= 4 1 −1 
1 −3 12

and  
19
b =  3 .
31
Compute the iteration matrix T using the fact that M = D and N = −(L + U ) for the
Jacobi method. Is ρ(T ) < 1?
Hint: First rearrange the order of the equations so that the matrix is strictly diagonally
dominant.

2.5.1.3 Gauss-Seidel Method


By modifying the Jacobi method to use the most up-to-date information available, we obtain the
Gauss-Seidel method  
(k+1) 1 bi −
X (k+1)
X (k)
xi = aij xj − aij xj  .
aii
j<i j>i

This is equivalent to using the splitting A = M − N where M = D + L and N = −U ; that is,


x(k+1) = (D + L)−1 [−U x(k) + b].
Typically, this iteration converges more rapidly than the Jacobi method, but the Jacobi method
retains one significant advantage: because each component of x(k+1) is computed independently
of the others, the Jacobi method can be parallelized, whereas the Gauss-Seidel method cannot,
(k+1) (k+1)
because the computation of xi depends on xj for j < i.

Exercise 2.5.2 Solve the same system Ax = b from above, where


 
2 7 1
A =  4 1 −1 
1 −3 12

and  
19
b =  3 .
31
using the Gauss-Seidel Method. What are the differences between this computation and
the one from Exercise 2.5.1?

2.5.1.4 Successive Overrelaxation


Both iterations are guaranteed to converge if A is strictly diagonally dominant. Furthermore,
Gauss-Seidel is guaranteed to converge if A is symmetric positive definite. However, in certain
82 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

important applications, both methods can converge quite slowly. To accelerate convergence, we
first rewrite the Gauss-Seidel method as follows:

x(k+1) = x(k) + [x(k+1) − x(k) ].

The quantity in brackets is the step taken from x(k) to x(k+1) . However, if the direction of this
step corresponds closely to the step x − x(k) to the exact solution, it may be possible to accelerate
convergence by increasing the length of this step. That is, we introduce a parameter ω so that
(k+1)
x(k+1) = x(k) + ω[xGS − x(k) ],
(k+1)
where xGS is the iterate obtained from x(k) by the Gauss-Seidel method. By choosing ω > 1,
(k+1)
which is called overrelaxation, we take a larger step in the direction of [xGS − x(k) ] than Gauss-
Seidel would call for.
This approach leads to the method of successive overrelaxation (SOR),

(D + ωL)x(k+1) = [(1 − ω)D − ωU ]x(k) + ωb.

Note that if ω = 1, then SOR reduces to the Gauss-Seidel method. If we examine the iteration
matrix Tω for SOR, we have

Tω = (D + ωL)−1 [(1 − ω)D − ωU ].

Because the matrices (D + ωL) and [(1 − ω)D − ωU ] are both triangular, it follows that
n
! n !
Y Y
−1
det(Tω ) = aii (1 − ω)aii = (1 − ω)n .
i=1 i=1

Because the determinant is the product of the eigenvalues, it follows that ρ(Tω ) ≥ |1 − ω|.

Exercise 2.5.3 By the above argument, find a lower and upper bound for the parameter
ω. Hint: Consider criteria for ρ(Tω ) if this method converges.
In some cases, it is possible to analytically determine the optimal value of ω, for which con-
vergence is most rapid. For example, if A is symmetric positive definite and tridiagonal, then the
optimal value is
2
ω= p ,
1 + 1 − [ρ(Tj )]2
where Tj is the iteration matrix −D−1 (L + U ) for the Jacobi method.

Exercise 2.5.4 Suppose we wish to solve

Ax = b.

Show that there exists a diagonal matrix D such that

B = DAD−1

is symmetric and positive definite, then SOR converges for the original problem.
Hint: Use the fact that SOR converges for any symmetric positive definite matrix.
2.5. ITERATIVE METHODS 83

A natural criterion for stopping any iterative method is to check whether kx(k) − x(k−1) k is less
than some tolerance. However, if kT k < 1 in some natural matrix norm, then we have

kx(k) − xk ≤ kT kkx(k−1) − xk ≤ kT kkx(k−1) − x(k) k + kT kkx(k) − xk,

which yields the estimate


kT k
kx(k) − xk ≤ kx(k) − x(k−1) k.
1 − kT k

Therefore, the tolerance must be chosen with kT k/(1 − kT k) in mind, as this can be quite large
when kT k ≈ 1.

2.5.2 Krylov Subspace Methods


We have learned about stationary iterative methods for solving Ax = b, that have the form of a
fixed-point iteration. Now, we will consider an alternative approach to developing iterative methods,
that leads to non-stationary iterative methods, in which search directions are used to progress from
each iterate to the next. That is,
x(k+1) = x(k) + αk pk

where pk is a search direction that is chosen so as to be approximately parallel to the error ek =


x − x(k) , and αk is a constant that determines how far along that direction to proceed so that
x(k+1) , in some sense, will be as close to x as possible.

2.5.2.1 Steepest Descent

We assume that A is symmetric positive definite, and consider the problem of minimizing the
function
1
φ(x) = xT Ax − bT x.
2
Differentiating, we obtain
∇φ(x) = Ax − b.

Therefore, this function has one critical point, when Ax = b. Differentiating ∇φ, we find that
the Hessian matrix of φ is A. Because A is symmetric positive definite, it follows that the unique
minimizer of φ is the solution to Ax = b. Therefore, we can use techniques for minimizing φ to
solve Ax = b.
From any vector x0 , the direction of steepest descent is given by

−∇φ(x0 ) = b − Ax0 = r0 ,

the residual vector. This suggests a simple non-stationary iterative method, which is called the
method of steepest descent. The basic idea is to choose the search direction pk to be rk = b − Ax(k) ,
84 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

and then to choose αk so as to minimize φ(x(k+1) ) = φ(x(k) + αk rk ). This entails solving a single-
variable minimization problem to obtain αk . We have

d (k) d 1 (k)
[φ(x + αk rk )] = (x + αk rk )T A(x(k) + αk rk )−
dαk dαk 2
i
bT (x(k) + αk rk )
= rTk Ax(k) + αk rTk Ark − bT rk ]
= −rTk rk + αk rTk Ark .
It follows that the optimal choice for αk is
rTk rk
αk = ,
rTk Ark
and since A is symmetric positive definite, the denominator is guaranteed to be positive.
The method of steepest descent is effective when A is well-conditioned, but when A is ill-
conditioned, convergence is very slow, because the level curves of φ become long, thin hyperellipsoids
in which the direction of steepest descent does not yield much progress toward the minimum.
Another problem with this method is that while it can be shown that rk+1 is orthogonal to rk , so that
each direction is completely independent of the previous one, rk+1 is not necessarily independent
of previous search directions.
Exercise 2.5.5 Show that each search direction rk is orthogonal to the previous search
direction rk−1 .
In fact, even in the 2 × 2 case, where only two independent search directions are available, the
method of steepest descent exhibits a “zig-zag” effect because it continually alternates between two
orthogonal search directions, and the more ill-conditioned A is, the smaller each step tends to be.

2.5.2.2 The Lanczos Algorithm


A more efficient iteration can be obtained if it can be guaranteed that each residual is orthogonal
to all previous residuals. While it would be preferable to require that all search directions are
orthogonal, this goal is unrealistic, so we settle for orthogonality of the residuals instead. For
simplicity, and without loss of generality, we assume that bT b = kbk22 = 1. Then, this orthogonality
can be realized if we prescribe that each iterate x(k) has the form
x(k) = Qk yk
where Qk is an n×k orthogonal matrix, meaning that QTk Qk = I, and yk is a k-vector of coefficients.
If we also prescribe that the first column of Qk is b, then, to ensure that each residual is
orthogonal to all previous residuals, we first note that
b − Ax(k) = Qk e1 − AQk yk .
That is, the residual lies in the span of the space spanned by the columns of Qk , and the vectors
obtained by multiplying A by those columns. Therefore, if we choose the columns of Qk so that
they form an orthonormal basis for the Krylov subspace
K(b, A, k) = span{b, Ab, A2 b, . . . , Ak−1 b},
2.5. ITERATIVE METHODS 85

then we ensure that rk is in the span of the columns of Qk+1 .


Then, we can guarantee that the residual is orthogonal to the columns of Qk , which span the
space that contains all previous residuals, by requiring
QTk (b − Ax(k) ) = QTk b − QTk AQk yk = e1 − Tk yk = 0,
where
Tk = QTk AQk .
It is easily seen that Tk is symmetric positive definite, since A is.
The columns of each matrix Qk , denoted by q1 , q2 , . . . , qk , are defined to be
qk = pk−1 (A)b,
where p0 , p1 , . . . define a sequence of orthogonal polynomials with respect to the inner product
hp, qi = bT p(A)q(A)b.
That is, these polynomials satisfy

T 1 i=j
hpi , pj i = b pi (A)pj (A)b = .
6 j
0 i=
Like any sequence of orthogonal polynomials, they satisfy a 3-term recurrence relation
βj pj (λ) = (λ − αj )pj−1 (λ) − βj−1 pj−2 (λ), j ≥ 1, p0 (λ) ≡ 1, p−1 (λ) ≡ 0,
which is obtained by applying Gram-Schmidt orthogonalization to the monomial basis.
To obtain the recursion coefficients αj and βj , we use the requirement that the polynomials
must be orthogonal. This yields
αj = hpj−1 (λ), λpj−1 (λ)i, βj2 = hpj−1 (λ), λ2 pj−1 (λ)i − αj2 .
It also follows from the 3-term recurrence relation that
Aqj = βj−1 qj−1 + αj qj + βj qj+1 ,
and therefore Tk is tridiagonal. In fact, we have
AQk = Qk Tk + βk qk+1 eTk ,
where  
α1 β1
 β1 α2 β2 
 
Tk = 
 .. .. .. .

 . . . 
 βk−2 αk−1 βk−1 
βk−1 αk
Furthermore, the residual is given by
b − Ax(k) = Qk e1 − AQk yk = Qk e1 − Qk Tk yk − βk qk+1 eTk yk = −βk qk+1 yk .
We now have the following algorithm for generating a sequence of approximations to the solution
of Ax = b, for which each residual is orthogonal to all previous residuals.
86 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

k = 0, rk = b, qk = 0, x(k) = 0
while x(k) is not converged do
βk = krk k2
qk+1 = rk /βk
k =k+1
vk = Aqk
αk = qTk vk
rk = vk − αk qk − βk−1 qk−1
x(k) = β0 Qk Tk−1 e1
end

This method is the Lanczos iteration. It is not only used for solving linear systems; the matrix
Tk is also useful for approximating extremal eigenvalues of A, and for approximating quadratic or
bilinear forms involving functions of A, such as the inverse or exponential.
Exercise 2.5.6 Implement the above Lanczos algorithm in Matlab. Your function
should take a random symmetric matrix A, a random initial vector u, and n where n
is the number of iterations as input. Also, your function should return the symmetric
matrix T , where T is the tridiagonal matrix described above that contains quantities αj
and βj that are computed by the algorithm.

2.5.2.3 The Conjugate Gradient Method


To improve efficiency, the tridiagonal structure of Tk can be exploited so that the vector yk can
easily be obtained from yk−1 by a few simple recurrence relations. However, the Lanczos iteration
is not normally used directly to solve Ax = b, because it does not provide a simple method of
computing x(k+1) from x(k) , as in a general non-stationary iterative method.
Instead, we note that because Tk is tridiagonal, and symmetric positive definite, it has an LDLT
factorization
Tk = Lk Dk LTk ,
where  
d1 0 ··· 0
 
1 0 ··· 0
l1 1 0 0
 .. 
   0 d2 . 
Lk =  , Dk =  ,
 
.. .. .. ..  .. ..
 . . . .   . . 0 

0 · · · lk−1 1 0 · · · 0 dk
and the diagonal entries of Dk are positive. Note that because Tk is tridiagonal, Lk has lower
bandwidth 1.
It follows that if we define the matrix P̃k by the matrix equation

P̃k LTk = Qk ,

then
x(k) = Qk yk = P̃k LTk Tk−1 e1 = P̃k wk
where wk satisfies
Lk Dk wk = e1 .
2.5. ITERATIVE METHODS 87

This representation of x(k) is more convenient than Qk yk , because, as a consequence of the recursive
definitions of Lk and Dk , and the fact that Lk Dk is lower triangular, we have
 
wk−1
wk = ,
wk

that is, the first k − 1 elements of wk are the elements of wk−1 .


Therefore,
x(k) = x(k−1) + wk p̃k , x(0) = 0.
That is, the columns of P̃k , which are the vectors p̃1 , . . . , p̃k , are search directions. Furthermore,
in view of
P̃kT AP̃k = L−1 T −T −1 −T −1
k Qk AQk Lk = Lk Tk Lk = Lk Lk Dk Lk Lk = Dk ,
T −T

which implies 
di > 0 i = j
p̃Ti Ap̃j = ,
0 i=6 j
we see that these search directions, while not orthogonal, are A-orthogonal, or A-conjugate. There-
fore, they are linearly independent, thus guaranteeing convergence, in exact arithmetic, within n
iterations. It is for this reason that the Lanczos iteration, reformulated in this manner with these
search directions, is called the conjugate gradient method.
From the definition of Pk , we obtain the relation

lk−1 p̃k−1 + p̃k = q̃k , k > 1, p̃1 = q̃1 .

It follows that each search direction p̃k is a linear combination of the residual rk = b − Ax(k−1) ,
which is a scalar multiple of qk , and the previous search direction pk−1 , except for the first direction,
which is equal to q1 = b. The exact linear combination can be determined by the requirement that
the search directions be A-conjugate.
Specifically, if we define pk = krk k2 p̃k , then, from the previous linear combination, we have

pk = rk + µk pk−1 ,

for some constant µk . From the requirement that pTk−1 Apk = 0, we obtain

pTk−1 Ark
µk = − .
pTk−1 Apk−1

We have eliminated the computation of the qk from the algorithm, as we can now use the residuals
rk instead to obtain the search directions that we will actually use, the pk .
The relationship between the residuals and the search direction also provides a simple way to
compute each residual from the previous one. We have

rk+1 = b − Ax(k) = b − A[x(k−1) + wk p̃k ] = rk − wk Ap̃k .

The orthogonality of the residuals, and the A-orthogonality of the search directions, yields the
relations
wk−1 T
rTk rk = −wk−1 rTk Ap̃k−1 = r Apk−1 ,
krk k2 k
88 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

and

rTk−1 rk−1 = wk−1 rTk−1 Ap̃k−1


= wk−1 (pk−1 − βk pk−2 )T Ap̃k−1
= wk−1 pTk−1 Ap̃k−1
wk−1 T
= p Apk−1 .
krk−1 k2 k−1
This yields the alternative formula
rTk rk
µk = − ,
rTk−1 rk−1
which eliminates one unnecessary matrix-vector multiplication per iteration.
To complete the algorithm, we need an efficient method of computing

x(k) = x(k−1) + wk p̃k = x(k−1) + νk pk , rk+1 = rk − νk Apk

for some constant νk . From the definition of pk = krk k2 p̃k , we know that νk = wk /krk k2 , but
we wish to avoid the explicit computation of Tk and its LDLT factorization that are needed to
compute wk . Instead, we use the relation

rk+1 = rk − νk Apk

and the orthogonality of the residuals to obtain


rTk rk
νk = .
rTk Apk
From the relationship between the residuals and search directions, and the A-orthogonality of the
search directions, we obtain
rT rk
νk = Tk .
pk Apk
The following algorithm results:

k = 1, r1 = b, x(0) = 0
while not converged do
if k > 1 then
µk = −rTk rk /rTk−1 rk−1
pk = rk + µk pk−1
else
p1 = r1
end if
vk = Apk
νk = rTk rk /pTk vk
x(k) = x(k−1) + νk pk
rk+1 = rk − νk vk
k =k+1
end while
2.5. ITERATIVE METHODS 89

An appropriate stopping criterion is that the norm of the residual rk+1 is smaller than some
tolerance. It is also important to impose a maximum number of iterations. Note that only one
matrix-vector multiplication per iteration is required.

2.5.2.4 Preconditioning
The conjugate gradient method is far more effective than the method of steepest descent, but it
can also suffer from slow convergence when A is ill-conditioned. Therefore, the conjugate gradient
method is often paired with a preconditioner that transforms the problem Ax = b into an equivalent
problem in which the matrix is close to I. The basic idea is to solve the problem

Ãx̃ = b̃

where
à = C −1 AC −1 , x̃ = Cx, b̃ = C −1 b,
and C is symmetric positive definite. Modifying the conjugate gradient method to solve this
problem, we obtain the algorithm

k = 1, C −1 r1 = C −1 b, Cx(0) = 0
while not converged do
if k > 1 then
µk = −rTk C −2 rk /rTk−1 C −2 rk−1
Cpk = C −1 rk + µk Cpk−1
else
Cp1 = C −1 r1
end if
C −1 vk = C −1 Apk
νk = rTk C −2 rk /pTk Cvk
Cx(k) = Cx(k−1) + νk Cpk
C −1 rk+1 = C −1 rk − νk vk
k =k+1
end while

which, upon defining M = C 2 , simplifies to

k = 1, r1 = b, x(0) = 0
while not converged do
Solve M zk = rk
if k > 1 then
µk = −rTk zk /rTk−1 zk−1
pk = zk + µk pk−1
else
p1 = z1
end if
vk = Apk
νk = rTk zk /pTk vk
90 CHAPTER 2. METHODS FOR SYSTEMS OF LINEAR EQUATIONS

x(k) = x(k−1) + νk pk
rk+1 = rk − νk vk
k =k+1
end while

We see that the action of the transformation is only felt through the preconditioner M = C 2 .
Because a system involving M is solved during each iteration, it is essential that such a system
is easily solved. One example of such a preconditioner is to define M = HH T , where H is an
“incomplete Cholesky factor” of A, which is a sparse matrix that approximates the true Cholesky
factor.
Chapter 3

Least Squares Problems

3.1 The Full Rank Least Squares Problem


Given an m × n matrix A, with m ≥ n, and an m-vector b, we consider the overdetermined system
of equations Ax = b, in the case where A has full column rank. If b is in the range of A, then
there exists a unique solution x∗ . For example, there exists a unique solution in the case of
   
0 1 1
A =  1 0 , b =  1 ,
0 0 0
 T
but not if b = 1 1 1 . In such cases, when b is not in the range of A, then we seek to
minimize kb − Axkp for some p. Recall that the vector r = b − Ax is known as the residual vector.
Different norms give different solutions. If p = 1 or p = ∞, then the function we seek to mini-
mize, f (x) = kb − Axkp is not differentiable, so we cannot use standard minimization techniques.
However, if p = 2, f (x) is differentiable, and thus the problem is more tractable. We now consider
two methods.

3.1.1 Derivation of the Normal Equations


The second method is to define φ(x) = 21 kb − Axk22 , which is a differentiable function of x. To help
us to characterize the minimum of this function, we first compute the gradient of simpler functions.
To that end, let
ψ(x) = cT x
where c is a constant vector. Then, from
n
X
ψ(x) = cj xj ,
j=1

we have
n
X ∂xj n
∂ψ X
= cj = cj δjk = ck ,
∂xk ∂xk
j=1 j=1

91
92 CHAPTER 3. LEAST SQUARES PROBLEMS

and therefore
∇ψ(x) = c.
Now, let
n X
X n
ϕ(x) = xT Bx = bij xi xj .
i=1 j=1

Then
n X
n
∂ϕ X ∂(xi xj )
= bij
∂xk ∂xk
i=1 j=1
n X
X n
= bij (δik xj + xi δjk )
i=1 j=1
Xn X n n X
X n
= bij xj δik + bij xi δjk
i=1 j=1 i=1 j=1
n
X n
X
= bkj xj + bik xi
j=1 i=1
n
X
= (Bx)k + (B T )ki xi
i=1
= (Bx)k + (B T x)k .

We conclude that
∇ϕ(x) = (B + B T )x.
From
kyk22 = yT y,
and the properties of the transpose, we obtain
1 1
kb − Axk22 = (b − Ax)T (b − Ax)
2 2
1 T 1 1 1
= b b − (Ax)T b − bT Ax + xT AT Ax
2 2 2 2
1 T 1
= b b − bT Ax + xT AT Ax
2 2
1 T 1
= b b − (A b) x + xT AT Ax.
T T
2 2
Using the above formulas, with c = 12 AT b and B = 12 AT A, we have
 
1 1
∇ kb − Axk22 = −AT b + (AT A + (AT A)T )x.
2 2

However, because
(AT A)T = AT (AT )T = AT A,
3.1. THE FULL RANK LEAST SQUARES PROBLEM 93

this simplifies to
 
1
∇ kb − Axk22 = −AT b + AT Ax = AT Ax − AT b.
2

The Hessian of the function ϕ(x), denoted by Hϕ (x), is the matrix with entries

∂2ϕ
hij = .
∂xi ∂xj

Because mixed second partial derivatives satisfy

∂2ϕ ∂2ϕ
=
∂xi ∂xj ∂xj ∂xi

as long as they are continuous, the Hessian is symmetric under these assumptions.
In the case of ϕ(x) = xT Bx, whose gradient is ∇ϕ(x) = (B + B T )x, the Hessian is Hϕ (x) =
B + B T . It follows from the previously computed gradient of 12 kb − Axk22 that its Hessian is AT A.
Recall that A is m × n, with m ≥ n and rank(A) = n. Then, if x 6= 0, it follows from the linear
independence of A’s columns that Ax 6= 0. We then have

xT AT Ax = (Ax)T Ax = kAxk22 > 0,

since the norm of a nonzero vector must be positive. It follows that AT A is not only symmetric,
but positive definite as well. Therefore, the Hessian of φ(x) is positive definite, which means that
the unique critical point x, the solution to the equations AT Ax − AT b = 0, is a minimum.
In general, if the Hessian at a critical point is

• positive definite, meaning that its eigenvalues are all positive, then the critical point is a local
minimum.

• negative definite, meaning that its eigenvalues are all negative, then the critical point is a
local maximum.

• indefinite, meaning that it has both positive and negative eigenvalues, then the critical point
is a sadddle point.

• singular, meaning that one of its eigenvalues is zero, then the second derivative test is incon-
clusive.

In summary, we can minimize φ(x) by noting that ∇φ(x) = AT (b − Ax), which means that
∇φ(x) = 0 if and only if AT Ax = AT b. This system of equations is called the normal equations,
and were used by Gauss to solve the least squares problem. If m  n then AT A is n × n, which
is a much smaller system to solve than Ax = b, and if κ(AT A) is not too large, we can use the
Cholesky factorization to solve for x, as AT A is symmetric positive definite.
94 CHAPTER 3. LEAST SQUARES PROBLEMS

3.1.2 Solving the Normal Equations


Instead of using the Cholesky factorization, we can solve the linear least squares problem using the
normal equations
AT Ax = AT b
as follows: first, we solve the above system to obtain an approximate solution x̂, and compute the
residual vector r = b − Ax̂. Now, because AT r = AT b − AT Ax̂ = 0, we obtain the system

r + Ax̂ = b
AT r = 0

or, in matrix form,     


I A r b
= .
AT 0 x̂ 0
This is a large system, but it preserves the sparsity of A. It can be used in connection with iterative
refinement, but unfortunately this procedure does not work well because it is very sensitive to the
residual.

3.1.3 The Condition Number of AT A


Because the coefficient matrix of the normal equations is AT A, it is important to understand its
condition number. When A is n × n and invertible,
σ1
κ2 (A) = kAk2 kA−1 k2 = ,
σn
where σ1 and σn are the largest and smallest singular values, respectively, of A. If A is m × n
with m > n and rank(A) = n, A−1 does not exist, but the quantity σ1 /σn is still defined and an
appropriate measure of the sensitivity of the least squares problem to perturbations in the data, so
we denote this ratio by κ2 (A) in this case as well.
From the relations
Avj = σj uj , AT uj = σj vj , j = 1, . . . , n,
where uj and vj are the left and right singular vectors, respectively, of A, we have

AT Avj = σj2 vj .

That is, σj2 is an eigenvalue of AT A. Furthermore, because AT A is symmetric positive definite, the
eigenvalues of AT A are also its singular values. Specifically, if A = U ΣV T is the SVD of A, then
V T (ΣT Σ)V is the SVD of AT A.
It follows that the condition number in the 2-norm of AT A is
 2
T T T −1 σ12 σ1
κ2 (A A) = kA Ak2 k(A A) k2 = 2 = = κ2 (A)2 .
σn σn

Note that because A has full column rank, AT A is nonsingular, and therefore (AT A)−1 exists, even
though A−1 may not.
3.2. THE QR FACTORIZATION 95

3.2 The QR Factorization


The first approach is to take advantage of the fact that the 2-norm is invariant under orthogonal
transformations, and seek an orthogonal matrix Q such that the transformed problem

min kb − Axk2 = min kQT (b − Ax)k2

is “easy” to solve. Let  


  R1
A = QR = Q1 Q2 = Q1 R1 ,
0
where Q1 is m × n and R1 is n × n. Then, because Q is orthogonal, QT A = R and

min kb − Axk2 = min kQT (b − Ax)k2


= min kQT b − (QT A)xk2
 
T R1
= min Q b −
x
0
2

If we partition  
c
QT b = ,
d
where c is an n-vector, then
    2
c R1
Axk22 2 2

min kb − = min
− = min kc − R1 xk2 + kdk2 .
x
d 0 2

Therefore, the minimum is achieved by the vector x such that R1 x = c and therefore

min kb − Axk2 = kdk2 ≡ ρLS .


x

It makes sense to seek a factorization of the form A = QR where Q is orthogonal, and R is


upper-triangular, so that R1 x = c is easily solved. This is called the QR factorization of A.
Let A be an m × n matrix with full column rank. The QR factorization of A is a decomposition
A = QR, where Q is an m × m orthogonal matrix and R is an m × n upper triangular matrix.
There are three ways to compute this decomposition:

1. Using Householder matrices, developed by Alston S. Householder

2. Using Givens rotations, also known as Jacobi rotations, used by W. Givens and originally
invented by Jacobi for use with in solving the symmetric eigenvalue problem in 1846.

3. A third, less frequently used approach is the Gram-Schmidt orthogonalization.

3.2.1 Gram-Schmidt Orthogonalization


Givens rotations or Householder reflections can be used to compute the “full” QR decomposition
 
  R1
A = QR = Q1 Q2
0
96 CHAPTER 3. LEAST SQUARES PROBLEMS

where Q is an m × m orthogonal matrix, and R1 is an n × n upper-triangular matrix that is


nonsingular if and only if A is of full column rank (that is, rank(A) = n).
It can be seen from the above partitions of Q and R that A = Q1 R1 . Furthermore, it can be
shown that range(A) = range(Q1 ), and (range(A))⊥ = range(Q2 ). In fact, for k = 1, . . . , n, we
have
span{a1 , . . . , ak } = span{q1 , . . . , qk },

where
   
A= a1 · · · an , Q= q1 · · · qm

are column partitions of A and Q, respectively.


We now examine two methods for computing the “thin” or “economy-size” QR factorization
A = Q1 R1 , which is sufficient for solving full-rank least-squares problems, as the least-squares
solution x that minimizes kb − Axk2 can be obtained by solving the upper-triangular system
R1 x = QT1 b by back substitution.

3.2.2 Classical Gram-Schmidt


Consider the “thin” QR factorization
 
r11 · · · r1n
A=

a1 · · · an

=

q1

· · · qn  .. ..  = QR.
. . 
rnn

From the above matrix product we can see that a1 = r11 q1 , from which it follows that

1
r11 = ±ka1 k2 , q1 = a1 .
ka1 k2

For convenience, we choose the + sign for r11 .


Next, by taking the inner product of both sides of the equation a2 = r12 q1 + r22 q2 with q1 , and
imposing the requirement that the columns of Q form an orthonormal set, we obtain

1
r12 = qT1 a2 , r22 = ka2 − r12 q1 k2 , q2 = (a2 − r12 q1 ).
r22

In general, we use the relation


k
X
ak = rjk qj
j=1

to obtain  
k−1
1  X
qk = ak − rjk qj  , rjk = qTj ak .
rkk
j=1
3.2. THE QR FACTORIZATION 97

Note that qk can be rewritten as


 
k−1
1  X
qk = ak − (qTj ak )qj 
rkk
j=1
 
k−1
1  X
= ak − qj qTj ak 
rkk
j=1
 
k−1
1  X
= I− qj qTj  ak .
rkk
j=1

If we define Pi = qi qTi , then Pi is a symmetric projection that satisfies Pi2 = Pi , and Pi Pj = δij .
Thus we can write  
k−1 k−1
1  X 1 Y
qk = I− Pj  ak = (I − Pj )ak .
rkk rkk
j=1 j=1

Unfortunately, Gram-Schmidt orthogonalization, as described, is numerically unstable, because,


for example, if a1 and a2 are almost parallel, then a2 − r12 q1 is almost zero, and roundoff error
becomes significant due to catastrophic cancellation.

3.2.3 Modified Gram-Schmidt


The Modified Gram-Schmidt method alleviates the numerical instability of “Classical” Gram-
Schmidt. Recall
 
A = Q1 R1 = r11 q1 r12 q1 + r22 q2 · · ·
We define
k−1
X
C (k) = qi rTi , rTi =
 
0 · · · 0 rii ri,i+1 · · · rin
i=1

which means
A − C (k) =

0 0 · · · 0 A(k)

,
because the first k − 1 columns of A are linear combinations of the first k − 1 columns of Q1 , and
the contributions of these columns of Q1 to all columns of A are removed by subtracting C (k) .
If we write
A(k) = zk Bk
 

then, because the kth column of A is a linear combination of the first k columns of Q1 , and the
contributions of the first k − 1 columns are removed in A(k) , zk must be a multiple of qk . Therefore,

1
rkk = kzk k2 , qk = zk .
rkk

We then compute
= qTk Bk
 
rk,k+1 · · · rk,n
98 CHAPTER 3. LEAST SQUARES PROBLEMS

which yields
A(k+1) = Bk − qk
 
rk,k+1 · · · rkn .
This process is numerically stable.
Note that Modified Gram-Schmidt computes the entries of R1 row-by-row, rather than column-
by-column, as Classical Gram-Schmidt does. This rearrangement of the order of operations, while
mathematically equivalent to Classical Gram-Schmidt, is much more stable, numerically, because
each entry of R1 is obtained by computing an inner product of a column of Q1 with a modified
column of A, from which the contributions of all previous columns of Q1 have been removed.
To see why this is significant, consider the inner products

uT v, uT (v + w),

where uT w = 0. The above inner products are equal, but suppose that |uT v|  kwk. Then uT v
is a small number that is being computed by subtraction of potentially large numbers, which is
susceptible to catastrophic cancellation.
It can be shown that Modified Gram-Schmidt produces a matrix Q̂1 such that

Q̂T1 Q̂1 = I + EM GS , kEM GS k ≈ uκ2 (A),

and Q̂1 can be computed in approximately 2mn2 flops (floating-point operations), whereas with
Householder QR,
Q̂T1 Q̂1 = I + En , kEn k ≈ u,
with Q̂1 being computed in approximately 2mn2 − 2n2 /3 flops to factor A and an additional
2mn2 − 2n2 /3 flops to obtain Q1 , the first n columns of Q. That is, Householder QR is much less
sensitive to roundoff error than Gram-Schmidt, even with modification, although Gram-Schmidt is
more efficient if an explicit representation of Q1 desired.

3.2.4 Householder Reflections


It is natural to ask whether we can introduce more zeros with each orthogonal rotation. To that
end, we examine Householder reflections. Consider a matrix of the form P = I − τ uuT , where
u 6= 0 and τ is a nonzero constant. It is clear that P is a symmetric rank-1 change of I. Can we
choose τ so that P is also orthogonal? From the desired relation P T P = I we obtain

PTP = (I − τ uuT )T (I − τ uuT )


= I − 2τ uuT + τ 2 uuT uuT
= I − 2τ uuT + τ 2 (uT u)uuT
= I + (τ 2 uT u − 2τ )uuT
= I + τ (τ uT u − 2)uuT .

It follows that if τ = 2/uT u, then P T P = I for any nonzero u. Without loss of generality, we can
stipulate that uT u = 1, and therefore P takes the form P = I − 2vvT , where vT v = 1.
Why is the matrix P called a reflection? This is because for any nonzero vector x, P x is the
reflection of x across the hyperplane that is normal to v. To see this, we consider the 2 × 2 case
3.2. THE QR FACTORIZATION 99

 T  T
and set v = 1 0 and x = 1 2 . Then

P = I − 2vvT
 
1  
= I −2 1 0
0
   
1 0 1 0
= −2
0 1 0 0
 
−1 0
=
0 1

Therefore     
−1 0 1 −1
Px = = .
0 1 2 2
Now, let x be any vector. We wish to construct P so that P x = αe1 for some α, where
 T
e1 = 1 0 · · · 0 . From the relations

kP xk2 = kxk2 , kαe1 k2 = |α|ke1 k2 = |α|,

we obtain α = ±kxk2 . To determine P , we begin with the equation

P x = (I − 2vvT )x = x − 2vvT x = αe1 .

Rearranging, we obtain
1
(x − αe1 ) = (vT x)v.
2
It follows that the vector v, which is a unit vector, must be a scalar multiple of x − αe1 . Therefore,
v is defined by the equations
x1 − α
v1 =
kx − αe1 k2
x1 − α
= p
2
kxk2 − 2αx1 + α2
x −α
= √ 1
2α2 − 2αx1
α − x1
= −p
2α(α − x1 )
r
α − x1
= −sgn(α) ,

x2
v2 = p
2α(α − x1 )
x2
= − ,
2αv1
..
.
xn
vn = − .
2αv1
100 CHAPTER 3. LEAST SQUARES PROBLEMS

To avoid catastrophic cancellation, it is best to choose the sign of α so that it has the opposite sign
of x1 . It can be seen that the computation of v requires about 3n operations.
Note that the matrix P is never formed explicitly. For any vector b, the product P b can be
computed as follows:
P b = (I − 2vvT )b = b − 2(vT b)v.
This process requires only 4n operations. It is easy to see that we can represent P simply by storing
only v.
Now, suppose that that x = a1 is the first column of a matrix A. Then we construct a
Householder reflection H1 = I − 2v1 v1T such that Hx = αe1 , and we have
 
r11 r12 ··· r1n
 0 
A(2) = H1 A =  .. .
 
(2) (2)
 . a ··· a 2:m,2
 2:m,n
0

where we denote the constant α by r11 , as it is the (1, 1) element of the updated matrix A(2) . Now,
we can construct H̃2 such that  
r22
 0 
(2)
H̃2 a2:m,2 =  .  ,
 
.
 . 
0
 
r11 r12 r13 ··· r1n
   0 r22 r23 ··· r2n 
1 0
 
A (3)
= (2)
A =
 0 0 .

0 H̃2  .. .. (3) (3) 
 . . a3:m,3 · · · a3:m,n 
0 0
Note that the first column of A(2) is unchanged by H̃2 , because H̃2 only operates on rows 2 through
m, which, in the first column, have zero entries. Continuing this process, we obtain

Hn · · · H1 A = A(n+1) = R,

where, for j = 1, 2, . . . , n,  
Ij−1 0
Hj =
0 H̃j
and R is an upper triangular matrix. We have thus factored A = QR, where Q = H1 H2 · · · Hn is
an orthogonal matrix.
Note that for each j = 1, 2, . . . , n, H̃j is also a Householder reflection, based on a vector whose
first j − 1 components are equal to zero. Therefore, application of Hj to a matrix does not affect
the first j rows or columns. We also note that

AT A = RT QT QR = RT R,

and thus R is the Cholesky factor of AT A.


3.2. THE QR FACTORIZATION 101

Example We apply Householder reflections to compute the QR factorization of the matrix from
the previous example,  
0.8147 0.0975 0.1576
 0.9058 0.2785 0.9706 
A(1) = A = 
 
 0.1270 0.5469 0.9572  .

 0.9134 0.9575 0.4854 
0.6324 0.9649 0.8003
First, we work with the first column of A,
 
0.8147
 0.9058 
(1)  
x1 = a1:5,1 =
 0.1270 ,
 kx1 k2 = 1.6536.
 0.9134 
0.6324

The corresponding Householder vector is


     
0.8147 1.0000 2.4684
 0.9058   0   0.9058 
     
ṽ1 = x1 + kx1 k2 e1 = 
 0.1270
 + 1.6536 
  0 =
  0.1270 .

 0.9134   0   0.9134 
0.6324 0 0.6324

From this vector, we build the Householder reflection


2
c= = 0.2450, H̃1 = I − cṽ1 ṽ1T .
ṽ1T ṽ1

Applying this reflection to A(1) yields


 
−1.6536 −1.1405 −1.2569
 0 −0.1758 0.4515 
(1)  
H̃1 A1:5,1:3 =
 0 0.4832 0.8844 ,

 0 0.4994 −0.0381 
0 0.6477 0.4379
 
−1.6536 −1.1405 −1.2569
 0 −0.1758 0.4515 
(2)
 
A =
 0 0.4832 0.8844 .

 0 0.4994 −0.0381 
0 0.6477 0.4379
Next, we take the “interesting” portion of the second column of the updated matrix A(2) , from
rows 2 to 5:  
−0.1758
(2)  0.4832 
 0.4994  , kx2 k2 = 0.9661.
x2 = a2:5,2 =  

0.6477
102 CHAPTER 3. LEAST SQUARES PROBLEMS

The corresponding Householder vector is


     
−0.1758 1.0000 −1.1419
 0.4832 0 
 =  0.4832  .
   
ṽ2 = x2 − kx2 k2 e1 = 
 0.4994
 − 0.9661 
  0   0.4994 
0.6477 0 0.6477

From this vector, we build the Householder reflection


2
c= = 0.9065, H̃2 = I − cṽ2 ṽ2T .
ṽ2T ṽ2

Applying this reflection to A(2) yields


 
0.9661 0.6341
(2)  0 0.8071 
H̃2 A2:5,2:3 = ,
 0 −0.1179 
0 0.3343
 
−1.6536 −1.1405 −1.2569
 0 0.9661 0.6341 
(3)
 
A =
 0 0 0.8071 .

 0 0 −0.1179 
0 0 0.3343
Finally, we take the interesting portion of the third column of A(3) , from rows 3 to 5:
 
0.8071
(3)
x3 = a3:5,3 =  −0.1179  , kx3 k2 = 0.8816.
0.3343

The corresponding Householder vector is


     
0.8071 1.0000 1.6887
ṽ3 = x3 + kx3 k2 e1 =  −0.1179  + 0.8816  0  =  −0.1179  .
0.3343 0 0.3343

From this vector, we build the Householder reflection


2
c= = 0.6717, H̃3 = I − cṽ3 ṽ3T .
ṽ3T ṽ3

Applying this reflection to A(3) yields


 
  −1.6536 −1.1405 −1.2569
−0.8816  0 0.9661 0.6341 
(3) (4)
 
H̃3 A3:5,3:3 = 0 , A =
 0 0 −0.8816 
.
0  0 0 0 
0 0 0
3.2. THE QR FACTORIZATION 103

Applying these same Householder reflections, in order, on the right of the identity matrix, yields
the orthogonal matrix
 
−0.4927 −0.4806 0.1780 −0.6015 −0.3644
 −0.5478 −0.3583 −0.5777 0.3760 0.3104 
 
 −0.0768
Q = H1 H2 H3 =  0.4754 −0.6343 −0.1497 −0.5859 

 −0.5523 0.3391 0.4808 0.5071 −0.3026 
−0.3824 0.5473 0.0311 −0.4661 0.5796

such that  
−1.6536 −1.1405 −1.2569
 0 0.9661 0.6341 
A(4) = R = QT A = H3 H2 H1 A = 
 
 0 0 −0.8816 

 0 0 0 
0 0 0
is upper triangular, where
 
  1 0 0
1 0
H1 = H̃1 , H2 = , H3 =  0 1 0  ,
0 H̃2
0 0 H̃3

are the same Householder transformations before, defined in such a way that they can be applied
to the entire matrix A. Note that for j = 1, 2, 3,,
 
T 0
Hj = I − 2vj vj , vj = , kvj k2 = kṽj k2 = 1,
ṽj

where the first j − 1 components of vj are equal to zero.


Also, note that the first n = 3 columns of Q are the same as those of the matrix Q that was
computed in the previous example, except for possible negation. The fourth and fifth columns are
not the same, but they do span the same subspace, the orthogonal complement of the range of A.
2

3.2.5 Givens Rotations


We illustrate the process in the case where A is a 2×2 matrix. In Gaussian elimination, we compute
L−1 A = U where L−1 is unit lower triangular and U is upper triangular. Specifically,
   " (2) (2) #
1 0 a11 a12 a11 a12 a21
= (2) , m21 = − .
m21 1 a21 a22 0 a22 a11

By contrast, the QR decomposition computes QT A = R, or


 T    
γ −σ a11 a12 r11 r12
= ,
σ γ a21 a22 0 r22

where γ 2 + σ 2 = 1.
104 CHAPTER 3. LEAST SQUARES PROBLEMS

From the relationship −σa11 + γa21 = 0 we obtain

γa21 = σa11
γ 2 a221 = σ 2 a211 = (1 − γ 2 )a211

which yields
a11
γ = ±p 2 .
a21 + a211
It is conventional to choose the + sign. Then, we obtain

a211 a2
σ2 = 1 − γ 2 = 1 − = 2 21 2 ,
a221 2
+ a11 a21 + a11
or
a21
σ = ±p 2 .
a21 + a211
Again, we choose the + sign. As a result, we have
a11 a21
q
r11 = a11 p + a21 p 2 = a221 + a211 .
a221 + a211 a21 + a211

The matrix  T
γ −σ
Q=
σ γ
is called a Givens rotation. It is called a rotation because it is orthogonal, and therefore length-
preserving, and also because there is an angle θ such that sin θ = σ and cos θ = γ, and its effect is
to rotate a vector clockwise through the angle θ. In particular,
 T    
γ −σ α ρ
=
σ γ β 0
p
where ρ = α2 + β 2 , α = ρ cos θ and β = ρ sin θ. It is easy to verify that the product of two
rotations is itself a rotation. Now, in the case where A is an n × n matrix, suppose that we have
the vector  
×
 .. 
 . 
 
 × 
 
 α 
 
 × 
 
 .. 
 . .
 
 × 
 
 β 
 
 × 
 
 . 
.
 . 
×
3.2. THE QR FACTORIZATION 105

Then     
1 × ×
..   ..   .. 
.

  .   . 
    
 1  ×   × 
    

 γ σ  α  
   ρ  
 1  ×   × 
    
..   ..   ..  .
 .  = 


 .    . 
 1  ×   × 
    

 −σ γ  β  
   0  
 1  ×   × 
    
..  .   .. 
.

 .  .   . 
1 × ×
So, in order to transform A into an upper triangular matrix R, we can find a product of rotations Q
such that QT A = R. It is easy to see that O(n2 ) rotations are required. Each rotation takes O(n)
operations, so the entire process of computing the QR factorization requires O(n3 ) operations.
It is important to note that the straightforward approach to computing the entries γ and σ of
the Givens rotation,
α β
γ=p , σ=p ,
2
α +β 2 α + β2
2
p
is not always advisable, because in floating-point arithmetic, the computation of α2 + β 2 could
overflow. To get around this problem, suppose that |β| ≥ |α|. Then, we can instead compute
α 1
τ= , σ=√ , γ = στ,
β 1 + τ2
which is guaranteed not to overflow since the only number that is squared is less than one in
magnitude. On the other hand, if |α| ≥ |β|, then we compute

β 1
τ= , γ=√ , σ = γτ.
α 1 + τ2
Now, we describe the entire algorithm for computing the QR factorization using Givens rota-
tions. Let [c, s] = givens(a, b) be a Matlab-style function that computes c and s such that
 T    
c −s a r p
= , r= a2 + b2 .
s c b 0

Then, let G(i, j, c, s)T be the Givens rotation matrix that rotates the ith and jth elements of a
vector v clockwise by the angle θ such that cos θ = c and sin θ = s, so that if vi = a√ and vj = b,
and [c, s] = givens(a, b), then in the updated vector u = G(i, j, c, s)T v, ui = r = a2 + b2 and
uj = 0. The QR factorization of an m × n matrix A is then computed as follows.

Q=I
R = A for j = 1 : n do
for i = m : −1 : j + 1 do
106 CHAPTER 3. LEAST SQUARES PROBLEMS

[c, s] = givens(ri−1,j , rij )


R = G(i, j, c, s)T R
Q = QG(i, j, c, s)
end
end

Note that the matrix Q is accumulated by column rotations of the identity matrix, because the
matrix by which A is multiplied to reduce A to upper-triangular form, a product of row rotations,
is QT .

Example We use Givens rotations to compute the QR factorization of

 
0.8147 0.0975 0.1576

 0.9058 0.2785 0.9706 

A=
 0.1270 0.5469 0.9572 .

 0.9134 0.9575 0.4854 
0.6324 0.9649 0.8003

First, we compute a Givens rotation that, when applied to a41 and a51 , zeros a51 :

 T    
0.8222 −0.5692 0.9134 1.1109
= .
0.5692 0.8222 0.6324 0

Applying this rotation to rows 4 and 5 yields

 T  
1 0 0 0 0 0.8147 0.0975 0.1576

 0 1 0 0 0 


 0.9058 0.2785 0.9706 


 0 0 1 0 0 


 0.1270 0.5469 0.9572  =

 0 0 0 0.8222 −0.5692   0.9134 0.9575 0.4854 
0 0 0 0.5692 0.8222 0.6324 0.9649 0.8003
 
0.8147 0.0975 0.1576

 0.9058 0.2785 0.9706 


 0.1270 0.5469 0.9572 .

 1.1109 1.3365 0.8546 
0 0.2483 0.3817

Next, we compute a Givens rotation that, when applied to a31 and a41 , zeros a41 :

 T    
0.1136 −0.9935 0.1270 1.1181
= .
0.9935 0.1136 1.1109 0
3.2. THE QR FACTORIZATION 107

Applying this rotation to rows 3 and 4 yields


 T  
1 0 0 0 0 0.8147 0.0975 0.1576
 0 1 0 0 0    0.9058 0.2785 0.9706 
 

 0 0 0.1136 −0.9935 0   0.1270 0.5469 0.9572  =
   
 0 0 0.9935 0.1136 0   1.1109 1.3365 0.8546 
0 0 0 0 1 0 0.2483 0.3817
 
0.8147 0.0975 0.1576
 0.9058 0.2785 0.9706 
 
 1.1181 1.3899 0.9578 .
 
 0 −0.3916 −0.8539 
0 0.2483 0.3817
Next, we compute a Givens rotation that, when applied to a21 and a31 , zeros a31 :
 T    
0.6295 −0.7770 0.9058 1.4390
= .
0.7770 0.6295 1.1181 0
Applying this rotation to rows 2 and 3 yields
 T  
1 0 0 0 0 0.8147 0.0975 0.1576
 0 0.6295 −0.7770 0 0    0.9058 0.2785 0.9706 

 
 0 0.7770 0.6295 0 0   1.1181 1.3899 0.9578  =
 

 0 0 0 1 0   0 −0.3916 −0.8539 
0 0 0 0 1 0 0.2483 0.3817
 
0.8147 0.0975 0.1576
 1.4390 1.2553 1.3552 
 

 0 0.6585 −0.1513 .
 0 −0.3916 −0.8539 
0 0.2483 0.3817
To complete the first column, we compute a Givens rotation that, when applied to a11 and a21 ,
zeros a21 :
 T    
0.4927 −0.8702 0.8147 1.6536
= .
0.8702 0.4927 1.4390 0
Applying this rotation to rows 1 and 2 yields
 T  
0.4927 −0.8702 0 0 0 0.8147 0.0975 0.1576
 0.8702 0.4927 0 0 0   1.4390
  1.2553 1.3552 
 

 0 0 1 0 0  
  0 0.6585 −0.1513  =

 0 0 0 1 0   0 −0.3916 −0.8539 
0 0 0 0 1 0 0.2483 0.3817
 
1.6536 1.1405 1.2569

 0 0.5336 0.5305 


 0 0.6585 −0.1513 .

 0 −0.3916 −0.8539 
0 0.2483 0.3817
108 CHAPTER 3. LEAST SQUARES PROBLEMS

Moving to the second column, we compute a Givens rotation that, when applied to a42 and a52 ,
zeros a52 :
 T    
0.8445 0.5355 −0.3916 0.4636
= .
−0.5355 0.8445 0.2483 0

Applying this rotation to rows 4 and 5 yields

 T  
1 0 0 0 0 1.6536 1.1405 1.2569

 0 1 0 0 0 


 0 0.5336 0.5305 

 0 0 1 0 0 


 0 0.6585 −0.1513  =
 0 0 0 0.8445 0.5355   0 −0.3916 −0.8539 
0 0 0 −0.5355 0.8445 0 0.2483 0.3817
 
1.6536 1.1405 1.2569

 0 0.5336 0.5305 

 0 0.6585 −0.1513 .
 0 −0.4636 −0.9256 
0 0 −0.1349

Next, we compute a Givens rotation that, when applied to a32 and a42 , zeros a42 :

 T    
0.8177 0.5757 0.6585 0.8054
= .
−0.5757 0.8177 −0.4636 0

Applying this rotation to rows 3 and 4 yields

 T  
1 0 0 0 0 1.6536 1.1405 1.2569

 0 1 0 0 0 


 0 0.5336 0.5305 


 0 0 0.8177 0.5757 0 


 0 0.6585 −0.1513  =

 0 0 −0.5757 0.8177 0   0 −0.4636 −0.9256 
0 0 0 0 1 0 0 −0.1349
 
1.6536 1.1405 1.2569

 0 0.5336 0.5305 


 0 0.8054 0.4091 .

 0 0 −0.8439 
0 0 −0.1349

Next, we compute a Givens rotation that, when applied to a22 and a32 , zeros a32 :

 T    
0.5523 −0.8336 0.5336 0.9661
= .
0.8336 0.5523 0.8054 0
3.2. THE QR FACTORIZATION 109

Applying this rotation to rows 3 and 4 yields


 T  
1 0 0 0 0 1.6536 1.1405 1.2569
 0 0.5523 −0.8336 0 0   0 0.5336 0.5305 
   
 0 0.8336 0.5523 0 0   0 0.8054 0.4091  =
 

 0 0 0 1 0   0 0 −0.8439 
0 0 0 0 1 0 0 −0.1349
 
1.6536 1.1405 1.2569

 0 0.9661 0.6341 

 0 0 −0.2163 .

 0 0 −0.8439 
0 0 −0.1349
Moving to the third column, we compute a Givens rotation that, when applied to a43 and a53 , zeros
a53 :
 T    
0.9875 −0.1579 −0.8439 0.8546
= .
0.1579 0.9875 −0.1349 0
Applying this rotation to rows 4 and 5 yields
 T  
1 0 0 0 0 1.6536 1.1405 1.2569
 0 1 0 0 0   0 0.9661 0.6341 
   
 0 0
 1 0 0  
  0 0 −0.2163   =
 0 0 0 0.9875 −0.1579   0 0 −0.8439 
0 0 0 0.1579 0.9875 0 0 −0.1349
 
1.6536 1.1405 1.2569

 0 0.9661 0.6341 

 0 0 −0.2163 .

 0 0 −0.8546 
0 0 0
Finally, we compute a Givens rotation that, when applied to a33 and a43 , zeros a43 :
 T    
0.2453 −0.9694 −0.2163 0.8816
= .
0.9694 0.2453 −0.8546 0
Applying this rotation to rows 3 and 4 yields
 T  
1 0 0 0 0 1.6536 1.1405 1.2569
 0 1 0 0 0   0 0.9661 0.6341 
   
 0 0 0.2453 −0.9694 0  0 0 −0.2163 
 =

  
 0 0 0.9694 0.2453 0   0 0 −0.8546 
0 0 0 0 1 0 0 0
 
1.6536 1.1405 1.2569

 0 0.9661 0.6341 


 0 0 −0.8816 
 = R.
 0 0 0 
0 0 0
110 CHAPTER 3. LEAST SQUARES PROBLEMS

Applying the transpose of each Givens rotation, in order, to the columns of the identity matrix
yields the matrix
 
0.4927 −0.4806 0.1780 −0.7033 0

 0.5478 −0.3583 −0.5777 0.4825 0.0706 

Q=
 0.0768 0.4754 −0.6343 −0.4317 −0.4235 

 0.5523 0.3391 0.4808 0.2769 −0.5216 
0.3824 0.5473 0.0311 −0.0983 0.7373

such that QT A = R is upper triangular. 2


We showed how to construct Givens rotations in order to rotate two elements of a column vector
so that one element would be zero, and that approximately n2 /2 such rotations could be used to
transform A into an upper triangular matrix R. Because each rotation only modifies two rows of
A, it is possible to interchange the order of rotations that affect different rows, and thus apply
sets of rotations in parallel. This is the main reason why Givens rotations can be preferable to
Householder reflections. Other reasons are that they are easy to use when the QR factorization
needs to be updated as a result of adding a row to A or deleting a column of A. They are also
more efficient when A is sparse.

3.3 Rank-Deficient Least Squares


3.3.1 QR with Column Pivoting
When A does not have full column rank, the property

span{a1 , . . . , ak } = span{q1 , . . . , qk }

can not be expected to hold, because the first k columns of A could be linearly dependent, while
the first k columns of Q, being orthonormal, must be linearly independent.
Example The matrix  
1 −2 1
A =  2 −4 0 
1 −2 3
has rank 2, because the first two columns are parallel, and therefore are linearly dependent, while
the third column is not parallel to either of the first two. Columns 1 and 3, or columns 2 and 3,
form linearly independent sets. 2
Therefore, in the case where rank(A) = r < n, we seek a decomposition of the form AΠ = QR,
where Π is a permutation matrix chosen so that the diagonal elements of R are maximized at each
stage. Specifically, suppose H1 is a Householder reflection chosen so that
 
r11
 0 
H1 A =  .  , r11 = ka1 k2 .
 
 .. ∗ 
0
3.3. RANK-DEFICIENT LEAST SQUARES 111

To maximize r11 , we choose Π1 so that in the column-permuted matrix A = AΠ1 , we have ka1 k2 ≥
kaj k2 for j ≥ 2. For Π2 , we examine the lengths of the columns of the submatrix of A obtained by
removing the first row and column. It is not necessary to recompute the lengths of the columns,
because we can update them by subtracting the square of the first component from the square of
the total length.
This process is called QR with column pivoting. It yields the decomposition
 
R S
Q = AΠ
0 0
where Q = H1 · · · Hr , Π = Π1 · · · Πr , and R is an upper triangular, r × r matrix. The last m − r
rows are necessarily zero, because every column of A is a linear combination of the first r columns
of Q.
Example We perform QR with column pivoting on the matrix
 
1 3 5 1
 2 −1 2 1 
A= 1
.
4 6 1 
4 5 10 1
Computing the 2-norms of the columns yields
ka1 k2 = 22, ka2 k2 = 51, ka3 k2 = 165, ka4 k2 = 4.
We see that the third column has the largest 2-norm. We therefore interchange the first and third
columns to obtain
   
0 0 1 0 5 3 1 1
0 1 0 0   2 −1 2 1
A(1) = AΠ1 = A 
 
= .
 1 0 0 0   6 4 1 1 
0 0 0 1 10 5 4 1
We then apply a Householder transformation H1 to A(1) to make the first column a multiple of e1 ,
which yields  
−12.8452 −6.7729 −4.2817 −1.7905
0 −2.0953 1.4080 0.6873 
H1 A(1) = 

.
 0 0.7141 −0.7759 0.0618 
0 −0.4765 1.0402 −0.5637
Next, we consider the submatrix obtained by removing the first row and column of H1 A(1) :
 
−2.0953 1.4080 0.6873
Ã(2) =  0.7141 −0.7759 0.0618  .
−0.4765 1.0402 −0.5637
We compute the lengths of the columns, as before, except that this time, we update the lengths of
the columns of A, rather than recomputing them. This yields
(2) (1) (1)
kã1 k22 = ka2 k2 − [a12 ]2 = 51 − (−6.7729)2 = 5.1273,
(2) (1) (1)
kã2 k22 = ka3 k2 − [a13 ]2 = 22 − (−4.2817)2 = 3.6667,
(2) (1) (1)
kã3 k22 = ka4 k2 − [a14 ]2 = 4 − (−1.7905)2 = 0.7939.
112 CHAPTER 3. LEAST SQUARES PROBLEMS

The second column is the largest, so there is no need for a column interchange this time. We
apply a Householder transformation H̃2 to the first column of Ã(2) so that the updated column is a
multiple of e1 , which is equivalent to applying a 4 × 4 Householder transformation H2 = I − 2v2 v2T ,
where the first component of v2 is zero, to the second column of A(2) so that the updated column
is a linear combination of e1 and e2 . This yields
 
2.2643 −1.7665 −0.4978
H̃2 Ã(2) =  0 −0.2559 0.2559  .
0 0.6933 −0.6933
Then, we consider the submatrix obtained by removing the first row and column of H2 Ã(2) :
 
(3) −0.2559 0.2559
à = .
0.6933 −0.6933
Both columns have the same lengths, so no column interchange is required. Applying a Householder
reflection H̃3 to the first column to make it a multiple of e1 will have the same effect on the second
column, because they are parallel. We have
 
(3) 0.7390 −0.7390
H̃3 Ã = .
0 0
It follows that the matrix Ã(4) obtained by removing the first row and column of H3 Ã(3) will be
the zero matrix. We conclude that rank(A) = 3, and that A has the factorization
 
R S
AΠ = Q ,
0 0
where  
0 0 1 0
 0 1 0 0 
Π=
 1
,
0 0 0 
0 0 0 1
   
−12.8452 −6.7729 −4.2817 −1.7905
R= 0 2.2643 −1.7665  , S =  −0.4978  ,
0 0 0.7390 −0.7390
and Q = H1 H2 H3 is the product of the Householder reflections used to reduce AΠ to upper-
triangular form. 2
Using this decomposition, we can solve the linear least squares problem Ax = b by observing
that
  2
2
R S T

kb − Axk2 = b − Q 0 0 Π x

2
   2
T R S u
= Q b −

0 0 v 2
    2
c Ru + Sv
= d −

0
2
= kc − Ru − Svk22 + kdk22 ,
3.3. RANK-DEFICIENT LEAST SQUARES 113

where    
c u
QT b = , ΠT x = ,
d v
with c and u being r-vectors. Thus min kb − Axk22 = kdk22 , provided that Ru + Sv = c. A basic
solution is obtained by choosing v = 0. A second solution is to choose u and v so that kuk22 + kvk22
is minimized. This criterion is related to the pseudo-inverse of A.

3.3.2 Complete Orthogonal Decomposition


After performing the QR factorization with column pivoting, we have
 
R S
A=Q ΠT
0 0

where R is upper triangular. Then

RT
 
T 0
A =Π QT
ST 0

where RT is lower triangular. We apply Householder reflections so that


 T   
R 0 U 0
Zr · · · Z2 Z1 = ,
ST 0 0 0

where U is upper-triangular. Then


 
T U 0
A =Z QT
0 0

where Z = ΠZ1 · · · Zr . In other words,


 
L 0
A=Q ZT
0 0

where L is a lower-triangular matrix of size r × r, where r is the rank of A. This is the complete
orthogonal decomposition of A.
Recall that X is the pseudo-inverse of A if

1. AXA = A

2. XAX = X

3. (XA)T = XA

4. (AX)T = AX

Given the above complete orthogonal decomposition of A, the pseudo-inverse of A, denoted A+ , is


given by  −1 
+ L 0
A =Z QT .
0 0
114 CHAPTER 3. LEAST SQUARES PROBLEMS

Let X = {x|kb − Axk2 = min }. If x ∈ X and we desire kxk2 = min , then x = A+ b. Note that in
this case,
r = b − Ax = b − AA+ b = (I − AA+ )b
where the matrix (I − AA+ ) is a projection matrix P ⊥ . To see that P ⊥ is a projection, note that
+
P = AA
L−1 0
   
L 0 T
= Q Z Z QT
0 0 0 0
 
Ir 0
= Q QT .
0 0
It can then be verified directly that P = P T and P 2 = P .

3.3.3 The Pseudo-Inverse


The singular value decomposition is very useful in studying the linear least squares problem. Sup-
pose that we are given an m-vector b and an m × n matrix A, and we wish to find x such that
kb − Axk2 = minimum.
From the SVD of A, we can simplify this minimization problem as follows:
kb − Axk22 = kb − U ΣV T xk22
= kU T b − ΣV T xk22
= kc − Σyk22
= (c1 − σ1 y1 )2 + · · · + (cr − σr yr )2 +
c2r+1 + · · · + c2m
where c = U T b and y = V T x. We see that in order to minimize kb − Axk2 , we must set yi = ci /σi
for i = 1, . . . , r, but the unknowns yi , for i = r + 1, . . . , n, can have any value, since they do not
influence kc − Σyk2 . Therefore, if A does not have full rank, there are infinitely many solutions to
the least squares problem. However, we can easily obtain the unique solution of minimum 2-norm
by setting yr+1 = · · · = yn = 0.
In summary, the solution of minimum length to the linear least squares problem is
x = Vy
= V Σ+ c
= V Σ+ U T b
= A+ b
where Σ+ is a diagonal matrix with entries
 −1 
σ1
 .. 
 . 
σr−1
 
Σ+ = 
 


 0 

 .. 
 . 
0
3.3. RANK-DEFICIENT LEAST SQUARES 115

and A+ = V Σ+ U T . The matrix A+ is called the pseudo-inverse of A. In the case where A is square
and has full rank, the pseudo-inverse is equal to A−1 . Note that A+ is independent of b. It also
has the properties

1. AA+ A = A

2. A+ AA+ = A+

3. (A+ A)T = A+ A

4. (AA+ )T = AA+

The solution x of the least-squares problem minimizes kb − Axk2 , and therefore is the vector
that solves the system Ax = b as closely as possible. However, we can use the SVD to show that
x is the exact solution to a related system of equations. We write b = b1 + b2 , where

b1 = AA+ b, b2 = (I − AA+ )b.

The matrix AA+ has the form


 
+ T + T + T Ir 0
AA = U ΣV V Σ U = U ΣΣ U = U UT ,
0 0

where Ir is the r × r identity matrix. It follows that b1 is a linear combination of u1 , . . . , ur , the


columns of U that form an orthogonal basis for the range of A.
From x = A+ b we obtain
Ax = AA+ b = P b = b1 ,
where P = AA+ . Therefore, the solution to the least squares problem, is also the exact solution to
the system Ax = P b. It can be shown that the matrix P has the properties

1. P = P T

2. P 2 = P

In other words, the matrix P is a projection. In particular, it is a projection onto the space spanned
by the columns of A, i.e. the range of A. That is, P = Ur UrT , where Ur is the matrix consisting of
the first r columns of U .
The residual vector r = b − Ax can be expressed conveniently using this projection. We have

r = b − Ax = b − AA+ b = b − P b = (I − P )b = P ⊥ b.

That is, the residual is the projection of b onto the orthogonal complement of the range of A, which
is the null space of AT . Furthermore, from the SVD, the 2-norm of the residual satisfies

ρ2LS ≡ krk22 = c2r+1 + · · · + c2m ,

where, as before, c = U T b.
116 CHAPTER 3. LEAST SQUARES PROBLEMS

3.3.4 Perturbation Theory


Suppose that we perturb the data, so that we are solving (A+E)x() = b. Then what is kx−x()k2
or kr − r()k2 ? Using the fact that P A = AA+ A = A, we differentiate with respect to  and obtain

dA dP dA
P + A= .
d d d
It follows that
dP dA dA
A = (I − P ) = P⊥ .
d d d
Multiplying through by A+ , we obtain

dP dA +
P = P⊥ A .
d d
Because P is a projection,
d(P 2 ) dP dP dP
=P + P = ,
d d d d
so, using the symmetry of P ,

dP dA + dAT ⊥
= P⊥ A + (A+ )T P .
d d d

Now, using a Taylor expansion around  = 0, as well as the relations x̂ = A+ b and r = P ⊥ b,


we obtain

dP ⊥
r() = r(0) +  b + O(2 )
d
d(I − P )
= r(0) +  b + O(2 )
d
dP
= r(0) −  b + O(2 )
d
= r(0) − [P ⊥ E x̂(0) + (A+ )T E T r(0)] + O(2 ).

Taking norms, we obtain


 
kr() − r(0)k2 + kr(0)k2
= ||kEk2 1 + kA k2 + O(2 ).
kx̂k2 kx̂(0)k2

Note that if A is scaled so that kAk2 = 1, then the second term above involves the condition number
κ2 (A). We also have
 
kx() − x(0)k2 kr(0)k2
= ||kEk2 2κ2 (A) + κ2 (A)2 + O(2 ).
kx̂k2 kx̂(0)k2

Note that a small perturbation in the residual does not imply a small perturbation in the solution.
3.4. THE SINGULAR VALUE DECOMPOSITION 117

3.4 The Singular Value Decomposition


3.4.1 Existence
The matrix `2 -norm can be used to obtain a highly useful decomposition of any m × n matrix A.
Let x be a unit `2 -norm vector (that is, kxk2 = 1 such that kAxk2 = kAk2 . If z = Ax, and we
define y = z/kzk2 , then we have Ax = σy, with σ = kAk2 , and both x and y are unit vectors.
We can extend x and y, n Rm , respectively, to obtain
 separately,
 to orthonormal bases of R
n×n
 and
m×m
orthogonal matrices V1 = x V2 ∈ R and U1 = y U2 ∈ R . We then have
 T 
T y  
U1 AV1 = T A x V2
U2
 T
y Ax yT AV2

=
U2T Ax U2T AV2
σyT y yT AV2
 
=
σU2T y U2T AV2
σ wT
 
=
0 B
≡ A1 ,

where B = U2T AV2 and w = V2T AT y. Now, let


 
σ
s= .
w

Then ksk22 = σ 2 + kwk22 . Furthermore, the first component of A1 s is σ 2 + kwk22 .


It follows from the invariance of the matrix `2 -norm under orthogonal transformations that
kA1 sk22 (σ 2 + kwk22 )2
σ 2 = kAk22 = kA1 k22 ≥ ≥ = σ 2 + kwk22 ,
ksk22 σ 2 + kwk22
and therefore we must have w = 0. We then have
 
σ 0
U1T AV1 = A1 = .
0 B

Continuing this process on B, and keeping in mind that kBk2 ≤ kAk2 , we obtain the decom-
position
U T AV = Σ
where
∈ Rm×m , ∈ Rn×n
   
U= u1 · · · um V = v1 · · · vn
are both orthogonal matrices, and

Σ = diag(σ1 , . . . , σp ) ∈ Rm×n , p = min{m, n}

is a diagonal matrix, in which Σii = σi for i = 1, 2, . . . , p, and Σij = 0 for i 6= j. Furthermore, we


have
kAk2 = σ1 ≥ σ2 ≥ · · · ≥ σp ≥ 0.
118 CHAPTER 3. LEAST SQUARES PROBLEMS

This decomposition of A is called the singular value decomposition, or SVD. It is more commonly
written as a factorization of A,
A = U ΣV T .

3.4.2 Properties
The diagonal entries of Σ are the singular values of A. The columns of U are the left singular
vectors, and the columns of V are the right singluar vectors. It follows from the SVD itself that
the singular values and vectors satisfy the relations

Avi = σi ui , AT ui = σi vi , i = 1, 2, . . . , min{m, n}.

For convenience, we denote the ith largest singular value of A by σi (A), and the largest and smallest
singular values are commonly denoted by σmax (A) and σmin (A), respectively.
The SVD conveys much useful information about the structure of a matrix, particularly with
regard to systems of linear equations involving the matrix. Let r be the number of nonzero singular
values. Then r is the rank of A, and

range(A) = span{u1 , . . . , ur }, null(A) = span{vr+1 , . . . , vn }.

That is, the SVD yields orthonormal bases of the range and null space of A.
It follows that we can write
X r
A= σi ui viT .
i=1

This is called the SVD expansion of A. If m ≥ n, then this expansion yields the “economy-size”
SVD
A = U1 Σ 1 V T ,
where
∈ Rm×n , Σ1 = diag(σ1 , . . . , σn ) ∈ Rn×n .
 
U1 = u1 · · · un

Example The matrix  


11 19 11
A =  9 21 9 
10 20 10

has the SVD A = U ΣV T where


 √ √ √  √ √ √ 
1/√3 −1/√2 1/√6 1/ 6 −1/√3
p 1/ 2
U =  1/√3 1/ 2 1/ 6  ,
p V =  2/3
√ 1/√3 √0 ,

1/ 3 0 − 2/3 1/ 6 −1/ 3 −1/ 2

and  
42.4264 0 0
S= 0 2.4495 0  .
0 0 0
3.4. THE SINGULAR VALUE DECOMPOSITION 119

   
Let U = u1 u2 u3 and V = v1 v2 v3 be column partitions of U and V , respectively.
Because there are only two nonzero singular values, we have rank(A) = 2, Furthermore, range(A) =
span{u1 , u2 }, and null(A) = span{v3 }. We also have

A = 42.4264u1 v1T + 2.4495u2 v2T .

2
The SVD is also closely related to the `2 -norm and Frobenius norm. We have

kAk2 = σ1 , kAk2F = σ12 + σ22 + · · · + σr2 ,

and
kAxk2
min = σp , p = min{m, n}.
x6=0 kxk2
These relationships follow directly from the invariance of these norms under orthogonal transfor-
mations.

3.4.3 Applications
The SVD has many useful applications, but one of particular interest is that the truncated SVD
expansion
Xk
Ak = σi ui viT ,
i=1

where k < r = rank(A), is the best approximation of A by a rank-k matrix. It follows that the
distance between A and the set of matrices of rank k is

Xr
T
min kA − Bk2 = kA − Ak k2 = σi ui vi = σk+1 ,

rank(B)=k
i=k+1

because the `2 -norm of a matrix is its largest singular value, and σk+1 is the largest singular value
of the matrix obtained by removing the first k terms of the SVD expansion. Therefore, σp , where
p = min{m, n}, is the distance between A and the set of all rank-deficient matrices (which is zero
when A is already rank-deficient). Because a matrix of full rank can have arbitarily small, but still
positive, singular values, it follows that the set of all full-rank matrices in Rm×n is both open and
dense.
Example The best approximation of the matrix A from the previous example, which has rank
two, by a matrix of rank one is obtained by taking only the first term of its SVD expansion,
 
10 20 10
A1 = 42.4264u1 v1T =  10 20 10  .
10 20 10

The absolute error in this approximation is

kA − A1 k2 = σ2 ku2 v2T k2 = σ2 = 2.4495,


120 CHAPTER 3. LEAST SQUARES PROBLEMS

while the relative error is


kA − A1 k2 2.4495
= = 0.0577.
kAk2 42.4264
That is, the rank-one approximation of A is off by a little less than six percent. 2
Suppose that the entries of a matrix A have been determined experimentally, for example by
measurements, and the error in the entries is determined to be bounded by some value . Then, if
σk >  ≥ σk+1 for some k < min{m, n}, then  is at least as large as the distance between A and
the set of all matrices of rank k. Therefore, due to the uncertainty of the measurement, A could
very well be a matrix of rank k, but it cannot be of lower rank, because σk > . We say that the
-rank of A, defined by
rank(A, ) = min rank(B),
kA−Bk2 ≤

is equal to k. If k < min{m, n}, then we say that A is numerically rank-deficient.

3.4.4 Minimum-norm least squares solution


One of the most well-known applications of the SVD is that it can be used to obtain the solution
to the problem
kb − Axk2 = min , kxk2 = min .

The solution is
x̂ = A+ b = V Σ+ U T b

where A+ is the pseudo-inverse of A.

3.4.5 Closest Orthogonal Matrix


Let Qn be the set of all n × n orthogonal matrices. Given an n × n matrix A, we wish to find the
matrix Q that satisfies
kA − QkF = min , Q ∈ Qn , σi (Q) = 1.

Given A = U ΣV T , if we compute Q̂ = U IV T , then

kA − Q̂k2F = kU (Σ − I)V T k2F


= kΣ − Ik2F
= (σ1 − 1)2 + · · · + (σn − 1)2

It can be shown that this is in fact the minimum.


A more general problem is to find Q ∈ Qn such that

kA − BQkF = min

for given matrices A and B. The solution is

Q̂ = U V T , B T A = U ΣV T .
3.5. LEAST SQUARES WITH CONSTRAINTS 121

3.4.6 Other Low-Rank Approximations


(r) (r) (k)
Let Mm,n be the set of all m×n matrices of rank r, and let A ∈ Mm,n . We wish to find B ∈ Mm,n ,
where k < r, such that kA − BkF = min .
To solve this problem, let A = U ΣV T be the SVD of A, and let B̂ = U Ωk V T where
 
σ1
 .. 
 . 
 
 σk 
Ωk =  .
 0 

 . . 
 . 
0

Then

kA − B̂k2F = kU (Σ − Ωk )V T k2F
= kΣ − Ωk k2F
2
= σk+1 + · · · + σr2 .

We now consider a variation of this problem. Suppose that B is a perturbation of A such that
A = B + E, where kEk2F ≤ 2 . We wish to find B̂ such that kA − B̂k2F ≤ 2 , where the rank of B̂
is minimized. We know that if Bk = U Ωk V T then

kA − BK k2F = σk+1
2
+ · · · + σr2 .

It follows that B̂ = Bk is the solution if

σk+1 + · · · + σr2 ≤ 2 , σk2 + · · · + σr2 > 2 .

Note that !
+ 1 1
kA − B̂ + k2F = 2 + ··· + 2 .
σk+1 σr

3.5 Least Squares with Constraints


3.5.1 Linear Constraints
Suppose that we wish to fit data as in the least squares problem, except that we are using different
functions to fit the data on different subintervals. A common example is the process of fitting data
using cubic splines, with a different cubic polynomial approximating data on each subinterval.
Typically, it is desired that the functions assigned to each piece form a function that is continuous
on the entire interval within which the data lies. This requires that constraints be imposed on the
functions themselves. It is also not uncommon to require that the function assembled from these
pieces also has a continuous first or even second derivative, resulting in additional constraints.
The result is a least squares problem with linear constraints, as the constraints are applied to
coefficients of predetermined functions chosen as a basis for some function space, such as the space
of polynomials of a given degree.
122 CHAPTER 3. LEAST SQUARES PROBLEMS

The general form of a least squares problem with linear constraints is as follows: we wish to
find an n-vector x that minimizes kAx − bk2 , subject to the constraint C T x = d, where C is a
known n × p matrix and d is a known p-vector.
This problem is usually solved using Lagrange multipliers. We define

f (x; λ) = kb − Axk22 + 2λT (C T x − d).

Then
∇f = 2(AT Ax − AT b + Cλ).
To minimize f , we can solve the system
 T    T 
A A C x A b
T = .
C 0 λ d

From AT Ax = AT b − Cλ, we see that we can first compute x = x̂ − (AT A)−1 Cλ where x̂ is the
solution to the unconstrained least squares problem. Then, from the equation C T x = d we obtain
the equation
C T (AT A)−1 Cλ = C T x̂ − d,
which we can now solve for λ. The algorithm proceeds as follows:

1. Solve the unconstrained least squares problem Ax = b for x̂.

2. Compute A = QR.

3. Form W = (RT )−1 C.

4. Compute W = P U , the QR factorization of W .

5. Solve U T U λ = η = C T x̂ − d for λ. Note that

UT U = (P T W )T (P T W )
= WTPPTW
= C T R−1 (RT )−1 C
= C T (RT R)−1 C
= C T (RT QT QR)−1 C
= C T (AT A)−1 C

6. Set x = x̂ − (AT A)−1 Cλ.

This method is not the most practical since it has more unknowns than the unconstrained least
squares problem, which is odd because the constraints should have the effect of eliminating un-
knowns, not adding them. We now describe an alternate approach.
Suppose that we compute the QR factorization of C to obtain
 
T R
Q C=
0
3.5. LEAST SQUARES WITH CONSTRAINTS 123

where R is a p × p upper triangular matrix. Then the constraint C T x = d takes the form
 
u
RT u = d, QT x = .
v

Then

kb − Axk2 = kb − AQQT xk
 
u
b − Ã v , Ã = AQ
=
2
 
  u
= b − Ã1 Ã2

v 2
= kb − Ã1 u − Ã2 vk2

Thus we can obtain x by the following procedure:


1. Compute the QR factorization of C

2. Compute à = AQ

3. Solve RT u = d

4. Solve the new least squares problem of minimizing k(b − Ã1 u) − Ã2 vk2

5. Compute  
u
x=Q .
v

This approach has the advantage that there are fewer unknowns in each system that needs to be
solved, and also that κ(Ã2 ) ≤ κ(A). The drawback is that sparsity can be destroyed.

3.5.2 Quadratic Constraints


We wish to solve the problem

kb − Axk2 = min , kxk2 = α, α ≤ kA+ bk2 .

This problem is known as least squares with quadratic constraints. To solve this problem, we define

ϕ(x; µ) = kb − Axk22 + µ(kxk2 − α2 )

and seek to minimize ϕ. From

∇ϕ = 2AT b − 2AT Ax + 2µx

we obtain the system


(AT A + µI)x = AT b.
If we denote the eigenvalues of AT A by

λi (AT A) = λ1 , . . . , λn , λ1 ≥ λ2 ≥ · · · ≥ λn ≥ 0
124 CHAPTER 3. LEAST SQUARES PROBLEMS

then
λi (AT A + µI) = λ1 + µ, · · · , λn + µ.

If µ ≥ 0, then κ(AT A + µI) ≤ κ(AT A), because

λ1 + µ λ1
≤ ,
λn + µ λn

so AT A + µI is better conditioned.
Solving the least squares problem with quadratic constraints arises in many literatures, including

1. Statistics: Ridge Regression

2. Regularization: Tikhonov

3. Generalized cross-validation (GCV)

To solve this problem, we see that we need to compute

x = (AT A + µI)−1 AT b

where
xT x = bT A(AT A + µI)−2 AT b = α2 .

If A = U ΣV T is the SVD of A, then we have

α2 = bT U ΣV T (V ΣT ΣV T + µI)−2 V ΣT U T b
= cT Σ(ΣT Σ + µI)−2 ΣT c, UT b = c
r
X c2i σi2
=
i=1
(σi2 + µ)2
= χ(µ)

The function χ(µ) has poles at −σi2 for i = 1, . . . , r. Furthermore, χ(µ) → 0 as µ → ∞.


We now have the following procedure for solving this problem, given A, b, and α2 :

1. Compute the SVD of A to obtain A = U ΣV T .

2. Compute c = U T b.

3. Solve χ(µ∗ ) = α2 where µ∗ ≥ 0. Don’t use Newton’s method on this equation directly; solving
1/χ(µ) = 1/α2 is much better.

4. Use the SVD to compute

x = (AT A + µI)−1 AT b = V (ΣT Σ + µI)−1 ΣT U T b.


3.6. TOTAL LEAST SQUARES 125

3.6 Total Least Squares


In the ordinary least squares problem, we are solving

Ax = b + r, krk2 = min .

In the total least squares problem, we wish to solve

(A + E)x = b + r, kEk2F + λ2 krk22 = min .

From Ax − b + Ex − r = 0, we obtain the system


   
  x   x
A b + E r = 0,
−1 −1
or
(C + F )z = 0.
We need the matrix C + F to have rank < n + 1, and we want to minimize kF kF .
To solve this problem, we compute the SVD of C = A b = U ΣV T . Let Ĉ = U Ωn V T .
 

Then, if vi is the ith column of V , we have

Ĉvn+1 = U Ωn V T vn+1 = 0.

Our solution is  
x̂ 1
=− vn+1
−1 vn+1,n+1
provided that vn+1,n+1 6= 0.  
Now, suppose that only some of the data is contaminated, i.e. E = 0 E1 where the first
p columns of E are zero. Then, in solving (C + F )z = 0, we use Householder transformations to
compute QT (C +F ) where the first p columns are zero below the diagonal. Since kF kF = kQT F kF ,
we then have a block upper triangular system
   
R11 R12 + F12 u
z = 0, z = .
0 R22 + F22 v

We can find the total least squares solution of

(R22 + F22 )v = 0,

and then set F12 = 0 and solve


R11 u + R12 v = 0.
126 CHAPTER 3. LEAST SQUARES PROBLEMS
Chapter 4

Eigenvalue Problems

4.1 Eigenvalues and Eigenvectors


4.1.1 Definitions and Properties
Let A be an n × n matrix. A nonzero vector x is called an eigenvector of A if there exists a scalar
λ such that
Ax = λx.
The scalar λ is called an eigenvalue of A, and we say that x is an eigenvector of A corresponding
to λ. We see that an eigenvector of A is a vector for which matrix-vector multiplication with A is
equivalent to scalar multiplication by λ.
We say that a nonzero vector y is a left eigenvector of A if there exists a scalar λ such that

λyH = yH A.

The superscript H refers to the Hermitian transpose, which includes transposition and complex
conjugation. That is, for any matrix A, AH = AT . An eigenvector of A, as defined above, is
sometimes called a right eigenvector of A, to distinguish from a left eigenvector. It can be seen
that if y is a left eigenvector of A with eigenvalue λ, then y is also a right eigenvector of AH , with
eigenvalue λ.
Because x is nonzero, it follows that if x is an eigenvector of A, then the matrix A − λI is
singular, where λ is the corresponding eigenvalue. Therefore, λ satisfies the equation

det(A − λI) = 0.

The expression det(A−λI) is a polynomial of degree n in λ, and therefore is called the characteristic
polynomial of A (eigenvalues are sometimes called characteristic values). It follows from the fact
that the eigenvalues of A are the roots of the characteristic polynomial that A has n eigenvalues,
which can repeat, and can also be complex, even if A is real. However, if A is real, any complex
eigenvalues must occur in complex-conjugate pairs.
The set of eigenvalues of A is called the spectrum of A, and denoted by λ(A). This terminology
explains why the magnitude of the largest eigenvalues is called the spectral radius of A. The trace
of A, denoted by tr(A), is the sum of the diagonal elements of A. It is also equal to the sum of the
eigenvalues of A. Furthermore, det(A) is equal to the product of the eigenvalues of A.

127
128 CHAPTER 4. EIGENVALUE PROBLEMS

Example A 2 × 2 matrix  
a b
A=
c d
has trace tr(A) = a + d and determinant det(A) = ad − bc. Its characteristic polynomial is

a−λ b
det(A − λI) =
c d−λ
= (a − λ)(d − λ) − bc = λ2 − (a + d)λ + (ad − bc)
= λ2 − tr(A)λ + det(A).

From the quadratic formula, the eigenvalues are


p p
a+d (a − d)2 + 4bc a+d (a − d)2 + 4bc
λ1 = + , λ2 = − .
2 2 2 2
It can be verified directly that the sum of these eigenvalues is equal to tr(A), and that their product
is equal to det(A). 2

4.1.2 Decompositions
A subspace W of Rn is called an invariant subspace of A if, for any vector x ∈ W , Ax ∈ W .
Suppose that dim(W ) = k, and let X be an n × k matrix such that range(X) = W . Then, because
each column of X is a vector in W , each column of AX is also a vector in W , and therefore is a
linear combination of the columns of X. It follows that AX = XB, where B is a k × k matrix.
Now, suppose that y is an eigenvector of B, with eigenvalue λ. It follows from By = λy that

XBy = X(By) = X(λy) = λXy,

but we also have


XBy = (XB)y = AXy.
Therefore, we have
A(Xy) = λ(Xy),
which implies that λ is also an eigenvalue of A, with corresponding eigenvector Xy. We conclude
that λ(B) ⊆ λ(A).
If k = n, then X is an n × n invertible matrix, and it follows that A and B have the same
eigenvalues. Furthermore, from AX = XB, we now have B = X −1 AX. We say that A and B are
similar matrices, and that B is a similarity transformation of A.
Similarity transformations are essential tools in algorithms for computing the eigenvalues of a
matrix A, since the basic idea is to apply a sequence of similarity transformations to A in order to
obtain a new matrix B whose eigenvalues are easily obtained. For example, suppose that B has a
2 × 2 block structure  
B11 B12
B= ,
0 B22
where B11 is p × p and B22 is q × q.
4.1. EIGENVALUES AND EIGENVECTORS 129

T
Let x = xT1 xT2 be an eigenvector of B, where x1 ∈ Cp and x2 ∈ Cq . Then, for some


scalar λ ∈ λ(B), we have     


B11 B12 x1 x1
=λ .
0 B22 x2 x2
If x2 6= 0, then B22 x2 = λx2 , and λ ∈ λ(B22 ). But if x2 = 0, then B11 x1 = λx1 , and λ ∈ λ(B11 ). It
follows that, λ(B) ⊆ λ(B11 ) ∪ λ(B22 ). However, λ(B) and λ(B11 ) ∪ λ(B22 ) have the same number
of elements, so the two sets must be equal. Because A and B are similar, we conclude that

λ(A) = λ(B) = λ(B11 ) ∪ λ(B22 ).

Therefore, if we can use similarity transformations to reduce A to such a block structure, the
problem of computing the eigenvalues of A decouples into two smaller problems of computing the
eigenvalues of Bii for i = 1, 2. Using an inductive argument, it can be shown that if A is block
upper-triangular, then the eigenvalues of A are equal to the union of the eigenvalues of the diagonal
blocks. If each diagonal block is 1 × 1, then it follows that the eigenvalues of any upper-triangular
matrix are the diagonal elements. The same is true of any lower-triangular matrix; in fact, it can
be shown that because det(A) = det(AT ), the eigenvalues of AT are the same as the eigenvalues of
A.
Example The matrix  
1 −2 3 −3 4

 0 4 −5 6 −5 

A=
 0 0 6 −7 8 

 0 0 0 7 0 
0 0 0 −8 9
has eigenvalues 1, 4, 6, 7, and 9. This is because A has a block upper-triangular structure
 
  1 −2 3  
A11 A12 7 0
A= , A11 =  0 4 −5  , A22 = .
0 A22 −8 9
0 0 6

Because both of these blocks are themselves triangular, their eigenvalues are equal to their diagonal
elements, and the spectrum of A is the union of the spectra of these blocks. 2
Suppose that x is a normalized eigenvector of A, with eigenvalue λ. Furthermore, suppose that
P is a Householder reflection such that P x = e1 . Because P is symmetric and orthogonal, P is its
own inverse, so P e1 = x. It follows that the matrix P T AP , which is a similarity transformation of
A, satisfies
P T AP e1 = P T Ax = λP T x = λP x = λe1 .
That is, e1 is an eigenvector of P T AP with eigenvalue λ, and therefore P T AP has the block
structure
λ vT
 
T
P AP = .
0 B
Therefore, λ(A) = {λ} ∪ λ(B), which means that we can now focus on the (n − 1) × (n − 1) matrix
B to find the rest of the eigenvalues of A. This process of reducing the eigenvalue problem for A
to that of B is called deflation.
130 CHAPTER 4. EIGENVALUE PROBLEMS

Continuing this process, we obtain the Schur Decomposition

A = QH T Q

where T is an upper-triangular matrix whose diagonal elements are the eigenvalues of A, and Q is
a unitary matrix, meaning that QH Q = I. That is, a unitary matrix is the generalization of a real
orthogonal matrix to complex matrices. Every square matrix has a Schur decomposition.
The columns of Q are called Schur vectors. However, for a general matrix A, there is no relation
between Schur vectors of A and eigenvectors of A, as each Schur vector qj satisfies Aqj = AQej =
QT ej . That is, Aqj is a linear combination of q1 , . . . , qj . It follows that for j = 1, 2, . . . , n, the
first j Schur vectors q1 , q2 , . . . , qj span an invariant subspace of A.
The Schur vectors and eigenvectors of A are the same when A is a normal matrix, which means
that AH A = AAH . Any symmetric or skew-symmetric matrix, for example, is normal. It can be
shown that in this case, the normalized eigenvectors of A form an orthonormal basis for Rn . It
follows that if λ1 , λ2 , . . . , λn are the eigenvalues of A, with corresponding (orthonormal) eigenvectors
q1 , q2 , . . . , qn , then we have
 
AQ = QD, Q = q1 · · · qn , D = diag(λ1 , . . . , λn ).

Because Q is a unitary matrix, it follows that

QH AQ = QH QD = D,

and A is similar to a diagonal matrix. We say that A is diagonalizable. Furthermore, because D


can be obtained from A by a similarity transformation involving a unitary matrix, we say that A
is unitarily diagonalizable.
Even if A is not a normal matrix, it may be diagonalizable, meaning that there exists an
invertible matrix P such that P −1 AP = D, where D is a diagonal matrix. If this is the case, then,
because AP = P D, the columns of P are eigenvectors of A, and the rows of P −1 are eigenvectors
of AT (as well as the left eigenvectors of A, if P is real).
By definition, an eigenvalue of A corresponds to at least one eigenvector. Because any nonzero
scalar multiple of an eigenvector is also an eigenvector, corresponding to the same eigenvalue,
an eigenvalue actually corresponds to an eigenspace, which is the span of any set of eigenvectors
corresponding to the same eigenvalue, and this eigenspace must have a dimension of at least one.
Any invariant subspace of a diagonalizable matrix A is a union of eigenspaces.
Now, suppose that λ1 and λ2 are distinct eigenvalues, with corresponding eigenvectors x1 and
x2 , respectively. Furthermore, suppose that x1 and x2 are linearly dependent. This means that
they must be parallel; that is, there exists a nonzero constant c such that x2 = cx1 . However, this
implies that Ax2 = λ2 x2 and Ax2 = cAx1 = cλ1 x1 = λ1 x2 . However, because λ1 6= λ2 , this is a
contradiction. Therefore, x1 and x2 must be linearly independent.
More generally, it can be shown, using an inductive argument, that a set of k eigenvectors
corresponding to k distinct eigenvalues must be linearly independent. Suppose that x1 , . . . , xk are
eigenvectors of A, with distinct eigenvalues λ1 , . . . , λk . Trivially, x1 is linearly independent. Using
induction, we assume that we have shown that x1 , . . . , xk−1 are linearly independent, and show
that x1 , . . . , xk must be linearly independent as well. If they are not, then there must be constants
c1 , . . . , ck−1 , not all zero, such that

xk = c1 x1 + c2 x2 + · · · + ck−1 xk−1 .
4.1. EIGENVALUES AND EIGENVECTORS 131

Multiplying both sides by A yields

Axk = c1 λ1 x1 + c2 λ2 x2 + · · · + ck−1 λk−1 xk−1 ,

because Axi = λi xi for i = 1, 2, . . . , k − 1. However, because both sides are equal to xk , and
Axk = λk xk , we also have

Axk = c1 λk x1 + c2 λk x2 + · · · + ck−1 λk xk−1 .

It follows that

c1 (λk − λ1 )x1 + c2 (λk − λ2 )x2 + · · · + ck−1 (λk − λk−1 )xk−1 = 0.

However, because the eigenvalues λ1 , . . . , λk are distinct, and not all of the coefficients c1 , . . . , ck−1
are zero, this means that we have a nontrivial linear combination of linearly independent vectors be-
ing equal to the zero vector, which is a contradiction. We conclude that eigenvectors corresponding
to distinct eigenvalues are linearly independent.
It follows that if A has n distinct eigenvalues, then it has a set of n linearly independent
eigenvectors. If X is a matrix whose columns are these eigenvectors, then AX = XD, where D is
a diagonal matrix of the eigenvectors, and because the columns of X are linearly independent, X
is invertible, and therefore X −1 AX = D, and A is diagonalizable.
Now, suppose that the eigenvalues of A are not distinct; that is, the characteristic polynomial
has repeated roots. Then an eigenvalue with multiplicity m does not necessarily correspond to m
linearly independent eigenvectors. The algebraic multiplicity of an eigenvalue λ is the number of
times that λ occurs as a root of the characteristic polynomial. The geometric multiplicity of λ is
the dimension of the eigenspace corresponding to λ, which is equal to the maximal size of a set of
linearly independent eigenvectors corresponding to λ. The geometric multiplicity of an eigenvalue
λ is always less than or equal to the algebraic multiplicity. When it is strictly less, then we say
that the eigenvalue is defective. When both multiplicities are equal to one, then we say that the
eigenvalue is simple.
The Jordan canonical form of an n × n matrix A is a decomposition that yields information
about the eigenspaces of A. It has the form

A = XJX −1

where J has the block diagonal structure


 
J1 0 ··· 0
. .
J2 . . ..
 
 0 
J =
 .. .. ..
.

 . . . 0 
0 · · · 0 Jp
Each diagonal block Jp is a Jordan block that has the form
 
λi 1
.
λi . .
 
Ji =   , i = 1, 2, . . . , p.
 
 λi 1 
λi
132 CHAPTER 4. EIGENVALUE PROBLEMS

The number of Jordan blocks, p, is equal to the number of linearly independent eigenvectors of A.
The diagonal element of Ji , λi , is an eigenvalue of A. The number of Jordan blocks associated with
λi is equal to the geometric multiplicity of λi . The sum of the sizes of these blocks is equal to the
algebraic multiplicity of λi . If A is diagonalizable, then each Jordan block is 1 × 1.
Example Consider a matrix with Jordan canonical form
 
2 1 0
 0 2 1 
 
 0 0 2 
J = .

 3 1 

 0 3 
2
The eigenvalues of this matrix are 2, with algebraic multiplicity 4, and 3, with algebraic multiplicity
2. The geometric multiplicity of the eigenvalue 2 is 2, because it is associated with two Jordan
blocks. The geometric multiplicity of the eigenvalue 3 is 1, because it is associated with only one
Jordan block. Therefore, there are a total of three linearly independent eigenvectors, and the matrix
is not diagonalizable. 2
The Jordan canonical form, while very informative about the eigensystem of A, is not practical
to compute using floating-point arithmetic. This is due to the fact that while the eigenvalues of a
matrix are continuous functions of its entries, the Jordan canonical form is not. If two computed
eigenvalues are nearly equal, and their computed corresponding eigenvectors are nearly parallel, we
do not know if they represent two distinct eigenvalues with linearly independent eigenvectors, or a
multiple eigenvalue that could be defective.

4.1.3 Perturbation Theory


Just as the problem of solving a system of linear equations Ax = b can be sensitive to pertur-
bations in the data, the problem of computing the eigenvalues of a matrix can also be sensitive
to perturbations in the matrix. We will now obtain some results concerning the extent of this
sensitivity.
Suppose that A is obtained by perturbing a diagonal matrix D by a matrix F whose diagonal
entries are zero; that is, A = D + F . If λ is an eigenvalue of A with corresponding eigenvector x,
then we have
(D − λI)x + F x = 0.
If λ is not equal to any of the diagonal entries of A, then D − λI is nonsingular and we have

x = −(D − λI)−1 F x.

Taking ∞-norms of both sides, we obtain

kxk∞ = k(D − λI)−1 F xk∞ ≤ k(D − λI)−1 F k∞ kxk∞ ,

which yields
n
−1
X |fij |
k(D − λI) F k∞ = max ≥ 1.
1≤i≤n |dii − λ|
j=1,j6=i
4.1. EIGENVALUES AND EIGENVECTORS 133

It follows that for some i, 1 ≤ i ≤ n, λ satisfies


n
X
|dii − λ| ≤ |fij |.
j=1,j6=i

That is, λ lies within one of the Gerschgorin circles in the complex plane, that has center aii and
radius
Xn
ri = |aij |.
j=1,j6=i

This is result is known as the Gerschgorin Circle Theorem.


Example The eigenvalues of the matrix
 
−5 −1 1
A =  −2 2 −1 
1 −3 7
are
λ(A) = {6.4971, 2.7930, −5.2902}.
The Gerschgorin disks are

D1 = {z ∈ C | |z − 7| ≤ 4},
D2 = {z ∈ C | |z − 2| ≤ 3},
D3 = {z ∈ C | |z + 5| ≤ 2}.

We see that each disk contains one eigenvalue. 2


It is important to note that while there are n eigenvalues and n Gerschgorin disks, it is not
necessarily true that each disk contains an eigenvalue. The Gerschgorin Circle Theorem only states
that all of the eigenvalues are contained within the union of the disks.
Another useful sensitivity result that applies to diagonalizable matrices is the Bauer-Fike The-
orem, which states that if
X −1 AX = diag(λ1 , . . . , λn ),
and µ is an eigenvalue of a perturbed matrix A + E, then

min |λ − µ| ≤ κp (X)kEkp .
λ∈λ(A)

That is, µ is within κp (X)kEkp of an eigenvalue of A. It follows that if A is “nearly non-


diagonalizable”, which can be the case if eigenvectors are nearly linearly dependent, then a small
perturbation in A could still cause a large change in the eigenvalues.
It would be desirable to have a concrete measure of the sensitivity of an eigenvalue, just as we
have the condition number that measures the sensitivity of a system of linear equations. To that
end, we assume that λ is a simple eigenvalue of an n × n matrix A that has Jordan canonical form
J = X −1 AX. Then, λ = Jii for some i, and xi , the ith column of X, is a corresponding right
eigenvector.
134 CHAPTER 4. EIGENVALUE PROBLEMS

If we define Y = X −H = (X −1 )H , then yi is a left eigenvector of A corresponding to λ. From


Y H X = I, it follows that yH x = 1. We now let A, λ, and x be functions of a parameter  that
satisfy
A()x() = λ()x(), A() = A + F, kF k2 = 1.
Differentiating with respect to , and evaluating at  = 0, yields

F x + Ax0 (0) = λx0 (0) + λ0 (0)x.

Taking the inner product of both sides with y yields

yH F x + yH Ax0 (0) = λyH x0 (0) + λ0 (0)yH x.

Because y is a left eigenvector corresponding to λ, and yH x = 1, we have

yH F x + λyH x0 (0) = λyH x0 (0) + λ0 (0).

We conclude that

|λ0 (0)| = |yH F x| ≤ kyk2 kF k2 kxk2 ≤ kyk2 kxk2 .

However, because θ, the angle between x and y, is given by

yH x 1
cos θ = = ,
kyk2 kxk2 kyk2 kxk2

it follows that
1
|λ0 (0)| ≤ .
| cos θ|
We define 1/| cos θ| to be the condition number of the simple eigenvalue λ. We require λ to be
simple because otherwise, the angle between the left and right eigenvectors is not unique, because
the eigenvectors themselves are not unique.
It should be noted that the condition number is also defined by 1/|yH x|, where x and y are
normalized so that kxk2 = kyk2 = 1, but either way, the condition number is equal to 1/| cos θ|. The
interpretation of the condition number is that an O() perturbation in A can cause an O(/| cos θ|)
perturbation in the eigenvalue λ. Therefore, if x and y are nearly orthogonal, a large change in the
eigenvalue can occur. Furthermore, if the condition number is large, then A is close to a matrix
with a multiple eigenvalue.
Example The matrix  
3.1482 −0.2017 −0.5363
A =  0.4196 0.5171 1.0888 
0.3658 −1.7169 3.3361
has a simple eigenvalue λ = 1.9833 with left and right eigenvectors
   
0.4150 −7.9435
x =  0.6160  , y =  83.0701  ,
0.6696 −70.0066
4.1. EIGENVALUES AND EIGENVECTORS 135

such that yH x = 1. It follows that the condition number of this eigenvalue is kxk2 kyk2 = 108.925.
In fact, the nearby matrix
 
3.1477 −0.2023 −0.5366
B =  0.4187 0.5169 1.0883 
0.3654 −1.7176 3.3354
has a double eigenvalue that is equal to 2. 2
We now consider the sensitivity of repeated eigenvalues. First, it is important to note that while
the eigenvalues of a matrix A are continuous functions of the entries of A, they are not necessarily
differentiable functions of the entries. To see this, we consider the matrix
 
1 a
A= ,
 1
where a > 0. Computing its characteristic polynomial
det(A − λI) = λ2 − 2λ + 1 − a

and computings its roots yields the eigenvalues λ = 1 ± a. Differentiating these eigenvalues with
respect to  yields r
dλ a
=± ,
d 
which is undefined at  = 0. In general, an O() perturbation in A causes an O(1/p ) perturbation in
an eigenvalue associated with a p×p Jordan block, meaning that the “more defective” an eigenvalue
is, the more sensitive it is.
We now consider the sensitivity of eigenvectors, or, more generally, invariant subspaces of a
matrix A, such as a subspace spanned by the first k Schur vectors, which are the first k columns in
a matrix Q such that QH AQ is upper triangular. Suppose that an n × n matrix A has the Schur
decomposition  
H
  T11 T12
A = QT Q , Q = Q1 Q2 , T = ,
0 T22
where Q1 is n × r and T11 is r × r. We define the separation between the matrices T11 and T22 by
kT11 X − XT22 kF
sep(T11 , T22 ) = min .
X6=0 kXkF
It can be shown that an O() perturbation in A causes a O(/sep(T11 , T22 )) perturbation in the
invariant subspace Q1 .
We now consider the case where r = 1, meaning that Q1 is actually a vector q1 , that is also an
eigenvector, and T11 is the corresponding eigenvalue, λ. Then, we have
kλX − XT22 kF
sep(λ, T22 ) = min
X6=0 kXkF
= min kyH (T22 − λI)k2
kyk2 =1

= min k(T22 − λI)H yk2


kyk2 =1

= σmin ((T22 − λI )H )
= σmin (T22 − λI),
136 CHAPTER 4. EIGENVALUE PROBLEMS

since the Frobenius norm of a vector is equivalent to the vector 2-norm. Because the smallest
singular value indicates the distance to a singular matrix, sep(λ, T22 ) provides a measure of the
separation of λ from the other eigenvalues of A. It follows that eigenvectors are more sensitive to
perturbation if the corresponding eigenvalues are clustered near one another. That is, eigenvectors
associated with nearby eigenvalues are “wobbly”.
It should be emphasized that there is no direct relationship between the sensitivity of an eigen-
value and the sensitivity of its corresponding invariant subspace. The sensitivity of a simple eigen-
value depends on the angle between its left and right eigenvectors, while the sensitivity of an
invariant subspace depends on the clustering of the eigenvalues. Therefore, a sensitive eigenvalue,
that is nearly defective, can be associated with an insensitive invariant subspace, if it is distant
from other eigenvalues, while an insensitive eigenvalue can have a sensitive invariant subspace if it
is very close to other eigenvalues.

4.2 Power Iterations


4.2.1 The Power Method
We now consider the problem of computing eigenvalues of an n × n matrix A. For simplicity, we
assume that A has eigenvalues λ1 , . . . , λn such that

|λ1 | > |λ2 | ≥ |λ3 | ≥ · · · ≥ |λn |.

We also assume that A is diagonalizable, meaning that it has n linearly independent eigenvectors
x1 , . . . , xn such that Axi = λi xi for i = 1, . . . , n.
Suppose that we continually multiply a given vector x(0) by A, generating a sequence of vectors
x , x(2) , . . . defined by
(1)

x(k) = Ax(k−1) = Ak x(0) , k = 1, 2, . . . .


Because A is diagonalizable, any vector in Rn is a linear combination of the eigenvectors, and
therefore we can write
x(0) = c1 x1 + c2 x2 + · · · + cn xn .
We then have

x(k) = Ak x(0)
Xn
= ci Ak xi
i=1
n
X
= ci λki xi
i=1
n
"  k #
X λi
= λk1 c1 x1 + ci xi .
λ1
i=2

Because |λ1 | > |λi | for i = 2, . . . , n, it follows that the coefficients of xi , for i = 2, . . . , n, converge
to zero as k → ∞. Therefore, the direction of x(k) converges to that of x1 . This leads to the most
basic method of computing an eigenvalue and eigenvector, the Power Method:
4.2. POWER ITERATIONS 137

Choose an initial vector q0 such that kq0 k2 = 1


for k = 1, 2, . . . do
zk = Aqk−1
qk = zk /kzk k2
end

This algorithm continues until qk converges to within some tolerance. If it converges, it con-
verges to a unit vector that is a scalar multiple of x1 , an eigenvector corresponding to the largest
eigenvalue, λ1 . The rate of convergence is |λ1 /λ2 |, meaning that the distance between qk and a
vector parallel to x1 decreases by roughly this factor from iteration to iteration.
It follows that convergence can be slow if λ2 is almost as large as λ1 , and in fact, the power
method fails to converge if |λ2 | = |λ1 |, but λ2 6= λ1 (for example, if they have opposite signs). It
is worth noting the implementation detail that if λ1 is negative, for example, it may appear that
qk is not converging, as it “flip-flops” between two vectors. This is remedied by normalizing qk so
that it is not only a unit vector, but also a positive number.
Once the normalized eigenvector x1 is found, the corresponding eigenvalue λ1 can be computed
using a Rayleigh quotient. Then, deflation can be carried out by constructing a Householder
reflection P1 so that P1 x1 = e1 , as discussed previously, and then P1 AP1 is a matrix with block
upper-triangular structure. This decouples the problem of computing the eigenvalues of A into the
(solved) problem of computing λ1 , and then computing the remaining eigenvalues by focusing on
the lower right (n − 1) × (n − 1) submatrix.

4.2.2 Orthogonal Iteration


This method can be impractical, however, due to the contamination of smaller eigenvalues by
roundoff error from computing the larger ones and then deflating. An alternative is to compute
several eigenvalues “at once” by using a block version of the Power Method, called Orthogonal
Iteration. In this method, A is multiplied by an n×r matrix, with r > 1, and then the normalization
of the vector computed by the power method is generalized to the orthogonalization of the block,
through the QR factorization. The method is as follows:

Choose an n × r matrix Q0 such that QH


0 Q0 = Ir
for k = 1, 2, . . . do
Zk = AQk−1
Zk = Qk Rk (QR Factorization)
end

Generally, this method computes a convergent sequence {Qk }, as long as Q0 is not deficient in
the directions of certain eigenvectors of AH , and |λr | > |λr+1 |. From the relationship

Rk = QH H
k Zk = Qk AQk−1 ,

we see that if Qk converges to a matrix Q, then QH AQ = R is upper-triangular, and because


AQ = QR, the columns of Q span an invariant subspace.
138 CHAPTER 4. EIGENVALUE PROBLEMS

Furthermore, if Q⊥ is a matrix whose columns span (range(Q))⊥ , then

QH QH AQ QH AQ⊥
   

 
A Q Q =
(Q⊥ )H (Q⊥ )H AQ (Q⊥ )H AQ⊥
QH AQ⊥
 
R
= .
0 (Q⊥ )H AQ⊥

That is, λ(A) = λ(R) ∪ λ((Q⊥ )H AQ⊥ ), and because R is upper-triangular, the eigenvalues of R are
its diagonal elements. We conclude that Orthogonal Iteration, when it converges, yields the largest
r eigenvalues of A.

4.2.3 Inverse Iteration

4.3 The QR Algorithm


If we let r = n, then, if the eigenvalues of A are separated in magnitude, then generally Orthogonal
Iteration converges, yielding the Schur Decomposition of A, A = QT QH . However, this convergence
is generally quite slow. Before determining how convergence can be accelerated, we examine this
instance of Orthogonal Iteration more closely.
For each integer k, we define Tk = QH k AQk . Then, from the algorithm for Orthogonal Iteration,
we have
Tk−1 = QH H H
k−1 AQk−1 = Qk−1 Zk = (Qk−1 Qk )Rk ,

and

Tk = QH
k AQk
= QH H
k AQk−1 Qk−1 Qk
= QH H
k Zk Qk−1 Qk
= QH H
k Qk Rk (Qk−1 Qk )
= Rk (QH
k−1 Qk ).

That is, Tk is obtained from Tk−1 by computing the QR factorization of Tk−1 , and then multi-
plying the factors in reverse order. Equivalently, Tk is obtained by applying a unitary similarity
transformation to Tk−1 , as

Tk = Rk (QH H H H H
k−1 Qk ) = (Qk−1 Qk ) Tk−1 (Qk−1 Qk ) = Uk Tk−1 Uk .

If Orthogonal Iteration converges, then Tk converges to an upper-triangular matrix T = QH AQ


whose diagonal elements are the eigenvalues of A. This simple process of repeatedly computing
the QR factorization and multiplying the factors in reverse order is called the QR Iteration, which
proceeds as follows:

Choose Q0 so that QH H
0 Q0 = In T0 = Q0 AQ0
for k = 1, 2, . . . do
Tk−1 = Uk Rk (QR factorization)
Tk = Rk Uk
end
4.3. THE QR ALGORITHM 139

It is this version of Orthogonal Iteration that serves as the cornerstone of an efficient algorithm
for computing all of the eigenvalues of a matrix. As described, QR iteration is prohibitively expen-
sive, because O(n3 ) operations are required in each iteration to compute the QR factorization of
Tk−1 , and typically, many iterations are needed to obtain convergence. However, we will see that
with a judicious choice of Q0 , the amount of computational effort can be drastically reduced.
It should be noted that if A is a real matrix with complex eigenvalues, then Orthogonal Iteration
or the QR Iteration will not converge, due to distinct eigenvalues having equal magnitude. However,
the structure of the matrix Tk in QR Iteration generally will converge to “quasi-upper-triangular”
form, with 1 × 1 or 2 × 2 diagonal blocks corresponding to real eigenvalues or complex-conjugate
pairs of eigenvalues, respectively. It is this type of convergence that we will seek in our continued
development of the QR Iteration.

4.3.1 Hessenberg Reduction


Let A be a real n × n matrix. It is possible that A has complex eigenvalues, which must occur
in complex-conjugate pairs, meaning that if a + ib is an eigenvalue, where a and b are real, then
so is a − ib. On the one hand, it is preferable that complex arithmetic be avoided as much as
possible when using QR iteration to obtain the Schur Decomposition of A. On the other hand, in
the algorithm for QR iteration, if the matrix Q0 used to compute T0 = QH 0 AQ0 is real, then every
matrix Tk generated by the iteration will also be real, so it will not be possible to obtain the Schur
Decomposition.
We compromise by instead seeking to compute the Real Schur Decomposition A = QT QT
where Q is a real, orthogonal matrix and T is a real, quasi-upper-triangular matrix that has a block
upper-triangular structure
 
T11 T12 · · · T1p
.. 
 0 T22 . . .

. 
T = 
. . ..  ,

 0 . . . . . 
0 0 0 Tpp

where each diagonal block Tii is 1 × 1, corresponding to a real eigenvalue, or a 2 × 2 block, corre-
sponding to a pair of complex eigenvalues that are conjugates of one another.
If QR iteration is applied to such a matrix, then the sequence {Tk } will not converge, but a
block upper-triangular structure will be obtained, which can then be used to compute all of the
eigenvalues. Therefore, the iteration can be terminated when appropriate entries below the diagonal
have been made sufficiently small.
However, one significant drawback to the QR iteration is that each iteration is too expensive, as
it requires O(n3 ) operations to compute the QR factorization, and to multiply the factors in reverse
order. Therefore, it is desirable to first use a similarity transformation H = U T AU to reduce A to
a form for which the QR factorization and matrix multiplication can be performed more efficiently.
Suppose that U T includes a Householder reflection, or a product of Givens rotations, that trans-
forms the first column of A to a multiple of e1 , as in algorithms to compute the QR factorization.
Then U operates on all rows of A, so when U is applied to the columns of A, to complete the
similarity transformation, it affects all columns. Therefore, the work of zeroing the elements of the
first column of A is undone.
140 CHAPTER 4. EIGENVALUE PROBLEMS

Now, suppose that instead, U T is designed to zero all elements of the first column except the
first two. Then, U T affects all rows except the first, meaning that when U T A is multiplied by U
on the right, the first column is unaffected. Continuing this reasoning with subsequent columns
of A, we see that a sequence of orthogonal transformations can be used to reduce A to an upper
Hessenberg matrix H, in which hij = 0 whenever i > j +1. That is, all entries below the subdiagonal
are equal to zero.
It is particularly efficient to compute the QR factorization of an upper Hessenberg, or simply
Hessenberg, matrix, because it is only necessary to zero one element in each column. Therefore,
it can be accomplished with a sequence of n − 1 Givens row rotations, which requires only O(n2 )
operations. Then, these same Givens rotations can be applied, in the same order, to the columns in
order to complete the similarity transformation, or, equivalently, accomplish the task of multiplying
the factors of the QR factorization.
Specifically, given a Hessenberg matrix H, we apply Givens row rotations GT1 , GT2 , . . . , GTn−1 to
H, where GTi rotates rows i and i + 1, to obtain

GTn−1 · · · GT2 GT1 H = (G1 G2 · · · Gn−1 )T H = QT H = R,

where R is upper-triangular. Then, we compute

H̃ = QT HQ = RQ = RG1 G2 · · · Gn−1

by applying column rotations to R, to obtain a new matrix H̃.


By considering which rows or columns the Givens rotations affect, it can be shown that Q
is Hessenberg, and therefore H̃ is Hessenberg as well. The process of applying an orthogonal
similarity transformation to a Hessenberg matrix to obtain a new Hessenberg matrix with the same
eigenvalues that, hopefully, is closer to quasi-upper-triangular form is called a Hessenberg QR step.
The following algorithm overwrites H with H̃ = RQ = QT HQ, and also computes Q as a product
of Givens column rotations, which is only necessary if the full Schur Decomposition of A is required,
as opposed to only the eigenvalues.

for j = 1, 2, . . . , n − 1 do
[c, s] = givens(h  jj , hj+1,j )
c −s
Gj =
s c
H(j : j + 1, j : n) = GTj H(j : j + 1, :)
end
Q=I
for j = 1, 2, . . . , n − 1 do
H(1 : j + 1, j : j + 1) = H(1 : j + 1, j : j + 1)Gj
Q(1 : j + 1, j : j + 1) = Q(1 : j + 1, j : j + 1)Gj
end

The function givens(a, b) returns c and s such that


 T    
c −s a r p
= , r= a2 + b2 .
s c b 0
4.3. THE QR ALGORITHM 141

Note that when performing row rotations, it is only necessary to update certain columns, and
when performing column rotations, it is only necessary to update certain rows, because of the
structure of the matrix at the time the rotation is performed; for example, after the first loop, H
is upper-triangular.
Before a Hessenberg QR step can be performed, it is necessary to actually reduce the original
matrix A to Hessenberg form H = U T AU . This can be accomplished by performing a sequence of
Householder reflections U = P1 P2 · · · Pn−2 on the columns of A, as in the following algorithm.

U =I
for j = 1, 2, . . . , n − 2 do
v = house(A(j + 1 : n, j)), c = 2/vT v
A(j + 1 : n, j : n) = A(j + 1 : n, j : n) − cvvT A(j + 1 : n, j : n)
A(1 : n, j + 1 : n) = A(1 : n, j + 1 : n) − cA(1 : n, j + 1 : n)vvT
end

The function house(x) computes a vector v such that P x = I − cvvT x = αe1 , where c = 2/vT v
and α = ±kxk2 . The algorithm for the Hessenberg reduction requires O(n3 ) operations, but it is
performed only once, before the QR Iteration begins, so it still leads to a substantial reduction in
the total number of operations that must be performed to compute the Schur Decomposition.
If a subdiagonal entry hj+1,j of a Hessenberg matrix H is equal to zero, then the problem of
computing the eigenvalues of H decouples into two smaller problems of computing the eigenvalues
of H11 and H22 , where  
H11 H12
H=
0 H22
and H11 is j ×j. Therefore, an efficient implementation of the QR Iteration on a Hessenberg matrix
H focuses on a submatrix of H that is unreduced, meaning that all of its subdiagonal entries are
nonzero. It is also important to monitor the subdiagonal entries after each iteration, to determine if
any of them have become nearly zero, thus allowing further decoupling. Once no further decoupling
is possible, H has been reduced to quasi-upper-triangular form and the QR Iteration can terminate.
It is essential to choose an maximal unreduced diagonal block of H for applying a Hessenberg
QR step. That is, the step must be applied to a submatrix H22 such that H has the structure
 
H11 H12 H13
H =  0 H22 H23 
0 0 H33

where H22 is unreduced. This condition ensures that the eigenvalues of H22 are also eigenvalues
of H, as λ(H) = λ(H11 ) ∪ λ(H22 ) ∪ λ(H33 ) when H is structured as above. Note that the size of
either H11 or H33 may be 0 × 0.
The following property of unreduced Hessenberg matrices is useful for improving the efficiency
of a Hessenberg QR step.
Theorem (Implicit Q Theorem) Let A be an n × n matrix, and let Q and V be n × n orthgonal
matrices such that QTAQ = H and T
 V AV = G are both upper Hessenberg, and H is unreduced.
If Q = q1 · · · qn and V = v1 · · · vn , and if q1 = v1 , then qi = ±vi for i = 2, . . . , n,
and |hij | = |gij | for i, j = 1, 2, . . . , n.
142 CHAPTER 4. EIGENVALUE PROBLEMS

That is, if two orthogonal similarity transformations that reduce A to Hessenberg form have the
same first column, then they are “essentially equal”, as are the Hessenberg matrices.
The proof of the Implicit Q Theorem proceeds as follows: From the relations QT AQ = H
and V T AV = G, we obtain GW = W H, where W = V T Q is orthogonal. Because q1 = v1 , we
have W e1 = e1 . Equating first columns of GW = W H, and keeping in mind that G and H are
both upper Hessenberg, we find that only the first two elements of W e2 are nonzero. Proceeding
by induction, it follows that W is upper triangular, and therefore W −1 is also upper triangular.
However, because W is orthogonal, W −1 = W T , which means that W −1 is lower triangular as well.
Therefore, W is a diagonal matrix, so by the orthogonality of W , W must have diagonal entries
that are equal to ±1, and the theorem follows.
Another important property of an unreduced Hessenberg matrix is that all of its eigenvalues
have a geometric multiplicity of one. To see this, consider the matrix H − λI, where H is an n × n
unreduced Hessenberg matrix and λ is an arbitrary scalar. If λ is not an eigenvalue of H, then H
is nonsingular and rank(H) = n. Otherwise, because H is unreduced, from the structure of H it
can be seen that the first n − 1 columns of H − λI must be linearly independent. We conclude
that rank(H − λI) = n − 1, and therefore at most one vector x (up to a scalar multiple) satisfies
the equation Hx = λx. That is, there can only be one linearly independent eigenvector. It follows
that if any eigenvalue of H repeats, then it is defective.

4.3.2 Shifted QR Iteration


The efficiency of the QR Iteration for computing the eigenvalues of an n × n matrix A is signifi-
cantly improved by first reducing A to a Hessenberg matrix H, so that only O(n2 ) operations per
iteration are required, instead of O(n3 ). However, the iteration can still converges very slowly, so
additional modifications are needed to make the QR Iteration a practical algorithm for computing
the eigenvalues of a general matrix.
In general, the pth subdiagonal entry of H converges to zero at the rate

λp+1
λp ,

where λp is the pth largest eigenvalue of A in magnitude. It follows that convergence can be
particularly slow if eigenvalues are very close to one another in magnitude. Suppose that we shift
H by a scalar µ, meaning that we compute the QR factorization of H − µI instead of H, and then
update H to obtain a new Hessenberg H̃ by multiplying the QR factors in reverse order as before,
but then adding µI. Then, we have

H̃ = RQ + µI
= QT (H − µI)Q + µI
= QT HQ − µQT Q + µI
= QT HQ − µI + µI
= QT HQ.

So, we are still performing an orthogonal similarity transformation of H, but with a different Q.
Then, the convergence rate becomes |λp+1 − µ|/|λp − µ|. Then, if µ is close to an eigenvalue,
convergence of a particular subdiagonal entry will be much more rapid.
4.3. THE QR ALGORITHM 143

In fact, suppose H is unreduced, and that µ happens to be an eigenvalue of H. When we


compute the QR factorization of H − µI, which is now singular, then, because the first n − 1
columns of H − µI must be linearly independent, it follows that the first n − 1 columns of R must
be linearly independent as well, and therefore the last row of R must be zero. Then, when we
compute RQ, which involves rotating columns of R, it follows that the last row of RQ must also
be zero. We then add µI, but as this only changes the diagonal elements, we can conclude that
h̃n,n−1 = 0. In other words, H̃ is not an unreduced Hessenberg matrix, and deflation has occurred
in one step.
If µ is not an eigenvalue of H, but is still close to an eigenvalue, then H − µI is nearly singular,
which means that its columns are nearly linearly dependent. It follows that rnn is small, and it can
be shown that h̃n,n−1 is also small, and h̃nn ≈ µ. Therefore, the problem is nearly decoupled, and
µ is revealed by the structure of H̃ as an approximate eigenvalue of H. This suggests using hnn as
the shift µ during each iteration, because if hn,n−1 is small compare to hnn , then this choice of shift
will drive hn,n−1 toward zero. In fact, it can be shown that this strategy generally causes hn,n−1
to converge to zero quadratically, meaning that only a few similarity transformations are needed to
achieve decoupling. This improvement over the linear convergence rate reported earlier is due to
the changing of the shift during each step.
Example Consider the 2 × 2 matrix
 
a b
H= ,  > 0,
 0
that arises naturally when using hnn as a shift. To compute its QR factorization of H, we perform
a single Givens rotation to obtain H = GR, where
 
c −s a 
G= , c= √ , s= √ .
s c 2
a + 2 a + 2
2

Performing the similarity transformation H̃ = GT HG yields


   
c s a b c −s
H̃ =
−s c  0 s c
  
c s ac + bs bc − as
=
−s c c −s
2 bc2 − acs − s2
 
ac + bcs + cs
=
−acs − bs2 + c2 −bcs + as2 − cs
a + bcs bc2 − 
 
= .
−bs2 −bcs
We see that the one subdiagonal element is
2
−bs2 = −b ,
2 + a2
compared to the original element . It follows that if  is small compared to a and b, then subsequent
QR steps will cause the subdiagonal element to converge to zero quadratically. For example, if
 
0.6324 0.2785
H= ,
0.0975 0.5469
144 CHAPTER 4. EIGENVALUE PROBLEMS

then the value of h21 after each of the first three QR steps is 0.1575, −0.0037, and 2.0876 × 10−5 .
2
This shifting strategy is called the single shift strategy. Unfortunately, it is not very effective
if H has complex eigenvalues. An alternative is the double shift strategy, which is used if the two
eigenvalues, µ1 and µ2 , of the lower-right 2 × 2 block of H are complex. Then, these two eigenvalues
are used as shifts in consecutive iterations to achieve quadratic convergence in the complex case as
well. That is, we compute

H − µ1 I = U1 R1
H1 = R1 U1 + µ1 I
H1 − µ2 I = U2 R2
H2 = R2 U2 + µ2 I.

To avoid complex arithmetic when using complex shifts, the double implicit shift strategy is
used. We first note that

U1 U2 R2 R1 = U1 (H1 − µ2 I)R1
= U1 H1 R1 − µ2 U1 R1
= U1 (R1 U1 + µ1 I)R1 − µ2 (H − µ1 I)
= U1 R1 U1 R1 + µ1 U1 R1 − µ2 (H − µ1 I)
= (H − µ1 I)2 + µ1 (H − µ1 I) − µ2 (H − µ1 I)
= H 2 − 2µ1 H + µ21 I + µ1 H − µ21 I − µ2 H + µ1 µ2 I
= H 2 − (µ1 + µ2 )H + µ1 µ2 I.

Since µ1 = a + bi and µ2 = a − bi are a complex-conjugate pair, it follows that µ1 + µ2 = ab and


µ1 µ2 = a2 + b2 are real. Therefore, U1 U2 R2 R1 = (U1 U2 )(R2 R1 ) represents the QR factorization of
a real matrix.
Furthermore,

H2 = R2 U2 + µ2 I
= U2T U2 R2 U2 + µ2 U2T U2
= U2T (U2 R2 + µ2 I)U2
= U2T H1 U2
= U2T (R1 U1 + µ1 I)U2
= U2T (U1T U1 R1 U1 + µ1 U1T U1 )U2
= U2T U1T (U1 R1 + µ1 I)U1 U2
= U2T U1T HU1 U2 .

That is, U1 U2 is the orthogonal matrix that implements the similarity transformation of H to obtain
H2 . Therefore, we could use exclusively real arithmetic by forming M = H 2 − (µ1 + µ2 )H + µ1 µ2 I,
compute its QR factorization to obtain M = ZR, and then compute H2 = Z T HZ, since Z = U1 U2 ,
in view of the uniqueness of the QR decomposition. However, M is computed by squaring H, which
requires O(n3 ) operations. Therefore, this is not a practical approach.
4.4. THE SYMMETRIC EIGENVALUE PROBLEM 145

We can work around this difficulty using the Implicit Q Theorem. Instead of forming M in its
entirety, we only form its first column, which, being a second-degree polynomial of a Hessenberg
matrix, has only three nonzero entries. We compute a Householder transformation P0 that makes
this first column a multiple of e1 . Then, we compute P0 HP0 , which is no longer Hessenberg, because
it operates on the first three rows and columns of H. Finally, we apply a series of Householder
reflections P1 , P2 , . . . , Pn−2 that restore Hessenberg form. Because these reflections are not applied
to the first row or column, it follows that if we define Z̃ = P0 P1 P2 · · · Pn−2 , then Z and Z̃ have
the same first column. Since both matrices implement similarity transformations that preserve the
Hessenberg form of H, it follows from the Implicit Q Theorem that Z and Z̃ are essentially equal,
and that they essentially produce the same updated matrix H2 . This variation of a Hessenberg QR
step is called a Francis QR step.
A Francis QR step requires 10n2 operations, with an additional 10n2 operations if orthogonal
transformations are being accumulated to obtain the entire real Schur decomposition. Generally,
the entire QR algorithm, including the initial reduction to Hessenberg form, requires about 10n3
operations, with an additional 15n3 operations to compute the orthogonal matrix Q such that
A = QT QT is the real Schur decomposition of A.

4.3.3 Computation of Eigenvectors

4.4 The Symmetric Eigenvalue Problem


4.4.1 Properties
The eigenvalue problem for a real, symmetric matrix A, or a complex, Hermitian matrix A, for which
A = AH , is a considerable simplification of the eigenvalue problem for a general matrix. Consider
the Schur decomposition A = QT QH , where T is upper-triangular. Then, if A is Hermitian, it
follows that T = T H . But because T is upper-triangular, it follows that T must be diagonal. That
is, any symmetric real matrix, or Hermitian complex matrix, is unitarily diagonalizable, as stated
previously because A is normal. What’s more, because the Hermitian transpose includes complex
conjugation, T must equal its complex conjugate, which implies that the eigenvalues of A are real,
even if A itself is complex.
Because the eigenvalues are real, we can order them. By convention, we prescribe that if A is
an n × n symmetric matrix, then it has eigenvalues
λ1 ≥ λ2 ≥ · · · ≥ λn .
Furthermore, by the Courant-Fischer Minimax Theorem, each of these eigenvalues has the following
characterization:
yH Ay
λk = max min .
dim(S)=k y∈S,y6=0 yH y
That is, the kth largest eigenvalue of A is equal to the maximum, over all k-dimensional subspaces
of Cn , of the minimum value of the Rayleigh quotient
yH Ay
r(y) = , y 6= 0,
yH y
on each subspace. It follows that λ1 , the largest eigenvalue, is the absolute maximum value of the
Rayleigh quotient on all of Cn , while λn , the smallest eigenvalue, is the absolute minimum value.
146 CHAPTER 4. EIGENVALUE PROBLEMS

In fact, by computing the gradient of r(y), it can be shown that every eigenvector of A is a critical
point of r(y), with the corresponding eigenvalue being the value of r(y) at that critical point.

4.4.2 Perturbation Theory


In the symmetric case, the Gerschgorin circles become Gerschgorin intervals, because the eigenval-
ues of a symmetric matrix are real.
Example The eigenvalues of the 3 × 3 symmetric matrix
 
−10 −3 2
A =  −3 4 −2 
2 −2 14

are
λ(A) = {14.6515, 4.0638, −10.7153}.
The Gerschgorin intervals are

D1 = {x ∈ R | |x − 14| ≤ 4},
D2 = {x ∈ R | |x − 4| ≤ 5},
D3 = {x ∈ R | |x + 10| ≤ 5}.

We see that each intervals contains one eigenvalue. 2


The characterization of the eigenvalues of a symmetric matrix as constrained maxima of the
Rayleigh quotient lead to the following results about the eigenvalues of a perturbed symmetric
matrix. As the eigenvalues are real, and therefore can be ordered, we denote by λi (A) the ith
largest eigenvalue of A.
Theorem (Wielandt-Hoffman) If A and A + E are n × n symmetric matrices, then
n
X
(λi (A + E) − λi (A))2 ≤ kEk2F .
i=1

It is also possible to bound the distance between individual eigenvalues of A and A + E.


Theorem If A and A + E are n × n symmetric matrices, then

λn (E) ≤ λk (A + E) − λk (A) ≤ λ1 (E).

Furthermore,
|λk (A + E) − λk (A)| ≤ kEk2 .

The second inequality in the above theorem follows directly from the first, as the 2-norm of the
symmetric matrix E, being equal to its spectral radius, must be equal to the larger of the absolute
value of λ1 (E) or λn (E).
4.4. THE SYMMETRIC EIGENVALUE PROBLEM 147

Theorem (Interlacing Property) If A is an n × n symmetric matrix, and Ar is the r × r leading


principal minor of A, then, for r = 1, 2, . . . , n − 1,
λr+1 (Ar+1 ) ≤ λr (Ar ) ≤ λr (Ar+1 ) ≤ · · · ≤ λ2 (Ar+1 ) ≤ λ1 (Ar ) ≤ λ1 (Ar+1 ).

For a symmetric matrix, or even a more general normal matrix, the left eigenvectors and right
eigenvectors are the same, from which it follows that every simple eigenvalue is “perfectly condi-
tioned”; that is, the condition number 1/| cos θ| is equal to 1 because θ = 0 in this case. However,
the same results concerning the sensitivity of invariant subspaces from the nonsymmetric case apply
in the symmetric case as well: such sensitivity increases as the eigenvalues become more clustered,
even though there is no chance of a defective eigenvalue. This is because for a nondefective, re-
peated eigenvalue, there are infinitely many possible bases of the corresponding invariant subspace.
Therefore, as the eigenvalues approach one another, the eigenvectors become more sensitive to small
perturbations, for any matrix.
Let Q1 be an n × r matrix with orthonormal columns, meaning that QT1 Q1 = Ir . If it spans an
invariant subspace of an n × n symmetric matrix A, then AQ1 = Q1 S, where S = QT1 AQ1 . On the
other hand, if range(Q1 ) is not an invariant subspace, but the matrix
AQ1 − Q1 S = E1
is small for any given r × r symmetric matrix S, then the columns of Q1 define an approximate
invariant subspace.
It turns out that kE1 kF is minimized by choosing S = QT1 AQ1 . Furthermore, we have
kAQ1 − Q1 SkF = kP1⊥ AQ1 kF ,
where P1⊥ = I − Q1 QT1 is the orthogonal projection into (range(Q1 ))⊥ , and there exist eigenvalues
µ1 , . . . , µr ∈ λ(A) such that

|µk − λk (S)| ≤ 2kE1 k2 , k = 1, . . . , r.
That is, r eigenvalues of A are close to the eigenvalues of S, which are known as Ritz values,
while the corresponding eigenvectors are called Ritz vectors. If (θk , yk ) is an eigenvalue-eigenvector
pair, or an eigenpair of S, then, because S is defined by S = QT1 AQ1 , it is also known as a Ritz
pair. Furthermore, as θk is an approximate eigenvalue of A, Q1 yk is an approximate corresponding
eigenvector.
To see this, let σk (not to be confused with singular values) be an eigenvalue of S, with eigen-
vector yk . We multiply both sides of the equation Syk = σk yk by Q1 :
Q1 Syk = σk Q1 yk .
Then, we use the relation AQ1 − Q1 S = E1 to obtain
(AQ1 − E1 )yk = σk Q1 yk .
Rearranging yields
A(Q1 yk ) = σk (Q1 yk ) + E1 yk .
If we let xk = Q1 yk , then we conclude
Axk = σk xk + E1 yk .
Therefore, kE1 k is small in some norm, Q1 yk is nearly an eigenvector.
148 CHAPTER 4. EIGENVALUE PROBLEMS

4.4.3 Rayleigh Quotient Iteration


The Power Method, when applied to a symmetric matrix to obtain its largest eigenvalue, is more
effective than for a general matrix: its rate of convergence |λ2 /λ1 |2 , meaning that it generally
converges twice as rapidly.
Let A be an n×n symmetric matrix. Even more rapid convergence can be obtained if we consider
a variation of the Power Method. Inverse Iteration is the Power Method applied to (A − µI)−1 .
The algorithm is as follows:

Choose x0 so that kx0 k2 = 1


for k = 0, 1, 2, . . . do
Solve (A − µI)zk = xk for zk
xk+1 = zk /kzk k2
end

Let A have eigenvalues λ1 , . . . , λn . Then, the eigenvalues of (A − µI)−1 matrix are 1/(λi − µ),
for i − 1, 2, . . . , n. Therefore, this method finds the eigenvalue that is closest to µ.
Now, suppose that we vary µ from iteration to iteration, by setting it equal to the Rayleigh
quotient
xH Ax
r(x) = H ,
x x
of which the eigenvalues of A are constrained extrema. We then obtain Rayleigh Quotient Iteration:

Choose a vector x0 , kx0 k2 = 1


for k = 0, 1, 2, . . . do
µk = xHk Axk
Solve (A − µk I)zk = xk for zk
xk+1 = zk /kzk k2
end

When this method converges, it converges cubically to an eigenvalue-eigenvector pair. To see


this, consider the diagonal 2 × 2 matrix
 
λ1 0
A= , λ1 > λ2 .
0 λ2
 T
This matrix has eigenvalues λ1 and λ2 , with eigenvectors e1 and e2 . Suppose that xk = ck sk ,
where c2k + s2k = 1. Then we have
  
 λ1 0 ck
= λ1 c2k + λ2 s2k .

µk = r(xk ) = ck sk
0 λ2 sk

From
λ1 − (λ1 c2k + λ2 s2k )
 
0
A − µk I =
0 λ2 − (λ1 c2k + λ2 s2k )
 2 
sk 0
= (λ1 − λ2 ) ,
0 −c2k
4.4. THE SYMMETRIC EIGENVALUE PROBLEM 149

we obtain
ck /s2k c3k
   
1 1
zk = = .
λ1 − λ2 −sk /c2k c2k s2k (λ1 − λ2 ) −s3k

Normalizing yields
c3k
 
1
xk+1 = q ,
−s3k
c6k + s6k

which indicates cubic convergence to a vector that is parallel to e1 or e2 , provided |ck | =


6 |sk |.
It should be noted that Inverse Iteration is also useful for a general (unsymmetric) matrix A, for
finding selected eigenvectors after computing the Schur decomposition A = QT QH , which reveals
the eigenvalues of A, but not the eigenvectors. Then, a computed eigenvalue can be used as the
shift µ, causing rapid convergence to a corresponding eigenvector. In fact, in practice a single
iteration is sufficient. However, when no such information about eigenvalues is available, Inverse
Iteration is far more practical for a symmetric matrix than an unsymmetric matrix, due to the
superior convergence of the Power Method in the symmetric case.

4.4.4 The Symmetric QR Algorithm


A symmetric Hessenberg matrix is tridiagonal. Therefore, the same kind of Householder reflections
that can be used to reduce a general matrix to Hessenberg form can be used to reduce a symmetric
matrix A to a tridiagonal matrix T . However, the symmetry of A can be exploited to reduce the
number of operations needed to apply each Householder reflection on the left and right of A.
It can be verified by examining the structure of the matrices involved, and the rows and columns
influenced by Givens rotations, that if T is a symmetric tridiagonal matrix, and T = QR is its QR
factorization, then Q is upper Hessenberg, and R is upper-bidiagonal (meaning that it is upper-
triangular, with upper bandwidth 1, so that all entries below the main diagonal and above the
superdiagonal are zero). Furthermore, T̃ = RQ is also tridiagonal.
Because each Givens rotation only affects O(1) nonzero elements of a tridiagonal matrix T , it
follows that it only takes O(n) operations to compute the QR factorization of a tridiagonal matrix,
and to multiply the factors in reverse order. However, to compute the eigenvectors of A as well as
the eigenvalues, it is necessary to compute the product of all of the Givens rotations, which still
takes O(n2 ) operations.
The Implicit Q Theorem applies to symmetric matrices as well, meaning that if two orthogonal
similarity transformations reduce a matrix A to unreduced tridiagonal form, and they have the same
first column, then they are essentially equal, as are the tridiagonal matrices that they produce.
In the symmetric case, there is no need for a double-shift strategy, because the eigenvalues are
real. However, the Implicit Q Theorem can be used for a different purpose: computing the similarity
transformation to be used during each iteration without explicitly computing T − µI, where T is
the tridiagonal matrix that is to be reduced to diagonal form. Instead, the first column of T − µI
can be computed, and then a Householder transformation to make it a multiple of e1 . This can
then be applied directly to T , followed by a series of Givens rotations to restore tridiagonal form.
By the Implicit Q Theorem, this accomplishes the same effect as computing the QR factorization
U R = T − µI and then computing T̃ = RU + µI.
150 CHAPTER 4. EIGENVALUE PROBLEMS

While the shift µ = tnn can always be used, it is actually more effective to use the Wilkinson
shift, which is given by
q tn−1,n−1 − tnn
µ = tnn + d − sign(d) d2 + t2n,n−1 , d= .
2
This expression yields the eigenvalue of the lower 2 × 2 block of T that is closer to tnn . It can be
shown that this choice of shift leads to cubic convergence of tn,n−1 to zero.
The symmetric QR algorithm is much faster than the unsymmetric QR algorithm. A single
QR step requires about 30n operations, because it operates on a tridiagonal matrix rather than a
Hessenberg matrix, with an additional 6n2 operations for accumulating orthogonal transformations.
The overall symmetric QR algorithm requires 4n3 /3 operations to compute only the eigenvalues,
and approximately 8n3 additional operations to accumulate transformations. Because a symmetric
matrix is unitarily diagonalizable, then the columns of the orthogonal matrix Q such that QT AQ
is diagonal contains the eigenvectors of A.

4.5 The SVD Algorithm


Let A be an m × n matrix. The Singular Value Decomposition (SVD) of A,

A = U ΣV T ,

where U is m × m and orthogonal, V is n × n and orthogonal, and Σ is an m × n diagonal matrix


with nonnegative diagonal entries

σ1 ≥ σ2 ≥ · · · ≥ σp , p = min{m, n},

known as the singular values of A, is an extremely useful decomposition that yields much informa-
tion about A, including its range, null space, rank, and 2-norm condition number. We now discuss
a practical algorithm for computing the SVD of A, due to Golub and Kahan.
Let U and V have column partitions
   
U= u1 · · · um , V = v1 · · · vn .

From the relations


Avj = σj uj , AT uj = σj vj , j = 1, . . . , p,

it follows that
AT Avj = σj2 vj .

That is, the squares of the singular values are the eigenvalues of AT A, which is a symmetric matrix.
It follows that one approach to computing the SVD of A is to apply the symmetric QR algorithm
to AT A to obtain a decomposition AT A = V ΣT ΣV T . Then, the relations Avj = σj uj , j = 1, . . . , p,
can be used in conjunction with the QR factorization with column pivoting to obtain U . However,
this approach is not the most practical, because of the expense and loss of information incurred
from computing AT A.
4.5. THE SVD ALGORITHM 151

Instead, we can implicitly apply the symmetric QR algorithm to AT A. As the first step of the
symmetric QR algorithm is to use Householder reflections to reduce the matrix to tridiagonal form,
we can use Householder reflections to instead reduce A to upper bidiagonal form
 
d1 f1

 d2 f2 

T
U1 AV1 = B = 
 .. .. .

 . . 
 dn−1 fn−1 
dn

It follows that T = B T B is symmetric and tridiagonal.


We could then apply the symmetric QR algorithm directly to T , but, again, to avoid the loss of
information from computing T explicitly, we implicitly apply the QR algorithm to T by performing
the following steps during each iteration:

1. Determine the first Givens row rotation GT1 that would be applied to T − µI, where µ is the
Wilkinson shift from the symmetric QR algorithm. This requires only computing the first
column of T , which has only two nonzero entries t11 = d21 and t21 = d1 f1 .

2. Apply G1 as a column rotation to columns 1 and 2 of B to obtain B1 = BG1 . This introduces


an unwanted nonzero in the (2, 1) entry.

3. Apply a Givens row rotation H1 to rows 1 and 2 to zero the (2, 1) entry of B1 , which yields
B2 = H1T BG1 . Then, B2 has an unwanted nonzero in the (1, 3) entry.

4. Apply a Givens column rotation G2 to columns 2 and 3 of B2 , which yields B3 = H1T BG1 G2 .
This introduces an unwanted zero in the (3, 2) entry.

5. Continue applying alternating row and column rotations to “chase” the unwanted nonzero
entry down the diagonal of B, until finally B is restored to upper bidiagonal form.

By the Implicit Q Theorem, since G1 is the Givens rotation that would be applied to the first
column of T , the column rotations that help restore upper bidiagonal form are essentially equal to
those that would be applied to T if the symmetric QR algorithm was being applied to T directly.
Therefore, the symmetric QR algorithm is being correctly applied, implicitly, to B.
To detect decoupling, we note that if any superdiagonal entry fi is small enough to be “declared”
equal to zero, then decoupling has been achieved, because the ith subdiagonal entry of T is equal
to di fi , and therefore such a subdiagonal entry must be zero as well. If a diagonal entry di becomes
zero, then decoupling is also achieved, because row or column rotations can be used to zero an
entire row or column of B. In summary, if any diagonal or superdiagonal entry of B becomes zero,
then the tridiagonal matrix T = B T B is no longer unreduced.
Eventually, sufficient decoupling is achieved so that B is reduced to a diagonal matrix Σ. All
Householder reflections that have pre-multiplied A, and all row rotations that have been applied
to B, can be accumulated to obtain U , and all Householder reflections that have post-multiplied
A, and all column rotations that have been applied to B, can be accumulated to obtain V .
152 CHAPTER 4. EIGENVALUE PROBLEMS

4.6 Jacobi Methods


One of the major drawbacks of the symmetric QR algorithm is that it is not parallelizable. Each
orthogonal similarity transformation that is needed to reduce the original matrix A to diagonal
form is dependent upon the previous one. In view of the evolution of parallel architectures, it is
therefore worthwhile to consider whether there are alternative approaches to reducing an n × n
symmetric matrix A to diagonal form that can exploit these architectures.

4.6.1 The Jacobi Idea


To that end, we consider a rotation matrix, denoted by J(p, q, θ), of the form
 
1 ··· 0 0 ··· 0
 .. . . ···
.. .. .. 
 .
 . . . . 
 0 ··· c ··· s ··· 0 
  p
 .. .. .. .. .. 
 . . . . . 
 0 · · · −s · · · c · · · 0  q
 
 
 .. .. .. . . .. 
 . . . . . 
···
0 ··· 0 0 ··· 1
p q
where c = cos θ and s = sin θ. This matrix, when applied as a similarity transformation to a
symmetric matrix A, rotates rows and columns p and q of A through the angle θ so that the (p, q)
and (q, p) entries are zeroed. We call the matrix J(p, q, θ) a Jacobi rotation. It is actually identical
to a Givens rotation, but in this context we call it a Jacobi rotation to acknowledge its inventor.
Let off(A) be the square root of the sum of squares of all off-diagonal elements of A. That is,
n
X
off(A)2 = kAk2F − a2ii .
i=1
Furthermore, let
B = J(p, q, θ)T AJ(p, q, θ).
Then, because the Frobenius norm is invariant under orthogonal transformations, and because only
rows and columns p and q of A are modified in B, we have
X n
off(B)2 = kBk2F − b2ii
i=1
X
= kAk2F − b2ii − (b2pp + b2qq )
i6=p,q
X
= kAk2F − a2ii − (a2pp + 2a2pq + a2qq )
i6=p,q
X n
= kAk2F − a2ii − 2a2pq
i=1
= off(A) − 2a2pq
2

< off(A)2 .
4.6. JACOBI METHODS 153

We see that the “size” of the off-diagonal part of the matrix is guaranteeed to decrease from such
a similarity transformation.

4.6.2 The 2-by-2 Symmetric Schur Decomposition


We now determine the values c and s such that the diagonalization
 T     
c s app apq c s bpp 0
=
−s c apq aqq −s c 0 bqq

is achieved. From the above matrix equation, we obtain the equation

0 = bp q = ap q(c2 − s2 ) + (ap p − aq q)cs.

If we define
aqq − app s
τ= , t= ,
2apq c
then t satisfies the quadratic equation

t2 + 2τ t − 1 = 0.

Choosing the root that is smaller in absolute value,


p
t = −τ + / − 1 + τ 2 = tan θ,

and using the relationships


s
= t, s2 + c2 = 1,
c
we obtain
1
c= , s = ct.
sqrt1 + t2
We choose the smaller root to ensure that |θ| ≤ π/4 and that the difference between A and B, since
X
kB − Ak2F = 4(1 − c) (a2ip + aiq )2 + 2ap q 2 /c2 .
i6=p,q

Therefore, we want c to be larger, which is accomplished if t is smaller.

4.6.3 The Classical Jacobi Algorithm


The classical Jacobi algorithm proceeds as follows: find indices p and q, p 6= q, such that |apq | is
maximized. Then, we use a single Jacobi rotation to zero apq , and then repeat this process until
off(A) is sufficiently small.
Let N = n(n − 1)/2. A sequence of N Jacobi rotations is called a sweep. Because, during each
iteration, apq is the largest off-diagonal entry, it follows that

off(A)2 ≤ 2N a2pq .
154 CHAPTER 4. EIGENVALUE PROBLEMS

Because off(B)2 = off(A)2 − 2a2pq , we have


 
2 1
off(B) ≤ 1 − off(A)2 ,
N

which implies linear convergence. However, it has been shown that for sufficiently large k, there
exist a constant c such that
off(A(k+N ) ) ≤ c ∗ off(A(k) )2 ,
where A(k) is the matrix after k Jacobi updates, meaning that the clasical Jacobi algorithm con-
verges quadratically as a function of sweeps. Heuristically, it has been argued that approximatly
log n sweeps are needed in practice.
It is worth noting that the guideline that θ be chosen so that |θ| ≤ π/4 is actually essential
to ensure quadratic convergence, because otherwise it is possible that Jacobi updates may simply
interchange nearly converged diagonal entries.

4.6.4 The Cyclic-by-Row Algorithm


The classical Jacobi algorithm is impractical because it requires O(n2 ) comparisons to find the
largest off-diagonal element. A variation, called the cyclic-by-row algorithm, avoids this expense
by simply cycling through the rows of A, in order. It can be shown that quadratic convergence is
still achieved.

4.6.5 Error Analysis


In terms of floating-point operations and comparisons, the Jacobi method is not competitive with
the symmetric QR algorithm, as the expense of two Jacobi sweeps is comparable to that of the entire
symmetric QR algorithm, even with the accumulation of transformations to obtain the matrix of
eigenvectors. On the other hand, the Jacobi method can exploit a known approximate eigenvector
matrix, whereas the symmetric QR algorithm cannot.
The relative error in the computed eigenvalues is quite small if A is positive definite. If λi is an
exact eigenvalue of A and λ̃i is the closest computed eigenvalue, then it has been shown by Demmel
and Veselic̀ that
|λ̃i − λi |
≈ uκ2 (D−1 AD−1 )  uκ2 (A),
|λi |
√ √ √
where D is a diagonal matrix with diagonal entries a11 , a22 , . . . , ann .

4.6.6 Parallel Jacobi


The primary advantage of the Jacobi method over the symmetric QR algorithm is its parallelism.
As each Jacobi update consists of a row rotation that affects only rows p and q, and a column
rotation that effects only columns p and q, up to n/2 Jacobi updates can be performed in parallel.
Therefore, a sweep can be efficiently implemented by performing n−1 series of n/2 parallel updates
in which each row i is paired with a different row j, for i 6= j.
As the size of the matrix, n, is generally much greater than the number of processors, p, it is
common to use a block approach, in which each update consists of the computation of a 2r × 2r
symmetric Schur decomposition for some chosen block size r. This is accomplished by applying
4.6. JACOBI METHODS 155

another algorithm, such as the symmetric QR algorithm, on a smaller scale. Then, if p ≥ n/(2r),
an entire block Jacobi sweep can be parallelized.

4.6.7 Jacobi SVD Procedures


The Jacobi method can be adapted to compute the SVD, just as the symmetric QR algorithm is.
Two types of Jacobi SVD procedures are:

• Two-sided Jacobi: In each Jacobi update, a 2 × 2 SVD is computed in place of a 2 × 2 Schur


decomposition, using a pair of rotations to zero out the off-diagonal entries apq and aqp . This
process continues until off(A), whose square is is reduced by a2pq + a2qp , is sufficiently small.
The idea is to first use a row rotation to make the block symmetric, and then perform a
Jacobi update as in the symmetric eigenvalue problem to diagonalize the symmetrized block.

• One-sided Jacobi: This approach, like the Golub-Kahan SVD algorithm, implicitly applies
the Jacobi method for the symmetric eigenvalue problem to AT A. The idea is, within each
update, to use a column Jacobi rotation to rotate columns p and q of A so that they are
orthogonal, which has the effect of zeroing the (p, q) entry of AT A. Once all columns of AV
are orthogonal, where V is the accumulation of all column rotations, the relation AV = U Σ
is used to obtain U and Σ by simple column scaling. To find a suitable rotation, we note that
if ap and aq , the pth and qth columns of A, are rotated through an angle θ, then the rotated
columns satisfy

(cap − saq )T (sap + caq ) = cs(kap k2 − kaq k2 ) + 2(c2 − s2 )yT x,

where c = cos θ and s = sin θ. Dividing by c2 and defining t = s/c, we obtain a quadratic
equation for t that can be solved to obtain c and s.
156 CHAPTER 4. EIGENVALUE PROBLEMS
Part III

Data Fitting and Function


Approximation

157
Chapter 5

Polynomial Interpolation

Calculus provides many tools that can be used to understand the behavior of functions, but in most
cases it is necessary for these functions to be continuous or differentiable. This presents a problem
in most “real” applications, in which functions are used to model relationships between quantities,
but our only knowledge of these functions consists of a set of discrete data points, where the data
is obtained from measurements. Therefore, we need to be able to construct continuous functions
based on discrete data.
The problem of constructing such a continuous function is called data fitting. In this lecture,
we discuss a special case of data fitting known as interpolation, in which the goal is to find
a linear combination of n known functions to fit a set of data that imposes n constraints, thus
guaranteeing a unique solution that fits the data exactly, rather than approximately. The broader
term “constraints” is used, rather than simply “data points”, since the description of the data may
include additional information such as rates of change or requirements that the fitting function
have a certain number of continuous derivatives.
When it comes to the study of functions using calculus, polynomials are particularly simple to
work with. Therefore, in this course we will focus on the problem of constructing a polynomial
that, in some sense, fits given data. We first discuss some algorithms for computing the unique
polynomial pn (x) of degree n that satisfies pn (xi ) = yi , i = 0, . . . , n, where the points (xi , yi ) are
given. The points x0 , x1 , . . . , xn are called interpolation points. The polynomial pn (x) is called
the interpolating polynomial of the data (x0 , y0 ), (x1 , y1 ), . . ., (xn , yn ). At first, we will assume
that the interpolation points are all distinct; this assumption will be relaxed in a later section.

5.1 Existence and Uniqueness

A straightforward method of computing the interpolating polynomial

pn (x) = a0 + a1 x + a2 x2 + · · · + an xn

is to solve the equations pn (xi ) = yi , for i = 0, 1, . . . , n, for the unknown coefficients aj , j =


0, 1, . . . , n.

159
160 CHAPTER 5. POLYNOMIAL INTERPOLATION

Exercise 5.1.1 Solve the system of equations

a0 + a1 x0 = y0 ,
a0 + a1 x1 = y1

for the coefficients of the linear function p1 (x) = a0 + a1 x that interpolates the data
(x0 , y0 ), (x1 , y1 ). What is the system of equations that must be solved to compute the
coefficients a0 , a1 and a2 of the quadratic function p2 (x) = a0 +a1 x+a2 x2 that interpolates
the data (x0 , y0 ), (x1 , y1 ), (x2 , y2 )? Express both systems of equations in matrix-vector
form.
For general n, computing the coefficients a0 , a1 , . . . , an of pn (x) requires solving the system of
linear equations Vn a = y, where the entries of Vn are defined by [Vn ]ij = xji , i, j = 0, . . . , n, where
x0 , x1 , . . . , xn are the points at which the data y0 , y1 , . . . , yn are obtained. The basis {1, x, . . . , xn }
of the space of polynomials of degree n is called the monomial basis, and the corresponding
matrix Vn is called the Vandermonde matrix for the points x0 , x1 , . . . , xn .
Unfortunately, this approach to computing pn (x) is not practical. Solving this system of equa-
tions requires O(n3 ) floating-point operations; we will see that O(n2 ) is possible. Furthermore, the
Vandermonde matrix can be ill-conditioned, especially when the interpolation points x0 , x1 , . . . , xn
are close together. Instead, we will construct pn (x) using a representation other than the monomial
basis. That is, we will represent pn (x) as

n
X
pn (x) = ai ϕi (x),
i=0

for some choice of polynomials ϕ0 (x), ϕ1 (x), . . . , ϕn (x). This is equivalent to solving the linear
system P a = y, where the matrix P has entries pij = ϕj (xi ). By choosing the basis functions
{ϕi (x)}ni=0 judiciously, we can obtain a simpler system of linear equations to solve.

Exercise 5.1.2 Write down the Vandermonde matrix Vn for the points x0 , x1 , . . . , xn .
Show that
n Y
Y i−1
det Vn = (xi − xj ).
i=0 j=0

Conclude that the system of equations Vn a = y has a unique solution.

Exercise 5.1.3 In this exercise, we consider another approach to proving the uniqueness
of the interpolating polynomial. Let pn (x) and qn (x) be polynomials of degree n such that
pn (xi ) = qn (xi ) = yi for i = 0, 1, 2, . . . , n. Prove that pn (x) ≡ qn (x) for all x.

Exercise 5.1.4 Suppose we express the interpolating polynomial of degree one in the
form
p1 (x) = a0 (x − x1 ) + a1 (x − x0 ).
What is the matrix of the system of equations p1 (xi ) = yi , for i = 0, 1? How should the
form of the interpolating polynomial of degree two, p2 (x), be chosen to obtain an equally
simple system of equations to solve for the coefficients a0 , a1 , and a2 ?
5.2. LAGRANGE INTERPOLATION 161

5.2 Lagrange Interpolation


In Lagrange interpolation, the matrix P is simply the identity matrix, by virtue of the fact that
the interpolating polynomial is written in the form
n
X
pn (x) = yj Ln,j (x),
j=0

where the polynomials {Ln,j }nj=0 have the property that



1 if i = j
Ln,j (xi ) = .
0 if i 6= j

The polynomials {Ln,j }, j = 0, . . . , n, are called the Lagrange polynomials for the interpolation
points x0 , x1 , . . ., xn .
To obtain a formula for the Lagrange polynomials, we note that the above definition specifies
the roots of Ln,j (x): xi , for i 6= j. It follows that Ln,j (x) has the form
n
Y
Ln,j (x) = βj (x − xi )
i=0,i6=j
Qn 1
for some constant βj . Substituting x = xj and requiring Ln,j (xj ) = 1 yields βj = i=0,i6=j (xj −xi ) .
We conclude that
n
Y x − xk
Ln,j (x) = .
xj − xk
k=0,k6=j

As the following result indicates, the problem of polynomial interpolation can be solved using
Lagrange polynomials.

Theorem 5.2.1 Let x0 , x1 , . . . , xn be n + 1 distinct numbers, and let f (x) be a function


defined on a domain containing these numbers. Then the polynomial defined by
n
X
pn (x) = f (xj )Ln,j
j=0

is the unique polynomial of degree n that satisfies

pn (xj ) = f (xj ), j = 0, 1, . . . , n.

Example 5.2.2 We will use Lagrange interpolation to find the unique polynomial p3 (x), of degree
3 or less, that agrees with the following data:
i xi yi
0 −1 3
1 0 −4
2 1 5
3 2 −6
162 CHAPTER 5. POLYNOMIAL INTERPOLATION

In other words, we must have p3 (−1) = 3, p3 (0) = −4, p3 (1) = 5, and p3 (2) = −6.
First, we construct the Lagrange polynomials {L3,j (x)}3j=0 , using the formula
3
Y (x − xi )
Ln,j (x) = .
(xj − xi )
i=0,i6=j

This yields
(x − x1 )(x − x2 )(x − x3 )
L3,0 (x) =
(x0 − x1 )(x0 − x2 )(x0 − x3 )
(x − 0)(x − 1)(x − 2)
=
(−1 − 0)(−1 − 1)(−1 − 2)
x(x2 − 3x + 2)
=
(−1)(−2)(−3)
1
= − (x3 − 3x2 + 2x)
6
(x − x0 )(x − x2 )(x − x3 )
L3,1 (x) =
(x1 − x0 )(x1 − x2 )(x1 − x3 )
(x + 1)(x − 1)(x − 2)
=
(0 + 1)(0 − 1)(0 − 2)
(x2 − 1)(x − 2)
=
(1)(−1)(−2)
1 3
= (x − 2x2 − x + 2)
2
(x − x0 )(x − x1 )(x − x3 )
L3,2 (x) =
(x2 − x0 )(x2 − x1 )(x2 − x3 )
(x + 1)(x − 0)(x − 2)
=
(1 + 1)(1 − 0)(1 − 2)
x(x2 − x − 2)
=
(2)(1)(−1)
1
= − (x3 − x2 − 2x)
2
(x − x0 )(x − x1 )(x − x2 )
L3,3 (x) =
(x3 − x0 )(x3 − x1 )(x3 − x2 )
(x + 1)(x − 0)(x − 1)
=
(2 + 1)(2 − 0)(2 − 1)
x(x2 − 1)
=
(3)(2)(1)
1 3
= (x − x).
6
By substituting xi for x in each Lagrange polynomial L3,j (x), for j = 0, 1, 2, 3, it can be verified
that 
1 if i = j
L3,j (xi ) = .
0 if i 6= j
5.2. LAGRANGE INTERPOLATION 163

It follows that the Lagrange interpolating polynomial p3 (x) is given by


3
X
p3 (x) = yj L3,j (x)
j=0
= y0 L3,0 (x) + y1 L3,1 (x) + y2 L3,2 (x) + y3 L3,3 (x)
   
1 3 2 1 3 2 1
= (3) − (x − 3x + 2x) + (−4) (x − 2x − x + 2) + (5) − (x3 − x2 − 2x) +
6 2 2
1
(−6) (x3 − x)
6
1 5
= − (x3 − 3x2 + 2x) + (−2)(x3 − 2x2 − x + 2) − (x3 − x2 − 2x) − (x3 − x)
2    2
1 5 3 5
= − − 2 − − 1 x3 + +4+ x2 + (−1 + 2 + 5 + 1) x − 4
2 2 2 2
= −6x3 + 8x2 + 7x − 4.

Substituting each xi , for i = 0, 1, 2, 3, into p3 (x), we can verify that we obtain p3 (xi ) = yi in each
case. 2

Exercise 5.2.1 Write a Matlab function L=makelagrange(x) that accepts as input a


vector x of length n + 1 consisting of the points x0 , x1 , . . . , xn , which must be distinct,
and returns a (n + 1) × (n + 1) matrix L, each row of which consists of the coefficients
of the Lagrange polynomial Ln,j , j = 0, 1, 2, . . . , n, with highest-degree coefficients in the
first column. Use the conv function to multiply polynomials that are represented as row
vectors of coefficients.

Exercise 5.2.2 Write a Matlab function p=lagrangefit(x,y) that accepts as input


vectors x and y of length n+1 consisting of the x- and y-coordinates, respectively, of points
(x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ), where the x-values must all be distinct, and returns a (n+1)-
vector p consisting of the coefficients of the Lagrange interpolating polynomial pn (x),
with highest-degree coefficient in the first position. Use your makelagrange function
from Exercise 5.2.1. Test your function by comparing your output to that of the built-in
function polyfit.
The Lagrange interpolating polynomial can be inefficient to evaluate, as written, because it
involves O(n2 ) subtractions for a polynomial of degree n. We can evaluate the interpolating poly-
nomial more efficiently using a technique called barycentric interpolation [5].
Starting with the Lagrange form of an interpolating polynomial pn (x),
n
X
pn (x) = yj Ln,j (x),
j=0

we define the barycentric weights wj , j = 0, 1, . . . , n, by


n
Y 1
wj = . (5.1)
xj − xi
i=0,i6=j
164 CHAPTER 5. POLYNOMIAL INTERPOLATION

Next, we define
πn (x) = (x − x0 )(x − x1 ) · · · (x − xn ).
Then, each Lagrange polynomial can be rewritten as

πn (x)wj
Ln,j (x) = , x 6= xj ,
x − xj

and the interpolant itself can be written as


n
X yj wj
pn (x) = πn (x) .
x − xj
j=0

However, if we let yj = 1 for j = 0, 1, . . . , n, we have


n
X wj
1 = πn (x) = .
x − xj
j=0

Dividing the previous two equations yields


n
X y j wj
x − xj
j=0
pn (x) = n .
X wj
x − xj
j=0

Although O(n2 ) products are needed to compute the barycentric weights, they need only be com-
puted once, and then re-used for each x, which is not the case with the Lagrange form.
Exercise 5.2.3 Write a Matlab function w=baryweights(x) that accepts as input a
vector x of length n + 1, consisting of the distinct interpolation points x0 , x1 , . . . , xn , and
returns a vector w of length n + 1 consisting of the barycentric weights wj as defined in
(5.1).

Exercise 5.2.4 Write a Matlab function yy=lagrangeval(x,y,xx) that accepts


as input vectors x and y, both of length n + 1, representing the data points
(x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ), and a vector of x-values xx. This function uses the barycen-
tric weights (5.1), as computed by the function baryweights from Exercise 5.2.3, to com-
pute the value of the Lagrange interpolating polynomial for the given data points at each
x-value in xx. The corresponding y-values must be returned in the vector yy, which must
have the same dimensions as xx.

5.3 Divided Differences


While the Lagrange polynomials are easy to compute, they are difficult to work with. Furthermore,
if new interpolation points are added, all of the Lagrange polynomials must be recomputed. Unfor-
tunately, it is not uncommon, in practice, to add to an existing set of interpolation points. It may
5.3. DIVIDED DIFFERENCES 165

be determined after computing the kth-degree interpolating polynomial pk (x) of a function f (x)
that pk (x) is not a sufficiently accurate approximation of f (x) on some domain. Therefore, an in-
terpolating polynomial of higher degree must be computed, which requires additional interpolation
points.
To address these issues, we consider the problem of computing the interpolating polynomial
recursively. More precisely, let k > 0, and let pk (x) be the polynomial of degree k that interpolates
the function f (x) at the points x0 , x1 , . . . , xk . Ideally, we would like to be able to obtain pk (x) from
polynomials of degree k − 1 that interpolate f (x) at points chosen from among x0 , x1 , . . . , xk . The
following result shows that this is possible.

Theorem 5.3.1 Let n be a positive integer, and let f (x) be a function defined on a
domain containing the n + 1 distinct points x0 , x1 , . . . , xn , and let pn (x) be the polynomial
of degree n that interpolates f (x) at the points x0 , x1 , . . . , xn . For each i = 0, 1, . . . , n, we
define pn−1,i (x) to be the polynomial of degree n − 1 that interpolates f (x) at the points
x0 , x1 , . . . , xi−1 , xi+1 , . . . , xn . If i and j are distinct nonnegative integers not exceeding n,
then
(x − xj )pn−1,j (x) − (x − xi )pn−1,i (x)
pn (x) = .
xi − xj

This theorem can be proved by substituting x = xi into the above form for pn (x), and using the
fact that the interpolating polynomial is unique.

Exercise 5.3.1 Prove Theorem 5.3.1.


This result leads to an algorithm called Neville’s Method [26] that computes the value of
pn (x) at a given point using the values of lower-degree interpolating polynomials at x. We now
describe this algorithm in detail.

Algorithm 5.3.2 Let x0 , x1 , . . . , xn be distinct numbers, and let f (x) be a function de-
fined on a domain containing these numbers. Given a number x∗ , the following algorithm
computes y ∗ = pn (x∗ ), where pn (x) is the nth interpolating polynomial of f (x) that in-
terpolates f (x) at the points x0 , x1 , . . . , xn .

for j = 0 to n do
Qj = f (xj )
end
for j = 1 to n do
for k = n to j do
Qk = [(x − xk )Qk−1 − (x − xk−j )Qk ]/(xk−j − xk )
end
end
y ∗ = Qn

At the jth iteration of the outer loop, the number Qk , for k = n, n − 1, . . . , j, represents the value
at x of the polynomial that interpolates f (x) at the points xk , xk−1 , . . . , xk−j .
The preceding theorem can be used to compute the polynomial pn (x) itself, rather than its value
at a given point. This yields an alternative method of constructing the interpolating polynomial,
166 CHAPTER 5. POLYNOMIAL INTERPOLATION

called Newton interpolation, that is more suitable for tasks such as inclusion of additional
interpolation points. The basic idea is to represent interpolating polynomials using the Newton
form, which uses linear factors involving the interpolation points, instead of monomials of the form
xj .

5.3.1 Newton Form


Recall that if the interpolation points x0 , . . . , xn are distinct, then the process of finding a polyno-
mial that passes through the points (xi , yi ), i = 0, . . . , n, is equivalent to solving a system of linear
equations Ax = b that has a unique solution. The matrix A is determined by the choice of basis for
the space of polynomials of degree n or less. Each entry aij of A is the value of the jth polynomial
in the basis at the point xi .
In Newton interpolation, the matrix A is upper triangular, and the basis functions are defined
to be the set {Nj (x)}nj=0 , where

j−1
Y
N0 (x) = 1, Nj (x) = (x − xk ), j = 1, . . . , n.
k=0

The advantage of Newton interpolation is that the interpolating polynomial is easily updated as
interpolation points are added, since the basis functions {Nj (x)}, j = 0, . . . , n, do not change from
the addition of the new points.
Using Theorem 5.3.1, it can be shown that the coefficients cj of the Newton interpolating
polynomial
Xn
pn (x) = cj Nj (x)
j=0

are given by
cj = f [x0 , . . . , xj ]
where f [x0 , . . . , xj ] denotes the divided difference of x0 , . . . , xj . The divided difference is defined
as follows:

f [xi ] = yi ,
yi+1 − yi
f [xi , xi+1 ] = ,
xi+1 − xi
f [xi+1 , . . . , xi+k ] − f [xi , . . . , xi+k−1 ]
f [xi , xi+1 , . . . , xi+k ] = .
xi+k − xi

This definition implies that for each nonnegative integer j, the divided difference f [x0 , x1 , . . . , xj ]
only depends on the interpolation points x0 , x1 , . . . , xj and the value of f (x) at these points. It
follows that the addition of new interpolation points does not change the coefficients c0 , . . . , cn .
Specifically, we have
yn+1 − pn (xn+1 )
pn+1 (x) = pn (x) + Nn+1 (x).
Nn+1 (xn+1 )
This ease of updating makes Newton interpolation the most commonly used method of obtaining
the interpolating polynomial.
5.3. DIVIDED DIFFERENCES 167

The following result shows how the Newton interpolating polynomial bears a resemblance to a
Taylor polynomial.

Theorem 5.3.3 Let f be n times continuously differentiable on [a, b], and let
x0 , x1 , . . . , xn be distinct points in [a, b]. Then there exists a number ξ ∈ [a, b] such
that
f (n) (ξ)
f [x0 , x1 , . . . , xn ] = .
n!

Exercise 5.3.2 Prove Theorem 5.3.3 for the case of n = 2, using the definition of divided
differences and Taylor’s theorem.

Exercise 5.3.3 Let pn (x) be the interpolating polynomial for f (x) at points
x0 , x1 , . . . , xn ∈ [a, b], and assume that f is n times differentiable. Use Rolle’s Theo-
(n)
rem to prove that pn (ξ) = f (n) (ξ) for some point ξ ∈ [a, b].

Exercise 5.3.4 Use Exercise 5.3.3 to prove Theorem 5.3.3. Hint: Think of pn−1 (x) as
an interpolant of pn (x).

5.3.2 Computing the Newton Interpolating Polynomial

We now describe in detail how to compute the coefficients cj = f [x0 , x1 , . . . , xj ] of the Newton
interpolating polynomial pn (x), and how to evaluate pn (x) efficiently using these coefficients.

The computation of the coefficients proceeds by filling in the entries of a divided-difference


table. This is a triangular table consisting of n + 1 columns, where n is the degree of the inter-
polating polynomial to be computed. For j = 0, 1, . . . , n, the jth column contains n − j entries,
which are the divided differences f [xk , xk+1 , . . . , xk+j ], for k = 0, 1, . . . , n − j.

We construct this table by filling in the n + 1 entries in column 0, which are the trivial divided
differences f [xj ] = f (xj ), for j = 0, 1, . . . , n. Then, we use the recursive definition of the divided
differences to fill in the entries of subsequent columns. Once the construction of the table is
complete, we can obtain the coefficients of the Newton interpolating polynomial from the first
entry in each column, which is f [x0 , x1 , . . . , xj ], for j = 0, 1, . . . , n.

In a practical implementation of this algorithm, we do not need to store the entire table, because
we only need the first entry in each column. Because each column has one fewer entry than the
previous column, we can overwrite all of the other entries that we do not need. The following
algorithm implements this idea.
168 CHAPTER 5. POLYNOMIAL INTERPOLATION

Algorithm 5.3.4 (Divided-Difference Table) Given n distinct interpolation points


x0 , x1 , . . . , xn , and the values of a function f (x) at these points, the following algorithm
computes the coefficients cj = f [x0 , x1 , . . . , xj ] of the Newton interpolating polynomial.

for i = 0, 1, . . . , n do
di,0 = f (xi )
end
for j = 1, 2, . . . , n do
for i = n, n − 1, . . . , j do
di,j = (di,j−1 − di−1,j−1 )/(xi − xi−j )
end
end
for j = 0, 1, . . . , n do
cj = dj,j
end

Example 5.3.5 We will use Newton interpolation to construct the third-degree polynomial p3 (x)
that fits the data

i xi f (xi )
0 −1 3
1 0 −4
2 1 5
3 2 −6

In other words, we must have p3 (−1) = 3, p3 (0) = −4, p3 (1) = 5, and p3 (2) = −6.

First, we construct the divided-difference table from this data. The divided differences in the
table are computed as follows:

f [x0 ] = f (x0 ) = 3, f [x1 ] = f (x1 ) = −4, f [x2 ] = f (x2 ) = 5, f [x3 ] = f (x3 ) = −6,
5.3. DIVIDED DIFFERENCES 169

f [x1 ] − f [x0 ] −4 − 3
f [x0 , x1 ] = = = −7
x1 − x0 0 − (−1)
f [x2 ] − f [x1 ] 5 − (−4)
f [x1 , x2 ] = = =9
x2 − x1 1−0
f [x3 ] − f [x2 ] −6 − 5
f [x2 , x3 ] = = = −11
x3 − x2 2−1
f [x1 , x2 ] − f [x0 , x1 ]
f [x0 , x1 , x2 ] =
x2 − x0
9 − (−7)
=
1 − (−1)
= 8
f [x2 , x3 ] − f [x1 , x2 ]
f [x1 , x2 , x3 ] =
x3 − x1
−11 − 9
=
2−0
= −10
f [x1 , x2 , x3 ] − f [x0 , x1 , x2 ]
f [x0 , x1 , x2 , x3 ] =
x3 − x0
−10 − 8
=
2 − (−1)
= −6

The resulting divided-difference table is

x0 = −1 f [x0 ] = 3
f [x0 , x1 ] = −7
x1 = 0 f [x1 ] = −4 f [x0 , x1 , x2 ] = 8
f [x1 , x2 ] = 9 f [x0 , x1 , x2 , x3 ] = −6
x2 = 1 f [x2 ] = 5 f [x1 , x2 , x3 ] = −10
f [x2 , x3 ] = −11
x3 = 2 f [x3 ] = −6

It follows that the interpolating polynomial p3 (x) can be expressed in Newton form as follows:

3
X j−1
Y
p3 (x) = f [x0 , . . . , xj ] (x − xi )
j=0 i=0
= f [x0 ] + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 ) +
f [x0 , x1 , x2 , x3 ](x − x0 )(x − x1 )(x − x2 )
= 3 − 7(x + 1) + 8(x + 1)x − 6(x + 1)x(x − 1).

We see that Newton interpolation produces an interpolating polynomial that is in the Newton form,
with centers x0 = −1, x1 = 0, and x2 = 1. 2
170 CHAPTER 5. POLYNOMIAL INTERPOLATION

Exercise 5.3.5 Write a Matlab function c=divdiffs(x,y) that computes the divided
difference table from the given data stored in the input vectors x and y, and returns a
vector c consisting of the divided differences f [x0 , . . . , xj ], j = 0, 1, 2, . . . , n, where n + 1
is the length of both x and y.
Once the coefficients have been computed, we can use nested multiplication to evaluate the
resulting interpolating polynomial, which is represented using the Newton form
n
X
pn (x) = cj Nj (x) (5.2)
j=0
n
X j−1
Y
= f [x0 , x1 , . . . , xj ] (x − xi ) (5.3)
j=0 i=0
= f [x0 ] + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 ) + (5.4)
f [x0 , x1 , . . . , xn ](x − x0 )(x − x1 ) · · · (x − xn−1 ). (5.5)

Algorithm 5.3.6 (Nested Multiplication) Given n distinct interpolation points


x0 , x1 , . . . , xn and the coefficients cj = f [x0 , x1 , . . . , xj ] of the Newton interpolating poly-
nomial pn (x), the following algorithm computes y = pn (x) for a given real number x.

bn = cn
for i = n − 1, n − 2, . . . , 0 do
bi = ci + (x − xi )y
end
y = b0
It can be seen that this algorithm closely resembles Horner’s Method, which is a special case of
nested multiplication that works with the power form of a polynomial, whereas nested multiplication
works with the more general Newton form.

Example 5.3.7 Consider the interpolating polynomial obtained in the previous example,

p3 (x) = 3 − 7(x + 1) + 8(x + 1)x − 6(x + 1)x(x − 1).

We will use nested multiplication to write this polynomial in the power form

p3 (x) = b3 x3 + b2 x2 + b1 x + b0 .

This requires repeatedly applying nested multiplication to a polynomial of the form

p(x) = c0 + c1 (x − x0 ) + c2 (x − x0 )(x − x1 ) + c3 (x − x0 )(x − x1 )(x − x2 ),

and for each application it will perform the following steps,

b3 = c3
b2 = c2 + (z − x2 )b3
b1 = c1 + (z − x1 )b2
b0 = c0 + (z − x0 )b1 ,
5.3. DIVIDED DIFFERENCES 171

where, in this example, we will set z = 0 each time.


The numbers b0 , b1 , b2 and b3 computed by the algorithm are the coefficients of p(x) in the
Newton form, with the centers x0 , x1 and x2 changed to z, x0 and x1 ; that is,

p(x) = b0 + b1 (x − z) + b2 (x − z)(x − x0 ) + b3 (x − z)(x − x0 )(x − x1 ).

It follows that b0 = p(z), which is why this algorithm is the preferred method for evaluating a
polynomial in Newton form at a given point z.
It should be noted that the algorithm can be derived by writing p(x) in the nested form

p(x) = c0 + (x − x0 )[c1 + (x − x1 )[c2 + (x − x2 )c3 ]]

and computing p(z) as follows:

p(z) = c0 + (z − x0 )[c1 + (z − x1 )[c2 + (z − x2 )c3 ]]


= c0 + (z − x0 )[c1 + (z − x1 )[c2 + (z − x2 )b3 ]]
= c0 + (z − x0 )[c1 + (z − x1 )b2 ]
= c0 + (z − x0 )b1
= b0 .

Initially, we have

p(x) = 3 − 7(x + 1) + 8(x + 1)x − 6(x + 1)x(x − 1),

so the coefficients of p(x) in this Newton form are

c0 = 3, c1 = −7, c2 = 8, c3 = −6,

with the centers


x0 = −1, x1 = 0, x2 = 1.
Applying nested multiplication to these coefficients and centers, with z = 0, yields

b3 = −6
b2 = 8 + (0 − 1)(−6)
= 14
b1 = −7 + (0 − 0)(14)
= −7
b0 = 3 + (0 − (−1))(−7)
= −4.

It follows that

p(x) = −4 + (−7)(x − 0) + 14(x − 0)(x − (−1)) + (−6)(x − 0)(x − (−1))(x − 0)


= −4 − 7x + 14x(x + 1) − 6x2 (x + 1),

and the centers are now 0, −1 and 0.


172 CHAPTER 5. POLYNOMIAL INTERPOLATION

For the second application of nested multiplication, we have

p(x) = −4 − 7x + 14x(x + 1) − 6x2 (x + 1),

so the coefficients of p(x) in this Newton form are

c0 = −4, c1 = −7, c2 = 14, c3 = −6,

with the centers


x0 = 0, x1 = −1, x2 = 0.
Applying nested multiplication to these coefficients and centers, with z = 0, yields

b3 = −6
b2 = 14 + (0 − 0)(−6)
= 14
b1 = −7 + (0 − (−1))(14)
= 7
b0 = −4 + (0 − 0)(7)
= −4.

It follows that

p(x) = −4 + 7(x − 0) + 14(x − 0)(x − 0) + (−6)(x − 0)(x − 0)(x − (−1))


= −4 + 7x + 14x2 − 6x2 (x + 1),

and the centers are now 0, 0 and −1.


For the third and final application of nested multiplication, we have

p(x) = −4 + 7x + 14x2 − 6x2 (x + 1),

so the coefficients of p(x) in this Newton form are

c0 = −4, c1 = 7, c2 = 14, c3 = −6,

with the centers


x0 = 0, x1 = 0, x2 = −1.
Applying nested multiplication to these coefficients and centers, with z = 0, yields

b3 = −6
b2 = 14 + (0 − (−1))(−6)
= 8
b1 = 7 + (0 − 0)(8)
= 7
b0 = −4 + (0 − 0)(7)
= 1.
5.3. DIVIDED DIFFERENCES 173

It follows that

p(x) = −4 + 7(x − 0) + 8(x − 0)(x − 0) + (−6)(x − 0)(x − 0)(x − 0)


= −4 + 7x + 8x2 − 6x3 ,

and the centers are now 0, 0 and 0. Since all of the centers are equal to zero, the polynomial is
now in the power form. 2

Exercise 5.3.6 Write a Matlab function yy=newtonval(x,y,xx) that accepts as


input vectors x and y, both of length n + 1, representing the data points
(x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ), and a vector of x-values xx. This function uses Algorithm
5.3.6, as well as the function divdiffs from Exercise 5.3.5, to compute the value of the
Newton interpolating polynomial for the given data points at each x-value in xx. The
corresponding y-values must be returned in the vector yy, which must have the same
dimensions as xx.
It should be noted that this is not the most efficient way to convert the Newton form of p(x)
to the power form. To see this, we observe that after one application of nested multiplication, we
have

p(x) = b0 + b1 (x − z) + b2 (x − z)(x − x0 ) + b3 (x − z)(x − x0 )(x − x1 )


= b0 + (x − z)[b1 + b2 (x − x0 ) + b3 (x − x0 )(x − x1 )].

Therefore, we can apply nested multiplication to the second-degree polynomial

q(x) = b1 + b2 (x − x0 ) + b3 (x − x0 )(x − x1 ),

which is the quotient obtained by dividing p(x) by (x − z). Because

p(x) = b0 + (x − z)q(x),

it follows that once we have changed all of the centers of q(x) to be equal to z, then all of the centers
of p(x) will be equal to z as well. In summary, we can convert a polynomial of degree n from Newton
form to power form by applying nested multiplication n times, where the jth application is to a
polynomial of degree n − j + 1, for j = 1, 2, . . . , n.
Since the coefficients of the appropriate Newton form of each of these polynomials of successively
lower degree are computed by the nested multiplication algorithm, it follows that we can implement
this more efficient procedure simply by proceeding exactly as before, except that during the jth
application of nested multiplication, we do not compute the coefficients b0 , b1 , . . . , bj−2 , because
they will not change anyway, as can be seen from the previous computations. For example, in the
second application, we did not need to compute b0 , and in the third, we did not need to compute
b0 and b1 .
Exercise 5.3.7 Write a Matlab function p=powerform(x,c) that accepts as input vec-
tors x and c, both of length n + 1, consisting of the interpolation points xj and divided
differences f [x0 , x1 , . . . , xj ], respectively, j = 0, 1, . . . , n. The output is a (n + 1)-vector
consisting of the coefficients of the interpolating polynomial pn (x) in power form, ordered
from highest degree to lowest.
174 CHAPTER 5. POLYNOMIAL INTERPOLATION

Exercise 5.3.8 Write a Matlab function p=newtonfit(x,y) that accepts as input vec-
tors x and y of length n + 1 consisting of the x- and y-coordinates, respectively, of points
(x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ), where the x-values must all be distinct, and returns a (n+1)-
vector p consisting of the coefficients of the Newton interpolating polynomial pn (x), in
power form, with highest-degree coefficient in the first position. Use your divdiffs func-
tion from Exercise 5.3.5 and your powerform function from Exercise 5.3.7. Test your
function by comparing your output to that of the built-in function polyfit.

5.3.3 Equally Spaced Points


Suppose that the interpolation points x0 , x1 , . . . , xn are equally spaced; that is, xi = x0 + ih for
some positive number h. In this case, the Newton interpolating polynomial can be simplified, since
the denominators of all of the divided differences can be expressed in terms of the spacing h. If we
define the forward difference operator ∆ by

∆xk = xk+1 − xk ,

where {xk } is any sequence, then the divided differences f [x0 , x1 , . . . , xk ] are given by
1
f [x0 , x1 , . . . , xk ] = ∆k f (x0 ). (5.6)
k!hk
The interpolating polynomial can then be described by the Newton forward-difference for-
mula
n  
X s
pn (x) = f [x0 ] + ∆k f (x0 ), (5.7)
k
k=1

where the new variable s is related to x by


x − x0
, s=
h
 
s
and the extended binomial coefficient is defined by
k
 
s s(s − 1)(s − 2) · · · (s − k + 1)
= ,
k k!

where k is a nonnegative integer.

Exercise 5.3.9 Use induction to prove (5.6). Then show that the Newton interpolating
polynomial (5.5) reduces to (5.7) in the case of equally spaced interpolation points.

Example 5.3.8 We will use the Newton forward-difference formula


n  
X s
pn (x) = f [x0 ] + ∆k f (x0 )
k
k=1

to compute the interpolating polynomial p3 (x) that fits the data


5.3. DIVIDED DIFFERENCES 175

i xi f (xi )
0 −1 3
1 0 −4
2 1 5
3 2 −6
In other words, we must have p3 (−1) = 3, p3 (0) = −4, p3 (1) = 5, and p3 (2) = −6. Note that the
interpolation points x0 = −1, x1 = 0, x2 = 1 and x3 = 2 are equally spaced, with spacing h = 1.
To apply the forward-difference formula, we define s = (x − x0 )/h = x + 1 and compute the
extended binomial coefficients
     
s s s(s − 1) x(x + 1) s s(s − 1)(s − 2) (x + 1)x(x − 1)
= s = x+1, = = , = = ,
1 2 2 2 3 6 6
and then the coefficients

f [x0 ] = f (x0 )
= 3,
∆f (x0 ) = f (x1 ) − f (x0 )
= −4 − 3
= −7,
2
∆ f (x0 ) = ∆(∆f (x0 ))
= ∆[f (x1 ) − f (x0 )]
= [f (x2 ) − f (x1 )] − [f (x1 ) − f (x0 )]
= f (x2 ) − 2f (x1 ) + f (x0 )
= 5 − 2(−4) + 3,
= 16
∆ f (x0 ) = ∆(∆2 f (x0 ))
3

= ∆[f (x2 ) − 2f (x1 ) + f (x0 )]


= [f (x3 ) − f (x2 )] − 2[f (x2 ) − f (x1 )] + [f (x1 ) − f (x0 )]
= f (x3 ) − 3f (x2 ) + 3f (x1 ) − f (x0 )
= −6 − 3(5) + 3(−4) − 3
= −36.

It follows that
3  
X s
p3 (x) = f [x0 ] + ∆k f (x0 )
k
k=1
     
s s 2 s
= 3+ ∆f (x0 ) + ∆ f (x0 ) + ∆3 f (x0 )
1 1 2
x(x + 1) (x + 1)x(x − 1)
= 3 + (x + 1)(−7) + 16 + (−36)
2 6
= 3 − 7(x + 1) + 8(x + 1)x − 6(x + 1)x(x − 1).
176 CHAPTER 5. POLYNOMIAL INTERPOLATION

Note that the forward-difference formula computes the same form of the interpolating polynomial
as the Newton divided-difference formula. 2

Exercise 5.3.10 Define the backward difference operator ∇ by

∇xk = xk − xk−1 ,

for any sequence {xk }. Then derive the Newton backward-difference formula
n  
X
k −s
pn (x) = f [xn ] + (−1) ∇k f (xn ),
k
k=1

where s = (x − xn )/h, and the preceding definition of the extended binomial coefficient
applies.

Exercise 5.3.11 Look up the documentation for the Matlab function diff. Then write
functions yy=newtonforwdiff(x,y,xx) and yy=newtonbackdiff(x,y,xx) that use diff
to implement the Newton forward-difference and Newton backward-difference formulas,
respectively, and evaluate the interpolating polynomial pn (x), where n = length(x) − 1,
at the elements of xx. The resulting values must be returned in yy.

5.4 Error Analysis


In some applications, the interpolating polynomial pn (x) is used to fit a known function f (x) at the
points x0 , . . . , xn , usually because f (x) is not feasible for tasks such as differentiation or integration
that are easy for polynomials, or because it is not easy to evaluate f (x) at points other than the
interpolation points. In such an application, it is possible to determine how well pn (x) approximates
f (x).

5.4.1 Error Estimation


From Theorem 5.3.3, we can obtain the following result.

Theorem 5.4.1 (Interpolation error) If f is n + 1 times continuously differentiable


on [a, b], and pn (x) is the unique polynomial of degree n that interpolates f (x) at the n + 1
distinct points x0 , x1 , . . . , xn in [a, b], then for each x ∈ [a, b],
n
Y f (n+1) (ξ(x))
f (x) − pn (x) = (x − xj ) ,
(n + 1)!
j=0

where ξ(x) ∈ [a, b].

It is interesting to note that the error closely resembles the Taylor remainder Rn (x).

Exercise 5.4.1 Prove Theorem 5.4.1. Hint: work with the Newton interpolating polyno-
mial for the points x0 , x1 , . . . , xn , x.
5.4. ERROR ANALYSIS 177

Exercise 5.4.2 Determine a bound on the error |f (x) − p2 (x)| for x in [0, 1], where
f (x) = ex , and p2 (x) is the interpolating polynomial of f (x) at x0 = 0, x1 = 0.5, and
x2 = 1.
If the number of data points is large, then polynomial interpolation becomes problematic since
high-degree interpolation yields oscillatory polynomials, when the data may fit a smooth function.

Example 5.4.2 Suppose that we wish to approximate the function f (x) = 1/(1+x2 ) on the interval
[−5, 5] with a tenth-degree interpolating polynomial that agrees with f (x) at 11 equally-spaced points
x0 , x1 , . . . , x10 in [−5, 5], where xj = −5+j, for j = 0, 1, . . . , 10. Figure 5.1 shows that the resulting
polynomial is not a good approximation of f (x) on this interval, even though it agrees with f (x) at
the interpolation points. The following MATLAB session shows how the plot in the figure can be
created.

>> % create vector of 11 equally spaced points in [-5,5]


>> x=linspace(-5,5,11);
>> % compute corresponding y-values
>> y=1./(1+x.^2);
>> % compute 10th-degree interpolating polynomial
>> p=polyfit(x,y,10);
>> % for plotting, create vector of 100 equally spaced points
>> xx=linspace(-5,5);
>> % compute corresponding y-values to plot function
>> yy=1./(1+xx.^2);
>> % plot function
>> plot(xx,yy)
>> % tell MATLAB that next plot should be superimposed on
>> % current one
>> hold on
>> % plot polynomial, using polyval to compute values
>> % and a red dashed curve
>> plot(xx,polyval(p,xx),’r--’)
>> % indicate interpolation points on plot using circles
>> plot(x,y,’o’)
>> % label axes
>> xlabel(’x’)
>> ylabel(’y’)
>> % set caption
>> title(’Runge’’s example’)

The example shown in Figure 5.1 is a well-known example of the difficulty of high-degree polynomial
interpolation using equally-spaced points, and it is known as Runge’s example [33]. 2

5.4.2 Chebyshev Interpolation


In general, it is not wise to use a high-degree interpolating polynomial and equally-spaced interpo-
lation points to approximate a function on an interval [a, b] unless this interval is sufficiently small.
178 CHAPTER 5. POLYNOMIAL INTERPOLATION

Figure 5.1: The function f (x) = 1/(1 + x2 ) (solid curve) cannot be interpolated accurately on
[−5, 5] using a tenth-degree polynomial (dashed curve) with equally-spaced interpolation points.

Is it possible to choose the interpolation points so that the error is minimized? To answer this
question, we introduce the Chebyshev polynomials

Tk (x) = cos(k cos−1 (x)), |x| ≤ 1, k = 0, 1, 2, . . . . (5.8)

Using (5.8) and the sum and difference formulas for cosine,

cos(A + B) = cos A cos B − sin A sin B, (5.9)


cos(A − B) = cos A cos B + sin A sin B, (5.10)

it can be shown that the Chebyshev polynomials satisfy the three-term recurrence relation

Tk+1 (x) = 2xTk (x) − Tk−1 (x), k ≥ 1. (5.11)

It can easily be seen from this relation, and the first two Chebyshev polynomials, that Tk (x) is in
fact a polynomial for all integers k ≥ 0.
The Chebyshev polynomials have the following properties of interest:
5.4. ERROR ANALYSIS 179

1. The leading coefficient of Tk (x) is 2k−1 .


2. Tk (x) is an even function of k is even, and an odd function if k is odd.
3. The zeros of Tk (x), for k ≥ 1, are
(2j − 1)π
xj = cos , j = 1, 2, . . . , k.
2k
4. The extrema of Tk (x) on [−1, 1] are

x̃j = cos , j = 0, 1, . . . , k,
k
and the corresponding extremal values are ±1.
5. |Tk (x)| ≤ 1 on [−1, 1] for all k ≥ 0.

Exercise 5.4.3 Use (5.8), (5.9), and (5.10) to prove (5.11).

Exercise 5.4.4 Use (5.11) and induction to show that the leading coefficient of Tk (x) is
2k−1 , for k ≥ 1.

Exercise 5.4.5 Use the roots of cosine to compute the roots of Tk (x). Show that they are
real, distinct, and lie within (−1, 1). These roots are known as the Chebyshev points.
Let f (x) be a function that is (n+1) times continuously differentiable on [a, b]. If we approximate
f (x) by a nth-degree polynomial pn (x) that interpolates f (x) at the n + 1 roots of the Chebyshev
polynomial Tn+1 (x), mapped from [−1, 1] to [a, b],
1 (2j + 1)π 1
ξj = (b − a) cos + (a + b),
2 2n + 2 2
then the error in this approximation is
n+1
f (n+1) (ξ)

b−a
f (x) − pn (x) = 2−n Tn+1 (t(x)),
(n + 1)! 2
where
2
t(x) = −1 + (x − a)
b−a
is the linear map from [a, b] to [−1, 1]. This is because
n n
b − a n+1 Y b − a n+1 −n
Y    
(x − ξj ) = (t(x) − τj ) = 2 Tn+1 (t(x)),
2 2
j=0 j=0

where τj is the jth root of Tn+1 (t). From |Tn+1 (t)| ≤ 1, we obtain
(b − a)n+1
|f (x) − pn (x)| ≤ max |f (n+1) (ξ)|.
22n+1 (n + 1)! ξ∈[a,b]
It can be shown that using Chebyshev points leads to much less error in the function f (x) =
1/(1 + x2 ) from Runge’s example [28].
180 CHAPTER 5. POLYNOMIAL INTERPOLATION

5.5 Osculatory Interpolation


Suppose that the interpolation points are perturbed so that two neighboring points xi and xi+1 ,
0 ≤ i < n, approach each other. What happens to the interpolating polynomial? In the limit, as
xi+1 → xi , the interpolating polynomial pn (x) not only satisfies pn (xi ) = yi , but also the condition
yi+1 − yi
p0n (xi ) = lim .
xi+1 →xi xi+1 − xi
It follows that in order to ensure uniqueness, the data must specify the value of the derivative of
the interpolating polynomial at xi .
In general, the inclusion of an interpolation point xi k times within the set x0 , . . . , xn must be
(j)
accompanied by specification of pn (xi ), j = 0, . . . , k − 1, in order to ensure a unique solution.
These values are used in place of divided differences of identical interpolation points in Newton
interpolation.
Interpolation with repeated interpolation points is called osculatory interpolation, since it
can be viewed as the limit of distinct interpolation points approaching one another, and the term
“osculatory” is based on the Latin word for “kiss”.
Exercise 5.5.1 Use a divided-difference table to compute the interpolating polynomial
p2 (x) for the function f (x) = cos x on [0, π], with interpolation points x0 = x1 = 0 and
x2 = π. That is, p2 (x) must satisfy p2 (0) = f (0), p02 (0) = f 0 (0) and p2 (π) = f (π). Then,
extend p2 (x) to a cubic polynomial p3 (x) by also requiring that p02 (π) = f (π), updating
the divided-difference table accordingly.

Exercise 5.5.2 Suppose that osculatory interpolation is used to construct the polyno-
mial pn (x) that interpolates f (x) at only one x-value, x0 , and satisfies pn (x0 ) = f (x0 ),
(n)
p0n (x0 ) = f 0 (x0 ), p00n (x0 ) = f 00 (x0 ), and so on, up to pn (x0 ) = f (n) (x0 ). What polynomial
approximation of f (x) is obtained?

5.5.1 Hermite Interpolation


In the case where each of the interpolation points x0 , x1 , . . . , xn is repeated exactly once, the
interpolating polynomial for a differentiable function f (x) is called the Hermite polynomial of
f (x), and is denoted by p2n+1 (x), since this polynomial must have degree 2n + 1 in order to satisfy
the 2n + 2 constraints

p2n+1 (xi ) = f (xi ), p02n+1 (xi ) = f 0 (xi ), i = 0, 1, . . . , n.

To satisfy these constraints, we define, for i = 0, 1, . . . , n,

Hi (x) = [Li (x)]2 (1 − 2L0i (xi )(x − xi )), (5.12)


2
Ki (x) = [Li (x)] (x − xi ), (5.13)

where, as before, Li (x) is the ith Lagrange polynomial for the interpolation points x0 , x1 , . . . , xn .
It can be verified directly that these polynomials satisfy, for i, j = 0, 1, . . . , n,

Hi (xj ) = δij , Hi0 (xj ) = 0,


5.5. OSCULATORY INTERPOLATION 181

Ki (xj ) = 0, Ki0 (xj ) = δij ,


where δij is the Kronecker delta 
1 i=j
δij = .
6 j
0 i=
It follows that
n
X
p2n+1 (x) = [f (xi )Hi (x) + f 0 (xi )Ki (x)]
i=0
is a polynomial of degree 2n + 1 that satisfies the above constraints.

Exercise 5.5.3 Derive the formulas (5.12), (5.13) for Hi (x) and Ki (x), respectively,
using the specified constraints for these polynomials. Hint: use an approach similar to
that used to derive the formula for Lagrange polynomials.
To prove that this polynomial is the unique polynomial of degree 2n + 1, we assume that there
is another polynomial p̃2n+1 of degree 2n + 1 that satisfies the constraints. Because p2n+1 (xi ) =
p̃2n+1 (xi ) = f (xi ) for i = 0, 1, . . . , n, p2n+1 − p̃2n+1 has at least n + 1 zeros. It follows from Rolle’s
Theorem that p02n+1 − p̃02n+1 has n zeros that lie within the intervals (xi−1 , xi ) for i = 0, 1, . . . , n − 1.
Furthermore, because p02n+1 (xi ) = p̃02n+1 (xi ) = f 0 (xi ) for i = 0, 1, . . . , n, it follows that p02n+1 −
p̃02n+1 has n+1 additional zeros, for a total of at least 2n+1. However, p02n+1 − p̃02n+1 is a polynomial
of degree 2n, and the only way that a polynomial of degree 2n can have 2n + 1 zeros is if it is
identically zero. Therefore, p2n+1 − p̃2n+1 is a constant function, but since this function is known
to have at least n + 1 zeros, that constant must be zero, and the Hermite polynomial is unique.
Using a similar approach as for the Lagrange interpolating polynomial, the following result can
be proved.

Theorem 5.5.1 Let f be 2n + 2 times continuously differentiable on [a, b], and let p2n+1
denote the Hermite polynomial of f with interpolation points x0 , x1 , . . . , xn in [a, b]. Then
there exists a point ξ(x) ∈ [a, b] such that

f (2n+2) (ξ(x))
f (x) − p2n+1 (x) = (x − x0 )2 (x − x1 )2 · · · (x − xn )2 .
(2n + 2)!

The proof will be left as an exercise.

5.5.2 Divided Differences


The Hermite polynomial can be described using Lagrange polynomials and their derivatives, but
this representation is not practical because of the difficulty of differentiating and evaluating these
polynomials. Instead, one can construct the Hermite polynomial using a divided-difference table, as
discussed previously, in which each entry corresponding to two identical interpolation points is filled
with the value of f 0 (x) at the common point. Then, the Hermite polynomial can be represented
using the Newton divided-difference formula.
Example 5.5.2 We will use Hermite interpolation to construct the third-degree polynomial p3 (x)
that fits f (x) and f 0 (x) at x0 = 0 and x1 = 1. For convenience, we define new interpolation points
zi that list each (distinct) x-value twice:
z2i = z2i+1 = xi , i = 0, 1, . . . , n.
182 CHAPTER 5. POLYNOMIAL INTERPOLATION

Our data is as follows:

i zi f (zi ) f 0 (zi )
0,1 0 0 1
2,3 1 0 1

In other words, we must have p3 (0) = 0, p03 (0) = 1, p3 (1) = 0, and p03 (1) = 1. To include the values
of f 0 (x) at the two distinct interpolation points, we repeat each point once, so that the number of
interpolation points, including repetitions, is equal to the number of constraints described by the
data.
First, we construct the divided-difference table from this data. The divided differences in the
table are computed as follows:

f [z0 ] = f (z0 ) = 0, f [z1 ] = f (z1 ) = 0, f [z2 ] = f (z2 ) = 0, f [z3 ] = f (z3 ) = 0,

f [z1 ] − f [z0 ]
f [z0 , z1 ] =
z1 − z 0
= f 0 (z0 )
= 1
f [z2 ] − f [z1 ]
f [z1 , z2 ] =
z2 − z 1
0−0
=
1−0
= 0
f [z3 ] − f [z2 ]
f [z2 , z3 ] =
z3 − z 2
0
= f (z2 )
= 1
f [z1 , z2 ] − f [z0 , z1 ]
f [z0 , z1 , z2 ] =
z2 − z 0
0−1
=
1−0
= −1
f [z2 , z3 ] − f [z1 , z2 ]
f [z1 , z2 , z3 ] =
z3 − z 1
1−0
=
1−0
= 1
f [z1 , z2 , z3 ] − f [z0 , z1 , z2 ]
f [z0 , z1 , z2 , z3 ] =
z3 − z 0
1 − (−1)
=
1−0
= 2

Note that the values of the derivative are used whenever a divided difference of the form f [zi , zi+1 ]
5.6. PIECEWISE POLYNOMIAL INTERPOLATION 183

is to be computed, where zi = zi+1 . This makes sense because


f (zi+1 ) − f (zi )
lim f [zi , zi+1 ] = lim = f 0 (zi ).
zi+1 →zi zi+1 →zi zi+1 − zi
The resulting divided-difference table is
z0 = 0 f [z0 ] = 0
f [z0 , z1 ] = 1
z1 = 0 f [z1 ] = 0 f [z0 , z1 , z2 ] = −1
f [z1 , z2 ] = 0 f [z0 , z1 , z2 , z3 ] = 2
z2 = 1 f [z2 ] = 0 f [z1 , z2 , z3 ] = 1
f [z2 , z3 ] = 1
z3 = 1 f [z3 ] = 0
It follows that the interpolating polynomial p3 (x) can be obtained using the Newton divided-difference
formula as follows:
3
X j−1
Y
p3 (x) = f [z0 , . . . , zj ] (x − zi )
j=0 i=0
= f [z0 ] + f [z0 , z1 ](x − z0 ) + f [z0 , z1 , z2 ](x − z0 )(x − z1 ) +
f [z0 , z1 , z2 , z3 ](x − z0 )(x − z1 )(x − z2 )
= 0 + (x − 0) + (−1)(x − 0)(x − 0) + 2(x − 0)(x − 0)(x − 1)
= x − x2 + 2x2 (x − 1).
We see that Hermite interpolation, using divided differences, produces an interpolating polynomial
that is in the Newton form, with centers z0 = 0, z1 = 0, and z2 = 1. 2

Exercise 5.5.4 Use the Newton form of the Hermite interpolating polynomial to prove
Theorem 5.5.1.

Exercise 5.5.5 Write a Matlab function c=hermdivdiffs(x,y,yp) that computes the


divided difference table from the given data stored in the input vectors x and y, as well
as the derivative values stored in yp, and returns a vector c consisting of the divided
differences f [z0 , . . . , zj ], j = 0, 1, 2, . . . , 2n + 1, where n + 1 is the length of both x and y.

Exercise 5.5.6 Write a Matlab function p=hermpolyfit(x,y,yp) that is similar to


the built-in function polyfit, in that it returns a vector of coefficients, in power form,
for the interpolating polynomial corresponding to the given data, except that Hermite
interpolation is used instead of Lagrange interpolation. Use the function hermdivdiffs
from Exercise 5.5.5 as well as the function powerform from Exercise 5.3.7.

5.6 Piecewise Polynomial Interpolation


We have seen that high-degree polynomial interpolation can be problematic. However, if the fitting
function is only required to have a few continuous derivatives, then one can construct a piecewise
polynomial to fit the data. We now precisely define what we mean by a piecewise polynomial.
184 CHAPTER 5. POLYNOMIAL INTERPOLATION

Definition 5.6.1 (Piecewise polynomial) Let [a, b] be an interval that is divided into
subintervals [xi , xi+1 ], where i = 0, . . . , n − 1, x0 = a and xn = b. A piecewise polyno-
mial is a function p(x) defined on [a, b] by

p(x) = pi (x), xi−1 ≤ x ≤ xi , i = 1, 2, . . . , n,

where, for i = 1, 2, . . . , n, each function pi (x) is a polynomial defined on [xi−1 , xi ]. The


degree of p(x) is the maximum degree of each polynomial pi (x), for i = 1, 2, . . . , n.

It is essential to note that by this definition, a piecewise polynomial defined on [a, b] is equal to
some polynomial on each subinterval [xi−1 , xi ] of [a, b], for i = 1, 2, . . . , n, but a different polynomial
may be used for each subinterval.
To study the accuracy of piecewise polynomials, we need to work with various function spaces,
including Sobolev spaces; these function spaces are defined in Section B.12.

5.6.1 Piecewise Linear Approximation


We first consider one of the simplest types of piecewise polynomials, a piecewise linear polynomial.
Let f ∈ C[a, b]. Given the points x0 , x1 , . . . , xn defined as above, the linear spline sL (x) that
interpolates f at these points is defined by
x − xi x − xi−1
sL (x) = f (xi−1 ) + f (xi ) , x ∈ [xi−1 , xi ], i = 1, 2, . . . , n. (5.14)
xi−1 − xi xi − xi−1
The points x0 , x1 , . . . , xn are the knots of the spline.

Exercise 5.6.1 Given that sL (x) must satisfy sL (xi ) = f (xi ) for i = 0, 1, 2, . . . , n, ex-
plain how the formula (5.14) can be derived.

Exercise 5.6.2 Given the points (x0 , f (x0 )), (x1 , f (x1 )), . . . , (xn , f (xn )) plotted in the
xy-plane, explain how sL can easily be graphed. How can the graph be produced in a
single line of Matlab code, given vectors x and y containing the x- and y- coordinates
of these points, respectively?
If f ∈ C 2 [a, b], then by the error in Lagrange interpolation (Theorem 5.4.1), on each subinterval
[xi−1 , xi ], for i = 1, 2, . . . , n, we have
f 00 (ξ)
f (x) − sL (x) = (x − xi−1 )(x − xi ).
2
This leads to the following Theorem.

Theorem 5.6.2 Let f ∈ C 2 [a, b], and let sL be the piecewise linear spline defined by
(5.14). For i = 1, 2, . . . , n, let hi = xi − xi−1 , and define h = max1≤i≤n hi . Then

M 2
kf − sL k∞ ≤ h ,
8
where |f 00 (x)| ≤ M on [a, b].

Exercise 5.6.3 Prove Theorem 5.6.2.


5.6. PIECEWISE POLYNOMIAL INTERPOLATION 185

In Section 5.4, it was observed in Runge’s example that even when f (x) is smooth, an inter-
polating polynomial of f (x) can be highly oscillatory, depending on the number and placement of
interpolation points. By contrast, one of the most welcome properties of the linear spline sL (x)
is that among all functions in H 1 (a, b) that interpolate f (x) at the knots x0 , x1 , . . . , xn , it is the
“flattest”. That is, for any function v ∈ H 1 (a, b) that interpolates f at the knots,

ks0L k2 ≤ kv 0 k2 .

To prove this, we first write

kv 0 k22 = kv 0 − s0L k22 + 2hv 0 − s0L , s0L i + ks0L k22 .

We then note that on each subinterval [xi−1 , xi ], since sL is a linear function, s0L is a constant
function, which we denote by
f (xi ) − f (xi−1 )
sL (x) ≡ mi = , i = 1, 2, . . . , n.
xi − xi−1
We then have
Z b
0
hv − s0L , s0L i = [v 0 (x) − s0L (x)]s0L (x) dx
a
n Z xi
X
= [v 0 (x) − s0L (x)]s0L (x) dx
i=1 xi−1
n
X Z xi
= mi v 0 (x) − mi dx
i=1 xi−1

Xn
= mi [v(x) − mi x]|xxii−1
i=1
n  
X f (xi ) − f (xi−1 )
= mi v(xi ) − v(xi−1 ) − (xi − xi−1 )
xi − xi−1
i=1
= 0,

because by assumption, v(x) interpolates f (x) at the knots. This leaves us with

kv 0 k22 = kv 0 − s0L k22 + ks0L k22 ,

which establishes the result.

5.6.2 Cubic Spline Interpolation


A major drawback of the linear spline is that it does not have any continuous derivatives. This
is significant when the function to be approximated, f (x), is a smooth function. Therefore, it
is desirable that a piecewise polynomial approximation possess a certain number of continuous
derivatives. This requirement imposes additional constraints on the piecewise polynomial, and
therefore the degree of the polynomials used on each subinterval must be chosen sufficiently high
to ensure that these constraints can be satisfied.
186 CHAPTER 5. POLYNOMIAL INTERPOLATION

We therefore define a spline of degree k to be a piecewise polynomial of degree k that has k − 1


continuous derivatives. The most commonly used spline is a cubic spline, which we now define.

Definition 5.6.3 (Cubic Spline) Let f (x) be function defined on an interval [a, b], and
let x0 , x1 , . . . , xn be n + 1 distinct points in [a, b], where a = x0 < x1 < · · · < xn = b. A
cubic spline, or cubic spline interpolant, is a piecewise polynomial s(x) that satisifes
the following conditions:

1. On each interval [xi−1 , xi ], for i = 1, . . . , n, s(x) = si (x), where si (x) is a cubic


polynomial.

2. s(xi ) = f (xi ) for i = 0, 1, . . . , n.

3. s(x) is twice continuously differentiable on (a, b).

4. Either of the following boundary conditions are satisfied:

(a) s00 (a) = s00 (b) = 0, which is called free or natural boundary conditions,
and
(b) s0 (a) = f 0 (a) and s0 (b) = f 0 (b), which is called clamped boundary condi-
tions.

If s(x) satisfies free boundary conditions, we say that s(x) is a natural spline. The
points x0 , x1 , . . . , xn are called the nodes of s(x).

Clamped boundary conditions are often preferable because they use more information about f (x),
which yields a spline that better approximates f (x) on [a, b]. However, if information about f 0 (x)
is not available, then natural boundary conditions must be used instead.

5.6.2.1 Constructing Cubic Splines


Suppose that we wish to construct a cubic spline interpolant s(x) that fits the given data (x0 , y0 ),
(x1 , y1 ), . . . , (xn , yn ), where a = x0 < x1 < · · · < xn = b, and yi = f (xi ), for some known function
f (x) defined on [a, b]. From the preceding discussion, this spline is a piecewise polynomial of the
form

s(x) = si (x) = di (x − xi−1 )3 + ci (x − xi−1 )2 + bi (x − xi−1 ) + ai , xi−1 ≤ x ≤ xi .


i = 1, 2, . . . , n,
(5.15)
That is, the value of s(x) is obtained by evaluating a different cubic polynomial for each subinterval
[xi−1 , xi ], for i = 1, 2, . . . , n.
We now use the definition of a cubic spline to construct a system of equations that must be
satisfied by the coefficients ai , bi , ci and di for i = 1, 2, . . . , n. We can then compute these coefficients
by solving the system. Because s(x) must fit the given data, we have

s(xi−1 ) = ai = yi−1 = f (xi−1 ), i = 1, 2, . . . , n. (5.16)

If we define hi = xi − xi−1 , for i = 1, 2, . . . , n, and define an+1 = yn , then the requirement that s(x)
is continuous at the interior nodes implies that we must have si (xi ) = si+1 (xi ) for i = 1, 2, . . . , n−1.
5.6. PIECEWISE POLYNOMIAL INTERPOLATION 187

Furthermore, because s(x) must fit the given data, we must also have s(xn ) = sn (xn ) = yn . These
conditions lead to the constraints
si (xi ) = di h3i + ci h2i + bi hi + ai = ai+1 = si+1 (xi ), i = 1, 2, . . . , n. (5.17)
To ensure that s(x) has a continuous first derivative at the interior nodes, we require that
s0i (xi )
= s0i+1 (xi ) for i = 1, 2 . . . , n − 1, which imposes the constraints
s0i (xi ) = 3di h2i + 2ci hi + bi = bi+1 = s0i+1 (xi ), i = 1, 2, . . . , n − 1. (5.18)
Similarly, to enforce continuity of the second derivative at the interior nodes, we require that
s00i (xi ) = s00i+1 (xi ) for i = 1, 2, . . . , n − 1, which leads to the constraints
s00i (xi ) = 6di hi + 2ci = 2ci+1 = s00i+1 (xi ), i = 1, 2, . . . , n − 1. (5.19)
There are 4n coefficients to determine, since there are n cubic polynomials, with 4 coefficients
each. However, we have only prescribed 4n − 2 constraints, so we must specify 2 more in order to
determine a unique solution. If we use natural boundary conditions, then these constraints are
s001 (x0 ) = 2c1 = 0, (5.20)
s00n (xn ) = 3dn hn + cn = 0. (5.21)
On the other hand, if we use clamped boundary conditions, then our additional constraints are
s01 (x0 ) = b1 = z0 , (5.22)
s0n (xn ) = 3dn h2n + 2cn hn + bn = zn , (5.23)
where zi = f 0 (xi ) for i = 0, 1, . . . , n.
Having determined our constraints that must be satisfied by s(x), we can set up a system of 4n
linear equations based on these constraints, and then solve this system to determine the coefficients
ai , bi , ci , di for i = 1, 2 . . . , n.
However, it is not necessary to construct the matrix for such a system, because it is possible
to instead solve a smaller system of only O(n) equations obtained from the continuity conditions
(5.18) and the boundary conditions (5.20), (5.21) or (5.22), (5.23), depending on whether natural
or clamped boundary conditions, respectively, are imposed. This reduced system is accomplished
by using equations (5.16), (5.17) and (5.19) to eliminate the ai , bi and di , respectively.

Exercise 5.6.4 Show that under natural boundary conditions, the coefficients c2 , . . . , cn
of the cubic spline (5.15) satisfy the system of equations Ac = b, where
 
2(h1 + h2 ) h2 0 ··· 0
 .. .. 

 h 2 2(h 2 + h3 ) h 3 . . 

A=
 .. .. .. ,

0 . . . 0
..
 
 .. .. .. 
 . . . . h  n−1
0 ··· 0 hn−1 2(hn−1 + hn )
3 3
− a2 ) − − a1 )
   
c2 h2 (a3 h1 (a2
..
c =  ...  , b= .
   
.
3 3
cn hn (an+1 − an ) − hn−1 (an − an−1 )
188 CHAPTER 5. POLYNOMIAL INTERPOLATION

Exercise 5.6.5 Show that under clamped boundary conditions, the coefficients
c1 , . . . , cn+1 of the cubic spline (5.15) satisfy the system of equations Ac = b, where
 
2h1 h1 0 ··· ··· 0
 .. .. 
 h1 2(h1 + h2 )
 h2 . . 
 . . .. 
 0 h2 2(h2 + h3 ) h3 . . 
A=  .. . . . .
,
 . . . . . . . . . 0 

 ..
 
..
.

 . h 2(h
n−1 n−1+h ) h 
n n
0 ··· ··· 0 hn 2hn
 3

  − a1 ) − 3z0
h1 (a2
c1 3 3
h2 (a3 − a2 ) − h1 (a2 − a1 )
 
 c2  
..
  
c= . , b= ,
 
 ..  . 
3 3

(a − a ) − (a − a )
 
hn n+1 n hn−1 n n−1
cn+1
 
3
3zn − hn (an+1 − an )
and cn+1 = s00n (xn ).
Example 5.6.4 We will construct a cubic spline interpolant for the following data on the interval
[0, 2].
j xj yj
0 0 3
1 1/2 −4
2 1 5
3 3/2 −6
4 2 7
The spline, s(x), will consist of four pieces {sj (x)}4j=1 , each of which is a cubic polynomial of the
form
sj (x) = aj + bj (x − xj−1 ) + cj (x − xj−1 )2 + dj (x − xj−1 )3 , j = 1, 2, 3, 4.
We will impose natural boundary conditions on this spline, so it will satisfy the conditions s00 (0) =
s00 (2) = 0, in addition to the “essential” conditions imposed on a spline: it must fit the given data
and have continuous first and second derivatives on the interval [0, 2].
These conditions lead to the following system of equations that must be solved for the coefficients
c1 , c2 , c3 , c4 , and c5 , where cj = s00 (xj−1 )/2 for j = 1, 2, . . . , 5. We define h = (2 − 0)/4 = 1/2 to be
the spacing between the interpolation points.
c1 = 0
h y2 − 2y1 + y0
(c1 + 4c2 + c3 ) =
3 h
h y3 − 2y2 + y1
(c2 + 4c3 + c4 ) =
3 h
h y4 − 2y3 + y2
(c3 + 4c4 + c5 ) =
3 h
c5 = 0.
5.6. PIECEWISE POLYNOMIAL INTERPOLATION 189

Substituting h = 1/2 and the values of yj , and also taking into account the boundary conditions,
we obtain
1
(4c2 + c3 ) = 32
6
1
(c2 + 4c3 + c4 ) = −40
6
1
(c3 + 4c4 ) = 48
6
This system has the solutions
c2 = 516/7, c3 = −720/7, c4 = 684/7.
Using (5.16), (5.17), and (5.19), we obtain
a1 = 3, a2 = −4, a3 = 5, a4 = −6.
b1 = −184/7, b2 = 74/7, b3 = −4, b4 = −46/7,
and
d1 = 344/7, d2 = −824/7, d3 = 936/7, d4 = −456/7.
We conclude that the spline s(x) that fits the given data, has two continuous derivatives on [0, 2],
and satisfies natural boundary conditions is
 344 3 184 2
x − 7 x +3 if x ∈ [0, 0.5]
 7824

− 7 (x − 1/2)3 + 516 2 + 74 (x − 1/2) − 4 if x ∈ [0.5, 1]

7 (x − 1/2) 7
s(x) = 936 .

 7 (x − 1)3 − 720
7 (x − 1)2 − 4(x − 1) + 5 if x ∈ [1, 1.5]
 456
− 7 (x − 3/2)3 + 684 2 46
7 (x − 3/2) − 7 (x − 3/2) − 6 if x ∈ [1.5, 2]
The graph of the spline is shown in Figure 5.2. 2
The Matlab function spline can be used to construct cubic splines satisfying both natural
(also known as “not-a-knot”) and clamped boundary conditions. The following exercises require
reading the documentation for this function.
Exercise 5.6.6 Use spline to construct a cubic spline for the data from Example 5.6.4.
First, use the interface pp=spline(x,y), where x and y are vectors consisting of the x-
and y-coordinates, respectively, of the given data points, and pp is a structure that rep-
resents the cubic spline s(x). Examine the members of p and determine how to interpret
them. Where do you see the coefficients computed in Example 5.6.4?

Exercise 5.6.7 The interface yy=spline(x,y,xx), where xx is a vector of x-values at


which the spline constructed from x and y should be evaluated, produces a vector yy of
corresponding y-values. Use this interface on the data from Example 5.6.4 to reproduce
Figure 5.2.

Exercise 5.6.8 If the input argument y in the function call pp=spline(x,y) has two
components more than x, it is assumed that the first and last components are the slopes
z0 = s0 (x0 ) and zn = s0 (xn ) imposed by clamped boundary conditions. Use the given data
from Example 5.6.4, with various values of z0 and zn , and construct the clamped cubic
spline using this interface to spline. Compare the coefficients and graphs to that of the
natural cubic spline from Exercises 5.6.6 and 5.6.7.
190 CHAPTER 5. POLYNOMIAL INTERPOLATION

Figure 5.2: Cubic spline that passing through the points (0, 3), (1/2, −4), (1, 5), (2, −6), and (3, 7).
5.6. PIECEWISE POLYNOMIAL INTERPOLATION 191

5.6.2.2 Well-Posedness and Accuracy


For both sets of boundary conditions, the system Ac = b has a unique solution, because the
matrix A is strictly row diagonally dominant. This property guarantees that A is invertible, due
to Gerschgorin’s Circle Theorem. We therefore have the following results.

Theorem 5.6.5 Let x0 , x1 , . . . , xn be n + 1 distinct points in the interval [a, b], where
a = x0 < x1 < · · · < xn = b, and let f (x) be a function defined on [a, b]. Then f
has a unique cubic spline interpolant s(x) that is defined on the nodes x0 , x1 , . . . , xn that
satisfies the natural boundary conditions s00 (a) = s00 (b) = 0.

Theorem 5.6.6 Let x0 , x1 , . . . , xn be n + 1 distinct points in the interval [a, b], where
a = x0 < x1 < · · · < xn = b, and let f (x) be a function defined on [a, b] that is
differentiable at a and b. Then f has a unique cubic spline interpolant s(x) that is defined
on the nodes x0 , x1 , . . . , xn that satisfies the clamped boundary conditions s0 (a) = f 0 (a)
and s0 (b) = f 0 (b).

Just as the linear spline is the “flattest” interpolant, in an average sense, the natural cubic spline
has the least “average curvature”. Specifically, if s2 (x) is the natural cubic spline for f ∈ C[a, b] on
[a, b] with knots a = x0 < x1 < · · · < xn = b, and v ∈ H 2 (a, b) is any interpolant of f with these
knots, then
ks002 k2 ≤ kv 00 k2 .
This can be proved in the same way as the corresponding result for the linear spline. It is this
property of the natural cubic spline, called the smoothest interpolation property, from which
splines were named.
The following result, proved in [34, p. 57-58], provides insight into the accuracy with which a
cubic spline interpolant s(x) approximates a function f (x).

Theorem 5.6.7 Let f be four times continuously differentiable on [a, b], and assume
that kf (4) k∞ = M . Let s(x) be the unique clamped cubic spline interpolant of f (x) on
the nodes x0 , x1 , . . . , xn , where a = x0 < x1 < · · · < xn < b. Then for x ∈ [a, b],

5M
kf (x) − s(x)k∞ ≤ max h4 ,
384 1≤i≤n i
where hi = xi − xi−1 .
A similar result applies in the case of natural boundary conditions [6].

5.6.2.3 Hermite Cubic Splines


We have seen that it is possible to construct a piecewise cubic polynomial that interpolates a
function f (x) at knots a = x0 < x1 < · · · < xn = b, that belongs to C 2 [a, b]. Now, suppose that
we also know the values of f 0 (x) at the knots. We wish to construct a piecewise cubic polynomial
s(x) that agrees with f (x), and whose derivative agrees with f 0 (x) at the knots. This piecewise
polynomial is called a Hermite cubic spline.
Because s(x) is cubic on each subinterval [xi−1 , xi ] for i = 1, 2, . . . , n, there are 4n coefficients,
and therefore 4n degrees of freedom, that can be used to satisfy any criteria that are imposed on
192 CHAPTER 5. POLYNOMIAL INTERPOLATION

s(x). Requiring that s(x) interpolates f (x) at the knots, and that s0 (x) interpolates f 0 (x) at the
knots, imposes 2n + 2 constraints on the coefficients. We can then use the remaining 2n − 2 degrees
of freedom to require that s(x) belong to C 1 [a, b]; that is, it is continuously differentiable on [a, b].
Note that unlike the cubic spline interpolant, the Hermite cubic spline does not have a continuous
second derivative.
The following result provides insight into the accuracy with which a Hermite cubic spline inter-
polant s(x) approximates a function f (x).

Theorem 5.6.8 Let f be four times continuously differentiable on [a, b], and assume that
kf (4) k∞ = M . Let s(x) be the unique Hermite cubic spline interpolant of f (x) on the
nodes x0 , x1 , . . . , xn , where a = x0 < x1 < · · · < xn < b. Then
M
kf (x) − s(x)k∞ ≤ max h4 ,
384 1≤i≤n i
where hi = xi − xi−1 .
This can be proved in the same way as the error bound for the linear spline, except that the error
formula for Hermite interpolation is used instead of the error formula for Lagrange interpolation.
Exercise 5.6.9 Prove Theorem 5.6.8.
An advantage of Hermite cubic splines over cubic spline interpolants is that they are local
approximations rather than global; that is, if the values of f (x) and f 0 (x) change at some knot
xi , only the polynomials defined on the pieces containing xi need to be changed. In cubic spline
interpolation, all pieces are coupled, so a change at one point changes the polynomials for all pieces.
To see this, we represent the Hermite cubic spline using the same form as in the cubic spline
interpolant,

si (x) = ai + bi (x − xi−1 ) + ci (x − xi−1 )2 + di (x − xi−1 )3 , x ∈ [xi−1 , xi ], (5.24)

for i = 1, 2, . . . , n. Then, the coefficients ai , bi , ci , di can be determined explicitly in terms of only


f (xi−1 ), f 0 (xi−1 ), f (xi ) and f 0 (xi ).

Exercise 5.6.10 Use the conditions

si (xi−1 ) = f (xi−1 ), si (xi ) = f (xi ), s0i (xi−1 ) = f 0 (xi−1 ), s0i (xi ) = f 0 (xi )

to obtain the values of the coefficients ai , bi , ci , di in (5.24).

Exercise 5.6.11 Write a Matlab function hp=hermitespline(x,y) that constructs a


Hermite cubic spline for the data given in the vectors x and y. The output hp should
be a structure that contains enough information so that the spline can be evaluated at
any x-value without having to specify any additional parameters. Write a second function
y=hsplineval(hp,x) that performs this evaluation.
Chapter 6

Approximation of Functions

Previously we have considered the problem of polynomial interpolation, in which a function f (x)
is approximated by a polynomial pn (x) that agrees with f (x) at n + 1 distinct points, based on the
assumption that pn (x) will be, in some sense, a good approximation of f (x) at other points. As we
have seen, however, this assumption is not always valid, and in fact, such an approximation can be
quite poor, as demonstrated by Runge’s example.
Therefore, we consider an alternative approach to approximation of a function f (x) on an
interval [a, b] by a polynomial, in which the polynomial is not required to agree with f at any
specific points, but rather approximate f well in an “overall” sense, by not deviating much from
f at any point in [a, b]. This requires that we define an appropriate notion of “distance” between
functions that is, intuitively, consistent with our understanding of distance between numbers or
points in space.
To that end, we can use vector norms, as defined in Section B.11, where the vectors in question
consist of the values of functions at selected points. In this case, the problem can be reduced to a
least squares problem, as discussed in Chapter 6. This is discussed in Section 6.1.
Still, finding an approximation of f (x) that is accurate with respect to any discrete, finite
subset of the domain cannot guarantee that it accurately approximates f (x) on the entire domain.
Therefore, in Section 6.2 we generalize least squares approximations to a continuous setting by
working with with norms on function spaces, which are vector spaces in which the vectors are
functions. Such function spaces and norms are reviewed in Section B.12.
In the remainder of the chapter, we consider approximating f (x) by functions other than poly-
nomials. Section 6.3 presents an approach to approximating f (x) with a rational function, to
overcome the limitations of polynomial approximation, while Section 6.4 explores approximation
through trigonometric polynomials, or sines and cosines, to capture the frequency content of f (x).

6.1 Discrete Least Squares Approximations


As stated previously, one of the most fundamental problems in science and engineering is data
fitting–constructing a function that, in some sense, conforms to given data points. So far, we have
discussed two data-fitting techniques, polynomial interpolation and piecewise polynomial interpo-
lation. Interpolation techniques, of any kind, construct functions that agree exactly with the data.
That is, given points (x1 , y1 ), (x2 , y2 ), . . ., (xm , ym ), interpolation yields a function f (x) such that

193
194 CHAPTER 6. APPROXIMATION OF FUNCTIONS

f (xi ) = yi for i = 1, 2, . . . , m.
However, fitting the data exactly may not be the best approach to describing the data with a
function. We have seen that high-degree polynomial interpolation can yield oscillatory functions
that behave very differently than a smooth function from which the data is obtained. Also, it
may be pointless to try to fit data exactly, for if it is obtained by previous measurements or other
computations, it may be erroneous. Therefore, we consider revising our notion of what constitutes
a “best fit” of given data by a function.   
Let f = f (x1 ) f (x2 ) · · · f (xn ) and let y = y1 y2 · · · yn . One alternative ap-
proach to data fitting is to solve the minimax problem, which is the problem of finding a function
f (x) of a given form for which
kf − yk∞ = max |f (xi ) − yi |
1≤i≤n

is minimized. However, this is a very difficult problem to solve.


Another approach is to minimize the total absolute deviation of f (x) from the data. That
is, we seek a function f (x) of a given form for which
m
X
kf − yk1 = |f (xi ) − yi |
i=1

is minimized. However, we cannot apply standard minimization techniques to this function, be-
cause, like the absolute value function that it employs, it is not differentiable.
This defect is overcome by considering the problem of finding f (x) of a given form for which
m
X
kf − yk22 = [f (xi ) − yi ]2
i=1

is minimized. This is known as the least squares problem. We will first show how this problem is
solved for the case where f (x) is a linear function of the form f (x) = a1 x + a0 , and then generalize
this solution to other types of functions.
When f (x) is linear, the least squares problem is the problem of finding constants a0 and a1
such that the function
m
X
E(a0 , a1 ) = (a1 xi + a0 − yi )2
i=1
is minimized. In order to minimize this function of a0 and a1 , we must compute its partial derivatives
with respect to a0 and a1 . This yields
m m
∂E X ∂E X
= 2(a1 xi + a0 − yi ), = 2(a1 xi + a0 − yi )xi .
∂a0 ∂a1
i=1 i=1

At a minimum, both of these partial derivatives must be equal to zero. This yields the system of
linear equations
m m
!
X X
ma0 + xi a1 = yi ,
i=1 i=1
m m m
! !
X X X
xi a0 + x2i a1 = xi yi .
i=1 i=1 i=1
6.1. DISCRETE LEAST SQUARES APPROXIMATIONS 195

These equations are called the normal equations.


Using the formula for the inverse of a 2 × 2 matrix,
 −1  
a b 1 d −b
= ,
c d ad − bc −c a

we obtain the solutions


Pm  Pm
( i=1 yi ) − ( m
2
P Pm
i=1 xi i=1 xi ) ( i=1 xi yi )
a0 = 2 ,
m m
P 2
Pm
i=1 xi − ( i=1 xi )
m m
P Pm Pm
i=1 xi yi − ( i=1 xi ) ( i=1 yi )
a1 = 2 .
m m
P 2
Pm
i=1 xi − ( i=1 xi )

Example 6.1.1 We wish to find the linear function y = a1 x + a0 that best approximates the data
shown in Table 6.1, in the least-squares sense. Using the summations

Table 6.1: Data points (xi , yi ), for i = 1, 2, . . . , 10, to be fit by a linear function
i xi yi
1 2.0774 3.3123
2 2.3049 3.8982
3 3.0125 4.6500
4 4.7092 6.5576
5 5.5016 7.5173
6 5.8704 7.0415
7 6.2248 7.7497
8 8.4431 11.0451
9 8.7594 9.8179
10 9.3900 12.2477

m
X m
X m
X m
X
xi = 56.2933, x2i = 380.5426, yi = 73.8373, xi yi = 485.9487,
i=1 i=1 i=1 i=1

we obtain
380.5426 · 73.8373 − 56.2933 · 485.9487 742.5703
a0 = 2
= = 1.1667,
10 · 380.5426 − 56.2933 636.4906
10 · 485.9487 − 56.2933 · 73.8373 702.9438
a1 = 2
= = 1.1044.
10 · 380.5426 − 56.2933 636.4906

We conclude that the linear function that best fits this data in the least-squares sense is

y = 1.1044x + 1.1667.

The data, and this function, are shown in Figure 6.1. 2


196 CHAPTER 6. APPROXIMATION OF FUNCTIONS

Figure 6.1: Data points (xi , yi ) (circles) and least-squares line (solid line)

Exercise 6.1.1 Write a Matlab function [m,b]=leastsqline(x,y) that computes the


slope a1 = m and y-intercept a0 = b of the line y = a1 x + a0 that best fits the data (xi , yi ),
i = 1, 2, . . . , m where m = length(x), in the least-squares sense.

Exercise 6.1.2 Generalize the above derivation of the coefficients a0 and a1 of the least-
squares line to obtain formulas for the coefficients a, b and c of the quadratic function
y = ax2 + bx + c that best fits the data (xi , yi ), i = 1, 2, . . . , m, in the least-squares sense.
Then generalize your function leastsqline from Exercise 6.1.1 to obtain a new function
leastsqquad that computes these coefficients.

It is interesting to note that if we define the m × 2 matrix A, the 2-vector a, and the m-vector
y by    
1 x1 y1
 1 x2     y2 
a0
A= . , a= , y =  . ,
   
.
 .. ..  a1  .. 

1 xm ym
then a is the solution to the system of equations

AT Aa = AT y.

These equations are the normal equations defined earlier, written in matrix-vector form. They arise
6.1. DISCRETE LEAST SQUARES APPROXIMATIONS 197

from the problem of finding the vector a such that

kAa − yk2

is minimized, where, for any vector u, |u| is the magnitude, or length, of u.


In this case, this expression is equivalent to the square root of the expression we originally
intended to minimize,
Xm
(a1 xi + a0 − yi )2 ,
i=1
but the normal equations also characterize the solution a, an n-vector, to the more general linear
least squares problem of minimizing |Aa − y| for any matrix A that is m × n, where m ≥ n, whose
columns are linearly independent.
We now consider the problem of finding a polynomial of degree n that gives the best least-
squares fit. As before, let (x1 , y1 ), (x2 , y2 ), . . ., (xm , ym ) be given data points that need to be
approximated by a polynomial of degree n. We assume that n < m − 1, for otherwise, we can use
polynomial interpolation to fit the points exactly.
Let the least-squares polynomial have the form
n
X
pn (x) = aj xj .
j=0

Our goal is to minimize the sum of squares of the deviations in pn (x) from each y-value,
 2
m m n
aj xj − yi  ,
X X X
E(a) = [pn (xi ) − yi ]2 = 
i
i=1 i=1 j=0

where a is a column vector of the unknown coefficients of pn (x),


 
a0
 a1 
a =  . .
 
 .. 
an
Differentiating this function with respect to each ak yields
 
m n
∂E
aj xji − yi  xki , k = 0, 1, . . . , n.
X X
= 2
∂ak
i=1 j=0

Setting each of these partial derivatives equal to zero yields the system of equations
n m m
!
j+k
X X X
xi aj = xki yi , k = 0, 1, . . . , n.
j=0 i=1 i=1

These are the normal equations. They are a generalization of the normal equations previously
defined for the linear case, where n = 1. Solving this system yields the coefficients {aj }nj=0 of the
least-squares polynomial pn (x).
198 CHAPTER 6. APPROXIMATION OF FUNCTIONS

As in the linear case, the normal equations can be written in matrix-vector form
AT Aa = AT y,
where
x21 xn1
     
1 x1 ··· a0 y1
 1 x2 x22 ··· xn2   a1   y2 
A= , a= , y= . (6.1)
     
.. .. .. .. .. .. ..
 . . . . .   .   . 
1 xm x2m · · · xnm an yn
The matrix A is called a Vandermonde matrix for the points x0 , x1 , . . . , xm .
The normal equations equations can be used to compute the coefficients of any linear combi-
nation of functions {φj (x)}nj=0 that best fits data in the least-squares sense, provided that these
functions are linearly independent. In this general case, the entries of the matrix A are given by
aij = φi (xj ), for i = 1, 2, . . . , m and j = 0, 1, . . . , n.

Example 6.1.2 We wish to find the quadratic function y = a2 x2 + a1 x + a0 that best approximates
the data shown in Table 6.2, in the least-squares sense. By defining

Table 6.2: Data points (xi , yi ), for i = 1, 2, . . . , 10, to be fit by a quadratic function
i xi yi
1 2.0774 2.7212
2 2.3049 3.7798
3 3.0125 4.8774
4 4.7092 6.6596
5 5.5016 10.5966
6 5.8704 9.8786
7 6.2248 10.5232
8 8.4431 23.3574
9 8.7594 24.0510
10 9.3900 27.4827

x21
   
1 x1   y1
 1 x2 x22  a0  y2 
A= , a = a1  , y= ,
   
.. .. ..  ..
 . . . 
a2
 . 
1 x10 x210 y10
and solving the normal equations
AT Aa = AT y,
we obtain the coefficients
c0 = 4.7681, c1 = −1.5193, c2 = 0.4251,
and conclude that the quadratic function that best fits this data in the least-squares sense is
y = 0.4251x2 − 1.5193x + 4.7681.
The data, and this function, are shown in Figure 6.2. 2
6.1. DISCRETE LEAST SQUARES APPROXIMATIONS 199

Figure 6.2: Data points (xi , yi ) (circles) and quadratic least-squares fit (solid curve)

Exercise 6.1.3 Write a Matlab function a=leastsqpoly(x,y,n) that computes the


coefficients aj , j = 0, 1, . . . , n of the polynomial of degree n that best fits the data (xi , yi )
in the least-squares sense. Use the vander function to easily construct the Vandermonde
matrix A used in the normal equations. Make sure you solve the normal equations without
explicitly computing AT A. Test your function on the data from Example 6.1.2, with
different values of n, but with n < 10. How does the residual kAa − bf yk2 behave as n
increases?

Exercise 6.1.4 Test your function leastsqpoly from Exercise 6.1.3 to approximate the
function y = e−cx on the interval [0, 1] where c is a chosen positive constant. Experiment
with different values of c, as well as m and n, the number of data points and degree of
the approximating polynomial, respectively. What combination yields the smallest relative
residual kAa − yk2 /kyk2 ?

Least-squares fitting can also be used to fit data with functions that are not linear combinations
of functions such as polynomials. Suppose we believe that given data points can best be matched
to an exponential function of the form y = beax , where the constants a and b are unknown. Taking
the natural logarithm of both sides of this equation yields

ln y = ln b + ax.

If we define z = ln y and c = ln b, then the problem of fitting the original data points {(xi , yi )}m
i=1
200 CHAPTER 6. APPROXIMATION OF FUNCTIONS

with an exponential function is transformed into the problem of fitting the data points {(xi , zi )}m
i=1
with a linear function of the form c + ax, for unknown constants a and c.
Similarly, suppose the given data is believed to approximately conform to a function of the form
y = bxa , where the constants a and b are unknown. Taking the natural logarithm of both sides of
this equation yields
ln y = ln b + a ln x.
If we define z = ln y, c = ln b and w = ln x, then the problem of fitting the original data points
{(xi , yi )}m
i=1 with a constant times a power of x is transformed into the problem of fitting the data
points {(wi , zi )}m
i=1 with a linear function of the form c + aw, for unknown constants a and c.

Example 6.1.3 We wish to find the exponential function y = beax that best approximates the data
shown in Table 6.3, in the least-squares sense. By defining

Table 6.3: Data points (xi , yi ), for i = 1, 2, . . . , 5, to be fit by an exponential function


i xi yi
1 2.0774 1.4509
2 2.3049 2.8462
3 3.0125 2.1536
4 4.7092 4.7438
5 5.5016 7.7260

   
1 x1 z1
1 x2 z2
 
c
   
A= .. .. ..  , c= , z= ,
   
a ..
 . . .   . 
1 x5 z5
where c = ln b and zi = ln yi for i = 1, 2, . . . , 5, and solving the normal equations
AT Ac = AT z,
we obtain the coefficients
a = 0.4040, b = ec = e−0.2652 = 0.7670,
and conclude that the exponential function that best fits this data in the least-squares sense is
y = 0.7670e0.4040x .
2

Exercise 6.1.5 Write a Matlab function [a,b]=leastsqexp(x,y) that computes the


coefficients a and b of a function y = beax that fits the given data (xi , yi ), i = 1, 2, . . . , m
where m = length(x), in the least squares sense.

Exercise 6.1.6 Write a Matlab function [a,b]=leastsqpower(x,y) that computes


the coefficients a and b of a function y = bxa that fits the given data (xi , yi ), i = 1, 2, . . . , m
where m = length(x), in the least squares sense.
6.2. CONTINUOUS LEAST SQUARES APPROXIMATION 201

6.2 Continuous Least Squares Approximation


Now, suppose we have a continuous set of data. That is, we have a function f (x) defined on an
interval [a, b], and we wish to approximate it as closely as possible, in some sense, by a function
fn (x) that is a linear combination of given real-valued functions {φj (x)}nj=0 . If we choose m equally
spaced points {xi }m i=1 in [a, b], and let m → ∞, we obtain the continuous least-squares problem
of finding the function
n
X
fn (x) = cj φj (x)
j=0

that minimizes
 2
Z b Z b Xn
E(c0 , c1 , . . . , cn ) = kfn − f k22 = [fn (x) − f (x)]2 dx =  cj φj (x) − f (x) dx,
a a j=0

where
Z b 1/2
2
kfn − f k2 = [fn (x) − f (x)] dx .
a
We refer to fn as the best approximation in span(φ0 , φ1 , . . . , φn ) to f in the 2-norm on (a, b).
This minimization can be performed for f ∈ C[a, b], the space of functions that are continuous
on [a, b], but it is not necessary for a function f (x) to be continuous for kf k2 to be defined. Rather,
we consider the space L2 (a, b), the space of real-valued functions such that |f (x)|2 is integrable over
(a, b). Both of these spaces, in addition to being normed spaces, are also inner product spaces, as
they are equipped with an inner product
Z b
hf, gi = f (x)g(x) dx.
a

Such spaces are reviewed in Section B.13.


To obtain the coefficients {cj }nj=0 , we can proceed as in the discrete case. We compute the
partial derivatives of E(c0 , c1 , . . . , cn ) with respect to each ck and obtain
 
Z b n
∂E X
= φk (x)  cj φj (x) − f (x) dx,
∂ck a j=0

and requiring that each partial derivative be equal to zero yields the normal equations
Xn Z b  Z b
φk (x)φj (x) dx cj = φk (x)f (x) dx, k = 0, 1, . . . , n.
j=0 a a

We can then solve this system of equations to obtain the coefficients {cj }nj=0 . This system can
be solved as long as the functions {φj (x)}nj=0 are linearly independent. That is, the condition
n
X
cj φj (x) ≡ 0, x ∈ [a, b],
j=0
202 CHAPTER 6. APPROXIMATION OF FUNCTIONS

is only true if c0 = c1 = · · · = cn = 0. In particular, this is the case if, for j = 0, 1, . . . , n, φj (x) is a


polynomial of degree j.

Exercise 6.2.1 Prove that the functions {φj (x)}nj=0 are linearly independent in C[a, b]
if, for j = 0, 1, . . . , n, φj (x) is a polynomial of degree j.

Exercise 6.2.2 Let A be the (n + 1) × (n + 1) matrix defined by


Z b
aij = φi (x)φj (x) dx,
a

where the functions {φj (x)}nj=0 are real-valued functions that are linearly independent
in C[a, b]. Prove that A is symmetric positive definite. Why is the assumption of linear
independence essential? How does this guarantee that the solution of the normal equations
yields a minimum rather than a maximum or saddle point?

Example 6.2.1 We approximate f (x) = ex on the interval [0, 5] by a fourth-degree polynomial


f4 (x) = c0 + c1 x + c2 x2 + c3 x3 + c4 x4 .
The normal equations have the form
n
X
aij cj = bi , i = 0, 1, . . . , 4,
j=0

or, in matrix-vector form, Ac = b, where


Z 5 Z 5
i j 5i+j+1
aij = x x dx = xi+j dx = , i, j = 0, 1, . . . , 4,
0 0 i+j+1
Z 5
bi = xi ex dx, i = 0, 1, . . . , 4.
0
Integration by parts yields the relation
bi = 5i e5 − ibi−1 , b0 = e5 − 1.
Solving this system of equations yields the polynomial
f4 (x) = 2.3002 − 6.226x + 9.5487x2 − 3.86x3 + 0.6704x4 .
As Figure 6.3 shows, this polynomial is barely distinguishable from ex on [0, 5].
However, it should be noted that the matrix A is closely related to the n × n Hilbert matrix
Hn , which has entries
1
[Hn ]ij = , 1 ≤ i, j ≤ n.
i+j−1
This matrix is famous for being highly ill-conditioned, meaning that solutions to systems of linear
equations involving this matrix that are computed using floating-point arithmetic are highly sensitive
to roundoff error. In fact, the matrix A in this example has a condition number of 1.56×107 , which
means that a change of size  in the right-hand side vector b, with entries bi , can cause a change
of size 1.56 × 107 in the solution c. 2
6.2. CONTINUOUS LEAST SQUARES APPROXIMATION 203

Figure 6.3: Graphs of f (x) = ex (red dashed curve) and 4th-degree continuous least-squares poly-
nomial approximation f4 (x) on [0, 5] (blue solid curve)

Exercise 6.2.3 Repeat Example 6.2.1 with f (x) = x7 . What happens to the coefficients
{cj }4j=0 if the right-hand side vector b is perturbed?
For the remainder of this section, we restrict ourselves to the case where the functions {φj (x)}nj=0
are polynomials. These polynomials form a basis of Pn , the vector space of polynomials of degree
at most n. Then, for f ∈ L2 (a, b), we refer to the polynomial fn that minimizes kf − pk2 over
all p ∈ Pn as the best 2-norm approximating polynomial, or least-squares approximating
polynomial, of degree n to f on (a, b).

6.2.1 Orthogonal Polynomials


As the preceding example shows, it is important to choose the polynomials {φj (x)}nj=0 wisely,
so that the resulting system of normal equations is not unduly sensitive to round-off errors. An
even better choice is one for which this system can be solved analytically, with relatively few
computations. An ideal choice of polynomials is one for which the task of computing fn+1 (x) can
reuse the computations needed to compute fn (x).
Suppose that we can construct a set of polynomials {φj (x)}nj=0 that is orthogonal with respect
to the inner product of functions on (a, b). That is,
Z b 
0 k 6= j
hφk , φj i = φk (x)φj (x) dx = .
a αk > 0 k = j

Then, the normal equations simplify to a trivial system


Z b  Z b
2
[φk (x)] dx ck = φk (x)f (x) dx, k = 0, 1, . . . , n,
a a
204 CHAPTER 6. APPROXIMATION OF FUNCTIONS

or, in terms of norms and inner products,

kφk k22 ck = hφk , f i, k = 0, 1, . . . , n.

It follows that the coefficients {cj }nj=0 of the least-squares approximation fn (x) are simply

hφk , f i
ck = , k = 0, 1, . . . , n.
kφk k22

If the constants {αk }nk=0 above satisfy αk = 1 for k = 0, 1, . . . , n, then we say that the orthogonal set
of functions {φj (x)}nj=0 is orthonormal. In that case, the solution to the continuous least-squares
problem is simply given by
ck = hφk , f i, k = 0, 1, . . . , n. (6.2)
Next, we will learn how sets of orthogonal polynomials can be constructed.

6.2.2 Construction of Orthogonal Polynomials


Recall the process known as Gram-Schmidt orthogonalization for obtaining a set of orthogonal
vectors p1 , p2 , . . . , pn from a set of linearly independent vectors a1 , a2 , . . . , an :

p1 = a 1
p1 · a2
p2 = a 2 − p1
p1 · p1
..
.
n−1
X pj · a n
pn = a n − pj .
pj · pj
j=0

By normalizing each vector pj , we obtain a unit vector


1
qj = pj ,
|pj |

and a set of orthonormal vectors {qj }nj=1 , in that they are orthogonal (qk · qj = 0 for k 6= j), and
unit vectors (qj · qj = 1).
We can use a similar process to compute a set of orthogonal polynomials {pj (x)}nj=0 . For
convenience, we will require that all polynomials in the set be monic; that is, their leading (highest-
degree) coefficient must be equal 1. We then define p0 (x) = 1. Then, because p1 (x) is supposed to
be of degree 1, it must have the form p1 (x) = x − α1 for some constant α1 . To ensure that p1 (x)
is orthogonal to p0 (x), we compute their inner product, and obtain

0 = hp0 , p1 i = h1, x − α1 i,

so we must have
h1, xi
α1 = .
h1, 1i
For j > 1, we start by setting pj (x) = xpj−1 (x), since pj should be of degree one greater
than that of pj−1 , and this satisfies the requirement that pj be monic. Then, we need to subtract
6.2. CONTINUOUS LEAST SQUARES APPROXIMATION 205

polynomials of lower degree to ensure that pj is orthogonal to pi , for i < j. To that end, we apply
Gram-Schmidt orthogonalization and obtain
j−1
X hpi , xpj−1 i
pj (x) = xpj−1 (x) − pi (x).
hpi , pi i
i=0

However, by the definition of the inner product, hpi , xpj−1 i = hxpi , pj−1 i. Furthermore, because
xpi is of degree i + 1, and pj−1 is orthogonal to all polynomials of degree less than j, it follows that
hpi , xpj−1 i = 0 whenever i < j − 1.
We have shown that sequences of orthogonal polynomials satisfy a three-term recurrence
relation
2
pj (x) = (x − αj )pj−1 (x) − βj−1 pj−2 (x), j > 1,
2
where the recursion coefficients αj and βj−1 are defined to be

hpj−1 , xpj−1 i
αj = , j > 1,
hpj−1 , pj−1 i

hpj−1 , xpj i hxpj−1 , pj i hpj , pj i kpj k22


βj2 = = = = , j ≥ 1.
hpj−1 , pj−1 i hpj−1 , pj−1 i hpj−1 , pj−1 i kpj−1 k22
Note that hxpj−1 , pj i = hpj , pj i because xpj−1 differs from pj by a polynomial of degree at most
j − 1, which is orthogonal to pj . The recurrence relation is also valid for j = 1, provided that we
define pj−1 (x) ≡ 0, and α1 is defined as above. That is,

hp0 , xp0 i
p1 (x) = (x − α1 )p0 (x), α1 = .
hp0 , p0 i

If we also define the recursion coefficient β0 by

β02 = hp0 , p0 i,

and then define


pj (x)
qj (x) = ,
β0 β1 · · · βj
then the polynomials q0 , q1 , . . . , qn are also orthogonal, and

hpj , pj i hpj−1 , pj−1 i hp0 , p0 i 1


hqj , qj i = = hpj , pj i ··· = 1.
β0 β12 · · · βj2
2 hpj , pj i hp1 , p1 i hp0 , p0 i

That is, these polynomials are orthonormal.


Exercise 6.2.4 Compute the first three monic orthogonal polynomials with respect to the
inner product Z 1
hf, gi = f (x)g(x)w(x) dx,
0
with weight functions w(x) = 1 and w(x) = x.
206 CHAPTER 6. APPROXIMATION OF FUNCTIONS

Exercise 6.2.5 Write a Matlab function P=orthpoly(a,b,w,n) that computes the co-
efficients of monic orthogonal polynomials on the interval (a, b), up to and including
degree n, and stores their coefficients in the rows of the (n + 1) × (n + 1) matrix P . The
vector w stores the coefficients of a polynomial w(x) that serves as the weight function.
Use Matlab’s polynomial functions to evaluate the required inner products. How can
you ensure that the weight function does not change sign on (a, b)?

6.2.3 Legendre Polynomials


If we consider the inner product Z 1
hf, gi = f (x)g(x) dx,
−1
then by Gram-Schmidt Orthogonalization, a sequence of orthogonal polynomials, with respect to
this inner product, can be defined as follows:

L0 (x) = 1, (6.3)
L1 (x) = x, (6.4)
2j + 1 j
Lj+1 (x) = xLj (x) − Lj−1 (x), j = 1, 2, . . . (6.5)
j+1 j+1
These are known as the Legendre polynomials [23]. One of their most important applications is
in the construction of Gaussian quadrature rules (see Section 7.5). Specifically, the roots of Ln (x),
for n ≥ 1, are the nodes of a Gaussian quadrature rule for the interval (−1, 1). However, they
can also be used to easily compute continuous least-squares polynomial approximations, as the
following example shows.

Example 6.2.2 We will use Legendre polynomials to approximate f (x) = cos x on [−π/2, π/2] by
a quadratic polynomial. First, we note that the first three Legendre polynomials, which are the ones
of degree 0, 1 and 2, are
1
L0 (x) = 1, L1 (x) = x, L2 (x) = (3x2 − 1).
2
However, it is not practical to use these polynomials directly to approximate f (x), because they
are orthogonal with respect to the inner product defined on the interval (−1, 1), and we wish to
approximate f (x) on (−π/2, π/2).
To obtain orthogonal polynomials on (−π/2, π/2), we replace x by 2t/π, where t belongs to
[−π/2, π/2], in the Legendre polynomials, which yields
 
2t 1 12 2
L̃0 (t) = 1, L̃1 (t) = , L̃2 (t) = t −1 .
π 2 π2
Then, we can express our quadratic approximation f2 (x) of f (x) by the linear combination

f2 (x) = c0 L̃0 (x) + c1 L̃1 (x) + c2 L̃2 (x),

where
hf, L̃j i
cj = , j = 0, 1, 2.
hL̃j , L̃j i
6.2. CONTINUOUS LEAST SQUARES APPROXIMATION 207

Computing these inner products yields

Z π/2
hf, L̃0 i = cos t dt
−π/2
= 2,
Z π/2
2t
hf, L̃1 i = cos t dt
−π/2 π
= 0,
Z π/2  
1 12 2
hf, L̃2 i = t − 1 cos t dt
−π/2 2 π2
2 2
= (π − 12),
π2
Z π/2
hL̃0 , L̃0 i = 1 dt
−π/2
= π,
Z π/2  2
2t
hL̃1 , L̃1 i = dt
−π/2 π

= ,
3
Z π/2   2
1 12 2
hL̃2 , L̃2 i = t −1 dt
−π/2 2 π2
π
= .
5

It follows that
2 2 5 2 10
c0 = , c1 = 0, c2 = 2
(π − 12) = 3 (π 2 − 12),
π π π π

and therefore
 
2 5 12 2
f2 (x) = + (π 2 − 12) x − 1 ≈ 0.98016 − 0.4177x2 .
π π3 π2

This approximation is shown in Figure 6.4. 2

Exercise 6.2.6 Write a Matlab script that computes the coefficients of the Legendre
polynomials up to a given degree n, using the recurrence relation (6.5) and the function
conv for multiplying polynomials. Then, plot the graphs of these polynomials on the
interval (−1, 1). What properties can you observe in these graphs? Is there any symmetry
to them?

Exercise 6.2.7 Prove that the Legendre polynomial Lj (x) is an odd function if j is odd,
and an even function if j is even. Hint: use mathematical induction.
208 CHAPTER 6. APPROXIMATION OF FUNCTIONS

Figure 6.4: Graph of cos x (solid blue curve) and its continuous least-squares quadratic approxi-
mation (red dashed curve) on (−π/2, π/2)

Exercise 6.2.8 Let A be the Vandermonde matrix from (6.1), where the points
x1 , x2 , . . . , xm are equally spaced points in the interval (−1, 1). Construct this matrix
in Matlab for a small chosen value of n and a large value of m, and then compute the
QR factorization of A (See Chapter 6). How do the columns of Q relate to the Legendre
polynomials?

6.2.4 Chebyshev Polynomials

It is possible to compute sequences of orthogonal polynomials with respect to other inner products.
A generalization of the inner product that we have been using is defined by

Z b
hf, gi = f (x)g(x)w(x) dx,
a

where w(x) is a weight function. To be a weight function, it is required that w(x) ≥ 0 on (a, b),
and that w(x) 6= 0 on any subinterval of (a, b). So far, we have only considered the case of w(x) ≡ 1.
6.2. CONTINUOUS LEAST SQUARES APPROXIMATION 209

Exercise 6.2.9 Prove that the discussion of Section 6.2.2 also applies when using the
inner product
Z b
hf, gi = f (x)g(x)w(x) dx,
a
where w(x) is a weight function that satisfies w(x) ≥ 0 on (a, b). That is, polynomials
orthogonal with respect to this inner product also satisfy a three-term recurrence relation,
with analogous definitions of the recursion coefficients αj and βj .
Another weight function of interest is
1
w(x) = √ , −1 < x < 1.
1 − x2
A sequence of polynomials that is orthogonal with respect to this weight function, and the associated
inner product Z 1
1
hf, gi = f (x)g(x) √ dx
−1 1 − x2
is the sequence of Chebyshev polynomials, previously introduced in Section 5.4.2:

T0 (x) = 1,
T1 (x) = x,
Tj+1 (x) = 2xTj (x) − Tj−1 (x), j = 1, 2, . . .

which can also be defined by

Tj (x) = cos(j cos−1 x), −1 ≤ x ≤ 1.

It is interesting to note that if we let x = cos θ, then


Z 1
1
hf, Tj i = f (x) cos(j cos−1 x) √ dx
−1 1 − x2
Z π
= f (cos θ) cos jθ dθ.
0

In Section 6.4, we will investigate continuous and discrete least-squares approximation of functions
by linear combinations of trigonometric polynomials such as cos jθ or sin jθ, which will reveal how
these coefficients hf, Tj i can be computed very rapidly.

Exercise 6.2.10 Write a Matlab function fn=best2normapprox(f,a,b,n,w) that


computes the coefficients of fn (x), the best 2-norm approximating polynomial of degree n
to the given function f (x) on (a, b), with respect to the inner product
Z b
hf, gi = f (x)g(x)w(x) dx,
a

where w is a function handle for the weight function w(x). Use the built-in function
integral to evaluate the required inner products. Make the fifth argument w an optional
argument, using w(x) ≡ 1 as a default.
210 CHAPTER 6. APPROXIMATION OF FUNCTIONS

Exercise 6.2.11 Compute the best 2-norm approximating polynomial of degree 3 to the
functions f (x) = ex and g(x) = sin πx on (−1, 1), using both Legendre and Chebyshev
polynomials. Comment on the accuracy of these approximations.

6.2.5 Error Analysis


Let p ∈ Pn , where Pn is the space of polynomials of degree at most n, and let fn be the best 2-
norm approximating polynomial of degree n to f ∈ L2 (a, b). As before, we assume the polynomials
q0 (x), q1 (x), . . . , qn (x) are orthonormal, in the sense that
Z b
hqj , qk i = qj (x)qk (x)w(x) dx = δjk , j, k = 0, 1, . . . , n.
a

Then, from (6.2) we have


n
X
fn (x) = hqj , f iqj (x). (6.6)
j=0

This form of fn (x) can be used to prove the following result.

Theorem 6.2.3 The polynomial fn ∈ Pn is the best 2-norm approximating polynomial


of degree n to f ∈ L2 (a, b) if and only if

hf − fn , pi = 0

for all p ∈ Pn .

Exercise 6.2.12 Use (6.6) to prove one part of Theorem 6.2.3: assume fn is the best
2-norm approximating polynomial to f ∈ L2 (a, b), and show that hf − fn , pi = 0 for any
p ∈ Pn .

Exercise 6.2.13 Use the Cauchy-Schwarz inequality to prove the converse of Exercise
6.2.12: that if f ∈ L2 (a, b) and hf − fn , pi = 0 for an arbitrary p ∈ Pn , then fn is the
best 2-norm approximating polynomial of degree n to f ; that is,

kf − fn k2 ≤ kf − pk2 .

Hint: by the assumptions, f − fn is orthogonal to any polynomial in Pn .

6.2.6 Roots of Orthogonal Polynomials


Finally, we prove one property of orthogonal polynomials that will prove useful in our upcoming
discussion of the role of orthogonal polynomials in numerical integration. Let ϕj (x) be a polynomial
of degree j ≥ 1 that is orthogonal to all polynomials of lower degree, with respect to the inner
product
Z b
hf, gi = f (x)g(x)w(x) dx,
a
6.3. RATIONAL APPROXIMATION 211

and let the points ξ1 , ξ2 , . . . , ξk be the points in (a, b) at which ϕj (x) changes sign. This set of
points cannot be empty, because ϕj , being a polynomial of degree at least one, is orthogonal a
constant function, which means
Z b
ϕj (x)w(x) dx = 0.
a

Because w(x) is a weight function, it does not change sign. Therefore, in order for the integral to
be zero, ϕj (x) must change sign at least once in (a, b).
If we define
πk (x) = (x − ξ1 )(x − ξ2 ) · · · (x − ξk ),
then ϕj (x)πk (x) does not change sign on (a, b), because πk changes sign at exactly the same points
in (a, b) as ϕj . Because both polynomials are also nonzero on (a, b), we must have
Z b
hϕj , πk i = ϕj (x)πk (x)w(x) dx 6= 0.
a

If k < j, then we have a contradiction, because ϕj is orthogonal to any polynomial of lesser degree.
Therefore, k ≥ j. However, if k > j, we also have a contradiction, because a polynomial of degree
j cannot change sign more than j times on the entire real number line, let alone an interval. We
conclude that k = j, which implies that all of the roots of ϕj are real and distinct, and lie in (a, b).

Exercise 6.2.14 Use your function orthpoly from Exercise 6.2.5 to generate orthogonal
polynomials of a fixed degree n for various weight functions. How does the distribution of
the roots of pn (x) vary based on where the weight function has smaller or larger values?
Hint: consider the distribution of the roots of Chebyshev polynomials, and their weight
function w(x) = (1 − x2 )−1/2 .

6.3 Rational Approximation


In some cases, it is not practical to approximate a given function f (x) by a polynomial, because it
simply cannot capture the behavior of f (x), regardless of the degree. This is because higher-degree
polynomials tend to be oscillatory, so if f (x) is mostly smooth, the degree n of an approximating
polynomial fn (x) must be unreasonably high. Therefore, in this section we consider an alternative
to polynomial approximation.
Specifically, we seek a rational function of the form

pm (x) a0 + a1 x + a2 x2 + · · · + am xm
rm,n (x) = = ,
qn (x) b0 + b1 x + b2 x2 + · · · + bn xn

where pm (x) and qn (x) are polynomials of degree m and n, respectively. For convenience, we impose
b0 = 1, since otherwise the other coefficients can simply be scaled.
To construct pm (x) and qn (x), we generalize approximation of f (x) by a Taylor polynomial of
degree n. Consider the error

f (x)qn (x) − pm (x)


E(x) = f (x) − rm,n (x) = .
qn (x)
212 CHAPTER 6. APPROXIMATION OF FUNCTIONS

As in Taylor polynomial approximation, our goal is to choose the coefficients of pm and qn so that

E(0) = E 0 (0) = E 00 (0) = · · · = E (m+n) = 0.

That is, 0 is a root of multiplicity m + n + 1. It follows that xm+n+1 is included in the factorization
of the numerator of E(x).
For convenience, we express p and q as polynomials of degree m + n, by padding them with
coefficients thath are zero: am+1 = am+2 = · · · = am+n = 0 and bn+1 = bn+2 = · · · = bn+m = 0.
Taking a Maclaurin expansion of f (x),

X f (i) (0)
f (x) = ci xi , ci = ,
i!
i=0

we obtain the following expression for this numerator:



X m+n
X m+n
X
f (x)qn (x) − pm (x) = ci xi bj xj − ai xi
i=0 j=0 i=0
∞ m+n
X X m+n
X
= ci bj xi+j − ai xi
i=0 j=0 i=0
∞ min(m+n,i)
X X m+n
X
= bj ci−j xi − ai xi
i=0 j=0 i=0
 
m+n
X Xi ∞
X m+n
X
=  bj ci−j − ai  xi + bj ci−j xi .
i=0 j=0 i=m+n+1 j=0

We can then ensure that 0 is a root of multiplicity m + n + 1 if the numerator has no terms of
degree m + n or less. That is, each coefficient of xi in the first summation must equal zero.
This entails solving the system of m + n + 1 equations

c0 = a0
c1 + b1 c0 = a1
c2 + b1 c1 + b2 c0 = a2 (6.7)
..
.
cm+n + b1 cm+n−1 + · · · + bm+n c0 = am+n .

This is a system of m+n+1 linear equations in the m+n+1 unknowns b1 , b2 , . . . , bn , a0 , a1 , . . . , am .


The resulting rational function rm,n (x) is called a Padé approximant of f (x) [27].
While this system can certainly be solved using Gaussian elimination with partial pivoting, we
would like to find out if the structure of this system can somehow be exploited to solve it more
efficiently than is possible for a general system of linear equations. To that end, we consider a
simple example.
6.3. RATIONAL APPROXIMATION 213

Example 6.3.1 We consider the approximation of f (x) = e−x by a rational function of the form

a0 + a1 x + a2 x2
r2,3 (x) = .
1 + b1 x + b2 x2 + b3 x3
The Maclaurin series for f (x) has coefficients cj = (−1)j /j!. The system of equations (6.7) becomes

c0 = a0
c1 + b1 c0 = a1
c2 + b1 c1 + b2 c0 = a2
c3 + b1 c2 + b2 c1 + b3 c0 = 0
c4 + b1 c3 + b2 c2 + b3 c1 = 0
c5 + b1 c4 + b2 c3 + b3 c2 = 0.

This can be written as Ax = b, where

−1 −c0
     
a0

 −1 c0 


 a1 


 −c1 

 −1 c 1 c0
  a2   −c2 
A= , x= , b= .

 c2 c1 c0 


 b1 


 −c3 

 c3 c2 c1   b2   −c4 
c4 c3 c2 b3 −c5

If possible, we would like to work with a structure that facilitates Gaussian elimination. To that
end, we can reverse the rows and columns of A, x and b to obtain the system

−c5
     
c2 c3 c4 b3
 c1 c2 c3   b2   −c4 
     
 c0 c1 c2   b1   −c3 
A= 
 , x =  a2  , b =  −c2  .
    
 c0 c1 −1     
 c0 −1   a1   −c1 
−1 a0 −c0

It follows that Gaussian elimination can be carried out by eliminating m = 2 entries in each of the
first n = 3 columns. After that, the matrix will be reduced to upper triangular form so that back
substitution can be carried out. If pivoting is required, it can be carried out on only the first n rows,
because due to the block lower triangular structure of A, it follows that A is nonsingular if and only
if the upper left n × n block is.
After carrying out Gaussian elimination for this example, with Maclaurin series coefficients
cj = (−1)j /j!, we obtain the rational approximation
1 2 2
p2 (x) x − x+1
e−x ≈ r2,3 (x) = = 20 5 .
q3 (x) 1 3 3 2 3
x + x + x+1
60 20 5
Plotting the error in this approximation on the interval [0, 1], we see that the error is maximum at
x = 1, at roughly 4.5 × 10−5 . 2
214 CHAPTER 6. APPROXIMATION OF FUNCTIONS

Exercise 6.3.1 Write a Matlab function [p,q]=padeapprox(c,m,n) that computes


the coefficients of the polynomials pm (x) and qn (x) such that rm,n (x) = pm (x)/qn (x) is the
Padé approximant of degree m, n for the function f (x) with Maclaurin series coefficients
c0 , c1 , . . . , cm+n stored in the vector c.

6.3.1 Continued Fraction Form


We now examine the process of efficiently evaluating rm,n (x). A natural approach is to simply
apply nested multiplication to pm (x) and qn (x) individually.

Example 6.3.2 If we apply nested multiplication to p2 (x) and q3 (x) from Example 6.3.1, we obtain
    
2 1 3 3 1
p2 (x) = 1 + x − + x , q3 (x) = 1 + x +x + x .
5 20 5 20 60
It follows that evaluating r2,3 (x) requires 5 multiplications, 5 additions, and one division.
An alternative approach is to represent r2,3 (x) as a continued fraction [29, p. 285-322]. We
have
1 2 2
p2 (x) x − x+1
r2,3 (x) = = 20 5
q3 (x) 1 3 3 2 3
x + x + x+1
60 20 5
3
=
x3 + 9x2 + 36x + 60
x2 − 8x + 20
3
=
152x − 280
x + 17 + 2
x − 8x + 20
3
=
8
x + 17 + x2 −8x+20
19x−35
3
=
152
x + 17 + 3125/361
117
x− 19 + x−35/19

In this form, evaluation of r2,3 (x) requires three divisions, no multiplications, and five additions,
resulting in significantly more efficiency than using nested multiplication on p2 (x) and q3 (x). 2

It is important to note that the efficiency of this approach comes from the ability to make the
polynomial in each denominator monic–that is, having a leading coefficient of one–to remove the
need for a multiplication.
Exercise 6.3.2 Write a Matlab function y=contfrac(p,q,x) that takes as input poly-
nomials p(x) and q(x), represented as vectors of coefficients p and q, respectively, and
outputs y = p(x)/q(x) by evaluating p(x)/q(x) as a continued fraction. Hint: use the
Matlab function deconv to divide polynomials.
6.3. RATIONAL APPROXIMATION 215

6.3.2 Chebyshev Rational Approximation


One drawback of the Padé approximant is that while it is highly accurate near x = 0, it loses
accuracy as x moves away from zero. Certainly it is straightforward to perform Taylor expansion
around a different center x0 , which ensures similar accuracy near x0 , but it would be desirable to
instead compute an approximation that is accurate on an entire interval [a, b].
To that end, we can employ the Chebyshev polynomials, previously discussed in Section 5.4.2.
Just as they can help reduce the error in polynomial interpolation over an interval, they can also
improve the accuracy of a rational approximation over an interval. For simplicity, we consider the
interval (−1, 1), on which each Chebyshev polynomial Tk (x) satisfies |Tk (x)| ≤ 1, but the approach
described here can readily be applied to an arbitrary interval through shifting and scaling as needed.
The main idea is to use Tk (x) in place of xk in constructing our rational approximation. That
is, our rational approximation now has the form
Pm
pm (x) ak Tk (x)
rm,n (x) = = Pk=0
n .
qn (x) k=0 bk Tk (x)

If we also expand f (x) in a series of Chebyshev polynomials,



X
f (x) = ck Tk (x), (6.8)
k=0

then the error in our approximation is


 
∞ X
n m
f (x)qn (x) − pm (x) 1 X X
E(x) = f (x) − rm,n (x) = =  ci bj Ti (x)Tj (x) − ai Ti (x) .
qn (x) qn (x)
i=0 j=0 i=0

By applying the identity


1
Ti (x)Tj (x) = [Ti+j (x) + T|i−j| (x)], (6.9)
2
we obtain the error
 
∞ n m
1  1 XX X
E(x) = ci bj [Ti+j (x) + T|i−j| (x)] − ai Ti (x)
qn (x) 2
i=0 j=0 i=0
  
∞ n ∞ j ∞
1  X 1 X X X X
= ci Ti (x) + bj  ci−j Ti (x) + cj−i Ti (x) + ci+j Ti (x) −
qn (x) 2
i=0 j=1 i=j i=1 i=0
m
#
X
ai Ti (x) . (6.10)
i=0

The coefficients {aj }m n


i=0 , {bj }j=1 are then determined by requiring that the coefficient of Ti (x) in
E(x) vanishes, for i = 0, 1, 2, . . . , m + n.
To obtain the coefficients {cj }∞ j=0 in the series expansion of f (x) from (6.8), we use the fact
that the Chebyshev polynomials are orthogonal on (−1, 1) with respect to the weight function
w(x) = (1 − x2 )−1/2 . By taking the inner product of both sides of (6.8), formulas for cj can be
obtained.
216 CHAPTER 6. APPROXIMATION OF FUNCTIONS

Exercise 6.3.3 Derive a formula for the coefficients cj , j = 0, 1, 2, . . . , of the expansion


of f (x) in a series of Chebyshev polynomials in (6.8).

Example 6.3.3 We consider the approximation of f (x) = e−x by a rational function of the form
a0 T0 (x) + a1 T1 (x) + a2 T2 (x)
r2,3 (x) = .
1 + b1 T1 (x) + b2 T2 (x) + b3 T3 (x)
The Chebyshev series (6.8) for f (x) has coefficients cj that can be obtained using the result of
Exercise 6.3.3. The system of equations implied by (6.10) becomes
1
c0 + (b1 c1 + b2 c2 + b3 c3 ) = a0
2
1
c1 + b1 c0 + (b1 c2 + b2 c1 + b2 c3 + b3 c2 + b3 c4 ) = a1
2
1
c2 + b2 c0 + (b1 c1 + b1 c3 + b2 c4 + b3 c1 + b3 c5 ) = a2
2
1
c3 + b3 c0 + (b1 c2 + b1 c4 + b2 c1 + b2 c5 + b3 c6 ) = 0
2
1
c4 + (b1 c3 + b1 c5 + b2 c2 + b2 c6 + b3 c1 + b3 c7 ) = 0
2
1
c5 + (b1 c4 + b1 c6 + b2 c3 + b2 c7 + b3 c2 + c3 c8 ) = 0.
2
This can be written as Ax = b, where
−1
   

 −1 


 c0 

 −1  1 c0 
A =  +  +


 2
  c0 

   

   
c1 c2 c3

 c0 


 c2 c3 c4 

1
 c1 c0  1
+  c3 c4 c5 
,
2
 c2 c1 c0 
 2
 c4 c5 c6 

 c3 c2 c1   c5 c6 c7 
c4 c3 c2 c6 c7 c8

−c0
   
a0

 a1 


 −c1 

 a2   −c2 
x= , b= .

 b1 


 −c3 

 b2   −c4 
b3 −c5
After carrying out Gaussian elimination for this example, we obtain the rational approximation
p2 (x) 0.0231x2 − 0.3722x + 0.9535
e−x ≈ r2,3 (x) = ≈ .
q3 (x) 0.0038x3 + 0.0696x2 + 0.5696x + 1
6.4. TRIGONOMETRIC INTERPOLATION 217

Plotting the error in this approximation on the interval (−1, 1), we see that the error is maximum
at x = −1, at roughly 1.1×10−5 , which is less than one-fourth of the error in the Padé approximant
on [0, 1]. In fact, on [0, 1], the error is maximum at x = 0 and is only 4.1 × 10−6 . 2

Exercise 6.3.4 Write a Matlab function [p,q]=chebyrat(c,m,n) that accepts as in-


puts a vector c consisting of the coefficients c0 , c1 , . . . , cm+n in the expansion of a given
function f (x) in a series of Chebyshev polynomials as in (6.8), along with the degrees
m and n of the numerator and denominator, respectively, of a rational Chebyshev inter-
polant rm,n (x) of f (x). The output must be row vectors p and q containing the coefficients
of the polynomials pm (x) and qn (x) for the numerator and denominator, respectively, of
rm,n (x).

6.4 Trigonometric Interpolation


In many application areas, such as differential equations and signal processing, it is more useful to
express a given function u(x) as a linear combination of sines and cosines, rather than polynomials.
In differential equations, this form of approximation is beneficial due to the simplicity of differen-
tiating sines and cosines, and in signal processing, one can readily analyze the frequency content
of u(x). In this section, we develop efficient algorithms for approximation of functions with such
trigonometric functions.

6.4.1 Fourier Series


Suppose that a function u(x) defined on the interval [0, L] is intended to satisfy periodic boundary
conditions u(0) = u(L). Then, since sin x and cos x are both 2π-periodic, u(x) can be expressed
in terms of both sines and cosines, as follows:

a0 X 2πjx 2πjx
u(x) = + aj cos + bj sin , (6.11)
2 L L
j=1

where, for j = 0, 1, 2, . . . , the coefficients aj and bj are defined by


Z L Z L
2 2πjx 2 2πjx
aj = u(x) cos dx, bj = u(x) sin dx. (6.12)
L 0 L L 0 L

This series representation of u(x) is called the Fourier series of u(x).


The formulas for the coefficients {aj }, {bj } in (6.12) are obtained using the fact that the
functions {cos(2πjx/L)}, {sin(2πjx/L)} are orthogonal with respect to the inner product
Z L
(f, g) = f (x)g(x) dx, (6.13)
0

which can be established using trigonometric identities. The complex conjugation of f (x) in (6.13)
is necessary to ensure that the norm k · k defined by
p
kuk = (u, u) (6.14)
218 CHAPTER 6. APPROXIMATION OF FUNCTIONS

satisfies one of the essential properties of norms, that the norm of a function must be nonnegative.
Exercise 6.4.1 Prove that if m, n are integers, then

   0 m 6= n,
2πmx 2πnx
cos , cos = L/2 m = n, n 6= 0, ,
L L
L m=n=0

  
2πmx 2πnx 0 m 6= n,
sin , sin = ,
L L L/2 m = n
 
2πmx 2πnx
cos , sin = 0,
L L
where the inner product (f, g) is as defined in (6.13).
Alternatively, we can use the relation eiθ = cos θ + i sin θ to express the solution in terms of
complex exponentials,

1 X
u(x) = √ û(ω)e2πiωx/L , (6.15)
L ω=−∞
where Z L
1
û(ω) = √ e−2πiωx/L u(x) dx. (6.16)
L 0

Like the sines and cosines in (6.11), the functions e2πiωx/L are orthogonal with respect to the inner
product (6.13). Specifically, we have
   L ω=η
e2πiωx/L , e2πiηx/L = . (6.17)
0 ω 6= η

This explains the presence of the scaling constant 1/ L in (6.15). It normalizes the functions
e2πiωx/L so that they form an orthonormal set, meaning that they are orthogonal to one another,
and have unit norm.
Exercise 6.4.2 Prove (6.17).
We say that f (x) is square-integrable on (0, L) if
Z L
|f (x)|2 dx < ∞. (6.18)
0

That is, the above integral must be finite; we also say that f ∈ L2 (0, L) . If such a function is also
piecewise continuous, the following identity, known as Parseval’s identity, is satisfied:

X
|fˆ(ω)|2 = kf k2 , (6.19)
ω=−∞

where the norm k · k is as defined in (6.14).

Exercise 6.4.3 Prove (6.19).


6.4. TRIGONOMETRIC INTERPOLATION 219

6.4.2 The Discrete Fourier Transform


Suppose that we define a grid on the interval [0, L], consisting of the N points xj = j∆x, where
∆x = L/N , for j = 0, . . . , N − 1. Given an L-periodic function f (x), we would like to compute an
approximation to its Fourier series of the form
N/2
1 X
fN (x) = √ e2πiωx/L f˜(ω), (6.20)
L ω=−N/2+1

where each f˜(ω) approximates the corresponding coefficient fˆ(ω) of the true Fourier series. Ideally,
this approximate series should satisfy

fN (xj ) = f (xj ), j = 0, 1, . . . , N − 1. (6.21)

That is, fN (x) should be an interpolant of f (x), with the N points xj , j = 0, 1, . . . , N − 1, as the
interpolation points.

6.4.2.1 Fourier Interpolation


The problem of finding this interpolant, called the Fourier interpolant of f , has a unique solution
that can easily be computed. The coefficients f˜(ω) are obtained by approximating the integrals
that defined the coefficients of the the Fourier series:
N −1
1 X −2πiωxj /L
f˜(ω) = √ e f (xj )∆x, ω = −N/2 + 1, . . . , N/2. (6.22)
L j=0

Because the functions e2πiωx/L are orthogonal with respect to the discrete inner product
N
X −1
(u, v)N = ∆x u(xj )v(xj ), (6.23)
j=0

it is straightforward to verify that fN (x) does in fact satisfy the conditions (6.21). Note that the
discrete inner product is an approximation of the continuous inner product.
From (6.21), we have
N/2
1 X
f (xj ) = √ e2πiηxj /L f˜(η). (6.24)
L η=−N/2+1

Multiplying both sides by ∆xe−2πiωxj /L , and summing from j = 0 to j = N − 1 yields


N −1 N −1 N/2
X 1 X X
∆x e−2πiωxj /L f (xj ) = ∆x √ e−2πiωxj /L e2πiηxj /L f˜(η), (6.25)
j=0
L j=0 η=−N/2+1

or  
N −1 N/2 N −1
X 1 X X
∆x e−2πiωxj /L f (xj ) = √ f˜(η) ∆x e−2πiωxj /L e2πiηxj /L  . (6.26)
j=0
L η=−N/2+1 j=0
220 CHAPTER 6. APPROXIMATION OF FUNCTIONS

Because 

2πiωx/L 2πiηx/L
 L ω=η
e ,e = , (6.27)
N 6 η
0 ω=
all terms in the outer sum on the right side of (6.26) vanish except for η = ω, and we obtain the
formula (6.22). It should be noted that the algebraic operations performed on (6.24) are equivalent
to taking the discrete inner product of both sides of (6.24) with e2πiωx/L .

Exercise 6.4.4 Prove (6.27). Hint: use formulas associated with geometric series.
The process of obtaining the approximate Fourier coefficients as in (6.22) is called the discrete
Fourier transform (DFT) of f (x). The discrete inverse Fourier transform is given by (6.20). As
at the beginning of this section, we can also work with the real form of the Fourier interpolant,
N/2−1
a˜0 X 2πjx 2πjx πN x
fN (x) = + ãj cos + b̃j sin + ãN/2 cos , (6.28)
2 L L L
j=1

where the coefficients ãj , b̃j are approximations of the coefficients aj , bj from (6.12).

Exercise 6.4.5 Express the coefficients ãj , b̃j of the real form of the Fourier interpolant
(6.28) in terms of the coefficients f˜(ω) from the complex exponential form (6.20).

Exercise 6.4.6 Why is there no need for a coefficient b̃N/2 in (6.28)?

Exercise 6.4.7 Use the result of Exercise 6.4.4 to prove the following discrete orthogo-
nality relations:

   0 m 6= n,
2πmx 2πnx
cos , cos = L/2 m = n, n 6= 0, ,
L L N L m=n=0

  
2πmx 2πnx 0 m 6= n,
sin , sin = ,
L L N
L/2 m = n
 
2πmx 2πnx
cos , sin = 0,
L L N
where m and n are integers, and the discrete inner product (f, g)N is as defined in (6.23).

6.4.2.2 De-Noising and Aliasing


Suppose we have N = 128 data points sampled from the following function over [0, 2π]:

f (x) = sin(10x) + noise. (6.29)

The function f (x), shown in Figure 6.5(a), is quite noisy. However, by taking the discrete Fourier
transform (Figure 6.5(b)), we can extract the original sine wave quite easily. The DFT shows two
distinct spikes, corresponding to frequencies of ω = ±10, that is, the frequencies of the original sine
wave. The first N/2 + 1 values of the Fourier transform correspond to frequencies of 0 ≤ ω ≤ ωmax ,
6.4. TRIGONOMETRIC INTERPOLATION 221

Figure 6.5: (a) Left plot: noisy signal (b) Right plot: discrete Fourier transform

where ωmax = N/2. The remaining N/2 − 1 values of the Fourier transform correspond to the
frequencies −ωmax < ω < 0.
The DFT only considers a finite range of frequencies. If there are frequencies beyond this
present in the Fourier series, an effect known as aliasing occurs. The effect of aliasing is shown in
Figure 6.6: it “folds” these frequencies back into the computed DFT. Specifically,


X
f˜(ω) = fˆ(ω + `N ), −N/2 + 1 ≤ ω ≤ N/2. (6.30)
`=−∞

Aliasing can be avoided by filtering the function before the DFT is applied, to prevent high-
frequency components from “contaminating” the coefficients of the DFT.

Exercise 6.4.8 Use (6.18) and (6.20) to prove (6.30). Hint: Let x = xj for some j.

6.4.3 The Fast Fourier Transform

The discrete Fourier transform, as it was presented in the previous lecture, requires O(N 2 ) oper-
ations to compute. In fact the discrete Fourier transform can be computed much more efficiently
than that (O(N log2 N ) operations) by using the fast Fourier transform (FFT). The FFT arises by
noting that a DFT of length N can be written as the sum of two Fourier transforms each of length
N/2. One of these transforms is formed from the even-numbered points of the original N , and the
other from the odd-numbered points.
222 CHAPTER 6. APPROXIMATION OF FUNCTIONS

Figure 6.6: Aliasing effect on noisy signal: coefficients fˆ(ω), for ω outside (−63, 64), are added to
coefficients inside this interval.

We have
N −1
∆x X −2πijω/N
f˜(ω) = √ e f (xj )
L j=0
N/2−1 N/2−1
∆x X −2πiω(2j)/N ∆x X −2πiω(2j+1)/N
= √ e f (x2j ) + √ e f (x2j+1 )
L j=0 L j=0
N/2−1 N/2−1
∆x X −2πiωj/(N/2) ∆x X
= √ e f (x2j ) + √ W ω e−2πiωj/(N/2) f (x2j+1 ) (6.31)
L j=0 L j=0

where
W = e−2πi/N . (6.32)
It follows that
1 1
f˜(ω) = f˜e (ω) + W ω f˜o (ω), ω = −N/2 + 1, . . . , N/2, (6.33)
2 2
where f˜e (ω) is the DFT of f obtained from its values at the even-numbered points of the N -point
grid on which f is defined, and f˜o (ω) is the DFT of f obtained from its values at the the odd-
numbered points. Because the coefficients of a DFT of length N are N -periodic, in view of the
6.4. TRIGONOMETRIC INTERPOLATION 223

identity e2πi = 1, evaluation of f˜e and f˜o at ω between −N/2 + 1 and N/2 is valid, even though
they are transforms of length N/2 instead of N .
This reduction to half-size transforms can be performed recursively; i.e. a transform of length
N/2 can be written as the sum of two transforms of length N/4, etc. Because only O(N ) operations
are needed to construct a transform of length N from two transforms of length N/2, the entire
process requires only O(N log2 N ) operations.

Exercise 6.4.9 Write two functions to compute the DFT of a function f (x) defined
on [0, L], represented by a N -vector f that contains its values at xj = j∆x, j =
0, 1, 2, . . . , N − 1, where j = L/N . For the first function, use the formula (6.20), and
for the second, use recursion and the formula (6.33) for the FFT. Compare the efficiency
of your functions for different values of N . How does the running time increase as N
increases?

6.4.4 Convergence and Gibbs’ Phenomenon


The Fourier series for an L-periodic function f (x) will converge to f (x) at any point in [0, L] at
which f is continuously differentiable. If f has a jump discontinuity at a point c, then the series
will converge to 21 [f (c+ ) + f (c− )], where
f (c+ ) = lim f (x), f (c− ) = lim f (x). (6.34)
x→c+ x→c−

If f (x) is not L-periodic, then there is a jump discontinuity in the L-periodic extension of f (x)
beyond [0, L], and the Fourier series will again converge to the average of the values of f (x) on
either side of this discontinuity.
Such discontinuities pose severe difficulties for trigonometric interpolation, because the basis
functions eiωx grow more oscillatory as |ω| increases. In particular, the truncated Fourier series of a
function f (x) with a jump discontinuity at x = c exhibits what is known as Gibbs’ phenomenon,
first discussed in [37], in which oscillations appear on either side of x = c, even if f (x) itself is
smooth there.
Convergence of the Fourier series of f is more rapid when f is smooth. In particular, if f is
p-times differentiable and its pth derivative is at least piecewise continuous (that is, continuous
except possibly for jump discontinuities), then the coefficients of the complex exponential form of
the Fourier series satisfy
C
|fˆ(ω)| ≤ p+1
(6.35)
|ω| +1
for some constant C that is independent of ω [17].

Exercise 6.4.10 Generate a random vector of DFT coefficients that satisfy the decay
rate (6.35), for some value of p. Then, perform an inverse FFT to obtain the truncated
Fourier series (6.20), and plot the resulting function fN (x). How does the behavior of the
function change as p decreases?

Exercise 6.4.11 Demonstrate Gibbs’ phenomenon by plotting truncated Fourier series


of the function f (x) = x on [0, 2π]. Use the formula (6.20), evaluated on a finer grid
(that is, using Ñ equally spaced points in [0, 2π], where Ñ  N ). What happens as N
increases?
224 CHAPTER 6. APPROXIMATION OF FUNCTIONS
Chapter 7

Differentiation and Integration

The solution of many mathematical models requires performing the basic operations of calculus,
differentiation and integration. In this chapter, we will learn several techniques for approximating a
derivative of a function at a point, and a definite integral of a function over an interval. As we will
see, our previous discussion of polynomial interpolation will play an essential role, as polynomials
are the easiest functions on which to perform these operations.

7.1 Numerical Differentiation


We first discuss how Taylor series and polynomial interpolation can be applied to help solve a
fundamental problem from calculus that frequently arises in scientific applications, the problem of
computing the derivative of a given function f (x) at a given point x = x0 . The basics of derivatives
are reviewed in Section A.2.

7.1.1 Taylor Series


Recall that the derivative of f (x) at a point x0 , denoted f 0 (x0 ), is defined by

f (x0 + h) − f (x0 )
f 0 (x0 ) = lim .
h→0 h

This definition suggests a method for approximating f 0 (x0 ). If we choose h to be a small positive
constant, then
f (x0 + h) − f (x0 )
f 0 (x0 ) ≈ .
h
This approximation is called the forward difference formula.
To estimate the accuracy of this approximation, we note that if f 00 (x) exists on [x0 , x0 + h],
then, by Taylor’s Theorem, f (x0 + h) = f (x0 ) + f 0 (x0 )h + f 00 (ξ)h2 /2, where ξ ∈ [x0 , x0 + h]. Solving
for f 0 (x0 ), we obtain
f (x0 + h) − f (x0 ) f 00 (ξ)
f 0 (x0 ) = − h,
h 2
so the error in the forward difference formula is O(h). We say that this formula is first-order
accurate.

225
226 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

The forward-difference formula is called a finite difference approximation to f 0 (x0 ), because


it approximates f 0 (x) using values of f (x) at points that have a small, but finite, distance between
them, as opposed to the definition of the derivative, that takes a limit and therefore computes the
derivative using an “infinitely small” value of h. The forward-difference formula, however, is just
one example of a finite difference approximation. If we replace h by −h in the forward-difference
formula, where h is still positive, we obtain the backward-difference formula
f (x0 ) − f (x0 − h)
f 0 (x0 ) ≈ .
h
Like the forward-difference formula, the backward difference formula is first-order accurate.
If we average these two approximations, we obtain the centered difference formula
f (x0 + h) − f (x0 − h)
f 0 (x0 ) ≈ .
2h
To determine the accuracy of this approximation, we assume that f 000 (x) exists on the interval
[x0 − h, x0 + h], and then apply Taylor’s Theorem again to obtain
f 00 (x0 ) 2 f 000 (ξ+ ) 3
f (x0 + h) = f (x0 ) + f 0 (x0 )h + h + h ,
2 6
f 00 (x0 ) 2 f 000 (ξ− ) 3
f (x0 − h) = f (x0 ) − f 0 (x0 )h + h − h ,
2 6
where ξ+ ∈ [x0 , x0 + h] and ξ− ∈ [x0 − h, x0 ]. Subtracting the second equation from the first and
solving for f 0 (x0 ) yields
f (x0 + h) − f (x0 − h) f 000 (ξ+ ) + f 000 (ξ− ) 2
f 0 (x0 ) = − h .
2h 12
Suppose that f 000 is continuous on [x0 − h, x0 + h]. By the Intermediate Value Theorem, f 000 (x) must
assume every value between f 000 (ξ− ) and f 000 (ξ+ ) on the interval (ξ− , ξ+ ), including the average of
these two values. Therefore, we can simplify this equation to
f (x0 + h) − f (x0 − h) f 000 (ξ) 2
f 0 (x0 ) = − h , (7.1)
2h 6
where ξ ∈ [x0 −h, x0 +h]. We conclude that the centered-difference formula is second-order accurate.
This is due to the cancellation of the terms involving f 00 (x0 ).
Example 7.1.1 Consider the function
√ 
x2 +x
sin2cos x−x
f (x) = √  .
sin √xx−1
2 +1

Our goal is to compute f 0 (0.25). Differentiating, using the Quotient Rule and the Chain Rule, we
obtain
√  √ h √ i
2 +x 2 +x x2 +1(sin x+1)
2 sin cosxx−x cos cosxx−x √
2
2x+1
2 x +1(cos x−x)
+ (cos x−x) 2
f 0 (x) = √  −
x−1
sin √x2 +1
√  √ h √ i
x( x−1)
cos √xx−1
2 +x
sin cosxx−x 2 +1
√ √1
2
− (x2 +1)3/2
 √ 2 x x +1 .
sin2 √xx−1 2 +1
7.1. NUMERICAL DIFFERENTIATION 227

Evaluating this monstrous function at x = 0.25 yields f 0 (0.25) = −9.066698770.


An alternative approach is to use a centered difference approximation,
f (x + h) − f (x − h)
f 0 (x) ≈ .
2h
Using this formula with x = 0.25 and h = 0.005, we obtain the approximation
f (0.255) − f (0.245)
f 0 (0.25) ≈ = −9.067464295,
0.01
which has absolute error 7.7 × 10−4 . While this complicated function must be evaluated twice to
obtain this approximation, that is still much less work than using differentiation rules to compute
f 0 (x), and then evaluating f 0 (x), which is much more complicated than f (x). 2

A similar approach can be used to obtain finite difference approximations of f 0 (x0 ) involving
any points of our choosing, and at an arbitrarily high order of accuracy, provided that sufficiently
many points are used.

Exercise 7.1.1 Use Taylor series expansions of f (x0 ± jh), for j = 1, 2, 3, to derive a
finite difference approximation of f 0 (x0 ) that is 6th-order accurate. What is the error
formula?

Exercise 7.1.2 Generalizing the process carried out by hand in Exercise 7.1.1, write a
Matlab function c=makediffrule(p) that takes as input a row vector of indices p and
returns in a vector c the coefficients of a finite-difference approximation of f 0 (x0 ) that
has the form
n
0 1X
f (x0 ) ≈ cj f (x0 + pj h),
h
j=1

where n is the length of p.

7.1.2 Lagrange Interpolation


While Taylor’s Theorem can be used to derive formulas with higher-order accuracy simply by
evaluating f (x) at more points, this process can be tedious. An alternative approach is to compute
the derivative of the interpolating polynomial that fits f (x) at these points. Specifically, suppose
we want to compute the derivative at a point x0 using the data

(x−j , y−j ), . . . , (x−1 , y−1 ), (x0 , y0 ), (x1 , y1 ), . . . , (xk , yk ),

where j and k are known nonnegative integers, x−j < x−j+1 < · · · < xk−1 < xk , and yi = f (xi ) for
i = −j, . . . , k. Then, a finite difference formula for f 0 (x0 ) can be obtained by analytically computing
the derivatives of the Lagrange polynomials {Ln,i (x)}ki=−j for these points, where n = j + k, and
the values of these derivatives at x0 are the proper weights for the function values y−j , . . . , yk . If
f (x) is n + 1 times continuously differentiable on [x−j , xk ], then we obtain an approximation of the
form
k k
0
X
0 f (n+1) (ξ) Y
f (x0 ) = yi Ln,i (x0 ) + (x0 − xi ), (7.2)
(n + 1)!
i=−j i=−j,i6=0
228 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

where ξ ∈ [x−j , xk ].

Exercise 7.1.3 Prove (7.2) using the error formula for Lagrange interpolation. Hint:
Use the fact that the unknown point ξ in the error formula is an (unknown) function of
x.

Exercise 7.1.4 Modify your function makediffrule from Exercise 7.1.2 so that it uses
Lagrange interpolation rather than Taylor series expansion. Make it return a second out-
put err which is the constant C such that the error in (7.2) is of the form Chn f (n+1) (ξ),
where n = j + k.
Among the best-known finite difference formulas that can be derived using this approach is the
second-order-accurate three-point formula
−3f (x0 ) + 4f (x0 + h) − f (x0 + 2h) f 000 (ξ) 2
f 0 (x0 ) = + h , ξ ∈ [x0 , x0 + 2h], (7.3)
2h 3
which is useful when there is no information available about f (x) for x < x0 . If there is no
information available about f (x) for x > x0 , then we can replace h by −h in the above formula to
obtain a second-order-accurate three-point formula that uses the values of f (x) at x0 , x0 − h and
x0 − 2h.
Another formula is the five-point formula
f (x0 − 2h) − 8f (x0 − h) + 8f (x0 + h) − f (x0 + 2h) f (5) (ξ) 4
f 0 (x0 ) = + h , ξ ∈ [x0 − 2h, x0 + 2h],
12h 30
which is fourth-order accurate. The reason it is called a five-point formula, even though it uses
the value of f (x) at four points, is that it is derived from the Lagrange polynomials for the five
points x0 − 2h, x0 − h, x0 , x0 + h, and x0 + 2h. However, f (x0 ) is not used in the formula because
L04,0 (x0 ) = 0, where L4,0 (x) is the Lagrange polynomial that is equal to one at x0 and zero at the
other four points.
If we do not have any information about f (x) for x < x0 , then we can use the following five-point
formula that actually uses the values of f (x) at five points,

−25f (x0 ) + 48f (x0 + h) − 36f (x0 + 2h) + 16f (x0 + 3h) − 3f (x0 + 4h) f (5) (ξ) 4
f 0 (x0 ) = + h ,
12h 5
where ξ ∈ [x0 , x0 + 4h]. As before, we can replace h by −h to obtain a similar formula that
approximates f 0 (x0 ) using the values of f (x) at x0 , x0 − h, x0 − 2h, x0 − 3h, and x0 − 4h.

Exercise 7.1.5 Use (7.2) to derive a general error formula for the approximation of
f 0 (x0 ) in the case where xi = x0 + ih, for i = −j, . . . , k. Use the preceding examples to
check the correctness of your error formula.

Example 7.1.2 We will construct a formula for approximating f 0 (x) at a given point x0 by inter-
polating f (x) at the points x0 , x0 + h, and x0 + 2h using a second-degree polynomial p2 (x), and
then approximating f 0 (x0 ) by p02 (x0 ). Since p2 (x) should be a good approximation of f (x) near x0 ,
especially when h is small, its derivative should be a good approximation to f 0 (x) near this point.
Using Lagrange interpolation, we obtain

p2 (x) = f (x0 )L2,0 (x) + f (x0 + h)L2,1 (x) + f (x0 + 2h)L2,2 (x),
7.1. NUMERICAL DIFFERENTIATION 229

where {L2,j (x)}2j=0 are the Lagrange polynomials for the points x0 , x1 = x0 + h and x2 = x0 + 2h.
Recall that these polynomials satisfy

1 if j = k
L2,j (xk ) = δjk = .
0 otherwise

Using the formula for the Lagrange polynomials,


2
Y (x − xi )
L2,j (x) = ,
(xj − xi )
i=0,i6=j

we obtain
(x − (x0 + h))(x − (x0 + 2h))
L2,0 (x) =
(x0 − (x0 + h))(x0 − (x0 + 2h))
x2 − (2x0 + 3h)x + (x0 + h)(x0 + 2h)
= ,
2h2
(x − x0 )(x − (x0 + 2h))
L2,1 (x) =
(x0 + h − x0 )(x0 + h − (x0 + 2h))
x2 − (2x0 + 2h)x + x0 (x0 + 2h)
= ,
−h2
(x − x0 )(x − (x0 + h))
L2,2 (x) =
(x0 + 2h − x0 )(x0 + 2h − (x0 + h))
x2 − (2x0 + h)x + x0 (x0 + h)
= .
2h2
It follows that

2x − (2x0 + 3h)
L02,0 (x) =
2h2
2x − (2x0 + 2h)
L02,1 (x) = −
h2
2x − (2x0 + h)
L02,2 (x) =
2h2

We conclude that f 0 (x0 ) ≈ p02 (x0 ), where

p02 (x0 ) = f (x0 )L02,0 (x0 ) + f (x0 + h)L02,0 (x0 ) + f (x0 + 2h)L02,0 (x0 )
−3 2 −1
≈ f (x0 ) + f (x0 + h) + f (x0 + 2h)
2h h 2h
3f (x0 ) + 4f (x0 + h) − f (x0 + 2h)
≈ .
2h

From (7.2), it can be shown (see Exercise 7.1.5) that the error in this approximation is O(h2 ), and
that this formula is exact when f (x) is a polynomial of degree 2 or less. The error formula is given
in (7.3). 2
230 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

7.1.3 Higher-Order Derivatives


The approaches of combining Taylor series or differentiating Lagrange polynomials to approximate
derivatives can be used to approximate higher-order derivatives. For example, the second derivative
can be approximated using a centered difference formula,

f (x0 + h) − 2f (x0 ) + f (x0 − h)


f 00 (x0 ) ≈ , (7.4)
h2
which is second-order accurate.
Exercise 7.1.6 Use both Taylor series and Lagrange polynomials to derive (7.4). Which
approach best facilitates computing an error formula for this approximation? What is the
error formula?

Exercise 7.1.7 Generalize your function makediffrule from Exercise 7.1.4 so that it
can compute the coefficients of a finite difference approximation to a derivative of a given
order, which is specified as an input argument.

7.1.4 Sensitivity
Based on the error formula for each of these finite difference approximations, one would expect
that it is possible to obtain an approximation that is accurate to within machine precision simply
by choosing h sufficiently small. In the following exercise, we can put this expectation to the test.

Exercise 7.1.8 Use the centered difference formula (7.1) to compute an approximation
of f 0 (x0 ) for f (x) = sin x, x0 = 1.2, and h = 10−d for d = 1, 2, . . . , 15. Compare the
error in each approximation with an upper bound for the error formula given in (7.1).
How does the actual error compare to theoretical expectations?

The reason for the discrepancy observed in Exercise 7.1.8 is that the error formula in (7.1), or any
other finite difference approximation, only accounts for discretization error, not roundoff error.
In a practical implementation of finite difference formulas, it is essential to note that roundoff
error in evaluating f (x) is bounded independently of the spacing h between points at which f (x)
is evaluated. It follows that the roundoff error in the approximation of f 0 (x) actually increases
as h decreases, because the errors incurred by evaluating f (x) are divided by h. Therefore, one
must choose h sufficiently small so that the finite difference formula can produce an accurate
approximation, and sufficiently large so that this approximation is not too contaminated by roundoff
error.

7.1.5 Differentiation Matrices


It is often necessary to compute derivatives of a function f (x) at a set of points within a given
domain. If both f (x) and f 0 (x) are represented by vectors f and g, respectively, whose elements
are the values of f and f 0 at N selected points. Then, in view of the linearity of differentiation, f
and g should be related by a linear transformation. That is, g = Df , where D is an N × N matrix.
In this context, D is called a differentiation matrix.
7.1. NUMERICAL DIFFERENTIATION 231

Example 7.1.3 We construct a differentiation matrix for functions defined on [0, 1], and satisfying
the boundary conditions f (0) = f (1) = 0. Let x1 , x2 , . . . , xn be n equally spaced points in (0, 1),
defined by xi = ih, where h = 1/(n + 1). If we use the forward difference approximation, we then
have
f (x2 ) − f (x1 )
f 0 (x1 ) ≈ ,
h
f (x3 ) − f (x2 )
f 0 (x2 ) ≈ ,
h
..
.
f (xn ) − f (xn−1 )
f 0 (xn−1 ≈ ,
h
0 − f (xn )
f 0 (xn ) ≈ .
h
Writing these equations in matrix-vector form, we obtain a relation of the form g ≈ Df , where
 
 0    −1 1
f (x1 ) f (x1 )
 f 0 (x2 )   f (x2 ) 
 −1 1 
1 .. ..

g= , f =  , D =  .
     
.. .. . .
 .   .  h  
−1 1 
f 0 (xn )

f (xn )
−1
The entries of D can be determined from the coefficients of each value f (xj ) used to approximate
f 0 (xi ), for i = 1, 2, . . . , n. From the structure of this upper bidiagonal matrix, it follows that we
can approximate f 0 (x) at these grid points by a matrix-vector multiplication which costs only O(n)
floating-point operations.
Now, suppose that we instead impose periodic boundary conditions f (0) = f (1). In this case,
we again use n equally spaced points, but including the left boundary: xi = ih, i = 0, 1, . . . , n − 1,
where h = 1/n. Using forward differencing again, we have the same approximations as before,
except
f (1) − f (xn−1 ) f (0) − f (xn−1 ) f (x1 ) − f (xn−1 )
f 0 (xn−1 ) ≈ = = .
h h h
It follows that the differentiation matrix is
 
−1 1
 −1 1 
1 . .

D= 
 . . . . .

h 
 −1 1 
1 −1
Note the “wrap-around” effect in which the superdiagonal appears to continue past the last column
into the first column. For this reason, D is an example of what is called a circulant matrix. 2

Exercise 7.1.9 What are the differentiation matrices corresponding to (7.4) for func-
tions defined on [0, 1], for (a) boundary conditions f (0) = f (1) = 0, and (b) periodic
boundary conditions f (0) = f (1)?
232 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

7.2 Numerical Integration


Numerous applications call for the computation of the integral of some function f : R → R over an
interval [a, b],
Z b
I[f ] = f (x) dx.
a

In some cases, I[f ] can be computed by applying the Fundamental Theorem of Calculus and
computing
I[f ] = F (b) − F (a),

where F (x) is an antiderivative of f , meaning that F 0 (x) = f (x). Unfortunately, this is not practical
if an antiderivative of f is not available. In such cases, numerical techniques must be employed
instead. The basics of integrals are reviewed in Section A.4.

7.2.1 Quadrature Rules


Clearly, if f is a Riemann integrable function and {Rn }∞
n=1 is any sequence of Riemann sums that
converges to I[f ], then any particular Riemann sum Rn can be viewed as an approximation of I[f ].
However, such an approximation is usually not practical since a large value of n may be necessary
to achieve sufficient accuracy.

Exercise 7.2.1 Write a Matlab script that computes the Riemann sum Rn for
Z 1
1
x2 dx = ,
0 3

where the left endpoint of each subinterval is used to obtain the height of the corresponding
rectangle. How large must n, the number of subintervals, be to obtain an approximate
answer that is accurate to within 10−5 ?
Instead, we use a quadrature rule to approximate I[f ]. A quadrature rule is a sum of the
form
n
X
Qn [f ] = f (xi )wi , (7.5)
i=1

where the points xi , i = 1, . . . , n, are called the nodes of the quadrature rule, and the numbers wi ,
i = 1, . . . , n, are the weights. We say that a quadrature rule is open if the nodes do not include
the endpoints a and b, and closed if they do.
The objective in designing quadrature rules is to achieve sufficient accuracy in approximating
I[f ], for any Riemann integrable function f , while using as few nodes as possible in order to
maximize efficiency. In order to determine suitable nodes and weights, we consider the following
questions:

• For what functions f is I[f ] easy to compute?

• Given a general Riemann integrable function f , can I[f ] be approximated by the integral of
a function g for which I[g] is easy to compute?
7.2. NUMERICAL INTEGRATION 233

7.2.2 Interpolatory Quadrature


One class of functions for which integrals are easily evaluated is the class of polynomial functions.
If we choose n nodes x1 , . . . , xn , then any polynomial pn−1 (x) of degree n − 1 can be written in the
form
Xn
pn−1 (x) = pn−1 (xi )Ln−1,i (x),
i=1
where Ln−1,i (x) is the ith Lagrange polynomial for the points x1 , . . . , xn . It follows that
Z b
I[pn−1 ] = pn−1 (x) dx
a
n
Z bX
= pn−1 (xi )Ln−1,i (x) dx
a i=1
n
X Z b 
= pn−1 (xi ) Ln−1,i (x) dx
i=1 a

Xn
= pn−1 (xi )wi
i=1
= Qn [pn−1 ]
where Z b
wi = Ln−1,i (x) dx, i = 1, . . . , n, (7.6)
a
are the weights of a quadrature rule with nodes x1 , . . . , xn .
Therefore, any n-point quadrature rule with weights chosen as in (7.6) computes I[f ] exactly
when f is a polynomial of degree less than n. For a more general function f , we can use this
quadrature rule to approximate I[f ] by I[pn−1 ], where pn−1 is the polynomial that interpolates
f at the points x1 , . . . , xn . Quadrature rules that use the weights defined above for given nodes
x1 , . . . , xn are called interpolatory quadrature rules. We say that an interpolatory quadrature
rule has degree of accuracy n if it integrates polynomials of degree n exactly, but is not exact
for polynomials of degree n + 1.
Exercise 7.2.2 Use Matlab’s polynomial functions to write a function
I=polydefint(p,a,b) that computes and returns the definite integral of a polyno-
mial with coefficients stored in the vector p over the interval [a, b].

Exercise 7.2.3 Use your function polydefint from Exercise 7.2.2 to write a function
w=interpweights(x,a,b) that returns a vector of weights w for an interpolatory quadra-
ture rule for the interval [a, b] with nodes stored in the vector x.

Exercise 7.2.4 Use your function interpweights from Exercise 7.2.3 to write a func-
tion I=interpquad(f,a,b,x) that approximates I[f ] over [a, b] using an interpolatory
quadrature rule with nodes stored in the vector x. The input argument f must be a func-
tion handle. Test your function by using it to evaluate the integrals of polynomials of
various degrees, comparing the results to the exact integrals returned by your function
polydefint from Exercise 7.2.2.
234 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

7.2.3 Sensitivity
To determine the sensitivity of I[f ], we define the ∞-norm of a function f (x) by

kf k∞ = max |f (x)|
x∈[a,b]

and let fˆ be a perturbation of f that is also Riemann integrable. Then the absolute condition
number of the problem of computing I[f ] can be approximated by

|I[fˆ] − I[f ]| |I[fˆ − f ]|


=
kfˆ − f k∞ kfˆ − f k∞
I[|fˆ − f |]

kfˆ − f k∞
(b − a)kfˆ − f k∞

kfˆ − f k∞
≤ (b − a),

from which it follows that the problem is fairly well-conditioned in most cases. Similarly, pertur-
bations of the endpoints a and b do not lead to large perturbations in I[f ], in most cases.

Exercise 7.2.5 What is the relative condition number of the problem of computing I[f ]?
If the weights wi , i = 1, . . . , n, are nonnegative, then the quadrature rule is stable, as its absolute
condition number can be bounded by (b − a), which is the same absolute condition number as the
underlying integration problem. However, if any of the weights are negative, then the condition
number can be arbitrarily large.

Exercise 7.2.6 Find the absolute condition number of the problem of computing Qn [f ]
for a general quadrature rule of the form (7.5).

7.3 Newton-Cotes Rules


The family of Newton-Cotes quadrature rules consists of interpolatory quadrature rules in which
the nodes are equally spaced points within the interval [a, b]. The most commonly used Newton-
Cotes rules are:
• The Trapezoidal Rule, which is a closed rule with two nodes, is defined by
Z b
b−a
f (x) dx ≈ [f (a) + f (b)].
a 2
It is of degree one, and it is based on the principle that the area under f (x) from x = a to
x = b can be approximated by the area of a trapezoid with heights f (a) and f (b) and width
b − a.
• The Midpoint Rule, which is an open rule with one node, is defined by
Z b  
a+b
f (x) dx ≈ (b − a)f .
a 2
7.3. NEWTON-COTES RULES 235

It is of degree one, and it is based on the principle that the area under f (x) can be approxi-
mated by the area of a rectangle with width b − a and height f (m), where m = (a + b)/2 is
the midpoint of the interval [a, b].

• Simpson’s Rule, which is a closed rule with three nodes, is defined by


b    
b−a
Z
a+b
f (x) dx ≈ f (a) + 4f + f (b) .
a 6 2

It is of degree three, and it is derived by computing the integral of the quadratic polynomial
that interpolates f (x) at the points a, (a + b)/2, and b.

Example 7.3.1 Let f (x) = x3 , a = 0 and b = 1. We have

b 1
1
x4
Z Z
3 1
f (x) dx = x dx = = .
a 0 4 0 4

Approximating this integral with the Midpoint Rule yields


Z 1  3
3 0+1 1
x dx ≈ (1 − 0) = .
0 2 8

Using the Trapezoidal Rule, we obtain


1
1−0 3
Z
1
x3 dx ≈ [0 + 13 ] = .
0 2 2

Finally, Simpson’s Rule yields


" 3 #
1   

Z
3 1 0 3 0 + 1 3 1 1 1
x dx ≈ 0 +4 +1 = 0+4 +1 = .
0 6 2 6 8 4

That is, the approximation of the integral by Simpson’s Rule is actually exact, which is expected
because Simpson’s Rule is of degree three. On the other hand, if we approximate the integral of
f (x) = x4 from 0 to 1, Simpson’s Rule yields 5/24, while the exact value is 1/5. Still, this is a
better approximation than those obtained using the Midpoint Rule (1/16) or the Trapezoidal Rule
(1/2). 2

Exercise 7.3.1 Write Matlab functions I=quadmidpoint(f,a,b),


I=quadtrapezoidal(f,a,b) and I=quadsimpsons(f,a,b) that implement the Midpoint
Rule, Trapezoidal Rule and Simpson’s Rule, respectively, to approximate the integral of
f (x), implemented by the function handle f, over the interval [a, b].

Exercise 7.3.2 Use your code from Exercise 7.2.4 to write a function
I=quadnewtoncotes(f,a,b,n) to integrate f (x), implemented by the function handle f,
over [a, b] using an n-node Newton-Cotes rule.
236 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

7.3.1 Error Analysis


The error in any interpolatory quadrature rule defined on an interval [a, b], such as a Newton-Cotes
rule or a Clenshaw-Curtis rule can be obtained by computing the integral from a to b of the error
in the polynomial interpolant on which the rule is based.
For the Trapezoidal Rule, which is obtained by integrating a linear polynomial that interpolates
the integrand f (x) at x = a and x = b, this approach to error analysis yields
Z b Z b 00
b−a f (ξ(x))
f (x) dx − [f (a) + f (b)] = (x − a)(x − b) dx,
a 2 a 2
where ξ(x) lies in [a, b] for a ≤ x ≤ b. The function (x − a)(x − b) does not change sign on [a, b],
which allows us to apply the Weighted Mean Value Theorem for Integrals and obtain a more useful
expression for the error,

f 00 (η) b f 00 (η)
Z b
b−a
Z
f (x) dx − [f (a) + f (b)] = (x − a)(x − b) dx = − (b − a)3 , (7.7)
a 2 2 a 12
where a ≤ η ≤ b. Because the error depends on the second derivative, it follows that the Trapezoidal
Rule is exact for any linear function.
A similar approach can be used to obtain expressions for the error in the Midpoint Rule and
Simpson’s Rule, although the process is somewhat more complicated due to the fact that the
functions (x − m), for the Midpoint Rule, and (x − a)(x − m)(x − b), for Simpson’s Rule, where in
both cases m = (a + b)/2, change sign on [a, b], thus making the Weighted Mean Value Theorem
for Integrals impossible to apply in the same straightforward manner as it was for the Trapezoidal
Rule.
We instead use the following approach, illustrated for the Midpoint Rule and adapted from a
similar proof for Simpson’s Rule from [36]. We assume that f is twice continuously differentiable
on [a, b]. First, we make a change of variable
a+b b−a
x= + t, t ∈ [−1, 1],
2 2
to map the interval [−1, 1] to [a, b], and then define F (t) = f (x(t)). The error in the Midpoint Rule
is then given by
Z b   Z 1 
a+b b−a
f (x) dx − (b − a)f = F (τ ) dτ − 2F (0) .
a 2 2 −1

We now define Z t
G(t) = F (τ ) dτ − 2tF (0).
−t

It is easily seen that the error in the Midpoint Rule is 21 (b − a)G(1). We then define

H(t) = G(t) − t3 G(1).

Because H(0) = H(1) = 0, it follows from Rolle’s Theorem that there exists a point ξ1 ∈ (0, 1)
such that H 0 (ξ1 ) = 0. However, from

H 0 (0) = G0 (0) = [F (t) + F (−t)]|t=0 − 2F (0) = 2F (0) − 2F (0) = 0,


7.3. NEWTON-COTES RULES 237

it follows from Rolle’s Theorem that there exists a point ξ2 ∈ (0, 1) such that H 00 (ξ2 ) = 0.
From
H 00 (t) = G00 (t) − 6tG(1) = F 0 (t) − F 0 (−t) − 6tG(1),
and the Mean Value Theorem, we obtain, for some ξ3 ∈ (−1, 1),

0 = H 00 (ξ2 ) = 2ξ2 F 00 (ξ3 ) − 6ξ2 G(1),

or  2
1 1 b−a
G(1) = F 00 (ξ3 ) = f 00 (x(ξ3 )).
3 3 2
Multiplying by (b − a)/2 yields the error in the Midpoint Rule.
The result of the analysis is that for the Midpoint Rule,
b
f 00 (η)
Z  
a+b
f (x) dx − (b − a)f = (b − a)3 , (7.8)
a 2 24

and for Simpson’s Rule,


b
f (4) (η) b − a 5
     
b−a
Z
a+b
f (x) dx − f (a) + 4f + f (b) = − , (7.9)
a 6 2 90 2

where, in both cases, η is some point in [a, b].


It follows that the Midpoint Rule is exact for any linear function, just like the Trapezoidal
Rule, even though it uses one less interpolation point, because of the cancellation that results from
choosing the midpoint of [a, b] as the interpolation point. Similar cancellation causes Simpson’s
Rule to be exact for polynomials of degree three or less, even though it is obtained by integrating
a quadratic interpolant over [a, b].

Exercise 7.3.3 Adapt the approach in the preceding derivation of the error formula (7.8)
for the Midpoint Rule to obtain the error formula (7.9) for Simpson’s Rule.
In general, the degree of accuracy of Newton-Cotes rules can easily be determined by expanding
the integrand f (x) in a Taylor series around the midpoint of [a, b], m = (a + b)/2. This technique
can be used to show that n-point Newton-Cotes rules with an odd number of nodes have degree
n, which is surprising since, in general, interpolatory n-point quadrature rules have degree n − 1.
This extra degree of accuracy is due to the cancellation of the high-order error terms in the Taylor
expansion used to determine the error. Such cancellation does not occur with Newton-Cotes rules
that have an even number of nodes.
Exercise 7.3.4 Prove the statement from the preceding paragraph: a n-node Newton-
Cotes rule has degree of accuracy n if n is odd, and n − 1 if n is even.

7.3.2 Higher-Order Rules


Unfortunately, Newton-Cotes rules are not practical when the number of nodes is large, due to the
inaccuracy of high-degree polynomial interpolation using equally spaced points. Furthermore, for
n ≥ 11, n-point Newton-Cotes rules have at least one negative weight, and therefore such rules can
238 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

be ill-conditioned. This can be seen by revisiting Runge’s Example from Section 5.4, and attempting
to approximate Z 5
1
dx (7.10)
−5 1 + x2
using a Newton-Cotes rule. As n increases, the approximate integral does not converge to the exact
result; in fact, it increases without bound.
Exercise 7.3.5 What is the smallest value of n for which a n-node Newton-Cotes rule
has a negative weight?

Exercise 7.3.6 Use your code from Exercise 7.3.2 to evaluate the integral from (7.10)
for increasing values of n and describe the behavior of the error as n increases.

7.4 Composite Rules


When using a quadrature rule to approximate I[f ] on some interval [a, b], the error is proportional
to hr , where h = b − a and r is some positive integer. Therefore, if the interval [a, b] is large, it is
advisable to divide [a, b] into smaller intervals, use a quadrature rule to compute the integral of f
on each subinterval, and add the results to approximate I[f ]. Such a scheme is called a composite
quadrature rule.
It can be shown that the approximate integral obtained using a composite rule that divides
[a, b] into n subintervals will converge to I[f ] as n → ∞, provided that the maximum width of the
n subintervals approaches zero, and the quadrature rule used on each subinterval has a degree of
at least zero. It should be noted that using closed quadrature rules on each subinterval improves
efficiency, because the nodes on the endpoints of each subinterval, except for a and b, are shared
by two quadrature rules. As a result, fewer function evaluations are necessary, compared to a
composite rule that uses open rules with the same number of nodes.
We will now state some of the most well-known composite quadrature rules. In the following
discussion, we assume that the interval [a, b] is divided into n subintervals of equal width h = (b −
a)/n, and that these subintervals have endpoints [xi−1 , xi ], where xi = a + ih, for i = 0, 1, 2, . . . , n.
Given such a partition of [a, b], we can compute I[f ] using
• the Composite Midpoint Rule
Z b
f (x) dx ≈ 2h[f (x1 ) + f (x3 ) + · · · + f (xn−1 )], n is even, (7.11)
a

• the Composite Trapezoidal Rule


Z b
h
f (x) dx ≈ [f (x0 ) + 2f (x1 ) + 2f (x2 ) + · · · + 2f (xn−1 ) + f (xn )], (7.12)
a 2
or
• the Composite Simpson’s Rule
Z b
h
f (x) dx ≈ [f (x0 )+4f (x1 )+2f (x2 )+4f (x3 )+· · ·+2f (xn−2 )+4f (xn−1 )+f (xn )], (7.13)
a 3
for which n is required to be even, as in the Composite Midpoint Rule.
7.4. COMPOSITE RULES 239

Exercise 7.4.1 Write Matlab functions I=quadcompmidpt(f,a,b,n),


I=quadcomptrap(f,a,b,n) and I=quadcompsimp(f,a,b,n) that implement the
Composite Midpoint rule (7.11), Composite Trapezoidal Rule (7.12), and Composite
Simpson’s Rule (7.13), respectively, to approximate the integral of f (x), implemented by
the function handle f, over [a, b] with n + 1 nodes x0 , x1 , . . . , xn .

Exercise 7.4.2 Apply your functions from Exercise 7.4.1 to approximate the integrals
Z 1√ Z 2

dx dx, x dx.
0 1

Use different values of n, the number of subintervals. How does the accuracy increase
as n increases? Explain any discrepancy between the observed behavior and theoretical
expectations.

7.4.1 Error Analysis


To obtain the error in each of these composite rules, we can sum the errors in the corresponding
basic rules over the n subintervals. For the Composite Trapezoidal Rule, this yields
n−1
Z b " #
h X
Etrap = f (x) dx − f (x0 ) + 2 f (xi ) + f (xn )
a 2
i=1
n
X f 00 (ηi )
= − (xi − xi−1 )3
12
i=1
n
h3 X
= − f 00 (ηi )
12
i=1
h3
= − nf 00 (η)
12
f 00 (η)
= − nh · h2
12
f 00 (η)
= − (b − a)h2 , (7.14)
12
where, for i = 1, . . . , n, ηi belongs to [xi−1 , xi ], and a ≤ η ≤ b. The replacement of hi=1 f 00 (ηi ) by
P
nf 00 (η) is justified by the Intermediate Value Theorem, provided that f 00 (x) is continuous on [a, b].
We see that the Composite Trapezoidal Rule is second-order accurate. Furthermore, its degree of
accuracy, which is the highest degree of polynomial that is guaranteed to be integrated exactly, is
the same as for the basic Trapezoidal Rule, which is one.
Similarly, for the Composite Midpoint Rule, we have
n/2 n/2 00
b
f 00 (η)
Z X X f (ηi )
Emid = f (x) dx − 2h f (x2i−1 ) = (2h)3 = (b − a)h2 .
a 24 6
i=1 i=1

Although it appears that the Composite Midpoint Rule is less accurate than the Composite Trape-
zoidal Rule, it should be noted that it uses about half as many function evaluations. In other words,
240 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

the Basic Midpoint Rule is applied n/2 times, each on a subinterval of width 2h. Rewriting the
Composite Midpoint Rule in such a way that it uses n function evaluations, each on a subinterval
of width h, we obtain
n
f 00 (η)
Z b  
X h
f (x) dx = h f xi−1 + + (b − a)h2 , (7.15)
a 2 24
i=1
which reveals that the Composite Midpoint Rule is generally more accurate.
Finally, for the Composite Simpson’s Rule, we have
n/2 (4)
X f (ηi ) f (4) (η)
Esimp = − h5 = − (b − a)h4 , (7.16)
90 180
i=1
because the Basic Simpson Rule is applied n/2 times, each on a subinterval of width 2h. We
conclude that the Simpson’s Rule is fourth-order accurate.

Exercise 7.4.3 Derive the error formula (7.16) for the Composite Simpson’s Rule
(7.13).
Example 7.4.1 We wish to approximate
Z 1
ex dx
0
using composite quadrature, to 3 decimal places. That is, the error must be less than 10−3 . This
requires choosing the number of subintervals, n, sufficiently large so that an upper bound on the
error is less than 10−3 .
For the Composite Trapezoidal Rule, the error is
f 00 (η) eη
Etrap = − (b − a)h2 = − ,
12 12n2
since f (x) = ex , a = 0 and b = 1, which yields h = (b − a)/n = 1/n. Since 0 ≤ η ≤ 1, and ex is
increasing, the factor eη is bounded above by e1 = e. It follows that |Etrap | < 10−3 if
e 1000e
< 10−3 ⇒ < n2 ⇒ n > 15.0507.
12n2 12
Therefore, the error will be sufficiently small provided that we choose n ≥ 16.
On the other hand, if we use the Composite Simpson’s Rule, the error is
f (4) (η) eη
Esimp = − (b − a)h4 = −
180 180n4
−3
for some η in [0, 1], which is less than 10 in absolute value if
1000e 1/4
 
n> ≈ 1.9713,
180
so n = 2 is sufficient. That is, we can approximate the integral to 3 decimal places by setting
h = (b − a)/n = (1 − 0)/2 = 1/2 and computing
Z 1
h 1/2 0
ex dx ≈ [ex0 + 4ex1 + ex2 ] = [e + 4e1/2 + e1 ] ≈ 1.71886,
0 3 3
whereas the exact value is approximately 1.71828. 2
7.5. GAUSSIAN QUADRATURE 241

7.5 Gaussian Quadrature


Previously, we learned that a Newton-Cotes quadrature rule with n nodes has degree at most n.
Therefore, it is natural to ask whether it is possible to select the nodes and weights of an n-point
quadrature rule so that the rule has degree greater than n. Gaussian quadrature rules [1] have
the surprising property that they can be used to integrate polynomials of degree 2n − 1 exactly
using only n nodes.

7.5.1 Direct Construction


Gaussian quadrature rules can be constructed using a technique known as moment matching, or
direct construction. For any nonnegative integer k, the k th moment is defined to be
Z b
µk = xk dx.
a

For given n, our goal is to select weights and nodes so that the first 2n moments are computed
exactly; i.e.,
Xn
µk = wi xki , k = 0, 1, . . . , 2n − 1. (7.17)
i=1

Since we have 2n free parameters, it is reasonable to think that appropriate nodes and weights can
be found. Unfortunately, this system of equations is nonlinear, so it can be quite difficult to solve.

Exercise 7.5.1 Use direct construction to solve the equations (7.17) for the case of n = 2
on the interval (a, b) = (−1, 1) for the nodes x1 , x2 and weights w1 , w2 .

7.5.2 Orthogonal Polynomials


Suppose g(x) is a polynomial of degree at most 2n − 1. For convenience, we will write g ∈ P2n−1 ,
where, for any natural number k, Pk denotes the space of polynomials of degree at most k. We
shall show that there exist nodes {xi }ni=1 and nodes {wi }ni=1 such that
Z b n
X
I[g] = g(x) dx = wi g(xi ).
a i=1

Furthermore, for more general functions, G(x),


Z b n
X
G(x) dx = wi G(xi ) + E[G]
a i=1

where

1. xi are real, distinct, and a < xi < b for i = 1, 2, . . . , n.

2. The weights {wi } satisfy wi > 0 for i = 1, 2, . . . , n.


G(2n) (ξ) R b Qn
3. The error E[G] satisfies E[G] = (2n)! a i=1 (x − xi )2 dx.
242 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

Notice that this method is exact for polynomials of degree 2n − 1 since the error functional E[G]
depends on the (2n)th derivative of G.
To prove this, we shall construct an orthonormal family of polynomials {qi (x)}ni=0 , as in Section
6.2, so that
Z b 
0 r 6= s,
hqr , qs i = qr (x)qs (x) dx =
a 1 r = s.
Recall that this can be accomplished using the fact that such a family of polynomials satisfies a
three-term recurrence relation

βj qj (x) = (x − αj )qj−1 (x) − βj−1 qj−2 (x), q0 (x) = (b − a)−1/2 , q−1 (x) = 0,

where
Z b Z b
αj = hqj−1 , xqj−1 i = xqj−1 (x)2 dx, βj2 = hqj , xqj−1 i = xqj (x)qj−1 (x) dx, j ≥ 1,
a a

with β02 = b − a.
We choose the nodes {xi } to be the roots of the nth -degree polynomial in this family, which are
real, distinct and lie within (a, b), as proved in Section 6.2.6. Next, we construct the interpolant of
degree n − 1, denoted pn−1 (x), of g(x) through the nodes:
n
X
pn−1 (x) = g(xi )Ln−1,i (x),
i=1

where, for i = 1, . . . , n, Ln−1,i (x) is the ith Lagrange polynomial for the points x1 , . . . , xn . We shall
now look at the interpolation error function

e(x) = g(x) − pn−1 (x).

Clearly, since g ∈ P2n−1 , e ∈ P2n−1 . Since e(x) has roots at each of the roots of qn (x), we can
factor e so that
e(x) = qn (x)r(x),
where r ∈ Pn−1 . It follows from the fact that qn (x) is orthogonal to any polynomial in Pn−1 that
the integral of g can then be written as
Z b Z b
I[g] = pn−1 (x) dx + qn (x)r(x) dx
a a
Z b
= pn−1 (x) dx
a
n
Z bX
= g(xi )Ln−1,i (x) dx
a i=1
n
X Z b
= g(xi ) Ln−1,i (x) dx
i=1 a

Xn
= g(xi )wi
i=1
7.5. GAUSSIAN QUADRATURE 243

where Z b
wi = Ln−1,i (x) dx, i = 1, 2, . . . , n.
a

For a more general function G(x), the error functional E[G] can be obtained from the expression
for Hermite interpolation error presented in Section 5.5.1, as we will now investigate.

7.5.3 Error Analysis


It is easy to show that the weights wi are positive. Since the interpolation basis functions Ln−1,i
belong to Pn−1 , it follows that L2n−1,i ∈ P2n−2 , and therefore

Z b n−1
X
0< L2n−1,i (x) dx = wj L2n−1,i (xj ) = wi .
a j=0

Note that we have thus obtained an alternative formula for the weights.
This formula also arises from an alternative approach to constructing Gaussian quadrature
rules, from which a representation of the error can easily be obtained. We construct the Hermite
interpolating polynomial G2n−1 (x) of G(x), using the Gaussian quadrature nodes as interpolation
points, that satisfies the 2n conditions

G2n−1 (xi ) = G(xi ), G02n−1 (xi ) = G0 (xi ), i = 1, 2, . . . , n.

We recall from Section 5.5.1 that this interpolant has the form
n
X n
X
G2n−1 (x) = G(xi )Hi (x) + G0 (xi )Ki (x),
i=1 i=1

where, as in our previous discussion of Hermite interpolation,

Hi (xj ) = δij , Hi0 (xj ) = 0, Ki (xj ) = 0, Ki0 (xj ) = δij , i, j = 1, 2, . . . , n.

Then, we have
Z b n
X Z b n
X Z b
0
G2n−1 (x) dx = G(xi ) Hi (x) dx + G (xi ) Ki (x) dx.
a i=1 a i=1 a

We recall from Section 5.5.1 that

Hi (x) = Ln−1,i (x)2 [1 − 2L0n−1,i (xi )(x − xi )], Ki (x) = Ln−1,i (x)2 (x − xi ), i = 1, 2, . . . , n,

and for convenience, we define

πn (x) = (x − x1 )(x − x2 ) · · · (x − xn ),

and note that


πn (x)
Ln−1,i (x) = , i = 1, 2, . . . , n.
(x − xi )πn0 (xi )
244 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

We then have
Z b Z b Z b
Hi (x) dx = Ln−1,i (x)2 dx − 2L0n−1,i (xi ) Ln−1,i (x)2 (x − xi ) dx
a a a
b 2L0n−1,i (xi ) b
Z Z
= Ln−1,i (x)2 dx − Ln−1,i (x)πn (x) dx
a πn0 (xi ) a
Z b
= Ln−1,i (x)2 dx,
a

as the second term vanishes because Ln−1,i (x) is of degree n − 1, and πn (x), a polynomial of degree
n, is orthogonal to all polynomials of lesser degree.
Similarly,
Z b Z b Z b
2 1
Ki (x) dx = Ln−1,i (x) (x − xi ) dx = 0 Ln−1,i (x)πn (x) dx = 0.
a a πn (xi ) a
We conclude that Z b n
X
G2n−1 (x) dx = G(xi )wi ,
a i=1
where, as before,
Z b Z b
2
wi = Ln−1,i (x) dx = Ln−1,i (x) dx, i = 1, 2, . . . , n.
a a

The equivalence of these formulas for the weights can be seen from the fact that the difference
Ln−1,i (x)2 − Ln−1,i (x) is a polynomial of degree 2n − 2 that is divisible by πn (x), because it
vanishes at all of the nodes. The quotient, a polynomial of degree n − 2, is orthogonal to πn (x).
Therefore, the integrals of Ln−1,i (x)2 and Ln−1,i (x) must be equal.
We now use the error in the Hermite interpolating polynomial to obtain
Z b n
X
E[G] = G(x) dx − G(xi )wi
a i=1
Z b
= [G(x) − G2n−1 (x)] dx
a
b
G(2n) (ξ(x))
Z
= πn (x)2 dx
a (2n)!
Z n
G(2n) (ξ) b Y
= (x − xi )2 dx,
(2n)! a i=1

where ξ ∈ (a, b). The last step is obtained using the Weighted Mean Value Theorem for Integrals,
which applies because πn (x)2 does not change sign.
In addition to this error formula, we can easily obtain qualitative bounds on the error. For
instance, if we know that the even derivatives of g are positive, then we know that the quadrature
rule yields a lower bound for I[g]. Similarly, if the even derivatives of g are negative, then the
quadrature rule gives an upper bound.
7.5. GAUSSIAN QUADRATURE 245

Finally, it can be shown that as n → ∞, the n-node Gaussian quadrature approximation of I[f ]
converges to I[f ]. The key to the proof is the fact that the weights are guaranteed to be positive,
and therefore the sum of the weights is always equal to b − a. Such a result does not hold for n-node
Newton-Cotes quadrature rules, because the sum of the absolute values of the weights cannot be
bounded, due to the presence of negative weights.

Example 7.5.1 We will use Gaussian quadrature to approximate the integral


Z 1
2
e−x dx.
0

The particular Gaussian quadrature rule that we will use consists of 5 nodes x1 , x2 , x3 , x4 and x5 ,
and 5 weights w1 , w2 , w3 , w4 and w5 . To determine the proper nodes and weights, we use the fact
that the nodes and weights of a 5-point Gaussian rule for integrating over the interval [−1, 1] are
given by
i Nodes r5,i Weights c5,i
1 0.9061798459 0.2369268850
2 0.5384693101 0.4786286705
3 0.0000000000 0.5688888889
4 −0.5384693101 0.4786286705
5 −0.9061798459 0.2369268850
To obtain the corresponding nodes and weights for integrating over [0, 1], we can use the fact that
in general,
Z b Z 1  
b−a a+b b−a
f (x) dx = f t+ dt,
a −1 2 2 2
as can be shown using the change of variable x = [(b − a)/2]t + (a + b)/2 that maps [a, b] into [−1, 1].
We then have
Z b Z 1  
b−a a+b b−a
f (x) dx = f t+ dt
a −1 2 2 2
5  
X b−a a+b b−a
≈ f r5,i + c5,i
2 2 2
i=1
5
X
≈ f (xi )wi ,
i=1

where
b−a a+b b−a
xi = r5,i + , wi = c5,i , i = 1, . . . , 5.
2 2 2
In this example, a = 0 and b = 1, so the nodes and weights for a 5-point Gaussian quadrature rule
for integrating over [0, 1] are given by
1 1 1
xi = r5,i + , wi = c5,i , i = 1, . . . , 5,
2 2 2
which yields
246 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

i Nodes xi Weights wi
1 0.95308992295 0.11846344250
2 0.76923465505 0.23931433525
3 0.50000000000 0.28444444444
4 0.23076534495 0.23931433525
5 0.04691007705 0.11846344250

It follows that

Z 1 5
−x2 2
X
e dx ≈ e−xi wi
0 i=1
2 2
≈ 0.11846344250e−0.95308992295 + 0.23931433525e−0.76923465505 +
2 2
0.28444444444e−0.5 + 0.23931433525e−0.23076534495 +
2
0.11846344250e−0.04691007705
≈ 0.74682412673352.

Since the exact value is 0.74682413281243, the absolute error is −6.08 × 10−9 , which is remarkably
accurate considering that only fives nodes are used. 2

The high degree of accuracy of Gaussian quadrature rules make them the most commonly used
rules in practice. However, they are not without their drawbacks:

• They are not progressive, so the nodes must be recomputed whenever additional degrees of
accuracy are desired. An alternative is to use Gauss-Kronrod rules [22]. A (2n + 1)-point
Gauss-Kronrod rule uses the nodes of the n-point Gaussian rule. For this reason, practical
quadrature procedures use both the Gaussian rule and the corresponding Gauss-Kronrod rule
to estimate accuracy.

• Because the nodes are the roots of a polynomial, they must be computed using traditional
root-finding methods, which are not always accurate. Errors in the computed nodes lead
to lost degrees of accuracy in the approximate integral. In practice, however, this does not
normally cause significant difficulty.

Exercise 7.5.2 Write a Matlab function I=gaussquadrule(f,a,b,n) that approxi-


mate the integral of f (x), implemented by the function handle f, over [a, b] with a n-node
Gaussian quadrature rule. Use your function interpquad from Exercise 7.2.4 as well
as your function makelegendre from Section 6.2. Test your function by comparing its
output to that of the built-in function integral. How does its accuracy compare to that
of your function quadnewtoncotes from Exercise 7.3.2?
7.5. GAUSSIAN QUADRATURE 247

Exercise 7.5.3 A 5-node Gaussian quadrature rule is exact for the integrand f (x) =
x8 , while a 5-node Newton-Cotes rule is not. How important is it that the Gaussian
quadrature
R1 8 nodes be computed with high accuracy? Investigate this by approximating
−1 x dx using a 5-node interpolatory quadrature rule with nodes

xi = θx̃i + (1 − θ)x̂i ,

where {x̃i }5i=1 and {x̂i }5i=1 are the nodes for a 5-node Gaussian and Newton-Cotes rule,
respectively, and θ ∈ [0, 1]. Use your function interpquad from Exercise 7.2.4 and let θ
vary from 0 to 1. How does the error behave as θ increases?

7.5.4 Other Weight Functions

In Section 6.2 we learned how to construct sequences of orthogonal polynomials for the inner
product

Z b
hf, gi = f (x)g(x)w(x) dx,
a

where f and g are real-valued functions on (a, b) and w(x) is a weight function satisfying w(x) > 0 on
(a, b). These orthogonal polynomials can be used to construct Gauss quadrature rules for integrals
with weight functions, in a similar manner to how they were constructed earlier in this section for
the case w(x) ≡ 1.

Exercise 7.5.4 Let w(x) be a weight function satisfying w(x) > 0 in (a, b). Derive a
Gauss quadrature rule of the form
Z b n
X
f (x)w(x) dx = f (xi )wi + E[f ]
a i=1

that is exact for f ∈ P2n−1 . What is the error functional E[f ]?

A case of particular interest is the interval (−1, 1) with the weight function w(x) = (1 − x2 )−1/2 ,
as the orthogonal polynomials for the corresponding inner product are the Chebyshev polynomials,
introduced in Section 5.4.2. Unlike other Gauss quadrature rules, there are simple formulas for the
nodes and weights in this case.

Exercise 7.5.5 Use trigonometric identities to prove that


sin 2nθ
cos θ + cos 3θ + cos 5θ + · · · + cos(2n − 1)θ = ,
sin θ
when θ is not an integer multiple of π.
248 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

Exercise 7.5.6 Use the result of Exercise 7.5.5 and direct construction to derive the
nodes and weights for an n-node Gaussian quadrature rule of the form
Z 1 n
X
2 −1/2
(1 − x ) f (x) dx = f (xi )wi ,
−1 i=1

that is exact when f ∈ P2n−1 .

7.5.5 Prescribing Nodes


We have seen that for an integrand f (x) that has even derivatives that do not change sign on (a, b),
it can be determined that the Gauss quadrature approximation of I[f ] is either an upper bound or
a lower bound. By prescribing either or both of the endpoints x = a or x = b as quadrature nodes,
we can obtain additional bounds and bracket the exact value of I[f ]. However, it is important to
prescribe such nodes in a manner that, as much as possible, maintains the high degree of accuracy
of the quadrature rule.
A Gauss quadrature rule with n + 1 nodes is exact for any integrand in P2n+1 . Our goal is to
construct a quadrature rule with n + 1 nodes, one of which is at x = a, that computes
Z b
I[f ] = f (x)w(x) dx,
a

for a given weight function w(x), exactly when f ∈ P2n . That is, prescribing a node reduces the
degree of accuracy by only one. Such a quadrature rule is called a Gauss-Radau quadrature
rule [1].
We begin by dividing f (x) by (x − a), which yields

f (x) = (x − a)q2n−1 (x) + f (a).

We then construct a n-node Gauss quadrature rule


Z b n
X
g(x)w∗ (x) dx = g(x∗i )wi∗ + E[g]
a i=1

for the weight function w∗ (x) = (x − a)w(x). It is clear that w∗ (x) > 0 on (a, b). Because this rule
is exact for g ∈ P2n−1 , we have
Z b Z b

I[f ] = q2n−1 (x)w (x) dx + f (a) w(x) dx
a a
n
X Z b
= q2n−1 (x∗i )wi∗ + f (a) w(x) dx
i=1 a
n
f (x∗i ) − f (a) ∗ b
X Z
= wi + f (a) w(x) dx
x∗i − a a
i=1
n n
"Z #
X b X
= f (x∗i )wi + f (a) w(x) dx − wi ,
i=1 a i=1
7.6. EXTRAPOLATION TO THE LIMIT 249

where wi = wi∗ /(x∗i − a) for i = 1, 2, . . . , n. By defining


Z b n
X
w0 = w(x) dx − wi ,
a i=1
and defining the nodes
x0 = a, xi = x∗i , i = 1, 2, . . . , n,
we obtain a quadrature rule
n
X
I[f ] = f (xi )wi + E[f ]
i=0
that is exact for f ∈ P2n . Clearly the weights w1 , w2 , . . . , wn are positive; it can be shown that
w0 > 0 by noting that it is the error in the Gauss quadrature approximation of
  Z b
∗ 1 1
I = w∗ (x) dx.
x−a a x − a
It can also be shown that if the integrand f (x) is sufficiently differentiable and satisfies f (2n) > 0
on (a, b), then this Gauss-Radau rule yields a lower bound for I[f ].

Exercise 7.5.7 Following the discussion above, derive a Gauss-Radau quadrature rule in
which a node is prescribed at x = b. Prove that the weights w1 , w2 , . . . , wn+1 are positive.
Does this rule yield an upper bound or lower bound for the integrand f (x) = 1/(x − a)?

Exercise 7.5.8 Derive formulas for the nodes and weights for a Gauss-Lobatto
quadrature rule [1], in which nodes are prescribed at x = a and x = b. Specifically, the
rule must have n + 1 nodes x0 = a < x1 < x2 < · · · < xn−1 < xn = b. Prove that the
weights w0 , w1 , . . . , wn are positive. What is the degree of accuracy of this rule?

Exercise 7.5.9 Explain why developing a Gauss-Radau rule by prescribing a node at


x = c, where c ∈ (a, b), is problematic.

7.6 Extrapolation to the Limit


We have seen that the accuracy of methods for computing integrals or derivatives of a function
f (x) depends on the spacing between points at which f is evaluated, and that the approximation
tends to the exact value as this spacing tends to 0.
Suppose that a uniform spacing h is used. We denote by F (h) the approximation computed
using the spacing h, from which it follows that the exact value is given by F (0). Let p be the order
of accuracy in our approximation; that is,
F (h) = a0 + a1 hp + O(hr ), r > p, (7.18)
where a0 is the exact value F (0). Then, if we choose a value for h and compute F (h) and F (h/q) for
some positive integer q, then we can neglect the O(hr ) terms and solve a system of two equations
for the unknowns a0 and a1 , thus obtaining an approximation that is rth order accurate. If we
can describe the error in this approximation in the same way that we can describe the error in our
original approximation F (h), we can repeat this process to obtain an approximation that is even
more accurate.
250 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

7.6.1 Richardson Extrapolation


This process of extrapolating from F (h) and F (h/q) to approximate F (0) with a higher order
of accuracy is called Richardson extrapolation [31]. In a sense, Richardson extrapolation is
similar in spirit to Aitken’s ∆2 method (see Section 8.5), as both methods use assumptions about
the convergence of a sequence of approximations to “solve” for the exact solution, resulting in a
more accurate method of computing approximations.

Example 7.6.1 Consider the function


√ 
x2 +x
sin2cos x−x
f (x) = √  .
sin √xx−1
2 +1

Our goal is to compute f 0 (0.25) as accurately as possible. Using a centered difference approximation,

f (x + h) − f (x − h)
f 0 (x) = + O(h2 ),
2h
with x = 0.25 and h = 0.01, we obtain the approximation
f (0.26) − f (0.24)
f 0 (0.25) ≈ = −9.06975297890147,
0.02
which has absolute error 3.0 × 10−3 , and if we use h = 0.005, we obtain the approximation
f (0.255) − f (0.245)
f 0 (0.25) ≈ = −9.06746429492149,
0.01
which has absolute error 7.7 × 10−4 . As expected, the error decreases by a factor of approximately
4 when we halve the step size h, because the error in the centered difference formula is of O(h2 ).
We can obtain a more accurate approximation by applying Richardson Extrapolation to these ap-
proximations. We define the function N1 (h) to be the centered difference approximation to f 0 (0.25)
obtained using the step size h. Then, with h = 0.01, we have

N1 (h) = −9.06975297890147, N1 (h/2) = −9.06746429492149,

and the exact value is given by N1 (0) = −9.06669877124279. Because the error in the centered
difference approximation satisfies

N1 (h) = N1 (0) + K1 h2 + K2 h4 + K3 h6 + O(h8 ), (7.19)

where the constants K1 , K2 and K3 depend on the derivatives of f (x) at x = 0.25, it follows that
the new approximation
N1 (h/2) − N1 (h)
N2 (h) = N1 (h/2) + = −9.06670140026149,
22 − 1
has fourth-order accuracy. Specifically, if we denote the exact value by N2 (0), we have

N2 (h) = N2 (0) + K̃2 h4 + K̃3 h6 + O(h8 ),


7.6. EXTRAPOLATION TO THE LIMIT 251

where the constants K̃2 and K̃3 are independent of h.


Now, suppose that we compute
f (x + h/4) − f (x − h/4) f (0.2525) − f (0.2475)
N1 (h/4) = = = −9.06689027527046,
2(h/4) 0.005
which has an absolute error of 1.9 × 10−4 , we can use extrapolation again to obtain a second fourth-
order accurate approximation,
N1 (h/4) − N1 (h/2)
N2 (h/2) = N1 (h/4) + = −9.06669893538678,
3
which has absolute error of 1.7 × 10−7 . It follows from the form of the error in N2 (h) that we can
use extrapolation on N2 (h) and N2 (h/2) to obtain a sixth-order accurate approximation,
N2 (h/2) − N2 (h)
N3 (h) = N2 (h/2) + = −9.06669877106180,
24 − 1
which has an absolute error of 1.8 × 10−10 . 2

Exercise 7.6.1 Use Taylor series expansion to prove (7.19); that is, the error in the
centered difference approximation can be expressed as a sum of terms involving only even
powers of h.

Exercise 7.6.2 Based on the preceding example, give a general formula for Richardson
extrapolation, applied to the approximation F (h) from (7.18), that uses F (h) and F (h/d),
for some integer d > 1, to obtain an approximation of F (0) that is of order r.

7.6.2 The Euler-Maclaurin Expansion


In the previous example, it was stated, without proof, that the error in the centered difference
approximation could be expressed as a sum of terms involving even powers of the spacing h.
We would like to use Richardson Extrapolation to enhance the accuracy of approximate integrals
computed using the Composite Trapezoidal Rule, but first we must determine the form of the error
in these approximations. We have established that the Composite Trapezoidal Rule is second-order
accurate, but if Richardson Extrapolation is used once to eliminate the O(h2 ) portion of the error,
we do not know the order of what remains.
Suppose that g(t) is differentiable on (−1, 1). From integration by parts, we have
Z 1 Z 1 Z 1
1 0
g(t) dt = tg(t)|−1 − tg (t) dt = [g(−1) + g(1)] − tg 0 (t) dt.
−1 −1 −1
The first term on the right side of the equals sign is the basic Trapezoidal Rule approximation of
the integral on the left side of the equals sign. The second term on the right side is the error in
this approximation. If g is 2k-times differentiabe on (−1, 1), and we repeatedly apply integration
by parts, 2k − 1 times, we obtain
Z 1 h i 1
g(t) dt − [g(−1) + g(1)] = q2 (t)g 0 (t) − q3 (t)g 00 (t) + · · · + q2k (t)g (2k−1) (t) −

−1 −1
Z 1
q2k (t)g (2k) (t) dt,
−1
252 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

where the sequence of polynomials q1 (t), . . . , q2k (t) satisfy


0
q1 (t) = −t, qr+1 (t) = qr (t), r = 1, 2, . . . , 2k − 1.
If we choose the constants of integration correctly, then, because q1 (t) is an odd function, we can
ensure that qr (t) is an odd function if r is odd, and an even function if r is even. This ensures that
q2r (−1) = q2r (1). Furthermore, we can ensure that qr (−1) = qr (1) = 0 if r is odd. This causes the
boundary terms involving qr (t) to vanish when r is odd, which yields
Z 1 X k Z 1
g(t) dt − [g(−1) + g(1)] = q2r (1)[g (2r−1) (1) − g (2r−1) (−1)] − q2k (t)g (2k) (t) dt.
−1 r=1 −1

Using this expression for the error in the context of the Composite Trapezoidal Rule, applied
to the integral of a 2k-times differentiable function f (x) on a general interval [a, b], yields the
Euler-Maclaurin Expansion
n−1
Z b " #
h X
f (x) dx = f (a) + 2 f (xi ) + f (b) +
a 2
i=1
k n Z xi
 2k X
X
2r (2r−1) (2r−1) h
cr h [f (b) − f (a)] − q2k (t)f (2k) (x) dx,
2 xi−1
r=1 i=1
2
where, for each i = 1, 2, . . . , n, t = −1 + h (x − xi−1 ), and the constants
q2r (1) B2r
cr = 2r
=− , r = 1, 2, . . . , k
2 (2r)!
are closely related to the Bernoulli numbers Br .
It can be seen from this expansion that the error Etrap (h) in the Composite Trapezoidal Rule,
like the error in the centered difference approximation of the derivative, has the form
Etrap (h) = K1 h2 + K2 h4 + K3 h6 + · · · + O(h2k ),
where the constants Ki are independent of h, provided that the integrand is at least 2k times
continuously differentiable. This knowledge of the error provides guidance on how Richardson Ex-
trapolation can be repeatedly applied to approximations obtained using the Composite Trapezoidal
Rule at different spacings in order to obtain higher-order accurate approximations.
It can also be seen from the Euler-Maclaurin Expansion that the Composite Trapezoidal Rule
is particularly accurate when the integrand is a periodic function, of period b − a, as this causes
the terms involving the derivatives of the integrand at a and b to vanish. Specifically, if f (x) is
periodic with period b − a, and is at least 2k times R b continuously differentiable, then2kthe error in the
Composite Trapezoidal Rule approximation to a f (x) dx, with spacing h, is O(h ), rather than
O(h2 ). It follows that if f (x) is infinitely differentiable, such as a finite linear combination of sines
or cosines, then the Composite Trapezoidal Rule has an exponential order of accuracy, meaning
that as h → 0, the error converges to zero more rapidly than any power of h.
Exercise 7.6.3 Use integration by parts to obtain explicit formulas for the polynomials
qr (t) for r = 2, 3, 4.

Exercise 7.6.4 Use the Composite Trapezoidal Rule to integrate f (x) = sin kπx, where
k is an integer, over [0, 1]. How does the error behave?
7.6. EXTRAPOLATION TO THE LIMIT 253

7.6.3 Romberg Integration


Richardson extrapolation is not only used to compute more accurate approximations of derivatives,
but is also used as the foundation of a numerical integration scheme called Romberg integration
[32]. In this scheme, the integral
Z b
I[f ] = f (x) dx
a

is approximated using the Composite Trapezoidal Rule with step sizes hk = (b − a)2−k , where k is
a nonnegative integer. Then, for each k, Richardson extrapolation is used k − 1 times to previously
computed approximations in order to improve the order of accuracy as much as possible.
More precisely, suppose that we compute approximations T1,1 and T2,1 to the integral, using
the Composite Trapezoidal Rule with one and two subintervals, respectively. That is,

b−a
T1,1 = [f (a) + f (b)]
2    
b−a a+b
T2,1 = f (a) + 2f + f (b) .
4 2

Suppose that f has continuous derivatives of all orders on [a, b]. Then, the Composite Trapezoidal
Rule, for a general number of subintervals n, satisfies
 
Z b n−1 ∞
h X X
f (x) dx = f (a) + 2 f (xj ) + f (b) + Ki h2i ,
a 2
j=1 i=1

where h = (b − a)/n, xj = a + jh, and the constants {Ki }∞ i=1 depend only on the derivatives of f .
It follows that we can use Richardson Extrapolation to compute an approximation with a higher
order of accuracy. If we denote the exact value of the integral by I[f ] then we have

T1,1 = I[f ] + K1 h2 + O(h4 )


T2,1 = I[f ] + K1 (h/2)2 + O(h4 )

Neglecting the O(h4 ) terms, we have a system of equations that we can solve for K1 and I[f ]. The
value of I[f ], which we denote by T2,2 , is an improved approximation given by

T2,1 − T1,1
T2,2 = T2,1 + .
3
It follows from the representation of the error in the Composite Trapezoidal Rule that I[f ] =
T2,2 + O(h4 ).
Suppose that we compute another approximation T3,1 using the Composite Trapezoidal Rule
with 4 subintervals. Then, as before, we can use Richardson Extrapolation with T2,1 and T3,1
to obtain a new approximation T3,2 that is fourth-order accurate. Now, however, we have two
approximations, T2,2 and T3,2 , that satisfy

T2,2 = I[f ] + K̃2 h4 + O(h6 )


T3,2 = I[f ] + K̃2 (h/2)4 + O(h6 )
254 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

for some constant K̃2 . It follows that we can apply Richardson Extrapolation to these approxima-
tions to obtain a new approximation T3,3 that is sixth-order accurate. We can continue this process
to obtain as high an order of accuracy as we wish. We now describe the entire algorithm.

Algorithm 7.6.2 (Romberg Integration) Given a positive integer J, an interval [a, b] and
Rb
a function f (x), the following algorithm computes an approximation to I[f ] = a f (x) dx
that is accurate to order 2J.

h=b−a
for j = 1, 2, . h. . , J do i
h P2j−1 −1
Tj,1 = 2 f (a) + 2 i=1 f (a + ih) + f (b) (Composite Trapezoidal Rule)
for k = 2, 3, . . . , j do
T −Tj−1,k−1
Tj,k = Tj,k−1 + j,k−1 4k−1 −1
(Richardson Extrapolation)
end
h = h/2
end
It should be noted that in a practical implementation, Tj,1 can be computed more efficiently by
using Tj−1,1 , because Tj−1,1 already includes more than half of the function values used to compute
Tj,1 , and they are weighted correctly relative to one another. It follows that for j > 1, if we split
the summation in the algorithm into two summations containing odd- and even-numbered terms,
respectively, we obtain
 
j−2 2j−2
h
2X X−1
Tj,1 = f (a) + 2 f (a + (2i − 1)h) + 2 f (a + 2ih) + f (b)
2
i=1 i=1
   
2Xj−2 −1 j−2
2X
h h
= f (a) + 2 f (a + 2ih) + f (b) + 2 f (a + (2i − 1)h)
2 2
i=1 i=1
2j−2
1 X
= Tj−1,1 + h f (a + (2i − 1)h).
2
i=1

Example 7.6.3 We will use Romberg integration to obtain a sixth-order accurate approximation
to Z 1
2
e−x dx,
0
an integral that cannot be computed using the Fundamental Theorem of Calculus. We begin by
using the Trapezoidal Rule, or, equivalently, the Composite Trapezoidal Rule
 
Z b n−1
h X b−a
f (x) dx ≈ f (a) + f (xj ) + f (b) , h = , xj = a + jh,
a 2 n
j=1

with n = 1 subintervals. Since h = (b − a)/n = (1 − 0)/1 = 1, we have


1
T1,1 = [f (0) + f (1)] = 0.68393972058572,
2
7.6. EXTRAPOLATION TO THE LIMIT 255

which has an absolute error of 6.3 × 10−2 .


If we bisect the interval [0, 1] into two subintervals of equal width, and approximate the area
2
under e−x using two trapezoids, then we are applying the Composite Trapezoidal Rule with n = 2
and h = (1 − 0)/2 = 1/2, which yields
0.5
T2,1 = [f (0) + 2f (0.5) + f (1)] = 0.73137025182856,
2
which has an absolute error of 1.5 × 10−2 . As expected, the error is reduced by a factor of 4 when
the step size is halved, since the error in the Composite Trapezoidal Rule is of O(h2 ).
Now, we can use Richardson Extrapolation to obtain a more accurate approximation,
T2,1 − T1,1
T2,2 = T2,1 + = 0.74718042890951,
3
which has an absolute error of 3.6 × 10−4 . Because the error in the Composite Trapezoidal Rule
satisfies
 
Z b n−1
h X
f (x) dx = f (a) + f (xj ) + f (b) + K1 h2 + K2 h4 + K3 h6 + O(h8 ),
a 2
j=1

where the constants K1 , K2 and K3 depend on the derivatives of f (x) on [a, b] and are independent
of h, we can conclude that T2,1 has fourth-order accuracy.
We can obtain a second approximation of fourth-order accuracy by using the Composite Trape-
zoidal Rule with n = 4 to obtain a third approximation of second-order accuracy. We set h =
(1 − 0)/4 = 1/4, and then compute
0.25
T3,1 = [f (0) + 2[f (0.25) + f (0.5) + f (0.75)] + f (1)] = 0.74298409780038,
2
which has an absolute error of 3.8 × 10−3 . Now, we can apply Richardson Extrapolation to T2,1 and
T3,1 to obtain
T3,1 − T2,1
T3,2 = T3,1 + = 0.74685537979099,
3
which has an absolute error of 3.1 × 10−5 . This significant decrease in error from T2,2 is to be
expected, since both T2,2 and T3,2 have fourth-order accuracy, and T3,2 is computed using half the
step size of T2,2 .
It follows from the error term in the Composite Trapezoidal Rule, and the formula for Richardson
Extrapolation, that
Z 1 Z 1  4
−x2 4 6 −x2 h
T2,2 = e dx + K̃2 h + O(h ), T2,2 = e dx + K̃2 + O(h6 ).
0 0 2
Therefore, we can use Richardson Extrapolation with these two approximations to obtain a new
approximation
T3,2 − T2,2
T3,3 = T3,2 + = 0.74683370984975,
24 − 1
which has an absolute error of 9.6 × 10−6 . Because T3,3 is a linear combination of T3,2 and T2,2 in
which the terms of order h4 cancel, we can conclude that T3,3 is of sixth-order accuracy. 2
256 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

Exercise 7.6.5 Write a Matlab function I=quadromberg(f,a,b,J) that implements


Algorithm 7.6.2 for Romberg integration described in this section. Apply it to approximate
the integrals Z 1 Z 1 Z 1
x 1 √
e dx, 2
dx, x dx.
0 0 1+x 0
How does the accuracy of the approximations improve as the number of extrapolations, J,
increases? Explain the difference in the observed behavior.

7.7 Adaptive Quadrature


Composite rules can be used to implement an automatic quadrature procedure, in which the all of
the subintervals of [a, b] are continually subdivided until sufficient accuracy is achieved. However,
this approach is impractical since small subintervals are not necessary in regions where the integrand
is smooth.
An alternative is adaptive quadrature [30]. Adaptive quadrature is a technique in which the
interval [a, b] is divided into n subintervals [aj , bj ], for j = 0, 1, . . . , n − 1, and a quadrature rule,
such as the Trapezoidal Rule, is used on each subinterval to compute
Z bj
Ij [f ] = f (x) dx,
aj

as in any composite quadrature rule. However, in adaptive quadrature, a subinterval [aj , bj ] is


subdivided if it is determined that the quadrature rule has not computed Ij [f ] with sufficient
accuracy.
To make this determination, we use the quadrature rule on [aj , bj ] to obtain an approximation
I1 , and then use the corresponding composite rule on [aj , bj ], with two subintervals, to compute
a second approximation I2 . If I1 and I2 are sufficiently close, then it is reasonable to conclude
that these two approximations are accurate, so there is no need to subdivide [aj , bj ]. Otherwise,
we divide [aj , bj ] into two subintervals, and repeat this process on these subintervals. We apply
this technique to all subintervals, until we can determine that the integral of f over each one
has been computed with sufficient accuracy. By subdividing only when it is necessary, we avoid
unnecessary computation and obtain the same accuracy as with composite rules or automatic
quadrature procedures, but with much less computational effort.
How do we determine whether I1 and I2 are sufficiently close? Suppose that the composite rule
has order of accuracy p. Then, I1 and I2 should satisfy
1
I − I2 ≈ (I − I1 ),
2p
where I is the exact value of the integral over [aj , bj ]. We then have
1
I − I2 ≈ (I − I2 + I2 − I1 )
2p
which can be rearranged to obtain
1
I − I2 ≈ (I2 − I1 ).
2p −1
7.7. ADAPTIVE QUADRATURE 257

Thus we have obtained an error estimate in terms of our two approximations.


We now describe an algorithm for adaptive quadrature. This algorithm uses the Trapezoidal
Rule to integrate over intervals, and intervals are subdivided as necessary into two subintervals
of equal width. The algorithm uses a data structure called a stack in order to keep track of the
subintervals over which f still needs to be integrated. A stack is essentially a list of elements,
where the elements, in this case, are subintervals. An element is added to the stack using a push
operation, and is removed using a pop operation. A stack is often described using the phrase “last-
in-first-out,” because the most recent element to be pushed onto the stack is the first element to be
popped. This corresponds to our intuitive notion of a stack of objects, in which objects are placed
on top of the stack and are removed from the top as well.

Algorithm 7.7.1 (Adaptive Quadrature) Given a function f (x) that is integrable on an


Rb
interval [a, b], the following algorithm computes an approximation I to I[f ] = a f (x) dx
that is accurate to within (b − a)T OL, where T OL is a given error tolerance.

S is an empty stack
push(S, [a, b])
I=0
while S is not empty do
[a, b] = pop(S) (the interval [a, b] on top of S is removed from S)
I1 = [(b − a)/2][f (a) + f (b)] (Trapezoidal Rule)
m = (a + b)/2
I2 = [(b − a)/4][f (a) + 2f (m) + f (b)] (Composite Trapezoidal Rule with 2 subintervals)
if |I1 − I2 | < 3(b − a)T OL then
I = I1 + I2 (from error term in Trapezoidal Rule, |I[f ] − I2 | ≈ 13 |I1 − I2 |)
else
push(S, [a, m])
push(S, [m, b])
end
end
Throughout the execution of the loop in the above algorithm, the stack S contains all intervals
over which f still needs to be integrated to within the desired accuracy. Initially, the only such
interval is the original interval [a, b]. As long as intervals remain in the stack, the interval on top of
the stack is removed, and we attempt to integrate over it. If we obtain a sufficiently accurate result,
then we are finished with the interval. Otherwise, the interval is bisected into two subintervals,
both of which are pushed on the stack so that they can be processed later. Once the stack is
empty, we know that we have accurately integrated f over a collection of intervals whose union is
the original interval [a, b], so the algorithm can terminate.
Example 7.7.2 We will use adaptive quadrature to compute the integral
Z π/4
e3x sin 2x dx
0

to within (π/4)10−4 .
Let f (x) = e3x sin 2x denote the integrand. First, we use Simpson’s Rule, or, equivalently, the
Composite Simpson’s Rule with n = 2 subintervals, to obtain an approximation I1 to this integral.
258 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

We have
π/4
I1 = [f (0) + 4f (π/8) + f (π/4)] = 2.58369640324748.
6
Then, we divide the interval [0, π/4] into two subintervals of equal width, [0, π/8] and [π/8, π/4],
and integrate over each one using Simpson’s Rule to obtain a second approximation I2 . This is
equivalent to using the Composite Simpson’s Rule on [0, π/4] with n = 4 subintervals. We obtain
π/8 π/8
I2 = [f (0) + 4f (π/16) + f (π/8)] + [f (π/8) + 4f (3π/16) + f (π/4)]
6 6
π/16
= [f (0) + 4f (π/16) + 2f (π/8) + 4f (3π/16) + f (π/4)]
3
= 2.58770145345862.

Now, we need to determine whether the approximation I2 is sufficiently accurate. Because the
error in the Composite Simpson’s Rule is O(h4 ), where h is the width of each subinterval used in
the rule, it follows that the actual error in I2 satisfies
1
|I2 − I[f ]| ≈ |I2 − I1 |,
15
where I[f ] is the exact value of the integral of f .
We find that the relation
1 π
|I2 − I[f ]| ≈ |I2 − I1 | < 10−4
15 4
is not satisfied, so we must divide the interval [0, π/4] into two subintervals of equal width, [0, π/8]
and [π/8, π/4], and use the Composite Simpson’s Rule with these smaller intervals in order to
achieve the desired accuracy.
First, we work with the interval [0, π/8]. Proceeding as before, we use the Composite Simpson’s
Rule with n = 2 and n = 4 subintervals to obtain approximations I1 and I2 to the integral of f (x)
over this interval. We have
π/8
I1 = [f (0) + 4f (π/16) + f (π/8)] = 0.33088926959519.
6
and
π/16 π/16
I2 = [f (0) + 4f (π/32) + f (π/16)] + [f (π/16) + 4f (3π/32) + f (π/8)]
6 6
π/32
= [f (0) + 4f (π/32) + 2f (π/16) + 4f (3π/32) + f (π/8)]
3
= 0.33054510467064.

Since these approximations satisfy the relation


1 π
|I2 − I[f ]| ≈ |I2 − I1 | < 10−4 ,
15 8
where I[f ] denotes the exact value of the integral of f over [0, π/8], we have achieved sufficient ac-
curacy on this interval and we do not need to subdivide it further. The more accurate approximation
I2 can be included in our approximation to the integral over the original interval [0, π/4].
7.7. ADAPTIVE QUADRATURE 259

Now, we need to achieve sufficient accuracy on the remaining subinterval, [π/4, π/8]. As before,
we compute the approximations I1 and I2 of the integral of f over this interval and obtain

π/8
I1 = [f (π/8) + 4f (3π/16) + f (π/4)] = 2.25681218386343.
6
and
π/16 π/16
I2 = [f (π/8) + 4f (5π/32) + f (3π/16)] + [f (3π/16) + 4f (7π/32) + f (π/4)]
6 6
π/32
= [f (π/8) + 4f (5π/32) + 2f (3π/16) + 4f (7π/32) + f (π/4)]
3
= 2.25801455892266.

Since these approximations do not satisfy the relation


1 π
|I2 − I[f ]| ≈ |I2 − I1 | < 10−4 ,
15 8
where I[f ] denotes the exact value of the integral of f over [π/8, π/4], we have not achieved suf-
ficient accuracy on this interval and we need to subdivide it into two subintervals of equal width,
[π/8, 3π/16] and [3π/16, π/4], and use the Composite Simpson’s Rule with these smaller intervals
in order to achieve the desired accuracy.
The discrepancy in these two approximations to the integral of f over [π/4, π/8] is larger than
the discrepancy in the two approximations of the integral over [0, π/8] because even though these
intervals have the same width, the derivatives of f are larger on [π/8, π/4], and therefore the error
in the Composite Simpson’s Rule is larger.
We continue the process of adaptive quadrature on the interval [π/8, 3π/16]. As before, we
compute the approximations I1 and I2 of the integral of f over this interval and obtain

π/16
I1 = [f (π/8) + 4f (5π/32) + f (3π/16)] = 0.72676545197054.
6
and
π/32 π/32
I2 = [f (π/8) + 4f (9π/64) + f (5π/32)] + [f (5π/32) + 4f (11π/64) + f (3π/16)]
6 6
π/64
= [f (π/8) + 4f (9π/64) + 2f (5π/32) + 4f (11π/64) + f (3π/16)]
3
= 0.72677918153379.

Since these approximations satisfy the relation


1 π
|I2 − I[f ]| ≈ |I2 − I1 | < 10−4 ,
15 16
where I[f ] denotes the exact value of the integral of f over [π/8, 3π/16], we have achieved suf-
ficient accuracy on this interval and we do not need to subdivide it further. The more accurate
approximation I2 can be included in our approximation to the integral over the original interval
[0, π/4].
260 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

Now, we work with the interval [3π/16, π/4]. Proceeding as before, we use the Composite Simp-
son’s Rule with n = 2 and n = 4 subintervals to obtain approximations I1 and I2 to the integral of
f (x) over this interval. We have

π/16
I1 = [f (3π/16) + 4f (7π/32) + f (π/4)] = 1.53124910695212.
6
and
π/32 π/32
I2 = [f (3π/16) + 4f (13π/64) + f (7π/32)] + [f (7π/32) + 4f (15π/64) + f (π/4)]
6 6
π/64
= [f (3π/16) + 4f (13π/64) + 2f (7π/32) + 4f (15π/64) + f (π/4)]
3
= 1.53131941583939.

Since these approximations satisfy the relation


1 π
|I2 − I[f ]| ≈ |I2 − I1 | < 10−4 ,
15 16
where I[f ] denotes the exact value of the integral of f over [3π/16, π/4], we have achieved suf-
ficient accuracy on this interval and we do not need to subdivide it further. The more accurate
approximation I2 can be included in our approximation to the integral over the original interval
[0, π/4].
We conclude that the integral of f (x) over [0, π/4] can be approximated by the sum of our
approximate integrals over [0, π/8], [π/8, 3π/16], and [3π/16, π/4], which yields
Z π/4
e3x sin 2x dx ≈ 0.33054510467064 + 0.72677918153379 + 1.53131941583939
0
≈ 2.58864370204382.

Since the exact value is 2.58862863250716, the absolute error is −1.507 × 10−5 , which is less in
magnitude than our desired error bound of (π/4)10−4 ≈ 7.854 × 10−5 . This is because on each
subinterval, we ensured that our approximation was accurate to within 10−4 times the width of
the subinterval, so that when we added these approximations, the total error in the sum would be
bounded by 10−4 times the width of the union of these subintervals, which is the original interval
[0, π/4].
The graph of the integrand over the interval of integration is shown in Figure 7.1. 2

Adaptive quadrature can be very effective, but it should be used cautiously, for the following
reasons:

• The integrand is only evaluated at a few points within each subinterval. Such sampling can
miss a portion of the integrand whose contribution to the integral can be misjudged.

• Regions in which the function is not smooth will still only make a small contribution to the
integral if the region itself is small, so this should be taken into account to avoid unnecessary
function evaluations.
7.7. ADAPTIVE QUADRATURE 261

Figure 7.1: Graph of f (x) = e3x sin 2x on [0, π/4], with quadrature nodes from Example 7.7.2
shown on the graph and on the x-axis.

• Adaptive quadrature can be very inefficient if the integrand has a discontinuity within a subin-
terval, since repeated subdivision will occur. This is unnecessary if the integrand is smooth
on either side of the discontinuity, so subintervals should be chosen so that discontinuities
occur between subintervals whenever possible.

Exercise 7.7.1 Write Matlab functions S=stackpush(S,v) and [S,v]=stackpop(S)


that implement the push and pop operations described in this section, in a manner that
is useful for adaptive quadrature. Assume that v is a row vector of a fixed length, and S
is a matrix in which each row represents an element of the stack.

Exercise 7.7.2 Write a Matlab function I=adapquad(f,a,b,tol) that implements


Algorithm 7.7.1 and uses the functions stackpush and stackpop from Exercise 7.7.1.
Then change your function so that Simpson’s Rule is used in place of the Trapezoidal
Rule.

Exercise 7.7.3 Write a Matlab function I=adapquadrecur(f,a,b,tol) that imple-


ments adaptive quadrature as described in this section, but uses recursion instead of a
stack to keep track of subintervals.
262 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

2
Exercise 7.7.4 Let f (x) = e−1000(x−c) , where c isR 1 a parameter. Use your function
adapquadrecur from Exercise 7.7.3 to approximate 0 f (x) dx for the cases c = 1/8 and
c = 1/4. Explain the difference in performance between these two cases.

Exercise 7.7.5 Explain why a straightforward implementation of adaptive quadrature us-


ing recursion, as in your function adapquadrecur from Exercise 7.7.3, is not as efficient
as it could be. What can be done to make it more efficient? Modify your implementation
accordingly.

7.8 Multiple Integrals


As many problems in scientific computing involve two- or three-dimensional domains, it is essential
to be able to compute integrals over such domains. In this section, we explore the generalization
of techniques for integrals of functions of one variable to such multivariable cases.

7.8.1 Double Integrals


Double integrals can be evaluated using the following strategies:
• If a two-dimensional domain Ω can be decomposed into rectangles, then the integral of a
function f (x, y) over Ω can be computed by evaluating integrals of the form
Z bZ d
I[f ] = f (x, y) dy dx. (7.20)
a c

Then, to evaluate I[f ], one can use a Cartesian product rule, whose nodes and weights are
obtained by combining one-dimensional quadrature rules that are applied to each dimension.
For example, if functions of x are integrated along the line between x = a and x = b using
nodes xi and weights wi , for i = 1, . . . , n, and if functions of y are integrated along the line
between y = c and y = d using nodes yi and weights zi , for i = 1, . . . , m, then the resulting
Cartesian product rule
Xn Xm
Qn,m [f ] = f (xi , yj )wi zj
i=1 j=1

has nodes (xi , yj ) and corresponding weights wi zj for i = 1, . . . , n and j = 1, . . . , m.


• If the domain Ω can be described as the region between two curves y1 (x) and y2 (x) for
x ∈ [a, b], then we can write Z Z
I[f ] = f (x, y) dA

as an iterated integral
Z bZ y2 (x)
I[f ] = f (x, y) dy dx
a y1 (x)
which can be evaluated by applying a one-dimensional quadrature rule to compute the outer
integral
Z b
I[f ] = g(x) dx
a
7.8. MULTIPLE INTEGRALS 263

where g(x) is evaluated by using a one-dimensional quadrature rule to compute the inner
integral
Z y2 (x)
g(x) = f (x, y) dy.
y1 (x)

• For various simple regions such as triangles, there exist cubature rules that are not combi-
nations of one-dimensional quadrature rules. Cubature rules are more direct generalizations
of quadrature rules, in that they evaluate the integrand at selected nodes and use weights
determined by the geometry of the domain and the placement of the nodes.

It should be noted that all of these strategies apply to certain special cases. The first algorithm
capable of integrating over a general two-dimensional domain was developed by Lambers and Rice
[21]. This algorithm combines the second and third strategies described above, decomposing the
domain into subdomains that are either triangles or regions between two curves.

Example 7.8.1 We will use the Composite Trapezoidal Rule with m = n = 2 to evaluate the
double integral
Z 1/2 Z 1/2
ey−x dy dx.
0 0

The Composite Trapezoidal Rule with n = 2 subintervals is


b    
b−a
Z
h a+b
f (x) dx ≈ f (a) + 2f + f (b) , h= .
a 2 2 n

If a = 0 and b = 1/2, then h = (1/2 − 0)/2 = 1/4 and this simplifies to


Z 1/2
1
f (x) dx ≈ [f (0) + 2f (1/4) + f (1/2)].
0 8

We first use this rule to evaluate the “single” integral


Z 1/2
g(x) dx
0

where Z 1
g(x) = ey−x dy.
0

This yields
Z 1/2 Z 1/2 Z 1/2
y−x
e dy dx = g(x) dx
0 0 0
1
≈ [g(0) + 2g(1/4) + g(1/2)]
8" #
Z 1/2 Z 1/2 Z 1/2
1 y−0 y−1/4 y−1/2
≈ e dy + 2 e dy + e dy .
8 0 0 0
264 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

Now, to evaluate each of these integrals, we use the Composite Trapezoidal Rule in the y-direction
with m = 2. If we let k denote the step size in the y-direction, we have k = (1/2 − 0)/2 = 1/4, and
therefore we have
Z 1/2 Z 1/2 "Z #
1/2 Z 1/2 Z 1/2
y−x 1 y−0 y−1/4 y−1/2
e dy dx ≈ e dy + 2 e dy + e dy
0 0 8 0 0 0

1 1 h 0−0 i
≈ e + 2e1/4−0 + e1/2−0 +
8 8
1h i
2 e0−1/4 + 2e1/4−1/4 + e1/2−1/4 +
8
1 h 0−1/2 i
1/4−1/2 1/2−1/2
e + 2e +e
8
1 h 0 i
≈ e + 2e1/4 + e1/2 +
64
1 h −1/4 i
e + 2e0 + e1/4 +
32
1 h −1/2 i
e + 2e−1/4 + e0
64
3 0 1 1 1 1
≈ e + e−1/4 + e−1/2 + e1/4 + e1/2
32 16 64 16 64
≈ 0.25791494889765.
The exact value, to 15 digits, is 0.255251930412762. The error is 2.66 × 10−3 , which is to be
expected due to the use of few subintervals, and the fact that the Composite Trapezoidal Rule is
only second-order-accurate. 2

Example 7.8.2 We will use the Composite Simpson’s Rule with n = 2 and m = 4 to evaluate the
double integral Z 1 Z 2x
x2 + y 3 dy dx.
0 x
In this case, the domain of integration described by the limits is not a rectangle, but a triangle
defined by the lines y = x, y = 2x, and x = 1. The Composite Simpson’s Rule with n = 2
subintervals is Z b    
h a+b b−a
f (x) dx ≈ f (a) + 4f + f (b) , h = .
a 3 2 n
If a = 0 and b = 1, then h = (1 − 0)/2 = 1/2, and this simplifies to
Z 1/2
1
f (x) dx ≈ [f (0) + 4f (1/2) + f (1)].
0 6
We first use this rule to evaluate the “single” integral
Z 1
g(x) dx
0

where Z 2x
g(x) = x2 + y 3 dy.
x
7.8. MULTIPLE INTEGRALS 265

This yields
Z 1 Z 2x Z 1
2 3
x + y dy dx = g(x) dx
0 x 0
1
≈ [g(0) + 4g(1/2) + g(1)]
6" #
Z 0 Z 1  2 Z 2
1 1
≈ 02 + y 3 dy + 4 + y 3 dy + 12 + y 3 dy .
6 0 1/2 2 1

The first integral will be zero, since the limits of integration are equal. To evaluate the second and
third integrals, we use the Composite Simpson’s Rule in the y-direction with m = 4. If we let k
denote the step size in the y-direction, we have k = (2x−x)/4 = x/4, and therefore we have k = 1/8
for the second integral and k = 1/4 for the third. This yields
" Z   #
1 Z 2x 1
1 2
Z Z 2
2 3 1 3 2 3
x + y dy dx ≈ 4 + y dy + 1 + y dy
0 x 6 1/2 2 1
( "  3 !  3 !  3 !
1 1 1 1 1 5 1 3
≈ 4 + +4 + +2 + +
6 24 4 2 4 8 4 4
 3 !  # "  3 !
1 7 1 1 5
+ 13 1 + 13 + 4 1 +

4 + + + +
4 8 4 12 4
 3 !  3 ! #)
3 7
+ 1 + 23

2 1+ +4 1+
2 4
≈ 1.03125.

The exact value is 1. The error 3.125 × 10−2 is rather large, which is to be expected due to the poor
distribution of nodes through the triangular domain of integration. A better distribution is achieved
if we use n = 4 and m = 2, which yields the much more accurate approximation of 1.001953125.
2

Exercise 7.8.1 Write a Matlab function I=quadcomptrap2d(f,a,b,c,d,m,n) that


approximates (7.20) using the Composite Trapezoidal Rule with m subintervals in the
x-direction and n subintervals in the y-direction.

Exercise 7.8.2 Generalize your function quadcomptrap2d from Exercise 7.8.1 so that
the arguments c and d can be either scalars or function handles. If they are function
handles, then your function approximates the integral
Z bZ d(x)
I[f ] = f (x, y) dy dx.
a c(x)

Hint: use the Matlab function isnumeric to determine whether c and d are numbers.
266 CHAPTER 7. DIFFERENTIATION AND INTEGRATION

Exercise 7.8.3 Generalize your function quadcomptrap2d from Exercise 7.8.2 so that
the arguments a, b, c and d can be either scalars or function handles. If a and b are
function handles, then your function approximates the integral
Z d Z b(y)
I[f ] = f (x, y) dx dy.
c a(y)

Exercise 7.8.4 Use the error formula for the Composite Trapezoidal Rule to obtain an
error formula for a Cartesian product rule such as the one implemented in Exercise
7.8.1. As in that exercise, assume that m subintervals are used in the x-direction and n
subintervals in the y-direction. Hint: first, apply the single-variable error formula to the
integral
Z b Z d
g(x) dx, g(x) = f (x, y) dy.
a c

7.8.2 Higher Dimensions


The techniques presented in this section can readily be adapted for the purpose of approximating
triple integrals. This adaptation is carried out in the following exercises.
Exercise 7.8.5 Write a Matlab function I=quadcomptrap3d(f,a,b,c,d,s,t,m,n,p)
that approximates
Z bZ dZ t
f (x, y, z) dz dy dx
a c s
using the Composite Trapezoidal Rule with m subintervals in the x-direction, n subin-
tervals in the y-direction, and p subintervals in the z-direction. Hint: use your function
quadcomptrap2d from Exercise 7.8.1 to simplify the implementation.

Exercise 7.8.6 Modify your function quadcomptrap3d from Exercise 7.8.5 to obtain
a function quadcompsimp3d that uses the Composite Simpson’s Rule in each direction.
Then, use an approach similar to that used in Exercise 7.8.4 to obtain an error formula
for the Cartesian product rule used in quadcompsimp3d.

Exercise 7.8.7 Generalize your function quadcomptrap3d from Exercise 7.8.5 so that
any of the arguments a, b, c, d, s and t can be either scalars or function handles.
For example, if a and b are scalars, c and d are function handles that have two input
arguments, and s and t are function handles that have one input argument, then your
function approximates the integral
Z bZ t(x) Z d(x,z)
I[f ] = f (x, y, z) dy dz dx.
a s(x) c(x,z)

Hint: use the Matlab function isnumeric to determine whether any arguments are
numbers, and the function nargin to determine how many arguments a function handle
requires.
In more than three dimensions, generalizations of quadrature rules are not practical, since
7.8. MULTIPLE INTEGRALS 267

the number of function evaluations needed to attain sufficient accuracy grows very rapidly as the
number of dimensions increases. An alternative is the Monte Carlo method, which samples the
integrand at n randomly selected points and attempts to compute the mean value of the integrand
on the entire domain. The method converges rather slowly but its convergence rate depends only
on n, not the number of dimensions.
268 CHAPTER 7. DIFFERENTIATION AND INTEGRATION
Part IV

Nonlinear and Differential Equations

269
Chapter 8

Zeros of Nonlinear Functions

8.1 Nonlinear Equations in One Variable


To this point, we have only considered the solution of linear equations. We now explore the much
more difficult problem of solving nonlinear equations of the form

f (x) = 0,

where f (x) : Rn → Rm can be any known function. A solution x of such a nonlinear equation is
called a root of the equation, as well as a zero of the function f .
In general, nonlinear equations cannot be solved in a finite sequence of steps. As linear equations
can be solved using direct methods such as Gaussian elimination, nonlinear equations usually require
iterative methods. In iterative methods, an approximate solution is refined with each iteration until
it is determined to be sufficiently accurate, at which time the iteration terminates. Since it is
desirable for iterative methods to converge to the solution as rapidly as possible, it is necessary to
be able to measure the speed with which an iterative method converges.
To that end, we assume that an iterative method generates a sequence of iterates x0 , x1 , x2 , . . .
that converges to the exact solution x∗ . Ideally, we would like the error in a given iterate xk+1 to
be much smaller than the error in the previous iterate xk . For example, if the error is raised to a
power greater than 1 from iteration to iteration, then, because the error is typically less than 1, it
will approach zero very rapidly. This leads to the following definition.

Definition 8.1.1 (Order and Rate of Convergence) Let {xk }∞ k=0 be a sequence in
Rn that converges to x∗ ∈ Rn and assume that xk 6= x∗ for each k. We say that the
order of convergence of {xk } to x∗ is order r, with asymptotic error constant C,
if
kxk+1 − x∗ k
lim = C,
k→∞ kxk − x∗ kr

where r ≥ 1. If r = 1, then the number ρ = − log10 C is called the asymptotic rate of


convergence.

If r = 1, and 0 < C < 1, we say that convergence is linear. If r = 1 and C = 0, or if 1 < r < 2 for
any positive C, then we say that convergence is superlinear. If r = 2, then the method converges

271
272 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

quadratically, and if r = 3, we say it converges cubically, and so on. Note that the value of C need
only be bounded above in the case of linear convergence.
When convergence is linear, the asymptotic rate of convergence ρ indicates the number of correct
decimal digits obtained in a single iteration. In other words, b1/ρc + 1 iterations are required to
obtain an additional correct decimal digit, where bxc is the “floor” of x, which is the largest integer
that is less than or equal to x.

8.1.1 Existence and Uniqueness


For simplicity, we assume that the function f : Rn → Rm is continuous on the domain under
consideration. Then, each equation fi (x) = 0, i = 1, . . . , m, defines a hypersurface in Rm . The
solution of f (x) = 0 is the intersection of these hypersurfaces, if the intersection is not empty. It is
not hard to see that there can be a unique solution, infinitely many solutions, or no solution at all.
For a general equation f (x) = 0, it is not possible to characterize the conditions under which
a solution exists or is unique. However, in some situations, it is possible to determine existence
analytically. For example, in one dimension, the Intermediate Value Theorem implies that if a
continuous function f (x) satisfies f (a) ≤ 0 and f (b) ≥ 0 where a < b, then f (x) = 0 for some
x ∈ (a, b).
Similarly, it can be concluded that f (x) = 0 for some x ∈ (a, b) if the function (x − z)f (x) ≥ 0
for x = a and x = b, where z ∈ (a, b). This condition can be generalized to higher dimensions. If
S ⊂ Rn is an open, bounded set, and (x − z)T f (x) ≥ 0 for all x on the boundary of S and for some
z ∈ S, then f (x) = 0 for some x ∈ S. Unfortunately, checking this condition can be difficult in
practice.
One useful result from calculus that can be used to establish existence and, in some sense,
uniqueness of a solution is the Inverse Function Theorem, which states that if the Jacobian of f
is nonsingular at a point x0 , then f is invertible near x0 and the equation f (x) = y has a unique
solution for all y near f (x0 ).
If the Jacobian of f at a point x0 is singular, then f is said to be degenerate at x0 . Suppose
that x0 is a solution of f (x) = 0. Then, in one dimension, degeneracy means f 0 (x0 ) = 0, and we
say that x0 is a double root of f (x). Similarly, if f (j) (x0 ) = 0 for j = 0, . . . , m − 1, then x0 is a root
of multiplicity m. We will see that degeneracy can cause difficulties when trying to solve nonlinear
equations.

8.1.2 Sensitivity of Solutions


The absolute condition number of a function f (x) is a measure of how a perturbation in x, denoted
by x +  for some small , is amplified by f (x). Using the Mean Value Theorem, we have

|f (x + ) − f (x)| = |f 0 (c)(x +  − x)| = |f 0 (c)|||

where c is between x and x + . With  being small, the absolute condition number can be
approximated by |f 0 (x)|, the factor by which the perturbation in x () is amplified to obtain the
perturbation in f (x).
In solving a nonlinear equation in one dimension, we are trying to solve an inverse problem; that
is, instead of computing y = f (x) (the forward problem), we are computing x = f −1 (0), assuming
8.2. THE BISECTION METHOD 273

that f is invertible near the root. It follows from the differentiation rule
d −1 1
[f (x)] = 0 −1
dx f (f (x))
that the condition number for solving f (x) = 0 is approximately 1/|f 0 (x∗ )|, where x∗ is the solution.
This discussion can be generalized to higher dimensions, where the condition number is measured
using the norm of the Jacobian.
Using backward error analysis, we assume that the approximate solution x̂ = fˆ−1 (0), obtained
by evaluating an approximation of f −1 at the exact input y = 0, can also be viewed as evaluating
the exact function f −1 at a nearby input ŷ = . That is, the approximate solution x̂ = f −1 () is
the exact solution of a nearby problem.
From this viewpoint, it can be seen from a graph that if |f 0 | is large near x∗ , which means that
the condition number of the problem f (x) = 0 is small (that is, the problem is well-conditioned ),
then even if  is relatively large, x̂ = f −1 () is close to x∗ . On the other hand, if |f 0 | is small near
x∗ , so that the problem is ill-conditioned, then even if  is small, x̂ can be far away from x∗ . These
contrasting situations are illustrated in Figure 8.1.

Figure 8.1: Left plot: Well-conditioned problem of solving f (x) = 0. f 0 (x∗ ) = 24, and an ap-
proximate solution ŷ = f −1 () has small error relative to . Right plot: Ill-conditioned problem of
solving f (x) = 0. f 0 (x∗ ) = 0, and ŷ has large error relative to .

8.2 The Bisection Method


Suppose that f (x) is a continuous function that changes sign on the interval [a, b]. Then, by the
Intermediate Value Theorem, f (x) = 0 for some x ∈ [a, b]. How can we find the solution, knowing
that it lies in this interval?
To determine how to proceed, we consider some examples in which such a sign change occurs.
We work with the functions

f (x) = x − cos x, g(x) = ex cos(x2 ),

on the intervals [0, π/2] and [0, π], respectively. The graphs of these functions are shown in Figure
8.2. It can be seen that f (a) and f (b) have different signs, and since both functions are continuous
274 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

Figure 8.2: Illustrations of the Intermediate Value Theorem. Left plot: f (x) = x − cos x has a
unique root on [0, π/2]. Right plot: g(x) = ex cos(x2 ) has multiple roots on [0, π].

on [a, b], the Intermediate Value Theorem guarantees the existence of a root in [a, b]. However,
both of these intervals are rather large, so we cannot obtain a useful approximation of a root from
this information alone.
At each root in these examples, f (x) changes sign, so f (x) > 0 for x on one side of the root
x , and f (x) < 0 on the other side. Therefore, if we can find two values a0 and b0 such that f (a0 )

and f (b0 ) have different signs, but a0 and b0 are very close to one another, then we can accurately
approximate x∗ .
Consider the first example in Figure 8.2, that has a unique root. We have f (0) < 0 and
f (π/2) > 0. From the graph, we see that if we evaluate f at any other point x0 in (0, π/2), and
we do not “get lucky” and happen to choose the root, then either f (x0 ) > 0 or f (x0 ) < 0. If
f (x0 ) > 0, then f has a root on (0, x0 ), because f changes sign on this interval. On the other hand,
if f (x0 ) < 0, then f has a root on (x0 , π/2), because of a sign change. This is illustrated in Figure
8.3. The bottom line is, by evaluating f (x) at an intermediate point within (a, b), the size of the
interval in which we need to search for a root can be reduced.
The Method of Bisection attempts to reduce the size of the interval in which a solution is known
to exist. Suppose that we evaluate f (m), where m = (a + b)/2. If f (m) = 0, then we are done.
Otherwise, f must change sign on the interval [a, m] or [m, b], since f (a) and f (b) have different
signs and therefore f (m) must have a different sign from one of these values.
Let us try this approach on the function f (x) = x − cos x, on [a, b] = [0, π/2]. This example
can be set up in Matlab as follows:
>> a=0;
>> b=pi/2;
>> f=inline(’x-cos(x)’);
To help visualize the results of the computational process that we will carry out to find an approx-
imate solution, we will also graph f (x) on [a, b]:
>> dx=(b-a)/100;
>> x=a:dx:b;
>> plot(x,f(x))
8.2. THE BISECTION METHOD 275

Figure 8.3: Because f (π/4) > 0, f (x) has a root in (0, π/4).

>> axis tight


>> xlabel(’x’)
>> ylabel(’y’)
>> title(’f(x) = x - cos x’)
>> hold on
>> plot([ 0 pi/2 ],[ 0 0 ],’k’)
The last statement is used to plot the relevant portion of the x-axis, so that we can more easily
visualize our progress toward computing a root.
Now, we can begin searching for an approximate root of f (x). We can reduce the size of our
search space by evaluating f (x) at any point in (a, b), but for convenience, we choose the midpoint:
>> m=(a+b)/2;
>> plot(m,f(m),’ro’)
The plot statement is used to plot the point (m, f (m)) on the graph of f (x), using a red circle.
You have now reproduced Figure 8.3.
Now, we examine the values of f at a, b and the midpoint m:
>> f(a)
ans =
-1
>> f(m)
ans =
0.078291382210901
>> f(b)
ans =
1.570796326794897
276 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

We can see that f (a) and f (m) have different signs, so a root exists within (a, m). We can therefore
update our search space [a, b] accordingly:

>> b=m;

We then repeat the process, working with the midpoint of our new interval:

>> m=(a+b)/2
>> plot(m,f(m),’ro’)

Now, it does not matter at which endpoint of our interval f (x) has a positive value; we only need
the signs of f (a) and f (b) to be different. Therefore, we can simply check whether the product of
the values of f at the endpoints is negative:

>> f(a)*f(m)
ans =
0.531180450812563
>> f(m)*f(b)
ans =
-0.041586851697525

We see that the sign of f changes on (m, b), so we update a to reflect that this is our new interval
to search:

>> a=m;
>> m=(a+b)/2
>> plot(m,f(m),’ro’)

The progress toward a root can be seen in Figure 8.4.

Exercise 8.2.1 Repeat this process a few more times: check whether f changes sign on
(a, m) or (m, b), update [a, b] accordingly, and then compute a new midpoint m. After
computing some more midpoints, and plotting each one as we have been, what behavior
can you observe? Are the midpoints converging, and if so, are they converging to a root
of f ? Check by evaluating f at each midpoint.
We can continue this process until the interval of interest [a, b] is sufficiently small, in which
case we must be close to a solution. By including these steps in a loop, we obtain the following
algorithm that implements the approach that we have been carrying out.
8.2. THE BISECTION METHOD 277

Figure 8.4: Progress of the Bisection method toward finding a root of f (x) = x − cos x on (0, π/2)

Algorithm 8.2.1 (Bisection) Let f be a continuous function on the interval [a, b] that
changes sign on (a, b). The following algorithm computes an approximation x∗ to a num-
ber x in (a, b) such that f (x) = 0.

for j = 1, 2, . . . do
m = (a + b)/2
if f (m) = 0 or b − a is sufficiently small then
x∗ = m
return x∗
end
if f (a)f (m) < 0 then
b=m
else
a=m
end
end

At the beginning, it is known that (a, b) contains a solution. During each iteration, this algo-
rithm updates the interval (a, b) by checking whether f changes sign in the first half (a, m), or in
the second half (m, b). Once the correct half is found, the interval (a, b) is set equal to that half.
Therefore, at the beginning of each iteration, it is known that the current interval (a, b) contains a
solution.
278 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

Exercise 8.2.2 Implement Algorithm 8.2.1 in a Matlab function


[x,niter]=bisection(f,a,b,tol) that accepts as input a function handle f for
f (x), the endpoints a and b of an interval [a, b], and an absolute error tolerance tol so
that the function will exit and return the latest midpoint m of [a, b], in x when the length
of the interval [a, b] is less than 2tol, meaning that m is a distance less than tol from a
root x∗ of f . The output niter is the number of iterations; that is, the number of times
that the midpoint m of [a, b] is examined. Use the error function to ensure that if f
does not change sign on [a, b], the function immediately exits with an appropriate error
message.
In comparison to other methods, including some that we will discuss, bisection tends to converge
rather slowly, but it is also guaranteed to converge. These qualities can be seen in the following
result concerning the accuracy of bisection.

Theorem 8.2.2 Let f be continuous on [a, b], and assume that f (a)f (b) < 0. For each
positive integer n, let pn be the nth iterate that is produced by the bisection algorithm.
Then the sequence {pn }∞ n=1 converges to a number p in (a, b) such that f (p) = 0, and
each iterate pn satisfies
b−a
|pn − p| ≤ n .
2

Exercise 8.2.3 Use induction to prove Theorem 8.2.2.

Exercise 8.2.4 On a computer using the IEEE double-precision floating-point system,


what is the largest number of iterations of bisection that is practical to perform? Justify
your answer.

It should be noted that because the nth iterate can lie anywhere within the interval (a, b) that is
used during the nth iteration, it is possible that the error bound given by this theorem may be
quite conservative.

Example 8.2.3 We seek a solution of the equation f (x) = 0, where


f (x) = x2 − x − 1.
Because f (1) = −1 and f (2) = 1, and f is continuous, we can use the Intermediate Value Theorem
to conclude that f (x) = 0 has a solution in the interval (1, 2), since f (x) must assume every value
between −1 and 1 in this interval.
We use the method of bisection to find a solution. First, we compute the midpoint of the interval,
which is (1 + 2)/2 = 1.5. Since f (1.5) = −0.25, we see that f (x) changes sign between x = 1.5
and x = 2, so we can apply the Intermediate Value Theorem again to conclude that f (x) = 0 has a
solution in the interval (1.5, 2).
Continuing this process, we compute the midpoint of the interval (1.5, 2), which is (1.5 + 2)/2 =
1.75. Since f (1.75) = 0.3125, we see that f (x) changes sign between x = 1.5 and x = 1.75, so we
conclude that there is a solution in the interval (1.5, 1.75). The following table shows the outcome
of several more iterations of this procedure. Each row shows the current interval (a, b) in which
we know that a solution exists, as well as the midpoint of the interval, given by (a + b)/2, and the
value of f at the midpoint. Note that from iteration to iteration, only one of a or b changes, and
the endpoint that changes is always set equal to the midpoint.
8.3. FIXED-POINT ITERATION 279

a b m = (a + b)/2 f (m)
1 2 1.5 −0.25
1.5 2 1.75 0.3125
1.5 1.75 1.625 0.015625
1.5 1.625 1.5625 −0.12109
1.5625 1.625 1.59375 −0.053711
1.59375 1.625 1.609375 −0.019287
1.609375 1.625 1.6171875 −0.0018921
1.6171875 1.625 1.62109325 0.0068512
1.6171875 1.62109325 1.619140625 0.0024757
1.6171875 1.619140625 1.6181640625 0.00029087

The correct solution, to ten decimal places, is 1.6180339887, which is the number known as the
golden ratio. 2

Exercise 8.2.5 The function f (x) = x2 − 2x − 2 has two real roots, one positive and one
negative. Find two disjoint intervals [a1 , b1 ] and [a2 , b2 ] that can be used with bisection to
find the negative and positive roots, respectively. Why is it not practical to use a much
larger interval that contains both roots, so that bisection can supposedly find one of them?

For this method, it is easier to determine the order of convergence if we use a different measure
of the error in each iterate xk . Since each iterate is contained within an interval [ak , bk ] where
bk − ak = 2−k (b − a), with [a, b] being the original interval, it follows that we can bound the error
xk − x∗ by ek = bk − ak . Using this measure, we can easily conclude that bisection converges
linearly, with asymptotic error constant 1/2.

8.3 Fixed-Point Iteration

A nonlinear equation of the form f (x) = 0 can be rewritten to obtain an equation of the form

x = g(x),

in which case the solution is a fixed point of the function g.

8.3.1 Successive Substitution

The formulation of the original problem f (x) = 0 into one of the form x = g(x) leads to a simple
solution method known as fixed-point iteration, or simple iteration, which we now describe.
280 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

Algorithm 8.3.1 (Fixed-Point Iteration) Let g be a continuous function defined on


the interval [a, b]. The following algorithm computes a number x∗ ∈ (a, b) that is a solution
to the equation g(x) = x.

Choose an initial guess x0 in [a, b].


for k = 0, 1, 2, . . . do
xk+1 = g(xk )
if |xk+1 − xk | is sufficiently small then
x∗ = xk+1
return x∗
end
end

When rewriting this equation in the form x = g(x), it is essential to choose the function g
wisely. One guideline is to choose g(x) = x − φ(x)f (x), where the function φ(x) is, ideally, nonzero
except possibly at a solution of f (x) = 0. This can be satisfied by choosing φ(x) to be constant,
but this can fail, as the following example illustrates.

Example 8.3.2 Consider the equation

x + ln x = 0.

By the Intermediate Value Theorem, this equation has a solution in the interval [0.5, 0.6]. Further-
more, this solution is unique. To see this, let f (x) = x + ln x. Then f 0 (x) = 1 + 1/x > 0 on the
domain of f , which means that f is increasing on its entire domain. Therefore, it is not possible
for f (x) = 0 to have more than one solution.
We consider using Fixed-point Iteration to solve the equivalent equation

x = g(x) = x − φ(x)f (x) = x − (1)(x + ln x) = − ln x.

That is, we choose φ(x) ≡ 1. Let us try applying Fixed-point Iteration in Matlab:

>> g=inline(’-log(x)’)
g =
Inline function:
g(x) = -log(x)
>> x=0.55;
>> x=g(x)
x =
0.597837000755620
>> x=g(x)
x =
0.514437136173803
>> x=g(x)
x =
0.664681915480620
8.3. FIXED-POINT ITERATION 281

Exercise 8.3.1 Try this for a few more iterations. What happens?

Clearly, we need to use a different approach for converting our original equation f (x) = 0 to an
equivalent equation of the form x = g(x).
What went wrong? To help us answer this question, we examine the error ek = xk −x∗ . Suppose
that x = g(x) has a solution x∗ in [a, b], as it does in the preceding example, and that g is also
continuously differentiable on [a, b], as was the case in that example. We can use the Mean Value
Theorem to obtain

ek+1 = xk+1 − x∗ = g(xk ) − g(x∗ ) = g 0 (ξk )(xk − x∗ ) = g 0 (ξk )ek ,

where ξk lies between xk and x∗ .


We do not yet know the conditions under which Fixed-Point Iteration will converge, but if it
does converge, then it follows from the continuity of g 0 at x∗ that it does so linearly with asymptotic
error constant |g 0 (x∗ )|, since, by the definition of ξk and the continuity of g 0 ,

|ek+1 |
lim = lim |g 0 (ξk )| = |g 0 (x∗ )|.
k→∞ |ek | k→∞

Recall, though, that for linear convergence, the asymptotic error constant C = |g 0 (x∗ )| must satisfy
C < 1. Unfortunately, with g(x) = − ln x, we have |g 0 (x)| = | − 1/x| > 1 on the interval [0.5, 0.6],
so it is not surprising that the iteration diverged.
What if we could convert the original equation f (x) = 0 into an equation of the form x = g(x)
so that g 0 satisfied |g 0 (x)| < 1 on an interval [a, b] where a fixed point was known to exist? What
we can do is take advantage of the differentiation rule
d −1 1
[f (x)] = 0 −1
dx f (f (x)

and apply g −1 (x) = e−x to both sides of the equation x = g(x) to obtain

g −1 (x) = g −1 (g(x)) = x,

which simplifies to
x = e−x .
The function g(x) = e−x satisfies |g 0 (x)| < 1 on [0.5, 0.6], as g 0 (x) = −e−x , and e−x < 1 when the
argument x is positive. What happens if you try Fixed-point Iteration with this choice of g?

>> g=inline(’exp(-x)’)
g =
Inline function:
g(x) = exp(-x)
>> x=0.55;
>> x=g(x)
x =
0.576949810380487
>> x=g(x)
x =
282 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

0.561608769952327
>> x=g(x)
x =
0.570290858658895
This is more promising.
Exercise 8.3.2 Continue this process to confirm that the iteration is in fact converging.
2

Having seen what can go wrong if we are not careful in applying Fixed-Point Iteration, we
should now address the questions of existence and uniqueness of a solution to the modified problem
g(x) = x. The following result, first proved in [9], answers the first of these questions.

Theorem 8.3.3 (Brouwer’s Fixed Point Theorem) Let g be continuous on [a, b]. If
g(x) ∈ [a, b] for each x ∈ [a, b], then g has a fixed point in [a, b].

Exercise 8.3.3 Use the Intermediate Value Theorem (see Theorem A.1.10) to prove
Theorem 8.3.3.
Given a continuous function g that is known to have a fixed point in an interval [a, b], we can try to
find this fixed point by repeatedly evaluating g at points in [a, b] until we find a point x for which
g(x) = x. This is the essence of the method of Fixed-point Iteration. However, just because g has
a fixed point does not mean that this iteration will necessarily converge. We will now investigate
this further.

8.3.2 Convergence Analysis


Under what circumstances will fixed-point iteration converge to a fixed point x∗ ? We say that a
function g that is continuous on [a, b] satisfies a Lipschitz condition on [a, b] if there exists a positive
constant L such that
|g(x) − g(y)| ≤ L|x − y|, x, y ∈ [a, b].
The constant L is called a Lipschitz constant. If, in addition, L < 1, we say that g is a contraction
on [a, b].
If we denote the error in xk by ek = xk − x∗ , we can see from the fact that g(x∗ ) = x∗ that if
xk ∈ [a, b], then

|ek+1 | = |xk+1 − x∗ | = |g(xk ) − g(x∗ )| ≤ L|xk − x∗ | ≤ L|ek | < |ek |.

Therefore, if g satisfies the conditions of the Brouwer Fixed-Point Theorem, and g is a contraction
on [a, b], and x0 ∈ [a, b] , then fixed-point iteration is convergent; that is, xk converges to x∗ .
Furthermore, the fixed point x∗ must be unique, for if there exist two distinct fixed points x∗
and y ∗ in [a, b], then, by the Lipschitz condition, we have

0 < |x∗ − y ∗ | = |g(x∗ ) − g(y ∗ )| ≤ L|x∗ − y ∗ | < |x∗ − y ∗ |,

which is a contradiction. Therefore, we must have x∗ = y ∗ . We summarize our findings with the
statement of the following result, first established in [3].
8.3. FIXED-POINT ITERATION 283

Theorem 8.3.4 (Contraction Mapping Theorem) Let g be a continuous function


on the interval [a, b]. If g(x) ∈ [a, b] for each x ∈ [a, b], and if there exists a constant
0 < L < 1 such that
|g(x) − g(y)| ≤ L|x − y|, x, y ∈ [a, b],
then g has a unique fixed point x∗ in [a, b], and the sequence of iterates {xk }∞
k=0 converges

to x , for any initial guess x0 ∈ [a, b].

In general, when fixed-point iteration converges, it does so at a rate that varies inversely with the
Lipschitz constant L.
If g satisfies the conditions of the Contraction Mapping Theorem with Lipschitz constant L,
then Fixed-point Iteration achieves at least linear convergence, with an asymptotic error constant
that is bounded above by L. This value can be used to estimate the number of iterations needed
to obtain an additional correct decimal digit, but it can also be used to estimate the total number
of iterations needed for a specified degree of accuracy.
From the Lipschitz condition, we have, for k ≥ 1,
|xk − x∗ | ≤ L|xk−1 − x∗ | ≤ Lk |x0 − x∗ |.
From
|x0 − x∗ | ≤ |x0 − x1 + x1 − x∗ | ≤ |x0 − x1 | + |x1 − x∗ | ≤ |x0 − x1 | + L|x0 − x∗ |
we obtain
Lk
|xk − x∗ | ≤
|x1 − x0 |. (8.1)
1−L
We can bound the number of iterations after performing a single iteration, as long as the Lipschitz
constant L is known.
Exercise 8.3.4 Use (8.1) to determine a lower bound on the number of iterations re-
quired to ensure that |xk − x∗ | ≤  for some error tolerance .
We can now develop a practical implementation of Fixed-Point Iteration.
Exercise 8.3.5 Write a Matlab function [x,niter]=fixedpt(g,x0,tol) that imple-
ments Algorithm 8.3.1 to solve the equation x = g(x) with initial guess x0 , except that
instead of simply using the absolute difference between iterates to test for convergence, the
error estimate (8.1) is compared to the specified tolerance tol, and a maximum number
of iterations is determined based on the result of Exercise 8.3.4. The output arguments x
and niter are the computed solution x∗ and number of iterations, respectively. Test your
function on the equation from Example 8.3.2.
We know that Fixed-point Iteration will converge to the unique fixed point in [a, b] if g satisfies
the conditions of the Contraction Mapping Theorem. However, if g is differentiable on [a, b], its
derivative can be used to obtain an alternative criterion for convergence that can be more practical
than computing the Lipschitz constant L. If we denote the error in xk by ek = xk − x∗ , we can see
from the Mean Value Theorem and the fact that g(x∗ ) = x∗ that
ek+1 = xk+1 − x∗ = g(xk ) − g(x∗ ) = g 0 (ξk )(xk − x∗ )
where ξk lies between xk and x∗ . However, from
g(xk ) − g(x∗ )

0
|g (ξk )| =
xk − x∗
284 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

it follows that if |g 0 (x)| ≤ L on (a, b), where L < 1, then the Contraction Mapping Theorem applies.
This leads to the following result.

Theorem 8.3.5 (Fixed-point Theorem) Let g be a continuous function on the inter-


val [a, b], and let g be differentiable on (a, b). If g(x) ∈ [a, b] for each x ∈ [a, b], and if
there exists a constant L < 1 such that

|g 0 (x)| ≤ L, x ∈ (a, b),

then the sequence of iterates {xk }∞ ∗


k=0 converges to the unique fixed point x of g in [a, b],
for any initial guess x0 ∈ [a, b].

Using Taylor expansion of the error around x∗ , it can also be shown that if g 0 is continuous at x∗
and |g 0 (x∗ )| < 1, then Fixed-point Iteration is locally convergent; that is, it converges if x0 is chosen
sufficiently close to x∗ .
It can be seen from the preceding discussion why g 0 (x) must be bounded away from 1 on (a, b),
as opposed to the weaker condition |g 0 (x)| < 1 on (a, b). If g 0 (x) is allowed to approach 1 as x
approaches a point c ∈ (a, b), then it is possible that the error ek might not approach zero as k
increases, in which case Fixed-point Iteration would not converge.

Exercise 8.3.6 Find a function g and interval [a, b] such that g continuous on [a, b] and
differentiable on (a, b), but does not satisfy a Lipschitz condition on [a, b] for any Lipschitz
constant L.
The derivative can also be used to indicate why Fixed-point Iteration might not converge.
Example 8.3.6 The function g(x) = x2 + 16 3
has two fixed points, x∗1 = 1/4 and x∗2 = 3/4, as can
3
be determined by solving the quadratic equation x2 + 16 = x. If we consider the interval [0, 3/8],
then g satisfies the conditions of the Fixed-point Theorem, as g 0 (x) = 2x < 1 on this interval, and
therefore Fixed-point Iteration will converge to x∗1 for any x0 ∈ [0, 3/8].
On the other hand, g 0 (3/4) = 2(3/4) = 3/2 > 1. Therefore, it is not possible for g to satisfy the
conditions of the Fixed-point Theorem. Furthemore, if x0 is chosen so that 1/4 < x0 < 3/4, then
Fixed-point Iteration will converge to x∗1 = 1/4, whereas if x0 > 3/4, then Fixed-point Iteration
diverges. 2

The fixed point x∗2 = 3/4 in the preceding example is an unstable fixed point of g, meaning that
no choice of x0 yields a sequence of iterates that converges to x∗2 . The fixed point x∗1 = 1/4 is a
stable fixed point of g, meaning that any choice of x0 that is sufficiently close to x∗1 yields a sequence
of iterates that converges to x∗1 .
The preceding example shows that Fixed-point Iteration applied to an equation of the form
x = g(x) can fail to converge to a fixed point x∗ if |g 0 (x∗ )| > 1. We wish to determine whether
this condition indicates non-convergence in general. If |g 0 (x∗ )| > 1, and g 0 is continuous in a
neighborhood of x∗ , then there exists an interval |x − x∗ | ≤ δ such that |g 0 (x)| > 1 on the interval.
If xk lies within this interval, it follows from the Mean Value Theorem that
|xk+1 − x∗ | = |g(xk ) − g(x∗ )| = |g 0 (η)||xk − x∗ |,
where η lies between xk and x∗ . Because η is also in this interval, we have
|xk+1 − x∗ | > |xk − x∗ |.
8.3. FIXED-POINT ITERATION 285

In other words, the error in the iterates increases whenever they fall within a sufficiently small
interval containing the fixed point. Because of this increase, the iterates must eventually fall outside
of the interval. Therefore, it is not possible to find a k0 , for any given δ, such that |xk − x∗ | ≤ δ
for all k ≥ k0 . We have thus proven the following result.

Theorem 8.3.7 Let g have a fixed point at x∗ , and let g 0 be continuous in a neighborhood
of x∗ . If |g 0 (x∗ )| > 1, then Fixed-point Iteration does not converge to x∗ for any initial
guess x0 except in a finite number of iterations.
Now, suppose that in addition to the conditions of the Fixed-point Theorem, we assume that
g 0 (x∗ ) = 0, and that g is twice continuously differentiable on [a, b]. Then, using Taylor’s Theorem,
we obtain
1 1
ek+1 = g(xk ) − g(x∗ ) = g 0 (x∗ )(xk − x∗ ) + g 00 (ξk )(xk − x∗ )2 = g 00 (ξk )e2k ,
2 2
where ξk lies between xk and x∗ . It follows that for any initial iterate x0 ∈ [a, b], Fixed-point
Iteration converges at least quadratically, with asymptotic error constant |g 00 (x∗ )/2|. Later, this
will be exploited to obtain a quadratically convergent method for solving nonlinear equations of
the form f (x) = 0.

8.3.3 Relaxation
Now that we understand the convergence behavior of Fixed-point Iteration, we consider the appli-
cation of Fixed-point Iteration to the solution of an equation of the form f (x) = 0.

Example 8.3.8 We use Fixed-point Iteration to solve the equation f (x) = 0, where f (x) = x −
cos x − 2. It makes sense to work with the equation x = g(x), where g(x) = cos x + 2.
Where should we look for a solution to this equation? For example, consider the interval [0, π/4].
0 0
√ interval, g (x) = − sin x, which certainly satisfies the condition |g (x)| ≤ ρ < 1 where
On this
ρ = 2/2, but g does not map this interval into itself, as required by the Brouwer Fixed-point
Theorem.
On the other hand, if we consider the interval [1, 3], it can readily be confirmed that g(x) maps
this interval to itself, as 1 ≤ 2 + cos x ≤ 3 for all real x, so a fixed point exists in this interval.
First, let’s set up a figure with a graph of g(x):

>> x=1:0.01:3;
>> figure(1)
>> plot(x,g(x))
>> hold on
>> plot([ 1 3 ],[ 1 3 ],’k--’)
>> xlabel(’x’)
>> ylabel(’y’)
>> title(’g(x) = cos x + 2’)

Exercise 8.3.7 Go ahead and try Fixed-point Iteration on g(x), with initial guess x0 = 2.
What happens?
286 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

Figure 8.5: Fixed-point Iteration applied to g(x) = cos x + 2.

The behavior is quite interesting, as the iterates seem to bounce back and forth. This is illustrated in
Figure 8.5. Continuing, we see that convergence is achieved, but it is quite slow. An examination
of the derivative explains why: g 0 (x) = − sin x, and we have |g 0 (π/2)| = | − sin π/2| = 1, so
the conditions of the Fixed-point Theorem are not satisfied–in fact, we could not be assured of
convergence at all, though it does occur in this case.
An examination of the iterates shown in Figure 8.5, along with an indication of the solution,
suggests how convergence can be accelerated. What if we used the average of x and g(x) at each
iteration? That is, we solve x = h(x), where
1 1
h(x) = [x + g(x)] = [x + cos x + 2].
2 2
You can confirm that if x = h(x), then f (x) = 0. However, we have
1
h0 (x) = [1 − sin x],
2
and how large can this be on the interval [1, 3]? In this case, the Fixed-point Theorem does apply.

Exercise 8.3.8 Try Fixed-point Iteration with h(x), and with initial guess x0 = 2. What
behavior can you observe?

The lesson to be learned here is that the most straightforward choice of g(x) is not always the
wisest–the key is to minimize the size of the derivative near the solution. 2

As previously discussed, a common choice for a function g(x) to use with Fixed-point Iteration
to solve the equation f (x) = 0 is a function of the form g(x) = x − φ(x)f (x), where φ(x) is nonzero.
8.4. NEWTON’S METHOD AND THE SECANT METHOD 287

Clearly, the simplest choice of φ(x) is a constant function φ(x) ≡ λ, but it is important to choose
λ so that Fixed-point Iteration with g(x) will converge.
Suppose that x∗ is a solution of the equation f (x) = 0, and that f is continuously differentiable
in a neighborhood of x∗ , with f 0 (x∗ ) = α > 0. Then, by continuity, there exists an interval
[x∗ − δ, x∗ + δ] containing x∗ on which m ≤ f 0 (x) ≤ M , for positive constants m and M . We can
then prove the following results.

Exercise 8.3.9 Let f 0 (x∗ ) = α > 0. Prove that there exist δ, λ > 0 such that on the
interval |x − x∗ | ≤ δ, there exists L < 1 such that |g 0 (x)| ≤ L, where g(x) = x − λf (x).
What is the value of L? Hint: choose λ so that upper and lower bounds on g 0 (x) are equal
to ±L.

Exercise 8.3.10 Under the assumptions of Exercise 8.3.9, prove that if Iδ = [x∗ −δ, x∗ +
δ], and λ > 0 is chosen so that |g 0 (x)| ≤ L < 1 on Iδ , then g maps Iδ into itself.
We conclude from the preceding two exercises that the Fixed-point Theorem applies, and Fixed-
point Iteration converges linearly to x∗ for any choice of x0 in [x∗ − δ, x∗ + δ], with asymptotic error
constant |1 − λα| ≤ L.
In summary, if f is continuously differentiable in a neighborhood of a root x∗ of f (x) = 0, and
f (x∗ ) is nonzero, then there exists a constant λ such that Fixed-point Iteration with g(x) = x−λf (x)
converges to x∗ for x0 chosen sufficiently close to x∗ . This approach to Fixed-point Iteration, with
a constant φ, is known as relaxation.
Convergence can be accelerated by allowing λ to vary from iteration to iteration. Intuitively,
an effective choice is to try to minimize |g 0 (x)| near x∗ by setting λ = 1/f 0 (xk ), for each k, so that
g 0 (xk ) = 1 − λf 0 (xk ) = 0. This results in linear convergence with an asymptotic error constant of 0,
which indicates faster than linear convergence. We will see that convergence is actually quadratic.

8.4 Newton’s Method and the Secant Method


To develop a more effective method for solving this problem of computing a solution to f (x) = 0,
we can address the following questions:

• Are there cases in which the problem easy to solve, and if so, how do we solve it in such cases?

• Is it possible to apply our method of solving the problem in these “easy” cases to more general
cases?

A recurring theme in this book is that these questions are useful for solving a variety of problems.

8.4.1 Newton’s Method


For the problem at hand, we ask whether the equation f (x) = 0 is easy to solve for any particular
choice of the function f . Certainly, if f is a linear function, then it has a formula of the form
f (x) = m(x − a) + b, where m and b are constants and m 6= 0. Setting f (x) = 0 yields the equation

m(x − a) + b = 0,
288 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

which can easily be solved for x to obtain the unique solution


b
x=a− .
m
We now consider the case where f is not a linear function. For example, recall when we solved
the equation f (x) = x − cos x using Bisection. In Figure 8.2, note that near the root, f is well-
approximated by a linear function. How shall we exploit this?
Using Taylor’s theorem, it is simple to construct a linear function that approximates f (x) near
a given point x0 . This function is simply the first Taylor polynomial of f (x) with center x0 ,

P1 (x) = f (x0 ) + f 0 (x0 )(x − x0 ).

This function has a useful geometric interpretation, as its graph is the tangent line of f (x) at the
point (x0 , f (x0 )).
We will illustrate this in Matlab, for the example f (x) = x − cos x, and initial guess x0 = 1.
The following code plots f (x) and the tangent line at (x0 , f (x0 )).
>> f=inline(’x - cos(x)’);
>> a=0.5;
>> b=1.5;
>> h=0.01;
>> x=a:h:b;
>> % plot f(x) on [a,b]
>> plot(x,f(x))
>> hold on
>> % plot x-axis
>> plot([ a b ],[ 0 0 ],’k’)
>> x0=1;
>> % plot initial guess on graph of f(x)
>> plot(x0,f(x0),’ro’)
>> % f’(x) = 1 + sin(x)
>> % slope of tangent line: m = f’(x0)
>> m=1+sin(x0);
>> % plot tangent line using points x=a,b
>> plot([ a b ],[ f(x0) + m*([ a b ] - x0) ],’r’)
>> xlabel(’x’)
>> ylabel(’y’)

Exercise 8.4.1 Rearrange the formula for the tangent line approximation P1 (x) to obtain
a formula for its x-intercept x1 . Compute this value in Matlab and plot the point
(x1 , f (x1 )) as a red ’+’.

The plot that should be obtained from the preceding code and Exercise 8.4.1 is shown in Figure
8.6.
As can be seen in Figure 8.6, we can obtain an approximate solution to the equation f (x) = 0
by determining where the linear function P1 (x) is equal to zero. If the resulting value, x1 , is not a
8.4. NEWTON’S METHOD AND THE SECANT METHOD 289

Figure 8.6: Approximating a root of f (x) = x − cos x using the tangent line of f (x) at x0 = 1.

solution, then we can repeat this process, approximating f by a linear function near x1 and once
again determining where this approximation is equal to zero.
Exercise 8.4.2 Modify the above Matlab statements to effectively “zoom in” on the
graph of f (x) near x = x1 , the zero of the tangent line approximation P1 (x) above. Use
the tangent line at (x1 , f (x1 )) to compute a second approximation x2 of the root of f (x).
What do you observe?

The algorithm that results from repeating this process of approximating a root of f (x) using
tangent line approximations is known as Newton’s method, which we now describe in detail.

Algorithm 8.4.1 (Newton’s Method) Let f : R → R be a differentiable function.


The following algorithm computes an approximate solution x∗ to the equation f (x) = 0.

Choose an initial guess x0


for k = 0, 1, 2, . . . do
if f (xk ) is sufficiently small then
x∗ = xk
return x∗
end
xk+1 = xk − ff0(x k)
(xk )
if |xk+1 − xk | is sufficiently small then
x∗ = xk+1
return x∗
end
end
290 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

Exercise 8.4.3 Write a Matlab function [x,niter]=newton(f,fp,x0,tol) that im-


plements Algorithm 8.4.1 to solve the equation f (x) = 0 using Newton’s Method with
initial guess x0 and absolute error tolerance tol. The second input argument fp must
be a function handle for f 0 (x). The output arguments x and niter are the computed
solution x∗ and number of iterations, respectively.
It is worth noting that Newton’s Method is equivalent to performing Fixed-point Iteration with
g(x) = x − λk f (x), where λk = 1/f 0 (xk ) for the kth iteration that computes xk+1 . Recall that this
choice of constant was chosen in order make relaxation as rapidly convergent as possible.
In fact, when Newton’s method converges, it does so very rapidly. However, it can be difficult
to ensure convergence, particularly if f (x) has horizontal tangents near the solution x∗ . Typically,
it is necessary to choose a starting iterate x0 that is close to x∗ . As the following result indicates,
such a choice, if close enough, is indeed sufficient for convergence [36, Theorem 1.7].

Theorem 8.4.2 (Convergence of Newton’s Method) Let f be twice continuously


differentiable on the interval [a, b], and suppose that f (c) = 0 and f 0 (c) 6= 0 for some
c ∈ [a, b]. Then there exists a δ > 0 such that Newton’s Method applied to f (x) converges
to c for any initial guess x0 in the interval [c − δ, c + δ].

Example 8.4.3 We will use Newton’s Method to compute a root of f (x) = x − cos x. Since f 0 (x) =
1 + sin x, it follows that in Newton’s Method, we can obtain the next iterate xn+1 from the previous
iterate xn by
f (xn ) xn − cos xn xn sin xn + cos xn
xn+1 = xn − 0 = xn − = .
f (xn ) 1 + sin xn 1 + sin xn
We choose our starting iterate x0 = 1, and compute the next several iterates as follows:

(1) sin 1 + cos 1


x1 = = 0.750363867840244
1 + sin 1
x1 sin x1 + cos x1
x2 = = 0.739112890911362
1 + sin x1
x3 = 0.739085133385284
x4 = 0.739085133215161
x5 = 0.739085133215161.

Since the fourth and fifth iterates agree to 15 decimal places, we assume that 0.739085133215161
is a correct solution to f (x) = 0, to at least 15 decimal places. 2

Exercise 8.4.4 How many correct decimal places are obtained in each xk in the preceding
example? What does this suggest about the order of convergence of Newton’s method?
We can see from this example that Newton’s method converged to a root far more rapidly than
the Bisection method did when applied to the same function. However, unlike Bisection, Newton’s
method is not guaranteed to converge. Whereas Bisection is guaranteed to converge to a root in
[a, b] if f (a)f (b) < 0, we must be careful about our choice of the initial guess x0 for Newton’s
method, as the following example illustrates.
8.4. NEWTON’S METHOD AND THE SECANT METHOD 291

Example 8.4.4 Newton’s Method can be used to compute the reciprocal of a number a without
performing any divisions. The solution, 1/a, satisfies the equation f (x) = 0, where

1
f (x) = a − .
x

Since
1
f 0 (x) = ,
x2

it follows that in Newton’s Method, we can obtain the next iterate xn+1 from the previous iterate
xn by
a − 1/xn a 1/xn
xn+1 = xn − 2
= xn − + = 2xn − ax2n .
1/xn 1/xn 1/x2n
2

Note that no divisions are necessary to obtain xn+1 from xn . This iteration was actually used on
older IBM computers to implement division in hardware.
We use this iteration to compute the reciprocal of a = 12. Choosing our starting iterate to be
0.1, we compute the next several iterates as follows:

x1 = 2(0.1) − 12(0.1)2 = 0.08


x2 = 2(0.12) − 12(0.12)2 = 0.0832
x3 = 0.0833312
x4 = 0.08333333333279
x5 = 0.08333333333333.

We conclude that 0.08333333333333 is an accurate approximation to the correct solution.


Now, suppose we repeat this process, but with an initial iterate of x0 = 1. Then, we have

x1 = 2(1) − 12(1)2 = −10


x2 = 2(−10) − 12(−10)2 = −1220
x3 = 2(−1220) − 12(−1220)2 = −17863240

It is clear that this sequence of iterates is not going to converge to the correct solution. In general,
for this iteration to converge to the reciprocal of a, the initial iterate x0 must be chosen so that
0 < x0 < 2/a. This condition guarantees that the next iterate x1 will at least be positive. The
contrast between the two choices of x0 are illustrated in Figure 8.7 for a = 8. 2

8.4.2 Convergence Analysis

We now analyze the convergence of Newton’s Method applied to the equation f (x) = 0, where we
assume that f is twice continuously differentiable near the exact solution x∗ . As before, we define
292 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

Figure 8.7: Newton’s Method used to compute the reciprocal of 8 by solving the equation f (x) =
8 − 1/x = 0. When x0 = 0.1, the tangent line of f (x) at (x0 , f (x0 )) crosses the x-axis at x1 = 0.12,
which is close to the exact solution. When x0 = 1, the tangent line crosses the x-axis at x1 = −6,
which causes searching to continue on the wrong portion of the graph, so the sequence of iterates
does not converge to the correct solution.

ek = xk − x∗ to be the error after k iterations. Using a Taylor expansion around xk , we obtain

ek+1 = xk+1 − x∗
f (xk )
= xk − 0 − x∗
f (xk )
f (xk )
= ek − 0
f (xk )
 
1 ∗ 0 ∗ 1 00 ∗ 2
= ek − 0 f (x ) − f (xk )(x − xk ) − f (ξk )(xk − x )
f (xk ) 2
00
f (ξk ) 2
= e
2f 0 (xk ) k
where ξk is between xk and x∗ .
Because, for each k, ξk lies between xk and x∗ , ξk converges to x∗ as well. By the continuity
of f 00 , we conclude that Newton’s method converges quadratically to x∗ , with asymptotic error
constant
|f 00 (x∗ )|
C= .
2|f 0 (x∗ )|

Example 8.4.5 Suppose that Newton’s Method is used to find the solution of f (x) = 0, where
8.4. NEWTON’S METHOD AND THE SECANT METHOD 293

f (x) = x2 − 2. We examine the error ek = xk − x∗ , where x∗ = 2 is the exact solution. The first
two iterations are illustrated in Figure 8.8. Continuing, we obtain

Figure 8.8: Newton’s Method applied to f (x) = x2 − 2. The bold curve is the graph of f . The
initial iterate x0 is chosen to be 1. The tangent line of f (x) at the point (x0 , f (x0 )) is used to
approximate f (x), and it crosses the x-axis at x1 = 1.5, which is much closer to the exact solution
than x0 . Then, the tangent line at (x1 , f (x1 )) is used to approximate f (x), and it crosses the x-axis
at x2 = 1.416̄, which is already very close to the exact solution.

k xk |ek |
0 1 0.41421356237310
1 1.5 0.08578643762690
2 1.41666666666667 0.00245310429357
3 1.41421568627457 0.00000212390141
4 1.41421356237469 0.00000000000159

We can determine analytically that Newton’s


√ Method
√ converges quadratically, and in this example,
the asymptotic error constant is |f 00 ( 2)/2f 0 ( 2)| ≈ 0.35355. Examining the numbers in the
table above, we can see that the number of correct decimal places approximately doubles with each
iteration, which is typical of quadratic convergence. Furthermore, we have

|e4 |
≈ 0.35352,
|e3 |2
294 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

so the actual behavior of the error is consistent with the behavior that is predicted by theory. 2

It is easy to see from the above analysis, however, that if f 0 (x∗ ) is very small, or zero, then
convergence can be very slow, or may not even occur.

Example 8.4.6 The function


f (x) = (x − 1)2 ex
has a double root at x∗ = 1, and therefore f 0 (x∗ ) = 0. Therefore, the previous convergence analysis
does not apply. Instead, we obtain

ek+1 = xk+1 − 1
f (xk )
= xk − 0 −1
f (xk )
(xk − 1)2 exk
= xk − −1
[2(xk − 1) + (xk − 1)2 ]exk
e2k
= ek −
2ek + e2k
xk
= ek .
xk + 1

It follows that if we choose x0 > 0, then Newton’s method converges to x∗ = 1 linearly, with
asymptotic error constant C = 21 . 2

Exercise 8.4.5 Prove that if f (x∗ ) = 0, f 0 (x∗ ) = 0, and f 00 (x∗ ) =


6 0, then Newton’s
method converges linearly with asymptotic error constant C = 21 . That is, the result
obtained in Example 8.4.6 applies in general.
Normally, convergence of Newton’s method is only assured if x0 is chosen sufficiently close to
x∗ . However, in some cases, it is possible to prove that Newton’s method converges quadratically
on an interval, under certain conditions on the sign of the derivatives of f on that interval. For
example, suppose that on the interval Iδ = [x∗ , x∗ + δ], f 0 (x) > 0 and f 00 (x) > 0, so that f is
increasing and concave up on this interval.
Let xk ∈ Iδ . Then, from
f (xk )
xk+1 = xk − 0 ,
f (xk )
we have xk+1 < xk , because f , being equal to zero at x∗ and increasing on Iδ , must be positive at
xk . However, because
f 00 (ξk )
xk+1 − x∗ = 0 (xk − x∗ )2 ,
2f (xk )
and f 0 and f 00 are both positive at xk , we must also have xk+1 > x∗ .
It follows that the sequence {xk } is monotonic and bounded, and therefore must be convergent
to a limit x∗ ∈ Iδ . From the convergence of the sequence and the determination of xk+1 from xk ,
it follows that f (x∗ ) = 0. However, f is positive on (x∗ , x∗ + δ], which means that we must have
8.4. NEWTON’S METHOD AND THE SECANT METHOD 295

x∗ = x∗ , so Newton’s method converges to x∗ . Using the previous analysis, it can be shown that
this convergence is quadratic.

Exercise 8.4.6 Prove that Newton’s method converges quadratically if f 0 (x) > 0 and
f 00 (x) < 0 on an interval [a, b] that contains x∗ and x0 , where x0 < x∗ . What goes wrong
if x0 > x∗ ?

8.4.3 The Secant Method


One drawback of Newton’s method is that it is necessary to evaluate f 0 (x) at various points, which
may not be practical for some choices of f . The secant method avoids this issue by using a finite
difference to approximate the derivative. As a result, f (x) is approximated by a secant line through
two points on the graph of f , rather than a tangent line through one point on the graph.
Since a secant line is defined using two points on the graph of f (x), as opposed to a tangent
line that requires information at only one point on the graph, it is necessary to choose two initial
iterates x0 and x1 . Then, as in Newton’s method, the next iterate x2 is then obtained by computing
the x-value at which the secant line passing through the points (x0 , f (x0 )) and (x1 , f (x1 )) has a
y-coordinate of zero. This yields the equation

f (x1 ) − f (x0 )
(x2 − x1 ) + f (x1 ) = 0
x1 − x0
which has the solution
f (x1 )(x1 − x0 )
x2 = x1 − .
f (x1 ) − f (x0 )
This leads to the following algorithm.

Algorithm 8.4.7 (Secant Method) Let f : R → R be a continuous function. The


following algorithm computes an approximate solution x∗ to the equation f (x) = 0.

Choose two initial guesses x0 and x1


for k = 1, 2, 3, . . . do
if f (xk ) is sufficiently small then
x∗ = xk
return x∗
end
x −x
xk+1 = xk − f (xk ) f (xkk)−f k−1
(xk−1 )
if |xk+1 − xk | is sufficiently small then
x∗ = xk+1
return x∗
end
end

Exercise 8.4.7 Write a Matlab function [x,niter]=secant(f,x0,x1,tol) that im-


plements Algorithm 8.4.7 to solve the equation f (x) = 0 using the Secant Method with
initial guesses x0 and x1 , and absolute error tolerance tol. The output arguments x and
niter are the computed solution x∗ and number of iterations, respectively.
296 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

Like Newton’s method, it is necessary to choose the starting iterate x0 to be reasonably close
to the solution x∗ . Convergence is not as rapid as that of Newton’s Method, since the secant-line
approximation of f is not as accurate as the tangent-line approximation employed by Newton’s
method.

Example 8.4.8 We will use the Secant Method to solve the equation f (x) = 0, where f (x) = x2 −2.
This method requires that we choose two initial iterates x0 and x1 , and then compute subsequent
iterates using the formula
f (xn )(xn − xn−1 )
xn+1 = xn − , n = 1, 2, 3, . . . .
f (xn ) − f (xn−1 )
We choose x0 = 1 and x1 = 1.5. Applying the above formula, we obtain

x2 = 1.4
x3 = 1.413793103448276
x4 = 1.414215686274510
x5 = 1.414213562057320.

As we√ can see, the iterates produced by the Secant Method are converging to the exact solution
x∗ = 2, but not as rapidly as those produced by Newton’s Method. 2

Exercise 8.4.8 How many correct decimal places are obtained in each xk in the preceding
example? What does this suggest about the order of convergence of the Secant method?
We now prove that the Secant Method converges if x0 is chosen sufficiently close to a solution
x∗ of f (x) = 0, if f is continuously differentiable near x∗ and f 0 (x∗ ) = α 6= 0. Without loss of
generality, we assume α > 0. Then, by the continuity of f 0 , there exists an interval Iδ = [x∗ −δ, x∗ +δ]
such that
3α 5α
≤ f 0 (x) ≤ , x ∈ Iδ .
4 4
It follows from the Mean Value Theorem that
xk − xk−1
xk+1 − x∗ = xk − x∗ − f (xk )
f (xk ) − f (xk−1 )
f (θk )(xk − x∗ )
0
= xk − x∗ −
f 0 (ϕk )
0
 
f (θk )
= 1− 0 (xk − x∗ ),
f (ϕk )
where θk lies between xk and x∗ , and ϕk lies between xk and xk−1 . Therefore, if xk−1 and xk are
in Iδ , then so are ϕk and θk , and xk+1 satisfies
 
∗ 5α/4 3α/4 2
|xk − x∗ | ≤ |xk − x∗ |.

|xk+1 − x | ≤ max 1 −
, 1 −
3α/4 5α/4 3
We conclude that if x0 , x1 ∈ Iδ , then all subsequent iterates lie in Iδ , and the Secant Method
converges at least linearly, with asymptotic rate constant 2/3.
8.4. NEWTON’S METHOD AND THE SECANT METHOD 297

The order of convergence of the Secant Method can be determined using a result, which we will
not prove here, stating that if {xk }∞k=0 is the sequence of iterates produced by the Secant Method
for solving f (x) = 0, and if this sequence converges to a solution x∗ , then for k sufficiently large,
|xk+1 − x∗ | ≈ S|xk − x∗ ||xk−1 − x∗ |
for some constant S.
We assume that {xk } converges to x∗ of order α. Then, dividing both sides of the above relation
by |xk − x∗ |α , we obtain
|xk+1 − x∗ |
≈ S|xk − x∗ |1−α |xk−1 − x∗ |.
|xk − x∗ |α
Because α is the order of convergence, the left side must converge to a positive constant C as
k → ∞. It follows that the right side must converge to a positive constant as well, as must its
reciprocal. In other words, there must exist positive constants C1 and C2
|xk − x∗ | |xk − x∗ |α−1
→ C1 , → C2 .
|xk−1 − x∗ |α |xk−1 − x∗ |
This can only be the case if there exists a nonzero constant β such that

|xk − x∗ | |xk − x∗ |α−1

= ,
|xk−1 − x∗ |α |xk−1 − x∗ |
which implies that
1 = (α − 1)β and α = β.
Eliminating β, we obtain the equation
α2 − α − 1 = 0,
which has the solutions
√ √
1+ 5 1− 5
α1 = ≈ 1.618, α2 = ≈ −0.618.
2 2
Since we must have α ≥ 1, the order of convergence is 1.618.
Exercise 8.4.9 What is the value of S in the preceding discussion? That is, compute
xk+1 − x∗
S= lim .
xk−1 ,xk →x∗ (xk − x∗ )(xk−1 − x∗ )

Hint: Take one limit at a time, and use Taylor expansion. Assume that x∗ is not a double
root.

Exercise 8.4.10 Use the value of S from the preceding exercise to obtain the asymptotic
error constant for the Secant method.

Exercise 8.4.11 Use both Newton’s method and the Secant method to compute a root
of the same polynomial. For both methods, count the number of floating-point operations
required in each iteration, and the number of iterations required to achieve convergence
with the same error tolerance. Which method requires fewer floating-point operations?
298 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

8.5 Convergence Acceleration


Suppose that a sequence {xk }∞ ∗
k=0 converges linearly to a limit x , in such a way that if k is
sufficiently large, then xk − x∗ has the same sign; that is, {xk } converges monotonically to x∗ . It
follows from the linear convergence of {xk } that for sufficiently large k,
xk+2 − x∗ xk+1 − x∗
≈ . (8.2)
xk+1 − x∗ xk − x∗
Solving for x∗ yields
(xk+1 − xk )2
x∗ ≈ xk − . (8.3)
xk+2 − 2xk+1 + xk

Exercise 8.5.1 Solve (8.2) for x∗ to obtain (8.3).


Therefore, we can construct an alternative sequence {x̂k }∞
k=0 , where

(xk+1 − xk )2
x̂k = xk − ,
xk+2 − 2xk+1 + xk
that also converges to x∗ . This sequence has the following desirable property.

Theorem 8.5.1 Suppose that the sequence {xk }∞ ∗


k=0 converges linearly to a limit x and
that for k sufficiently large, (xk+1 − x∗ )(xk − x∗ ) > 0. Then, if the sequence {x̂k }∞
k=0 is
defined by
(xk+1 − xk )2
x̂k = xk − , k = 0, 1, 2, . . . ,
xk+2 − 2xk+1 + xk
then
x̂k − x∗
lim = 0.
k→∞ xk − x∗

In other words, the sequence {x̂k } converges to x∗ more rapidly than {xk } does.

Exercise 8.5.2 Prove Theorem 8.5.1. Assume that the sequence {xk }∞
k=0 converges lin-
early with asymptotic error constant C, where 0 < C < 1.
If we define the forward difference operator ∆ by
∆xk = xk+1 − xk ,
then
∆2 xk = ∆(xk+1 − xk ) = (xk+2 − xk+1 ) − (xk+1 − xk ) = xk+2 − 2xk+1 + xk ,
and therefore x̂k can be rewritten as
(∆xk )2
x̂k = xk − , k = 0, 1, 2, . . .
∆2 xk
For this reason, the method of accelerating the convergence of {xk } by constructing {x̂k } is called
Aitken’s ∆2 Method [2].
A slight variation of this method, called Steffensen’s Method, can be used to accelerate the
convergence of Fixed-point Iteration, which, as previously discussed, is linearly convergent. The
basic idea is as follows:
8.5. CONVERGENCE ACCELERATION 299

1. Choose an initial iterate x0

2. Compute x1 and x2 using Fixed-point Iteration

3. Use Aitken’s ∆2 Method to compute x̂0 from x0 , x1 , and x2

4. Repeat steps 2 and 3 with x0 = x̂0 .

The principle behind Steffensen’s Method is that x̂0 is thought to be a better approximation to the
fixed point x∗ than x2 , so it should be used as the next iterate for Fixed-point Iteration.

Example 8.5.2 We wish to find the unique fixed point of the function f (x) = cos x on the interval
[0, 1]. If we use Fixed-point Iteration with x0 = 0.5, then we obtain the following iterates from the
formula xk+1 = g(xk ) = cos(xk ). All iterates are rounded to five decimal places.

x1 = 0.87758
x2 = 0.63901
x3 = 0.80269
x4 = 0.69478
x5 = 0.76820
x6 = 0.71917.

These iterates show little sign of converging, as they are oscillating around the fixed point.
If, instead, we use Fixed-point Iteration with acceleration by Aitken’s ∆2 method, we obtain a
new sequence of iterates {x̂k }, where

(∆xk )2
x̂k = xk −
∆2 xk
(xk+1 − xk )2
= xk − ,
xk+2 − 2xk+1 + xk

for k = 0, 1, 2, . . .. The first few iterates of this sequence are

x̂0 = 0.73139
x̂1 = 0.73609
x̂2 = 0.73765
x̂3 = 0.73847
x̂4 = 0.73880.

Clearly, these iterates are converging much more rapidly than Fixed-point Iteration, as they are not
oscillating around the fixed point, but convergence is still linear.
Finally, we try Steffensen’s Method. We begin with the first three iterates of Fixed-point Itera-
tion,
(0) (0) (0)
x0 = x0 = 0.5, x1 = x1 = 0.87758, x2 = x2 = 0.63901.
300 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

Then, we use the formula from Aitken’s ∆2 Method to compute


(0) (0)
(1) (0) (x1 − x0 )2
x0 = x0 − (0) (0) (0)
= 0.73139.
x2 − 2x1 + x0

We use this value to restart Fixed-point Iteration and compute two iterates, which are
(1) (1) (1) (1)
x1 = cos(x0 ) = 0.74425, x2 = cos(x1 ) = 0.73560.
(1) (1)
Repeating this process, we apply the formula from Aitken’s ∆2 Method to the iterates x0 , x1 and
(1)
x2 to obtain
(1) (1)
(2) (1) (x − x )2
x0 = x0 − (1) 1 (1)0 (1) = 0.739076.
x2 − 2x1 + x0
(2)
Restarting Fixed-point Iteration with x0 as the initial iterate, we obtain
(2) (2) (2) (2)
x1 = cos(x0 ) = 0.739091, x2 = cos(x1 ) = 0.739081.
(2)
The most recent iterate x2 is correct to five decimal places.
Using all three methods to compute the fixed point to ten decimal digits of accuracy, we find that
Fixed-point Iteration requires 57 iterations, so x57 must be computed. Aitken’s ∆2 Method requires
us to compute 25 iterates of the modified sequence {x̂k }, which in turn requires 27 iterates of the
(3)
sequence {xk }, where the first iterate x0 is given. Steffensen’s Method requires us to compute x2 ,
which means that only 11 iterates need to be computed, 8 of which require a function evaluation. 2

Exercise 8.5.3 Write a Matlab function [x,niter]=steffensen(g,x0,tol) that


modifies your function fixedpt from Exercise 8.3.5 to accelerate convergence using Stef-
fensen’s Method. Test your function on the equation from Example 8.3.2. How much
more rapidly do the iterates converge?

8.6 Systems of Nonlinear Equations


The techniques that have been presented in this chapter for solving a single nonlinear equation of
the form f (x) = 0, or x = g(x), can be generalized to solve a system of n nonlinear equations in n
unknowns x1 , x2 , . . . , xn , using concepts and techniques from numerical linear algebra.

8.6.1 Fixed-Point Iteration


Previously, we have learned how to use fixed-point iteration to solve a single nonlinear equation of
the form
f (x) = 0
by first transforming the equation into one of the form

x = g(x).
8.6. SYSTEMS OF NONLINEAR EQUATIONS 301

Then, after choosing an initial guess x(0) , we compute a sequence of iterates by

x(k+1) = g(x(k) ), k = 0, 1, 2, . . . ,

that, hopefully, converges to a solution of the original equation.


We have also learned that if the function g is a continuous function that maps an interval I into
itself, then g has a fixed point (also called a stationary point) x∗ in I, which is a point that satisfies
x∗ = g(x∗ ). That is, a solution to f (x) = 0 exists within I. Furthermore, if there is a constant
ρ < 1 such that
|g 0 (x)| < ρ, x ∈ I,
then this fixed point is unique.
It is worth noting that the constant ρ, which can be used to indicate the speed of convergence
of fixed-point iteration, corresponds to the spectral radius ρ(T ) of the iteration matrix T = M −1 N
used in a stationary iterative method of the form

x(k+1) = T x(k) + M −1 b

for solving Ax = b, where A = M − N .


We now generalize fixed-point iteration to the problem of solving a system of n nonlinear
equations in n unknowns,

f1 (x1 , x2 , . . . , xn ) = 0,
f2 (x1 , x2 , . . . , xn ) = 0,
..
.
fn (x1 , x2 , . . . , xn ) = 0.

For simplicity, we express this system of equations in vector form,

F(x) = 0,

where F : D ⊆ Rn → Rn is a vector-valued function of n variables represented by the vector


x = (x1 , x2 , . . . , xn ), and f1 , f2 , . . . , fn are the component functions, or coordinate functions, of F
(see A.1.2).
The notions of limits and continuity for functions of several variables We now define fixed-point
iteration for solving a system of nonlinear equations

F(x) = 0.

First, we transform this system of equations into an equivalent system of the form

x = G(x).

One approach to doing this is to solve the ith equation in the original system for xi . This is
analogous to the derivation of the Jacobi method for solving systems of linear equations. Next, we
choose an initial guess x(0) . Then, we compute subsequent iterates by

x(k+1) = G(x(k) ), k = 0, 1, 2, . . . .
302 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

Exercise 8.6.1 Write a Matlab function x=fixedptsys(G,x0,tol) that performs


fixed-point iteration to solve the equation x = G(x) with an initial guess x0 and absolute
error tolerance tol.
The existence and uniqueness of fixed points of vector-valued functions of several variables
can be described in an analogous manner to how it is described in the single-variable case. The
function G has a fixed point in a domain D ⊆ Rn if G is continuous on D and G maps D into D.
Furthermore, if G has continuous first partial derivatives and there exists a constant ρ < 1 such
that, in some natural matrix norm,

kJG (x)k ≤ ρ, x ∈ D,

where JG (x) is the Jacobian matrix of first partial derivatives of G evaluated at x, then G has a
unique fixed point x∗ in D, and fixed-point iteration is guaranteed to converge to x∗ for any initial
guess chosen in D. This can be seen by computing a multivariable Taylor expansion of the error
x(k+1) − x∗ around x∗ .
Exercise 8.6.2 Use a multivariable Taylor expansion to prove that if G satisfies the
assumptions in the preceding discussion (that it is continuous, maps D into itself, has
continuous first partial derivatives, and satisfies kJG (x)k ≤ ρ < 1 for x ∈ D and any
natural matrix norm k · k), then G has a unique fixed point x∗ ∈ D and fixed-point
iteration will converge to x∗ for any initial guess x(0) ∈ D.

Exercise 8.6.3 Under the assumptions of Exercise 8.6.2 to obtain a bound for the error
after k iterations, kx(k) − x∗ k, in terms of the initial difference kx(1) − x(0) k.
The constant ρ measures the rate of convergence of fixed-point iteration, as the error approxi-
mately decreases by a factor of ρ at each iteration. It is interesting to note that the convergence
of fixed-point iteration for functions of several variables can be accelerated by using an approach
similar to how the Jacobi method for linear systems is modified to obtain the Gauss-Seidel method.
(k+1) (k) (k+1)
That is, when computing xi by evaluating gi (x(k) ), we replace xj , for j < i, by xj , since it
has already been computed (assuming all components of x(k+1) are computed in order). Therefore,
as in Gauss-Seidel, we are using the most up-to-date information available when computing each
iterate.

Example 8.6.1 Consider the system of equations

x2 = x21 ,
x21 + x22 = 1.

The first equation describes a parabola, while the second describes the unit circle. By graphing both
equations, it can easily be seen that this system has two solutions, one of which lies in the first
quadrant (x1 , x2 > 0).
To solve this system using fixed-point iteration, we solve the second equation for x1 , and obtain
the equivalent system
q
x1 = 1 − x22 ,
x2 = x21 .
8.6. SYSTEMS OF NONLINEAR EQUATIONS 303

If we consider the rectangle

D = {(x1 , x2 ) | 0 ≤ x1 ≤ 1, 0 ≤ x2 ≤ 1},

we see that the function q 


G(x1 , x2 ) = 1− x22 , x21

maps D into itself. Because G is also continuous on D, it follows that G has a fixed point in D.
However, G has the Jacobian matrix
 p 
0 −x2 / 1 − x22
JG (x) = ,
2x1 0

which cannot satisfy kJG k < 1 on D. Therefore, we cannot guarantee that fixed-point iteration with
this choice of G will converge, and, in fact, it can be shown that it does not converge. Instead, the
iterates tend to approach the corners of D, at which they remain.
In an attempt to achieve convergence, we note that ∂g2 /∂x1 = 2x1 > 1 near the fixed point.
Therefore, we modify G as follows:
 

q
G(x1 , x2 ) = 2
x2 , 1 − x1 .

For this choice of G, JG still has partial derivatives that are greater than 1 in magnitude near the
fixed point. However, there is one crucial distinction: near the fixed point, ρ(JG ) < 1, whereas with
the original choice of G, ρ(JG ) > 1. Attempting fixed-point iteration with the new G, we see that
convergence is actually achieved, although it is slow. 2

It can be seen from this example that the conditions for the existence and uniqueness of a fixed
point are sufficient, but not necessary.

8.6.2 Newton’s Method


Suppose that fixed-point iteration is being used to solve an equation of the form F(x) = 0, where F
is a vector-valued function of n variables, by transforming it into an equation of the form x = G(x).
Furthermore, suppose that G is known to map a domain D ⊆ Rn into itself, so that a fixed point
exists in D. We have learned that the number ρ, where

kJG (x)k ≤ ρ, x ∈ D,

provides an indication of the rate of convergence, in the sense that as the iterates converge to a
fixed point x∗ , if they converge, the error satisfies

kx(k+1) − x∗ k ≤ ρkx(k) − x∗ k.

Furthermore, as the iterates converge, a suitable value for ρ is given by ρ(JG (x∗ )), the spectral
radius of the Jacobian matrix at the fixed point.
Therefore, it makes sense to ask: what if this spectral radius is equal to zero? In that case, if the
first partial derivatives are continuous near x∗ , and the second partial derivatives are continuous and
304 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

bounded at x∗ , then fixed-point iteration converges quadratically. That is, there exists a constant
M such that
kx(k+1) − x∗ k ≤ M kx(k) − x∗ k2 .

Exercise 8.6.4 Let G : D ⊆ Rn → Rn be a twice continuously differentiable function


that has a fixed point x(0) ∈ D. Furthermore, assume that JG (x∗ ) = 0 and that the
second partial derivatives of the component functions gi , i = 1, 2, . . . , n of G are bounded
on D. Use a multivariable Taylor expansion to prove that fixed-point iteration applied to
G converges quadratically for any initial guess x(0) ∈ D.
We have previously learned that for a single nonlinear equation f (x) = 0, Newton’s method
generally achieves quadratic convergence. Recall that this method computes iterates by

f (x(k) )
x(k+1) = x(k) − , k = 0, 1, 2, . . . ,
f 0 (x(k) )

where x(0) is an initial guess. We now wish to generalize this method to systems of nonlinear
equations.
Consider the fixed-point iteration function

G(x) = x − [JF (x)]−1 F(x). (8.4)

Then, it can be shown by direct differentiation that the Jacobian matrix of this function is equal
to the zero matrix at x = x∗ , a solution to F(x) = 0. If we define
     
x1 f1 (x1 , x2 , . . . , xn ) g1 (x1 , x2 , . . . , xn )
 x2   f2 (x1 , x2 , . . . , xn )   g2 (x1 , x2 , . . . , xn ) 
x =  .  , F(x) =  , G(x) = ,
     
.. ..
 .. 
 
 .   . 
xn fn (x1 , x2 , . . . , xn ) gn (x1 , x2 , . . . , xn )
where fi and gi , i = 1, 2, . . . , n are the coordinate functions of F and G, respectively, then we have
 
n n n
∂ ∂  X X ∂ X ∂
gi (x) = xi − bij (x)fj (x) = δik −
 bij (x) fj (x) − bij (x)fj (x),
∂xk ∂xk ∂xk ∂xk
j=1 j=1 j=1

where bij (x) is the (i, j) element of [JF (x)]−1 .

Exercise 8.6.5 Prove that if G(x) is defined as in (8.4) and F(x∗ ) = 0, then JG (x∗ ) =
0.
We see that this choice of fixed-point iteration is a direct generalization of Newton’s method to
systems of equations, in which the division by f 0 (x(k) ) is replaced by multiplication by the inverse
of JF (x(k) ), the total derivative of F(x).
In summary, Newton’s method proceeds as follows: first, we choose an initial guess x(0) . Then,
for k = 0, 1, 2, . . . , we iterate as follows:

yk = −F(x(k) )
sk = [JF (x(k) )]−1 yk
x(k+1) = x(k) + sk
8.6. SYSTEMS OF NONLINEAR EQUATIONS 305

The vector sk is computed by solving the system

JF (x(k) )sk = yk ,

using a method such as Gaussian elimination with back substitution.

Example 8.6.2 Recall the system of equations

x2 − x21 = 0,
x21 + x22 − 1 = 0.

Fixed-point iteration converged rather slowly for this system, if it converged at all. Now, we apply
Newton’s method to this system. We have

x2 − x21
   
−2x1 1
F(x1 , x2 ) = , J (x ,
F 1 2 x ) = .
x21 + x22 − 1 2x1 2x2

Using the formula for the inverse of a 2 × 2 matrix, we obtain the iteration
" # " # " #" #
(k+1) (k) (k) (k) (k)
x1 x1 1 2x2 −1 x2 − (x1 )2
(k+1) = (k) + (k) (k) (k) (k) (k) (k) (k) .
x2 x2 4x1 x2 + 2x1 −2x1 −2x1 (x1 )2 + (x2 )2 − 1

Implementing this iteration in Matlab, we see that it converges quite rapidly, much more so than
(k)
fixed-point iteration. Note that in order for the iteration to not break down, we must have x1 6= 0
(k)
and x2 6= −1/2. 2

Exercise 8.6.6 Write a Matlab function x=newtonsys(F,JF,x0,tol) that solves the


system of equations F(x) = 0 using Newton’s method, where JF is a function handle that
takes x as input and returns the matrix JF (x). The parameters x0 and tol are the initial
guess and absolute error tolerance, respectively. Test your function on the system from
Example 8.6.2.
One of the drawbacks of Newton’s method for systems of nonlinear equations is the need to
compute the Jacobian matrix during every iteration, and then solve a system of linear equations,
which can be quite expensive. Furthermore, it is not possible to take advantage of information
from one iteration to use in the next, in order to save computational effort. These difficulties will
be addressed shortly.

8.6.3 Broyden’s Method


One of the drawbacks of using Newton’s Method to solve a system of nonlinear equations F(x) = 0
is the computational expense that must be incurred during each iteration to evaluate the partial
derivatives of F at x(k) , and then solve a system of linear equations involving the resulting Jacobian
matrix. The algorithm does not facilitate the re-use of data from previous iterations, and in some
cases evaluation of the partial derivatives can be unnecessarily costly.
An alternative is to modify Newton’s Method so that approximate partial derivatives are used, as
in the Secant Method for a single nonlinear equation, since the slightly slower convergence is offset
306 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

by the improved efficiency of each iteration. However, simply replacing the analytical Jacobian
matrix of F with a matrix consisting of finite difference approximations of the partial derivatives
does not do much to reduce the cost of each iteration, because the cost of solving the system of
linear equations is unchanged.
However, because the Jacobian matrix consists of the partial derivatives evaluated at an element
of a convergent sequence, intuitively Jacobian matrices from consecutive iterations are “near” one
another in some sense, which suggests that it should be possible to cheaply update an approximate
Jacobian matrix from iteration to iteration, in such a way that the inverse of the Jacobian matrix
can be updated efficiently as well.
This is the case when a matrix has the form

B = A + uvT ,

where u and v are given vectors. This modification of A to obtain B is called a rank-one update,
since uvT , an outer product, has rank one, since every vector in the range of uvT is a scalar multiple
of u. To obtain B −1 from A−1 , we note that if

Ax = u,

then
Bx = (A + uvT )x = (1 + vT x)u,
which yields
1
B −1 u = A−1 u.
1+ vT A−1 u
On the other hand, if x is such that vT A−1 x = 0, then

BA−1 x = (A + uvT )A−1 x = x,

which yields
B −1 x = A−1 x.
This takes us to the following more general problem: given a matrix C, we wish to construct a
matrix D such that the following conditions are satisfied:

• Dw = z, for given vectors w and z

• Dy = Cy, if y is orthogonal to a given vector g.

In our application, C = A−1 , D = B −1 , w = u, z = 1/(1 + vT A−1 u)A−1 u, and g = A−T v. To


solve this problem, we set
(z − Cw)gT
D=C+ . (8.5)
gT w
It can be verified directly that D satisfies the above conditions.

Exercise 8.6.7 Prove that the matrix D defined in (8.5) satisfies Dw = z and Dy = Cy
for gT y = 0.
8.6. SYSTEMS OF NONLINEAR EQUATIONS 307

Applying this definition of D, we obtain


 
1
1+vT A−1 u
A−1 u − A−1 u vT A−1
B −1 = A−1 +
vT A−1 u
A−1 uvT A−1
= A−1 − . (8.6)
1 + vT A−1 u
This formula for the inverse of a rank-one update is known as the Sherman-Morrison Formula.

Exercise 8.6.8 Prove the final form of the Sherman-Morrison formula given in (8.6).
We now return to the problem of approximating the Jacobian of F, and efficiently obtaining
its inverse, at each iterate x(k) . We begin with an exact Jacobian, A0 = JF (x(0) ), and use A0 to
compute the first iterate, x(1) , using Newton’s Method. Then, we recall that for the Secant Method,
we use the approximation
f (x1 ) − f (x0 )
f 0 (x1 ) ≈ .
x1 − x0
Generalizing this approach to a system of equations, we seek an approximation A1 to JF (x(1) that
has these properties:
• A1 (x(1) − x(0) ) = F(x(1) ) − F(x(0) )

• If zT (x(1) − x(0) ) = 0, then A1 z = JF (x(0) )z = A0 z.


It follows from previous discussion that
y1 − A0 s1 T
A1 = A0 + s1 ,
sT1 s1
where
s1 = x(1) − x(0) , y1 = F (x(1) ) − F(x(0) ).
Furthermore, once we have computed A−1 0 , we have
 
y1 −A0 s1 T
A−1 s1 A0
−1
(s1 − A−1 T −1
0 y1 )s1 A0
0 sT
−1 −1 1 s1
A1 = A0 −  = A−1
0 + .
sT1 A−1

y1 −A0 s1
1 + sT1 A−1
0 T
s s
0 y1
1 1

Then, as A1 is an approximation to JF (x(1) ), we can obtain our next iterate x(2) as follows:

A1 s2 = −F(x(1) ), x(2) = x(1) + s2 .

Repeating this process, we obtain the following algorithm, which is known as Broyden’s Method:
Choose x(0)
A0 = JF (x(0) )
s1 = −A−10 F(x )
(0)

x(1) = x(0) + s1
k=1
while not converged do
308 CHAPTER 8. ZEROS OF NONLINEAR FUNCTIONS

yk = F(x(k) ) − F(x(k−1) )
wk = A−1k−1 yk
c = 1/sTk wk
A−1 −1 T −1
k = Ak−1 + c(sk − wk )sk Ak−1
sk+1 = −A−1 (k)
k F(x )
x(k+1) = x(k) + sk+1
k =k+1
end

Note that it is not necessary to compute Ak for k ≥ 1; only A−1


k is needed. It follows that no systems
of linear equations need to be solved during an iteration; only matrix-vector multiplications are
required, thus saving an order of magnitude of computational effort during each iteration compared
to Newton’s Method.
Exercise 8.6.9 Write a Matlab function x=broyden(F,JF,x0,tol) that solves the
system of equations F(x) = 0 using Broyden’s method, where JF is a function handle
that takes x as input and returns the matrix JF (x). The parameters x0 and tol are the
initial guess and absolute error tolerance, respectively. Test your function on the system
from Example 8.6.2 and compare the efficiency to that of Newton’s method as implemented
in Exercise 8.6.6.
Chapter 9

Initial Value Problems

In this chapter, we begin our exploration of the development of numerical methods for solving differ-
ential equations, which are equations that depend on derivatives of unknown quantities. Differential
equations arise in mathematical models of a wide variety of phenomena, such as propagation of
waves, dissipation of heat energy, population growth, or motion of fluids. Solutions of differen-
tial equations yield valuable insight about such phenomena, and therefore techniques for solving
differential equations are among the most essential methods of applied mathematics.
We now illustrate mathematical models based on differential equations. Newton’s Second Law
states
dv
F = ma = m ,
dt
where F , m, a and v represent force, mass, acceleration and velocity, respectively. We use this
law to develop a mathematical model for the velocity of a falling object that includes a differential
equation. The forces on the falling object include gravity and air resistance, or drag; to simplify
the discussion, we neglect any other forces.
The force due to gravity is equal to mg, where g is the acceleration due to gravity, and the
drag force is equal to −γv, where γ is the drag coefficient. We use downward orientation, so that
gravity is acting in the positive (downward) direction and drag is acting in the negative (upward)
direction. In summary, we have
F = mg − γv.
Combining with Newton’s Second Law yields the differential equation

dv
m = mg − γv
dt
for the velocity v of the falling object.
Another example of a mathematical model is a differential equation for the population p of a
species, which can have the form
dp
= rp − d,
dt
where the constant r is the rate of reproduction of the species. In general, r is called a rate constant
or growth rate. The constant d indicates the number of specimens that die per unit of time, perhaps
due to predation or other causes.

309
310 CHAPTER 9. INITIAL VALUE PROBLEMS

A differential equation such as this one does not have a unique solution, as it does not include
enough information. Typically, the differential equation is paired with an initial condition of the
form
y(t0 ) = y0 ,
where t0 represents an initial time and y0 is an initial value. The differential equation, together
with the intial condition, is called an initial value problem. As discussed in the next section,
under certain assumptions, it can be proven that an initial value problem has a unique solution.
This chapter explores the numerical solution of initial value problems. Chapter 10 investigates the
numerical solution of boundary value problems, which are differential equations defined on a spatial
domain, such as a bounded interval [a, b], paired with boundary conditions that ensure a unique
solution.
There are many types of differential equations, and a wide variety of solution techniques, even
for equations of the same type, let alone different types. We now introduce some terminology that
aids in classification of equations and, by extension, selection of solution techniques.
• An ordinary differential equation, or ODE, is an equation that depends on one or more deriva-
tives of functions of a single variable. Differential equations given in the preceding examples
are all ordinary dfiferential equations, and we will consider these equations exclusively in this
book.
• A partial differential equation, or PDE, is an equation that depends on one or more partial
derivatives of functions of several variables. In many cases, PDEs are solved by reducing to
multiple ODEs.

Example 9.0.1 The heat equation


∂u ∂2u
= k2 2 ,
∂t ∂x
where k is a constant, is an example of a partial differential equation, as its solution u(x, t)
is a function of two independent variables, and the equation includes partial derivatives with
respect to both variables. 2

• The order of a differential equation is the order of the highest derivative of any unknown
function in the equation.

Example 9.0.2 The differential equation


dy
= ay − b,
dt
where a and b are constants, is a first-order differential equation, as only the first derivative
of the solution y(t) appears in the equation. On the other hand, the ODE
y 00 + 3y 0 + 2y = 0
is a second-order differential equation, whereas the PDE known as the beam equation
ut = uxxxx
is a fourth-order differential equation. 2
9.1. EXISTENCE AND UNIQUENESS OF SOLUTIONS 311

In this chapter, we limit ourselves to numerical methods for the solution of first-order ODEs. In
Section 9.6, we consider systems of first-order ODEs, which allows these numerical methods to be
applied to higher-order ODEs.

9.1 Existence and Uniqueness of Solutions


Consider the general first-order initial value problem, or IVP, that has the form
y 0 = f (t, y), t0 ≤ t ≤ T, (9.1)
y(t0 ) = y0 (9.2)
We would like to have an understanding of when this problem can be solved, and whether any
solution that can be obtained is unique. The following notion of continuity, applied previously in
Section 8.3 to establish convergence criteria for Fixed-Point Iteration, is helpful for this purpose.

Definition 9.1.1 (Lipschitz condition) A function f (t, y) satisfies a Lipschitz con-


dition in y on D ⊂ R2 if

|f (t, y2 ) − f (t, y1 )| ≤ L|y2 − y1 |, (t, y1 ), (t, y2 ) ∈ D, (9.3)

for some constant L > 0, which is called a Lipschitz constant for f .

If, in addition, ∂f /∂y exists on D, we can also conclude that |∂f /∂y| ≤ L on D.
When solving a problem numerically, it is not sufficient to know that a unique solution exists.
As discussed in Section 1.4.4, if a small change in the problem data can cause a substantial change
in the solution, then the problem is ill-conditioned, and a numerical solution is therefore unreliable,
because it could be unduly influenced by roundoff error. The following definition characterizes
problems involving differential equations for which numerical solution is feasible.

Definition 9.1.2 (Well-posed problem) A differential equation of any type, in con-


junction with any other information such as an initial condition, is said to describe a
well-posed problem if it satisfies three conditions, known as Hadamard’s conditions
for well-posedness:

• A solution of the problem exists. The


• A solution of the problem is unique.

• The unique solution depends continuously on the problem data

If a problem is not well-posed, then it is said to be ill-posed.


“problem data” in this definiton may include, for example, initial values or coefficients of the
differential equation.
We are now ready to describe a class of initial-value problems that can be solved numerically.

Theorem 9.1.3 (Existence-Uniqueness, Well-Posedness) Let D = [t0 , T ] × R, and


let f (t, y) be continuous on D. If f satisfies a Lipschitz condition on D in y, then the
initial value problem (9.1), (9.2) has a unique solution y(t) on [t0 , T ]. Furthermore, the
problem is well-posed.
312 CHAPTER 9. INITIAL VALUE PROBLEMS

This theorem can be proved using Fixed-Point Iteration, in which the Lipschitz condition on f is
used to prove that the iteration converges [7, p. 142-155].

Exercise 9.1.1 Consider the initial value problem

y 0 = 3y + 2t, 0 < t ≤ 1, y(0) = 1.

Show that this problem is well-posed.

9.2 One-Step Methods


Numerical methods for the initial-value problem (9.1), (9.2) can be developed using Taylor series.
We wish to approximate the solution at times tn , n = 1, 2, . . ., where

tn = t0 + nh,

with h being a chosen time step. Computing approximate solution values in this manner is
called time-stepping or time-marching. Taking a Taylor expansion of the exact solution y(t)
at t = tn+1 around the center t = tn , we obtain

h2 00
y(tn+1 ) = y(tn ) + hy 0 (tn ) + y (ξ),
2
where tn ≤ ξ ≤ tn+1 .

9.2.1 Euler’s Method


Using the fact that y 0 = f (t, y), we obtain a numerical scheme by truncating the Taylor series after
the second term. The result is a difference equation

yn+1 = yn + hf (tn , yn ),

where each yn , for n = 1, 2, . . ., is an approximation of y(tn ). This method is called Euler’s


method, the simplest example of what is known as a one-step method.
We now need to determine whether this method converges; that is, whether

lim max |y(tn ) − yn | = 0.


h→0 0≤n≤T /h

To that end, we attempt to bound the error at time tn . We begin with a comparison of the difference
equation and the Taylor expansion of the exact solution,

h2 00
y(tn+1 ) = y(tn ) + hf (tn , y(tn )) + y (ξ),
2
yn+1 = yn + hf (tn , yn )

It follows that if we define en = y(tn ) − yn , then

h2 00
en+1 = en + h[f (tn , y(tn ) − f (tn , yn )] + y (ξ).
2
9.2. ONE-STEP METHODS 313

Using the assumption that f satisfies a Lipschitz condition (9.3) in y, we obtain


h2 M
|en+1 | ≤ (1 + hL)|en | + , (9.4)
2
where
|y 00 (t)| ≤ M, t0 ≤ t ≤ T,
and L is the Lipschitz constant for f in y on [t0 , T ] × R.
Applying the relationship (9.4) repeatedly yields
n−1
h2 M X
|en | ≤ (1 + hL)n |e0 | + (1 + hL)i
2
i=0
h2 M (1 + hL)n − 1

2 (1 + hL) − 1
h M [ehL ]n − 1
2

2 hL
hM L(tn −t0 )
≤ [e − 1].
2L
Here, we have used the formula for the partial sum of a geometric series,
rn − 1
1 + r + r2 + · · · + rn−1 = .
r−1
We conclude that for t0 ≤ tn ≤ T,
hM L(tn −t0 ) hM L(T −t0 )
|y(tn ) − yn | ≤ [e − 1] ≤ [e − 1].
2L 2L
That is, as h → 0, the solution obtained using Euler’s method converges to the exact solution, and
the convergence is O(h); that is, first-order.
This convergence analysis, however, assumes exact arithmetic. To properly account for roundoff
error, we note that the approximate solution values ỹn , n = 0, 1, 2, . . ., satisfy the modified difference
equation
ỹn+1 = ỹn + hf (tn , ỹn ) + δn+1 , ỹ0 = y0 + δ0 , (9.5)
where, for n = 0, 1, 2, . . ., |δn | ≤ δ, which is O(u), where u is the the machine precision (i.e., unit
roundoff) introduced in Section 1.5.1. Note that even the initial value ỹ0 has an error term, which
arises from representation of y0 in the floating-point system.

Exercise 9.2.1 Repeat the convergence analysis for Euler’s method on (9.5) to obtain
the error bound
 
1 hM δ
|y(tn ) − ỹn | ≤ + [eL(tn −t0 ) − 1] + δeL(tn −t0 ) .
L 2 h

What happens to this error bound as h → 0? What is an optimal choice of h so that the
error bound is minimized?
We conclude our discussion of Euler’s method with an example of how the previous convergence
analyses can be used to select a suitable time step h.
314 CHAPTER 9. INITIAL VALUE PROBLEMS

Example 9.2.1 Consider the IVP

y 0 = −y, 0 < t < 10, y(0) = 1.

We know that the exact solution is y(t) = e−t . Euler’s method applied to this problem yields the
difference equation
yn+1 = yn − hyn = (1 − h)yn , y0 = 1.

We wish to select h so that the error at time T = 10 is less than 0.001. To that end, we use the
error bound
hM L(tn −t0 )
|y(tn ) − yn | ≤ [e − 1],
2L
with M = 1, since y 00 (t) = e−t , which satisfies 0 < y 00 (t) < 1 on [0, 10], and L = 1, since
f (t, y) = −y satisfies |∂f /∂y| = | − 1| ≡ 1. Substituting tn = 10 and t0 = 0 yields

h 10
|y(10) − yn | ≤ [e − 1] ≈ 1101.27h.
2

Ensuring that the error at this time is less than 10−3 requires choosing h < 9.08 × 10−8 . However,
the bound on the error at t = 10 is quite crude. Applying Euler’s method with this time step yields
a solution whose error at t = 10 is 2 × 10−11 .
Now, suppose that we include roundoff error in our error analysis. The optimal time step is
r

h= ,
M

where δ is a bound on the roundoff error during any time step. We use δ = 2u, where u is the
unit roundoff, because each time step performs only two floating-point operations. Even if 1 − h
is computed once, in advance, its error still propagates to the multiplication with yn . In a typical
double-precision floating-point number system, u ≈ 1.1 × 10−16 . It follows that the optimal time
step is
r r
2δ 2(1.1 × 10−16 )
h= = ≈ 1.5 × 10−8 .
M 1
With this value of h, we find that the error at t = 10 is approximately 3.55 × 10−12 . This is even
more accurate than with the previous choice of time step, which makes sense, because the new value
of h is smaller. 2

9.2.2 Solving IVPs in Matlab


Matlab provides several functions for solving IVPs [35]. To solve an IVP of the form

y 0 = f (t, y), t0 < t ≤ T, y(t0 ) = y0 ,

one can use, for example, the command

>> [t,y]=ode23(f,[ t_0 T ],y_0);


9.2. ONE-STEP METHODS 315

where f is a function handle for f (t, y). The first output t is a column vector consisting of times
t0 , t1 , . . . , tn = T , where n is the number of time steps. The second output y is a n × m matrix,
where m is the length of y 0. The ith row of y consists of the values of y(ti ), for i = 1, 2, . . . , n.
This is the simplest usage of one of the ODE solvers; additional interfaces are described in the
documentation.

Exercise 9.2.2 (a) Write a Matlab function [T,Y]=eulersmethod(f,tspan,y0,h)


that solves a given IVP of the form (9.1),
 (9.2) using Euler’s method. Assume that
tspan is a vector of the form t0 T that contains the initial and final times, as
in the typical usage of Matlab ODE solvers. The output T must be a column vector
of time values, and the output Y must be a matrix, each row of which represents the
computed solution at the corresponding time value in the same row of T.

(b) Test your function on the IVP from Example 9.2.1 with h=0.1 and h=0.01, and
compute the error at the final time t using the known exact solution. What happens
to the error as h decreases? Is the behavior what you would expect based on theory?

9.2.3 Runge-Kutta Methods


We have seen that Euler’s method is first-order accurate. We would like to use Taylor series to
design methods that have a higher order of accuracy. First, however, we must get around the fact
that an analysis of the global error, as was carried out for Euler’s method, is quite cumbersome.
Instead, we will design new methods based on the criteria that their local truncation error, the
error committed during a single time step, is higher-order in h.
Using higher-order Taylor series directly to approximate y(tn+1 ) is cumbersome, because it
requires evaluating derivatives of f . Therefore, our approach will be to use evaluations of f at
carefully chosen values of its arguments, t and y, in order to create an approximation that is just
as accurate as a higher-order Taylor series expansion of y(t + h).
To find the right values of t and y at which to evaluate f , we need to take a Taylor expansion of
f evaluated at these (unknown) values, and then match the resulting numerical scheme to a Taylor
series expansion of y(t + h) around t.
We now illustrate our proposed approach in order to obtain a method that is second-order
accurate; that is, its local truncation error is O(h2 ). The proposed method has the form
yn+1 = yn + hf (t + α1 , y + β1 ),
where α1 and β1 are to be determined. To ensure second-order accuracy, we must match the Taylor
expansion of the exact solution,
h2 d h3 d2
y(t + h) = y(t) + hf (t, y(t)) + [f (t, y(t))] + [f (ξ, y)],
2 dt 6 dt2
to
y(t + h) = y(t) + hf (t + α1 , y + β1 ),
where t ≤ ξ ≤ t + h. After simplifying by removing terms or factors that already match, we see
that we only need to match
hd h2 d2
f (t, y) + [f (t, y)] + [f (t, y)]
2 dt 6 dt2
316 CHAPTER 9. INITIAL VALUE PROBLEMS

with
f (t + α1 , y + β1 ),

at least up to and including terms of O(h), so that the local truncation error will be O(h2 ).
Applying the multivariable version of Taylor’s theorem to f (see Theorem A.6.6), we obtain

∂f ∂f
f (t + α1 , y + β1 ) = f (t, y) + α1 (t, y) + β1 (t, y) +
∂t ∂y
2 2
α1 ∂ f 2
∂ f β12 ∂ 2 f
(ξ, µ) + α 1 β1 (ξ, µ) + (ξ, µ),
2 ∂t2 ∂t∂y 2 ∂y 2

where ξ is between t and t + α1 and µ is between y and y + β1 . Meanwhile, computing the full
derivatives with respect to t in the Taylor expansion of the solution yields

h ∂f h ∂f
f (t, y) + (t, y) + (t, y)f (t, y) + O(h2 ).
2 ∂t 2 ∂y

Comparing terms yields


h h
α1 = , β1 = f (t, y).
2 2
The resulting numerical scheme is
 
h h
yn+1 = yn + hf tn + , yn + f (tn , yn ) .
2 2

This scheme is known as the midpoint method, or the explicit midpoint method. Note that
it evaluates f at the midpoint of the intervals [tn , tn+1 ] and [yn , yn+1 ], where the midpoint in y is
approximated using Euler’s method with time step h/2.
The midpoint method is the simplest example of a Runge-Kutta method, which is the name
given to any of a class of time-stepping schemes that are derived by matching multivaraiable Taylor
series expansions of f (t, y) with terms in a Taylor series expansion of y(t + h). Another often-used
Runge-Kutta method is the modified Euler method

h
yn+1 = yn + [f (tn , yn ) + f (tn+1 , yn + hf (tn , yn )], (9.6)
2

also known as the explicit trapezoidal method, as it resembles the Trapezoidal Rule from
numerical integration. This method is also second-order accurate.

Exercise 9.2.3 Derive the explicit trapezoidal method (9.6) by finding a method of the
form
yn+1 = yn + h[a1 f (tn + α1 , yn + β1 ) + a2 f (tn + α2 , yn + β2 )]
that is second-order accurate.
However, the best-known Runge-Kutta method is the fourth-order Runge-Kutta method,
9.2. ONE-STEP METHODS 317

which uses four evaluations of f during each time step. The method proceeds as follows:

k1 = hf (tn , yn ),
 
h 1
k2 = hf tn + , yn + k1 ,
2 2
 
h 1
k3 = hf tn + , yn + k2 ,
2 2
k4 = hf (tn+1 , yn + k3 ) ,
1
yn+1 = yn + (k1 + 2k2 + 2k3 + k4 ). (9.7)
6
In a sense, this method is similar to Simpson’s Rule from numerical integration, which is also
fourth-order accurate, as values of f at the midpoint in time are given four times as much weight
as values at the endpoints tn and tn+1 .
The values k1 , . . . , k4 are referred to as stages; more precisely, a stage of a Runge-Kutta method
is an evaluation of f (t, y), and the number of stages of a Runge-Kutta method is the number of
evaluations required per time step. We therefore say that (9.7) is a four-stage, fourth-order method,
while the explicit trapezoidal method (9.6) is a two-stage, second-order method. We will see in
Section 9.5 that the number of stages does not always correspond to the order of accuracy.

Example 9.2.2 We compare Euler’s method with the fourth-order Runge-Kutta scheme on the
initial value problem
y 0 = −2ty, 0 < t ≤ 1, y(0) = 1,
2
which has the exact solution y(t) = e−t . We use a time step of h = 0.1 for both methods. The
computed solutions, and the exact solution, are shown in Figure 9.1.
It can be seen that the fourth-order Runge-Kutta method is far more accurate than Euler’s
method, which is first-order accurate. In fact, the solution computed using the fourth-order Runge-
Kutta method is visually indistinguishable from the exact solution. At the final time T = 1, the
relative error in the solution computed using Euler’s method is 0.038, while the relative error in the
solution computing using the fourth-order Runge-Kutta method is 4.4 × 10−6 . 2

Exercise 9.2.4 Modify your function eulersmethod from Exercise 9.2.2 to obtain a
new function [T,Y]=rk4(f,tspan,y0,h) that implements the fourth-order Runge-Kutta
method.

9.2.4 Implicit Methods


Suppose that we approximate the equation
Z tn+1
y(tn+1 ) = y(tn ) + y 0 (s) ds
tn

by applying the Trapezoidal Rule to the integral. This yields a one-step method

h
yn+1 = yn + [f (tn , yn ) + f (tn+1 , yn+1 )], (9.8)
2
318 CHAPTER 9. INITIAL VALUE PROBLEMS

Figure 9.1: Solutions of y 0 = −2ty, y(0) = 1 on [0, 1], computed using Euler’s method and the
fourth-order Runge-Kutta method

known as the trapezoidal method.


The trapezoidal method constrasts with Euler’s method because it is an implicit method, due
to the evaluation of f (t, y) at yn+1 . It follows that it is generally necessary to solve a nonlinear
equation to obtain yn+1 from yn . This additional computational effort is offset by the fact that
implicit methods are generally more stable than explicit methods such as Euler’s method. Another
example of an implicit method is backward Euler’s method
yn+1 = yn + hf (tn+1 , yn+1 ). (9.9)
Like Euler’s method, backward Euler’s method is first-order accurate.
Exercise 9.2.5 Write a Matlab function [T,Y]=backwardeuler(f,tspan,y0,h) that
implements backward Euler’s method (9.9). Use the secant method to solve for yn+1 at
each time step. For initial guesses, use yn and yn + hf (tn , yn ), the approximation of yn+1
obtained using (forward) Euler’s method.

Exercise 9.2.6 Suppose that fixed-point iteration is used to solve for yn+1 in backward
Euler’s method. What is the function g in the equation yn+1 = g(yn+1 )? Assuming that
g satisfies the condition for a fixed point to exist, how should h be chosen to help ensure
convergence of the fixed-point iteration?

Exercise 9.2.7 Repeat Exercise 9.2.6 for the trapezoidal method (9.8).
9.3. MULTISTEP METHODS 319

9.3 Multistep Methods


All of the numerical methods that we have developed for solving initial value problems are classified
as one-step methods, because they only use information about the solution at time tn to approximate
the solution at time tn+1 . As n increases, that means that there are additional values of the solution,
at previous times, that could be helpful, but are unused.
Multistep methods are time-stepping methods that do use this information. A general mul-
tistep method has the form
s
X s
X
αi yn+1−i = h βi f (tn+1−i , yn+1−i ),
i=0 i=0

where s is the number of steps in the method (s = 1 for a one-step method), and h is the time step
size, as before.
By convention, α0 = 1, so that yn+1 can be conveniently expressed in terms of other values.
If β0 = 0, the multistep method is said to be explicit, because then yn+1 can be described using
an explicit formula, whereas if β0 6= 0, the method is implicit, because then an equation, generally
nonlinear, must be solved to compute yn+1 .
For a general implicit multistep method, for which β0 6= 0, Newton’s method can be applied to
the function
Xs s
X
F (y) = α0 y + αi yn+1−i − hβ0 f (tn+1 , y) − h βi fn+1−i .
i=1 i=1
The resulting iteration is
(k)
(k+1) (k) F (yn+1 )
yn+1 = yn+1 − (k)
F 0 (yn+1 )
(k) Ps (k) Ps
(k) α0 yn+1 + i=1 αi yn+1−i − hβ0 f (tn+1 , yn+1 ) − h i=1 βi fn+1−i
= yn+1 − (k)
,
α0 − hβ0 fy (tn+1 , yn+1 )
(0)
with yn+1 = yn . If one does not wish to compute fy , then the Secant Method can be used instead.

9.3.1 Adams Methods


Adams methods [4] involve the integral form of the ODE,
Z tn+1
y(tn+1 ) = y(tn ) + f (s, y(s)) ds.
tn

The general idea behind Adams methods is to approximate the above integral using polynomial
interpolation of f at the points tn+1−s , tn+2−s , . . . , tn if the method is explicit, and tn+1 as well if
the method is implicit. In all Adams methods, α0 = 1, α1 = −1, and αi = 0 for i = 2, . . . , s.
Explicit Adams methods are called Adams-Bashforth methods. To derive an Adams-
Bashforth method, we interpolate f at the points tn , tn−1 , . . . , tn−s+1 with a polynomial of degree
s − 1. We then integrate this polynomial exactly. It follows that the constants βi , i = 1, . . . , s,
are the integrals of the corresponding Lagrange polynomials from tn to tn+1 , divided by h, because
there is already a factor of h in the general multistep formula.
320 CHAPTER 9. INITIAL VALUE PROBLEMS

Example 9.3.1 We derive the three-step Adams-Bashforth method,

yn+1 = yn + h[β1 f (tn , yn ) + β2 f (tn−1 , yn−1 ) + β3 f (tn−2 , yn−2 ).

The constants βi , i = 1, 2, 3, are obtained by evaluating the integral from tn to tn+1 of a polynomial
p2 (t) that passes through f (tn , yn ), f (tn−1 , yn−1 ), and f (tn−2 , yn−2 ).
Because we can write
X 2
p2 (t) = f (tn−i , yn−i )Li (t),
i=0
where Li (t) is the ith Lagrange polynomial for the interpolation points tn , tn−1 and tn−2 , and because
our final method expresses yn+1 as a linear combination of yn and values of f , it follows that the
constants βi , i = 1, 2, 3, are the integrals of the Lagrange polynomials from tn to tn+1 , divided by h.
However, using a change of variable u = (tn+1 − s)/h, we can instead interpolate at the points
u = 1, 2, 3, thus simplifying the integration. If we define p̃2 (u) = p2 (s) = p2 (tn+1 − hu) and
L̃i (u) = Li (tn+1 − hu), then we obtain
Z tn+1 Z tn+1
f (s, y(s)) ds = p2 (s) ds
tn tn
Z 1
= h p̃2 (u) du
0
Z 1
= h f (tn , yn )L̃0 (u) + f (tn−1 , yn−1 )L̃1 (u) + f (tn−2 , yn−2 )L̃2 (u) du
0 Z 1 Z 1
= h f (tn , yn ) L̃0 (u) du + f (tn−1 , yn−1 ) L̃1 (u) du+
0 0
Z 1 
f (tn−2 , yn−2 ) L̃2 (u) du
0
 Z 1 Z 1
(u − 2)(u − 3) (u − 1)(u − 3)
= h f (tn , yn ) du + f (tn−1 , yn−1 ) du+
0 (1 − 2)(1 − 3) 0 (2 − 1)(2 − 3)
Z 1 
(u − 1)(u − 2)
f (tn−2 , yn−2 ) du
0 (3 − 1)(3 − 2)
 
23 4 5
= h f (tn , yn ) − f (tn−1 , yn−1 ) + f (tn−2 , yn−2 ) .
12 3 12
We conclude that the three-step Adams-Bashforth method is
h
yn+1 = yn + [23f (tn , yn ) − 16f (tn−1 , yn−1 ) + 5f (tn−2 , yn−2 )]. (9.10)
12
This method is third-order accurate. 2

The same approach can be used to derive an implicit Adams method, which is known as an
Adams-Moulton method. The only difference is that because tn+1 is an interpolation point, after
the change of variable to u, the interpolation points 0, 1, 2, . . . , s are used. Because the resulting
interpolating polynomial is of degree one greater than in the explicit case, the error in an s-step
Adams-Moulton method is O(hs+1 ), as opposed to O(hs ) for an s-step Adams-Bashforth method.
9.3. MULTISTEP METHODS 321

Exercise 9.3.1 Derive the four-step Adams-Bashforth method


h
yn+1 = yn + [55f (tn , yn ) − 59f (tn−1 , yn−1 ) + 37f (tn−2 , yn−2 ) − 9f (tn−3 , yn−3 )] (9.11)
24
and the three-step Adams-Moulton method
h
yn+1 = yn + [9f (tn+1 , yn+1 ) + 19f (tn , yn ) − 5f (tn−1 , yn−1 ) + f (tn−2 , yn−2 )]. (9.12)
24
What is the order of accuracy of each of these methods?

9.3.2 Predictor-Corrector Methods


An Adams-Moulton method can be impractical because, being implicit, it requires an iterative
method for solving nonlinear equations, such as fixed-point iteration, and this method must be
applied during every time step. An alternative is to pair an Adams-Bashforth method with an
Adams-Moulton method to obtain an Adams-Moulton predictor-corrector method [25]. Such
a method proceeds as follows:

• Predict: Use the Adams-Bashforth method to compute a first approximation to yn+1 , which
we denote by ỹn+1 .

• Evaluate: Evaluate f at this value, computing f (tn+1 , ỹn+1 ).

• Correct: Use the Adams-Moulton method to compute yn+1 , but instead of solving an equation,
use f (tn+1 , ỹn+1 ) in place of f (tn+1 , yn+1 ) so that the Adams-Moulton method can be used
as if it was an explicit method.

• Evaluate: Evaluate f at the newly computed value of yn+1 , computing f (tn+1 , yn+1 ), to use
during the next time step.

Example 9.3.2 We illustrate the predictor-corrector approach with the two-step Adams-Bashforth
method
h
yn+1 = yn + [3f (tn , yn ) − f (tn−1 , yn−1 )]
2
and the two-step Adams-Moulton method
h
yn+1 = yn + [5f (tn+1 , yn+1 ) + 8f (tn , yn ) − f (tn−1 , yn−1 )]. (9.13)
12
First, we apply the Adams-Bashforth method, and compute
h
ỹn+1 = yn + [3f (tn , yn ) − f (tn−1 , yn−1 )].
2
Then, we compute f (tn+1 , ỹn+1 ) and apply the Adams-Moulton method, to compute
h
yn+1 = yn + [5f (tn+1 , ỹn+1 ) + 8f (tn , yn ) − f (tn−1 , yn−1 )].
12
This new value of yn+1 is used when evaluating f (tn+1 , yn+1 ) during the next time step. 2
322 CHAPTER 9. INITIAL VALUE PROBLEMS

One drawback of multistep methods is that because they rely on values of the solution from
previous time steps, they cannot be used during the first time steps, because not enough values
are available. Therefore, it is necessary to use a one-step method, with at least the same order
of accuracy, to compute enough starting values of the solution to be able to use the multistep
method. For example, to use the three-step Adams-Bashforth method, it is necessary to first use
a one-step method such as the fourth-order Runge-Kutta method to compute y1 and y2 , and then
the Adams-Bashforth method can be used to compute y3 using y2 , y1 and y0 .
Exercise 9.3.2 How many starting values are needed to use a s-step multistep method?

Exercise 9.3.3 Write a Matlab function [T,Y]=adamsbashforth3(f,tspan,y0,h)


that implements the 3-step Adams-Bashforth method (9.10) to solve the given IVP (9.1),
(9.2). Use the fourth-order Runge-Kutta to generate the necessary starting values. Use
different values of h on a sample IVP to confirm that your method is third-order accurate.

Exercise 9.3.4 Write a Matlab function [T,Y]=predictcorrect(f,tspan,y0,h)


that implements an Adams-Moulton predictor-corrector method using a four-step predic-
tor (9.11) and three-step corrector (9.12). Use the fourth-order Runge-Kutta to generate
the necessary starting values. Use different values of h on a sample IVP to confirm that
your method is fourth-order accurate.

9.3.3 Backward Differentiation Formulae


Another class of multistep methods, known as Backward differentiation formulas (BDF) [36,
p. 349], can be derived using polynomial interpolation as in Adams methods, but for a differ-
ent purpose–to approximate the derivative of y at tn+1 . This approximation is then equated to
f (tn+1 , yn+1 ). It follows that all methods based on BDFs are implicit, and they all satisfy β0 = 1,
with βi = 0 for i = 1, 2, . . . , s.
More precisely, a BDF has the form
s
X
αi yn+1−i = hf (tn+1 , yn+1 ),
i=0

where
αi = L0s,i (tn+1 ),
with Ls,0 (t), Ls,1 (t), . . . , Ls,s (t) being the Lagrange polynomials for tn+1−s , . . . , tn , tn+1 .

Exercise 9.3.5 Show that a 1-step BDF is simply backward Euler’s method.

Exercise 9.3.6 Derive a 2-step BDF. How do the coefficients αi , i = 0, 1, 2, relate to


those of the 3-point second-order numerical differentiation formula (7.3)?

9.4 Convergence Analysis


We have previously determined that when applying Euler’s method

yn+1 = yn + hf (tn , yn )
9.4. CONVERGENCE ANALYSIS 323

to the initial value problem (9.1), (9.2), the error in the computed solution satisfies the error bound

M h L(tn −t0 )
|y(tn ) − yn | ≤ (e − 1),
2L

where L is the Lipschitz constant for f and M is an upper bound on |y 00 (t)|. This error bound
indicates that the numerical solution converges to the exact solution at h → 0; that is,

lim max |y(tn ) − yn | = 0.


h→0 0≤n≤(T −t0 )/h

It would be desirable to be able to prove that a numerical method converges without having to
proceed through the same detailed error analysis that was carried out with Euler’s method, since
other methods are more complex and such analysis would require more assumptions to obtain a
similar bound on the error in yn .
To that end, we define two properties that a numerical method should have in order to be
convergent.

Definition 9.4.1

• Consistency: a numerical method for the initial-value problem (9.1) is said to be


consistent if
lim max |τn (h)| = 0,
h→0 0≤n≤(T −t0 )/h

where τn (h) is the local truncation error at time tn .

• Stability: a numerical method is said to be stable if there exists a constant α such


that for any two numerical solutions yn and ỹn

|yn+1 − ỹn+1 | ≤ (1 + αh)|yn − ỹn |, 0 ≤ n ≤ (T − t0 )/h.

It follows from this relation that

|yn − ỹn | ≤ eα(T −t0 ) |y0 − ỹ0 |.

Informally, a stable method converges to the differential equation as h → 0, and the solution
computed using a stable method is not overly sensitive to perturbations in the initial data. While
the difference in solutions is allowed to grow over time, it is “controlled” growth, meaning that the
rate of growth is independent of the step size h.

9.4.1 Consistency

The definition of consistency in Definition 9.4.1 can be cumbersome to apply directly to a given
method. Therefore, we consider one-step and multistep methods separately to obtain simple ap-
proaches for determining whether a given method is consistent, and if so, its order of accuracy.
324 CHAPTER 9. INITIAL VALUE PROBLEMS

9.4.1.1 One-Step Methods


We have learned that the numerical solution obtained from Euler’s method,

yn+1 = yn + hf (tn , yn ), tn = t0 + nh,

converges to the exact solution y(t) of the initial value problem

y 0 = f (t, y), y(t0 ) = y0 ,

as h → 0.
We now analyze the convergence of a general one-step method of the form

yn+1 = yn + hΦ(tn , yn , h), (9.14)

for some continuous function Φ(t, y, h). We define the local truncation error of this one-step
method by
y(tn+1 ) − y(tn )
τn (h) = − Φ(tn , y(tn ), h).
h
That is, the local truncation error is the result of substituting the exact solution into the approxi-
mation of the ODE by the numerical method.

Exercise 9.4.1 Find the local truncation error of the modified Euler method (9.6).
As h → 0 and n → ∞, in such a way that t0 + nh = t ∈ [t0 , T ], we obtain

τn (h) → y 0 (t) − Φ(t, y(t), 0).

It follows from Defintion 9.4.1 that the one-step method is consistent if

Φ(t, y, 0) = f (t, y).

Recall that a consistent one-step method is one that converges to the ODE as h → 0.

Example 9.4.2 Consider the midpoint method


 
h h
yn+1 = yn + hf tn + , yn + f (tn , yn ) .
2 2

To check consistency, we compute


 
0 0
φ(t, y, 0) = f t + , y + f (t, y) = f (t, y),
2 2

so it is consistent. 2

Exercise 9.4.2 Verify that the fourth-order Runge-Kutta method (9.7) is consistent.
9.4. CONVERGENCE ANALYSIS 325

9.4.1.2 Multistep Methods


For multistep methods, we must define consistency slightly differently, because we must account for
the fact that a multistep method requires starting values that are computed using another method.
Therefore, we say that a multistep method is consistent if its own local truncation error τn (h)
approaches zero as h → 0, and if the one-step method used to compute its starting values is also
consistent. We also say that a s-step multistep method is stable, or zero-stable, if there exists a
constant K such that for any two sequences of values {yk } and {zk } produced by the method with
step size h from different sets of starting values {y0 , y1 , . . . , ys−1 } and {z0 , z1 , . . . , zs−1 },
|yn − zn | ≤ K max |yj − zj |,
0≤j≤s−1

as h → 0.
To compute the local truncation error of Adams methods, integrate the error in the polynomial
interpolation used to derive the method from tn to tn+1 . For the explicit s-step method, this yields
1 tn+1 f (s) (ξ, y(ξ))
Z
τn (h) = (t − tn )(t − tn−1 ) · · · (t − tn−s+1 ) dt.
h tn s!
Using the substitution u = (tn+1 − t)/h, and the Weighted Mean Value Theorem for Integrals,
yields
Z 1
1 f (s) (ξ, y(ξ)) s+1
τn (h) = h (−1)s (u − 1)(u − 2) · · · (u − s) du.
h s! 0
Evaluating the integral yields the constant in the error term. We also use the fact that y 0 =
f (t, y) to replace f (s) (ξ, y(ξ)) with y (s+1) (ξ). Obtaining the local truncation error for an implicit,
Adams-Moulton method can be accomplished in the same way, except that tn+1 is also used as an
interpolation point.
For a general multistep method, we substitute the exact solution into the method, as in one-step
methods, and obtain
Ps Ps
j=0 αj y(tn+1−j ) − h βj f (tn+1−j , y(tn+1−j ))
τn (h) = Psj=0 ,
h j=0 βj
where the scaling by h sj=0 βj is designed to make this definition of local truncation error consistent
P
with that of one-step methods.
By replacing each evaluation of y(t) by a Taylor series expansion around tn+1 , we obtain
s ∞ ∞
" #
1 X X 1 (k) X d k
τn (h) = αj y (tn+1 )(−jh)k − hβj [f (tn+1 , y(tn+1 ))](−jh)k
h sj=0 βj dtk
P
k!
j=0 k=0 k=0
s
"∞ ∞
#
1 X X h k X h k
= (−1)k αj y (k) (tn+1 )j k + (−1)k βj y (k) (tn+1 )j k−1
h sj=0 βj
P
k! (k − 1)!
j=0 k=0 k=1
  
s ∞ s s
1  X X 1 X 1 X 
= Ps y(tn+1 ) αj + (−h)k y (k) (tn+1 )  j k αj + j k−1 βj 
h j=0 βj  k! (k − 1)! 
j=0 k=1 j=1 j=0

" #
1 X
k (k)
= y(tn+1 )C0 + (−h) y (tn+1 )Ck
h sj=0 βj
P
k=1
326 CHAPTER 9. INITIAL VALUE PROBLEMS

where
s s s
X 1 X k 1 X
C0 = αj , Ck = j αj + j k−1 βj , k = 1, 2, . . . .
k! (k − 1)!
j=0 j=1 j=0

We find that τn (h) → 0 as h → 0 only if C0 = C1 = 0. Furthermore, the method is of order p if


and only if
C0 = C1 = C2 = · · · = Cp = 0, Cp+1 6= 0. (9.15)
Finally, we can conclude that the local truncation error for a method of order p is
1
τn (h) = Ps (−h)p y (p+1) (tn+1 )Cp+1 + O(hp+1 ).
j=0 βj

Exercise 9.4.3 Use the conditions (9.15) to verify the order of accuracy of the four-step
Adams-Bashforth method (9.11). What is the local truncation error?
Further analysis is required to obtain the local truncation error of a predictor-corrector method
that is obtained by combining two Adams methods. The result of this analysis is the following
theorem, which is proved in [19, p. 387-388].

Theorem 9.4.3 Let the solution of the initial value problem

y 0 = f (t, y), t0 < t ≤ T, y(t0 ) = y0

be approximated by the Adams-Moulton s-step predictor-corrector method with predictor


s
X
ỹn+1 = yn + h β̃i fn+1−i
i=1

and corrector
s
" #
X
yn+1 = yn + h β0 f (tn+1 , ỹn+1 ) + βi fn+1−i .
i=1

Then the local truncation error of the predictor-corrector method is


∂f
Sn (h) = T̃n (h) + Tn (h)β0 (tn+1 , y(tn+1 ) + ξn+1 )
∂y

where Tn (h) and T̃n (h) are the local truncation errors of the predictor and corrector,
respectively, and ξn+1 is between 0 and hTn (h). Furthermore, there exist constant α and
β such that  
|y(tn ) − yn | ≤ max |y(ti ) − yi | + βS(h) eα(tn −t0 ) ,
0≤i≤s−1

where S(h) = maxs≤n≤(T −t0 )/h |Sn (h)|.


A single time step of a predictor-corrector method, as we have described it, can be viewed as
an instance of Fixed-Point Iteration in which only one iteration is performed, with the initial guess
being the prediction ỹn+1 and the function g(y) being the corrector. If desired, the iteration can
be continued until convergence is achieved.
9.4. CONVERGENCE ANALYSIS 327

Exercise 9.4.4 Show that an s-step predictor-corrector method, in which the corrector
is repeatedly applied until yn+1 converges, has local truncation error O(hs+1 ).

9.4.2 Stability
We now specialize the definition of stability from Definition 9.4.1 to one-step and multistep methods,
so that their stability (or lack thereof) can readily be determined.

9.4.2.1 One-Step Methods

From Definition 9.4.1, a one-step method of the form (9.14) is stable if Φ(t, y, h) is Lipschitz
continuous in y. That is,

|Φ(t, u, h) − Φ(t, v, h)| ≤ LΦ |u − v|, t ∈ [t0 , T ], u, v ∈ R, h ∈ [0, h0 ], (9.16)

for some constant LΦ .

Example 9.4.4 Consider the midpoint method


 
h h
yn+1 = yn + hf tn + , yn + f (tn , yn ) .
2 2

First, we check whether


 
h h
φ(t, y, h) = f t + , y + f (t, y)
2 2

satisfies a Lipschitz condition in y. We assume that f (t, y) satisifes a Lipschitz condition in y on


[t0 , T ] × (−∞, ∞) with Lipschitz constant L. Then we have
   
h h h h
|φ(t, y, h) − φ(t, ỹ, h)| = f t + , y + f (t, y) − f t + , ỹ + f (t, ỹ)
2 2 2 2
   
h h
≤ L y + f (t, y) − ỹ + f (t, ỹ)
2 2
 
hL
≤ L |y − ỹ| + |y − ỹ|
2
 
1 2
≤ L + hL |y − ỹ|.
2

It follows that φ(t, y, h) satifies a Lipschitz condition on the domain [t0 , T ] × (−∞, ∞) × [0, h0 ] with
Lipschitz constant L̃ = L + 21 h0 L2 . Therefore it is stable. 2

Exercise 9.4.5 Prove that the modified Euler method (9.6) is stable.
328 CHAPTER 9. INITIAL VALUE PROBLEMS

9.4.2.2 Multistep Methods


We now examine the stability of a general s-step multistep method of the form
s
X s
X
αi yn+1−i = h βi f (tn+1−i , yn+1−i ).
i=0 i=0

If this method is applied to the initial value problem

y 0 = 0, y(t0 ) = y0 , y0 6= 0,

for which the exact solution is y(t) = y0 , then for the method to be stable, the computed solution
must remain bounded.
It follows that the computed solution satisfies the m-term recurrence relation
s
X
αi yn+1−i = 0,
i=0

which has a solution of the form


s
X
yn = ci npi λni ,
i=0
where the ci and pi are constants, and the λi are the roots of the characteristic equation

α0 λs + α1 λs−1 + · · · + αs−1 λ + αs = 0.

When a root λi is distinct, pi = 0. Therefore, to ensure that the solution does not grow exponen-
tially, the method must satisfy the root condition:
• All roots must satisfy |λi | ≤ 1.

• If |λi | = 1 for any i, then it must be a simple root, meaning that its multiplicity is one.
It can be shown that a multistep method is zero-stable if and only if it satisfies the root condition.
Furthermore, λ = 1 is always a root, because in order to be consistent, a multistep method must
have the property that si=0 αi = 0. If this is the only root that has absolute value 1, then we say
P
that the method is strongly stable, whereas if there are multiple roots that are distinct from one
another, but have absolute value 1, then the method is said to be weakly stable.
Because all Adams methods have the property that α0 = 1, α1 = −1, and αi = 0 for i =
2, 3, . . . , s, it follows that the roots of the characteristic equation are all zero, except for one root
that is equal to 1. Therefore, all Adams methods are strongly stable. The same is not true for
BDFs; they are zero-unstable for s > 6 [36, p. 349].

Example 9.4.5 A multistep method that is neither an Adams method, nor a backward differenti-
ation formula, is an implicit 2-step method known as Simpson’s method:
h
yn+1 = yn−1 + [fn+1 + 4fn + fn−1 ].
3
Although it is only a 2-step method, it is fourth-order accurate, due to the high degree of accuracy
of Simpson’s Rule.
9.4. CONVERGENCE ANALYSIS 329

This method is obtained from the relation satisfied by the exact solution,
Z tn+1
y(tn+1 ) = y(tn−1 ) + f (t, y(t)) dt.
tn−1

Since the integral is over an interval of width 2h, it follows that the coefficients βi obtained by
polynomial interpolation of f must satisfy the condition
s
X
βi = 2,
i=0

as opposed to summing to 1 for Adams methods.


For this method, we have s = 2, α0 = 1, α1 = 0 and α2 = −1, which yields the characteristic
polynomial λ2 − 1. This polynomial has two distinct roots, 1 and −1, that both have absolute value
1. It follows that Simpson’s method is only weakly stable. 2

Exercise 9.4.6 Determine whether the 2-step BDF from Exercise (9.3.6) is strongly
stable, weakly stable, or unstable.

9.4.3 Convergence
It can be shown that a consistent and stable one-step method of the form (9.14) is convergent.
Using the same approach and notation as in the convergence proof of Euler’s method, and the fact
that the method is stable, we obtain the following bound for the global error en = y(tn ) − yn :
!
eLΦ (T −t0 ) − 1
|en | ≤ max |τm (h)|,
LΦ 0≤m≤n−1

where LΦ is the Lipschitz constant for Φ, as in (9.16).


Because the method is consistent, we have

lim max |τn (h)| = 0.


h→0 0≤n≤T /h

It follows that as h → 0 and n → ∞ in such a way that t0 + nh = t, we have

lim |en | = 0,
n→∞

and therefore the method is convergent.


In the case of Euler’s method, we have

h 00
Φ(t, y, h) = f (t, y), τn (h) = y (τ ), τ ∈ (t0 , T ).
2
Therefore, there exists a constant K such that

|τn (h)| ≤ Kh, 0 < h ≤ h0 ,


330 CHAPTER 9. INITIAL VALUE PROBLEMS

for some sufficiently small h0 . We say that Euler’s method is first-order accurate. More generally,
we say that a one-step method has order of accuracy p if, for any sufficiently smooth solution
y(t), there exists constants K and h0 such that

|τn (h)| ≤ Khp , 0 < h ≤ h0 .

Exercise 9.4.7 Prove that the modified Euler method (9.6) is convergent and second-
order accurate.
As for multistep methods, a consistent multistep method is convergent if and only if it is stable.
Because Adams methods are always strongly stable, it follows that all Adams-Moulton predictor-
corrector methods are convergent.

9.4.4 Stiff Differential Equations


To this point, we have evaluated the accuracy of numerical methods for initial-value problems
in terms of the rate at which the error approaches zero, when the step size h approaches zero.
However, this characterization of accuracy is not always informative, because it neglects the fact
that the local truncation error of any one-step or multistep method also depends on higher-order
derivatives of the solution. In some cases, these derivatives can be quite large in magnitude, even
when the solution itself is relatively small, which requires that h be chosen particularly small in
order to achieve even reasonable accuracy.
This leads to the concept of a stiff differential equation. A differential equation of the form
y 0 = f (t, y) is said to be stiff if its exact solution y(t) includes a term that decays exponentially to
zero as t increases, but whose derivatives are much greater in magnitude than the term itself. An
example of such a term is e−ct , where c is a large, positive constant, because its kth derivative is
ck e−ct . Because of the factor of ck , this derivative decays to zero much more slowly than e−ct as t
increases. Because the error includes a term of this form, evaluated at a time less than t, the error
can be quite large if h is not chosen sufficiently small to offset this large derivative. Furthermore,
the larger c is, the smaller h must be to maintain accuracy.

Example 9.4.6 Consider the initial value problem

y 0 = −100y, t > 0, y(0) = 1.

The exact solution is y(t) = e−100t , which rapidly decays to zero as t increases. If we solve this
problem using Euler’s method, with step size h = 0.1, then we have

yn+1 = yn − 100hyn = −9yn ,

which yields the exponentially growing solution yn = (−9)n . On the other hand, if we choose
h = 10−3 , we obtain the computed solution yn = (0.9)n , which is much more accurate, and correctly
captures the qualitative behavior of the exact solution, in that it rapidly decays to zero. 2

The ODE in the preceding example is a special case of the test equation

y 0 = λy, y(0) = 1, Re λ < 0.


9.4. CONVERGENCE ANALYSIS 331

The exact solution to this problem is y(t) = eλt . However, as λ increases in magnitude, the problem
becomes increasingly stiff. By applying a numerical method to this problem, we can determine how
small h must be, for a given value of λ, in order to obtain a qualitatively accurate solution.
When applying a one-step method to the test equation, the computed solution has the form
yn+1 = Q(hλ)yn ,
where Q(hλ) is a polynomial in hλ if the method is explicit, and a rational function if it is implicit.
This polynomial is meant to approximate ehλ , since the exact solution satisfies y(tn+1 ) = ehλ y(tn ).
However, to obtain a qualitatively correct solution, that decays to zero as t increases, we must
choose h so that |Q(hλ)| < 1.

Example 9.4.7 Consider the modified Euler method


h
yn+1 = yn + [f (tn , yn ) + f (tn + h, yn + hf (tn , yn ))].
2
Setting f (t, y) = λy yields the computed solution
 
h 1
yn+1 = yn + [λyn + λ(yn + hλyn )] = 1 + hλ + h2 λ2 yn ,
2 2
so Q(hλ) = 1 + hλ + 21 (hλ)2 . If we assume λ is real, then in order to satisfy |Q(hλ)| < 1, we must
have −2 < hλ < 0. It follows that the larger |λ| is, the smaller h must be. 2

The test equation can also be used to determine how to choose h for a multistep method. The
process is similar to the one used to determine whether a multistep method is stable, except that
we use f (t, y) = λy, rather than f (t, y) ≡ 0.
Given a general multistep method of the form
s
X s
X
αi yn+1−i = h βi fn+1−i ,
i=0 i=0

we substitute fn = λyn and obtain the recurrence relation


s
X
(αi − hλβi )yn+1−i = 0.
i=0

It follows that the computed solution has the form


s
X
yn = ci npi µni ,
i=1

where each µi is a root of the stability polynomial


Q(µ, hλ) = (α0 − hλβ0 )µs + (α1 − hλβ1 )µs−1 + · · · + (αs−1 − hλβs−1 )µ + (αs − hλβs ).
The exponents pi range from 0 to the multiplicity of µi minus one, so if the roots are all
distinct, all pi are equal to zero. In order to ensure that the numerical solution yn decays to zero as
n increases, we must have |µi | < 1 for i = 1, 2, . . . , s. Otherwise, the solution will either converge
to a nonzero value, or grow in magnitude.
332 CHAPTER 9. INITIAL VALUE PROBLEMS

Example 9.4.8 Consider the 3-step Adams-Bashforth method


h
yn+1 = yn + [23fn − 16fn−1 + 5fn−2 ].
12
Applying this method to the test equation yields the stability polynomial
 
23 4 5
3
Q(µ, hλ) = µ + −1 − hλ µ2 + hλµ − hλ.
12 3 12

Let λ = −100. If we choose h = 0.1, so that λh = −10, then Q(µ, hλ) has a root approximately
equal to −18.884, so h is too large for this method. On the other hand, if we choose h = 0.005,
so that hλ = −1/2, then the largest root of Q(µ, hλ) is approximately −0.924, so h is sufficiently
small to produce a qualitatively correct solution.
Next, we consider the 2-step Adams-Moulton method
h
yn+1 = yn + [5fn+1 + 8fn − fn−1 ].
12
In this case, we have
   
5 2 2 1
Q(µ, hλ) = 1 − hλ µ + −1 − hλ µ + hλ.
12 3 12

Setting h = 0.05, so that hλ = −5, the largest root of Q(µ, hλ) turns out to be approximately
−0.906, so a larger step size can safely be chosen for this method. 2

In general, larger step sizes can be chosen for implicit methods than for explicit methods.
However, the savings achieved from having to take fewer time steps can be offset by the expense
of having to solve a nonlinear equation during every time step.

9.4.4.1 Region of Absolute Stability


The region of absolute stability of a one-step method or a multistep method is the region R of
the complex plane such that if hλ ∈ R, then a solution of the test equation computed using h and
λ will decay to zero, as desired. That is, for a one-step method, |Q(hλ)| < 1 for hλ ∈ R, and for a
multistep method, the roots µ1 , µ2 , . . . , µs of Q(µ, hλ) satisfy |µi | < 1.
Because a larger region of absolute stability allows a larger step size h to be chosen for a given
value of λ, it is preferable to use a method that has as large a region of absolute stability as possible.
The ideal situation is when a method is A-stable, which means that its region of absolute stability
contains the entire left half-plane, because then, the solution will decay to zero regardless of the
choice of h.
An example of an A-stable one-step method is the Backward Euler method

yn+1 = yn + hf (tn+1 , yn+1 ),

an implicit method. For this method,


1
Q(hλ) = ,
1 − hλ
9.4. CONVERGENCE ANALYSIS 333

and since Re λ < 0, it follows that |Q(hλ)| < 1 regardless of the value of h. The only A-stable
multistep method is the implicit trapezoidal method
h
yn+1 = yn + [fn+1 + fn ],
2
because    
hλ hλ
Q(µ, hλ) = 1 − µ + −1 − ,
2 2
which has the root

1+ 2
µ= hλ
.
1− 2
The numerator and denominator have imaginary parts of the same magnitude, but because Re λ <
0, the real part of the denominator has a larger magnitude than that of the numerator, so |µ| < 1,
regardless of h.
Implicit multistep methods, such as the implicit trapezoidal method, are often used for stiff
differential equations because of their larger regions of absolute stability. However, as the next
exercises illustrate, it is important to properly estimate the largest possible value of λ for a given
ODE in order to select an h such that hλ actually lies within the region of absolute stability.
Exercise 9.4.8 Form the stability polynomial for the 2-step Adams-Moulton method
h
yn+1 = yn + [5fn+1 + 8fn − fn−1 ]. (9.17)
12

Exercise 9.4.9 Suppose the 2-step Adams-Moulton method (9.17) is applied to the IVP

y 0 = −2y, y(0) = 1.

How small must h be so that a bounded solution can be ensured?

Exercise 9.4.10 Now, suppose the same Adams-Moulton method is applied to the IVP

y 0 = −2y + e−100t , y(0) = 1.

How does the addition of the source term e−100t affect the choice of h?

Exercise 9.4.11 In general, for an ODE of the form y 0 = f (t, y), how should the value
of λ be determined for the purpose of choosing an h such that hλ lies within the region of
absolute stability?

9.4.4.2 Dahlquist’s Theorems


We conclue our discussion of multistep methods with some important results, due to Germund
Dahlquist, concerning the consistency, zero-stability, and convergence of multistep methods.

Theorem 9.4.9 (Dahlquist’s Equivalence Theorem) A consistent multistep method


with local truncation error O(hp ) is convergent with global error O(hp ) if and only if it is
zero-stable.
334 CHAPTER 9. INITIAL VALUE PROBLEMS

This theorem shows that local error provides an indication of global error only for zero-stable
methods. A proof can be found in [15, Theorem 6.3.4].
The second theorem imposes a limit on the order of accuracy of zero-stable methods.

Theorem 9.4.10 (Dahlquist’s Barrier Theorem) The order of accuracy of a zero-


stable s-step method is at most s + 1, if s is odd, or s + 2, if s is even.

For example, because of this theorem, it can be concluded that a 6th-order accurate three-step
method cannot be zero stable, whereas a 4th-order accurate, zero-stable two-step method has the
highest order of accuracy that can be achieved. A proof can be found in [15, Section 4.2].
Finally, we state a result, proved in [11], concerning absolute stability that highlights the trade-
off between explicit and implicit methods.

Theorem 9.4.11 (Dahlquist’s Second Barrier Theorem) No explicit multistep


method is A-stable. Furthermore, no A-stable multistep method can have an order of
accuracy greater than 2. The second-order accurate, A-stable multistep method with the
smallest asymptotic error constant is the trapezoidal method.

In order to obtain A-stable methods with higher-order accuracy, it is necessary to relax the
condition of A-stability. Backward differentiation formulae (BDF), mentioned previously in our
initial discussion of multistep methods, are efficient implicit methods that are high-order accurate
and have a region of absolute stability that includes a large portion of the negative half-plane,
including the entire negative real axis.

Exercise 9.4.12 Find a BDF of order greater than 1 that has a region of absolute sta-
bility that includes the entire negative real axis.

9.5 Adaptive Methods


So far, we have assumed that the time-stepping methods that we have been using for solving
y 0 = f (t, y) on the interval t0 < t < T compute the solution at times t1 , t2 , . . . that are equally
spaced. That is, we define tn+1 − tn = h for some value of h that is fixed over the entire interval
[t0 , T ] on which the problem is being solved. However, in practice, this is ill-advised because

• the chosen time step may be too large to resolve the solution with sufficient accuracy, especially
if it is highly oscillatory, or

• the chosen time step may be too small when the solution is particularly smooth, thus wasting
computational effort required for evaluations of f .

This is reminiscent of the problem of choosing appropriate subintervals when applying composite
quadrature rules to approximate definite integrals. In that case, adaptive quadrature rules were
designed to get around this problem. These methods used estimates of the error in order to
determine whether certain subintervals should be divided. In this section, we seek to develop an
analogous strategy for time-stepping to solve initial value problems.
9.5. ADAPTIVE METHODS 335

9.5.1 Error Estimation


The counterpart to this approach for initial value problems would involve estimating the global
error, perhaps measured by max0≤n≤T /h |y(tn ) − yn |, and then, if it is too large, repeating the
time-stepping process with a smaller value of h. However, this is impractical, because it is difficult
to obtain a sharp estimate of global error, and much of the work involved would be wasted due
to overwriting of solution values, unlike with adaptive quadrature, where each subinterval can be
integrated independently.
Instead, we propose to estimate the local truncation error at each time step, and use that
estimate to determine whether h should be varied for the next time step. This approach minimizes
the amount of extra work that is required to implement this kind of adaptive time-stepping, and it
relies on an error estimate that is easy to compute.
We first consider error estimation for one-step methods. This error estimation is accomplished
using a pair of one-step methods,

yn+1 = yn + hΦp (tn , yn , h), (9.18)


ỹn+1 = ỹn + hΦp+1 (tn , ỹn , h), (9.19)

of orders p and p + 1, respectively. Recall that their local truncation errors are
1
τn+1 (h) = [y(tn+1 ) − y(tn )] − Φp (tn , y(tn ), h),
h
1
τ̃n+1 (h) = [y(tn+1 ) − y(tn )] − Φp+1 (tn , y(tn ), h).
h
We make the assumption that both methods are exact at time tn ; that is, yn = ỹn = y(tn ). It
then follows from (9.18) and (9.19) that
1 1
τn+1 (h) = [y(tn+1 ) − yn+1 ], τ̃n+1 (h) = [y(tn+1 ) − ỹn+1 ].
h h
Subtracting these equations yields
1
τn+1 (h) = τ̃n+1 (h) + [ỹn+1 − yn+1 ].
h
Because τn+1 (h) is O(hp ) while τ̃n+1 (h) is O(hp+1 ), we neglect τ̃n+1 (h) and obtain the simple error
estimate
1
τn+1 (h) = (ỹn+1 − yn+1 ).
h
The approach for multistep methods is similar. We use a pair of Adams methods, consisting of
an s-step Adams-Bashforth (explicit) method,
s
X s
X
αi yn+1−i = h βi fn+1−i ,
i=0 i=1

and an (s − 1)-step Adams-Moulton (implicit) method,


s
X s
X
α̃i ỹn+1−i = h β̃i f˜n+1−i ,
i=0 i=0
336 CHAPTER 9. INITIAL VALUE PROBLEMS

where f˜i = f (ti , ỹi ), so that both are O(hs )-accurate. We then have
s
X s
X
αi y(tn+1−i ) = h βi f (tn+1−i , y(tn+1−i ) + hτn+1 (h),
i=0 i=1
Xs Xs
α̃i y(tn+1−i ) = h β̃i f (tn+1−i , y(tn+1−i ) + hτ̃n+1 (h),
i=0 i=0

where τn+1 (h) and τ̃n+1 (h) are the local truncation errors of the explicit and implicit methods,
respectively.
As before, we assume that yn+1−s , . . . , yn are exact, which yields

1 1
τn+1 (h) = [y(tn+1 ) − yn+1 ], τ̃n+1 (h) = [y(tn+1 ) − ỹn+1 ],
h h
as in the case of one-step methods. It follows that

ỹn+1 − yn+1 = h[τn+1 (h) − τ̃n+1 (h)].

The local truncation errors have the form

τn+1 (h) = Chs y (s+1) (ξn ), τ̃n+1 (h) = C̃hs y (s+1) (ξ˜n ),

where ξn , ξ˜n ∈ [tn+1−s , tn+1 ]. We assume that these unknown values are equal, which yields


τ̃n+1 (h) ≈ [ỹn+1 − yn+1 ]. (9.20)
h(C − C̃)

Exercise 9.5.1 Formulate an error estimate of the form (9.20) for the case s = 4;
that is, estimate the error in the 3-step Adams-Moulton method (9.12) using the 4-step
Adams-Bashforth method (9.11). Hint: use the result of Exercise 9.4.3.

9.5.2 Adaptive Time-Stepping


Our goal is to determine how to modify h so that the local truncation error is approximately equal
to a prescribed tolerance ε, and therefore is not too large nor too small.
When using two one-step methods as previously discussed, because τn+1 (h) is the local trun-
cation error of a method that is pth-order accurate, it follows that if we replace h by qh for some
scaling factor q, the error is multiplied by q p . Therefore, we relate the error obtained with step size
qh to our tolerance, and obtain
p
q
|τn+1 (qh)| ≈ (ỹn+1 − yn+1 ) ≤ ε.

h

Solving for q yields


 1/p
εh
q≤ .
|ỹn+1 − yn+1 |
9.5. ADAPTIVE METHODS 337

In practice, though, the step size is kept bounded by chosen values hmin and hmax in order to avoid
missing sensitive regions of the solution by using excessively large time steps, as well as expending
too much computational effort on regions where y(t) is oscillatory by using step sizes that are too
small [10].
For one-step methods, if the error is deemed small enough so that yn+1 can be accepted, but
ỹn+1 is obtained using a higher-order method, then it makes sense to instead use ỹn+1 as input for
the next time step, since it is ostensibly more accurate, even though the error estimate applies to
yn+1 . Using ỹn+1 instead is called local extrapolation.
The Runge-Kutta-Fehlberg method [14] is an example of an adaptive time-stepping method.
It uses a four-stage, fourth-order Runge-Kutta method and a five-stage, fifth-order Runge-Kutta
method. These two methods share some evaluations of f (t, y), in order to reduce the number of
evaluations of f per time step to six, rather than the nine that would normally be required from a
pairing of fourth- and fifth-order methods. A pair of Runge-Kutta methods that can share stages
in this way is called an embedded pair.
The Bogacki-Shampine method [8], which is used in the Matlab function ode23, is an
embedded pair consisting of a four-stage, second-order Runge-Kutta method and a three-stage,
third-order Runge-Kutta method. As in the Runge-Kutta-Fehlberg metod, evaluations of f (t, y)
are shared, reducing the number of evaluations per time step from seven to four. However, unlike
the Runge-Kutta Felhberg method, its last stage, k4 = f (tn+1 , yn+1 ), is the same as the first stage
of the next time step, k1 = f (tn , yn ), if yn+1 is accepted, as local extrapolation is used. This reduces
the number of new evaluations of f per time step from four to three. A Runge-Kutta method that
shares stages across time-steps in this manner is called a FSAL (First Same as Last) method.

Exercise 9.5.2 Find the definitions of the two Runge-Kutta methods used in the Bogacki-
Shampine method (they can easily be found online). Use these definitions to write a
Matlab function [t,y]=rk23(f,tspan,y0,h) that implements the Bogacki-Shampine
method, using an initial step size specified in the input argument h. How does the perfor-
mance of your method compare to that of ode23?
The Matlab ODE solver ode45 uses the Dormand-Prince method [13], which consists of a
5-stage, 5th-order Runge-Kutta method and a 6-stage, 4th-order Runge-Kutta method. By sharing
stages, the number of evaluations of f (t, y) is reduced from eleven to seven. Like the Bogacki-
Shampine method, the Dormand-Prince method is FSAL, so in fact only six new evaluations per
time step are required.

Exercise 9.5.3 Find the definitions of the two Runge-Kutta methods used in the
Dormand-Prince method (they can easily be found online). Use these definitions to write
a Matlab function [t,y]=rk23(f,tspan,y0,h) that implements the Dormand-Prince
method, using an initial step size specified in the input argument h. How does the perfor-
mance of your method compare to that of ode45?
For multistep methods, we assume as before that an s-step predictor and (s − 1)-step corrector
are used. Recall that the error estimate τn+1 (h) for the corrector is given in (9.20). As with one-
step methods, we relate the error estimate τn+1 (qh) to the error tolerance ε and solve for q, which
yields
 1/s
ε
q≈ .
τn+1 (h)
338 CHAPTER 9. INITIAL VALUE PROBLEMS

Then, the time step can be adjusted to qh, but as with one-step methods, q is constrained to
avoid drastic changes in the time step. Unlike one-step methods, a change in the time step is
computationally expensive, as it requires the computation of new starting values at equally spaced
times.
Exercise 9.5.4 Implement an adaptive multistep method based on the 4-step Adams
Bashforth method (9.11) and 3-step Adams-Moulton method (9.12). Use the fourth-order
Runge-Kutta method to obtain starting values.

9.6 Higher-Order Equations and Systems of Differential Equa-


tions
Numerical methods for solving a single, first-order ODE of the form y 0 = f (t, y) can also be applied
to more general ODE, including systems of first-order equations, and equations with higher-order
derivatives. We will now learn how to generalize these methods to such problems.

9.6.1 Systems of First-Order Equations


We consider a system of m first-order equations, that has the form

y10 = f1 (t, y1 , y2 , . . . , ym ),
y20 = f2 (t, y1 , y2 , . . . , ym ),
..
.
0
ym = fm (t, y1 , y2 , . . . , ym ),

where t0 < t ≤ T , with initial conditions

y10 (t0 ) = y1,0 , y20 (t0 ) = y2,0 , ··· 0


ym (t0 ) = ym,0 .

This problem can be written more conveniently in vector form

y0 = f (y, y), y0 (t0 ) = y0 ,

where y(t) is a vector-valued function with component functions

y(t) = hy1 (t), y2 (t), . . . , ym (t)i,

f is a vector-valued function of t and y, with component functions

f (t, y) = hf1 (t, y), f2 (t, y), . . . , fm (t, y),

and y0 is the vector of initial values,

y0 = hy1,0 , y2,0 , . . . , ym,0 i.

This initial-value problem has a unique solution y(t) on [t0 , T ] if f is continuous on the domain D =
[t0 , T ] × (−∞, ∞)m , and satisfies a Lipschitz condition on D in each of the variables y1 , y2 , . . . , ym .
9.6. HIGHER-ORDER EQUATIONS AND SYSTEMS OF DIFFERENTIAL EQUATIONS 339

Applying a one-step method of the form


yn+1 = yn + hφ(tn , yn , h)
to a system is straightforward. It simply requires generalizing the function φ(tn , yn , h) to a vector-
valued function that evaluates f (t, y) in the same way as it evaluates f (t, y) in the case of a single
equation, with its arguments obtained from tn and yn in the same way as they are from tn and yn
for a single equation.

Example 9.6.1 Consider the modified Euler method


  
h h
yn+1 = yn + f (tn , yn ) + f tn + h, yn + f (tn , yn ) .
2 2
To apply this method to a system of m equations of the form y0 = f (t, y), we compute
  
h h
yn+1 = yn + f (tn , yn ) + f tn + h, yn + f (tn , yn ) ,
2 2
where yn is an approximation to y(tn ). The vector yn has components
yn = hy1,n , y2,n , . . . , ym,n i,
where, for i = 1, 2, . . . , m, yi,n is an approximation to yi (tn ).
We illustrate this method on the system of two equations
y10 = f1 (t, y1 , y2 ) = −2y1 + 3ty2 , (9.21)
y20 2
= f2 (t, y1 , y2 ) = t y1 − e y2 .−t
(9.22)
First, we rewrite the method in the more convenient form
k1 = hf (tn , yn )
k2 = hf (tn + h, yn + k1 )
1
yn+1 = yn + [k1 + k2 ].
2
Then, the modified Euler method, applied to this system of ODE, takes the form
k1,1 = hf1 (tn , y1,n , y2,n )
= h[−2y1,n + 3tn y2,n ]
k2,1 = hf2 (tn , y1,n , y2,n )
= h[t2n y1,n − e−tn y2,n
k1,2 = hf1 (tn + h, y1,n + k1,1 , y2,n + k2,1 )
= h[−2(y1,n + k1,1 ) + 3(tn + h)(y2,n + k2,2 )]
k2,2 = hf2 (tn + h, y1,n + k1,1 , y2,n + k2,1 )
= h[(tn + h)2 (y1,n + k1,1 ) − e−(tn +h) (y2,n + k2,1 )]
1
y1,n+1 = y1,n + [k1,1 + k1,2 ]
2
1
y2,n+1 = y2,n + [k2,1 + k2,2 ].
2
340 CHAPTER 9. INITIAL VALUE PROBLEMS

This can be written in vector form as follows:

k1 = hf (tn , yn )
k2 = hf (tn + h, yn + k1 )
1
yn+1 = yn + [k1 + k2 ].
2
2

Exercise 9.6.1 Try your rk4 function from Exercise 9.2.4 on the system (9.21), (9.22)
with initial conditions y1 (0) = 1, y2 (0) = −1. Write your time derivative function
yp=f(t,y) for this system so that the input argument y and the value yp returned by f
are both column vectors, and pass a column vector containing the initial values as the
input argument y0. Do you even need to modify rk4?

Multistep methods generalize in a similar way. A general s-step multistep method for a system of
first-order ODE y0 = f (t, y) has the form

s
X s
X
αi yn+1−i = h βi f (tn+1−i , yn+1−i ),
i=0 i=0

where the constants αi and βi , for i = 0, 1, . . . , s, are determined in the same way as in the case of
a single equation.

Example 9.6.2 The explicit 3-step Adams-Bashforth method applied to the system in the previous
example has the form

h
y1,n+1 = y1,n + [23f1,n − 16f1,n−1 + 5f1,n−2 ],
12
h
y2,n+1 = y2,n + [23f2,n − 16f2,n−1 + 5f2,n−2 ],
12

where
f1,i = −2y1,i + 3ti y2,i , f2,i = t2i y1,i − e−ti y2,i , i = 0, . . . , n.

The order of accuracy for a one-step or multistep method, when applied to a system of equations,
is the same as when it is applied to a single equation. For example, the modified Euler’s method is
second-order accurate for systems, and the 3-step Adams-Bashforth method is third-order accurate.
However, when using adaptive step size control for any of these methods, it is essential that the
step size h is selected so that all components of the solution are sufficiently accurate, or it is likely
that none of them will be.
9.6. HIGHER-ORDER EQUATIONS AND SYSTEMS OF DIFFERENTIAL EQUATIONS 341

9.6.2 Higher-Order Equations


The numerical methods we have learned are equally applicable to differential equations of higher
order, that have the form
y (m) = f (t, y, y 0 , y 00 , . . . , y (m−1) ), t0 < t ≤ T,
because such equations are equivalent to systems of first-order equations, in which new variables
are introduced that correspond to lower-order derivatives of y. Specifically, we define the variables
u1 = y, u2 = y 0 , u3 = y 00 , ··· um = y (m−1) .
Then, the above ODE of order m is equivalent to the system of first-order ODEs
u01 = u2
u02 = u3
..
.
u0m = f (t, u1 , u2 , . . . , um ).
The initial conditions of the original higher-order equation,
(0) (1) (m−1)
y(t0 ) = y0 , y 0 (t0 ) = y0 , . . . , y (m−1) (t0 ) = y0 ,
are equivalent to the following initial conditions of the first order system
(0) (1) (m−1)
u1 (t0 ) = y0 , u2 (t0 ) = y0 , . . . , um (t0 ) = y0 .
We can then apply any one-step or multistep method to this first-order system.

Example 9.6.3 Consider the second-order equation


y 00 + 3y 0 + 2y = cos t, y(0) = 2, y 0 (0) = −1.
By defining u1 = y and u2 = y 0 , we obtain the equivalent the first-order system
u01 = u2
u02 = −3u2 − 2u1 + cos t,
with initial conditions
u1 (0) = 2, u2 (0) = −1.
To apply the 4th-order Runge-Kutta method,
k1 = hf (tn , un )
 
h 1
k2 = hf tn + , un + k1
2 2
 
h 1
k3 = hf tn + , un + k2
2 2
k4 = hf (tn + h, un + k3 )
1
un+1 = un + (k1 + 2k2 + 2k3 + k4 ),
6
342 CHAPTER 9. INITIAL VALUE PROBLEMS

to this system, we compute

k1,1 = hf1 (tn , u1,n , u2,n ) = hu2,n


k2,1 = hf2 (tn , u1,n , u2,n ) = h[−3u2,n − 2u1,n + cos tn ]
   
h 1 1 1
k1,2 = hf1 tn + , u1,n + k1,1 , u2,n + k2,1 = h u2,n + k2,1
2 2 2 2
 
h 1 1
k2,2 = hf2 tn + , u1,n + k1,1 , u2,n + k2,1
2 2 2
      
1 1 h
= h −3 u2,n + k2,1 − 2 u1,n + k1,1 + cos tn +
2 2 2
   
h 1 1 1
k1,3 = hf1 tn + , u1,n + k1,2 , u2,n + k2,2 = h u2,n + k2,2
2 2 2 2
 
h 1 1
k2,3 = hf2 tn + , u1,n + k1,2 , u2,n + k2,2
2 2 2
      
1 1 h
= h −3 u2,n + k2,2 − 2 u1,n + k1,2 + cos tn +
2 2 2
k1,4 = hf1 (tn + h, u1,n + k1,3 , u2,n + k2,3 ) = h(u2,n + k2,3 )
k2,4 = hf2 (tn + h, u1,n + k1,3 , u2,n + k2,3 )
= h[−3(u2,n + k2,3 ) − 2(u1,n + k1,3 ) + cos(tn + h)]
1
u1,n+1 = u1,n + (k1,1 + 2k1,2 + 2k1,3 + k1,4 )
6
1
u2,n+1 = u2,n + (k2,1 + 2k2,2 + 2k2,3 + k2,4 ).
6
2

Exercise 9.6.2 Modify your rk4 function from Exercise 9.2.4 so that it solves a single
ODE of the form
y (m) = f (t, y, y 0 , y 00 , . . . , y (m−1)
with initial conditions
(m)
y(t0 ) = y0 , y 0 (t0 ) = y00 , y 00 (t0 ) = y000 , ··· y (m−1) (t0 ) = y0 .

Assume that the input argument ym = f(t, y) treats the input argument y as a row vector
consisting of the values of y, y 0 , y 00 , . . . , y (m−1) at time t, and that f returns a scalar value
ym that represents the value of y (m) . Your function should also assume that the argument
y0 containing the initial values is a row vector. The value of m indicating the order of the
ODE can be automatically inferred from length(y0), rather than passed as a parameter.

Exercise 9.6.3 How would you use your function from the previous exercise to solve a
system of ODEs of order m?
Chapter 10

Two-Point Boundary Value Problems

We now consider the two-point boundary value problem (BVP)

y 00 = f (x, y, y 0 ), a < x < b, (10.1)

a second-order ODE, with boundary conditions

y(a) = α, y(b) = β. (10.2)

This problem is guaranteed to have a unique solution if the following conditions hold:
• f , fy , and fy0 are continuous on the domain

D = {(x, y, y 0 ) | a ≤ x ≤ b, −∞ < y < ∞, −∞ < y 0 < ∞}.

• fy > 0 on D

• fy0 is bounded on D.
In this chapter, we will introduce several methods for solving this kind of problem, most of which
can be generalized to partial differential equations (PDEs) on higher-dimensional domains. A
comprehensive treatment of numerical methods for two-point BVPs can be found in [20].

10.1 The Shooting Method


There are several approaches to solving this type of problem. The first method that we will examine
is called the shooting method. It treats the two-point boundary value problem as an initial value
problem (IVP), in which x plays the role of the time variable, with a being the “initial time” and
b being the “final time”. Specifically, the shooting method solves the initial value problem

y 00 = f (x, y, y 0 ), a < x < b,

with initial conditions


y(a) = α, y 0 (a) = t,
where t must be chosen so that the solution satisfies the remaining boundary condition, y(b) = β.
Since t, being the first derivative of y(x) at x = a, is the “initial slope” of the solution, this

343
344 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS

approach requires selecting the proper slope, or “trajectory”, so that the solution will “hit the
target” of y(x) = β at x = b. This viewpoint indicates how the shooting method earned its name.
Note that since the ODE associated with the IVP is of second-order, it must normally be rewritten
as a system of first-order equations before it can be solved by standard numerical methods such as
Runge-Kutta or multistep methods.

10.1.1 Linear Problems


In the case where y 00 = f (x, y, y 0 ) is a linear ODE of the form

y 00 = p(x)y 0 + q(x)y + r(x), a < x < b, (10.3)

selecting the slope t is relatively simple. Let y1 (x) be the solution of the IVP

y 00 = p(x)y 0 + q(x)y + r(x), a < x < b, y(a) = α, y 0 (a) = 0, (10.4)

and let y2 (x) be the solution of the IVP

y 00 = p(x)y 0 + q(x)y, a < x < b, y(a) = 0, y 0 (a) = 1. (10.5)

Then, the solution of the original BVP has the form

y(x) = y1 (x) + ty2 (x), (10.6)

where t is the correct slope, since any linear combination of solutions of the ODE also satisfies the
ODE, and the initial values are linearly combined in the same manner as the solutions themselves.

Exercise 10.1.1 Assume y2 (b) 6= 0. Find the value of t in (10.6) such that the boundary
conditions (10.2) are satisfied.

Exercise 10.1.2 Explain why the condition y2 (b) is guaranteed to be satisfied, due to the
previously stated assumptions about f (x, y, y 0 ) that guarantee the existence and uniqueness
of the solution.

Exercise 10.1.3 Write a Matlab function y=shootlinear(p,q,r,a,b,alpha,beta,n)


that solves the linear BVP (10.3), (10.2) using the shooting method. Use the fourth-order
Runge-Kutta method to solve the IVPs (10.4), (10.5). Hint: consult Section 9.6 on
solving second-order ODEs. The input arguments p, q, and r are function handles for
the coefficients p(x), q(x) and r(x), respectively, of (10.3). The input arguments a, b,
alpha and beta specify the boundary conditions (10.2), and n refers to the number of
interior grid points; that is, a time step of h = (b − a)/(n + 1) is to be used. The output
y is a vector consisting of n + 2 values, including both the boundary and interior values
of the approximation of the solution y(x) on [a, b]. Test your function on the BVP from
Example 10.2.1.

10.1.2 Nonlinear Problems


If the ODE is nonlinear, however, then t satisfies a nonlinear equation of the form

y(b, t) = 0,
10.1. THE SHOOTING METHOD 345

where y(b, t) is the value of the solution, at x = b, of the IVP specified by the shooting method, with
initial sope t. This nonlinear equation can be solved using an iterative method such as the bisection
method, fixed-point iteration, Newton’s Method, or the Secant Method. The only difference is that
each evaluation of the function y(b, t), at a new value of t, is relatively expensive, since it requires
the solution of an IVP over the interval [a, b], for which y 0 (a) = t. The value of that solution at
x = b is taken to be the value of y(b, t).
If Newton’s Method is used, then an additional complication arises, because it requires the
derivative of y(b, t), with respect to t, during each iteration. This can be computed using the fact
that z(x, t) = ∂y(x, t)/∂t satisfies the ODE

z 00 = fy z + fy0 z 0 , a < x < b, z(a, t) = 0, z 0 (a, t) = 1,

which can be obtained by differentiating the original BVP and its boundary conditions with respect
to t. Therefore, each iteration of Newton’s Method requires two IVPs to be solved, but this extra
effort can be offset by the rapid convergence of Newton’s Method.
Suppose that Euler’s method,

yi+1 = yi + hf (x, yi , h),

for the IVP y0 = f (x, y), y(x0 ) = y0 , is to be used to solve any IVPs arising from the Shooting
Method in conjunction with Newton’s Method. Because each IVP, for y(x, t) and z(x, t), is of
second order, we must rewrite each one as a first-order system. We first define

y 1 = y, y2 = y0, z 1 = z, z2 = z0.

We then have the systems

∂y 1
= y2,
∂x
∂y 2
= f (x, y 1 , y 2 ),
∂x
∂z 1
= z2,
∂x
∂z 2
= fy (x, y 1 , y 2 )z 1 + fy0 (x, y 1 , y 2 )z 2 ,
∂x
with initial conditions

y 1 (a) = α, y 2 (a) = t, z 1 (a) = 0, z 2 (a) = 1.

The algorithm then proceeds as follows:

Choose t(0)
Choose h such that b − a = hN , where N is the number of steps
for k = 0, 1, 2, . . . until convergence do
i = 0, y01 = α, y02 = t(k) , z01 = 0, z02 = 1
for i = 0, 1, 2, . . . , N − 1 do
xi = a + ih
346 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS

1
yi+1 = yi1 + hyi2
2
yi+1 = yi2 + hf (xi , yi1 , yi2 )
zi+1 = zi1 + hzi2
1
2
zi+1 = zi2 + h[fy (xi , yi1 , yi2 )zi1 + fy0 (xi , yi1 , yi2 )zi2 ]
end
t(k+1) = t(k) − (yN
1 − β)/z 1
N
end

Exercise 10.1.4 What would be a logical choice of initial guess for the slope t(0) , that
would not require any information about the function f (x, y, y 0 )?

Exercise 10.1.5 Implement the above algorithm to solve the BVP from Example 10.2.2.
Changing the implementation to use a different IVP solver, such as a Runge-Kutta method or
multistep method, in place of Euler’s method only changes the inner loop.
Exercise 10.1.6 Modify your code from Exercise 10.1.5 to use the fourth-order Runge-
Kutta method in place of Euler’s method. How does this affect the convergence of the
Newton iteration?

Exercise 10.1.7 Modify your code from Exercise 10.1.5 to use the Secant Method instead
of Newton’s Method. How can the function f (x, y, y 0 ) from the ODE be used to obtain a
logical second initial guess t(1) ? Hint: consider a solution that is a parabola. How is the
efficiency of the iteration affected by the change to the Secant Method?

Exercise 10.1.8 Write a Matlab function SHOOTBVP(f,a,b,alpha,beta,N) that


solves the general nonlinear BVP (10.1), (10.2) using the shooting method in conjunction
with the Secant Method. The input arguments f is a function handle for the functions f .
Test your function on the BVP from Exercise 10.1.5. What happens to the performance
as N, the number of time steps, increases?

10.2 Finite Difference Methods


The shooting method for a two-point boundary value problem of the form
y 00 = f (x, y, y 0 ), a < x < b, y(a) = α, y(b) = β,
while taking advantage of effective methods for initial value problems, can not readily be generalized
to boundary value problems in higher spatial dimensions. We therefore consider an alternative
approach, in which the first and second derivatives of the solution y(x) are approximated by finite
differences.
We discretize the problem by dividing the interval [a, b] into N + 1 subintervals of equal width
h = (b − a)/(N + 1). Each subinterval is of the form [xi−1 , xi ], for i = 1, 2, . . . , N , where xi = a + ih.
We denote by yi an approximation of the solution at xi ; that is, yi ≈ y(xi ). Then, assuming y(x) is
at least four times continuously differentiable, we approximate y 0 and y 00 at each xi , i = 1, 2, . . . , N ,
by the finite differences
y(xi+1 ) − y(xi ) h2 000
y 0 (xi ) = − y (ηi ),
2h 6
10.2. FINITE DIFFERENCE METHODS 347

y(xi+1 ) − 2y(xi ) + y(xi−1 ) h2 (4)


y 00 (xi ) = − y (ξi ),
h2 12
where ηi and ξi lie in [xi−1 , xi+1 ].
If we substitute these finite differences into the boundary value problem, and apply the boundary
conditions to impose
y0 = α, yN +1 = β,
then we obtain a system of equations
 
yi+1 − 2yi + yi−1 yi+1 − yi−1
=f xi , yi , , i = 1, 2, . . . , N,
h2 2h

for the values of the solution at each xi , in which the local truncation error is O(h2 ).

10.2.1 Linear Problems


We first consider the case in which the boundary value problem includes a linear ODE of the form

y 00 = p(x)y 0 + q(x)y + r(x). (10.7)

Then, the above system of equations is also linear, and can therefore be expressed in matrix-vector
form
Ay = r,
where A is a tridiagonal matrix, since the approximations of y 0 and y 00 at xi only use yi−1 , yi and
yi+1 , and r is a vector that includes the values of r(x) at the grid points, as well as additional terms
that account for the boundary conditions.
Specifically,

aii = 2 + h2 q(xi ), i = 1, 2, . . . , N,
h
ai,i+1 = −1 + p(xi ), i = 1, 2, . . . , N − 1,
2
h
ai+1,i = −1 − p(xi+1 ), i = 1, 2, . . . , N − 1,
2  
2 h
r1 = −h r(x1 ) + 1 + p(x1 ) α,
2
ri = −h2 r(xi ), i = 2, 3, . . . , N − 1,
 
2 h
rN = −h r(xN ) + 1 − p(xN ) β.
2
This system of equations is guaranteed to have a unique solution if A is diagonally dominaint,
which is the case if q(x) ≥ 0 and h < 2/L, where L is an upper bound on |p(x)|.

Example 10.2.1 We solve the BVP

y 00 = 2y 0 − y + xex − x, 0 < x < 2, y(0) = 0, y(2) = −4. (10.8)

The following script uses the function FDBVP (see Exercise 10.2.1) to solve this problem with N = 10
interior grid points, and then visualize the exact and approximate solutions, as well as the error.
348 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS

% coefficients
p=@(x)(2*ones(size(x)));
q=@(x)(-ones(size(x)));
r=@(x)(x.*exp(x)-x);
% boundary conditions
a=0;
b=2;
alpha=0;
beta=-4;
% number of interior grid points
N=10;
% solve using finite differences
[x,y]=FDBVP(p,q,r,a,b,alpha,beta,N);
% exact solution: y = x^3 e^x/6 - 5xe^x/3 + 2e^x - x - 2
yexact=x.^3.*exp(x)/6-5*x.*exp(x)/3+2*exp(x)-x-2;
% plot exact and approximate solutions for comparison
subplot(121)
plot(x,yexact,’b-’)
hold on
plot(x,y,’r--o’)
hold off
xlabel(’x’)
ylabel(’y’)
subplot(122)
plot(x,abs(yexact-y))
xlabel(’x’)
ylabel(’error’)

The plots are shown in Figure 10.1. 2

Exercise 10.2.1 Write a Matlab function FDBVP(p,q,r,a,b,alpha,beta,N) that


solves the linear BVP (10.7), (10.2). The input arguments p, q and r must be func-
tion handles that represent the functions p(x), q(x) and r(x), respectively. Test your
function on the BVP from Example 10.2.1 for different values of N . How does the error
behave as N increases? Specifically, if the error is O(hp ), then what is the value of p?
Does this value match the theoretical expectation? Hint: use the diag function to set up
the matrix A.

Exercise 10.2.2 After evaluating the coefficients p(x), q(x) and r(x) from (10.7) at
the grid points xi , i = 1, 2, . . . , N , how many floating-point operations are necessary to
solve the system Ay = r? If the boundary conditions are changed but the ODE remains
the same, how many additional floating-point operations are needed? Hint: review the
material in Chapter 2 on the solution of banded systems.
10.2. FINITE DIFFERENCE METHODS 349

Figure 10.1: Left plot: exact (solid curve) and approximate (dashed curve with circles) solutions of
the BVP (10.8) computed using finite differences. Right plot: error in the approximate solution.

10.2.2 Nonlinear Problems

If the ODE is nonlinear, then we must solve a system of nonlinear equations of the form

F(y) = 0,

where F(y) is a vector-valued function with coordinate functions fi (y), for i = 1, 2, . . . , N . These
coordinate functions are defined as follows:

 
2 y2 − α
F1 (y) = y2 − 2y1 + α − h f x1 , y1 , ,
2h
 
y3 − y1
F2 (y) = y3 − 2y2 + y1 − h2 f x2 , y2 , ,
2h
..
. (10.9)
 
2 yN − yN −2
FN −1 (y) = yN − 2yN −1 + yN −2 − h f xN −1 , yN −1 , ,
2h
 
2 β − yN −1
FN (y) = β − 2yN + yN −1 − h f xN , yN , .
2h

This system of equations can be solved approximately using an iterative method such as Fixed-point
Iteration, Newton’s Method, or the Secant Method.
For example, if Newton’s Method is used, then, by the Chain Rule, the entries of the Jacobian
350 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS

matrix JF (y), a tridiagonal matrix, are defined as follows:


 
∂fi 2 yi+1 − yi−1
JF (y)ii = (y) = −2 − h fy xi , yi , , i = 1, 2, . . . , N,
∂yi 2h
 
∂fi h yi+1 − yi−1
JF (y)i,i+1 = (y) = 1 − fy0 xi , yi , , i = 1, 2, . . . , N − 1, (10.10)
∂yi+1 2 2h
 
∂fi h yi+1 − yi−1
JF (y)i,i−1 = (y) = 1 + fy0 xi , yi , , i = 2, 3, . . . , N,
∂yi−1 2 2h

where, for convenience, we use y0 = α and yN +1 = β. Then, during each iteration of Newton’s
Method, the system of equations

JF (y(k) )sk+1 = −F(y(k) )

is solved in order to obtain the next iterate

y(k+1) = y(k) + sk+1

from the previous iterate. An appropriate initial guess is the unique linear function that satisfies
the boundary conditions,
β−α
y(0) = α + (x − a),
b−a
where x is the vector with coordinates x1 , x2 , . . . , xN .

Exercise 10.2.3 Derive the formulas (10.10) from (10.9).

Example 10.2.2 We solve the BVP


1 1
y 00 = y 3 − yy 0 , 1 < x < 2, y(1) = , y(2) = . (10.11)
2 3
This BVP has the exact solution y(x) = 1/(1 + x). To solve this problem using finite differences,
we apply Newton’s Method to solve the equation F(y) = 0, where
 
1 2 3 y2 − 1/2
F1 (y) = y2 − 2y1 + − h y1 − y1 ,
2 2h
 
2 3 yi+1 − yi−1
Fi (y) = yi+1 − 2yi + yi−1 − h yi − yi , i = 2, 3, . . . , N − 1,
2h
 
1 2 3 1/3 − yN −1
FN (y) = − 2yN + yN −1 − h yN − yN .
3 2h

The following Matlab function can be used to evaluate F(y) for a general BVP of the form (10.1).
Its arguments are assumed to be vectors of x- and y-values, including boundary values, along with
a function handle for the right-hand side f (x, y, y 0 ) of the ODE (10.1) and the spacing h.

% newtF: evaluates F(y) for solving ODE


% y’’ = f(x,y,y’) with Newton’s Method
function F=newtF(x,y,f,h)
10.2. FINITE DIFFERENCE METHODS 351

% use only interior x-values


xi=x(2:end-1);
% y_i
yi=y(2:end-1);
% y_{i+1}
yip1=y(3:end);
% y_{i-1}
yim1=y(1:end-2);
% centered difference approximation of y’:
% (y_{i+1} - y_{i-1})/(2h)
ypi=(yip1-yim1)/(2*h);
% evaluate F(y)
F=yip1-2*yi+yim1-h^2*f(xi,yi,ypi);
Using fy = 3y 2 − y 0 and fy0 = −y, we obtain
 
2 2 yi+1 − yi−1
JF (y)ii = −2 − h 3yi − , i = 1, 2, . . . , N,
2h
h
JF (y)i,i+1 = 1 + yi , i = 1, 2, . . . , N − 1,
2
h
JF (y)i,i−1 = 1 − yi , i = 2, 3, . . . , N.
2
A Matlab function similar to newtF can be written to construct JF (y) for a general ODE of the
form (10.1). This is left to Exercise 10.2.4.
The following script sets up this BVP, calls FDNLBVP (see Exercise 10.2.5 to compute an ap-
proximate solution, and then visualizes the approximate solution and exact solution.
% set up BVP y’’ = f(x,y,y’)
f=@(x,y,yp)(y.^3-y.*yp);
fy=@(x,y,yp)(3*y.^2-yp);
fyp=@(x,y,yp)(-y);
% boundary conditons: y(a)=alpha, y(b)=beta
a=1;
b=2;
alpha=1/2;
beta=1/3;
% N: number of interior nodes
N=10;
% use Newton’s method
[x,y]=FDNLBVP(f,fy,fyp,a,b,alpha,beta,N);
% compare to exact solution
yexact=1./(x+1);
plot(x,yexact,’b-’)
hold on
plot(x,y,’r--o’)
hold off
352 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS

xlabel(’x’)
ylabel(’y’)
Using an absolute error tolerance of 10−8 , Newton’s Method converges in just three iterations, and
does so quadratically. The resulting plot is shown in Figure 10.2. 2

Figure 10.2: Exact (solid curve) and approximate (dashed curve with circles) solutions of the BVP
(10.11) from Example 10.2.2.

Exercise 10.2.4 Write a Matlab function newtJ(x,y,fy,fyp,h) that uses (10.10) to


construct the Jacobian matrix JF (y) for use with Newton’s Method. The input arguments
x and y contain x- and y-values, respectively, including boundary values. The input
arguments fy and fyp are assumed to be function handles that implement fy (x, y, y 0 ) and
fy0 (x, y, y 0 ), respectively. Use newtF from Example 10.2.2 as a model.

Exercise 10.2.5 Write a Matlab function FDNLBVP(f,fy,fyp,a,b,alpha,beta,N)


that solves the general nonlinear BVP (10.1), (10.2) using finite differences in conjunc-
tion with Newton’s Method. The input arguments f, fy and fyp are function handles
for the functions f , fy and fy0 , respectively. Use newtF from Example 10.2.2 and newtJ
from Exercise 10.2.4 as helper functions. Test your function on the BVP from Example
10.2.2. What happens to the error as N, the number of interior grid points, increases?
It is worth noting that for two-point boundary value problems that are discretized by finite
differences, it is much more practical to use Newton’s Method, as opposed to a quasi-Newton
10.3. COLLOCATION 353

Method such as the Secant Method or Broyden’s Method, than for a general system of nonlinear
equations because the Jacobian matrix is tridiagaonal. This reduces the expense of the computation
of sk+1 from O(N 3 ) operations in the general case to only O(N ) for two-point boundary value
problems.
Exercise 10.2.6 Modify your function FDNLBVP from Exercise 10.2.5 to use Broyden’s
Method instead of Newton’s Method. How does this affect the efficiency, when applied to
the BVP from Example 10.2.2?

Exercise 10.2.7 Although Newton’s Method is much more efficient for such a problem
than for a general system of nonlinear equations, what is an advantage of using the Secant
Method over Newton’s Method or Broyden’s Method?
It can be shown that regardless of the choice of iterative method used to solve the system of
equations arising from discretization, the local truncation error of the finite difference method for
nonlinear problems is O(h2 ), as in the linear case. The order of accuracy can be increased by
applying Richardson extrapolation.

10.3 Collocation
While the finite-difference approach from the previous section is generally effective for two-point
boundary value problems, and is more flexible than the Shooting Method as it can be applied to
higher-dimensional BVPs, it does have its drawbacks.
• First, the accuracy of finite difference approximations relies on the existence of the higher-
order derivatives that appear in their error formulas. Unfortunately, the existence of these
higher-order derivatives is not assured.

• Second, a matrix obtained from a finite-difference approximation can be ill-conditioned, and


this conditioning worsens as the spacing h decreases.

• Third, it is best suited for problems in which the domain is relatively simple, such as a
rectangular domain.
We now consider an alternative approach that, in higher dimensions, is more readily applied to
problems on domains with complicated geometries.
First, we assume that the solution y(x) is approximated by a function yN (x) that is a linear
combination of chosen linearly independent functions φ1 (x), φ2 (x), . . . , φN (x), called basis functions
as they form a basis for an N -dimensional vector space. We then have
N
X
yN (x) = ci φi (x), (10.12)
i=1

where the constants c1 , c2 , . . . , cN are unknown. Substituting this form of the solution into (10.1),
(10.2) yields the equations
 
XN XN N
X
cj φ00j (x) = f x, cj φj (x), cj φ0j (x) , a < x < b, (10.13)
j=1 j=1 j=1
354 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS

N
X N
X
cj φj (a) = α, cj φi (b) = β. (10.14)
j=1 j=1

Already, the convenience of this assumption is apparent: instead of solving for a function y(x) on
the interval (a, b), we are instead having to solve for the N coefficients c1 , c2 , . . . , cN . However, it is
not realistic to think that there is any choice of these coefficients that satisfies (10.13) on the entire
interval (a, b), as well as the boundary conditions (10.14). Rather, we need to impose N conditions
on these N unknowns, in the hope that the resulting system of N equations will have a unique
solution that is also an accurate approximation of the exact solution y(x).
To that end, we require that (10.13) is satisfied at N −2 points in (a, b), denoted by x1 , x2 , . . . , xN −2 ,
and that the boundary conditions (10.14) are satisfied. The points a = x0 , x1 , x2 , . . . , xN −2 , xN −1 =
b are called collocation points. This approach of approximating y(x) by imposing (10.12) and solving
the system of N equations given by
 
XN N
X N
X
cj φ00j (xi ) = f xi , cj φj (xi ), cj φ0j (xi ) , i = 1, 2, . . . , N − 2 (10.15)
j=1 j=1 j=1

and (10.14), is called collocation.


For simplicity, we assume that the BVP (10.1) is linear. We are then solving a problem of the
form
y 00 (x) = p(x)y 0 (x) + q(x)y(x) + r(x), a < x < b. (10.16)
From (10.15), we obtain the system of linear equations
N
X N
X N
X
cj φ00j (xi ) = r(xi ) + cj q(xi )φj (xi ) + cj p(xi )φ0j (xi ), i = 1, 2, . . . , N − 2, (10.17)
j=1 j=1 j=1

 T
along with (10.14). This system can be written in the form Ac = b, where c = c1 · · · cN .
We can then solve this system using any of the methods from Chapter 2.

Exercise 10.3.1 Express the system of linear equations (10.17), (10.14) in the form
Ac = b, where c is defined as above. What are the entries aij and bi of the matrix A and
right-hand side vector b, respectively?

Example 10.3.1 Consider the BVP

y 00 = x2 , 0 < x < 1, (10.18)

y(0) = 0, y(0) = 1. (10.19)


We assume that our approximation of y(x) has the form

y4 (x) = c1 + c2 x + c3 x2 + c4 x3 .

That is, N = 4, since we are assuming that y(x) is a linear combination of the four functions
1, x, x2 and x3 . Substituting this form into the BVP yields

2c3 + 6c4 xi = x2i , i = 1, 2,


10.3. COLLOCATION 355

c1 = 0, c1 + c2 + c3 + c4 = 1.
Writing this system of equations in matrix-vector form, we obtain
    
1 0 0 0 c1 0
 0 0 2 6x1    2 
  c2  =  x1  .

 (10.20)
 0 0 2 6x2   c3   x22 
1 1 1 1 c4 1

For the system to be specified completely, we need to choose the two collocation points x1 , x2 ∈ (0, 1).
As long as these points are chosen to be distinct, the matrix of the system will be nonsingular. For
this example, we choose x1 = 1/3 and x2 = 2/3. We then have the system
    
1 0 0 0 c1 0
 0 0 2 2   c2   1/9 
=
  4/9  . (10.21)
  
 0 0 2 4   c3
1 1 1 1 c4 1

We can now solve this system in Matlab:

>> x=[ 1/3 2/3 ];


>> A=[ 1 0 0 0;
0 0 2 6*x(1);
0 0 2 6*x(2);
1 1 1 1 ];
>> b=[ 0; x(1)^2; x(2)^2; 1 ];
>> format rat
>> c=A\b

c =

0
17/18
-1/9
1/6

The format rat statement was used to obtain exact values of the entries of c, since these entries
are guaranteed to be rational numbers in this case. It follows that our approximate solution yN (x)
is
17 1 1
y4 (x) = x − x2 + x3 .
18 9 6
The exact solution of the original BVP is easily obtained by integration:
1 4 11
y(x) = x + x.
12 12
From these formulas, though, it is not easy to gauge how accurate y4 (x) is. To visualize the error,
we plot both solutions:
356 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS

>> xp=0:0.01:1;
>> y4p=c(1)+c(2)*xp+c(3)*xp.^2+c(4)*xp.^3;
>> yp=xp.^4/12+11*xp/12;
>> plot(xp,yp)
>> hold on
>> plot(xp,y4p,’r--’)
>> xlabel(’x’)
>> ylabel(’y’)
>> legend(’exact’,’approximate’)

The result is shown in Figure 10.3. As we can see, this approximate solution is reasonably accurate.

Figure 10.3: Exact (blue curve) and approximate (dashed curve) solutions of (10.18), (10.19) from
Example 10.3.1.

To get a numerical indication of the accuracy, we can measure the error at the points in xp that
were used for plotting:

>> format short


>> norm(yp-y4p,’inf’)

ans =

0.0023
10.3. COLLOCATION 357

Since the exact solution and approximation are both polynomials, we can also compute the L2
norm of the error:
>> py=[ 1/12 0 0 11/12 0 ];
>> py4=c(end:-1:1)’;
>> err4=py-[ 0 py4 ];
>> err42=conv(err4,err4);
>> Ierr42=polyint(err42);
>> norm22=polyval(Ierr42,1)-polyval(Ierr42,0);
>> norm2=sqrt(norm22)

norm2 =

0.0019
2

Exercise 10.3.2 Solve the BVP from Example 10.3.1 again, but with different collocation
points x1 , x2 ∈ (0, 1). What happens to the error?

Exercise 10.3.3 Use Matlab to compute the relative error in the ∞-norm and L2 -
norm from the preceding example.

Exercise 10.3.4 What would happen if N = 5 collocation points were used, along with
the functions φj (x) = xj−1 , j = 1, 2, . . . , 5?

Exercise 10.3.5 Write a Matlab function [x,y]=linearcollocation(p,q,r,a,b,alpha,beta,N)


that uses collocation to solve the linear BVP (10.16), (10.2). The input arguments p,
q and r must be function handles for the functions p(x), q(x) and r(x), respectively,
from (10.16). Use N equally spaced collocation points, which must be returned in the
output x. The output y must contain the values of the approximate solution yN (x) at
the collocation points. Use the basis functions φj (x) = xj−1 , j = 1, 2, . . . , N . Test your
function on the BVP from Example 10.3.1.

Exercise 10.3.6 Use your function linearcollocation from Exercise 10.3.5 to solve
the BVP
y 00 = ex , 0 < x < 1, y(0) = 0, y(1) = 1.
What happens to the error in the approximate solution as the number of collocation points,
N , increases? Plot the error as a function of N , using logarithmic scales.

The choice of functions φj (x), j = 1, 2, . . . , N , can significantly affect the process of solving the
resulting system of equations. The choice used in Example 10.3.1, φj (x) = xj−1 , while convenient,
is not a good choice, especially when N is large. As illustrated in Section 6.2, these functions can
be nearly linearly dependent on the interval [a, b], which can lead to ill-conditioned systems.

Exercise 10.3.7 What happens to the condition number of the matrix used by your func-
tion linearcollocation from Exercise 10.3.5 as N increases?
358 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS

Alternatives include orthogonal polynomials, such as Chebyshev polynomials, or trigonometric


polynomials, as discussed in Section 6.4.

Exercise 10.3.8 Modify your function linearcollocation from Exercise 10.3.5 to use
Chebyshev polynomials instead of the monomial basis, and the Chebyshev points as the
collocation points instead of equally spaced points. What happens to the condition number
of the matrix as N increases?

Collocation can be used for either linear or nonlinear BVP. In the nonlinear case, choosing the
functions φj (x), j = 1, 2, . . . , N , and the collocation points xi , i = 0, 1, . . . , N − 1,, yields a system
of nonlinear equations for the unknowns c1 , c2 , . . . , cN . This system can then be solved using any
of the techniques from Section 8.6, just as when using finite differences.

Exercise 10.3.9 Describe the system of nonlinear equations F(c) = 0 that must be solved
at each iteration when using Newton’s method to solve a general nonlinear BVP of the
form (10.1), (10.2). What is the Jacobian of F, JF (c)?

Exercise 10.3.10 Write a Matlab function [x,y]=nonlinearcollocation(f,a,b,alpha,beta,N)


that solves a BVP of the form (10.1), (10.2) using Newton’s method. The input argument
f must be a function handle for the function f (x, y, y 0 ) from (10.1), and N indicates the
number of collocation points. Use the Chebyshev points as the collocation points, and the
Chebyshev polynomials as the basis functions. For the initial guess, use the coefficients
corresponding to the unique linear function that satisfies the boundary conditions (10.2).
Test your function on the BVP

y 00 = y 2 , y(1) = 6, y(2) = 3/2.

What is the exact solution? Hint: Solve the simpler ODE y 0 = y 2 ; the form of its solution
suggests the form of the solution of y 00 = y 2 .

10.4 The Finite Element Method


In collocation, the approximate solution yN (x) is defined to be an element of an N -dimensional
function space, which is the span of the chosen basis functions φj (x), j = 1, 2, . . . , N . In this section,
we describe another method for solving a BVP in which the approximate solution is again restricted
to an N -dimensional function space, but instead of requiring the residual R(x, yN , yN 0 , y 00 ) ≡ y 0 −
N N
0
f (x, yN , yN ) to vanish at selected points in (a, b), as in collocation, we require that the residual
is orthogonal to a given function space, consisting of functions called test functions. That is, we
require the residual to be zero in an “average” sense, rather than a pointwise sense. In fact, this
approach is called the weighted mean residual method.
For concreteness and simplicity, we consider the linear boundary value problem

− u00 (x) = f (x), 0 < x < 1, (10.22)

with boundary conditions


u(0) = 0, u(1) = 0. (10.23)
10.4. THE FINITE ELEMENT METHOD 359

This equation can be used to model, for example, transverse vibration of a string due to an external
force f (x), or longitudinal displacement of a beam subject to a load f (x). In either case, the
boundary conditions prescribe that the endpoints of the object in question are fixed.
If we multiply both sides of (10.22) by a test function w(x), and then integrate over the domain
[0, 1], we obtain
Z 1 Z 1
00
−w(x)u (x) dx = w(x)f (x) dx.
0 0
Applying integration by parts, we obtain
Z 1 Z 1
00 0
1
w(x)u (x) dx = w(x)u (x) 0 −
w0 (x)u0 (x) dx.
0 0

Let C 2 [0, 1] be the space of all functions with two continuous derivatives on [0, 1], and let C02 [0, 1]
be the space of all functions in C 2 [0, 1] that are equal to zero at the endpoints x = 0 and x = 1. If
we require that our test function u(x) belongs to C02 [0, 1], then w(0) = w(1) = 0, and the boundary
term in the above application of integration by parts vanishes. We then have
Z 1 Z 1
0 0
w (x)u (x) dx = w(x)f (x) dx.
0 0

This is called the weak form of the boundary value problem (10.22), (10.23), known as the strong
form or classical form, because it only requires that the first derivative of u(x) exist, as opposed
to the original boundary value problem, that requires the existence of the second derivative. The
weak form also known as the variational form. It can be shown that both the weak form and strong
form have the same solution u ∈ C02 [0, 1].
To find an approximate solution of the weak form, we restrict ourselves to a N -dimensional
subspace VN of C02 [0, 1] by requiring that the approximate solution, denoted by uN (x), satisfies
N
X
uN (x) = cj φj (x), (10.24)
j=1

where the trial functions φ1 , φ2 , . . . , φn form a basis for VN . For now, we only assume that these
trial functions belong to C02 [0, 1], and are linearly independent. Substituting this form into the
weak form yields
XN Z 1  Z 1
0
w(x)φj (x) dx cj = w(x)f (x) dx.
j=1 0 0

Since our trial functions and test functions come from the same space, this version of the weighted
mean residual method is known as the Galerkin method. As in collocation, we need N equations
to uniquely determine the N unknowns c1 , c2 , . . . , cN . To that end, we use the basis functions
φ1 , φ2 , . . . , φN as test functions. This yields the system of equations
XN Z 1  Z 1
0 0
φi (x)φj (x) dx cj = φi (x)f (x) dx, i = 1, 2, . . . , N.
j=1 0 0

This system can be written in matrix-vector form


Ac = f
360 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS

where u is a vector of the unknown coefficients c1 , c2 , . . . , cN and


Z 1
aij = φ0i (x)φ0j (x) dx, i, j = 1, 2, . . . , N,
0

Z 1
fi = φi (x)f (x) dx, i = 1, 2, . . . , N.
0

By finding the coefficients u1 , u2 , . . . , uN that satisfy these equations, we ensure that the residual
R(x, uN , u0N , u00N ) = f (x) + u00N (x) satisfies

hw, Ri = 0, w ∈ VN ,

as each w ∈ VN can be expressed as a linear combination of the test functions φ1 , φ2 , . . . , φN . It


follows that the residual is orthogonal to VN .
We must now choose trial (and test) functions φ1 , φ2 , . . . , φN . A simple choice is a set of
piecewise linear “hat” functions, or “tent” functions, so named because of the shapes of their
graphs, which are illustrated in Figure 10.4. We divide the interval [0, 1] into N + 1 subintervals

Figure 10.4: Piecewise linear basis functions φj (x), as defined in (10.25), for j = 1, 2, 3, 4, with
N =4

[xi−1 , xi ], for i = 1, 2, . . . , N + 1, with uniform spacing h = 1/(N + 1), where x0 = 0 and xN +1 = 1.


10.4. THE FINITE ELEMENT METHOD 361

Then we define

 0 0 ≤ x ≤ xj−1
 1

h (x − x j−1 ) xj−1 < x ≤ xj
φj (x) = 1 , j = 1, 2, . . . , N. (10.25)
(x − x) xj < x ≤ xj+1
 h j+1


0 xj+1 < x ≤ 1
These functions automatically satisfy the boundary conditions. Because they are only piecewise
linear, their derivatives are discontinuous. They are

 0 0 ≤ x ≤ xj−1
 1

xj−1 < x ≤ xj
φ0j (x) = h
1 , j = 1, 2, . . . , N.

 − h xj < x ≤ xj+1
0 xj+1 < x ≤ 1

It follows from these definitions that φi (x) and φj (x) cannot simultaneously be nonzero at any
point in [0, 1] unless |i − j| ≤ 1. This yields a symmetric tridiagonal matrix A with entries
 2 Z xi
1 2 xi+1
  Z
1 2
aii = 1 dx + − 1 dx = , i = 1, 2, . . . , N,
h xi−1 h xi h
Z xi+1
1 1
ai,i+1 = − 2 1 dx = − , i = 1, 2, . . . , N − 1,
h xi h
ai+1,i = ai,i+1 , i = 1, 2, . . . , N − 1.
For the right-hand side vector f , known as the load vector, we have
1 xi 1 xi+1
Z Z
fi = (x − xi−1 )f (x) dx + (xi+1 − x)f (x) dx, i = 1, 2, . . . , N. (10.26)
h xi−1 h xi
When the Galerkin method is used with basis functions such as these, that are only nonzero within
a small portion of the spatial domain, the method is known as the finite element method. In this
context, the subintervals [xi−1 , xi ] are called elements, and each xi is called a node. As we have
seen, an advantage of this choice of trial function is that the resulting matrix A, known as the
stiffness matrix, is sparse.
It can be shown that the matrix A with entries defined from these approximate integrals is
not only symmetric and tridiagonal, but also positive definite. It follows that the system Ac = f
is stable with respect to roundoff error, and can be solved using methods such as the conjugate
gradient method that are appropriate for sparse symmetric positive definite systems.

Example 10.4.1 We illustrate the finite element method by solving (10.22), (10.23) with f (x) = x,
with N = 4. The following Matlab commands are used to help specify the problem.
>> % solve -u’’ = f on (0,1), u(0)=u(1)=0, f polynomial
>> % represent f(x) = x as a polynomial
>> fx=[ 1 0 ];
>> % set number of interior nodes
>> N=4;
>> h=1/(N+1);
>> % compute vector containing all nodes, including boundary nodes
>> x=h*(0:N+1)’;
362 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS

This vector of nodes will be convenient when constructing the load vector f and performing other
tasks.
We need to solve the system Ac = f , where
   
2 −1 0 0 c1
1  −1 2 −1 0 
 , c =  c2  ,
 
A= 
h  0 −1 2 −1   c3 
0 0 −1 2 c4

with h = 1/5. The following Matlab commands set up the stiffness matrix for a general value of
N.
>> % construct stiffness matrix:
>> e=ones(N-1,1);
>> % use diag to place entries on subdiagonal and superdiagonal
>> A=1/h*(2*eye(N)-diag(e,1)-diag(e,-1));
The load vector f has elements
1 xi 1 xi+1
Z Z
fi = (x − xi−1 )x dx + (xi+1 − x)x dx, i = 1, 2, . . . , N.
h xi−1 h xi

The following statements compute these elements when f is a polynomial.


% construct load vector:
% pre-allocate column vector
f=zeros(N,1);
for i=1:N
% note that in text, 0-based indexing is used
% for x-values, while Matlab uses 1-based indexing
% phi_{i-1}(x) = (x - x_{i-1})/h
hat1=[ 1 -x(i) ]/h;
% multiply hat function by f(x)
integrand1=conv(fx,hat1);
% anti-differentiate
antideriv1=polyint(integrand1);
% substitute limits of integration into antiderivative
integral1=polyval(antideriv1,x(i+1))-polyval(antideriv1,x(i));
% phi_i(x) = (x_{i+1} - x)/h
hat2=[ -1 x(i+2) ]/h;
% repeat integration process on [x_i,x_{i+1}]
integrand2=conv(fx,hat2);
% anti-differentiate
antideriv2=polyint(integrand2);
% substitute limits of integration into antiderivative
integral2=polyval(antideriv2,x(i+2))-polyval(antideriv2,x(i+1));
f(i)=integral1+integral2;
end
10.4. THE FINITE ELEMENT METHOD 363

Now that the system Ac = f is set up, we can easily solve it in Matlab using the command c=A\f.
For this simple BVP, we can obtain the exact solution analytically, to help gauge the accuracy
of our approximate solution. The following statements accomplish this for the BVP −u00 = f on
(0, 1), u(x0 ) = u(xN +1 ) = 1, when f is a polynomial.

% obtain exact solution by integration


u=-polyint(polyint(fx));
% to solve for constants of integration:
% u + d1 x + d2 = 0 at x = x_0, x_{N+1}
% in matrix-vector form:
% [ x_0 1 ] [ d1 ] = [ -u(x0) ]
% [ x_{N+1} 1 ] [ d2 ] [ -u(x_{N+1}) ]
V=[ x(1) 1; x(end) 1 ];
b=[ -polyval(u,x(1)); -polyval(u,x(end)) ];
d=V\b;
u=u+[ 0 0 d(1) d(2) ];

Now, we can visualize the exact solution u(x) and approximate solution uN (x), which is a piecewise
linear function due to the use of piecewise linear trial functions.

% make vector of x-values for plotting exact solution


xp=x(1):h/100:x(end);
plot(xp,polyval(u,xp),’b’)
hold on
plot(x,[ 0; c; 0 ],’r--o’)
hold off
xlabel(’x’)
ylabel(’y’)

Because each of the trial functions φj (x), j = 1, 2, 3, 4, is equal to 1 at xj and equal to 0 at xi


for i 6= j, each element cj of c is equal to the value of u4 (x) at the corresponding node xj . The
exact and approximate solutions are shown in Figure 10.5. It can be seen that there is very close
agreement between the exact and approximate solutions at the nodes; in fact, in this example, they
are exactly equal, though this does not occur in general. 2

In the preceding example, the integrals in (10.26) could be evaluated exactly. Generally, how-
ever, they must be approximated, using techniques such as those presented in Chapter 7.
Exercise 10.4.1 What is the value of fi if the Trapezoidal Rule is used on each of the
integrals in (10.26)? What if Simpson’s Rule is used?

Exercise 10.4.2 Write a Matlab function [x,u]=FEMBVP(f,N) that solves the BVP
(10.22), (10.23) with N interior nodes. The input argument f must be a function handle.
The outputs x and u must be column vectors consisting of the nodes and values of the
approximate solution at the nodes, respectively. Use the Trapezoidal rule to approximate
the integrals (10.26). Test your function with f (x) = ex . What happens to the error as
N increases?
364 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS

Figure 10.5: Exact (solid curve) and approximate (dashed curve) solutions of (10.22), (10.23) with
f (x) = x and N = 4

Exercise 10.4.3 Generalize your function FEMBVP from Exercise 10.4.2 so that it can be
used to solve the BVP −u00 + q(x)u = f (x) on (0, 1), with boundary conditions u(0) =
u(1) = 0, for a given function q(x) that must be passed as an input argument that is a
function handle. Hint: re-derive the weak form of the BVP to determine how the matrix
A must be modified. Use the Trapezoidal Rule to approximate any integrals involving q(x).

Exercise 10.4.4 Modify your function FEMBVP from Exercise 10.4.3 so that it can be
used to solve the BVP −u00 + q(x)u = f (x) on (0, 1), with boundary conditions u(0) = u0 ,
u(1) = u1 , where the scalars u0 and u1 must be passed as input arguments. Hint: Modify
(10.24) to include additional basis functions φ0 (x) and φN +1 (x), that are equal to 1 at
x = x0 and x = xN +1 , respectively, and equal to 0 at all other nodes. How must the load
vector f be modified to account for these nonhomogeneous boundary conditions?

Exercise 10.4.5 Modify your function FEMBVP from Exercise 10.4.4 so that it can be
used to solve the BVP −(p(x)u0 )0 + q(x)u = f (x) on (0, 1), with boundary conditions
u(0) = u0 , u(1) = u1 , where the coefficient p(x) must be passed as an input argument
that is a function handle. Hint: re-derive the weak form of the BVP to determine how
the matrix A must be modified. Use the Trapezoidal Rule to approximate any integrals
involving p(x).
10.5. FURTHER READING 365

It can be shown that when using the finite element method with piecewise linear trial functions,
the error in the approximate solution is O(h2 ). Higher-order accuracy can be achieved by using
higher-degree piecewise polynomials as basis functions, such as cubic B-splines. Such a choice
also helps to ensure that the approximate solution is differentiable, unlike the solution computed
using piecewise linear basis functions, which are continuous but not differentiable at the points xi ,
i = 1, 2, . . . , N . With cubic B-splines, the error in the computed solution is O(h4 ) as opposed to
O(h2 ) in the piecewise linear case, due to the two additional degrees of differentiability. However,
the drawback is that the matrix arising form the use of higher-degree basis functions is no longer
tridiagonal; the upper and lower bandwidth are each equal to the degree of the piecewise polynomial
that is used.

10.5 Further Reading


366 CHAPTER 10. TWO-POINT BOUNDARY VALUE PROBLEMS
Part V

Appendices

367
Appendix A

Review of Calculus

Among the mathematical problems that can be solved using techniques from numerical analysis
are the basic problems of differential and integral calculus:

• computing the instantaneous rate of change of one quantity with respect to another, which
is a derivative, and

• computing the total change in a function over some portion of its domain, which is a definite
integral.

Calculus also plays an essential role in the development and analysis of techniques used in numerical
analysis, including those techniques that are applied to problems not arising directly from calculus.
Therefore, it is appropriate to review some basic concepts from calculus before we begin our study
of numerical analysis.

A.1 Limits and Continuity

A.1.1 Limits

The basic problems of differential and integral calculus described in the previous paragraph can be
solved by computing a sequence of approximations to the desired quantity and then determining
what value, if any, the sequence of approximations approaches. This value is called a limit of the
sequence. As a sequence is a function, we begin by defining, precisely, the concept of the limit of a
function.

369
370 APPENDIX A. REVIEW OF CALCULUS

Definition A.1.1 We write


lim f (x) = L
x→a

if for any open interval I1 containing L, there is some open interval I2 containing a such
that f (x) ∈ I1 whenever x ∈ I2 , and x 6= a. We say that L is the limit of f (x) as x
approaches a.
We write
lim f (x) = L
x→a−

if, for any open interval I1 containing L, there is an open interval I2 of the form (c, a),
where c < a, such that f (x) ∈ I1 whenever x ∈ I2 . We say that L is the limit of f (x) as
x approaches a from the left, or the left-hand limit of f (x) as x approaches a.
Similarly, we write
lim f (x) = L
x→a+

if, for any open interval I1 containing L, there is an open interval I2 of the form (a, c),
where c > a, such that f (x) ∈ I1 whenever x ∈ I2 . We say that L is the limit of f (x) as
x approaches a from the right, or the right-hand limit of f (x) as x approaches
a.
We can make the definition of a limit a little more concrete by imposing sizes on the intervals
I1 and I2 , as long as the interval I1 can still be of arbitrary size. It can be shown that the following
definition is equivalent to the previous one.
Definition A.1.2 We write
lim f (x) = L
x→a

if, for any  > 0, there exists a number δ > 0 such that |f (x) − L| <  whenever
0 < |x − a| < δ.

Similar definitions can be given for the left-hand and right-hand limits.
Note that in either definition, the point x = a is specifically excluded from consideration when
requiring that f (x) be close to L whenever x is close to a. This is because the concept of a limit
is only intended to describe the behavior of f (x) near x = a, as opposed to its behavior at x = a.
Later in this appendix we discuss the case where the two distinct behaviors coincide.
To illustrate limits, we consider
sin x
L = lim . (A.1)
x→0+ x

We will visualize this limit in Matlab. First, we construct a vector of x-values that are near zero,
but excluding zero. This can readily be accomplished using the colon operator:

>> dx=0.01;
>> x=dx:dx:1;

Then, the vector x contains the values xi = i∆x, i = 1, 2, . . . , 100, where ∆x = 0.01.

Exercise A.1.1 Use the vector x to plot sin x/x on the interval (0, 1]. What appears to
be the value of the limit L in (A.1)?
A.1. LIMITS AND CONTINUITY 371

The preceding exercise can be completed using a for loop, but it is much easier to use component-
wise operators. Since x is a vector, the expression sin(x)/x would cause an error. Instead, the ./
operator can be used to perform componentwise division of the vectors sin(x) and x. The . can
be used with several other arithmetic operators to perform componentwise operations on matrices
and vectors. For example, if A is a matrix, then A.^2 is a matrix in which each entry is the square
of the corresponding entry of A.

Exercise A.1.2 Use one statement to plot sin x/x on the interval (0, 1].

A.1.2 Limits of Functions of Several Variables

The notions of limit and continuity generalize to vector-valued functions and functions of several
variables in a straightforward way.

Definition A.1.3 Given a function f : D ⊆ Rn → R and a point x0 ∈ D, we write

lim f (x) = L
x→x0

if, for any  > 0, there exists a δ > 0 such that

|f (x) − L| < 

whenever x ∈ D and
0 < kx − x0 k < δ.
In this definition, we can use any appropriate vector norm k · k, as discussed in Section B.11.

Definition A.1.4 We also say that f is continuous at a point x0 ∈ D if

lim f (x) = f (x0 ).


x→x0

It can be shown f is continuous at x0 if its partial derviatives are bounded near x0 .


Having defined limits and continuity for scalar-valued functions of several variables, we can now
define these concepts for vector-valued functions. Given a vector-valued function F : D ⊆ Rn → Rn ,
and x = (x1 , x2 , . . . , xn ) ∈ D, we write

   
f1 (x) f1 (x1 , x2 , . . . , xn )
 f2 (x)   f2 (x1 , x2 , . . . , xn ) 
F(x) =  =
   
.. .. 
 .   . 
fn (x) fn (x1 , x2 , . . . , xn )

where f1 , f2 , . . . , fn are the component functions, or coordinate functions, of F.


372 APPENDIX A. REVIEW OF CALCULUS

Definition A.1.5 Given F : D ⊆ Rn → Rn and x0 ∈ D, we say that

lim F(x) = L
x→x0

if and only if
lim fi (x) = Li , i = 1, 2, . . . , n.
x→x0

Similarly, we say that F is continuous at x0 if and only if each coordinate function fi is


continuous at x0 . Equivalently, F is continuous at x0 if

lim F(x) = F(x0 ).


x→x0

A.1.3 Limits at Infinity


The concept of a limit defined above is useful for describing the behavior of a function f (x) as x
approaches a finite value a. However, suppose that the function f is a sequence, which is a function
that maps N, the set of natural numbers, to R, the set of real numbers. We will denote such a
sequence by {fn }∞n=0 , or simply {fn }. In numerical analysis, it is sometimes necessary to determine
the value that the terms of a sequence {fn } approach as n → ∞. Such a value, if it exists, is not
a limit, as defined previously. However, it is natural to use the notation of limits to describe this
behavior of a function. We therefore define what it means for a sequence {fn } to have a limit as n
becomes infinite.
Definition A.1.6 (Limit at Infinity) Let {fn } be a sequence defined for all integers
not less than some integer n0 . We say that the limit of {fn } as n approaches ∞ is
equal to L, and write
lim fn = L,
n→∞

if for any open interval I containing L, there exists a number M such that fn ∈ I whenever
x > M.

Example A.1.7 Let the sequence {fn }∞


n=1 be defined by fn = 1/n for every positive integer n.
Then
lim fn = 0,
n→∞

since for any  > 0, no matter how small, we can find a positive integer n0 such that |fn | <  for
all n ≥ n0 . In fact, for any given , we can choose n0 = d1/e, where dxe, known as the ceiling
function, denotes the smallest integer that is greater than or equal to x. 2

A.1.4 Continuity
In many cases, the limit of a function f (x) as x approached a could be obtained by simply computing
f (a). Intuitively, this indicates that f has to have a graph that is one continuous curve, because
any “break” or “jump” in the graph at x = a is caused by f approaching one value as x approaches
a, only to actually assume a different value at a. This leads to the following precise definition of
what it means for a function to be continuous at a given point.
A.1. LIMITS AND CONTINUITY 373

Definition A.1.8 (Continuity) We say that a function f is continuous at a if

lim f (x) = f (a).


x→a

We also say that f (x) has the Direct Subsitution Property at x = a.


We say that a function f is continuous from the right at a if

lim f (x) = f (a).


x→a+

Similarly, we say that f is continuous from the left at a if

lim f (x) = f (a).


x→a−

The preceding definition describes continuity at a single point. In describing where a function
is continuous, the concept of continuity over an interval is useful, so we define this concept as well.

Definition A.1.9 (Continuity on an Interval) We say that a function f is contin-


uous on the interval (a, b) if f is continuous at every point in (a, b). Similarly, we say
that f is continuous on

1. [a, b) if f is continuous on (a, b), and continuous from the right at a.

2. (a, b] if f is continuous on (a, b), and continuous from the left at b.

3. [a, b] if f is continuous on (a, b), continuous from the right at a, and continuous
from the left at b.
In numerical analysis, it is often necessary to construct a continuous function, such as a polyno-
mial, based on data obtained by measurements and problem-dependent constraints. In this course,
we will learn some of the most basic techniques for constructing such continuous functions by a
process called interpolation.

A.1.5 The Intermediate Value Theorem

Suppose that a function f is continuous on some closed interval [a, b]. The graph of such a function
is a continuous curve connecting the points (a, f (a)) with (b, f (b)). If one were to draw such a
graph, their pen would not leave the paper in the process, and therefore it would be impossible
to “avoid” any y-value between f (a) and f (b). This leads to the following statement about such
continuous functions.
Theorem A.1.10 (Intermediate Value Theorem) Let f be continuous on [a, b].
Then, on (a, b), f assumes every value between f (a) and f (b); that is, for any value
y between f (a) and f (b), f (c) = y for some c in (a, b).

The Intermediate Value Theorem has a very important application in the problem of finding
solutions of a general equation of the form f (x) = 0, where x is the solution we wish to compute
and f is a given continuous function. Often, methods for solving such an equation try to identify
an interval [a, b] where f (a) > 0 and f (b) < 0, or vice versa. In either case, the Intermediate Value
374 APPENDIX A. REVIEW OF CALCULUS

Theorem states that f must assume every value between f (a) and f (b), and since 0 is one such
value, it follows that the equation f (x) = 0 must have a solution somewhere in the interval (a, b).
We can find an approximation to this solution using a procedure called bisection, which re-
peatedly applies the Intermediate Value Theorem to smaller and smaller intervals that contain the
solution. We will study bisection, and other methods for solving the equation f (x) = 0, in this
course.

A.2 Derivatives
The basic problem of differential calculus is computing the instantaneous rate of change of one
quantity y with respect to another quantity x. For example, y may represent the position of an
object and x may represent time, in which case the instantaneous rate of change of y with respect
to x is interpreted as the velocity of the object.
When the two quantities x and y are related by an equation of the form y = f (x), it is certainly
convenient to describe the rate of change of y with respect to x in terms of the function f . Because
the instantaneous rate of change is so commonplace, it is practical to assign a concise name and
notation to it, which we do now.

Definition A.2.1 (Derivative) The derivative of a function f (x) at x = a, denoted


by f 0 (a), is
f (a + h) − f (a)
f 0 (a) = lim ,
h→0 h
provided that the above limit exists. When this limit exists, we say that f is differentiable
at a.
Remark Given a function f (x) that is differentiable at x = a, the following numbers are all equal:

• the derivative of f at x = a, f 0 (a),

• the slope of the tangent line of f at the point (a, f (a)), and

• the instantaneous rate of change of y = f (x) with respect to x at x = a.

This can be seen from the fact that all three numbers are defined in the same way. 2

Exercise A.2.1 Let f (x) = x2 − 3x + 2. Use the Matlab function polyder to compute
the coefficients of f 0 (x). Then use the polyval function to obtain the equation of the
tangent line of f (x) at x = 2. Finally, plot the graph of f (x) and this tangent line, on
the same graph, restricted to the interval [0, 4].

Many functions can be differentiated using differentiation rules such as those learned in a cal-
culus course. However, many functions cannot be differentiated using these rules. For example,
we may need to compute the instantaneous rate of change of a quantity y = f (x) with respect to
another quantity x, where our only knowledge of the function f that relates x and y is a set of pairs
of x-values and y-values that may be obtained using measurements. In this course we will learn how
to approximate the derivative of such a function using this limited information. The most common
methods involve constructing a continuous function, such as a polynomial, based on the given data,
A.3. EXTREME VALUES 375

using interpolation. The polynomial can then be differentiated using differentiation rules. Since the
polynomial is an approximation to the function f (x), its derivative is an approximation to f 0 (x).

A.2.1 Differentiability and Continuity


Consider a tangent line of a function f at a point (a, f (a)). When we consider that this tangent
line is the limit of secant lines that can cross the graph of f at points on either side of a, it seems
impossible that f can fail to be continuous at a. The following result confirms this: a function
that is differentiable at a given point (and therefore has a tangent line at that point) must also be
continuous at that point.

Theorem A.2.2 If f is differentiable at a, then f is continuous at a.

It is important to keep in mind, however, that the converse of the above statement, “if f
is continuous, then f is differentiable”, is not true. It is actually very easy to find examples of
functions that are continuous at a point, but fail to be differentiable at that point. As an extreme
example, it is known that there is a function that is continuous everywhere, but is differentiable
nowhere.

Example A.2.3 The functions f (x) = |x| and g(x) = x1/3 are examples of functions that are
continuous for all x, but are not differentiable at x = 0. The graph of the absolute value function
|x| has a sharp corner at x = 0, since the one-sided limits

f (h) − f (0) f (h) − f (0)


lim = −1, lim =1
h→0− h h→0+ h

do not agree, but in general these limits must agree in order for f (x) to have a derivative at x = 0.
The cube root function g(x) = x1/3 is not differentiable at x = 0 because the tangent line to the
graph at the point (0, 0) is vertical, so it has no finite slope. We can also see that the derivative
does not exist at this point by noting that the function g 0 (x) = (1/3)x−2/3 has a vertical asymptote
at x = 0.
Exercise A.2.2 Plot both of the functions from this example on the interval [−1, 1].
Use the colon operator to create a vector of x-values and use the dot for performing
2
componentwise operations on a vector, where applicable. From the plot, identify the non-
differentiability in these continuous functions.

A.3 Extreme Values


In many applications, it is necessary to determine where a given function attains its minimum or
maximum value. For example, a business wishes to maximize profit, so it can construct a function
that relates its profit to variables such as payroll or maintenance costs. We now consider the basic
problem of finding a maximum or minimum value of a general function f (x) that depends on a
single independent variable x. First, we must precisely define what it means for a function to have
a maximum or minimum value.
376 APPENDIX A. REVIEW OF CALCULUS

Definition A.3.1 (Absolute extrema) A function f has a absolute maximum or


global maximum at c if f (c) ≥ f (x) for all x in the domain of f . The number f (c) is
called the maximum value of f on its domain. Similarly, f has a absolute minimum
or global minimum at c if f (c) ≤ f (x) for all x in the domain of f . The number f (c) is
then called the minimum value of f on its domain. The maximum and minimum values
of f are called the extreme values of f , and the absolute maximum and minimum are
each called an extremum of f .

Before computing the maximum or minimum value of a function, it is natural to ask whether it
is possible to determine in advance whether a function even has a maximum or minimum, so that
effort is not wasted in trying to solve a problem that has no solution. The following result is very
helpful in answering this question.

Theorem A.3.2 (Extreme Value Theorem) If f is continuous on [a, b], then f has
an absolute maximum and an absolute minimum on [a, b].

Now that we can easily determine whether a function has a maximum or minimum on a closed
interval [a, b], we can develop an method for actually finding them. It turns out that it is easier
to find points at which f attains a maximum or minimum value in a “local” sense, rather than
a “global” sense. In other words, we can best find the absolute maximum or minimum of f
by finding points at which f achieves a maximum or minimum with respect to “nearby” points,
and then determine which of these points is the absolute maximum or minimum. The following
definition makes this notion precise.

Definition A.3.3 (Local extrema) A function f has a local maximum at c if f (c) ≥


f (x) for all x in an open interval containing c. Similarly, f has a local minimum at
c if f (c) ≤ f (x) for all x in an open interval containing c. A local maximum or local
minimum is also called a local extremum.
At each point at which f has a local maximum, the function either has a horizontal tangent
line, or no tangent line due to not being differentiable. It turns out that this is true in general,
and a similar statement applies to local minima. To state the formal result, we first introduce the
following definition, which will also be useful when describing a method for finding local extrema.

Definition A.3.4 (Critical Number) A number c in the domain of a function f is a


critical number of f if f 0 (c) = 0 or f 0 (c) does not exist.

The following result describes the relationship between critical numbers and local extrema.

Theorem A.3.5 (Fermat’s Theorem) If f has a local minimum or local maximum at


c, then c is a critical number of f ; that is, either f 0 (c) = 0 or f 0 (c) does not exist.

This theorem suggests that the maximum or minimum value of a function f (x) can be found by
solving the equation f 0 (x) = 0. As mentioned previously, we will be learning techniques for solving
such equations in this course. These techniques play an essential role in the solution of problems
in which one must compute the maximum or minimum value of a function, subject to constraints
on its variables. Such problems are called optimization problems.
The following exercise highlights the significance of critical numbers. It relies on some of the
A.4. INTEGRALS 377

Matlab functions for working with polynomials that were introduced in Section 1.2.

Exercise A.3.1 Consider the polynomial f (x) = x3 − 4x2 + 5x − 2. Plot the graph of
this function on the interval [0, 3]. Use the colon operator to create a vector of x-values
and use the dot for componentwise operations on vectors wherever needed. Then, use the
polyder and roots functions to compute the critical numbers of f (x). How do they relate
to the absolute maximum and minimum values of f (x) on [0, 3], or any local maxima or
minima on this interval?

A.4 Integrals
There are many cases in which some quantity is defined to be the product of two other quantities.
For example, a rectangle of width w has uniform height h, and the area A of the rectangle is given
by the formula A = wh. Unfortunately, in many applications, we cannot necessarily assume that
certain quantities such as height are constant, and therefore formulas such as A = wh cannot be
used directly. However, they can be used indirectly to solve more general problems by employing
the notation known as integral calculus.
Suppose we wish to compute the area of a shape that is not a rectangle. To simplify the
discussion, we assume that the shape is bounded by the vertical lines x = a and x = b, the x-axis,
and the curve defined by some continuous function y = f (x), where f (x) ≥ 0 for a ≤ x ≤ b. Then,
we can approximate this shape by n rectangles that have width ∆x = (b − a)/n and height f (xi ),
where xi = a + i∆x, for i = 0, . . . , n. We obtain the approximation
n
X
A ≈ An = f (xi )∆x.
i=1

Intuitively, we can conclude that as n → ∞, the approximate area An will converge to the exact
area of the given region. This can be seen by observing that as n increases, the n rectangles defined
above comprise a more accurate approximation of the region.
More generally, suppose that for each n = 1, 2, . . ., we define the quantity Rn by choosing points
a = x0 < x1 < · · · < xn = b, and computing the sum
n
X
Rn = f (x∗i )∆xi , ∆xi = xi − xi−1 , xi−1 ≤ x∗i ≤ xi .
i=1

The sum that defines Rn is known as a Riemann sum. Note that the interval [a, b] need not
be divided into subintervals of equal width, and that f (x) can be evaluated at arbitrary points
belonging to each subinterval.
If f (x) ≥ 0 on [a, b], then Rn converges to the area under the curve y = f (x) as n → ∞,
provided that the widths of all of the subintervals [xi−1 , xi ], for i = 1, . . . , n, approach zero. This
behavior is ensured if we require that

lim δ(n) = 0, where δ(n) = max ∆xi .


n→∞ 1≤i≤n

This condition is necessary because if it does not hold, then, as n → ∞, the region formed by the
n rectangles will not converge to the region whose area we wish to compute. If f assumes negative
378 APPENDIX A. REVIEW OF CALCULUS

values on [a, b], then, under the same conditions on the widths of the subintervals, Rn converges
to the net area between the graph of f and the x-axis, where area below the x-axis is counted
negatively.

Definition A.4.1 We define the definite integral of f (x) from a to b by


Z b
f (x) dx = lim Rn ,
a n→∞

where the sequence of Riemann sums {Rn }∞ n=1 is defined so that δ(n) → 0 as n → ∞, as
in the previous discussion. The function f (x) is called the integrand, and the values a
and b are the lower and upper limits of integration, respectively. The process of computing
an integral is called integration.

In Chapter 7, we will study the problem of computing an approximation to the definite integral
of a given function f (x) over an interval [a, b]. We will learn a number of techniques for computing
such an approximation, and all of these techniques involve the computation of an appropriate
Riemann sum.
2
Exercise A.4.1 Let f (x) = e−x . Write a Matlab function that takes as input a
parameter n, and computes the Riemann sum Rn for f (x) using n rectangles. Use a for
loop to compute the Riemann sum. First use x∗i = xi−1 , then use x∗i = xi , and then use
x∗i = (xi−1 + xi )2. For each case, compute Riemann sums for several values of n. What
can you observe about the convergence of the Riemann sum in these three cases?

A.5 The Mean Value Theorem


While the derivative describes the behavior of a function at a point, we often need to understand
how the derivative influences a function’s behavior on an interval. This understanding is essential
in numerical analysis because, it is often necessary to approximate a function f (x) by a function
g(x) using knowledge of f (x) and its derivatives at various points. It is therefore natural to ask
how well g(x) approximates f (x) away from these points.
The following result, a consequence of Fermat’s Theorem, gives limited insight into the rela-
tionship between the behavior of a function on an interval and the value of its derivative at a
point.

Theorem A.5.1 (Rolle’s Theorem) If f is continuous on a closed interval [a, b] and


is differentiable on the open interval (a, b), and if f (a) = f (b), then f 0 (c) = 0 for some
number c in (a, b).

By applying Rolle’s Theorem to a function f , then to its derivative f 0 , its second derivative f 00 , and
so on, we obtain the following more general result, which will be useful in analyzing the accuracy
of methods for approximating functions by polynomials.

Theorem A.5.2 (Generalized Rolle’s Theorem) Let x0 , x1 , x2 , . . ., xn be distinct


points in an interval [a, b]. If f is n times differentiable on (a, b), and if f (xi ) = 0 for
i = 0, 1, 2, . . . , n, then f (n) (c) = 0 for some number c in (a, b).
A.5. THE MEAN VALUE THEOREM 379

A more fundamental consequence of Rolle’s Theorem is the Mean Value Theorem itself, which
we now state.

Theorem A.5.3 (Mean Value Theorem) If f is continuous on a closed interval [a, b]


and is differentiable on the open interval (a, b), then

f (b) − f (a)
= f 0 (c)
b−a
for some number c in (a, b).

Remark The expression


f (b) − f (a)
b−a

is the slope of the secant line passing through the points (a, f (a)) and (b, f (b)). The Mean Value
Theorem therefore states that under the given assumptions, the slope of this secant line is equal to
the slope of the tangent line of f at the point (c, f (c)), where c ∈ (a, b). 2

The Mean Value Theorem has the following practical interpretation: the average rate of change of
y = f (x) with respect to x on an interval [a, b] is equal to the instantaneous rate of change y with
respect to x at some point in (a, b).

A.5.1 The Mean Value Theorem for Integrals

Suppose that f (x) is a continuous function on an interval [a, b]. Then, by the Fundamental Theorem
of Calculus, f (x) has an antiderivative F (x) defined on [a, b] such that F 0 (x) = f (x). If we apply
the Mean Value Theorem to F (x), we obtain the following relationship between the integral of f
over [a, b] and the value of f at a point in (a, b).

Theorem A.5.4 (Mean Value Theorem for Integrals) If f is continuous on [a, b],
then Z b
f (x) dx = f (c)(b − a)
a
for some c in (a, b).

In other words, f assumes its average value over [a, b], defined by

Z b
1
fave = f (x) dx,
b−a a

at some point in [a, b], just as the Mean Value Theorem states that the derivative of a function
assumes its average value over an interval at some point in the interval.
The Mean Value Theorem for Integrals is also a special case of the following more general result.
380 APPENDIX A. REVIEW OF CALCULUS

Theorem A.5.5 (Weighted Mean Value Theorem for Integrals) If f is continuous


on [a, b], and g is a function that is integrable on [a, b] and does not change sign on [a, b],
then Z b Z b
f (x)g(x) dx = f (c) g(x) dx
a a
for some c in (a, b).

In the case where g(x) is a function that is easy to antidifferentiate and f (x) is not, this theorem
can be used to obtain an estimate of the integral of f (x)g(x) over an interval.

Example A.5.6 Let f (x) be continuous on the interval [a, b]. Then, for any x ∈ [a, b], by the
Weighted Mean Value Theorem for Integrals, we have
Z x Z x x
(s − a)2 (x − a)2
f (s)(s − a) ds = f (c) (s − a) ds = f (c) = f (c) ,
a a 2
a 2

where a < c < x. It is important to note that we can apply the Weighted Mean Value Theorem
because the function g(x) = (x − a) does not change sign on [a, b]. 2

A.6 Taylor’s Theorem


In many cases, it is useful to approximate a given function f (x) by a polynomial, because one can
work much more easily with polynomials than with other types of functions. As such, it is necessary
to have some insight into the accuracy of such an approximation. The following theorem, which is
a consequence of the Weighted Mean Value Theorem for Integrals, provides this insight.

Theorem A.6.1 (Taylor’s Theorem) Let f be n times continuously differentiable on


an interval [a, b], and suppose that f (n+1) exists on [a, b]. Let x0 ∈ [a, b]. Then, for any
point x ∈ [a, b],
f (x) = Pn (x) + Rn (x),
where
n
X f (j) (x0 )
Pn (x) = (x − x0 )j
j!
j=0

1 f (n) (x0 )
= f (x0 ) + f 0 (x0 )(x − x0 ) + f 00 (x0 )(x − x0 )2 + · · · + (x − x0 )n
2 n!
and
x
f (n+1) (s) f (n+1) (ξ(x))
Z
Rn (x) = (x − s)n ds = (x − x0 )n+1 ,
x0 n! (n + 1)!
where ξ(x) is between x0 and x.

The polynomial Pn (x) is the nth Taylor polynomial of f with center x0 , and the expression Rn (x)
is called the Taylor remainder of Pn (x). When the center x0 is zero, the nth Taylor polynomial is
also known as the nth Maclaurin polynomial.
A.6. TAYLOR’S THEOREM 381

Exercise A.6.1 Plot the graph of f (x) = cos x on the interval [0, π], and then use the
hold command to include the graphs of P0 (x), P2 (x), and P4 (x), the Maclaurin polyno-
mials of degree 0, 2 and 4, respectively, on the same plot but with different colors and line
styles. To what extent do these Taylor polynomials agree with f (x)?

The final form of the remainder is obtained by applying the Mean Value Theorem for Integrals
to the integral form. As Pn (x) can be used to approximate f (x), the remainder Rn (x) is also
referred to as the truncation error of Pn (x). The accuracy of the approximation on an interval can
be analyzed by using techniques for finding the extreme values of functions to bound the (n + 1)-st
derivative on the interval.
Because approximation of functions by polynomials is employed in the development and analysis
of many techniques in numerical analysis, the usefulness of Taylor’s Theorem cannot be overstated.
In fact, it can be said that Taylor’s Theorem is the Fundamental Theorem of Numerical Analysis,
just as the theorem describing inverse relationship between derivatives and integrals is called the
Fundamental Theorem of Calculus.
The following examples illustrate how the nth-degree Taylor polynomial Pn (x) and the remain-
der Rn (x) can be computed for a given function f (x).

Example A.6.2 If we set n = 1 in Taylor’s Theorem, then we have

f (x) = P1 (x) + R1 (x)

where
P1 (x) = f (x0 ) + f 0 (x0 )(x − x0 ).
This polynomial is a linear function that describes the tangent line to the graph of f at the point
(x0 , f (x0 )).
If we set n = 0 in the theorem, then we obtain

f (x) = P0 (x) + R0 (x),

where
P0 (x) = f (x0 )
and
R0 (x) = f 0 (ξ(x))(x − x0 ),
where ξ(x) lies between x0 and x. If we use the integral form of the remainder,
x
f (n+1) (s)
Z
Rn (x) = (x − s)n ds,
x0 n!

then we have Z x
f (x) = f (x0 ) + f 0 (s) ds,
x0

which is equivalent to the Total Change Theorem and part of the Fundamental Theorem of Calculus.
Using the Mean Value Theorem for integrals, we can see how the first form of the remainder can
be obtained from the integral form. 2
382 APPENDIX A. REVIEW OF CALCULUS

Example A.6.3 Let f (x) = sin x. Then

f (x) = P3 (x) + R3 (x),

where
x3 x3
P3 (x) = x − =x− ,
3! 6
and
1 4 1
R3 (x) = x sin ξ(x) = x4 sin ξ(x),
4! 24
where ξ(x) is between 0 and x. The polynomial P3 (x) is the 3rd Maclaurin polynomial of sin x, or
the 3rd Taylor polynomial with center x0 = 0.
If x ∈ [−1, 1], then

1 4 1 1
|Rn (x)| = x sin ξ(x) = |x4 || sin ξ(x)| ≤ ,
24 24 24
since | sin x| ≤ 1 for all x. This bound on |Rn (x)| serves as an upper bound for the error in the
approximation of sin x by P3 (x) for x ∈ [−1, 1]. 2

Exercise A.6.2 On the interval [−1, 1], plot f (x) = sin x and its Taylor polynomial
of degree 3 centered at x0 = 0, P3 (x). How do they compare? In a separate figure
1
window, plot the error R3 (x) = f (x) − P3 (x) on [−1, 1], and also plot the lines y = ± 24 ,
corresponding to the upper bound on |R3 (x)| from the preceding example. Confirm that
the error actually does satisfy this bound.

Example A.6.4 Let f (x) = ex . Then

f (x) = P2 (x) + R2 (x),

where
x2
P2 (x) = 1 + x + ,
2
and
x3 ξ(x)
R2 (x) =e ,
6
where ξ(x) is between 0 and x. The polynomial P2 (x) is the 2nd Maclaurin polynomial of ex , or
the 2nd Taylor polynomial with center x0 = 0.
If x > 0, then R2 (x) can become quite large, whereas its magnitude is much smaller if x < 0.
Therefore, one method of computing ex using a Maclaurin polynomial is to use the nth Maclaurin
polynomial Pn (x) of ex when x < 0, where n is chosen sufficiently large so that Rn (x) is small for
the given value of x. If x > 0, then we instead compute e−x using the nth Maclaurin polynomial
for e−x , which is given by
x2 x3 (−1)n xn
Pn (x) = 1 − x + − + ··· + ,
2 6 n!
and then obtaining an approximation to ex by taking the reciprocal of our computed value of e−x .
2
A.6. TAYLOR’S THEOREM 383

Example A.6.5 Let f (x) = x2 . Then, for any real number x0 ,

f (x) = P1 (x) + R1 (x),

where
P1 (x) = x20 + 2x0 (x − x0 ) = 2x0 x − x20 ,
and
R1 (x) = (x − x0 )2 .
Note that the remainder does not include a “mystery point” ξ(x) since the 2nd derivative of x2 is
only a constant. The linear function P1 (x) describes the tangent line to the graph of f (x) at the
point (x0 , f (x0 )). If x0 = 1, then we have

P1 (x) = 2x − 1,

and
R1 (x) = (x − 1)2 .
We can see that near x = 1, P1 (x) is a reasonable approximation to x2 , since the error in this
approximation, given by R1 (x), would be small in this case. 2

Taylor’s theorem can be generalized to functions of several variables, using partial derivatives.
Here, we consider the case of two independent variables.

Theorem A.6.6 (Taylor’s Theorem in Two Variables) Let f (t, y) be (n + 1) times


continuously differentiable on a convex set D, and let (t0 , y0 ) ∈ D. Then, for every
(t, y) ∈ D, there exists ξ between t0 and t, and µ between y0 and y, such that

f (t, y) = Pn (t, y) + Rn (t, y),

where Pn (t, y) is the nth Taylor polynomial of f about (t0 , y0 ),


 
∂f ∂f
Pn (t, y) = f (t0 , y0 ) + (t − t0 ) (t0 , y0 ) + (y − y0 ) (t0 , y0 ) +
∂t ∂y
2 2 ∂2f (y − y0 )2 ∂ 2 f
 
(t − t0 ) ∂ f
(t0 , y0 ) + (t − t0 )(y − y0 ) (t0 , y0 ) + (t0 , y0 ) +
2 ∂t2 ∂t∂y 2 ∂y 2
 
n 
∂nf

1 X n
··· +  (t − t0 )n−j (y − y0 )j n−j j (t0 , y0 ) ,
n! j ∂t ∂y
j=0

and Rn (t, y) is the remainder term associated with Pn (t, y),


n+1 
∂ n+1 f

1 X n+1
Rn (t, y) = (t − t0 )n+1−j (y − y0 )j (ξ, µ).
(n + 1)! j ∂tn+1−j ∂y j
j=0
384 APPENDIX A. REVIEW OF CALCULUS
Appendix B

Review of Linear Algebra

B.1 Matrices
Writing a system of equations can be quite tedious. Therefore, we instead represent a system of
linear equations using a matrix, which is an array of elements, or entries. We say that a matrix A
is m × n if it has m rows and n columns, and we denote the element in row i and column j by aij .
We also denote the matrix A by [aij ].
With this notation, a general system of m equations with n unknowns can be represented using
a matrix A that contains the coefficients of the equations, a vector x that contains the unknowns,
and a vector b that contains the quantities on the right-hand sides of the equations. Specifically,
     
a11 a12 · · · a1n x1 b1
 a21 a22 · · · a2n   x2   b2 
A= . , x =  . , b =  . .
     
.
 .. ..  ..   .. 


am1 am2 · · · amn xn bm

Note that the vectors x and b are represented by column vectors.


Example The coefficients in the linear system

3x1 + 2x2 = 4,
−x1 + 5x2 = −3

can be represented by the matrix  


3 2
A= .
−1 5
The coefficient of x2 in the first equation is represented by the entry in the first row and second
column of A, which is a12 = 2. 2

B.2 Vector Spaces


Matrices are much more than notational conveniences for writing systems of linear equations. A
matrix A can also be used to represent a linear function fA whose domain and range are both sets

385
386 APPENDIX B. REVIEW OF LINEAR ALGEBRA

of vectors called vector spaces. A vector space over a field (such as the field of real or complex
numbers) is a set of vectors, together with two operations: addition of vectors, and multiplication
of a vector by a scalar from the field.
Specifically, if u and v are vectors belonging to a vector space V over a field F , then the sum
of u and v, denoted by u + v, is a vector in V , and the scalar product of u with a scalar α in F ,
denoted by αu, is also a vector in V . These operations have the following properties:
• Commutativity: For any vectors u and v in V ,
u+v =v+u

• Associativity: For any vectors u, v and w in V ,


(u + v) + w = u + (v + w)

• Identity element for vector addition: There is a vector 0, known as the zero vector, such that
for any vector u in V ,
u+0=0+u=u

• Additive inverse: For any vector u in V , there is a unique vector −u in V such that
u + (−u) = −u + u = 0

• Distributivity over vector addition: For any vectors u and v in V , and scalar α in F ,
α(u + v) = αu + αv

• Distributivity over scalar multiplication: For any vector u in V , and scalars α and β in F ,
(α + β)u = αu + βu

• Associativity of scalar multiplication: For any vector u in V and any scalars α and β in F ,
α(βu) = (αβ)u

• Identity element for scalar multiplication: For any vector u in V ,


1u = u

Example Let V be the vector space R3 . The vector addition operation on V consists of adding
corresponding components of vectors, as in
     
3 −2 1
 0  +  4  =  4 .
1 5 6
The scalar multiplication operation consists of scaling each component of a vector by a scalar:
   3 
3
1   2 
0 = 0 .
2 1
1 2
2
B.3. SUBSPACES 387

B.3 Subspaces
Before we can explain how matrices can be used to easily describe linear transformations, we must
introduce some important concepts related to vector spaces.
A subspace of a vector space V is a subset of V that is, itself, a vector space. In particular, a
subset S of V is also a subsapce if it is closed under the operations of vector addition and scalar
multiplication. That is, if u and v are vectors in S, then the vectors u + v and αu, where α is any
scalar, must also be in S. In particular, S cannot be a subspace unless it includes the zero vector.
Example The set S of all vectors in R3 of the form
 
x1
x =  x2  ,
0
where x1 , x2 ∈ R, is a subspace of R3 , as the sum of any two vectors in S, or a scalar multiple of
any vector in S, must have a third component of zero, and therefore is also in S.
On the other hand, the set S̃ consisting of all vectors in R3 that have a third component of 1 is
not a subspace of R3 , as the sum of vectors in S̃ will not have a third component of 1, and therefore
will not be in S̃. That is, S̃ is not closed under addition. 2

B.4 Linear Independence and Bases


Often a vector space or subspace can be characterized as the set of all vectors that can be obtained
by adding and/or scaling members of a given set of specific vectors. For example, R2 can be
described as the set of all vectors that can be obtained by adding and/or scaling the vectors
   
1 0
e1 = , e2 = .
0 1
These vectors comprise what is known as the standard basis of R2 .
More generally, given a set of vectors {v1 , v2 , . . . , vk } from a vector space V , a vector v ∈ V is
called a linear combination of v1 , v2 , . . ., vk if there exist constants c1 , c2 , . . ., ck such that
k
X
v = c1 v1 + c2 v2 + · · · + ck vk = ci vi .
i=1

We then define the span of {v1 , v2 , . . . , vk }, denoted by span {v1 , v2 , . . . , vk }, to be the set of all
linear combinations of v1 , v2 , . . ., vk . From the definition of a linear combination, it follows that
this set is a subspace of V .
Example Let      
1 3 −1
v1 =  0  , v2 =  4  , v3 =  2  .
1 0 1
Then the vector  
6
v =  10 
2
388 APPENDIX B. REVIEW OF LINEAR ALGEBRA

is a linear combination of v1 , v2 and v3 , as

v = v1 + 2v2 + v3 .

2
When a subspace is defined as the span of a set of vectors, it is helpful to know whether the set
includes any vectors that are, in some sense, redundant, for if this is the case, the description of
the subspace can be simplified. To that end, we say that a set of vectors {v1 , v2 , . . . , vk } is linearly
independent if the equation
c1 v1 + c2 v2 + · · · + ck vk = 0
holds if and only if c1 = c2 = · · · = ck = 0. Otherwise, we say that the set is linearly dependent.
If the set is linearly independent, then any vector v in the span of the set is a unique linear
combination of members of the set; that is, there is only one way to choose the coefficients of a
linear combination that is used to obtain v.
Example The subspace S of R3 defined by
  
 x1 
S =  x2  x1 , x2 ∈ R
0
 

can be described as    
 1 0 
S = span  0 , 1 
 
0 0
 
or    
 1 1 
S = span  1  ,  −1  ,
0 0
 

but    
 1 −1 
6 span  1  ,  −1  ,
S=
0 0
 

as these vectors only span the subspace of vectors whose first two components are equal, and whose
third component is zero, which does not account for every vector in S. It should be noted that the
two vectors in the third set are linearly dependent, while the pairs of vectors in the previous two
sets are linearly independent. 2
Example The vectors    
1 1
v1 =  0  , v2 =  1 
1 0
are linearly independent. It follows that the only way in which the vector
 
3
v= 1 
2
B.5. LINEAR TRANSFORMATIONS 389

can be expressed as a linear combination of v1 and v2 is

v = 2v1 + v2 .

On the other hand, if


     
1 2 3
v1 =  1  , v2 =  2  , v =  3 ,
0 0 0

then, because v1 and v2 are linearly dependent, any linear combination of the form c1 v1 + c2 v2 ,
such that c1 + 2c2 = 3, will equal v. 2
Given a vector space V , if there exists a set of vectors {v1 , v2 , . . . , vk } such that V is the span
of {v1 , v2 , . . . , vk }, and {v1 , v2 , . . . , vk } is linearly independent, then we say that {v1 , v2 , . . . , vk }
is a basis of V . Any basis of V must have the same number of elements, k. We call this number
the dimension of V , which is denoted by dim(V ).
Example The standard basis of R3 is
     
1 0 0
e1 =  0  , e2 =  1  , e3 =  0  .
0 0 1

The set     
1 1 0
v1 =  1  , v2 =  −1  , v3 =  0 
0 0 1

is also a basis for R3 , as it consists of three linearly independent vectors, and the dimension of R3
is three. 2

B.5 Linear Transformations


A function fA : V → W , whose domain V and range W are vector spaces over a field F , is a linear
transformation if it has the properties

fA (x + y) = fA (x) + fA (y), fA (αx) = αfA (x),

where x and y are vectors in V and α is a scalar from F . If V and W are the same vector space,
then we say that fA is a linear operator on V .

B.5.1 The Matrix of a Linear Transformation


If V is a vector space of dimension n over a field, such as Rn or Cn , and W is a vector space of
dimension m, then a linear transformation fA with domain V and range W can be represented by
an m × n matrix A whose entries belong to the field.
390 APPENDIX B. REVIEW OF LINEAR ALGEBRA

Suppose that the set of vectors {v1 , v2 , . . . , vn } is a basis for V , and the set {w1 , w2 , . . . , wm }
is a basis for W . Then, aij is the scalar by which wi is multiplied when applying the function fA
to the vector vj . That is,
m
X
fA (vj ) = a1j w1 + a2j w2 + · · · + amj wm = aij wi .
i=1

In other words, the jth column of A describes the image under fA of the vector vj , in terms of the
coefficients of fA (vj ) in the basis {w1 , w2 , . . . , wm }.
If V and W are spaces of real or complex vectors, then, by convention, the bases {vj }nj=1 and
{wi }m n m
i=1 are each chosen to be the standard basis for R and R , respectively. The jth vector in
the standard basis is a vector whose components are all zero, except for the jth component, which
is equal to one. These vectors are called the standard basis vectors of an n-dimensional space of
real or complex vectors, and are denoted by ej . From this point on, we will generally assume that
V is Rn , and that the field is R, for simplicity.
Example The standard basis for R3 consists of the vectors
     
1 0 0
e1 = 0 , e2 = 1 , e3 = 0  .
    
0 0 1
2

B.5.2 Matrix-Vector Multiplication


To describe the action of A on a general vector x from V , we can write
n
X
x = x1 e1 + x2 e2 + · · · + xn en = xj ej .
j=1

Then, because A represents a linear function,


n
X n
X
fA (x) = xj fA (ej ) = xj a j ,
j=1 j=1

where aj is the jth column of A.


We define the vector y = fA (x) above to be the matrix-vector product of A and x, which we
denote by Ax. Each element of the vector y = Ax is given by
n
X
yi = [Ax]i = ai1 x1 + ai2 x2 + · · · + ain xn = aij xj .
j=1

From this definition, we see that the jth column of A is equal to the matrix-vector product Aej .
Example Let    
3 0 −1 10
A =  1 −4 2  , x =  11  .
5 1 −3 12
B.6. MATRIX-MATRIX MULTIPLICATION 391

Then        
3 0 −1 18
Ax = 10  1  + 11  −4  + 12  2  =  −10  .
5 1 −3 25
We see that Ax is a linear combination of the columns of A, with the coefficients of the linear
combination obtained from the components of x. 2
Example Let  
3 1  
1
A =  1 0 , x= .
−1
2 4
Then the matrix-vector product of A and x is
   
3(1) + 1(−1) 2
y = Ax =  1(1) + 0(−1)  =  1  .
2(1) + 4(−1) −2
2

B.5.3 Special Subspaces


Let A be an m × n matrix. Then the range of A, denoted by ran(A), is the set of all vectors of the
form y = Ax, where x ∈ Rn . It follows that ran(A) is the span of the columns of A, which is also
called the column space of A.
The dimension of ran(A) is called the column rank of A. Similarly, the dimension of the row
space of A is called the rank, row of A. It can be shown that the row rank and column rank are
equal; this common value is simply called the rank of A, and is denoted by rank(A). We say that
A is rank-deficient if rank(A) < min{m, n}; otherwise, we say that A has full rank. It is interesting
to note that any outer product of vectors has rank one.
The null space of A, denoted by null(A), is the set of all vectors x ∈ Rn such that Ax = 0. Its
dimension is called the nullity of A. It can be shown that for an m × n matrix A,

dim(null(A)) + rank(A) = n.

B.6 Matrix-Matrix Multiplication


It follows from this definition that a general system of m linear equations in n unknowns can be
described in matrix-vector form by the equation

Ax = b,

where Ax is a matrix-vector product of the m × n coefficient matrix A and the vector of unknowns
x, and b is the vector of right-hand side values.
Of course, if m = n = 1, the system of equations Ax = b reduces to the scalar linear equation
ax = b, which has the solution x = a−1 b, provided that a 6= 0. As a−1 is the unique number
such that a−1 a = aa−1 = 1, it is desirable to generalize the concepts of multiplication and identity
element to square matrices, for which m = n.
392 APPENDIX B. REVIEW OF LINEAR ALGEBRA

The matrix-vector product can be used to define the composition of linear functions represented
by matrices. Let A be an m × n matrix, and let B be an n × p matrix. Then, if x is a vector of
length p, and y = Bx, then we have
Ay = A(Bx) = (AB)x = Cx,
where C is an m × p matrix with entries
n
X
Cij = aik bkj .
k=1
We define the matrix product of A and B to be the matrix C = AB with entries defined in
this manner. It should be noted that the product BA is not defined, unless m = p. Even if this
is the case, in general, AB 6= BA. That is, matrix multiplication is not commutative. However,
matrix multiplication is associative, meaning that if A is m × n, B is n × p, and C is p × k, then
A(BC) = (AB)C.
Example Consider the 2 × 2 matrices
   
1 −2 −5 6
A= , B= .
−3 4 7 −8
Then
  
1 −2 −5 6
AB =
−3 4 7 −8
 
1(−5) − 2(7) 1(6) − 2(−8)
=
−3(−5) + 4(7) −3(6) + 4(−8)
 
−19 22
= ,
43 −50
whereas
  
−5 6 1 −2
BA =
7 −8 −3 4
 
−5(1) + 6(−3) −5(−2) + 6(4)
=
7(1) − 8(−3) 7(−2) − 8(4)
 
−23 34
= .
31 −46
We see that AB 6= BA. 2
Example If  
3 1  
1 5
A =  1 0 , B = ,
4 −1
2 4
then the matrix-matrix product of A and B is
   
3(1) + 1(4) 3(5) + 1(−1) 7 14
C = AB =  1(1) + 0(4) 1(5) + 0(−1)  =  1 5  .
2(1) + 4(4) 2(5) + 4(−1) 18 6
It does not make sense to compute BA, because the dimensions are incompatible. 2
B.7. OTHER FUNDAMENTAL MATRIX OPERATIONS 393

B.7 Other Fundamental Matrix Operations


B.7.1 Vector Space Operations
The set of all matrices of size m × n, for fixed m and n, is itself a vector space of dimension mn.
The operations of vector addition and scalar multiplication for matrices are defined as follows: If
A and B are m × n matrices, then the sum of A and B, denoted by A + B, is the m × n matrix C
with entries
cij = aij + bij .
If α is a scalar, then the product of α and an m × n matrix A, denoted by αA, is the m × n matrix
B with entries
bij = αaij .
It is natural to identify m×n matrices with vectors of length mn, in the context of these operations.
Matrix addition and scalar multiplication have properties analogous to those of vector addition
and scalar multiplication. In addition, matrix multiplication has the following properties related
to these operations. We assume that A is an m × n matrix, B and D are n × k matrices, and α is
a scalar.
• Distributivity: A(B + D) = AB + AD
• Commutativity of scalar multiplication: α(AB) = (αA)B = A(αB)

B.7.2 The Transpose of a Matrix


An n × n matrix A is said to be symmetric if aij = aji for i, j = 1, 2, . . . , n. The n × n matrix
B whose entries are defined by bij = aji is called the transpose of A, which we denote by AT .
Therefore, A is symmetric if A = AT . More generally, if A is an m × n matrix, then AT is the n × n
matrix B whose entries are defined by bij = aji . The transpose has the following properties:
• (AT )T = A
• (A + B)T = AT + B T
• (AB)T = B T AT

Example If  
3 1
A =  1 0 ,
2 4
then  
3 1 2
AT = .
1 0 4
2
Example Let A be the matrix from a previous example,
 
3 0 −1
A= 1 −4 2  .
5 1 −3
394 APPENDIX B. REVIEW OF LINEAR ALGEBRA

Then  
3 1 5
AT =  0 −4 1  .
−1 2 −3
It follows that    
3 + 3 0 + 1 −1 + 5 6 1 4
A + AT =  1 + 0 −4 − 4 2 + 1  =  1 −8 3  .
5 − 1 1 + 2 −3 − 3 4 3 −6
This matrix is symmetric. This can also be seen by the properties of the transpose, since
(A + AT )T = AT + (AT )T = AT + A = A + AT .
2
Example The matrix  
3 1 5
A= 1 2 0 
5 0 4
is symmetric, while  
0 1 2
B =  −1 0 −3 
−2 3 0
is skew-symmetric, meaning that AT = −A. 2

B.7.3 Inner and Outer Products


We now define several other operations on matrices and vectors that will be useful in our study of
numerical linear algebra. For simplicity, we work with real vectors and matrices.
Given two vectors x and y in Rn , the dot product, or inner product, of x and y is the scalar
n
X
xT y = x 1 y1 + x 2 y2 + · · · + x n yn = x i yi ,
i=1

where    
x1 y1
 x2   y2 
x= , y= .
   
.. ..
 .   . 
xn yn
Note that x and y must both be defined to be column vectors, and they must have the same length.
If xT y = 0, then we say that x and y are orthogonal.
Let x ∈ Rm and y ∈ Rn , where m and n are not necessarily equal. The term “inner product”
suggests the existence of another operation called the outer product, which is defined by
 
x 1 y1 x 1 y2 · · · x 1 yn
 x 2 y1 x 2 y2 · · · x 2 yn 
xyT =  . .
 
..
 .. . 
xm y1 xm y2 · · · xm yn
B.7. OTHER FUNDAMENTAL MATRIX OPERATIONS 395

Note that whereas the inner product is a scalar, the outer product is an m × n matrix.
Example Let   
1 4
x =  0 , y =  −1  .
2 3
Then the inner (dot) product of x and y is
xT y = 1(4) + 0(−1) + 2(3) = 10,
while the outer product of x and y is
   
1(4) 1(−1) 1(3) 4 −1 3
xyT =  0(4) 0(−1) 0(3)  =  0 0 0 .
2(4) 2(−1) 2(3) 8 −2 6
2
Example Let  
1 −3 7
A= 2 5 −8  .
4 −6 −9
To change a11 from 1 to 10, we can perform the outer product update B = A + (10 − 1)e1 eT1 .
Similary, the outer product update C = B + 5e2 eT1 adds 5 to b21 , resulting in the matrix
 
10 −3 7
C= 7 5 −8  .
4 −6 −9
Note that          
0 0 0 0 0 0
e2 eT1 =   1  1  1  0  1  0  =  1 0 0  .
0 0 0 0 0 0
2

B.7.4 Hadamard Product

If x, y ∈ Rn , the Hadamard product, or componentwise product, of x and y, denoted by x ◦ y or


x. ∗ y, is the vector z obtained by multiplying corresponding components of x and y. That is, if
z = x. ∗ y, then zi = xi yi , for i = 1, 2, . . . , n.
 T  T
Example If x = 1 −2 and y = −3 4 , then
   
T T 1(−3) 1(4) −3 4
x y = 1(−3) + (−2)4 = −11, xy = = ,
−2(−3) −2(4) 6 −8
and    
1(−3) −3
x. ∗ y = = .
−2(4) −8
2
396 APPENDIX B. REVIEW OF LINEAR ALGEBRA

B.7.5 Kronecker Product


B.7.6 Partitioning

It is useful to describe matrices as collections of row or column vectors. Specifically, a row partition
of an m × n matrix A is a description of A as a “stack” of row vectors rT1 , rT2 , . . ., rTm . That is,
 T 
r1
 rT 
 2 
A =  . .
 .. 
rTm

On the other hand, we can view A is a “concatenation” of column vectors c1 , c2 , . . . , cn :


 
A = c1 c2 · · · cn .

This description of A is called a column partition.


Example If  
1 2
A= ,
3 4
then a column partitioning of A is
   
  1 2
A= c1 c2 , c1 = , c2 = .
3 4

A row partitioning of A is

rT1
     
1 3
A= , r1 = , r2 = .
rT2 2 4
2

B.8 Understanding Matrix-Matrix Multiplication


The fundamental operation of matrix-matrix multiplication can be understood in three different
ways, based on other operations that can be performed on matrices and vectors. Let A be an m × n
matrix, and B be an n × p matrix, in which case C = AB is an m × p matrix. We can then view
the computation of C in the following ways:

• Dot product: each entry cij is the dot product of the ith row of A and the jth column of B.

• Matrix-vector multiplication: the jth column of C is a linear combination of the columns of


A, where the coefficients are obtained from the jth column of B. That is, if
   
C = c1 c2 · · · cp , B = b1 b2 · · · bp

are column partitions of C and B, then cj = Abj , for j = 1, 2, . . . , p.


B.8. UNDERSTANDING MATRIX-MATRIX MULTIPLICATION 397

• Outer product: given the partitions

bT1


   bT2 
A= a1 a2 · · · an ,  ··· ,
B= 

bTn

we can write
n
X
C = a1 bT1 + a2 bT2 + · · · + an bTn = ai bTi .
i=1

That is, C is a sum of outer product updates.

B.8.1 The Identity Matrix


When n = 1, the identity element of 1 × 1 matrices, the number 1, is the unique number such that
a(1) = 1(a) = a for any number a. To determine the identity element for n × n matrices, we seek
a matrix I such that AI = IA = A for any n × n matrix A. That is, we must have
n
X
aik Ikj = aij , i, j = 1, . . . , n.
k=1

This can only be the case for any matrix A if Ijj = 1 for j = 1, 2, . . . , n, and Iij = 0 when i 6= j.
We call this matrix the identity matrix
 
1 0 ··· ··· 0

 0 1 .. 
 0 . 

I=
 . .. .. ...
. .

 ..
 

 . 0 1 0 
0 ··· ··· 0 1

Note that the jth column of I is the standard basis vector ej .

B.8.2 The Inverse of a Matrix


Given an n × n matrix A, it is now natural to ask whether it is possible to find an n × n matrix B
such that AB = BA = I. Such a matrix, if it exists, would then serve as the inverse of A, in the
sense of matrix multiplication. We denote this matrix by A−1 , just as we denote the multiplicative
inverse of a nonzero number a by a−1 . If the inverse of A exists, we say that A is invertible or
nonsingular; otherwise, we say that A is singular. If A−1 exists, then we can use it to describe the
solution of the system of linear equations Ax = b, for

A−1 Ax = (A−1 A)x = Ix = x = A−1 b,

which generalizes the solution x = a−1 b of a single linear equation in one unknown.
398 APPENDIX B. REVIEW OF LINEAR ALGEBRA

However, just as we can use the inverse to describe the solution to a system of linear equations,
we can use systems of linear equations to characterize the inverse. Because A−1 satisfies AA−1 = I,
it follows from multiplication of both sides of this equation by the jth standard basis vector ej that
Abj = ej , j = 1, 2, . . . , n,
where bj = A−1 ej is the jth column of B = A−1 . That is, we can compute A−1 by solving n
systems of linear equations of the form Abj = ej , using a method such as Gaussian elimination
and back substitution. If Gaussian elimination fails due to the inability to obtain a nonzero pivot
element for each column, then A−1 does not exist, and we conclude that A is singular.
The inverse of a nonsingular matrix A has the following properties:
• A−1 is unique.
• A−1 is nonsingular, and (A−1 )−1 = A.
• If B is also a nonsingular n × n matrix, then (AB)−1 = B −1 A−1 .
• (A−1 )T = (AT )−1 . It is common practice to denote the transpose of A−1 by A−T .
Because the set of all n × n matrices has an identity element, matrix multiplication is associative,
and each nonsingular n × n matrix has a unique inverse with respect to matrix multiplication that
is also an n × n nonsingular matrix, this set forms a group, which is denoted by GL(n), the general
linear group.

B.9 Triangular and Diagonal Matrices


There are certain types of matrices for which the fundamental problems of numerical linear algebra,
solving systems of linear equations or computing eigenvalues and eigenvectors, are relatively easy
to solve. We now discuss a few such types.
Let A be an m × n matrix. We define the main diagonal of A to be the entries a11 , a22 , . . . , app ,
where p = min{m, n}. That is, the main diagonal consists of all entries for which the row index
and column index are equal. We then say that A is a diagonal matrix if the only nonzero entries of
A lie on the main diagonal. That is, A is a diagonal matrix if aij = 0 whenever i 6= j.
We say that A is upper triangular if aij = 0 whenever i > j. That is, all nonzero entries of
A are confined to the “upper triangle” of A, which consists of all entries on or “above” the main
diagonal. Similarly, we say that A is lower triangular if aij = 0 whenever i < j. Such a matrix has
all of its nonzero entries on or below the main diagonal.
We will see that a system of linear equations of the form Ax = b is easily solved if A is an
upper or lower triangular matrix, and that the eigenvalues of a square matrix A are easy to obtain
if A is triangular. As such, certain methods for solving both kinds of problems proceed by reducing
A to triangular form.
Example The matrices
     
1 2 3 1 0 0 1 0 0
U =  0 4 5 , L =  2 3 0 , D= 0 2 0 
0 0 6 4 5 6 0 0 3
are upper triangular, lower triangular, and diagonal, respectively. 2
B.10. DETERMINANTS 399

B.10 Determinants
We previously learned that a 2 × 2 matrix A is invertible if and only if the quantity a11 a22 − a12 a21
is nonzero. This generalizes the fact that a 1 × 1 matrix a is invertible if and only if its single
entry, a11 = a, is nonzero. We now discuss the generalization of this determination of invertibility
to general square matrices.
The determinant of an n × n matrix A, denoted by det(A) or |A|, is defined as follows:

• If n = 1, then det(A) = a11 .

• If n > 1, then det(A) is recursively defined by


n
X
det(A) = aij (−1)i+j det(Mij ), 1 ≤ i ≤ n,
j=1

where Mij , called a minor of A, is the matrix obtained by removing row i and column j of A.

• Alternatively,
n
X
det(A) = aij (−1)i+j det(Mij ), 1 ≤ j ≤ n.
i=1

The matrix Aij = (−1)i+j Mij is called a cofactor of A.


This definition of the determinant, however, does not lead directly to a practical algorithm for
its computation, because it requires O(n!) floating-point operations, whereas typical algorithms for
matrix computations run in polynomial time. However, the computational effort can be reduced by
choosing from the multiple formulas for det(A) above. By consistently choosing the row or column
with the most zeros, the number of operations can be minimized.
However, more practical methods for computing the determinant can be obtained by using its
properties:

• If any row or column of A has only zero entries, then det(A) = 0.

• If any two rows or columns of A are the same, then det(A) = 0.

• If à is obtained from A by adding a multiple of a row of A to another row, then det(Ã) =


det(A).

• If à is obtained from A by interchanging two rows of A, then det(Ã) = − det(A).

• If à is obtained from A by scaling a row by a scalar λ, then det(Ã) = λ det(A).

• If B is an n × n matrix, then det(AB) = det(A) det(B).

• det(AT ) = det(A)

• If A is nonsingular, then det(A−1 ) = (det(A))−1


Qn
• If A is a triangular matrix (either upper or lower), then det(A) = i=1 aii .
400 APPENDIX B. REVIEW OF LINEAR ALGEBRA

The best-known application of the determinant is the fact that it indicates whether a matrix A is
nonsingular, or invertible. The following statements are all equivalent.
• det(A) 6= 0.

• A is nonsingular.

• A−1 exists.

• The system Ax = b has a unique solution for any n-vector b.

• The system Ax = 0 has only the trivial solution x = 0.


The determinant has other interesting applications. The determinant of a 3 × 3 matrix is equal to
the volume of a parallelepiped defined by the vectors that are the rows (or columns) of the matrix.
Example Because the matrices
 
  1 −1 0 5
1 0 0  0 2 3 −6 
L= 2 4 0 , U = 
 0 0 −4 7 
−3 −5 −6
0 0 0 8
are lower and upper triangular, respectively, their determinants are the products of their diagonal
entries. That is,

det(L) = 1(4)(−6) = −24, det(U ) = 1(2)(−4)(8) = −64.

B.11 Vector and Matrix Norms


B.11.1 Vector Norms
Given vectors x and y of length one, which are simply scalars x and y, the most natural notion of
distance between x and y is obtained from the absolute value; we define the distance to be |x − y|.
We therefore define a distance function for vectors that has similar properties.
A function k · k : Rn → R is called a vector norm if it has the following properties:
1. kxk ≥ 0 for any vector x ∈ Rn , and kxk = 0 if and only if x = 0

2. kαxk = |α|kxk for any vector x ∈ Rn and any scalar α ∈ R

3. kx + yk ≤ kxk + kyk for any vectors x, y ∈ Rn .


The last property is called the triangle inequality. It should be noted that when n = 1, the absolute
value function is a vector norm.
The most commonly used vector norms belong to the family of p-norms, or `p -norms, which
are defined by
n
!1/p
X
p
kxkp = |xi | .
i=1
B.11. VECTOR AND MATRIX NORMS 401

It can be shown that for any p > 0, k · kp defines a vector norm. The following p-norms are of
particular interest:

• p = 1: The `1 -norm
kxk1 = |x1 | + |x2 | + · · · + |xn |

• p = 2: The `2 -norm or Euclidean norm


q √
kxk2 = x21 + x22 + · · · + x2n = xT x

• p = ∞: The `∞ -norm
kxk∞ = max |xi |
1≤i≤n

Example Given the vector  


1
x =  2 ,
−3
we have

kxk1 = |1| + |2| + | − 3| = 6,


p √
kxk2 = 12 + 22 + (−3)2 = 14,
kxk∞ = max{|1|, |2|, | − 3|} = 3.

2
It can be shown that the `2 -norm satisfies the Cauchy-Bunyakovsky-Schwarz inequality, also
known as simply the Cauchy-Schwarz inequality,

|xT y| ≤ kxk2 kyk2

for any vectors x, y ∈ Rn . This inequality is useful for showing that the `2 -norm satisfies the
triangle inequality. It is a special case of the Hölder inequality
1 1
|xT y| ≤ kxkp kykq , + = 1.
p q

We will prove the Cauchy-Schwarz inequality for vectors in Rn ; the proof can be generalized to
a complex vector space. For x, y ∈ Rn and c ∈ R, with y 6= 0, we have

(x − cy)T (x − cy) = kx − cyk22 ≥ 0.

It follows from the properties of the inner product that

0 ≤ (x − cy)T (x − cy)
≤ xT x − xT (cy) − (cy)T x + (cy)T (cy)
≤ kxk22 − 2cxT y + c2 kyk22 .
402 APPENDIX B. REVIEW OF LINEAR ALGEBRA

We now try to find the value of c that minimizes this expression. Differentiating with respect to c
and equating to zero yields the equation

−2xT y + 2ckyk22 = 0,

and therefore the minimum occurs when c = xT y/kyk22 . It follows that

0 ≤ kxk22 − 2cxT y + c2 kyk2


xT y T (xT y)2
≤ kxk22 − 2 x y + kyk22
kyk22 kyk42
(xT y)2 (xT y)2
≤ kxk22 − 2 +
kyk22 kgk2
(xT y)2
≤ kxk22 − .
kyk22
It follows that
(xT y)2 ≤ kxk22 kyk22 .
Taking the square root of both sides yields the Cauchy-Schwarz inequality.
Now that we have defined various notions of the size, or magnitude, of a vector, we can discuss
distance and convergence. Given a vector norm k · k, and vectors x, y ∈ Rn , we define the distance
between x and y, with respect to this norm, by kx − yk. Then, we say that a sequence of n-vectors
{x(k) }∞
k=0 converges to a vector x if

lim kx(k) − xk = 0.
k→∞

That is, the distance between x(k) and x must approach zero. It can be shown that regardless of
the choice of norm, x(k) → x if and only if
(k)
xi → xi , i = 1, 2, . . . , n.

That is, each component of x(k) must converge to the corresponding component of x. This is due
to the fact that for any vector norm k · k, kxk = 0 if and only if x is the zero vector.
Because we have defined convergence with respect to an arbitrary norm, it is important to know
whether a sequence can converge to a limit with respect to one norm, while converging to a different
limit in another norm, or perhaps not converging at all. Fortunately, for p-norms, this is never the
case. We say that two vector norms k · kα and k · kβ are equivalent if there exists constants C1 and
C2 , that are independent of x, such that for any vector x ∈ Rn ,

C1 kxkα ≤ kxkβ ≤ C2 kxkα .

It follows that if two norms are equivalent, then a sequence of vectors that converges to a limit
with respect to one norm will converge to the same limit in the other. It can be shown that all
`p -norms are equivalent. In particular, if x ∈ Rn , then

kxk2 ≤ kxk1 ≤ nkxk2 ,

kxk∞ ≤ kxk2 ≤ nkxk∞ ,
B.11. VECTOR AND MATRIX NORMS 403

kxk∞ ≤ kxk1 ≤ nkxk∞ .

We will now prove the equivalence of k · k1 and k · k2 . Let x ∈ Rn . First, we have


n
X n
X
kxk22 = 2
|xi | ≤ |xi ||xj | ≤ kxk21 ,
i=1 i,j=1

and therefore kxk2 ≤ kx1 k1 . Then, we define the vector y by



1 xi ≥ 0
yi = .
−1 xi < 0

It follows that kx1 k = yT x. By the Cauchy-Schwarz inequality,



kxk1 = yT x ≤ kyk2 kxk2 ≤ nkxk2

and the equivalence of the norms has been established.

B.11.2 Matrix Norms


It is also very useful to be able to measure the magnitude of a matrix, or the distance between
matrices. However, it is not sufficient to simply define the norm of an m × n matrix A as the norm
of an mn-vector x whose components are the entries of A. We instead define a matrix norm to be
a function k · k : Rm×n → R that has the following properties:

• kAk ≥ 0 for any A ∈ Rm×n , and kAk = 0 if and only if A = 0

• kαAk = |α|kAk for any m × n matrix A and scalar α

• kA + Bk ≤ kAk + kBk for any m × n matrices A and B

Another property that is often, but not always, included in the definition of a matrix norm is the
submultiplicative property: if A is m × n and B is n × p, we require that

kABk ≤ kAkkBk.

This is particularly useful when A and B are square matrices.


Any vector norm induces a matrix norm. It can be shown that given a vector norm, defined
appropriately for m-vectors and n-vectors, the function k · k : Rm×n → R defined by

kAxk
kAk = sup = max kAxk
x6=0 kxk kxk=1

is a matrix norm. It is called the natural, or induced, matrix norm. Furthermore, if the vector
norm is a `p -norm, then the induced matrix norm satisfies the submultiplicative property.
The following matrix norms are of particular interest:
404 APPENDIX B. REVIEW OF LINEAR ALGEBRA

• The `1 -norm:
m
X
kAk1 = max kAxk1 = max |aij |.
kxk1 =1 1≤j≤n
i=1

That is, the `1 -norm of a matrix is its maximum column sum of |A|. To see this, let x ∈ Rn
satisfy kxk1 = 1. Then
m
X
kAxk1 = |(Ax)i |
i=1

m X
n

X
=
aij xj
i=1 j=1
n m
!
X X
≤ |xj | |aij |
j=1 i=1
n m
!
X X
≤ |xj | max |aij |
1≤j≤n
j=1 i=1
m
!
X
≤ max |aij | .
1≤j≤n
i=1

Equality is achieved if x = eJ , where the index J satisfies


m m
!
X X
max |aij | = |aiJ |.
1≤j≤n
i=1 i=1

It follows that the maximum column sum of |A| is equal to the maximum of kAxk1 taken
over all the set of all unit 1-norm vectors.

• The `∞ -norm:
n
X
kAk∞ = max kAxk∞ = max |aij |.
kxk∞ =1 1≤i≤m
j=1

That is, the `∞ -norm of a matrix is its maximum row sum. This formula can be obtained in
a similar manner as the one for the matrix 1-norm.

• The `2 -norm:
kAk2 = max kAxk2 .
kxk2 =1

To obtain a formula for this norm, we note that the function

kAxk22
g(x) =
kxk22

has a local maximum or minimum whenever x is a unit `2 -norm vector (that is, kxk2 = 1)
that satisfies
AT Ax = kAxk22 x,
B.12. FUNCTION SPACES AND NORMS 405

as can be shown by differentiation of g(x). That is, x is an eigenvector of AT A, with corre-


sponding eigenvalue kAxk22 = g(x). We conclude that
q
kAk2 = max λi (AT A).
1≤i≤n

That is, the `2 -norm of a matrix is the square root of the largest eigenvalue of AT A, which is
guaranteed to be nonnegative, as can be shown using the vector 2-norm. We see that unlike
the vector `2 -norm, the matrix `2 -norm is much more difficult to compute than the matrix
`1 -norm or `∞ -norm.

• The Frobenius norm:  1/2


Xm X
n
kAkF =  a2ij  .
i=1 j=1

It should be noted that the Frobenius norm is not induced by any vector `p -norm, but it
is equivalent to the vector `2 -norm in the sense that kAkF = kxk2 where x is obtained by
reshaping A into a vector.

Like vector norms, matrix norms are equivalent. For example, if A is an m × n matrix, we have

kAk2 ≤ kAkF ≤ nkAk2 ,
1 √
√ kAk∞ ≤ kAk2 ≤ mkAk∞ ,
n
1 √
√ kAk1 ≤ kAk2 ≤ nkAk1 .
m

Example Let  
1 2 3
A =  0 1 0 .
−1 0 4
Then
kAk1 = max{|1| + |0| + | − 1|, |2| + |1| + |0|, |3| + |0| + |4|} = 7,
and
kAk∞ = max{|1| + |2| + |3|, |0| + |1| + |0|, | − 1| + |0| + |4|} = 6.
2

B.12 Function Spaces and Norms


We now define norms on more general vector spaces. Let V be a vector space over the field of real
numbers R. A norm on V is a function k · k : V → R that has the following properties:

1. kf k ≥ 0 for all f ∈ V, and kf k = 0 if and only if f is the zero vector of V.

2. kcf k = |c|kf k for any vector f ∈ V and any scalar c ∈ R.


406 APPENDIX B. REVIEW OF LINEAR ALGEBRA

3. kf + gk ≤ kf k + kgk for all f, g ∈ V.


The last property is known as the triangle inequality. A vector space V, together with a norm k · k,
is called a normed vector space or normed linear space. In particular, we are interested in working
with function spaces, which are vector spaces in which the vectors are functions.
Example The space C[a, b] of functions that are continuous on the interval [a, b] is a normed vector
space with the norm
kf k∞ = max |f (x)|,
a≤x≤b
known as the ∞-norm or maximum norm. 2
Example The space C[a, b] can be equipped with a different norm, such as
Z b 1/2
2
kf k2 = |f (x)| w(x) dx ,
a

where the weight function w(x) is positive and integrable on (a, b). It is allowed to be singular
at the endpoints, as will be seen in certain examples. This norm is called the 2-norm or weighted
2-norm. 2
The 2-norm and ∞-norm are related as follows:
kf k2 ≤ W kf k∞, W = k1k2 .
However, unlike the ∞-norm and 2-norm defined for the vector space Rn , these norms are not
equivalent in the sense that a function that has a small 2-norm necessarily has a small ∞-norm. In
fact, given any  > 0, no matter how small, and any M > 0, no matter how large, there exists a
function f ∈ C[a, b] such that
kf k2 < , kf k∞ > M.
We say that a function f is absolutely continouous on [a, b] if its derivative is finite almost
everywhere in [a, b] (meaning that it is not finite on at most a subset of [a, b] that has measure
zero), is integrable on [a, b], and satisfies
Z x
f 0 (s) dx = f (x) − f (a), a ≤ x ≤ b.
a

Any continuously differentiable function is absolutely continuous, but the converse is not necessarily
true.

Example B.12.1 For example, f (x) = |x| is absolutely continuous on any interval of the form
[−a, a], but it is not continuously differentiable on such an interval. 2

Next, we define the Sobolev spaces H k (a, b) as follows. The space H 1 (a, b) is the set of all
absolutely continuous functions on [a, b] whose derivatives belong to L2 (a, b). Then, for k > 1,
H k (a, b) is the subset of H k−1 (a, b) consisting of functions whose (k −1)st derivatives are absolutely
continuous, and whose kth derivatives belong to L2 (a, b). If we denote by C k [a, b] the set of all
functions defined on [a, b] that are k times continuously differentiable, then C k [a, b] is a proper
subset of H k (a, b). For example, any piecewise linear belongs to H 1 (a, b), but does not generally
belong to C 1 [a, b].
B.13. INNER PRODUCT SPACES 407

Example B.12.2 The function f (x) = x3/4 belongs to H 1 (0, 1) because f 0 (x) = 3 −1/4
4x is inte-
grable on [0, 1], and also square-integrable on [0, 1], since

9 1/2 1 9
Z 1 Z 1
0 2 9 −1/2
|f (x)| dx = x = x = .
0 0 16 8 0 8

/ C 1 [a, b], because f 0 (x) is singular at x = 0. 2


However, f ∈

B.13 Inner Product Spaces


Recall that two m-vectors u = hu1 , u2 , . . . , um i and v = hv1 , v2 , . . . , vm i are orthogonal if
m
X
u·v = ui vi = 0,
i=1

where u · v is the dot product, or inner product, of u and v.


By viewing functions defined on an interval [a, b] as infinitely long vectors, we can generalize
the inner product, and the concept of orthogonality, to functions. To that end, we define the inner
product of two real-valued functions f (x) and g(x) defined on the interval [a, b] by
Z b
hf, gi = f (x)g(x) dx.
a

Then, we say f and g are orthogonal with respect to this inner product if hf, gi = 0.
In general, an inner product on a vector space V over R, be it continuous or discrete, has the
following properties:
1. hf + g, hi = hf, hi + hg, hi for all f, g, h ∈ V

2. hcf, gi = chf, gi for all c ∈ R and all f ∈ V

3. hf, gi = hg, f i for all f, g ∈ V

4. hf, f i ≥ 0 for all f ∈ V, and hf, f i = 0 if and only if f = 0.


This inner product can be used to define the norm of a function, which generalizes the concept
of the magnitude of a vector to functions, and therefore provides a measure of the “magnitude” of
a function. Recall that the magnitude of a vector v, denoted by kvk, can be defined by

kvk = (v · v)1/2 .

Along similar lines, we define the 2-norm of a function f (x) defined on [a, b] by
Z b 1/2
1/2 2
kf k2 = (hf, f i) = [f (x)] dx .
a

It can be verified that this function does in fact satisfy the properties required of a norm.
One very important property that k · k2 has is that it satisfies the Cauchy-Schwarz inequality

|hf, gi| ≤ kf k2 kgk2 , f, g ∈ V.


408 APPENDIX B. REVIEW OF LINEAR ALGEBRA

This can be proven by noting that for any scalar c ∈ R,

c2 kf k22 + 2chf, gi + kgk22 = kcf + gk22 ≥ 0.

The left side is a quadratic polynomial in c. In order for this polynomial to not have any negative
values, it must either have complex roots or a double real root. This is the case if the discrimant
satisfies
4hf, gi2 − 4kf k22 kgk22 ≤ 0,
from which the Cauchy-Schwarz inequality immediately follows. By setting c = 1 and applying this
inequality, we immediately obtain the triangle-inequality property of norms.

B.14 Eigenvalues
We have learned what it means for a sequence of vectors to converge to a limit. However, using
the definition alone, it may still be difficult to determine, conclusively, whether a given sequence of
vectors converges. For example, suppose a sequence of vectors is defined as follows: we choose the
initial vector x(0) arbitrarily, and then define the rest of the sequence by

x(k+1) = Ax(k) , k = 0, 1, 2, . . .

for some matrix A. Such a sequence actually arises in the study of the convergence of various
iterative methods for solving systems of linear equations.
An important question is whether a sequence of this form converges to the zero vector. This
will be the case if
lim kx(k) k = 0
k→∞

in some vector norm. From the definition of x(k) , we must have

lim kAk x(0) k = 0.


k→∞

From the submultiplicative property of matrix norms,

kAk x(0) k ≤ kAkk kx(0) k,

from which it follows that the sequence will converge to the zero vector if kAk < 1. However, this
is only a sufficient condition; it is not necessary.
To obtain a sufficient and necessary condition, it is necessary to achieve a better understanding
of the effect of matrix-vector multiplication on the magnitude of a vector. However, because
matrix-vector multiplication is a complicated operation, this understanding can be difficult to
acquire. Therefore, it is helpful to identify circumstances under which this operation can be simply
described.
To that end, we say that a nonzero vector x is an eigenvector of an n × n matrix A if there
exists a scalar λ such that
Ax = λx.
The scalar λ is called an eigenvalue of A corresponding to x. Note that although x is required to
be nonzero, it is possible that λ can be zero. It can also be complex, even if A is a real matrix.
B.14. EIGENVALUES 409

If we rearrange the above equation, we have

(A − λI)x = 0.

That is, if λ is an eigenvalue of A, then A − λI is a singular matrix, and therefore det(A − λI) = 0.
This equation is actually a polynomial in λ, which is called the characteristic polynomial of A. If
A is an n × n matrix, then the characteristic polynomial is of degree n, which means that A has n
eigenvalues, which may repeat.
The following properties of eigenvalues and eigenvectors are helpful to know:

• If λ is an eigenvalue of A, then there is at least one eigenvector of A corresponding to λ

• If there exists an invertible matrix P such that B = P AP −1 , then A and B have the same
eigenvalues. We say that A and B are similar, and the transformation P AP −1 is called a
similarity transformation.

• If A is a symmetric matrix, then its eigenvalues are real.

• If A is a skew-symmetric matrix, meaning that AT = −A, then its eigenvalues are either equal
to zero, or are purely imaginary.

• If A is a real matrix, and λ = u + iv is a complex eigenvalue of A, then λ̄ = u − iv is also an


eigenvalue of A.

• If A is a triangular matrix, then its diagonal entries are the eigenvalues of A.

• det(A) is equal to the product of the eigenvalues of A.

• tr(A), the sum of the diagonal entries of A, is also equal to the sum of the eigenvalues of A.

It follows that any method for computing the roots of a polynomial can be used to obtain the
eigenvalues of a matrix A. However, in practice, eigenvalues are normally computed using iterative
methods that employ orthogonal similarity transformations to reduce A to upper triangular form,
thus revealing the eigenvalues of A. In practice, such methods for computing eigenvalues are used
to compute roots of polynomials, rather than using polynomial root-finding methods to compute
eigenvalues, because they are much more robust with respect to roundoff error.
It can be shown that if each eigenvalue λ of a matrix A satisfies |λ| < 1, then, for any vector x,

lim Ak x = 0.
k→∞

Furthermore, the converse of this statement is also true: if there exists a vector x such that Ak x
does not approach 0 as k → ∞, then at least one eigenvalue λ of A must satisfy |λ| ≥ 1.
Therefore, it is through the eigenvalues of A that we can describe a necessary and sufficient
condition for a sequence of vectors of the form x(k) = Ak x(0) to converge to the zero vector.
Specifically, we need only check if the magnitude of the largest eigenvalue is less than 1. For
convenience, we define the spectral radius of A, denoted by ρ(A), to be max |λ|, where λ is an
eigenvalue of A. We can then conclude that the sequence x(k) = Ak x(0) converges to the zero
vector if and only if ρ(A) < 1.
410 APPENDIX B. REVIEW OF LINEAR ALGEBRA

Example Let
 
2 3 1
A =  0 4 5 .
0 0 1
Then    
2−2 3 1 0 3 1
A − 2I =  0 4−2 5  =  0 2 5 .
0 0 1−2 0 0 −1
Because A − 2I has a column of all zeros, it is singular. Therefore, 2 is an eigenvalue of A. In fact,
 T
Ae1 = 2e1 , so e1 = 1 0 0 is an eigenvector.
Because A is upper triangular, its eigenvalues are the diagonal elements, 2, 4 and 1. Because
the largest eigenvalue in magnitude is 4, the spectral radius of A is ρ(A) = 4. 2
The spectral radius is closely related to natural (induced) matrix norms. Let λ be the largest
eigenvalue of A, with x being a corresponding eigenvector. Then, for any natural matrix norm k · k,
we have
ρ(A)kxk = |λ|kxk = kλxk = kAxk ≤ kAkkxk.

Therefore, we have ρ(A) ≤ kAk. When A is symmetric, we also have

kAk2 = ρ(A).

For a general matrix A, we have


kAk2 = [ρ(AT A)]1/2 ,

which can be seen to reduce to ρ(A) when AT = A, since, in general, ρ(Ak ) = ρ(A)k .
Because the condition ρ(A) < 1 is necessary and sufficient to ensure that limk→∞ Ak x = 0, it
is possible that such convergence may occur even if kAk ≥ 1 for some natural norm k · k. However,
if ρ(A) < 1, we can conclude that
lim kAk k = 0,
k→∞

even though limk→∞ kAkk may not even exist.


In view of the definition of a matrix norm, that kAk = 0 if and only if A = 0, we can conclude
that if ρ(A) < 1, then Ak converges to the zero matrix as k → ∞. In summary, the following
statements are all equivalent:

1. ρ(A) < 1

2. limk→∞ kAk k = 0, for any natural norm k · k

3. limk→∞ (Ak )ij = 0, i, j = 1, 2, . . . , n

4. limk→∞ Ak x = 0

These results are very useful for analyzing the convergence behavior of various iterative methods
for solving systems of linear equations.
B.15. DIFFERENTIATION OF MATRICES 411

B.15 Differentiation of Matrices


Suppose that A(t) is an m × n matrix in which each entry is a function of a parameter t. Then the
matrix A0 (t), or dA/dt, is the m × n matrix obtained by differentiating each entry with respect to
t. That is,  
dA(t) d
= [aij (t)].
dt ij dt
Matrices obey differentiation rules that are analogous to differentiation rules for functions, but the
rules of matrix-matrix multiplication must be taken into account. For example, if A(t) is an m × n
matrix and B is an n × p matrix, then
d d d
[A(t)B(t)] = [A(t)]B(t) + A(t) [B(t)],
dt dt dt
and if A(t) is a nonsingular n × n matrix, then

d d
[A(t)−1 ] = −A(t)−1 [A(t)]A(t)−1 .
dt dt
It is also useful to know how to differentiate functions of vectors with respect to their compo-
nents. Let A be an n × n matrix and x ∈ Rn . Then, we have

∇ xT Ax = (A + AT )x, ∇(xT b) = b.


These formulas are useful in problems involving minimization of functions of x, such as the least-
squares problem, which entails approximately solving a system of m equations in n unknowns, where
typically m > n.
412 APPENDIX B. REVIEW OF LINEAR ALGEBRA
Bibliography

[1] Abramowitz, M. and Stegun, I. A., Eds.: Handbook of Mathematical Functions with Formulas,
Graphs, and Mathematical Tables, 9th printing. New York: Dover (1972).

[2] Aitken, A. C.: “On interpolation by iteration of proportional parts, without the use of differ-
ences”. Proc. Edinburgh Math. Soc. 3(2) (1932), p. 56-76.

[3] Banach, S.: “Sur les opérations dans les ensembles abstraits et leur application aux équations
intégrales.” Fund. Math. 3 (1922), p. 133-181.

[4] Bashforth, F. and Adams, J. C.: An Attempt to test the Theories of Capillary Action by
comparing the theoretical and measured forms of drops of fluid. With an explanation of the
method of integration employed in constructing the tables which give the theoretical forms of
such drops, Cambridge University Press (1883).

[5] Berrut, J.-P. and Trefethen, L. N.: “Barycentric Lagrange Interpolation”. SIAM Review 46(3)
(2004), p. 501-517.

[6] Birkhoff, G. and De Boor, C.: “Error bounds for spline interpolation”, Journal of Mathematics
and Mechanics 13 (1964), p. 827-836.

[7] Birkhoff, G. and Rota, G.: Ordinary differential equations, (Fourth edition), John Wiley &
Sons, New York, 1989.

[8] Bogacki, P. and Shampine, L. F.: “A 3(2) pair of Runge?Kutta formulas”, Applied Mathematics
Letters 2(4) (1989), p. 321-325.

[9] Brouwer, L. E. J.: “Über Abbildungen von Mannigfaltigkeiten”, Mathematische Annalen 71


(1911), p. 97-115.

[10] Burden, R. L. and Faires, J. D.: Numerical Analysis, 9th Edition. Brooks Cole (2004).

[11] Dahlquist, G.: “A special stability problem for linear multistep methods”, BIT 3 (1963), p.
27-43.

[12] Demmel, J. W.: Applied Numerical Linear Algebra, SIAM (1997).

[13] Dormand, J. R. and Prince, P. J.: “A family of embedded Runge-Kutta formulae”, Journal of
Computational and Applied Mathematics 6 (1) (1980), p. 19-26.

413
414 BIBLIOGRAPHY

[14] Fehlberg, E.: “Klassische Runge-Kutta Formeln vierter und niedrigerer Ordnung mit
Schrittweiten-Kontrolle und ihre Anwendung auf Wrmeleitungsprobleme”, Computing 6
(1970), p. 61-71.

[15] Gautschi, W. Numerical Analysis: an Introduction, Birkhäuser, Boston, MA, 1997.

[16] Golub, G. H. and van Loan, C. F.: Matrix Computations, 4th Edition, Johns Hopkins Univer-
sity Press (2012).

[17] Gustafsson, B., Kreiss, H.-O. and Oliger, J. E.: Time-Dependent Problems and Difference
Methods, John WIley & Sons, New York (1995).

[18] Heath, M. T.: Scientific Computing: An Introductory Survey, 2nd Edition, McGraw-Hill
(2002).

[19] Issacson, E. and Keller, H. B.: Analysis of numerical methods, John Wiley & Sons, New York
(1966).

[20] Keller, H. B.: Numerical Methods for Two-Point Boundary Value Problems, Blaisdell,
Waltham, MA (1968).

[21] Lambers, J. V. and Rice, J. R., “Numerical Quadrature for General Two-Dimensional Do-
mains”, Computer Science Technical Reports, Paper 906 (1991).

[22] Laurie, D.: “Calculation of Gauss-Kronrod quadrature rules”, Mathematics of Computation


of the American Mathematical Society 66(219) (1997), p. 1133-1145.

[23] Le Gendre, M.: “Recherches sur l’attraction des sphéroı̈des homogènes”, Mémoires de
Mathématiques et de Physique, présentés à l’Académie Royale des Sciences, par divers sa-
vans, et lus dans ses Assemblées, Tome X (1785), p. 411-435.

[24] Moler, C. B.:“Demonstration of a matrix laboratory”. Lecture notes in mathematics (J. P.


Hennart, ed.), Springer-Verlag, Berlin (1982), p. 84-98.

[25] Moulton, F. R.: New methods in exterior ballistics, University of Chicago Press (1926).

[26] Neville, E. H.: “Iterative Interpolation”, J. Indian Math Soc. 20 (1934), p. 87-120.

[27] Padé, H.: “Sur la répresentation approchée d’une fonction par des fractions rationelles”, Thesis,
Ann. École Nor. (3), 9 (1892), p. 1-93.

[28] Quarteroni, A. and Saleri, F.: Scientific Computing with MATLAB, Texts in computational
science and engineering 2, Springer (2003), p. 66.

[29] Ralston, A. and Rabinowitz, P.: A first course in numerical analysis, (2nd edition), McGraw-
Hill, New York, 1978.

[30] Rice, J. R.: “A Metalgorithm for Adaptive Quadrature”, Journal of the ACM 22(1) (1975),
p. 61-82.
BIBLIOGRAPHY 415

[31] Richardson, L. F.: “The approximate arithmetical solution by finite differences of physical
problems including differential equations, with an application to the stresses in a masonry
dam”, Philosophical Transactions of the Royal Society A. 210(459-470) (1911), p. 307-357.

[32] Romberg, W.: “Vereinfachte numerische Integration”, Det Kongelige Norske Videnskabers
Selskab Forhandlinger, Trondheim 28(7) (1955), p. 30-36.

[33] Runge, C.: “Über empirische Funktionen und die Interpolation zwischen quidistanten Ordi-
naten”, Zeitschrift für Mathematik und Physik 46 (1901), p. 224-243.

[34] Schultz, M. H.: Spline analysis, Prentice-Hall, Englewood Cliffs, NJ (1973).

[35] Shampine, L. F. and Reichelt, M. W.: “The Matlab ODE Suite”, SIAM J. Sci. Comput.
18(1) (1997), p. 1-22.

[36] Süli, E. and Mayers, D.: An Introduction to Numerical Analysis, Cambridge University Press
(2003).

[37] Wilbraham, H.: “On a certain periodic function”, The Cambridge and Dublin Mathematical
Journal 3 (1848), p. 198-201.
Index

p-norm, 400 characteristic equation, 328


2-norm, 406, 407 characteristic polynomial, 409
Chebyshev polynomials, 178, 209, 247
A-stability, 332 chopping, 37
absolute continuity, 406 circulant matrix, 231
absolute stability, region of, 332 closure, 387
Adams methods, 319 cofactor, 399
Adams-Bashforth method, 319 collocation, 354
Adams-Moulton method, 320 collocation points, 354
Adams-Moulton predictor-corrector method, 321 colon operator, 13
adaptive quadrature, 257 column space, 391
adaptive time-stepping, 335 commutativity, 386
aitken’s ∆2 method, 298 complete orthogonal decomposition, 113
algorithm, stable, 29 condition number, absolute, 28
aliasing, 221 condition number, relative, 28
antiderivative, 232 consistency, 323
arithmetic, floating-point, 38 consistency, one-step method, 324
associativity, 386
continued fraction, 214
backward difference operator, 176 continuity, 371, 372
backward differentiation formula, 322 contraction, 282
backward error, 28 convergence, 323
backward Euler’s method, 318 convergence, cubic, 272
base, 35 convergence, linear, 271
basis, 389 convergence, local, 284
Bernoulli numbers, 252 convergence, quadratic, 272
binomial coefficient, extended, 174 convergence, sequence of vectors, 402
bisection, 274, 374 convergence, superlinear, 271
Bogacki-Shampine method, 337 cubature, 263
boundary condition, 343 cubature rule, 263
boundary value problem, two-point, 343
Broyden’s Method, 307 Dahlquist’s Barrier Theorem, first, 334
Dahlquist’s Barrier Theorem, second, 334
Cartesian product rule, 262 Dahlquist’s Equivalence Theorem, 333
catastrophic cancellation, 39 data fitting, 159
Cauchy-Schwarz inequality, 401, 407 degenerate function, 272
ceiling function, 372 degree of accuracy, 233, 239
characteristic, 35 determinant, 399

416
INDEX 417

diagonal, main, 398 Galerkin method, 359


difference equation, 312 Gauss-Lobatto quadrature, 249
difference operator, forward, 174, 298 Gauss-Radau quadrature, 248
difference, backward, 226 Gibbs’ phenomenon, 223
difference, centered, 226 Givens rotation, 104
difference, forward, 225 global error, 329, 335
differentiation matrix, 230 Gram-Schmidt orthogonalization, 95, 204
dimension, 389 modified, 97
direct construction, 241 group, 398
discrete fourier transform, 220 group, general linear, 398
distance, vectors, 402
divided difference, 166 Hölder inequality, 401
divided-difference table, 167 Hadamard product, 395
Dormand-Prince method, 337 Hadamard’s conditions, 311
dot product, 394 Hermite polynomial, 180
Hilbert matrix, 202
eigenvalue, 405, 408
horner’s method, 170
eigenvector, 405, 408
Householder reflection, 98
embedded pair, 337
equilibration, 69
identity matrix, 397
equivalence of norms, 402
IEEE floating-point standard, 38
error analysis, backward, 28
ill-conditioned matrix, 202
error, absolute, 26
ill-posed problem, 311
error, computational, 25
image, 390
error, data, 25
implicit method, 318
error, relative, 26
initial value problem, 343
error, roundoff, 37
inner product, 394, 407
error, truncation, 25
inner product space, 201
Euler’s method, 312
integral, definite, 378
Euler’s method, modified, 316
integral, inner, 263
Euler-Maclaurin Expansion, 252
integral, iterated, 262
explicit method, 318
integral, outer, 262
exponent, 35
integrand, 378
field, 386 integration, 378
finite difference, 226, 346 intermediate value theorem, 272
finite element method, 361 interpolant, Fourier, 219
fixed point, 279, 301 interpolating polynomial, 159
fixed point, stable, 284 interpolation, 159
fixed point, unstable, 284 interpolation point, 159
fixed-point iteration, 279, 300 interpolation, barycentric, 163
forward error, 27 interpolation, Newton, 166
Fourier series, 217 inverse function theorem, 272
FSAL method, 337 inverse of matrix, 397
function, component, 301, 371
function, coordinate, 301, 371 Jacobian matrix, 302
418 INDEX

Kronecker delta, 181 nested form, 171


nested multiplication, 170
Lagrange interpolation, 161 Neville’s Method, 165
Lagrange multipliers, 122 Newton backward-difference formula, 176
Lagrange polynomial, 161 newton form, 166
least squares problem, 194 Newton forward-difference formula, 174
linear constraints, 121 Newton’s method, 18, 289, 304
quadratic constraints, 123 Newton-Cotes quadrature, 234
total, 125 node, quadrature, 232
least-squares problem, 411 norm, 217, 405, 407
least-squares problem, continuous, 201 norm, `1 , matrix, 404
Legendre polynomials, 206 norm, `1 , vector, 401
limit, 369 norm, `2 , matrix, 404
limit of integration, 378 norm, `2 , vector, 401
linear combination, 387 norm, `∞ , matrix, 404
linear dependence, 388 norm, `∞ , vector, 401
linear independence, 201, 388 norm, equivalent, 406
linear operator, 389 norm, Euclidean, 401
linear transformation, 389 norm, Frobenius, 405
Lipschitz condition, 282, 311 norm, induced, 403
Lipschitz constant, 282, 311 norm, matrix, 403
local extrapolation, 337 norm, natural, 403
local truncation error, 324 norm, vector, 400
M-file, 14 normal equations, 93, 195, 197, 201
machine number, 36 normalization, 36
Maclaurin polynomial, 380 normed space, 201
mantissa, 35 normed vector space, 406
matrix, 385 null space, 391
matrix product, 392 nullity, 391
matrix, diagonal, 398 number system, floating-point, 35
matrix, invertible, 397 numerical analysis, 3
matrix, lower triangular, 398 one-step method, 312
matrix, nonsingular, 397 optimization problem, 376
matrix, singular, 397 order of accuracy, 225, 239, 330
matrix, upper triangular, 398 orthogonal set, 203
matrix-vector product, 390 orthogonal vectors, 394
maximum norm, 406 orthogonality, 407
midpoint method, explicit, 316 orthonormal set, 204
Midpoint Rule, 234 osculatory interpolation, 180
Midpoint Rule, Composite, 238 outer product, 306, 394
minor, 399 outer product update, 397
moment matching, 241 overdetermined system, 91
monomial basis, 160 overflow, 36
Monte Carlo method, 267
multistep method, 319 Padé approximant, 212
INDEX 419

Parseval’s identity, 218 row space, 391


partition, column, 396 Runge’s Example, 238
partition, row, 396 Runge’s example, 177
periodic boundary conditions, 217 Runge-Kutta method, 316
piecewise continuous, 223 Runge-Kutta method, fourth-order, 316
polynomial, piecewise, 183 Runge-Kutta-Fehlberg method, 337
polynomial, trigonometric, 209
power form, 170 scalar product, 386
precision, 34, 35 secant line, 295
precision, double, 38 secant method, 295
precision, machine, 37 Sherman-Morrison Formula, 307
precision, single, 38 shooting method, 343
projection, 115 significant digits, 26
symmetric, 97 similar matrices, 409
Property A, 68 similarity transformation, 409
pseudo-inverse, 113, 115, 120 Simpson’s method, 328
Simpson’s Rule, 235
QR factorization, 95 Simpson’s Rule, Composite, 238
quadrature rule, 232 skew-symmetric matrix, 394, 409
quadrature rule, closed, 232 Sobolev space, 406
quadrature rule, composite, 238 span, 387
quadrature rule, Gaussian, 241 spectral radius, 409
quadrature rule, interpolatory, 233 spline, 186
quadrature rule, open, 232 spline, cubic, 186
quadrature, adaptive, 256 spline, Hermite cubic, 191
quadrature, automatic, 256 spline, linear, 184
quadrature, Gauss-Kronrod, 246 square-integrable, 218
square-integrable function, 407
range, 391 stability, 323, 327
rank, 391 stability polynomial, 331
rank, column, 391 stability, strong, 328
rank, full, 391 stability, weak, 328
rank-deficient, 391 stage, Runge-Kutta method, 317
rank-one update, 306 standard basis, 387, 390
recursion coefficients, 205 stationary point, 301
relaxation, 287 steffenson’s method, 298
Richardson extrapolation, 250 stiff differential equation, 330
Riemann sum, 377 stiffness matrix, 361
Romberg integration, 253 submultiplicative property, 403
root condition, 328 subspace, 387
root, double, 272 symmetric matrix, 393
root, simple, 328
rounding, 37 Taylor polynomial, 380
rounding to nearest, 37 Taylor remainder, 380
rounding toward zero, 37 test equation, 330
row rank, 391 test function, 359
420 INDEX

three-term recurrence relation, 205


time step, 312
time-marching, 312
time-stepping, 312
transpose, 393
transpose, Hermitian, 14
trapezoidal method, 318
trapezoidal method, explicit, 316
Trapezoidal Rule, 234
Trapezoidal Rule, Composite, 238
trial function, 359
triangle inequality, 400, 406
trivial solution, 400

underflow, 36
underflow, gradual, 36
unit roundoff, 37

Vandermonde matrix, 160, 198


vector space, 386

weak form, 359


weight function, 208, 247, 406
weight, quadrature, 232
well-posed problem, 311
well-posedness, 25

zero-stability, 325

You might also like