0% found this document useful (0 votes)

9 views498 pages

Numerical Methods and Methods of Approximation in Science and Engineering

The document is a comprehensive text on numerical methods and approximation techniques in science and engineering, authored by Karan S. Surana. It covers various topics including linear and nonlinear equations, eigenvalue problems, and solution methods, providing detailed explanations and methodologies. The book is published by CRC Press and includes a range of mathematical models and methods for solving complex engineering problems.

Uploaded by

mohammed.adil.civileng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views498 pages

Numerical Methods and Methods of Approximation in Science and Engineering

Uploaded by

mohammed.adil.civileng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 498

Numerical Methods

and Methods
of Approximation
in Science and Engineering
Numerical Methods
and Methods
of Approximation
in Science and Engineering

Karan S. Surana
Department of Mechanical Engineering
The University of Kansas
Lawrence, Kansas
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2019 by Taylor & Francis Group, LLC

CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper

International Standard Book Number-13: 978-0-367-13672-7 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to
publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials
or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material repro-
duced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any
form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming,
and recording, or in any information storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copy-
right.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400.
CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have
been granted a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identifica-
tion and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com

and the CRC Press Web site at

http://www.crcpress.com
To my granddaughter Riya,

who has filled my life with joy.

Contents

Preface xv

About the Author xix

1 Introduction 1
1.1 Numerical Solutions . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Numerical Methods without any Approximation . . . 1
1.1.2 Numerical Methods with Approximations . . . . . . . 2
1.2 Accuracy of Numerical Solution, Error . . . . . . . . . . . . 2
1.3 Concept of Convergence . . . . . . . . . . . . . . . . . . . . . 3
1.4 Mathematical Models . . . . . . . . . . . . . . . . . . . . . . 4
1.5 A Brief Description of Topics and Methods . . . . . . . . . . 4

2 Linear Simultaneous Algebraic Equations 9

2.1 Introduction, Matrices, and Vectors . . . . . . . . . . . . . . 9
2.1.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Matrix Algebra . . . . . . . . . . . . . . . . . . . . . 13
2.1.2.1 Addition and Subtraction of Two Matrices . 13
2.1.2.2 Multiplication by a Scalar . . . . . . . . . . . 14
2.1.2.3 Product of Matrices . . . . . . . . . . . . . . 14
2.1.2.4 Algebraic Properties of Matrix Multiplication 14
2.1.2.5 Decomposition of a Square Matrix into Sym-
metric and Skew-Symmetric Matrices . . . . 19
2.1.2.6 Augmenting a Matrix . . . . . . . . . . . . . 19
2.1.2.7 Determinant of a Matrix . . . . . . . . . . . 20
2.2 Matrix and Vector Notation . . . . . . . . . . . . . . . . . . 25
2.2.1 Elementary Row Operations . . . . . . . . . . . . . . 26
2.3 Solution Methods . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Direct Methods . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.1 Graphical Method . . . . . . . . . . . . . . . . . . . . 28
2.4.2 Cramer’s Rule . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Elimination Methods . . . . . . . . . . . . . . . . . . . . . . 34

vii
viii CONTENTS

2.5.1 Gauss Elimination . . . . . . . . . . . . . . . . . . . . 34

2.5.1.1 Naive Gauss Elimination . . . . . . . . . . . 34
2.5.1.2 Gauss Elimination with Partial Pivoting . . 39
2.5.1.3 Gauss Elimination with Full Pivoting . . . . 43
2.5.2 Gauss-Jordan Elimination . . . . . . . . . . . . . . . 46
2.5.3 Methods Using [L][U ] Decomposition . . . . . . . . . 49
2.5.3.1 Classical [L][U ] Decomposition and Solution
of [A]{x} = {b}: Cholesky Decomposition . . 49
2.5.3.2 Determination of the Solution {x} Using [L][U ]
Decomposition . . . . . . . . . . . . . . . . . 53
2.5.3.3 Crout Decomposition of [A] into [L][U ] and
Solution of Linear Algebraic Equations . . . 56
2.5.3.4 Classical or Cholesky Decomposition of [A]
in [A]{x} = {b} using Gauss Elimination . . 61
2.5.3.5 Cholesky Decomposition for a Symmetric Ma-
trix [A] . . . . . . . . . . . . . . . . . . . . . 63
2.5.3.6 Alternate Derivation of [L][U ] Decomposi-
tion when [A] is Symmetric . . . . . . . . . . 64
2.6 Solution of Linear Systems Using the Inverse . . . . . . . . . 65
2.6.1 Methods of Finding Inverse of [A] . . . . . . . . . . . 65
2.6.1.1 Direct Method of Finding Inverse of [A] . . . 65
2.6.1.2 Using Elementary Row Operations and Gauss-
Jordan Method to Find the Inverse of [A] . . 66
2.6.1.3 Finding the Inverse of [A] by [L][U ] Decom-
position . . . . . . . . . . . . . . . . . . . . . 67
2.7 Iterative Methods of Solving Linear Systems . . . . . . . . . 68
2.7.1 Gauss-Seidel Method . . . . . . . . . . . . . . . . . . 68
2.7.2 Jacobi Method . . . . . . . . . . . . . . . . . . . . . . 74
2.7.2.1 Condition for Convergence of Jacobi Method 75
2.7.3 Relaxation Techniques . . . . . . . . . . . . . . . . . 80
2.8 Condition Number of the Coefficient Matrix . . . . . . . . . 81
2.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 81

3 Nonlinear Simultaneous Equations 89

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.2 Root-Finding Methods . . . . . . . . . . . . . . . . . . . . . 90
3.2.1 Graphical Method . . . . . . . . . . . . . . . . . . . . 91
3.2.2 Incremental Search Method . . . . . . . . . . . . . . 92
3.2.2.1 More Accurate Value of a Root . . . . . . . . 93
3.2.3 Bisection Method or Method of Half-Interval . . . . . 95
3.2.4 Method of False Position . . . . . . . . . . . . . . . . 99
3.2.5 Newton-Raphson Method or Newton’s Linear Method 102
CONTENTS ix

3.2.5.1 Alternate Method of Deriving (3.38) . . . . . 103

3.2.5.2 General Remarks Regarding Newton-Raphson
Method . . . . . . . . . . . . . . . . . . . . . 104
3.2.5.3 Error Analysis of Newton-Raphson Method . 105
3.2.6 Newton’s Second Order Method . . . . . . . . . . . . 108
3.2.7 Secant Method . . . . . . . . . . . . . . . . . . . . . . 113
3.2.8 Fixed Point Method or Basic Iteration Method . . . 114
3.2.9 General Remarks on Root-Finding Methods . . . . . 117
3.3 Solutions of Nonlinear Simultaneous Equations . . . . . . . . 118
3.3.1 Newton’s Linear Method or Newton-Raphson Method 118
3.3.1.1 Special Case: Single Equation . . . . . . . . 120
3.3.2 Concluding Remarks . . . . . . . . . . . . . . . . . . 123

4 Algebraic Eigenvalue Problems 129

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.2 Basic Properties of the Eigenvalue Problems . . . . . . . . . 129
4.2.1 Orthogonality of Eigenvectors . . . . . . . . . . . . . 131
4.2.1.1 Orthogonality of Eigenvectors in SEVP . . . 131
4.2.1.2 Normalizing an Eigenvector of SEVP . . . . 132
4.2.1.3 Orthogonality of Eigenvectors in GEVP . . . 133
4.2.1.4 Normalizing an Eigenvector of GEVP . . . . 133
4.2.2 Scalar Multiples of Eigenvectors . . . . . . . . . . . . 134
4.2.2.1 SEVP . . . . . . . . . . . . . . . . . . . . . . 134
4.2.3 Consequences of Orthonormality of {φ} e . . . . . . . . 135
4.2.3.1 Orthonormality of {φ} e in SEVP . . . . . . . 135
4.2.3.2 Orthonormality of {φ} e in GEVP . . . . . . . 136
4.3 Determining Eigenpairs . . . . . . . . . . . . . . . . . . . . . 136
4.3.1 Characteristic Polynomial Method . . . . . . . . . . . 137
4.3.1.1 Faddeev-Leverrier Method of Obtaining the
Characteristic Polynomial p(λ) . . . . . . . . 138
4.3.2 Vector Iteration Method of Finding Eigenpairs . . . . 144
4.3.2.1 Inverse Iteration Method: Setting Up an Eigen-
value Problem for Determining Smallest Eigen-
pair . . . . . . . . . . . . . . . . . . . . . . . 144
4.3.2.2 Inverse Iteration Method: Determination of
Smallest Eigenpair (λ1 , {φ}1 ) . . . . . . . . . 145
4.3.2.3 Forward Iteration Method: Setting Up an
Eigenvalue Problem for Determining Largest
Eigenpair . . . . . . . . . . . . . . . . . . . . 147
4.3.2.4 Forward Iteration Method: Determination of
Largest Eigenpair (λn , {φ}n ) . . . . . . . . . 149
x CONTENTS

4.3.3 Gram-Schmidt Orthogonalization or Iteration Vector

Deflation to Calculate Intermediate or Subsequent
Eigenpairs . . . . . . . . . . . . . . . . . . . . . . . . 158
4.3.3.1 Gram-Schmidt Orthogonalization or Iteration
Vector Deflation . . . . . . . . . . . . . . . . 159
4.3.3.2 Basic Steps in Iteration Vector Deflation . . 160
4.3.4 Shifting in Eigenpair Calculations . . . . . . . . . . . 165
4.3.4.1 What is a Shift? . . . . . . . . . . . . . . . . 166
4.3.4.2 Consequences of Shifting . . . . . . . . . . . 166
4.4 Transformation Methods for Eigenvalue Problems . . . . . . 167
4.4.1 SEVP: Orthogonal Transformation, Change of Basis . 167
4.4.2 GEVP: Orthogonal Transformation, Change of Basis 168
4.4.3 Jacobi Method for SEVP . . . . . . . . . . . . . . . . 170
4.4.3.1 Constructing [Pl ] ; l = 1, 2, . . . , k Matrices . . 171
4.4.3.2 Using Jacobi Method . . . . . . . . . . . . . 172
4.4.4 Generalized Jacobi Method for GEVP . . . . . . . . 175
4.4.4.1 Basic Theory of Generalized Jacobi Method 176
4.4.4.2 Construction of [Pl ] Matrices . . . . . . . . . 177
4.4.5 Householder Method with QR Iterations . . . . . . . 180
4.4.5.1 Step 1: Householder Transformations to Tridi-
agonalize [A] . . . . . . . . . . . . . . . . . . 180
4.4.5.2 Using Householder Transformations . . . . . 181
4.4.5.3 Step 2: QR Iterations to Extract Eigenpairs 183
4.4.5.4 Determining [Q] and [R] . . . . . . . . . . . 183
4.4.5.5 Using QR Iteration . . . . . . . . . . . . . . 184
4.4.6 Subspace Iteration Method . . . . . . . . . . . . . . . 186
4.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 188

5 Interpolation and Mapping 195

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
5.2 Interpolation Theory in R1 . . . . . . . . . . . . . . . . . . . 195
5.2.1 Piecewise Linear Interpolation . . . . . . . . . . . . . 196
5.2.2 Polynomial Interpolation . . . . . . . . . . . . . . . . 197
5.2.3 Lagrange Interpolating Polynomials . . . . . . . . . . 198
5.2.3.1 Construction of Lk (x): Lagrange Interpolat-
ing Polynomials . . . . . . . . . . . . . . . . 199
5.3 Mapping in R1 . . . . . . . . . . . . . . . . . . . . . . . . . . 202
5.4 Lagrange Interpolation in R1 using Mapping . . . . . . . . . 207
5.5 Piecewise Mapping and Lagrange Interpolation in R1 . . . . 209
5.6 Mapping of Length and Derivatives of f (·) . . . . . . . . . . 214
5.7 Mapping and Interpolation Theory in R2 . . . . . . . . . . . 217
5.7.1 Division of Ω̄ into Subdivisions Ω̄(e) . . . . . . . . . . 218
CONTENTS xi

5.7.2 Mapping of Ω̄(e) ⊂ R2 into Ω̄(ξη) ⊂ R2 . . . . . . . . . 219

5.7.3 Pascal’s Rectangle: A Polynomial Approach to De-
termine Li (ξ, η) . . . . . . . . . . . . . . . . . . . . . 222
5.7.4 Tensor Product to Generate Li (ξ, η) ; i = 1, 2, . . . . . 224
5.7.4.1 Bilinear Li (ξ, η) in ξ and η . . . . . . . . . . 224
5.7.4.2 Biquadratic Li (ξ, η) in ξ and η . . . . . . . . 226
5.7.5 Interpolation of Function Values fi Over Ω̄(e) Using
Ω̄(ξ,η) . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
5.7.6 Mapping of Length, Areas and Derivatives of f (ξ, η)
with Respect to x, y and ξ, η . . . . . . . . . . . . . . 229
5.7.6.1 Mapping of Areas . . . . . . . . . . . . . . . 229
5.7.6.2 Obtaining Derivatives of f (ξ, η) with Respect
to x, y . . . . . . . . . . . . . . . . . . . . . . 231
5.8 Serendipity family of C 00 interpolations . . . . . . . . . . . . 232
5.8.1 Method of deriving serendipity interpolation functions 233
5.9 Mapping and Interpolation in R3 . . . . . . . . . . . . . . . . 237
5.9.1 Mapping of Ω̄(e) into Ω̄(m) in ξηζ-Space . . . . . . . . 238
5.9.1.1 Construction of L e i (ξ, η, ζ) using Polynomial
Approach . . . . . . . . . . . . . . . . . . . . 239
5.9.1.2 Tensor Product to Generate L e i (ξ, η, ζ) . . . . 241
5.9.2 Interpolation of Function Values fi Over Ω̄(e) Using
Ω̄(m) . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
5.9.3 Mapping of Lengths, Volumes and Derivatives of f (ξ, η, ζ)
with Respect to x, y, z and ξ, η, ζ in R3 . . . . . . . . 245
5.9.3.1 Mapping of Lengths . . . . . . . . . . . . . . 245
5.9.3.2 Mapping of Volumes . . . . . . . . . . . . . . 246
5.9.3.3 Obtaining Derivatives of f (ξ, η, ζ) with Re-
spect to x, y, z . . . . . . . . . . . . . . . . . 246
5.10 Newton’s Interpolating Polynomials in R1 . . . . . . . . . . . 251
5.10.1 Determination of Coefficients in (5.142) . . . . . . . . 252
5.11 Approximation Errors in Interpolations . . . . . . . . . . . . 256
5.12 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 257

6 Numerical Integration or Quadrature 269

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
6.1.1 Numerical Integration in R1 . . . . . . . . . . . . . . 270
6.1.2 Numerical Integration in R2 and R3 : . . . . . . . . . 270
6.2 Numerical Integration in R1 . . . . . . . . . . . . . . . . . . 271
6.2.1 Trapezoid Rule . . . . . . . . . . . . . . . . . . . . . 271
6.2.2 Simpson’s 13 Rule . . . . . . . . . . . . . . . . . . . . 272
6.2.3 Simpson’s 38 Rule . . . . . . . . . . . . . . . . . . . . 274
6.2.4 Newton-Cotes Iteration . . . . . . . . . . . . . . . . . 276
xii CONTENTS

6.2.4.1 Numerical Examples . . . . . . . . . . . . . . 276

6.2.5 Richardson’s Extrapolation . . . . . . . . . . . . . . . 284
6.2.6 Romberg Method . . . . . . . . . . . . . . . . . . . . 285
6.3 Numerical Integration in R1 using Gauss Quadrature for [−1, 1]288
6.3.1 Two-Point Gauss Quadrature . . . . . . . . . . . . . 289
6.3.2 Three-Point Gauss Quadrature . . . . . . . . . . . . . 290
6.3.3 n-Point Gauss Quadrature . . . . . . . . . . . . . . . 292
6.3.4 Using Gauss Quadrature in R1 with [−1, 1] Limits for
Integrating Algebraic Polynomials and Other Functions293
6.3.5 Gauss Quadrature in R1 for Arbitrary Integration
Limits . . . . . . . . . . . . . . . . . . . . . . . . . . 295
6.4 Gauss Quadrature in R2 . . . . . . . . . . . . . . . . . . . . 296
6.4.1 Gauss Quadrature in R2 over Ω̄ = [−1, 1] × [−1, 1] . . 296
6.4.2 Gauss Quadrature in R2 Over Arbitrary Rectangular
Domains Ω̄ = [a, b] × [c, d] . . . . . . . . . . . . . . . 297
6.5 Gauss Quadrature in R3 . . . . . . . . . . . . . . . . . . . . 298
6.5.1 Gauss Quadrature in R3 over Ω̄ = [−1, 1] × [−1, 1] ×
[−1, 1] . . . . . . . . . . . . . . . . . . . . . . . . . . 298
6.5.2 Gauss Quadrature in R3 Over Arbitrary Prismatic
Domains Ω = [a, b] × [c, d] × [e, f ] . . . . . . . . . . . 299
6.5.3 Numerical Examples . . . . . . . . . . . . . . . . . . 300
6.5.4 Concluding Remarks . . . . . . . . . . . . . . . . . . 306

7 Curve Fitting 311

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
7.2 Linear Least Squares Fit (LLSF) . . . . . . . . . . . . . . . . 312
7.3 Weighted Linear Least Squares Fit (WLLSF) . . . . . . . . . 315
7.4 Non-linear Least Squares Fit: A Special Case (NLSF) . . . . 321
7.5 General formulation for non-linear least squares fit (GNLSF) 328
7.5.1 Weighted general non-linear least squares fit (WGNLSF)
330
7.5.1.1 Using general non-linear least squares fit for
linear least squares fit . . . . . . . . . . . . 330
7.6 Least squares fit using sinusoidal functions (LSFSF) . . . . 336
7.6.1 Concluding remarks . . . . . . . . . . . . . . . . . . 342

8 Numerical Differentiation 347

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
k
8.1.1 Determination of Approximate Value of ddxfk ; k =
1, 2, . . . . using Interpolation Theory . . . . . . . . . . 347
8.1.2 Determination of Approximate Values of the Deriva-
tives of f with Respect to x Only at xi ; i = 1, 2, . . . , n348
8.2 Numerical Differentiation using Taylor Series Expansions . . 348
CONTENTS xiii

df
8.2.1 First Derivative of dx at x = xi . . . . . . . . . . . . 349
d2 f
8.2.2 Second Derivative at x = xi : Central Difference
dx2
Method . . . . . . . . . . . . . . . . . . . . . . . . . . 350
3
8.2.3 Third Derivative ddxf3 at x = xi . . . . . . . . . . . . . 351
8.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 354

9 Numerical Solutions of BVPs 359

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
9.2 Integral Forms . . . . . . . . . . . . . . . . . . . . . . . . . . 361
9.2.1 Integral Form Based on the Fundamental Lemma and
the Approximate Solution φn . . . . . . . . . . . . . . 362
9.2.2 Integral Form Based on the Residual Functional . . . 365
9.3 Finite Element Method for BVPs . . . . . . . . . . . . . . . 366
9.3.1 Finite Element Processes Based on the Fundamental
Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . 369
9.3.1.1 Finite Element Processes Based on GM, PGM,
WRM . . . . . . . . . . . . . . . . . . . . . . 371
9.3.1.2 Finite Element Processes Based on GM/WF 372
9.3.2 Finite Element Processes Based on the Residual Func-
tional . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
9.3.3 General Remarks . . . . . . . . . . . . . . . . . . . . 375
9.4 Finite Difference Method . . . . . . . . . . . . . . . . . . . . 397
9.4.1 Finite Difference Method for Ordinary Differential
Equations . . . . . . . . . . . . . . . . . . . . . . . . 398
9.4.2 Finite Difference Method for Partial Differential Equa-
tions . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
9.4.2.1 Laplace’s Equation . . . . . . . . . . . . . . . 408
9.4.2.2 Poisson’s Equation . . . . . . . . . . . . . . . 412
9.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 415

10 Numerical Solution of Initial Value Problems 425

10.1 General overview . . . . . . . . . . . . . . . . . . . . . . . . . 425
10.2 Space-time coupled methods for Ω̄xt . . . . . . . . . . . . . . 426
10.3 Space-time coupled methods using space-time strip . . . . . 428
10.4 Space-time decoupled or quasi methods . . . . . . . . . . . . 430
10.5 General remarks . . . . . . . . . . . . . . . . . . . . . . . . . 434
10.6 Space-time coupled finite element method . . . . . . . . . . . 434
10.7 Space-time decoupled finite element method . . . . . . . . . 435
10.8 Time integration of ODEs in space-time decoupled methods 437
10.9 Some time integration methods for ODEs in time . . . . . . 437
10.9.1 Euler’s Method . . . . . . . . . . . . . . . . . . . . . 438
10.9.2 Runge-Kutta Methods . . . . . . . . . . . . . . . . . 442
xiv CONTENTS

10.9.2.1 Second Order Runge-Kutta Methods . . . . 443

10.9.2.2 Heun Method . . . . . . . . . . . . . . . . . 444
10.9.2.3 Midpoint Method . . . . . . . . . . . . . . . 444
10.9.2.4 Third Order Runge-Kutta Method . . . . . . 445
10.9.2.5 Fourth Order Runge-Kutta Method . . . . . 445
10.9.2.6 Runge-Kutta Method for a System of ODEs
in Time . . . . . . . . . . . . . . . . . . . . . 446
10.9.2.7 Runge-Kutta Method for Higher Order ODEs
in Time . . . . . . . . . . . . . . . . . . . . . 446
10.9.3 Numerical Examples . . . . . . . . . . . . . . . . . . 447
10.9.4 Further Remarks on Runge-Kutta Methods . . . . . . 454
10.10 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 454

11 Fourier Series 459

11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
11.2 Fourier series representation of arbitrary periodic function . 459
11.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 463

BIBLIOGRAPHY 467

INDEX 471
Preface

Numerical methods and numerical analysis are an integral part of applied

mathematics. With the shift in engineering education over the last fifty years
from formulae, design, and synthesis oriented curriculum to one in which ba-
sic sciences, mechanics, and applied mathematics constitute the core of the
engineering education, numerical methods and methods of approximation
have become an integral part of the undergraduate engineering curriculum.
At present most engineering curricula incorporate study of numerical meth-
ods and methods of approximation in some form, generally during the third
(junior) year of the four year undergraduate study leading to baccalaureate
degree in engineering. Adopting the text books and writings on this subject
that are mathematically rigorous with theorems, lemmas, corollaries, and
their proofs with very little illustrative examples in engineering curriculum
was not very beneficial in terms of good understanding of the methods and
their applications. This spurred a host of new text books on numerical meth-
ods that are specifically designed for engineering students. The progression
and evolution of such writings at present has reached a stage that specifically
caters the study of numerical methods to software packages and their use.
Such writings lack theoretical foundation, deeper understanding of methods,
discussion of pros, cons, and limitations of the methods.
The author has taught the numerical methods subject at the Univer-
sity of Kansas for over twenty years using his own class notes, which have
evolved into the manuscript of this text book. The author’s own research
in computational mathematics and computational mechanics and his own
graduate level text books on these subjects have been instrumental in de-
signing the unique presentation of the material on the numerical methods
and methods of approximation in this text book. The material in this book
focuses on sound theoretical foundation, yet is presented with enough clarity,
simplicity, and worked out illustrative examples to facilitate thorough under-
standing of the subject and its applications. This manuscript and its earlier
versions have successfully been used at the University of Kansas mechanical
engineering department since 1984 by the author and his colleagues.
The study of numerical methods and the methods of approximation us-
ing this text book requires that the students have knowledge of a computer
programming language and also know how to structure a sequence of opera-
tions into a program using a programming language of their choice. For this
reason, this book contains no material regarding any of the programming
languages or instructions on how to structure a sequence of operations into

xv
xvi PREFACE

a computer program.
In this book, all numerical methods are clearly grouped in two cate-
gories:

(i) The numerical methods that do not involve any approximations. In

such methods the calculated numerical solutions are exact solutions of
the mathematical models within the accuracy of computations on the
computer. We refer to such methods as numerical methods or numerical
methods without approximation.

(ii) Those methods in which the numerically calculated solution is always

approximate. We refer to such methods as methods of approximation
or numerical methods with approximations. In such methods often
we can progressively approach (converge to) the true solution, but can
never obtain precise theoretical solution.

In the numerical calculations of the solutions of the mathematical models, it

is important to know whether the computed solutions are exact or true solu-
tions of the mathematical models or if they are approximations of the exact
solution. In approximate solutions, some assessment of error, computed or
estimated, is highly meritorious as it helps in establishing the accuracy of
the solution. Throughout the book in all chapters we keep this aspect of the
computed solution in mind.
The book consists of eleven chapters. Chapters 2 and 3 consider methods
of solutions of linear and nonlinear simultaneous algebraic equations. Stan-
dard and general eigenvalue problems, properties of eigenvalue problems,
and methods of calculating eigenpairs are presented in Chapter 4. Chap-
ter 5 contains interpolation theory and mapping in R1 , R2 , and R3 in the
physical domain as well as the natural coordinate space ξηζ. Numerical in-
tegration or quadrature methods: trapezoid rule, Simpson’s 1/3 and 3/8 rules
are presented in Chapter 6. Gauss quadrature in R1 , R2 , and R3 is also
presented in Chapter 6 using physical and natural coordinate spaces. Curve
fitting methods and numerical differentiation techniques are considered in
Chapters 7 and 8. Methods of obtaining numerical solutions of boundary
value problems (BVPs) and initial value problems (IVPs) are presented in
Chapters 9 and 10. Time integration techniques are described in Chapter
10. Chapter 11 is devoted to the Fourier series and its applications in ap-
proximate analytical representation of functions that may or may not be
analytic.
I am grateful to my former M.S. student, Mr. Tommy Hirst, for his inter-
est in initiating the typing of the earlier preliminary version of the current
manuscript. My very special thanks to Dr. Aaron D. Joy, my former Ph.D.
student, for typesetting the current manuscript, preparing tables and graphs,
xvii

performing some numerical studies, and for bringing the original prelimi-
nary version of the manuscript of the book to significant level of completion.
Aaron’s interest in the subject, hard work, and commitment to this book
project are instrumental in the completion of the major portion of this book.
Also my very sincere and special thanks to Mr. Dhaval Mysore, my current
Ph.D. student for completing the typing and type setting of much of the
newer material in Chapters 7 through 11. His interest in the subject, hard
work and commitment have helped in the completion of final manuscript of
this book. My sincere thanks to many of my colleagues of the mechanical
engineering department at the University of Kansas, and in particular to my
colleague and good friend Professor Peter TenPas, for valuable suggestions
and many discussions that have helped me in improving the manuscript of
the book.
This book contains many equations, derivations, mathematical details,
and tables of solutions that it is hardly possible to avoid some typographical
and other errors. The author would be grateful to those readers who are
willing to draw attention to the errors using the email [email protected].

Karan S. Surana, Lawrence, KS

About the Author

Karan S. Surana, born in India, went to undergraduate school at Birla

Institute of Technology and Science (BITS), Pilani, India, and received a
B.E. degree in Mechanical Engineering in 1965. He then attended the Uni-
versity of Wisconsin, Madison, where he obtained M.S. and Ph.D. degrees in
Mechanical Engineering in 1967 and 1970, respectively. He worked in indus-
try, in research and development in various areas of computational mechanics
and software development, for fifteen years: SDRC, Cincinnati (1970–1973),
EMRC, Detroit (1973–1978); and McDonnell-Douglas, St. Louis (1978–1984).
In 1984, he joined the Department of Mechanical Engineering faculty at
University of Kansas, where he is currently the Deane E. Ackers University
Distinguished Professor of Mechanical Engineering.
His areas of interest and expertise are computational mathematics, com-
putational mechanics, and continuum mechanics. He is author of over 350
research reports, conference papers, and journal articles. He has served as
advisor and chairman of 50 M.S. students and 22 Ph.D. students in various
areas of Computational Mathematics and Continuum Mechanics. He has
delivered many plenary and keynote lectures in various national and inter-
national conferences and congresses on computational mathematics, compu-
tational mechanics, and continuum mechanics. He has served on interna-
tional advisory committees of many conferences and has co-organized mini-
symposia on k-version of the finite element method, computational meth-
ods, and constitutive theories at U.S. National Congresses of Computational
Mechanics organized by the U.S. Association of Computational Mechanics
(USACM). He is a member of International Association of Computational
Mechanics (IACM) and USACM, and a fellow and life member of ASME.
Dr. Surana’s most notable contributions include: large deformation finite
element formulations of shells, the k-version of the finite element method,
operator classification and variationally consistent integral forms in meth-
ods of approximations for BVPs and IVPs, and ordered rate constitutive
theories for solid and fluent continua. His most recent and present research
work is in non-classical internal polar continuum theories and non-classical
Cosserat continuum theories for solid and fluent continua and associated or-
dered rate constitutive theories. He is the author of recently published text-
books: Advanced Mechanics of Continua, CRC/Taylor & Francis, The Finite
Element Method for Boundary Value Problems: Mathematics and Compu-
tations, CRC/Taylor & Francis, and The Finite Element Method for Initial
Value Problems: Mathematics and Computations, CRC/Taylor & Francis.

xix
1
Introduction

Numerical methods and methods of approximation play a significant role

in engineering, mathematical and applied physics, and engineering science.
The mathematical descriptions of physical systems leads to mathematical
models that may be in differential, integral, or algebraic form. The specific
form depends upon the basic principles and formulation strategy utilized in
deriving them. Regardless of the specific forms of the mathematical models
we can possibly choose either of two approaches in obtaining their solutions.
In the first approach we seek analytic or theoretical solutions of the equa-
tions constituting the mathematical model. Unfortunately, this approach
can only be used for simple and often trivial mathematical models. In prac-
tical applications the complexity of the mathematical models prohibit the
use of this approach. However, in cases where this approach can be used, we
obtain analytical expressions for the solution that are highly meritorious.

1.1 Numerical Solutions

In the second approach we resort to numerical methods or methods of
approximation for obtaining the solutions of the mathematical models. In
general, when using such methods we obtain numerical values of the solution.
In some cases, the union of piecewise analytical expressions and numerical
solutions constitute the entire solution, as in the finite element method. On
the other hand, in finite difference methods we only have numerical values
of the solution at a priori chosen locations in the spatial domain. Broadly
speaking, the method of obtaining numerical solutions can be classified in
the following two categories.

1.1.1 Numerical Methods without any Approximation

These are a class of numerical methods that yield a numerical solution,
but the numerical solution is not an approximation of the true solution of
the mathematical models. In these methods we obtain the exact solution
of the mathematical model but in numerical form. The only errors in this
solution are those due to truncations in the computations due to limited
word size of the computers. We simply refer to these methods as numerical
methods.

1
2 INTRODUCTION

1.1.2 Numerical Methods with Approximations

These are a class of numerical methods in which we only obtain an ap-
proximate solution of the mathematical models. Such numerical methods
are called methods of approximation. Obviously the solutions obtained us-
ing this class of methods contain error compared to the exact or analytical
solution.

Remarks.

(a) For a given class of mathematical models, some methods of obtaining nu-
merical solutions may be numerical methods (no approximation), while
others may be methods of approximation. For example, if the math-
ematical model consists of a system of linear simultaneous algebraic
equations (Chapter 2), then methods like Gauss elimination, Gauss-
Jordan method, and Cramer’s rule for obtaining their solution are nu-
merical methods without any approximation, while Gauss-Seidel and
Jacobi methods are methods of approximation.

(b) Some methods of obtaining numerical solutions are always methods of

approximation. Numerical integration techniques (such as Simpson’s
rules or Gauss quadrature) for integrands that are not algebraic poly-
nomials are always approximate. Solutions of nonlinear equations (alge-
braic or otherwise) are always iterative, hence fall into the category of
methods of approximation.

(c) Methods of calculating eigenvalues (characteristic polynomial) are nu-

merical methods when the degree of the characteristic polynomial is
three or less, but methods of approximation are typically employed when
the degree is higher than three.

(d) Methods of obtaining numerical solutions of boundary value problems

and initial value problems such as finite element method, finite difference
method, etc. are methods of approximation.

1.2 Accuracy of Numerical Solution, Error

Obviously in numerical methods without approximation, the errors are
only due to truncation because of the word size during computations. Such
errors when performing computations with word size of 64 bits or greater
are very small and generally not worthy of quantification. On the other
hand, in methods of approximation the calculated numerical solution is an
approximation of the true solution. Thus, in such methods:
1.3. CONCEPT OF CONVERGENCE 3

(i) If the true solution is known, the error can be measured as the difference
between the true solution and the calculated solution in the pointwise
sense, or if possible in the sense of L2 -norm.

(ii) When the theoretical solution is not known, as is the case with most
practical applications, we can possibly consider some of the following.

(a) We can attempt to estimate the error bounds. This provides the
least upper bound of the error in the solution, i.e., the true error
is less than or equal to the estimated error bound. In many cases
(but not always), this estimation of the error bound is possible.
(b) There are methods of approximation in which errors can be com-
puted based on the current numerical solution without knowledge
of the theoretical solution. The residual functional or L2 -norms of
residuals in the finite element methods with minimally conform-
ing approximation spaces are examples of this approach. This ap-
proach is highly meritorious as it provides a quantitative measure
of error in the computed solution without knowldge of the theo-
retical solution, hence can be used to compute errors in practical
applications.
(c) There are methods in which the solution error can neither be es-
timated nor computed but there is some vague indication of im-
provement. Order of truncation errors in finite difference processes
fall under this category. With increasing order of truncation, the
solution errors are expected to reduce.

We remark that a comprehensive treatment of these topics is beyond the

scope of this book. However, brief discussions are included wherever felt
necessary.

1.3 Concept of Convergence

In general, the concept of convergence means approaching the desired
goal. Thus, precisely what we are accomplishing through the process of con-
vergence depends upon what our objective or goal is. In the case of nonlinear
mathematical models, the numerical solutions are obtained iteratively. That
is, we assume a solution (initial starting solution for the iterative process)
and iterate using a recursive scheme established using the mathematical
model to obtain progressively improved solutions. When two successive so-
lutions are within some pre-specified tolerance, we consider the interative
process to be converged, i.e., we have an approximate numerical solution of
the mathematical model that is no longer changing as we continue to iterate.
4 INTRODUCTION

In many applications, the mathematical models used in the iterative pro-

cedure are themselves an approximation of the true physics. Nonlinear al-
gebraic equations obtained by finite element or finite difference methods are
approximations of the true physics due to choice of a characteristic length
used in obtaining them. In such cases, for a choice of discretization we obtain
a converged solution from the iterative solution procedure. This is repeated
for progressively refined discretizations leading to a sequence of progressively
improved solutions (hence convergence) of the actual mathematical model.
Figures 1.1 and 1.2 show schematic block diagrams of the convergence con-
cepts for linear and nonlinear physical processes. We observe that in the
case of linear processes (Figure 1.1), the convergence concept only implies
convergence to the correct solution. If Figure 1.2 for nonlinear processes,
there is a concept of convergence of the iterative solution procedure as well
as the concept of progressively refined discretization solutions converging to
the true solution of the problem.

1.4 Mathematical Models

The mathematical models describing the physical systems are derived us-
ing various methodologies depending upon the requirements of the physics
at hand. In this book we do not dwell on the derivations of the mathemati-
cal models, but rather use representative mathematical models with desired
features to present the numerical solution techniques suitable for them. How-
ever, whenever and wherever appropriate, enough description and insight is
provided regarding the origins and applications of the mathematical models
so that the significance and usefulness of the methods presented in this book
are realized.

1.5 A Brief Description of Topics and Methods

Chapter 2 contains a review of linear algebra followed by solution meth-
ods for linear simultaneous algebraic equations. These consist of numerical
methods such as Gauss elimination, Gauss-Jordan method, Cholesky de-
composition, and Cramer’s rule, as well as methods of approximation such
as Gauss-Seidel method, Jacobi method, and relaxation method. Details of
each method are followed by model problem solutions.
Chapter 3 contains methods of solution for nonlinear single or simulta-
neous equations. Using f (x) = 0, a single nonlinear function in indepen-
dent variable x, various methods of finding the solution x are introduced
with numerical examples. These consist of graphical method, incremental
search method, bisection method, method of false position, Newton-Raphson
method, secant method, fixed point method, and basic iteration method.
1.5. A BRIEF DESCRIPTION OF TOPICS AND METHODS 5

Linear Physical
System

Linear Mathematical
Model
(BVP or IVP as
examples)
(A)

Discretization
Linear Algebraic
Equations
(B)

Solution

Error estimate or
error computation

NO YES
Converged Converged solution
Solution? (of (A))

Figure 1.1: Concepts of convergence in linear systems

Newton-Raphson method is extended to a system of simultaneous nonlinear

equations.
Chapter 4 presents treatment of algebraic eigenvalue problems. Basic
properties of eigenvalue problems, the characteristic polynomial and efficient
methods of constructing it, standard eigenvalue problems (SEVP) as well as
general eigenvalue problems (GEVP) are considered. Inverse and forward
iteration methods with Gram-Schmidt orthogonalization are presented for
determining eigenpairs of the SEVP. Jacobi, Generalized Jacobi, QR House-
holder method, subspace iteration method and inverse iteration methods of
determining eigenpairs are presented.
6 INTRODUCTION

Non-linear Physical
System

Non-linear Mathematical
Model
(BVP or IVP as
examples)
(A)

Discretization
Linear Algebraic
Equations
(B)

Iterative Solution
Procedure

NO Iterative process
converged?

YES

Error Computation
or Estimation

NO YES
Approximate solution
Converged ?
(of (A))

Figure 1.2: Concepts of convergence in non-linear systems

Interpolation theory and mapping in R1 , R2 , and R3 are presented in

Chapter 5.
Various techniques of numerical integration such as trapezoid rule and
Simpson’s 1/3 and 3/8 rules are presented in Chapter 6 for numerical integra-
tion in R1 . Gauss quadrature in R1 , R2 , and R3 is presented using physical
coordinates (x, y, z) and natural coordinates (ξ, η, ζ).
Curve fitting using least squares fit, weighted least squares fit, and least
squares fit for nonlinear case are given in Chapter 7.
Numerical differentiation and model problem solutions are contained in
Chapter 8.
1.5. A BRIEF DESCRIPTION OF TOPICS AND METHODS 7

Numerical solutions of boundary value problems (BVPs) and Initial Value

Problems (IVPs) using finite element and finite difference methods are con-
sidered in Chapters 9 and 10.
Chapter 11 contains Fourier series representation of analytic as well as
non-analytic functions with model problem.
2
Linear Simultaneous
Algebraic Equations and
Methods of Obtaining Their
Solutions

2.1 Introduction, Matrices, and Vectors

Linear simultaneous algebraic equations arise in all branches of engineer-
ing, physics, applied mathematics, and in many other disciplines. In some
cases the mathematical representation of the physics may naturally result in
these while in other applications these may arise, for example, when consid-
ering solutions of differential and partial differential equations using methods
of approximation such as finite difference, finite element methods, etc. In
obtaining the solutions of linear simultaneous algebraic equations, one could
employ methods that are not methods of approximation. In such methods
the sources of errors are not due to the method used, but rather due to
computational inaccuracies. The solutions resulting from these methods are
exact within the computational precision. On the other hand, if methods of
approximation are employed in obtaining the solutions of linear simultaneous
algebraic equations, then obviously the calculated solutions are approximate
and are only accurate within some tolerance. In this chapter we consider
both methods of obtaining solutions of linear simultaneous algebraic equa-
tions.
First we introduce the concept of simultaneous equations in a more gen-
eral form. Consider

fi (x1 , x2 , . . . , xn ) = bi ; i = 1, 2, . . . , n (2.1)

in which xj ; j = 1, 2, . . . , n are unknown and bi are known (numbers). Each

fi (·) defines a functional relationship between xj ; j = 1, 2, . . . n that satisfies
(2.1). It is rather obvious that in doing so we cannot consider each fi (·)
individually as each fi (·) is a function of xj ; j = 1, 2, . . . , n. Instead we must
consider them all simultaneously.

9
10 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

2.1.1 Basic Definitions

Definition 2.1 (Nonlinear System). The system of equations (2.1) is
called a system of nonlinear simultaneous algebraic equations if some or all
fi (·) are nonlinear functions of some or all xj .
Definition 2.2 (Linear System). The system of equations (2.1) is called
a system of linear simultaneous algebraic equations if each fi (·) is a linear
combination of xj ; j = 1, 2, . . . n in which the coefficients in the linear com-
bination are known (numbers). For this case we can express (2.1) as

f1 (xj ; j = 1, 2, . . . , n) − b1 = a11 x1 + a12 x2 + · · · + a1n xn − b1 = 0

f2 (xj ; j = 1, 2, . . . , n) − b2 = a21 x1 + a22 x2 + · · · + a2n xn − b2 = 0
.. (2.2)
.
fn (xj ; j = 1, 2, . . . , n) − bn = an1 x1 + an2 x2 + · · · + ann xn − bn = 0

We note that each fi (·) is a linear combination of xj using aij ; i, j =

1, 2, . . . , n. aij and bij are known coefficients.
Remarks.
(1) When (2.1) represents a system of nonlinear simultaneous algebraic
equation, a form like (2.2) is also possible, but in this case the coef-
ficients (some or all) may be functions of unknowns xj ; j = 1, 2, . . . , n.
Thus, in general we can write (2.2) with the following definitions of the
coefficients aij :

aij = aij (xj ; j = 1, 2, . . . , n) i, j = 1, 2, . . . , n (2.3)

(2) In this chapter we consider methods of determining solutions of linear

simultaneous algebraic equations that are in the form (2.2).

(3) If the number of equations are large (large value of n in equation (2.1)),
then the representation (2.2) is cumbersome, i.e., not very compact. We
use matrix and vector notations to represent (2.2).
Definition 2.3 (Matrix). A matrix is an ordered rectangular (in general)
arrangement of elements and is generally denoted by a symbol. Thus, n × m
elements aij ; i = 1, 2, . . . , n; j = 1, 2, . . . , m can be represented by a symbol
[A] called the matrix A as follows:
 
a11 a12 . . . a1m
 a21 a22 . . . a2m 
[A] =  . (2.4)
 
 ..


an1 an2 . . . anm
2.1. INTRODUCTION, MATRICES, AND VECTORS 11

The elements along each horizontal line are called rows whereas the elements
along each vertical line are called columns. Thus, the matrix [A] has n rows
and m columns. We refer to [A] as an n × m matrix. We identify each
element of [A] by row and column location. Thus, the element aij of [A] is
located at row i and column j. The first subscript in aij is the row location
and the second subscript is the column location. This is a standard notation
and is used throughout the book.

Definition 2.4 (Rectangular Matrix). In the matrix [A] when n 6= m,

i.e., the number of rows and columns are not the same, then [A] is called a
rectangular matrix.

Definition 2.5 (Square Matrix). In a square matrix, the number of rows

is the same as the number of columns, i.e., n = m. The square matrices
are of special significance in representing coefficients aij ; i, j = 1, 2, . . . , n
appearing in the linear simultaneous equations (2.2).

Definition 2.6 (Row Matrix). In (2.1), if n = 1, then the matrix [A]

will contain only one row, hence we can represent its elements by a single
subscript only. Thus, a row matrix containing m columns can be represented
by
[A] = [a1 a2 . . . am ] (2.5)

Definition 2.7 (Column Matrix or Vector). In (2.4), if m = 1 then

the matrix [A] will contain only one column, hence we can also represent
its elements by a single subscript. A matrix containing only one column is
called a vector. Thus a column matrix or a vector containing n elements can
be represented by  

 a1 
 a2 
 
{A} = .. (2.6)


 . 


an
 

Definition 2.8 (Symmetric Matrix). A square matrix [A] is symmetric

if each row of the matrix is identical to the corresponding column.

aij = aji i, j = 1, 2, . . . , n (2.7)

The elements aii ; i = 1, 2, . . . , n are called diagonal elements of matrix [A].

Thus, in a symmetric matrix the elements of the matrix below the diagonal
are a mirror reflection of the elements above the diagonal and vice versa.
   
a11 a12 a13 a11 a12 a13
[A] = a12 a22 a23  =  a22 a23  (2.8)
a13 a23 a33 symm. a33
12 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

[A] is a (3 × 3) symmetric square matrix.

Definition 2.9 (Skew-Symmetric or Antisymmetric Matrix). A square

matrix [A] is called skew-symmetric or antisymmetric if its elements above
the diagonal are negative of the elements below the diagonal or vice versa
and if its diagonal elements are zero, i.e., aji = −aij or aij = −aji ; j 6= i
and aii = 0.  
0 a12 a13
[A] = −a12 0 a23  (2.9)
−a13 −a23 0
The matrix [A] is a (3 × 3) skew-symmetric square matrix.

Definition 2.10 (Diagonal Matrix). The elements aij ; i 6= j of a square

matrix [A] are called off-diagonal elements and the elements aij ; j = i,
i.e., aii , are called diagonal elements of [A]. If all off-diagonal elements of a
matrix [A] are zero (aij = 0; j 6= i), then the matrix [A] is called a diagonal
matrix.  
a11 0 0
[A] =  0 a22 0  (2.10)
0 0 a33
The matrix [A] is a (3 × 3) diagonal matrix.

Definition 2.11 (Identity Matrix). An identity matrix is a diagonal ma-

trix whose diagonal elements are unity (one). We denote an identity matrix
by [I]. Thus  
1 0 0
[I] =  0 1 0  (2.11)
0 0 1
is a (3 × 3) identity matrix.

Definition 2.12 (Kronecker Delta (δij )). The elements of an identity

matrix [I] can be identified as

1 if j = i
δij = i, j = 1, 2, . . . , n (2.12)
0 if j 6= i

The notation (2.12) is helpful when expressing [I] in terms of its components
(Einstein notation). Thus δij is in fact the identity matrix expressed in
Einstein notation. If we consider the product of [A] and [I], then we can
write:
[A][I] = aij δjk = aik = [A] ; i, j, k = 1, 2, . . . , n (2.13)
Likewise:
[I][I] = δij δjk = δik = [I] (2.14)
2.1. INTRODUCTION, MATRICES, AND VECTORS 13

Definition 2.13 (Upper Triangular Matrix). If all elements below the

diagonal of a square matrix [A] are zero, then [A] is called an upper triangular
matrix. For such matrices aij = 0 for i > j holds. Thus
 
a11 a12 a13
[A] =  0 a22 a23  (2.15)
0 0 a33
is a (3 × 3) upper triangular matrix.
Definition 2.14 (Lower Triangular Matrix). If all elements above the
diagonal of a square matrix [A] are zero, then [A] is called a lower triangular
matrix. For such matrices aij = 0 for i < j holds. Thus
 
a11 0 0
[A] = a21 a22 0  (2.16)
a31 a32 a33
is a (3 × 3) lower triangular matrix.
Definition 2.15 (Banded Matrix). All elements of a banded matrix are
zero, with the exception of a band about the diagonal. Thus
 
a11 a11 0 0
 
a a a 0 
21 22 23
[A] =  (2.17)
 

 0 a32 a33 a34 
 
0 0 a43 a44
has a bandwidth of three. All non-zero elements are within a band whose
width is three elements. Such matrices with a bandwidth of three centered
on the diagonal are called tridiagonal matrices.

2.1.2 Matrix Algebra

2.1.2.1 Addition and Subtraction of Two Matrices
The addition and subtraction of two matrices [A] and [B] results in a
matrix [C].
[A] ± [B] = [C] (2.18)
The matrix [C] is defined by:
cij = aij ± bij ; i = 1, 2, . . . , n ; j = 1, 2, . . . , m (2.19)
Obviously for the addition or subtraction of [A] and [B] to be valid, both
[A] and [B] must have the same number of rows and columns. The resulting
matrix [C] has the same number of rows and columns as well. We note that
[A] ± [B] = ±[B] + [A] holds for addition and subtraction of matrices, that
is, matrix addition is commutative.
14 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

2.1.2.2 Multiplication by a Scalar

Multiplication of a matrix [A] by a scalar s results in a matrix [D].

s[A] = [D] (2.20)

[D] is defined by

dij = saij ; i = 1, 2, . . . , n ; j = 1, 2, . . . , m (2.21)

That is, every element of [A] gets multiplied by the scalar s.

2.1.2.3 Product of Matrices

A matrix [A](n×m) can be multiplied with a matrix [B](m×l) . The result-
ing matrix is [C](n×l) .

[A](n×m) [B](m×l) = [C](n×l) (2.22)

[C](n×l) is defined by

cij = aik bkj ; i = 1, 2, . . . , n; j = 1, 2, . . . , l; k = 1, 2, . . . , m (2.23)

We note that the number of columns in [A] must be the same as the number
of rows in [B], otherwise the product of [A] and [B] is not valid. Consider
 
a11 a12
b11 b12
[A] = a21 a22
  [B] = (2.24)
b21 b22
a31 a32

Then
   
a11 a12 (a11 b11 + a12 b21 ) (a11 b12 + a12 b22 )
b b
[C] = [A][B] = a21 a22  11 12 = (a21 b11 + a22 b21 ) (a21 b12 + a22 b22 )
b21 b22
a31 a32 (a31 b11 + a32 b21 ) (a31 b12 + a32 b22 )
(2.25)
We note that [A](n×n) [I](n×n) = [I](n×n) [A](n×n) = [A](n×n) .

2.1.2.4 Algebraic Properties of Matrix Multiplication

Associative Property:
A product of matrices is invariant of the order of multiplication.

[A][B][C] = [A]([B][C]) = ([A][B])[C] = [D] (2.26)

Distributive Property:
2.1. INTRODUCTION, MATRICES, AND VECTORS 15

The sum of [A] and [B] multiplied with [C] is the same as [A] and [B]
multiplied with [C], then summed.

([A] + [B])[C] = [A][C] + [B][C] (2.27)

Commutative Property:
The product of [A] and [B] is not the same as product of [B] and [A].
Thus, in taking the product of [A] and [B], their positions cannot be changed.

[A][B] 6= [B][A] (2.28)

Definition 2.16 (Trace of a Matrix). The trace of a square matrix [A]

is the sum of its diagonal elements.
n
P
tr[A] = aii (2.29)
i=1

The trace is only defined for a square matrix.

Definition 2.17 (Inverse of a Matrix). For every non-singular (defined

later) square matrix [A] there exists another matrix [A]−1 (inverse of [A])
such that the following holds:

[A]−1 [A] = [A][A]−1 = [I] (2.30)

A singular matrix is one for which its inverse does not exist. The inverse is
only defined for a square matrix.

Definition 2.18 (Transpose of a Matrix). The transpose of a matrix

[A] is denoted by [A]T and is obtained by interchanging rows with the cor-
responding columns. If a matrix [A] has elements aij ; i = 1, 2, . . . , n; j =
1, 2, . . . , m, then the elements of [A]T are aji ; i = 1, 2, . . . , n; j = 1, 2, . . . , m.
We note that the matrix [A] is (n × m) where the matrix [A]T is (m × n). If

a a a
[A] = 11 12 13 (2.31)
a21 a22 a23 (2×3)

then  
a11 a21
[A]T = a12 a22  (2.32)
a13 a23 (3×2)

Row one of [A] is the same as column one of [A]T . Likewise row one of
[A]T is the same as column one of [A] and so on. That is, rows of [A] are
same as columns of [A]T and vice versa.
16 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

Transpose of a Row Matrix:

If the row matrix [A] is defined by

[A](1×m) = a1 a2 . . . am (1×m) (2.33)

then  

 a1 
 a2 
 
[A]T(m×1) = .. (2.34)


 . 


am (m×1)
 

That is, the transpose of a row matrix is a column matrix or vector.

Transpose of a Vector:
If {A} is a vector defined by
 

 a1 
 a2 
 
{A}(n×1) = .. (2.35)


 . 

 
an (n×1)


then
{A}T(1×n) = a1 a2 . . . an (1×n)

(2.36)
That is, the transpose of a column vector is a row matrix.

Transpose of a Product of Matrices:

Let [A]m×n and [B]n×p be rectangular matrices, then:
T
= [B]T [A]T p×m

[A][B] m×p (2.37)

Likewise: T
[A][B][C] = [C]T [B]T [A]T (2.38)
and
([A]m×n {c}n×1 )T = {c}T [A]T

1×m
(2.39)
Thus, the transpose of the product of matrices is the product of their trans-
poses in reverse order.

Transpose of a Symmetric Matrix: If [A] is a (n × n) symmetric matrix,

then:
aij = aji ; i, j = 1, 2, . . . , n (2.40)
or
[A]T = [A] (2.41)
2.1. INTRODUCTION, MATRICES, AND VECTORS 17

That is, the transpose of a symmetric matrix is the matrix itself.

Transpose of a Skew-Symmetric Matrix: If [A] is a (n × n) skew-symmetric

matrix, then:

aij = −aji , i 6= j and aii = 0 ; i, j = 1, 2, . . . , n (2.42)

or
[A]T = −[A] (2.43)

Transpose of the Products of Symmetric and Skew-Symmetric Matrices: If

[A] is a (n×n) symmetric matrix and [B] is a (n×n) skew-symmetric matrix,
then:
aij = aji ; i, j = 1, 2, . . . , n
(2.44)
bij = −bji , i 6= j and bii = 0 ; i, j = 1, 2, . . . , n

Therefore, we have:
T
[A][B] = [B]T [A]T = −[B][A] (2.45)

Likewise: T
[B][A] = [A]T [B]T = [A](−[B]) = −[A][B] (2.46)
One can conclude from this that the product of a symmetric matrix and a
skew-symmetric matrix is a skew-symmetric matrix.

Definition 2.19 (Orthogonal Matrix). A matrix [R] is orthogonal if its

transpose is the same as its inverse.

[R]−1 = [R]T (2.47)

∴ [R]−1 [R] = [R][R]−1 = [R]T [R] = [R][R]T = [I] (2.48)

Rotation matrices defining rotation of a frame of reference into another frame
are examples of such matrices. Orthogonality in this sense is only defined
for a square matrix.

Definition 2.20 (Positive-Definite Matrix). A square matrix [A] is

positive-definite if and only if

{x}T [A]{x} > 0 ∀{x} =

6 {0} (2.49)

If {x}T [A]{x} ≤ 0 then [A] is not positive-definite. All positive-definite

matrices are symmetric. Eigenvalues of a positive-definite matrix are real
18 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

and strictly greater than zero, and the associated eigenvectors are real (see
Chapter 4).

Definition 2.21 (Positive-Semidefinite Matrix). A square matrix [A]

is positive-semidefinite if and only if

[A] = [B]∗ [B] (2.50)

for some square matrix [B]. Neither [A] nor [B] are necessarily symmetric.
When [A] is not symmetric, [B] and [B]∗ are complex. If [B] is not complex,
then [B]∗ = [B]T . This is only ensured if [A] is symmetric. Thus, if [A] is
symmetric then [B] is also symmetric and in this case (see Chapter 4 for
proof):
{x}T [A]{x} = {x}T [B]T [B]{x} ≥ 0 ∀{x} = 6 {0} (2.51)
and
{x}T [A]{x} = 0 for some {x} =
6 {0} (2.52)

Definition 2.22 (Orthogonality of Vectors). If {x}i and {x}j are two

vectors of unit norm or length in an n-dimensional space, then {x}i and {x}j
are orthogonal if and only if

{x}Ti {x}j = δij (2.53)

where δij is the Kronecker delta.

Definition 2.23 (Orthogonality of Vectors with Respect to a Ma-

trix). If {x}i and {x}j are two vectors that are normalized with respect to
a matrix [M ], i.e.,
{x}Ti [M ]{x}i = 1
(2.54)
{x}Tj [M ]{x}j = 1
then {x}i and {x}j are [M ]-orthogonal if and only if

{x}Ti [M ]{x}j = δij (2.55)

Definition 2.24 (Orthogonality of Vectors with Respect to Identity

[I]). Definition 2.23 implies:

{x}Ti [I]{x}j = δij (2.56)

when {x}i and {x}j are orthogonal with respect to [I]. Thus, when (2.56)
holds, so does (2.53). We note that (2.56) is a special case of (2.55) with
[M ] = [I].
2.1. INTRODUCTION, MATRICES, AND VECTORS 19

2.1.2.5 Decomposition of a Square Matrix into Symmetric and

Skew-Symmetric Matrices
Consider a square matrix [A].
1 1
[A] = [A] + [A] (2.57)
2 2
Add and subtract 12 [A]T to right side of (2.57).
1 1 1 1
[A] = [A] + [A] + [A]T − [A]T (2.57)
2 2 2 2
or
1 1
[A] = ([A] + [A]T ) + ([A] − [A]T ) (2.58)
2 2
We define
1 1
[D] = ([A] + [A]T ) [W ] = ([A] − [A]T ) (2.59)
2 2
∴ [A] = [D] + [W ] (2.60)
We note that
1
[D]T = ([A]T + [A]) = [D]
2 (2.61)
1
[W ] = ([A]T − [A]) = −[W ]
T
2
Thus the matrix [D] is symmetric and [W ] is skew-symmetric (or antisym-
metric) with zeros on the diagonal. Equation (2.60) is the decomposition of
the square matrix [A] into a symmetric matrix [D] and the skew-symmetric
matrix [W ].

2.1.2.6 Augmenting a Matrix

If a new matrix is formed from the original matrix [A] by adding an
additional column or columns to it, then the resulting matrix is an augmented
matrix [Aag ]. Consider  
a11 a12 a13
[A] = a12 a22 a23  (2.62)
a13 a23 a33
Then  
a11 a12 a13 1 0 0
[Aag ] =  a12 a22 a23 0 1 0  (2.63)
a13 a23 a33 0 0 1
is the (3 × 6) matrix obtained by augmenting [A] with the (3 × 3) identity
matrix. We separate the original matrix [A] from [I] (in this case) by a
vertical line in defining the augmented matrix [Aag ].
20 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

Consider  
a11 a12 a13 b1
[Aag ] =  a12 a22 a23 b2  (2.64)
a13 a23 a33 b3
[Aag ] in this case is the (3 × 4) matrix defined by augmenting [A] by a vector
whose components are b1 , b2 , and b3 .
Definition 2.25 (Linear Dependence and Independence of Rows).
If a row of a matrix can be generated by a linear combination of the other
rows of the matrix, then this row is called linearly dependent. Otherwise,
the row is called linearly independent.
Definition 2.26 (Linear Dependence and Independence of Columns).
If a column of a matrix can be generated by a linear combination of the other
columns of the matrix, then this column is called linearly dependent. Oth-
erwise, the column is called linearly independent.
Definition 2.27 (Rank of a Matrix). The rank of a square matrix is the
number of linearly independent rows or columns. In a (n × n) square matrix,
if all rows and all columns are linearly independent, then n is the rank of
the matrix.
Definition 2.28 (Rank Deficient Matrix). In a rank deficient (n × n)
square matrix, there is at least one row and one column that can be expressed
as a linear combination of the other rows and columns. Thus, in a (n × n)
matrix of rank (n−m) there are m rows and columns that can be expressed as
linear combinations of the others. In such matrices, a reduced (n−m×n−m)
matrix can be formed by removing the linearly dependent rows and columns
that would have a rank of (n − m).

2.1.2.7 Determinant of a Matrix

The determinant of a square matrix [A] is a scalar, i.e., a real number if
the elements of [A] are real numbers, and is denoted by det[A] or |A|. If
 
a11 a12 . . . a1n
 a12 a22 . . . a2n 
[A] =  . (2.65)
 
. .. 
 . . 
an1 an2 . . . ann

then det[A] = |A| can be obtained by using the following:

(i) Minor of aij :
The minor of aij is defined as the determinant of [A] obtained after
deleting row i and column j from [A] and is denoted by mij .
2.1. INTRODUCTION, MATRICES, AND VECTORS 21

col. j

mij = row i (2.66)

(ii) Cofactor of aij :

The cofactor of aij is a scalar denoted by āij . It is the signed minor of
aij , i.e., the cofactor of aij is obtained by assigning a sign to the minor
of aij and is defined by

āij = (−1)i+j mij (2.67)

(iii) Computation of Determinant:

The determinant of [A] is obtained by multiplying each element of any
one row or any one column of [A] with its associated cofactor and
summing the products. This is called Laplace expansion. Thus, if we
use the first row of [A] then

|A| = a11 ā11 + a12 ā12 + · · · + a1n ā1n (2.68)

Using the second column of [A] we obtain

|A| = a12 ā12 + a22 ā22 + · · · + an2 ān2 (2.69)

The determinant computed using (2.68) is identical to that found using

(2.69). Typically, the row or column with the most 0 elements is chosen
for ease of calculation.

The determinant is only defined for a square matrix. Obviously, the cal-
culation of det[A] is facilitated by choosing a row or a column containing
zeros.

Definition 2.29 (Singular Matrix). A matrix [A] is singular if it is non-

invertible (i.e., if [A]−1 does not exist). This is equivalent to |A| = 0, linear
dependence of any rows or columns, and rank deficiency. If any one of these
conditions hold, then they all do. A matrix [A] is non-singular if and only
if none of the previously mentioned conditions hold.

Example 2.1 (Determinant of a 2×2 Matrix). Consider a (2×2) matrix

[A].
a11 a12
[A] =
a21 a22
22 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

Find |A|.

Solution:
Determine |A| using the first row of [A].

(i) The minors m11 and m12 of a11 and a12 are given by

m11 = |a22 | = a22 ; m12 = |a21 | = a21

(ii) The cofactors of a11 and a12 are given by the signed minors of a11 and
a12 .

ā11 = (−1)1+1 m11 = a22 ; ā12 = (−1)1+2 m12 = −a21

(iii) The determinant of [A] is given by

|A| = a11 ā11 + a12 ā12

Substituting for the cofactors, we have

|A| = a11 a22 − a12 a21

Example 2.2. Consider a (3 × 3) matrix [A].

 
a11 a12 a13
[A] = a21 a22 a23 
a31 a32 a33

Find |A|.

Solution:
Determine |A| using the first row of [A].

(i) Minors m11 , m12 , and m13 of a11 , a12 , and a13 are given by

a22 a23 a21 a23 a21 a22

m11 = ; m12 = ; m13 =
a32 a33 a31 a33 a31 a32

(ii) Cofactors ā11 , ā12 , and ā13 are given by

ā11 = (−1)1+1 m11 ; ā12 = (−1)1+2 m12 ; ā13 = (−1)1+3 m13

2.1. INTRODUCTION, MATRICES, AND VECTORS 23

(iii)
|A| = a11 ā11 + a12 ā12 + a13 ā13
Substituting for ā11 , ā12 , and ā13 :

|A| = a11 (1)m11 + a12 (−1)m12 + a13 (1)m13

Further substituting for m11 , m12 , and m13 :

a22 a23 a a a a
|A| = a11 (1) + a12 (−1) 21 23 + a13 (1) 21 22
a32 a33 a31 a33 a31 a32

Expanding determinants in the above expression using the first row in

each case:
a22 a23
= a22 ā22 + a23 ā23
a32 a33
= a22 (−1)2+2 m22 + a23 (−1)2+3 m23
= a22 (−1)2+2 a33 + a23 (−1)2+3 a32
= a22 a33 − a23 a32

Similarly:

a21 a23
= a21 a33 − a23 a31
a31 a33
a21 a22
= a21 a32 − a22 a31
a31 a32

Substituting these in the expression for |A|, the determinant is given

|A| = a11 (a22 a33 − a23 a32 ) − a12 (a21 a33 − a23 a31 ) + a13 (a21 a32 − a22 a31 )

Example 2.3. Consider a (2 × 2) matrix [A].

−2 3
[A] =
−2 3

|A| = (−2)(−1)1+1 (3) + (3)(−1)1+2 (−2)

= (−2)(3) + (3)(−1)(−2)
= −6 + 6 = 0
24 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

In matrix [A], row two is identical to row one. It can be shown that if [A]
is a square matrix (n × n) and if any two rows are the same, the |A| = 0,
regardless of n.

Example 2.4. Consider a (2 × 2) matrix [A].

−2 −2
[A] =
3 3

|A| = (−2)(−1)1+1 (3) + (−2)(−1)1+2 (3)

= −6 + 6 = 0
In this case column one is identical to column two. It can be shown that if
[A] is any square matrix (n × n) and if any two columns are the same, the
|A| = 0, regardless of n.

Example 2.5. Consider a (2 × 2) matrix [A].

4 4
[A] =
4a 4a

|A| = (4)(−1)1+1 (4a) + (4)(−1)1+2 (4a)

= 16a − 16a = 0
In matrix [A], row two is a multiple of row one (by a) or row one is a multiple
of row two (by 1/a).
Remarks.
(1) We note that the matrix [A] in Example 2.4 is the transpose of the
matrix [A] in Example 2.3, hence we can conclude that if |A| = 0, then
|AT | = 0.
(2) In general, for an (n × n) matrix [A], if any two rows are multiples of
each other, then |A| = 0. We note that in Example 2.5, the two columns
are the same, but this is not the case in general.
(3) It also holds that for any (n × n) matrix [A], if any two columns are
multiples of each other, then |A| = 0.
(4) As an illustration, |A| = 0 in Example 2.3 and column two can be
obtained by multiplying column one by −3/2.
2.2. MATRIX AND VECTOR NOTATION 25

2.2 Matrix and Vector Representation of Linear Si-

multaneous Algebraic Equations
Consider equation (2.1):

fi (x1 , x2 , . . . , xn ) = bi i = 1, 2, . . . , n (2.70)

When each fi (·) is a linear combination of xj ; j = 1, 2, . . . , n, then we can

write (2.70) as

a11 x1 + a12 x2 + · · · + a1n xn = b1

a21 x1 + a22 x2 + · · · + a2n xn = b2 (2.71)
..
.
an1 x1 + an2 x2 + · · · + ann xn = bn

Equations (2.71) represent a system of n linear simultaneous algebraic equa-

tions. These equations are linear in xj and each equation simultaneously
depends on all xj ; j = 1, 2, . . . , n. The coefficients aij ,bi ; i, j = 1, 2, . . . , n
are known. Our objective is to find xj ; j = 1, 2, . . . , n that satisfy (2.71).
Equations (2.71) can be represented more compactly using matrix and vec-
tor notation. If we define the coefficients aij by a matrix [A](n×n) , bi by a
vector {b}(n×1) , and xj by a vector {x}(n×1) , then (2.71) can be written as

[A]{x} = {b} (2.72)

in which
     
a11 a12 . . . a1n 
 b1  
 x1 
 a12 a22 . . . a2n   b2 
   x2 
 
[A] =  . ..  ; {b} = ; {x} = (2.73)
 
.. ..
 .. .  
 . 
 . 

  
 
an1 an2 . . . ann bn xn
  

The matrix [A] is called the coefficient matrix, {b} is called the right-hand
side or non-homogeneous part, and {x} is a vector of unknowns to be deter-
mined such that (2.72) holds. Sometimes we augment [A] by {b} by including
it as (n + 1)th column in [A]. Thus augmented matrix [Aag ] would be:
 
a11 a12 ... a1n b1
 a12 a22 ... a2n b2 
[Aag ] =  . (2.74)
 
.. .. .. .. 
 .. . . . . 
an1 an2 . . . ann bn
26 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

2.2.1 Elementary Row Operations

The augmented matrix [Aag ] is a compact representation of the coeffi-
cients of [A] and {b} in the linear simultaneous equations (2.72). We note
that in (2.72) if an equation is multiplied by a constant c, the solution of
the new equations is the same as those of (2.72). Likewise, if an equation
of (2.72) is multiplied by a constant and then added to another equation
of (2.72), the solution of the new system of equations is the same as that
of (2.72). These operations are called elementary row operations. This is
more effectively used with [Aag ]. In [Aag ], a row Ri can be multiplied by a
constant c and added to another row Rm to form a new row Rm 1 = R + cR .
m i
The equations defined by the new [Aag ] have the same solutions as (2.72).
It is important to note that in elementary row operations, when the
coefficients of a row of [A] are multiplied by a constant, the corresponding
element of {b} must also be multiplied by the same constant. The same
holds true when adding or subtracting two rows. Thus, the elementary row
operations should be performed on [Aag ] and not [A], as only [Aag ] includes
the right-hand side vector {b}.

2.3 Methods of Obtaining Solutions of Linear Si-

multaneous Algebraic Equations
Consider a system of n linear simultaneous algebraic equation in n un-
knowns {x}(n×1) .
[A](n×n) {x}(n×1) = {b}(n×1) (2.75)
The coefficient matrix [A] and the right-hand side vector {b} (equation
(2.73)) are known. Broadly speaking, the methods of obtaining the solu-
tion {x} of (2.75) can be classified into two groups. In the first group of
methods, one only obtains approximations of the true value of {x}. Graph-
ical methods and iterative methods such as Gauss-Seidel or Jacobi methods
fall into this category. With the second group of methods we seek the solu-
tion {x} that satisfies (2.75) within the precision of the computations (i.e.,
the exact solution), for example the word size of a computer. Even though
the second group of methods are superior to the first group in terms of accu-
racy of the solution {x}, the first group of methods are sometimes preferred
due to ease of their use (specifically iterative methods). In the following we
list various methods of obtaining the solution {x} of (2.75) based on the
fundamental concept involved in the design of the method.
(A) Direct methods

(a) Graphical methods

(b) Cramer’s rule
2.4. DIRECT METHODS 27

(B) Elimination methods

(a) Gauss elimination

i. Naive Gauss elimination
ii. Gauss elimination with partial pivoting
iii. Gauss elimination with full pivoting
(b) Gauss-Jordan method
(c) [L][U ] Decomposition
i. Classical or Cholesky [L][U ] decomposition
ii. Crout [L][U ] decomposition
iii. [L][U ] decomposition using Gauss elimination

(C) Using the Inverse of [A], i.e., [A]−1

(a) Direct method of obtaining [A]−1

(b) Inverse of [A] by elementary row operations
(c) Inverse of [A] using [L][U ] decomposition

(D) Iterative methods (methods of approximation)

(a) Gauss-Seidel method

(b) Jacobi method
(c) Relaxation techniques

We remark that Cramer’s rule, Gauss elimination, Gauss-Jordan elimination,

[L][U ] decomposition, and use of the inverse of [A] to solve linear systems are
numerical methods (when [A]−1 is not approximate). Graphical methods,
Gauss-Seidel method, Jacobi method, and relaxation methods are methods
of approximation. In the former methods, the computed solutions are the
theoretical solutions, whereas in the latter the calculated solutions are always
approximate. We present details of each of these methods in the following
sections and provide numerical examples illustrating how to use them.

2.4 Direct Methods

The direct methods are only helpful in obtaining solutions of a very
small system of linear simultaneous algebraic equations, generally n = 2 and
n = 3. For n greater than three, these methods are either not usable or
become impractical due to complexity in their use.
28 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

2.4.1 Graphical Method

In this method we use graphical representations of the equations, hence
this method is difficult to use for n > 3. We plot a graph corresponding
to each equation. These are naturally straight lines (or flat planes) as the
equations are linear. The common point of intersection of these straight lines
or planes is the solution of the system of equations. The solution (x1 , x2 )
(or (x, y)) is the only ordered pair that satisfies both equations; graphically,
this is the only coordinate point that lies on both lines. Consider:
a11 x + a12 y = b1
(2.76)
a21 x + a22 y = b2

We rewrite (2.76) by dividing the first equation by a11 and the second equa-
tion by a22 (provided a11 6= 0 and a22 6= 0).

y = (−a11/a12 ) x + (b1/a12 )
(2.77)
y = (−a21/a22 ) x + (b2/a22 )
If we define
m1 = (−a11/a12 ) m2 = (−a21/a22 )
(2.78)
c1 = (b1/a12 ) c2 = (b2/a22 )

Then, (2.77) can be written as:

y = m1 x + c1
(2.79)
y = m2 x + c2

If we consider two-dimensional xy-space (x being the abscissa and y being

the ordinate) then (2.79) are equations of straight lines in which m1 , m2
are their slopes and c1 ,c2 are the corresponding intercepts with the y-axis.
Thus, (2.79) can be plotted in the xy-plane. Their intersection is the solution
of (2.79) as it would naturally satisfy both equations in (2.79). Figure 2.1
shows the details.

Remarks.
(1) When the determinant of the coefficient matrix in (2.76), i.e., (a11 a22 −
a12 a22 ), is not equal to zero, the intersection of the straight lines is
distinct and we clearly have a unique solution (as shown in Figure 2.1)

(2) If a21 = a11 and a22 = a12 in (2.76), we have

a11 x + a12 y = b1
(2.80)
a11 x + a12 y = b2
2.4. DIRECT METHODS 29
y

y = m1 x + c1

(x, y) ; solution of (2.79)

y = m2 x + c2
c1

Figure 2.1: Graphical method of obtaining solution of two linear simultaneous equations

or
y = (−a11/a12 ) x + (b1/a12 )
(2.81)
y = (−a11/a12 ) x + (b2/a13 )
Let
m1 = −a11/a12 c1 = b1/a12 c2 = b2/a12 (2.82)
Hence, (2.81) can be written as
y = m1 x + c1
(2.83)
y = m1 x + c2
Equations (2.83) are the equations of straight lines that are parallel.
Parallel lines have the same slopes but different intercepts, thus these
will never intersect. In this case we obviously cannot find a solution
(x, y) of (2.83). We also note that the determinant of the coefficient
matrix of (2.80) is zero. In (2.80) row one of the coefficient matrix is
the same as row two and the columns are multiples of each other. This
system of equations (2.80) is rank deficient. Figure 2.2 shows plots of
(2.83).
(3) Consider a case in which column two of the coefficient matrix in (2.76)
is a multiple of column one, i.e., for a scalar s we have
a12 = sa11
a22 = sa21 (2.84)
30 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

y = m1 x + c2

y = m1 x + c1

c2
c1

Figure 2.2: An ill-conditioned system (det[A] = 0); no solution

Thus, for this case (2.76) reduces to

a11 x + sa11 y = b1
(2.85)
a21 x + sa21 y = b2
Divide the first equation by sa11 and the second equation by sa21 .
y = − (1/s) x + (b1/sa11 )
(2.86)
y = − (1/s) x + (b2/sa21 )
or
y = m1 x + c1
(2.87)
y = m1 x + c2
which is the same as (2.83). We can also arrive at (2.87) if row two of
the coefficient matrix in (2.76) is a multiple of row one or vice versa.
When the coefficients are such that c2 6= c1 , graphs of (2.87) are the
same as those in Figure 2.2. But for some choice of coefficients if we
have c2 = c1 , then both equations in (2.87) are identical. Their xy plots
naturally coincide (Figure 2.3), hence this case the two straight lines
intersect at infinitely many locations, implying infinitely many solutions.
The solution of (2.85) is not unique in such a case.
(4) From the above remarks, we conclude that whenever the determinant
of the coefficient matrix is a system of algebraic equations is zero, their
solution either does not exist or is not unique.
2.4. DIRECT METHODS 31

Plot of (2.87) when c1 = c2

c1 = c2

Figure 2.3: Infinitely many solutions when two equations are identical

(5) Consider system of equations (2.76). It could happen that the coeffi-
cients aij ; i, j = 1, 2 are such that the determinant of the coefficient
matrix (a11 a22 − a12 a21 ) may not be zero but may be close to zero. In
this case the straight lines defined by the two equations in (2.76) do have
an intersection but their intersection may not be distinct (Figure 2.4).

Intersection zone (not distinct)

Figure 2.4: Non-distinct intersection: when det[A] ≈ 0

32 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

(6) Consider n = 3 in (2.75). In this case we have three linear simultaneous

algebraic equations.

a11 x1 + a12 x2 + a13 x3 = b1

a21 x1 + a22 x2 + a23 x3 = b2 (2.88)
a31 x1 + a32 x2 + a33 x3 = b3

In x1 x2 x3 orthogonal coordinate space (or xyz-space), each equation in

(2.88) represents a plane. Graphically we can visualize this as follows
(assuming det[A] 6= 0). Consider the first two equations in (2.88). The
intersection of the planes defined by these two is a straight line. Inter-
section of this straight line with the plane defined by the third equation
in (2.88) is a point (x∗1 , x∗2 , x∗3 ) in x1 x2 x3 -space, which is the solution of
(2.88). The remarks given for a system of two equations apply here as
well and are not repeated.

(7) We clearly see that for n > 3, the graphical approach is difficult and
impractical. However, the graphical approach gives deeper insight into
the meaning of the solutions of linear simultaneous equations.

2.4.2 Cramer’s Rule

Let
[A](n×n) {x}(n×1) = {b}(n×1) (2.89)
be a system of n simultaneous algebraic equations. To illustrate the details
of this method, let us consider n = 3. For this case
     
a11 a12 a13 x1  b1 
[A] = a21 a22 a23  {x} = x2 {b} = b2 (2.90)
a31 a32 a33 x3 b3
   

Then
b1 a12 a13 a11 b2 a13 a11 a12 b1
b2 a22 a23 a21 b2 a23 a21 a22 b2
(2.91)
b3 a32 a33 a31 b3 a33 a31 a32 b3
x1 = x2 = x3 =
|A| |A| |A|

Thus, to calculate x1 , we replace the first column of [A] by {b}, then divide
its determinant by the determinant of [A]. For x2 and x3 we use the second
and third columns of [A] with {b} respectively, with the rest of the procedure
remaining the same as for x1 .

Remarks.
2.4. DIRECT METHODS 33

(1) If det[A] is zero then xj ; j = 1, 2, 3 are infinity, i.e., they are not defined.
(2) When n is large, calculations of determinants is tedious and time con-
suming. Hence, this method is not preferred for large systems of linear si-
multaneous algebraic equations. However, unlike the graphical method,
this method can be used for n ≥ 3.

Example 2.6 (Cramer’s Rule). Consider the following system of three

linear simultaneous algebraic equations:

x1 + x2 + x3 = 6
0.1x1 + x2 + 0.2x3 = 2.7
x1 + 0.2x2 + x3 = 4.4

In this case
     
1 1 1 x1  6
[A] = 0.1 1 0.2 {x} = x2 {b} = 2.7
1 0.2 0.2 x3 4.4
   

in
[A]{x} = {b}
We use Cramer’s Rule to obtain the solution of {x}. Following (2.91):

1 1 1
det[A] = 0.1 1 0.2 = 0.08
1 0.2 0.2

{b} {b} {b}

6 1 1 1 6 1 1 1 6
x1 = 2.7 1 0.2 x2 = 0.1 2.7 0.2 x3 = 0.1 1 2.7
4.4 0.2 1 1 4.4 1 1 0.2 4.4
|A| |A| |A|
or
0.08 0.16 0.24
x1 = =1 x2 = =2 x3 = =3
0.08 0.08 0.08
Hence, the solution {x} is
   
x1  1
{x} = x2 = 2
x3 3
   
34 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

2.5 Elimination Methods

Consider
[A](n×n) {x}(n×1) = {b}(n×1) (2.92)

In the elimination methods we reduce the number of variables by (i − 1),

where i is the equation number for each equation in (2.92), beginning with
the first equation. Thus for equation one (i = 1), we maintain all n vari-
ables (xj ; j = 1, 2, . . . , n). In equation two (i = 2) we reduce the number
of variables by one (i − 1 = 1), leaving n − 1 variables. Thus, in the last
equation (i = n), we reduce the number of variables by (n − 1), hence it
will only contain one variable xn . This reduction in the number of variables
in the equations is accomplished by elementary row operations, which obvi-
ously changes [A]. Hence, to ensure that the solution {x} from the reduced
system is the same as that of (2.92), we must augment [A] by {b} before
performing the reduction. This allows {b} to be modified accordingly during
the reduction process. When the reduction process is finished, the last equa-
tion has only one variable, xn , and the reduced [A] is in upper triangular
form, hence we can solve for xn using the last equation.
Knowing xn , we use (n − 1)th equation that contains xn−1 and xn vari-
ables, hence we can solve for xn−1 using this equation. This process is
continued until we have obtained solutions for all of variables {x}. Due
to the fact that in this method we eliminate variables from equations, the
method is called a elimination method. This approach is the basis for Gauss
elimination. We note this is a two-step process: in the first step the variables
are eliminated from the augmented equations to make [A] upper triangular,
and in the second step the numerical values are calculated for the variables
beginning with the last and proceeding in backward fashion. This process is
called triangulation (upper) and back substitution.

2.5.1 Gauss Elimination

As mentioned earlier, in the process of eliminating variables from the
equations (2.92), we operate on the coefficients of the matrix [A] as well as
the vector {b}. This process can be made systematic by augmenting [A]
with {b} and then performing elementary row operations on this augmented
[Aag ] matrix. We discuss various elimination methods and their details in
the following section.

2.5.1.1 Naive Gauss Elimination

In this method we augment [A] by {b} to construct [Aag ]. We perform

elementary row operations on [Aag ] to make the portion corresponding to [A]
2.5. ELIMINATION METHODS 35

upper triangular, without switching rows or columns. The row and column
locations in [Aag ] are preserved during the elimination process.
Consider (2.92) with n = 3, i.e., three linear simultaneous algebraic equa-
tions in three unknowns: x1 , x2 , and x3 . [A]{x} = {b} is given by:
    
a11 a12 a13 x1  b1 
a21 a22 a23  x2 = b2 (2.93)
a31 a32 a33 x3 b3
   

We augment the coefficient matrix [A] by {b}.

 
R1− a11 a12 a13 b1
[Aag ] = R2− a21 a22 a23 b2  (2.94)
R3− a31 a32 a33 b3
Our objective is to make [A] upper triangular, hence row one remains unal-
tered. From row two (the second equation) we eliminate x1 and from row
three (the third equation) we eliminate x1 and x2 . This can be done if we
can make a21 , a31 , and a32 zero. We do this by elementary row operations,
in which we multiply a row of (2.94) by a scalar and then add or subtract to
any desired row. These operations are valid due to the fact that the trans-
formed system has the same solution as the original system (2.93) because
we are operating on thje augmented matrix [Aag ]. Let us denote the rows of
(2.94) by R1, R2, and R3.

Step 1: Making [A] Upper Triangular

To make a21 and a31 zero, we perform the following two elementary row
operations:  
R1 a11 a12 a13 b1
R2 − aa2111
R1  0 a022 a023 b02  (2.95)
a31 0 0 0
R3 − a11 R1 0 a32 a33 b3
The coefficients in (2.95) with primes are the new values due to elementary
row operations shown in (2.95). The diagonal element a11 in (2.95) is called
the pivot element. After these elementary row operations in (2.95), column
one is in upper triangular form.
Next, in column two of (2.95), we make a032 zero by the following ele-
mentary row operation using rows of (2.95):
 
a11 a12 a13 b1
 0 a022 a023 b02  (2.96)
a032
R3 − a0 R2 0 0 a0033 b003
22

In (2.96), we note that all elements below the diagonal are zero in [A], i.e., [A]
in (2.96) is in upper triangular form. This the main objective of elimination
36 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

of variables: to make matrix [A] upper triangular in the augmented form.

In the elementary row operations in (2.96), a022 is the pivot element. We
note that pivot elements are used to divide the other elements, hence these
cannot be zero.

Step 2: Back Substitution

In this step, known as back substitution, we calculate the solution in the
reverse order, i.e., x3 , then x2 , then x1 using the upper triangular form of [A]
in [Aag ], hence the name back substitution. We note that (2.96) represents
the following system of equations.

a11 x1 + a12 x2 + a13 x3 = b1

a022 x2 + a023 x3 = b02 (2.97)
a0033 x3 = b003

In (2.97), we can solve for x3 using the last equation.

b003
x3 = (2.98)
a0033

In this case a0033 is the pivot element. Next we can use the second equation
in (2.97) to solve for x2 , as x3 is already known.

(b02 − a023 x3 )
x2 = (2.99)
a022

Now using the first equation in (2.97), we can solve for x1 as x2 and x3 are
already known.
(b1 − a21 x2 − a13 x3 )
x1 = (2.100)
a11
Thus, the complete solution [x1 x2 x3 ]T is known.

Remarks.

(1) The elements a11 , a022 , and a0033 are pivots that are used to divide other
coefficients. These cannot be zero otherwise this method will fail.

(2) In this method we maintain the positions of rows and columns in the
augmented matrix, i.e., we do not perform row and column interchanges
even if zero pivots are encountered, hence the name naive Gauss elimi-
nation.

(3) It is a two step process: in the first step we make the matrix [A] in the
augmented form upper triangular using elementary row operations. In
2.5. ELIMINATION METHODS 37

the second step, called back substitution, we calculate x3 , x2 , and x1 in

this order.

(4) When a solution is required for more than one {b} (i.e., more than one
right side), then the matrix [A] can be augmented by all of the right side
vectors before performing elementary row operations to make [A] upper
triangular. As an example consider (2.92), a (3 × 3) system in which we
desire solutions {x} for {b} = {p} and {b} = {q}.
   
p1  q1 
{b} = {p} = p2 {b} = {q} = q2 (2.101)
p3 q3
   

We augment [A] by both {p} and {q}.

 
a11 a12 a13 p1 q1
a12 a22 a23 p2 q2  (2.102)
a13 a32 a33 p3 q3

Using the details in Step 1, we make [A] upper triangular in (2.102).

 
a11 a12 a13 p1 q1
 0 a022 a023 p02 q20  (2.103)
0 0 a0033 p003 q300

Equation (2.103) clearly implies that we have the following:

a11 x1 + a12 x2 + a13 x3 = p1

a022 x2 + a023 x3 = p02 (2.104)
a033 x3 = p003

and

a11 x1 + a12 x2 + a13 x3 = q1

a022 x2 + a023 x3 = q20 (2.105)
a033 x3 = q300

Now we can use back substitution for (2.104) and (2.105) to find solutions
for {x} for {b} = {p} and {b} = {q}.
38 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

Example 2.7 (Naive Gauss Elimination). Consider the same system of

equations as used in Example 2.6 for Cramer’s Rule:

x1 + x2 + x3 = 6
0.1x1 + x2 + 0.2x3 = 2.7
x1 + 0.2x2 + x3 = 4.4

which can be written as

[A]{x} = {b}

where
     
1 1 1 x1  6
[A] = 0.1 1 0.2 {x} = x2 {b} = 2.7
1 0.2 1 x3 4.4
   

We augment [A] by adding {b} as a fourth column to [A].

 
R1− 1 1 1 6
[Aag ] = R2− 0.1 1 0.2 2.7
R3− 1 0.2 1 4.4

Upper Triangular Form of [A] in [Aag ]

Make column one in [Aag ] upper triangular by using the elementary row
operations shown below.
 
1 1 1 6
R2 − 0.1

1 R1 0 0.9 0.1 2.1
 
1
R3 − 1 R1 0 −0.8 0 −1.6

Next we make column two in the modified [Aag ] upper triangular using the
elementary row operations shown below.
 
1 1 1 6
0 0.9 0.1 2.1 
−0.8 −0.8 0.8

R3 − 0.9 R2 0 0 ( 0.9 )(0.1) −1.6 − ( 0.9 )2.1
 
1 1 1 6
0 0.9 0.1 2.1
0 0 0.89
0.8
3
2.5. ELIMINATION METHODS 39

Back Substitution
From the third row in the upper triangular form:
0.8 0.8
x3 =
9 3

∴ x3 = 3

Using the second row of the upper triangular form:

0.9x2 = 2.1 − 0.1x3 = 2.1 − 0.1(3) = 1.8

∴ x2 = 2

Using the first row of the upper triangular form:

x1 = 6 − x2 − x3 = 6 − 2 − 3 = 1

Hence,    
x1  1
{x} = x2 = 2
x3 3
   

The solution {x} is the same as that obtained using Cramer’s rule.

2.5.1.2 Gauss Elimination with Partial Pivoting

Consider the system of equations from (2.93).

[A]{x} = {b} (2.106)

In some cases the coefficients in the system of equations (2.106) may be such
that a11 = 0 even though the system of equations (2.106) does have a unique
solution. In this case the naive Gauss elimination method will fail due to
the fact that we must divide by the pivot a11 . In such situations we can
employ partial pivoting that helps in avoiding zero pivots. This procedure
involves the interchange of rows for a column under consideration during
upper triangulation such that the largest element (absolute value) in this
column becomes the pivot. This is followed by the upper triangulation for
the column under consideration. This procedure is continued for subsequent,
columns keeping in mind that the columns (and corresponding rows) that
are already in upper triangular form are exempted or are not considered in
40 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

searching for the next pivot. Consider

    
a11 a12 a13 x1  b1 
a21 a22 a23  x2 = b2 (2.107)
a31 a32 a33 x3 b3
   

We augment coefficient matrix [A] in (2.107) by the column vector {b}.

 
a11 a12 a13 b1
[Aag ] = a21 a22 a23 b2  (2.108)
a31 a32 a33 b3
We make column one in (2.108) upper triangular using the largest element
in column one of (2.108) as a pivot. Let |a31 | (the absolute value of a31 ) be
the largest element in column one, then we interchange row one with row
three in (2.108) to obtain the following:
 
a31 a32 a33 b3
a21 a22 a23 b2  (2.109)
a11 a12 a13 b1
We make column one in (2.109) upper triangular by using elementary row
operations (as discussed in naive Gauss elimination), i.e., we make a21 and
a11 zero.  
a31 a32 a33 b3
 0 a022 a023 b02  (2.110)
0 a012 a013 b01
In the next step we make column two in (2.110) upper triangular using the
element with the largest magnitude (absolute value) out of a022 and a012 as
the pivot. Let us assume that |a012 | > |a022 |, then we interchange row two
with row three in (2.110).
 
a31 a32 a33 b3
 0 a012 a013 b01  (2.111)
0 a022 a023 b02
Now we can make column two in (2.111) upper triangular by elementary row
operation using a012 as the pivot, i.e., we make a022 in (2.111) zero. This gives
us:  
a31 a32 a33 b3
 0 a012 a013 b01  (2.112)
0 0 a0023 b002
Using (2.112), we can write the expanded form of the equations.
a31 x1 + a32 x2 + a33 x3 = b3
a012 x2 + a013 x3 = b01 (2.113)
a0023 x3 = b002
2.5. ELIMINATION METHODS 41

Using (2.112) or (2.113) we can use back substitution to find x3 , x2 , and x1

(in this order) beginning with the last equation and then proceeding to the
previous equation in succession.
Remarks.
(1) The procedure described above is called Gauss elimination with par-
tial pivoting. In this procedure we interchange rows to make sure that
in the column to be made upper triangular, the largest element is the
pivot. This helps in avoiding divisions by small numbers or zeros during
triangulation.
(2) We only consider the diagonal element and the elements below it in the
column under consideration to determine the largest element for making
the row interchanges.
(3) The partial pivoting procedure is computationally quite efficient even
for large systems of algebraic equations.

Example 2.8 (Gauss Elimination with Partial Pivoting). Consider

the following system of equations:

x1 + x2 + x3 = 6
8x1 + 1.6x2 + 8x3 = 35.2
0.1x1 + x2 + 0.2x3 = 2.7

Or in matrix and vector form:

     
1 1 1 x1   6 
[A] =  8 1.6 8  {x} = x2 {b} = 35.2
0.1 1 0.2 x3 2.7
   

We augment [A] by {b} to obtain [Aag ].

 
1 1 1 6
[Aag ] =  8 1.6 8 35.2
0.1 1 0.2 2.7

We want to make column one upper triangular by using the largest element,
i.e., 8, as the pivot. This requires that we interchange rows one and two in
the augmented matrix.
 
8 1.6 8 35.2
R1 R2  1 1 1 6 
0.1 1 0.2 2.7
42 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

We then make column one upper triangular by using elementary row oper-
ations.  
8 1.6 8 35.2
R2 − 18 R1 0 0.8 0

1.6 
R3 − 0.18 R1 0 0.98 0.1 2.26
Next we consider column two. The elements on the diagonal and below the
diagonal in column two are 0.8 and 0.98. We want to use 0.98 as the pivot
(the larger of the two). This requires that we interchange rows two and
three.  
8 1.6 8 35.2
R2 R3 0 0.98 0.1 2.26
0 0.8 0 1.6
We now make column two upper triangular by elementary row operations.
 
8 1.6 8 35.2
0 0.98 0.1 2.26 
0.8 0.8 0.8 0.8

R3 − 0.98 R2 0 0.8 − ( 0.98 )(0.98) 0 − ( 0.98 )(0.1) 1.6 − ( 0.98 )(2.26)
which after simplification becomes:
 
8 1.6 8 35.2
0 0.98 0.1 2.26 
0 0 0.0816 0.2449
This augmented form contains the final upper triangular form of [A]. Now
we can find x3 , x2 , and x1 using back substitution. Using the last equation:
0.2448979
x3 = =3
0.0816326
Using the second equation with the known value of x3 :
(2.26 − 0.1x3 ) (2.26 − 0.1(3)) 1.96
x2 = = = =2
0.98 0.98 0.98
Now, we can find x1 using the first equation and the known values of x2 and
x3 :
1 1 8
x1 = (35.2 − 1.6x2 − 8x3 ) = (35.2 − 1.6(2) − 8(3)) = = 1
8 8 8
Hence we have the solution {x}:
   
x1  1
{x} = x2 = 2
x3 3
   
2.5. ELIMINATION METHODS 43

2.5.1.3 Gauss Elimination with Full Pivoting

In this method also we make matrix [A] upper triangular as in the pre-
vious two Gauss elimination methods, except that in this method we choose
the largest element of [A] or the largest element of the sub-matrix of upper
triangulated [A] during the elimination. This is sometimes necessitated if
the elements of the coefficient matrix [A] vary drastically in magnitude, in
which case the other two Gauss elimination processes may result in signifi-
cant roundoff errors or may even fail if a zero pivot is encountered. Consider:

[A]{x} = {b} (2.114)

1. Augment matrix [A] by right side {b}.

2. Search the entire matrix [A] for the element with the largest magnitude
(absolute value).

3. Perform simultaneous interchanges of rows and columns to ensure that

the element with the largest magnitude is the pivot in the first column.

4. Make column one upper triangular by elementary row operations.

5. Next consider column two of the reduced sub-matrix without row one
and column one. Search for the element with largest magnitude (absolute
value) in this sub-matrix. Perform row and column interchanges in the
augmented matrix (with column one in upper triangular form) so that the
element with the largest magnitude is the pivot for column two. Make
column two upper triangular by elementary row operations.

6. We continue this procedure for the remaining columns until [Aag ] becomes
upper triangular.

7. Solution {x} is then calculated using the upper triangular matrix in [A]
to obtain {x} in reverse order ie xn , xn−1 , . . . , x1 .

Remarks.

(1) This method is obviously very time consuming as it requires a search

for the largest pivot at every step and simultaneous interchanges of rows
and columns.

(2) If the |A| =

6 0, then this method ensures unique {x} if the solution exists.

(3) It is important to note that row interchanges do not effect the order
of the variables in vector {x}, but column interchanges require that we
44 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

also interchange the locations of the corresponding variables in the {x}

vector. This is illustrated in the numerical example presented in the
following.

Example 2.9 (Gauss Elimination with Full Pivoting). Consider the

following system of equations:
x1 + x2 + x3 = 9
8x1 + 1.6x2 + 8x3 = 35.2
0.1x1 + x2 + 3x3 = 11.1
in which
     
1 1 1 x1   9 
[A] =  8 1.6 8 {x} = x2 {b} = 35.2
0.1 1 3 x3 11.1
   

We augment [A] by {b}. We label rows and columns as x1 , x2 , x3 .

x1 x2 x3
 
1 1 1 9 x1
[Aag ] =  8 1.6 8 35.2 x2
0.1 1 3 11.1 x3
In Gauss elimination with full pivoting, in addition to row interchanges, col-
umn interchanges may also be required to ensure that the largest element of
matrix [A] or the sub-matrix of [A] is the pivot. Since interchanging columns
of [A] requires that we also interchange the positions of the corresponding
xi in {x}, it is prudent in [Aag ] to keep xi s with the columns and rows.
Interchange rows one and two in [Aag ] so that largest element in column
one become the pivot. We choose element a21 . There is no incentive to use
element a23 since a23 = a21 = 8.
x1 x2 x3
 
8 1.6 8 35.2 x1
 1 1 1 9  x2
0.1 1 3 11.1 x3
Make column one upper triangular by using elementary row operations.
x1 x2 x3
 
8 1.6 8 35.2 x1
0 0.8 0 4.6  x2
0 0.98 2.9 10.66 x3
2.5. ELIMINATION METHODS 45

Consider the (2 × 2) sub-matrix (i.e., a022 , a023 , a032 , and a033 ). The element
with the largest magnitude is 2.9 (a023 ). We want 2.9 to be pivot, i.e., at
location (2,2). This could be done in two ways:

(i) First interchange row two with row three and then interchange columns
two and three.

(ii) Alternatively, first interchange columns two and three and then inter-
change rows two and three.

Regardless of whether we choose (i) or (ii), the end result is the same. In
the following we consider (i).
Interchange rows two and three.

x1 x2 x3
 
8 1.6 8 35.2 x1
0 0.98 2.9 10.66 x2
0 0.8 0 4.6 x3

Now interchange columns two and three. In doing so, we should also inter-
change x2 and x3 respectively.

x1 x3 x2
 
8 8 1.6 35.2 x1
0 2.9 0.98 10.66 x3
0 0 0.8 4.6 x2

Since a032 is already zero, column two is already in upper triangular form,
hence no elementary row operations are required. This system of equations
are in the desired upper triangular form. Now, we can use back substitution
to calculate x2 , x3 , and x1 (in this order).
Consider the last equation from which we can calculate x2 .
4.6
x2 = = 5.75
0.8
Using the second equation and the known value of x2 we can calculate x3 .
(10.66 − 0.98x2 ) (10.66 − 0.98(5.75))
x3 = = = 1.73
2.9 2.9
Now using first equation, we can calculate x1 .
(35.2 − 8x3 − 1.6x2 ) (35.2 − 8(1.73) − 1.6(5.75)
x1 = = = 1.52
8 8
46 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

Hence    
x1  1.52
{x} = x2 = 5.75
x3 1.73
   

2.5.2 Gauss-Jordan Elimination

In this method we use the fact that if we premultiply [A]{x} = {b} by
[A]−1 , then [I]{x} = {x} = [A]−1 {b} is the desired solution. Thus, we can
augment [A] with [b] and perform elementary row operations on [Aag ] such
that [A] in [Aag ] becomes [I]. At this point the locations of {b} in [Aag ] will
contain the solution {x}. We present details in the following. Consider:

[A]{x} = {b} (2.115)

Premultiply (2.115) by [A]−1 .

[A]−1 [A]{x} = [A]−1 {b} (2.116)

Since [A]−1 [A] = [I], (2.116) reduces to:

[I]{x} = [A]−1 {b} (2.117)

But [I]{x} = {x}, hence (2.117) becomes:

{x} = [A]−1 {b} (2.118)

Comparing (2.115) with (2.117) suggest that if we augment [A] by {b} and
then if we can make [A] an identity matrix by elementary row operations,
then the modified {b} will be the solution {x}.
Consider a general (3 × 3) system of linear simultaneous algebraic equa-
tions.

a11 x1 + a12 x2 + a13 x3 = b1

a21 x1 + a22 x2 + a23 x3 = b2 (2.119)
a31 x1 + a32 x2 + a33 x3 = b3

Augment the coefficient matrix [A] of the coefficients aij by the right side
vector {b}.
 
a11 a12 a13 b1
[Aag ] = a21 a22 a23 b2  (2.120)
a31 a32 a33 b3
2.5. ELIMINATION METHODS 47

In this first step our goal is to make a11 unity and its first column in the
upper triangular from using elementary row operations. First, we make a11
unity.
R1 = aR1 0 a0 0
 
11
1 a12 13 b1
a21 a22 a23 b2  (2.121)
a31 a32 a33 b3
Make column one in (2.121) upper triangular using elementary row opera-
tions.  0 0
1 a12 a13 b01


R2 − a21 R1 0 a022 a023 b02  (2.122)

R3 − a31 R1 0 a032 a033 b03
In (2.122), we make a022 unity.
 0 0
1 a12 a13 b01

R2 0 1 a0023 b002 
a022 (2.123)
0 a032 a033 b03

We make the elements of column two below the diagonal zero by using row
two and elementary row operations in (2.123).
 0 0
1 a12 a13 b01

0 1 a0023 b002  (2.124)
R3 − a32 R2 0 0 a0033 b003
0

Make element a0033 in (2.124) unity.

 0 0
1 a12 a13 b01

0 1 a0023 b002  (2.125)
R3
a00
33
0 0 1 b003

Make the elements of column three in (2.125) above the diagonal zero by
using row three and elementary row operations.

R1 − a013 R3 1 a012 0 b001

 

R2 − a0023 R3 0 1 0 b000
2
 (2.126)
0 0 1 b3000

Lastly, make the elements of column two in (2.126) above the diagonal zero
using row two and elementary row operations.

R1 − a0012 R2 1 0 0 b000
 
1
0 1 0 b000
2
 (2.127)
000
0 0 1 b3
48 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

In (2.127), the vector b000 000 000 T is the solution vector {x} = x x x T .

1 b2 b3 1 2 3

Remarks.
(1) If the solution of (2.115) is required for more than one right-hand side,
then [A] in (2.121) must be augmented by all right-hand sides before
making [A] identity. When [A] becomes identity, the locations of the
right-hand side column vectors contain solutions for them.

Example 2.10 (Gauss-Jordan Elimination). Consider the following set

of equations:
x1 + x2 + x3 = 6
0.1x1 + x2 + 0.2x3 = 2.7 (2.128)
x1 + 0.2x2 + x3 = 4.4
Augment the coefficient matrix [A] in (2.128).
 
1 1 1 6
[Aag ] = 0.1 1 0.2 2.7 (2.129)
1 0.2 1 4.4
The element a11 in (2.129) is already unity, hence we can proceed to make
column one upper triangular using elementary row operations.
 
1 1 1 6
R2 − 0.1R1 0 0.9 0.1 2.1  (2.130)
R3 − R1 0 −0.8 0 −1.6
Make element a22 unity.  
1 1 1 6
0 1 1 7 
(2.131)
9 3
0 −0.8 0 −1.6
Make the off-diagonal elements of the second column of (2.131) zero using
elementary row operations.
 
R1 − R2 1 0 89 11
3
 
0 1 1 7 
 9 3 
R3 − (−0.8)R2 0 0 0.8 9 −1.6 + 7
3 (0.8)
or  
8 11
10 9 3
 1 7

0 1
 9 3

 (2.132)
0.8
00 9 0.2666
2.5. ELIMINATION METHODS 49

Make element a33 unity in (2.132).

 
8 11
10 9 3
 1 7

0 1
 9 3

 (2.133)
R3 0.8 7
0.8 00 9 −1.6 + 3 (0.8)
9

Make the off-diagonal elements of column three zero.

 
11 8
100 3 − ( 9 )3
 7 1

0 1 0
 3 − ( 9 )3


7
0 0 1 −1.6 + 3 (0.8)

or  
100 1
0 1 0 2 (2.134)
001 3
The vector in the location of {b} in (2.134) is the solution vector {x}. Thus:
   
x1  1
{x} = x2 = 2 (2.135)
x3 3
   

2.5.3 Methods Using [L][U ] Decomposition

2.5.3.1 Classical [L][U ] Decomposition and Solution of [A]{x} = {b}:
Cholesky Decomposition

Consider
[A]{x} = {b} (2.136)

In this method we express the coefficient matrix [A] as the product of a

unit lower triangular matrix [L] (that is, a lower triangular matrix with unit
diagonal elements) and an upper triangular matrix [U ].

[A] = [L][U ] (2.137)

The rules for determining the coefficients of [L] and [U ] are established by
forming the product [L][U ] and equating the elements of the product to the
corresponding elements of [A]. We present details in the following. Consider
50 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

[A] to be a (4 × 4) matrix. In this case we have:

    
a11 a12 a13 a14 1 0 0 0 u11 u12 u13 u14
a21 a22 a23 a24  L21 1 0 0  0 u22 u23 u24 
a31 a32 a33 a34  = L31 L32 1 0  0 (2.138)
    
0 u33 u34 
a41 a42 a43 a44 L41 L42 L43 1 0 0 0 u44

To determine the coefficients of [L] and [U ], we form the product [L][U ] in

(2.138).
 
a11 a12 a13 a14
a21 a22 a23 a24 
a31 a32 a33 a34  =
 

a41 a42 a43 a44

u11 u12 u13 u14
L21 u11 L21 u12 + u22 L21 u13 + u23 L21 u14 + u24
L31 u11 L31 u12 + L32 u22 L31 u13 + L32 u23 + u33 L31 u14 +L32 u24 +
u34
L41 u11 L41 u12 + L42 u22 L41 u13 + L42 u23 + L43 u33 L41 u14 +L42 u24 +
L43 u34 + u44
(2.139)

The elements of [U ] and [L] are determined by alternating between a row of

[U ] and the corresponding column of [L].

First Row of [U ]:
To determine the first row of [U ] we equate coefficients in the first row
on both sides of (2.139).

u11 = a11 u12 = a12 u13 = a13 u14 = a14 (2.140)

That is, the first row of [U ] is the same as the first row of the coefficient
matrix [A].

First Column of [L]:

If we equate coefficients of the first column on both sides of (2.139), we
obtain:
a21 a31 a41
L21 = u11 L31 = u11 L41 = u11 (2.141)

Thus, first column of [L] is determined.

2.5. ELIMINATION METHODS 51

Second Row of [U ]:
Equate the coefficients of the second row on both sides of (2.139).

u22 = a22 − L21 u12

u23 = a23 − L21 u13 (2.142)
u24 = a24 − L21 u14

Thus, the second row of [U ] is determined.

Second Column of [L]:

Equate coefficients of the second column on both sides of (2.139).

(a32 − L31 u12 )

L32 =
u22
(2.143)
(a42 − L41 u12 )
L42 =
u22
This establishes the elements of the second column of [L].

Third Row of [U ]:
Equate coefficients of the third row on both sides of (2.139).

u33 = a33 − L31 u13 − L32 u23

(2.144)
u34 = a34 − L31 u14 − L32 u24

Hence, the third row of [U ] is known.

Third Column of [L]:

Equate coefficients of the third column on both sides of (2.139).

(a43 − L41 u13 − L42 u23 )

L43 = (2.145)
u33
Fourth Row of [U ]:
Equate coefficients of the fourth row on both sides of (2.139).

u44 = a44 − L41 u14 − L42 u24 − L43 u34 (2.146)

Thus, the coefficients of [L] and [U ] in (2.138) are completely determined.

For matrices larger than (4 × 4) this procedure can be continued for the
subsequent rows and columns of [U ] and [L].
52 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

Remarks.
(1) The coefficients of [U ] and [L] can be expressed more compactly as fol-
lows:
j = 1, 2, . . . n
u1j = a1j
(i = 1)

ai1 i = 2, 3, . . . , n
Li1 =
u11 (j = 1)
(2.147)
i−1 i = 2, 3, . . . , n
X
uij = aij − Lik ukj j = i, i + 1, . . . , n
k=1 (for each value of i)

j−1
P
aij − Lik ukj
k=1 j = 2, 3, . . . , n
Lij =
ujj i = j + 1, . . . , n
Using n = 4 in (2.147) we can obtain (2.140) – (2.146). The form in
(2.147) is helpful in programming [L][U ] decomposition.
(2) We can economize in the storage of the coefficients of [L] and [U ].
(i) There is no need to store zeros in either [L] or [U ].
(ii) Ones in the diagonal of [L] do not need to be stored either.
(iii) A closer examination of the expressions for the coefficients of [L]
and [U ] shows that once the elements of aij of [A] are used, they
do not appear again in the further calculations of the coefficients
of [L] and [U ].
(iv) Thus we can store coefficients of [L] and [U ] in the same storage
space for [A].
uij : stored in the same locations as
i = 1, 2, . . . , n
aij
j = i, i + 1, . . . , n (for each i)
Lij : stored in the same locations as
j = 1, 2, . . . , n
aij
i = j + 1, . . . , n
2.5. ELIMINATION METHODS 53

In this scheme of storing coefficients of [L] and [U ], the unit di-

agonal elements of [L] are not stored and the original coefficient
matrix [A] is obviously destroyed.

2.5.3.2 Determination of the Solution {x} Using [L][U ] Decompo-

sition
Consider [A]{x} = {b}, i.e., equation (2.136). Substitute the [L][U ] de-
composition of [A] from (2.137) into (2.136).
[L][U ]{x} = {b} (2.148)
Let
[U ]{x} = {y} (2.149)
Substitute (2.149) in (2.148).
[L]{y} = {b} (2.150)
Step 1: We recall that [L] is a unit lower triangular matrix, hence using
(2.150) we can determine y1 , y2 , . . . , yn using the first, second, . . . ,
last equations in (2.150). This is called the forward pass.
Step 2: With {y} known (right-hand side in (2.149)), we now determine
{x} using back substitution in (2.149), since [U ] is an upper tri-
angular matrix. In this step we determine xn , xn−1 , . . . , x1 (in
this order) starting with the last equation (nth equation) and then
progressively moving up (i.e., (n − 1)th equation, . . . ).
Remarks.
(1) The [L][U ] decomposition does not affect the vector {b}, hence it is
ideally suited for obtaining solutions for more than one right-hand side
vector {b}. For each right side vector we use Steps 1 and 2.
(2) This decomposition is called Cholesky decomposition.

Example 2.11 (Solution of Linear Equations Using Cholesky [L][U ]

Decomposition). Consider

[A]{x} = {b}

in which
     
3 −0.1 −0.2 x1   7.85 
[A] = 0.1 7 −0.3 {x} = x2 {b} = −19.3
0.3 −0.2 10 x3 71.4
   
54 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

[L][U ] Decomposition of [A]:

  
1 0 0 u11 u12 u13
[A] = [L][U ] = L21 1 0  0 u22 u23 
L31 L32 1 0 0 u33
First Row of [U ]:
The first row of [U ] is the same as the first row of [A].
u11 = a11 = 3
u12 = a12 = −0.1
u13 = a13 = −0.2
First Column of [L]:
a21 0.1
L21 = = = 0.033333
u11 3
a31 0.3
L31 = = = 0.1
u11 3
At this stage [L] and [U ] are given by:
   
1 0 0 3 − 0.1 − 0.2
[L] =  0.033 1 0  [U ] =  0 
0.1 1 0 0
Second Row of [U ]:
u22 = a22 − L21 u12 = 7 − (0.0333)(−0.1) = 7.00333
u23 = a23 − L21 u13 = −0.3 − (0.0333)(−0.2) = −0.2933
Second Column of [L]:
(a32 − L31 u12 ) −0.2 − (0.1)(−0.1)
L32 = = = −0.02713
u22 7.00333
At this stage [L] and [U ] are:
   
1 0 0 3 − 0.1 − 0.2
[L] =  0.0333 1 0 [U ] =  0 7.00333 − 0.29333 
0.1 − 0.02713 1 0 0
Third Row of [U ]:
u33 = a33 − L31 u13 − L32 u23
= 10 − (0.1)(−0.2) − (−0.02713)(−0.29333)
= 10.012
2.5. ELIMINATION METHODS 55

This completes the [L][U ] decomposition and we have:

   
1 0 0 3 −0.1 −0.2
[L] =  0.0333 1 0 [U ] =  0 7.00333 − 0.29333 
0.1 − 0.02713 1 0 0 10.012

Solution of [A]{x} = {b}:

In [A]{x} = {b}, we replace [A] by its [L][U ] decomposition.
     
1 0 0 3 −0.1 −0.2 x1   7.85 
 0.0333 1 0   0 7.00333 − 0.29333  x2 = −19.3
0.1 − 0.02713 1 0 0 10.012 x3 71.4
   

Let

[U ]{x} = {y}

∴ [L]{y} = {b}

    
1 0 0 y1   7.85 
 0.0333 1 0  y2 = −19.3
0.1 − 0.02713 1 y3 71.4
   

We can calculate {y} using forward pass.

y1 = 7.85
y2 = −19.3 − (0.0333)(7.85) = −19.561405
y3 = 71.4 − (0.1)(7.85) − (−0.2713)(−19.561405) = 70.0843

Now we know {y}, hence we can use [U ]{x} = {y} to find {x} (backward
pass).
    
3 −0.1 −0.2 x1   7.85 
 0 7.00333 − 0.29333  x2 = −19.56125
0 0 10.012 x3 70.0843
   
56 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

Back substitution of the backward pass gives:

70.0843
x3 = = 7.000029872
10.012
(−19.5614 − (−0.29333)(7.000029872))
x2 =
7.000333
= −2.4999652
(7.85 − (−0.2)(7.000029872) − (−0.1)(−2.499952))
x1 =
3.0
= 3.00003152

Therefore the solution {x} is:

   
x1   3 
{x} = x2 = −2.5
x3 7
   

2.5.3.3 Crout Decomposition of [A] into [L][U ] and Solution of Lin-

ear Algebraic Equations
Consider
[A]{x} = {b} (2.151)

In Crout decomposition, we also express [A] as a product of [L] and [U ] as

in Cholesky decomposition, except that in this decomposition [L] is a lower
triangular matrix and [U ] is a unit upper triangular matrix. That is, the
diagonal elements are [L] are not unity, and instead the diagonal elements
of [U ] are unity. We begin with:

[A] = [L][U ] (2.152)

The rules for determining the elements of [L] and [U ] are established by
forming the product [L][U ] and equating the elements of the product to the
corresponding elements of [A]. We present details in the following. Consider
[A] to be a (4 × 4) matrix. We equate [A] to the product of [L] and [U ].
    
a11 a12 a13 a14 L11 0 0 0 1 u12 u13 u14
a21 a22 a23 a24  L21 L22 0 0  0 1 u23 u24 
 =   (2.153)
a31 a32 a33 a34  L31 L32 L33 0  0 0 1 u34 
a41 a42 a43 a44 L41 L42 L43 L44 0 0 0 1
2.5. ELIMINATION METHODS 57

To determine the elements of [L] and [U ], we form the product of [L] and
[U ] in (2.153).
 
a11 a12 a13 a14
a21 a22 a23 a24 
a31 a32 a33 a34  =
 

a41 a42 a43 a44

L11 L11 u12 L11 u13 L11 u14
L21 L21 u12 + L22 L21 u13 + L22 u23 L21 u14 + L22 u24
L31 L31 u12 + L32 L31 u13 + L32 u23 + L33 L31 u14 + L32 u24 +
L33 u34
L41 L41 u12 + L42 L41 u13 + L42 u23 + L43 L41 u14 + L42 u24 +
L43 u34 + L44
(2.154)
In the Crout method the procedure for determining the elements of [L]
and [U ] alternate between a column of [L] and the corresponding row of [U ],
as opposed to Cholesky decomposition in which we determine a row of [U ]
first followed by the corresponding column of [L].

First Column of [L]:

If we equate the elements of the first column on both sides of (2.154), we
obtain:

Li1 = ai1 i = 1, 2, . . . , 4 (or n in general) (2.155)

First Row of [U ]:
Equating elements of the first row of both sides of (2.154):
a1j
u1j = j = 1, 2, . . . , 4 (or n in general) (2.156)
L11
Second Column of [L]:
Equating elements of the second column on both sides of (2.154):
L22 = a22 − L21 u12
L32 = a32 − L31 u12 (2.157)
L42 = a42 − L41 u12
Second Row of [U ]:
Equating elements of the second row on both sides of (2.154):
(a23 − L21 u13 )
u23 =
L22
(2.158)
(a24 − L21 u14 )
u24 =
L22
58 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

Third Column of [L]:

Equating elements of the third column on both sides of (2.154):

L33 = a33 − L31 u13 − L32 u23

(2.159)
L43 = a43 − L41 u13 − L42 u23

Third Row of [U ]:
Equating the elements of the third row in (2.154):

(a34 − L31 u14 − L32 u24 )

u34 = (2.160)
L33

Fourth Column of [L]:

Lastly, equating elements of the fourth column on both sides (2.154):

L44 = a44 − L41 u14 − L42 u24 − L43 u34 (2.161)

Thus, the elements of [L] and [U ] are completely determined.

Remarks.

(1) The elements of [L] and [U ] can be expressed more compactly as follows:

i = 1, 2, . . . , n
Lij = ai1
(j = 1 for this case )
a1j j = 2, 3, . . . , n
uij =
L11 (i = 1 for this case )
j = 2, 3, . . . , n (2.162)
j−1
X
Lij = aij − Lik ukj i = j, j + 1, . . . , n
k=1 (for each j)
i−1
P
aij − Lik ukj
k=1 i = 2, 3, . . . , n
uij =
Lii j = i + 1, i + 2, . . . , n

Using n = 4 in (2.162), we can obtain (2.155) - (2.161). The form in

(2.162) is helpful in programming [L][U ] decomposition based on Crout
method.

(2) Just like Cholesky decomposition, [L] and [U ] can be stored in the same
space that is used for [A], however [A] is obviously destroyed in this case.
2.5. ELIMINATION METHODS 59

Example 2.12 (Decomposition Using Crout Method and Solution

of Linear System). Consider

[A]{x} = {b}

in which
     
3 − 0.1 − 0.2 x1   7.85 
[A] =  0.1 7 −0.3  {x} = x2 {b} = −19.3
0.3 −0.2 10 x3 71.4
   

[L][U ] Decomposition of [A]:

  
L11 0 0 1 u12 u13
[A] = [L][U ] = L21 L22 0  0 1 u23 
L31 L32 L33 0 0 1

First Column of [L] (same as first column of [A]):

L11 = a11 = 3
L21 = a21 = 0.1
L31 = a31 = 0.3

First Row of [U ]:
a12 −0.1
u12 = = = −0.033333
L11 3
a13 −0.2
u13 = = = −0.066667
L11 3
Second Column of [L]:

L22 = a22 − L21 u12 = 7 − 0.1(−0.03333) = 7.003333

L32 = a32 − L31 u12 = −0.2 − 0.3(−0.03333) = −0.19

Second Row of [U ]:
(a23 − L21 u13 ) (−0.3 − 0.1(−0.066667))
u23 = =
L22 7.003333
u23 = −0.0418848

Third Column of [L]:

L33 = a33 − L31 u13 − L32 u23 = 10 − 0.3(−0.066667) − (−0.19)(−0.0418818)

L33 = 10.012042
60 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

  
3 0 0 1 − 0.03333 −0.66667
[A] = [L][U ] =  0.1 7.003333 00 1 − 0.04188848 
0.3 −0.19 10.012042 0 0 1
By taking product of [L] and [U ] we recover [A].

Solution {x} of [A]{x} = {b}:

Let
[U ]{x} = {y}

∴ [L]{y} = {b}
Using forward pass, calculate {y} using [L]{y} = {b}.
    
3 0 0 y1   7.85 
 0.1 7.003333 0  y2 = −19.3
0.3 −0.19 10.012042 y3 71.4
   

7.85
y1 = = 2.616667
3
(−19.3 − (0.1)(2.616667))
y2 = = −2.7931936
7.003333
(71.4 − (0.3)(2.61667) − (−0.19(−2.7931936))
y3 = =7
10.012042
   
y1   2.616667 
∴ {y} = y2 = −2.7931936
y3 7
   

Now consider [U ]{x} = {y}.

    
1 − 0.03333 −0.66667 x1   2.616667 
0 1 − 0.04188848  x2 = −2.7931936
0 0 1 x3 7
   

Using the backward pass or back substitution to obtain x3 , x2 , and x1 :

x3 = 7
x2 = −2.7931936 − (−0.048848)(7) = −2.5
x1 = 2.616667 − (−0.033333)(−2.5) − (−0.066667)(7) = 3
   
x1   3 
∴ {x} = x2 = −2.5
x3 7
   

This is the same as calculated using Cholesky decomposition.

2.5. ELIMINATION METHODS 61

2.5.3.4 Classical or Cholesky Decomposition of [A] in [A]{x} = {b}

using Gauss Elimination

Consider
[A]{x} = {b} (2.163)

In Gauss elimination we make the augmented matrix [Aag ] upper triangular

by elementary row operations. Consider
    
a11 a12 a13 x1  b1 
a21 a22 a23  x2 = b2 (2.164)
a31 a32 a33 x3 b3
   

When making the first column in (2.164) upper triangular, we need to make
a21 and a31 zero by elementary row operations. In this process we multiply
row one by aa21
11
= C21 and aa31
11
= C31 and then subtract these from rows two
and three of (2.164). This results in
    
a11 a12 a13 x1  b1 
 0 a022 a023  x2 = b02 (2.165)
0 a032 a033
   0
x3 b3

To make the second column upper triangular we multiply the second row in
a0
(2.165) by a32
0 = C32 and subtract it from row three of (2.165).
22

    
a11 a12 a13 x1  b1 
 0 a022 a023  x2 = b02 (2.166)
0 0 a0033
   00 
x3 b3

This is the upper triangular form, as in Gauss elimination. The coefficients

C21 , C31 , and C32 are indeed the elements of [L] and the upper triangular
form in (2.166) is [U ]. Thus, we can write:

[A]{x} = [L][U ]{x} = {b} (2.167)

or      
1 0 0 a11 a12 a13 x1  b1 
C21 1 0  0 a022 a023  x2 = b2 (2.168)
C31 C32 1 0 0 a0033 x3 b3
   

a0
By using C21 = aa21
11
, C31 = aa31
11
, and C32 = a32
0 in (2.168) and by carrying
22
the product of [L] and [U ] in (2.168), the matrix [A] is recovered.
62 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

Example 2.13 (Classical [L][U ] Decomposition using Gauss Elimi-

nation). Consider [A]{x} = {b} in which
     
3 − 0.1 − 0.2 x1   7.85 
[A] =  0.1 7 −0.3  {x} = x2 {b} = −19.3
0.3 −0.2 10 x3 71.4
   

In making column one upper triangular we use:

a21 0.1
C21 = = = 0.033333
a11 3
(2.169)
a31 0.3
C31 = = = 0.1
a11 3
and [A] becomes
 
3 −0.1 −0.2
[A] = 0 7.003333
 − 0.293333  (2.170)
0 −0.19 10.012

In making column two upper triangular in (2.160), we use:

a032 −0.19
C32 = 0 = = −0.027130 (2.171)
a22 7.003333

The new upper triangular form of [A] is in fact [U ] and is given by:
 
3 −0.1 −0.2
 0 7.003333 − 0.293333  = [U ] (2.172)
0 0 10.012

and
     
1 0 0 1 0 0 1 0 0
[L] = L21 1 0 = C21 1 0 =  0.0333 1 0 (2.173)
L31 L32 1 C31 C32 1 0.1 − 0.02713 1

We can check that the product of [L] in (2.173) and [U ] in (2.172) is in fact
[A].
2.5. ELIMINATION METHODS 63

2.5.3.5 Cholesky Decomposition for a Symmetric Matrix [A]

If the matrix [A] is symmetric then the following decomposition of [A] is
possible. Since [A] = [A]T we can write:

[A] = [L̃][L̃]T (2.174)

in which [L̃] is a lower triangular matrix. If [A] is a (3×3) symmetric matrix,

then [L̃] will have the following form.
 
L̃11 0 0
[L̃] = L̃21 L̃22 0  (2.175)
L̃31 L̃32 L̃33

Obviously [L̃] is lower triangular.

The elements of [L̃] are obtained by substituting [L̃] from (2.175) in
(2.174), carrying out the multiplication of [L̃][L̃]T on the right-hand side of
(2.175), and then equating the elements of both sides of (2.174).
    
a11 a12 a13 L̃11 0 0 L̃11 L̃21 L̃31
i = 1, 2, 3
a21 a22 a23  = L̃21 L̃22 0   0 L̃22 L̃32  aij = aji
a31 a32 a33 j = 1, 2, 3
L̃31 L̃32 L̃33 0 0 L̃33

 
(L̃11 )2 (L̃11 )(L̃21 ) (L̃11 )(L̃31 )
 
=
(L̃21 )(L̃11 ) (L̃21 )2 + (L̃22 )2 (L̃21 )(L̃31 ) + (L̃22 )(L̃32 ) 

(L̃31 )(L̃11 ) (L̃31 )(L̃21 ) + (L̃32 )(L̃22 ) (L̃31 )2 + (L̃32 )2 + (L̃33 )2
(2.176)

We note that the [L̃][L̃]T product in (2.176) is symmetric, as expected.

Hence, we only need to consider the elements on the diagonal and those
above the diagonal in [L̃][L̃]T in (2.176).
Equate the elements of row one on both sides of (2.176).
√ a21 a31
L̃11 = a11 ; L̃21 = ; L̃31 = (2.177)
L̃11 L̃11
Consider row two in (2.176).

a32 − L̃21 L̃31

q
L̃22 = a22 − (L̃21 )2 ; L̃32 = (2.178)
L̃22
Consider row three in (2.176).
q
L̃33 = a33 − (L̃31 )2 − (L̃32 )2 (2.179)
64 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

Hence [L̃] is completely determined. We can write (2.177) – (2.179) in a

more compact form.
k−1
X
L̃2kk = akk − (L̃kj )2 k = 1, 2, . . . , n
j=1
i−1
P (2.180)
aki − L̃ij L̃kj
j=1 i = 1, 2, . . . , k − 1
L̃ki =
L̃ii k = 1, 2, . . . , n

2.5.3.6 Alternate Derivation of [L][U ] Decomposition when [A] is

Symmetric
The decomposition in (2.180) can also be derived using the following
method when [A] is symmetric. Consider the classical (Cholesky) [L][U ]
decomposition of a (3 × 3) matrix [A].
  
L11 0 0 1 u12 u13
[A] = L21 L22 0  0 1 u23  (2.181)
L31 L32 L33 0 0 1
If we divide columns of [L] by L11 , L22 , and L33 and if we form a diagonal
matrix of L11 , L22 , and L33 , then we can write the following:
   
1 0 0 L11 0 0 1 u12 u13
L   
[A] =  21
 L11 1 0   0 L22 0  0 1 u23 
   (2.182)
L31 L32
L11 L22 1 0 0 L33 0 0 1

Rewrite the diagonal matrix in (2.182) as a product of two diagonal matrices

in which the diagonal elements are the square roots of L11 , L22 , and L33 .
  √  √  
1 0 0 L11 0 0 L11 0 0 1 u12 u13
L  √  √  
[A] = 
 L11 1 0  0
21  L22 0   0 L22 0  0 1 u23 
L31 L32 √   √   
L11 L22 1 0 0 L33 0 0 L33 0 0 1
(2.183)
Define   √ 
1 0 0 L11 0 0
L  √ 
[L̃] = 
 L11 1 0  0
21  L22 0  (2.184)
L31 L32 √ 
L11 L22 1 0 0 L33
√  
L11 0 0 1 u12 u13
√
[L̃]T = 
  
 0 L22 0  0 1 u23  (2.185)
√   
0 0 L33 0 0 1
2.6. SOLUTION OF LINEAR SYSTEMS USING THE INVERSE 65

∴ [A] = [L̃][L̃]T (2.186)

This completes the decomposition.

2.6 Solution of Linear Simultaneous Algebraic Equa-

tions [A]{x} = {b} Using the Inverse of [A]
Consider
[A]{x} = {b} (2.187)
Let [A]−1 be inverse of the coefficient matrix [A], then:

[A][A]−1 = [A]−1 [A] = [I] (2.188)

Premultiply (2.187) by [A]−1 .

[A]−1 [A]{x} = [A]−1 {b} (2.189)

Using (2.188):
[I]{x} = [A]−1 {b} (2.190)
or
{x} = [A]−1 {b} (2.191)
Thus, if we can find [A]−1 then the solution of {x} of (2.187) can be obtained
using (2.191).

2.6.1 Methods of Finding Inverse of [A]

We consider three methods of finding [A]−1 in the following sections:

(a) Direct method.

(b) Elementary row transformation as in Gauss-Jordan method.

(c) [L][U ] decomposition using Cholesky or Crout method.

2.6.1.1 Direct Method of Finding Inverse of [A]

We follow the steps given below.

1. Find the determinant of [A], i.e., det[A] or |A|.

2. Find the minors mij ; i, j = 1, 2, . . . , n of aij .

The minor mij of aij is given by the determinant of [A] after deleting row
i and column j.
66 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

3. Find the cofactors of aij , i.e., āij ; i, j = 1, 2, . . . , n.

āij = (−1)i+j mij (2.192)

4. Find the cofactor matrix of [A], i.e., [Ā], by using cofactors āij ; i, j =
1, 2, . . . , n.  
ā11 ā12 . . . ā1n
 ā21 ā22 . . . ā2n 
[Ā] =  (2.193)
 
.. 
 . 
ān1 ān2 . . . ānn

5. Find the adjoint of [A] (adj[A]).

adj[A] = [Ā]T ; transpose of the cofactor matrix of [A] (2.194)

6. Finally, the inverse of [A] is given by:

 
ā11 ā21 . . . ān1
1 1 1  ā12
 ā22 . . . ān2 
[A]−1 = (adj[A]) = [Ā]T = (2.195)

..
|A| |A| |A| 
 
. 
ā1n ā2n . . . ānn

2.6.1.2 Using Elementary Row Operations and Gauss-Jordan Method

to Find the Inverse of [A]
Augment the matrix [A] by an identity matrix of the same size (i.e., the
same number of rows and columns as in [A]).

[A] [I] (2.196)

Perform elementary row operations on (2.196) so that [A] becomes an iden-

tity matrix (same operations as in Guass-Jordan method).
- [I] [B]
[A] [I] Elementary Row Operations (2.197)
or    
a11 a12 a13 1 0 0 - 1 0 0 b11 b12 b13
a21 a22 a23 0 1 0 Elementary 0 1 0 b21 b22 b23  (2.198)
a31 a32 a33 0 0 1 Row 0 0 1 b31 b32 b33
Operations
Thus,  
b11 b12 b13
[B] = [A]−1 = b21 b22 b23  (2.199)
b31 b32 b33
2.6. SOLUTION OF LINEAR SYSTEMS USING THE INVERSE 67

Remarks. This procedure is exactly the same as Gauss-Jordan method if

we augment [A] by [I] and {b}. Consider

[A] [I] {b} (2.200)
Using the Gauss-Jordan method, when [A] is made identity using elementary
row operations, we have:
[I] [A]−1 {x}

(2.201)
The location of [I] in (2.200) contains [A]−1 and the location of {b} in (2.200)
contains the solution vector {x}.

2.6.1.3 Finding the Inverse of [A] by [L][U ] Decomposition

Consider the [L][U ] decomposition of [A] obtained by any of the methods
discussed in the earlier sections.

[A] = [L][U ] (2.202)

Let
[B] = [A]−1 (2.203)
To obtain the first column of [B] we solve the following system of equations:
First column of [B] XXX
1st row
X
z   
 b11 1
 
0
  
b
 
21

 

 


 .. 



 .
 .. 




.
 
[L][U ] = . (2.204)

 bi1 
  .. 


 .. 
 
 .





 .







 .
.




bn1
   
0
 

To obtain the the second column of [B] we consider solution of:

Second column of [B]X
XX  
X
z 
 b12 
 
0 
1 2nd row
  
b
   
 22 
  
 

 ... 
   
 .. 

   
 
.

[L][U ] = . (2.205)
 bi2 
   ..  

 .
.

 
 .. 

.
   



 

 
 
 .

bn2

 
0

68 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

For column j of [B], we solve:

Third column of [B]X
XXX
z   

 b1j 
 
0 
b 0
   
2j 

 
 
 
  
 . 
 .  .
  
 . 
. .

[L][U ] = = (2.206)

 bij 
 
1 
j th row
 .  
  .
 ..   .. 


  
  


   
 
bnj 0


Thus determination of each column of the inverse requires solutions of linear

simultaneous algebraic equation in which the right-hand side vector is null
except for the row location corresponding to the column number for which
we are seeking the solution. This is the only location in the right-hand side
vector that contains a nonzero value of one.

2.7 Iterative Methods of Solving Linear Simultane-

ous Algebraic Equations
Iterative methods of obtaining solutions of linear simultaneous algebraic
equations are an alternative to the elimination and other methods we have
discussed in earlier sections. In all iterative methods, we begin with an
assumed or guess solution, also known as the initial solution, and then use a
systematic iterative procedure to obtain successively improved solutions until
the solution no longer changes. At this stage we have a converged solution
that is a close approximation of the true solution. Thus, these are methods of
approximation. These methods are generally easier to program. The number
of iterations required for convergence is dependent on the coefficient matrix
[A] and the choice of the initial solution. Consider the following methods:
(a) Gauss-Seidel method
(b) Jacobi method
(c) Relaxation methods

2.7.1 Gauss-Seidel Method

This is a simple and commonly used iterative method of obtaining solu-
tions of [A]{x} = {b}. We illustrate the details of the method for a system
of three linear simultaneous equations.
    
a11 a12 a13 x1  b1 
a21 a22 a23  x2 = b2 (2.207)
a31 a32 a33 x3 b3
   
2.7. ITERATIVE METHODS OF SOLVING LINEAR SYSTEMS 69

Solve for x1 , x2 , x3 using first, second, and third equations in (2.207).

b1 − a12 x2 − a13 x3
x1 = (2.208)
a11
b2 − a21 x1 − a23 x3
x2 = (2.209)
a22
b3 − a31 x1 − a32 x2
x3 = (2.210)
a33
(i) Choose an initial or guess solution:
 
x̃1 
{x} = {x̃} = x̃2 (2.211)
x̃3
 

{x̃} could be [0 0 0]T or [1 1 1]T or any other choice.

(ii) Use the {x̃} vector from (2.211) in (2.208) to solve for x1 , say x01 .
(iii) Update the {x̃} vector in (2.211) by replacing x̃1 by x01 . Thus the
updated {x} is:  0
x1 
{x} = x̃2 (2.212)
x̃3
 

(iv) Use the {x} vector from (2.212) in (2.209) to solve for x2 , say x002 .
(v) Update the {x} in (2.212) by replacing x̃2 with x002 , hence the updated
{x} becomes:  0
 x1 
{x} = x002 (2.213)
x̃3
 

(vi) Use the vector {x} from (2.213) in (2.210) to solve for x3 , say x000
3.

(vii) Update the {x} vector in (2.213) by replacing x̃3 by x000

3 , hence the new,
improved {x} is:  0
 x1 
{x} = x002 (2.214)
 000 
x3
In (2.214) we have the improved estimate of {x}. Steps (ii) - (vii) constitute
an iteration. The new improved estimate is used to repeat steps (ii) - (vii)
until the process is converged, i.e., until two successive estimates of {x} do
not differ appreciably. More specifically, the process is converged when the
solutions from the two successive iterations are within a tolerance based on
the desired decimal place accuracy. We discuss the concept of convergence
and the convergence criterion in the following section.
70 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

Convergence Criterion

Let {x}j−1 and {x}j be two successive solutions at the end of (j − 1)th
and j th iterations. We consider the iterative solution procedure converged
when the corresponding components of {x}j−1 and {x}j are within a preset
tolerance ∆.

(xi )j−1 − (xi )j

(i )j = × 100 ≤ ∆ i = 1, 2, . . . , n (2.215)
(xi )j

(i )j is the percentage error in the ith component of {x}, i.e., xi , based on
the most up to date solution for the ith component of {x}, (xi )j . When
(2.215) is satisfied, we consider the iterative process converged and we have
an approximation {x}j of the true solution {x} of (2.207).

Example 2.14 (Gauss-Seidel Method). Consider the following set of

linear simultaneous algebraic equations:

3x1 − 0.1x2 − 0.2x3 = 7.85

0.1x1 + 7x2 − 0.3x3 = −19.3 (2.216)
0.3x1 − 0.2x2 + 10x3 = 71.4

Solve for x1 , x2 , and x3 using the first, second, and third equations in (2.216).
7.85 + 0.1x2 + 0.2x3
x1 = (2.217)
3
−19.3 − 0.1x1 + 0.3x3
x2 = (2.218)
7
71.4 − 0.3x1 + 0.2x2
x3 = (2.219)
10

(i) Choose    
x̃1  0
{x} = x̃2 = 0 (2.220)
x̃3 0
   

as a guess or initial solution.

(ii) Solve for x1 using (2.217) and (2.220) and denote the new value of x1
by x01 .
7.85 + 0 + 0
x1 = = 2.616667 = x01 (2.221)
3
2.7. ITERATIVE METHODS OF SOLVING LINEAR SYSTEMS 71

(iii) Using the new value of x1 , i.e. x01 , update the starting solution vector
(2.220).  0  
x1  2.616667
{x} = x̃2 = 0 (2.222)
x̃3 0
   

(iv) Using the most recent {x} (2.222), calculate x2 using (2.218) and de-
note the new value of x2 by x002 .

−19.3 − 0.1(2.616667) + 0
x2 = = −2.794524 = x002 (2.223)
7

(v) Update {x} in (2.222) using x002 from (2.223).

 0  
x1   2.616667 
{x} = x002 = −2.794524 (2.224)
x̃3 0
   

(vi) Calculate x3 using (2.219) and (2.224) and denote the new value of x3
by x000
3.

1
x3 = (71.4 − 0.3(2.616667) + 0.2(−2.7974524)) = 7.00561 = x003
10
(2.225)

(vii) Update {x} in (2.224) using the new value of x3 , i.e. x000
3.
 0  
 x1   2.616667 
{x}1 = x002 = −2.794524 (2.226)
 000  
x3 7.00561


Steps (i)-(vii) complete the first iteration. At the end of the first iteration,
{x} in (2.226) is the most recent estimate of the solution. We denote this
by {x}1 .
Using (2.226) as the initial solution for the second iteration and repeating
steps (ii)-(vii), the second iteration would yield the following new estimate
of the solution {x}.
7.85 + 0.1(−2.794524) + 0.2(7.005610)
x01 = = 2.990557
3
−19.3 − 0.1(2.990557) + 0.3(7.005610)
x002 = = −2.499625 (2.227)
7
71.4 − 0.3(2.990557) + 0.2(−2.499625)
x000
3 = = 7.000291
10
72 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

Thus at the end of the second iteration the solution of the vector {x} is
 
 2.990557 
{x}2 = −2.499685 (2.228)
7.000291
 

Using {x}1 and {x}2 in (2.226) and (2.228), we can compute (i )2 (using
(2.215)).

2.990557 − 2.616667
(1 )2 = 100 = 12.5%
2.990557
−2.499625 − (−2.794524)
(2 )2 = 100 = 11.8% (2.229)
−2.4999625
7.000291 − 7.00561
(3 )2 = 100 = 0.076%
7.000291

More iterations can be performed to reduce (i )j to obtain desired threshold

value ∆. We choose ∆ = 10−7 and perform more iterations. When each
component of {}j for an iteration j becomes lower than 10−7 , we consider
it to be zero. Details of additional iterations are given in the following.
Table 2.1: Results of Gauss-Seidel method for equations (2.216)
∆= 0.10000E−07

iter (j) {x}j−1 {x}j {}j

3 0.299055E+01 0.300003E+01 0.315845E+00

-0.249962E+01 -0.249998E+01 0.145340E−01
0.700029E+01 0.699999E+01 0.416891E−02
4 0.300003E+01 0.300000E+01 0.105698E−02
-0.249998E+01 -0.250000E+01 0.486373E−03
0.699995E+01 0.700000E+01 0.136239E−04
5 0.300000E+01 0.300000E+01 0.794728E−05
-0.250000E+01 -0.250000E+01 0.000000E+00
0.700000E+01 0.700000E+01 0.000000E+00
6 0.300000E+01 0.300000E+01 0.000000E+00
-0.250000E+01 -0.250000E+01 0.000000E+00
0.700000E+01 0.700000E+01 0.000000E+00

Thus, at the end of iteration 6, {x}6 is the converged solution in which each
component of {}6 < 10−7 .
2.7. ITERATIVE METHODS OF SOLVING LINEAR SYSTEMS 73

Example 2.15 (Gauss-Seidel Method). Consider the following set of

linear simultaneous algebraic equations.
10x1 + x2 + 2x3 + 3x4 = 10
x1 + 20x2 + 2x3 + 3x4 = 20
(2.230)
2x1 + 2x2 + 30x3 + 4x4 = 30
3x1 + 3x2 + 4x3 + 40x4 = 40
Choose {e x} = [0 0 0 0]T as the initial guess solution vector and a convergence
tolerance of ∆ = 10−7 for each component of {}j , where j is the iteration
number. The converged solution is obtained using Gauss-Seidel method in
eight iterations. The calculated solution for each iteration is tabulated in
the following.
Table 2.2: Results of Gauss-Seidel method for equations (2.230)
∆= 0.10000E−07

iter (j) {x}j−1 {x}j {}j

1 0.000000E+00 0.100000E+01 0.100000E+03

0.000000E+00 0.949999E+00 0.100000E+03
0.000000E+00 0.870000E+00 0.100000E+03
0.000000E+00 0.766749E+00 0.100000E+03
2 0.100000E+01 0.500975E+00 0.996107E+02
0.949999E+00 0.772938E+00 0.229075E+02
0.870000E+00 0.812839E+00 0.703225E+01
0.766749E+00 0.823172E+00 0.685428E+01
3 0.500975E+00 0.513186E+00 0.237954E+01
0.772938E+00 0.769580E+00 0.436318E+00
0.812839E+00 0.804725E+00 0.100820E+01
0.823172E+00 0.823319E+00 0.178889E−01
4 0.513186E+00 0.515100E+00 0.371628E+00
0.769580E+00 0.770274E+00 0.900328E−01
0.804725E+00 0.804532E+00 0.240483E−01
0.823319E+00 0.823143E+00 0.214119E−01
5 0.515100E+00 0.515123E+00 0.431586E−02
0.770274E+00 0.770319E+00 0.579550E−02
0.804532E+00 0.804551E+00 0.236328E−02
0.823143E+00 0.823136E+00 0.839974E−03
6 0.515123E+00 0.515116E+00 0.120339E−02
0.770319E+00 0.770318E+00 0.696389E−04
0.804551E+00 0.804552E+00 0.170393E−03
0.823136E+00 0.823137E+00 0.434469E−04
7 0.515116E+00 0.515116E+00 0.694266E−04
0.770318E+00 0.770318E+00 0.232129E−04
0.804552E+00 0.804552E+00 0.000000E+00
0.823137E+00 0.823137E+00 0.724115E−05
8 0.515116E+00 0.515116E+00 0.000000E+00
0.770318E+00 0.770318E+00 0.000000E+00
0.804552E+00 0.804552E+00 0.000000E+00
0.823137E+00 0.823137E+00 0.000000E+00
74 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

Thus, at the end of iteration 8, {x}8 is the converged solution in which each
component of {}8 < 10−7 .

Remarks.
(1) In Gauss-Seidel method, we begin with a starting or assumed solution
vector and obtain new values of the components of {x} individually.
The new computed vector {x} is used as the starting vector for the next
iteration.
(2) We observe that the coefficient matrix [A] in (2.216), if well-conditioned,
generally has the largest elements on the diagonal, i.e., it is a diagonally
dominant coefficient matrix. Iterative method have good convergence
characteristics for algebraic systems with such coefficient matrices.
(3) The choice of starting vector is crucial. Sometimes the physics from
which the algebraic equations are derived provides enough information to
prudently select a starting vector. When this information is not available
or helpful, null or unity vectors are often useful as initial guess solutions.

2.7.2 Jacobi Method

This method is similar to Gauss-Seidel method, but differs from it in
the sense that here we do not continuously update each component of the
solution vector {x}, but rather update all components of {x} simultaneously.
We consider details in the following.
[A]{x} = {b} (2.231)
As in Gauss-Seidel method, here also we solve for x1 , x2 , . . . , xn using the
first, second, . . . , nth equations in (2.231).
x1 = b1/a11 − (x1 + a12/a11 x2 + a13/a11 x3 + · · · + a1n/a11 xn ) + x1
x2 = b2/a22 − (a21/a22 x1 + x2 + a23/a22 x3 + · · · + a2n/a22 xn ) + x2
.. (2.232)
.
xn = bn/ann − (an1/ann x1 + an2/ann x2 + an3/ann x3 + · · · + xn ) + xn
In (2.232), we have also added and subtracted x1 , x2 , . . . , xn in the first,
second, . . . , nth equations so that right-hand side of each equation in (2.232)
contains the complete vector {x}. Equations (2.232) can be written as:
{x} = {b̂} − [Â]{x} + {x}
or
{x} = {b̂} − [Â]{x} + [I]{x}
2.7. ITERATIVE METHODS OF SOLVING LINEAR SYSTEMS 75

or
{x} = {b̂} − ([Â] − [I]){x} (2.233)
It is more convenient to write (2.233) in the following form for performing
iterations.
{x}j+1 = {b̂} − ([Â] − [I]){x}j (2.234)
{x}j+1 is the most recent estimate of {x} and {x}j is the immediately pre-
ceding estimate of {x}.

(i) Assume a guess or initial vector {x}1 for {x} (i.e., j = 1 in (2.234)).
This could be a null vector, a unit vector, or any other appropriate
choice.

(ii) Use (2.234) to solve for {x}2 . {x}2 is the improved estimate of the
solution.

(iii) Check for convergence using the criterion defined in the next section
or using the same criterion as in the case of Gauss-Seidel method,
(2.215). We repeat steps (ii)-(iii) if the most recent estimate of {x} is
not converged.

2.7.2.1 Condition for Convergence of Jacobi Method

In this section we examine the conditions under which the Jacobi method
will converge. Consider:

{x}j+1 = {b̂} − ([Â] − [I]){x}j (2.235)

or
{x}j+1 = {b̂} − [B̂]{x}j (2.236)
For j = 1:
{x}2 = {b̂} − [B̂]{x}1 (2.237)
For j = 2:
{x}3 = {b̂} − [B̂]{x}2 (2.238)
Substitute {x}2 from (2.237) into (2.238).

{x}3 = {b̂} − [B̂]({b̂} − [B̂]{x}1 ) (2.239)

= ([I] − [B̂]){b̂} + [B̂][B̂]{x}1 (2.240)

For j = 3:
{x}4 = {b̂} − [B̂]{x}3 (2.241)
Substituting for {x}3 from (2.240) in (2.241) and rearranging terms:

{x}4 = ([I] − [B̂] + [B̂][B̂]){b̂} − [B̂][B̂][B̂]{x}1 (2.242)

76 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

By induction we can write the following for i = n:

Product of n [B̂] matrices
z }| {
{x}n+1 = ([I]+[B̂][B̂]+· · ·+([B̂][B̂] · · · [B̂])){b̂}+ [B̂][B̂] · · · [B̂]{x}1 (2.243)
| {z }
Product of (n − 1) [B̂] matrices

If we consider the solution {x}n+1 in the limit n → ∞, then {x}n+1 must

converge to {x}.
lim {x}n+1 = {x} (2.244)
n→∞
must hold, and (2.244) must be independent of {x}1 ,the starting or initial
guess solution. Hence,

lim [B̂][B̂] · · · [B̂]{x}1 = {0} (2.245)

n→∞ | {z }
Product of n [B̂] matrices

for an arbitrary {x}1 .

Remarks.
(1) The condition in (2.245) is the convergence criterion for Jacobi method.
When (2.245) is satisfied, the Jacobi method is assured to converge.
(2) It can be shown that the condition (2.244) is satisfied by the coefficient
matrix [A] in [A]{x} = {b} when [A] is diagonally dominant, i.e., when
the elements in the diagonal of [A] are larger in magnitude than the
off-diagonal elements.
(3) We can state the convergence criterion (2) in a more concrete form.
(a) Row criterion: Based on each row of [A]:
n
X aij
≤1 ; i = 1, 2, . . . , n (2.246)
aii
j=1
j6=i

(b) Column criterion: Based on each column of [A]:

n
X aij
≤1 ; j = 1, 2, . . . , n (2.247)
aii
i=1
i6=j

(c) Normalized off diagonal elements of [A]:

n X n
aij 2
X
≤1 (2.248)
aii
i=1 j=1
j6=i
2.7. ITERATIVE METHODS OF SOLVING LINEAR SYSTEMS 77

Example 2.16 (Jacobi Method). Consider the following set of linear

simultaneous algebraic equations:
3x1 − 0.1x2 − 0.2x3 = 7.85
0.1x1 + 7x2 − 0.3x3 = −19.3 (2.249)
0.3x1 − 0.2x2 + 10x3 = 71.4
Solve for x1 , x2 , and x3 using the first, second, and third equations in (2.249).
Add and subtract x1 , x2 , and x3 to the right-hand side of the first, second,
and third equations, respectively to obtain the appropriate forms of the
equations.
x1 = 7.85/3.0 − (x1 − 0.1/3.0x2 − 0.2/3.0x3 ) + x1 (2.250)
x2 = −19.3/7.0 − (0.1/7.0x1 + x2 − 0.3/7.0x3 ) + x2 (2.251)
x3 = 71.4/10.0 − (0.3/10.0x1 − 0.2/10.0x2 + x3 ) + x3 (2.252)
These equations can be expressed in the iteration form given by (2.234):
{x}j+1 = {b̂} − ([Â] − [I]){x}j (2.253)
where
   
1.0 −0.1/3.0 −0.2/3.0  7.85/3.0 
[Â] =  0.1/7.0 1.0 −0.3/7.0  ; {b̂} = −19.3/7.0 (2.254)
0.3/10.0 −0.2/10.0 1.0 /10.0
 71.4 

We choose {e x} = [0 0 0]T as the initial guess solution vector and a con-

vergence tolerance of ∆ = 10−7 for each component of the error {} for
each iteration. The converged solution is obtained using Jacobi method in
eight iterations. The calculated solution for each iteration is tabulated in
the following.
Table 2.3: Results of Jacobi method for equations (2.249)
∆= 0.10000E−07

iter (j) {x}j−1 {x}j {}j

1 0.000000E+00 0.261666E+01 0.100000E+03

0.000000E+00 -0.275714E+01 0.100000E+03
0.000000E+01 0.714000E+01 0.100000E+03
2 0.261666E+01 0.300076E+01 0.127999E+02
-0.275714E+01 -0.248852E+01 0.107943E+02
0.714000E+01 0.700635E+01 0.190744E+01
3 0.300076E+01 0.300080E+01 0.148574E−02
-0.248852E+01 -0.249973E+01 0.448626E+00
0.700635E+01 0.700020E+01 0.878648E−01
4 0.300080E+01 0.300002E+01 0.261304E−01
-0.249973E+01 -0.250000E+01 0.105762E−01
0.700020E+01 0.699998E+01 0.322206E−02
78 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

Table 2.3: Results of Jacobi method for equations (2.249)

5 0.300002E+01 0.299999E+01 0.794728E−03
-0.250000E+01 -0.250000E+01 0.667571E−04
0.699998E+01 0.699999E+01 0.258854E−03
6 0.299999E+01 0.299999E+01 0.397364E−04
-0.250000E+01 -0.250000E+01 0.381469E−04
0.699999E+01 0.700000E+01 0.136239E−04
7 0.299999E+01 0.300000E+01 0.794728E−05
-0.250000E+01 -0.250000E+01 0.000000E+00
0.700000E+01 0.700000E+01 0.000000E+00
8 0.300000E+01 0.300000E+01 0.000000E+00
-0.250000E+01 -0.250000E+01 0.000000E+00
0.700000E+01 0.700000E+01 0.000000E+00

Thus, at the end of iteration 8, {x}8 is the converged solution in which each
component of {}8 < 10−7 .

Example 2.17 (Jacobi Method). Consider the following set of linear

simultaneous algebraic equations.
10x1 + x2 + 2x3 + 3x4 = 10
x1 + 20x2 + 2x3 + 3x4 = 20
(2.255)
2x1 + 2x2 + 30x3 + 4x4 = 30
3x1 + 3x2 + 4x3 + 40x4 = 40
Choose {e x} = [0 0 0 0]T as the initial guess solution vector and a convergence
tolerance of ∆ = 10−7 for each component of {}j , where j is the iteration
number. The converged solution is obtained using Jacobi method in seven-
teen iterations. The calculated solution for each iteration is tabulated in the
following.
Table 2.4: Results of Jacobi method for equations (2.255)
∆= 0.10000E−07

iter (j) {x}j−1 {x}j {}j

1 0.000000E+00 0.100000E+01 0.100000E+03

0.000000E+00 0.100000E+01 0.100000E+03
0.000000E+00 0.100000E+01 0.100000E+03
0.000000E+00 0.100000E+01 0.100000E+03
2 0.100000E+01 0.399999E+00 0.150000E+03
0.100000E+01 0.699999E+00 0.428571E+02
0.100000E+01 0.733333E+00 0.363636E+02
0.100000E+01 0.750000E+00 0.333333E+02
3 0.399999E+00 0.558333E+00 0.283582E+02
0.699999E+00 0.794166E+00 0.118572E+02
0.733333E+00 0.826666E+00 0.112903E+02
0.750000E+00 0.844166E+00 0.111549E+02
4 0.558333E+00 0.501999E+00 0.112217E+02
0.794166E+00 0.762791E+00 0.411317E+01
0.826666E+00 0.797277E+00 0.368615E+01
0.844166E+00 0.815895E+00 0.346499E+01
2.7. ITERATIVE METHODS OF SOLVING LINEAR SYSTEMS 79

Table 2.4: Results of Jacobi method for equations (2.255)

5 0.501999E+00 0.519496E+00 0.336797E+01
0.762791E+00 0.772787E+00 0.129352E+01
0.797277E+00 0.806894E+00 0.119181E+01
0.815895E+00 0.825412E+00 0.346499E+01
6 0.519496E+00 0.513718E+00 0.112475E+01
0.772787E+00 0.769523E+00 0.424167E+00
0.806894E+00 0.803792E+00 0.385891E+00
0.825412E+00 0.822389E+00 0.367663E+00
7 0.513718E+00 0.515572E+00 0.359577E+00
0.769523E+00 0.770576E+00 0.136601E+00
0.803792E+00 0.804798E+00 0.124993E+00
0.822389E+00 0.823377E+00 0.120030E+00
8 0.515572E+00 0.514969E+00 0.117086E+00
0.770576E+00 0.770234E+00 0.443416E−01
0.804798E+00 0.804473E+00 0.404687E−01
0.823377E+00 0.823058E+00 0.387076E−01
9 0.514969E+00 0.515164E+00 0.378224E−01
0.770234E+00 0.770345E+00 0.143451E−01
0.804473E+00 0.804578E+00 0.131050E−01
0.823058E+00 0.823162E+00 0.125630E−01
10 0.515164E+00 0.515101E+00 0.122657E−01
0.770345E+00 0.770309E+00 0.465038E−02
0.804578E+00 0.804544E+00 0.423766E−02
0.823162E+00 0.823128E+00 0.406232E−02
11 0.515101E+00 0.515121E+00 0.396884E−02
0.770309E+00 0.770321E+00 0.150883E−02
0.804544E+00 0.804555E+00 0.137055E−02
0.823128E+00 0.823139E+00 0.131788E−02
12 0.515121E+00 0.515114E+00 0.128439E−02
0.770321E+00 0.770317E+00 0.487473E−03
0.804555E+00 0.804551E+00 0.444505E−03
0.823139E+00 0.823136E+00 0.427228E−03
13 0.515114E+00 0.515116E+00 0.416559E−03
0.770317E+00 0.770318E+00 0.154753E−03
0.804551E+00 0.804552E+00 0.140759E−03
0.823136E+00 0.823137E+00 0.137581E−03
14 0.515116E+00 0.515116E+00 0.127282E−03
0.770318E+00 0.770318E+00 0.541636E−04
0.804552E+00 0.804552E+00 0.444505E−04
0.823137E+00 0.823137E+00 0.434469E−04
15 0.515116E+00 0.515116E+00 0.347132E−04
0.770318E+00 0.770318E+00 0.154753E−04
0.804552E+00 0.804552E+00 0.148168E−04
0.823137E+00 0.823137E+00 0.144823E−04
16 0.515116E+00 0.515116E+00 0.115711E−04
0.770318E+00 0.770318E+00 0.000000E+00
0.804552E+00 0.804552E+00 0.000000E+00
0.823137E+00 0.823137E+00 0.724115E−05
17 0.515116E+00 0.515116E+00 0.000000E+00
0.770318E+00 0.770318E+00 0.000000E+00
0.804552E+00 0.804552E+00 0.000000E+00
0.823137E+00 0.823137E+00 0.000000E+00

Thus, at the end of iteration 17, {x}17 is the converged solution in which
each component of {}17 < 10−7 .

Remarks.
(1) We note that the convergence characteristics of the Jacobi method for
this example are much poorer than the Gauss-Seidel method in terms of
the number of iterations.

(2) The observation in Remark (1) is not surprising due to the fact that in
80 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

the Gauss-Seidel method, computation of the ith component of {x} for

each iteration utilizes the most recent updated values of the components
x1 , x2 , . . . , xi−1 of {x}, whereas in the Jacobi method all components of
{x} are updated simultaneously for each iteration.

2.7.3 Relaxation Techniques

The purpose of relaxation techniques is to improve the convergence of
the iterative methods of solving the system of linear simultaneous algebraic
equations. Let {x}c and {x}p be the currently calculated and the immedi-
ately preceding values of the solution vector {x}. Using these two solution
vectors, we construct a new solution vector {x}new as follows:

{x}new = λ{x}c + (1 − λ){x}p (2.256)

In (2.256), λ is a weight factor that is chosen between 0 and 2. Equation

(2.256) represents {x}new as a weighted average of the two solutions {x}c
and {x}p . The new vector {x}new is used in the next iteration as opposed
to {x}c . This is continued until the converged solution is obtained.

1. When λ = 1 in (2.256), then:

{x}new = {x}c (2.257)

In this case we have the conventional iterative method.

2. If 0 < λ < 1, the vectors {x}c and {x}p get multiplied with factors be-
tween 0 and 1. For this choice of λ, the method is called under-relaxation.

3. If 1 < λ < 2, then the vector {x}c is multiplied with a factor greater
than one and {x}p is multiplied with a factor that is less than zero. The
motivation for this is that {x}c is supposedly more accurate than {x}p ,
hence a bigger weight factor is appropriate to assign to {x}c compared to
{x}p . For this choice of λ, the method is called successive or simultaneous
over-relaxation or SOR.

4. The newly formed vector {x}new from either under- or over-relaxation is

used for the next iteration in both Gauss-Seidel or Jacobi methods instead
of {x}c .

5. The choice of λ is unfortunately problem dependent. This is a serious

drawback of the method.

6. The relaxation methods are generally helpful in improving the conver-

gence of the iterative methods.
2.8. CONDITION NUMBER OF THE COEFFICIENT MATRIX 81

2.8 Condition Number of the Coefficient Matrix

Consider a system of n linear simultaneous equations:

[A]{x} = {b} (2.258)

in which [A] is an (n × n) square matrix that is symmetric and positive-

definite, hence its eigenvalues are real and greater than zero. Let λi ; i =
1, 2, . . . , n be the eigenvalues of [a] arranged in ascending order, i.e., λ1 is
the smallest eigenvalue and λn is the greatest (see Chapter 4 for additional
details about eigenvalues). The condition number cn of the matrix [A] is
defined as:
cn = λn/λ1 (2.259)
When the value of cn is closer to one, the matrix is better conditioned.
For a well-conditioned matrix [A], the coefficients of [A] (especially diagonal
elements) are all roughly of the same order of magnitude. In such cases, when
computing {x} from (2.258) using, for example, elimination methods, the
round off errors are minimal during triangulation (as in Gauss elimination).
Higher condition number cn results in higher round off errors during the
elimination process. In extreme cases for very high cn , the computations
may even fail.
With a poorly conditioned coefficient matrix [A], the computed solution
{x} using (2.258) (if possible to compute) generally does not satisfy (2.258)
due to round off errors in the triangulation process. Often, a high condition-
ing number cn suggests a large disparity in the magnitudes of the elements
of [A], especially the diagonal elements.

2.9 Concluding Remarks

In this chapter, the basic elements of linear algebra are presented first as
refresher material. This is followed by the methods of obtaining numerical
solutions of a system of linear simultaneous algebraic equations. A list of
the methods considered is given is Section 2.3. We remark that groups of
methods (B), (C), and Cramer’s rule (as listed in Section 2.3) are numerical
methods without approximation, whereas graphical methods and methods
(D) are methods of approximation. As far as possible, the methods without
approximation are meritorious over the methods of approximation. Use of
methods of approximation without precise quantification of the solution error
is obviously dangerous.
The author’s own experience in computational mathematics and finite
element computations suggests the Gauss elimination method to be the
most efficient and straightforward to program and the method of choice.
When the coefficient matrix is symmetric and positive-definite, as is the
82 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

case in many engineering applications, pivoting is not required. This leads

to improved efficiency of computations. When the coefficient matrix is not
positive-definite, partial pivoting is worthy of exploiting before using full
pivoting as full pivoting is extremely computationally intensive for large
systems of equations. Relaxation methods are quite popular in computa-
tional fluid dynamics (CFD) due to their simplicity of programming but
more importantly due to the fact that coefficient matrices in CFD are rarely
positive-definite. Since these are methods of approximation, the computed
solutions are always in error.
2.9. CONCLUDING REMARKS 83

Problems
2.1 Matrices [A], [B], [C] and the vector {x} are given in the following:
 
  0 1 0
1 2 −1 3 1 2 −1
[A] = 0 4 −2 1 ; [B] = 1 −1 3

3 −1 1 1
2 −2 1
   
0 1 4 2 −1.2
[C] = −1 −2 3 1 ; {x} = 3.7
0 1 2 −1 2.0
 

(a) Find [A] + [B], [A] + [C], [A] − 2[C]

(b) Find [A][B], [B][A], [A][C], [B]{x}, [C]{x}
(c) What is the size of [A][B][C] and of [B]{x}
(d) Show that ([A][B])[C] = [A]([B][C])
(e) Find [A]T , [B]T , [C]T , and {x}T

2.2 Consider the square matrix [A] defined by

 
2 2 −2
[A] = −2 0 2
2 −2 −2

(a) Find [A]2 , [A]3

(b) Find det[A] or |A|
(c) Show that [A][I] = [I][A] = [A] where [I] is the (3x3) identity matrix

2.3 Write the following system of equations in the matrix form using x, y, z
as vector of unknowns (in this order).

d1 = b1 y + c1 z
d2 = b2 y + a2 x
d3 = a3 x + b3 y

Determine the transpose of the coefficient matrix.

2.4 Consider a square matrix [A] given by

 
1 3 1
[A] = 2 0 1
0 −1 4
84 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

Decompose the matrix [A] as a sum of a symmetric and skew symmetric

matrices.

2.5 Consider the following (3 × 3) matrix

 
cos θ sin θ 0
[R] =  − sin θ cos θ 0
0 0 1

in which θ is an angle in radians. Is the matrix [R] orthogonal?

2.6 Consider the following system of equations.

3x1 − x2 + 2x3 = 12
x1 + 2x2 + 3x3 = 11
2x1 − 2x2 − x3 = 2

or symbolically [A]{x} = {b}.

(a) Obtain the solution {x} using naive Gauss elimination

(b) Obtain the solution {x} using Gauss-Jordan method. Also obtain
the inverse of the coefficient matrix [A] during the Gauss-Jordan
elimination. Show that

{x} = [A]−1 {b}

(c) Obtain the solution {x} using Cramer’s rule.

(d) Perform Cholesky decomposition of [A] into the product of [L] and
[U ]. Obtain solution {x} using Cholesky factors [L] and [U ].

2.7 Consider the following system of equations

4x1 + x2 − x3 = −2
5x1 + x2 + 2x3 = 4
6x1 + x2 + x3 = 6

(a) Obtain solution {x} using Gauss elimination with partial pivoting.
(b) Obtain solution {x} using Gauss elimination with full pivoting.

2.8 Consider the following system of equations

2x1 + x2 − x3 = 1
5x1 + 2x2 + 2x3 = −4
3x1 + x2 + x3 = 5
2.9. CONCLUDING REMARKS 85

(a) Calculate solution {x} using Gauss-Jordan method with partial piv-
oting.
(b) Calculate solution {x} using Gauss-Jordan method with full pivot-
ing.

2.9 Consider the following system of equations in which the coefficient ma-
trix [A] is symmetric.

2x1 − x2 = 1.5
−x1 + 2x2 − x3 = −0.25
−x2 + x3 = −0.25

i,e [A]{x} = {b}

(a) Perform the decomposition

[A] = [L̃][L̃]T

using Cholesky factors of [A].

(b) Obtain solution {x} using [L̃][L̃]T decomposition of [A] in [A]{x} =
{b}.

2.10 Write a computer program to solve a system of linear simultaneous

algebraic equations using Gauss-Seidel method

2.11 Write a computer program to solve a system of linear simultaneous

algebraic equations using Jacobi method
In both 2.10 and 2.11 consider the following:

For each iteration tabulate the starting solution, computed solution, and
percentage error in each component of the computed solution using the cal-
culated solution as the improved solution. Allow maximum of 20 iterations
and use a convergence tolerance of 0.1 × 10−6 for the percentage error in
each component of the solution vector.

Tabulate starting solution, calculated solution and percentage error as three

columns for each iteration. Provide a printed heading showing the iteration
number.

Use the following two systems of equations to compute numerical values

of the solution using programs in 2.10 and 2.11.
86 LINEAR SIMULTANEOUS ALGEBRAIC EQUATIONS

(a)
      
3.0 −0.1 −0.2 x1   7.85 0
0.1 7.0 −0.3  x2 = −19.3 as initial or
; use {x} = 0
  starting solution
0.3 −0.2 10.0 x3 71.4 0
   

(b)
      
10 1 2 3  x1   10 0
1      
 
20 2 3 x 20 0 as initial or
 
2

2
 = ; use {x} =
2 30 4 x3  30 0
 starting solution
     
3 3 4 40 x4 40 0
  

Provide a listing of both computer programs. Document your program with

appropriate comments. Provide printed copies of the solutions for (a) and
(b) using programs in 2.10 and 2.11. Print all results up to ten decimal
places.

Based on the numerical studies for (a) and (b) comment on the performance
of Gauss-Seidel method and Jacobi method.

2.12 Consider the following system of simultaneous linear algebraic equa-

tions     
1 2 1 x1  b1 
3 4 0 x2 = b2
2 10 4 x3 b3
   

(a) Use Gauss elimination only once to obtain the solutions x1 , x2 , x3

for      
b1   3  1
b2 = 3 and 1
b3 10 3
     

2.13 Consider the following system of linear simultaneous algebraic equa-

tions     
0.5 −0.5 0 x1  1
−0.5 1 −0.5 x2 = 4
0 −0.5 0.5 x3 8
   

show whether this system of equations has a unique solution or not without
computing the solution.

2.14 Consider the following system of equations

6x + 4y = 4
4x + 5y = 1
2.9. CONCLUDING REMARKS 87

find solution for x, y using Crammer’s rule

2.15 Consider the following matrix

24
[A] =
68

(a) Find inverse of [A] using Gauss-Jordan method.

(b) Calculate co-factor matrix of [A].
(c) Perform [L][U ] decomposition of [A] where [L] is a unit lower trian-
gular matrix and [U ] is upper triangular matrix.
(d) If [A]{x} = {b}, then find {x} for {b}T = [b1 , b2 ] = [1, 2].
2.16 Obtain solution of the following system of equations using Cholesky
decomposition.     
1 −1 −1 x1  −4
−1 2 −1 x2 = 0
−1 −1 3 x3 6
   

2.17 Consider the following system of linear simultaneous algebraic equa-

tions
x1 + 2x2 = 1
3x1 + bx2 = 2
Find condition on b for which the solution exists and is unique.
2.18 Consider the following symmetric matrix [A]

42
[A] =
25
The [L][U ] decomposition of [A] is given by

1 0 42
[A] = [L][U ] =
0.5 1 0 4
Using [L][U ] decomposition given above, derive the matrix [L̃], a lower tri-
angular matrix such that, [A] = [L̃][L̃]T where [L̃]T is the transpose of [L̃].
2.19 Consider the following system of equations
x−y =1

x 1
or [A] =
−x + 2y = 2 y 2

(a) Find solution x, y and the inverse of A using Gauss-Jordan method.

(b) Find adjoint of [A] where [A] is defined above.
3
Solutions of Nonlinear
Simultaneous Equations

3.1 Introduction
Nonlinear simultaneous equations are nonlinear expressions in unknown
quantities of interest. These may arise in some physical processes directly
due to consideration of the mathematical descriptions of their physics. On
the other hand, in many physical processes described by nonlinear differ-
ential or partial differential equations, the use of approximation methods
such as finite difference, finite volume, and finite element methods for ob-
taining their approximate numerical solutions naturally results in nonlinear
simultaneous equations. Solutions of these nonlinear simultaneous equations
provides the solutions of the associated nonlinear differential and partial dif-
ferential equations.
In this chapter we consider systems of nonlinear simultaneous equations
and methods of obtaining their solutions. Consider a system of n nonlinear
simultaneous equations:

fi (x1 , x2 , . . . , xn ) = bi ; i = 1, 2, . . . , n (3.1)

In (3.1) some or all fi (xj ) are nonlinear functions of some or all xj ; i, j =

1, 2, . . . , n. As in the case of linear simultaneous algebraic equations, here
also we cannot obtain a solution for any xj independently of the remaining,
i.e., we must solve for all xj ; j = 1, 2, . . . , n together. Since (3.1) are
nonlinear, we must employ iterative methods for obtaining their solution in
which we choose a guess or starting solution and improve it iteratively until
two successively computed solutions are within a desired tolerance. Thus,
generally all methods of obtaining solutions of nonlinear algebraic equations
are approximate, i.e., in these methods we only obtain an approximation of
the true solution {x}. Hence, it is appropriate to refer to these methods as
methods of approximation.
The simplest form of (3.1) is a single nonlinear equation in a single vari-
able.
f (x) = f1 (x1 ) − b1 = 0 (3.2)

89
90 NONLINEAR SIMULTANEOUS EQUATIONS

The nonlinear relationship is defined by the function f1 (·) or f (·). Un-

like linear equations, nonlinear equations may have multiple solutions. In
many cases the methods of obtaining the solutions of (3.1) are extensions
or generalizations of the methods employed for (3.2). We first study vari-
ous numerical methods of approximating the solutions of a single nonlinear
equation f (x) = 0 in a single independent variable x given by (3.2). When
we have a single nonlinear equation, the values of x that satisfy the equation
f (x) = 0 are called the roots of f (x). Thus, the methods of obtaining solu-
tions of f (x) = 0 are often called root-finding methods. We consider these
in the following section for the nonlinear equation (3.2).

3.2 Root-Finding Methods for (3.2)

Consider f (x) = 0 in (3.2), a nonlinear function of x.
(i) If f (x) is a polynomial in x (a linear combination of monomials in x),
then for up to third degree polynomials we can solve for the values of x
directly using explicit expressions. In this case the solutions are exact.
If f (x) is a polynomial in x of degree higher than three, then we must
employ numerical methods that are iterative to solve for the roots of
f (x).
(ii) If f (x) is not a polynomial in x, then in general we must also use
iterative numerical methods to find roots of f (x).

Different Methods of Finding Roots of f (x)

In the following sections we consider various methods of finding the roots
of f (x) ; a ≤ x ≤ b, i.e., we find the roots of f (x) that lie in the range
x ∈ [a, b].
(a) Graphical method
(b) Incremental search method
(c) Bisection method or method of half interval
(d) Method of false position
(e) Newton’s methods
(i) Newton-Raphson method, Newton’s linear method, or method of
tangents
(ii) Newton’s second order method
(f) Secant method
(g) Fixed point iteration method or basic iteration method
3.2. ROOT-FINDING METHODS 91

3.2.1 Graphical Method

Consider:
f (x) = 0 ; ∀x ∈ [a, b] (3.3)
We plot a graph of f (x) as a function of x for values of x between a and b.
Let Figure 3.1 be a typical graph of f (x) versus x.

f (x)

x1 x2 x3
x
x=a x=b

Figure 3.1: Graph of f (x) versus x

From Figure 3.1, we note that f (x) is zero for x = x1 , x = x2 , and

x = x3 ∀x ∈ [a, b]. Thus, x1 , x2 , and x3 are roots of f (x) in the range [a, b]
of x. Figure 3.2 shows exploded view of the behavior of f (x) = 0 in the
neighborhood of x = x1 .
If we choose a value of x slightly less than x = x1 , say x = xl , then
f (xl ) > 0, and if we choose a value of x slightly greather than x = x1 , say
x = xu , then f (xu ) < 0, hence:

f (xl )f (xu ) < 0 (3.4)

We note that for the root x1 , this condition holds in the immediate neigh-
borhood of x = x1 as long as xl < x1 and xu > x1 . For the second root x2
we have:
f (xl ) < 0 , f (xu ) > 0
(3.5)
f (xl )f (xu ) < 0 for xl < x2 , xu > x2
For the third root x3 :
f (xl ) > 0 , f (xu ) < 0
(3.6)
f (xl )f (xu ) < 0
92 NONLINEAR SIMULTANEOUS EQUATIONS

f (x)

f (xl ) > 0

x1
x
x = xl x = xu

f (xu ) < 0

Figure 3.2: Enlarged view of f (x) versus x in the neighborhood of x = x1

Thus, we note that regardless of which root we consider, the condition

f (xl )f (xu ) < 0 ; xl < xi xu > xi (3.7)

in which xi in the root of f (x), holds for each root. Thus, the condition (3.7)
is helpful in the root-finding methods considered in the following sections.
In the graphical method, we simply plot a graph of f (x) versus x and
locate values of x for which f (x) = 0 in the range [a, b]. These of course are
the approximate values of the desired roots within the limits of the graphical
precision.

3.2.2 Incremental Search Method

Consider
f (x) = 0 ∀x ∈ [a, b] (3.8)
In this method we begin with x = a (lower value of x), increment it by ∆x,
i.e., xi = a + i∆x, and find the function values corresponding to each value
xi of x. Let
xi , f (xi ) ; i = 1, 2, . . . , n (3.9)
be the values of x and f (x) at various points between [a, b]. Then,

f (xi )f (xi+1 ) < 0 ; i = 1, 2, . . . , n − 1 (3.10)

3.2. ROOT-FINDING METHODS 93

indicates a root between xi and xi+1 . We try progressively reduced values

of ∆x to ensure that all roots in the range [a, b] are bracketed using (3.10),
i.e., no roots are missed.

3.2.2.1 More Accurate Value of a Root

Let
f (xl )f (xu ) < 0 (3.11)
hold for values of x = xl and x = xu (xu > xl ). xl and xu are typical values
of x determined for a root using incremental search. A more accurate value
of the root in [xl ,xu ] can be determined by using this range and a smaller
∆x (say ∆x = ∆x/10) and by repeating the incremental search. This will
yield a yet narrower range x for the root. This process can be continued for
progressively reduced values of ∆x until the root is determined with desired
accuracy.
Remarks.
(1) This method is quite effective in bracketing the roots.
(2) The method is rather ‘brute force’ and inefficient in determining roots
with higher accuracy.

Example 3.1 (Bracketing Roots of (3.12): Incremental Search).

f (x) = x3 + 2.3x2 − 5.08x − 7.04 = 0 (3.12)
Equation (3.12) is a cubic polynomial in x and therefore has three roots.
We consider (3.12) and bracket its roots using incremental search. Figure
3.3 shows a graph of f (x) versus x for x ∈ [−4, 4]. From the graph, we note
that the roots of f (x) are approximately located at x = −3, −1, 2.
80
f(x)

0 x
-4 -3 -2 -1 0 1 2 3 4

-20

Figure 3.3: Plot of f (x) in (3.12) versus x

We consider incremental search in the range [xmin , xmax ] = [−4, 4] with
94 NONLINEAR SIMULTANEOUS EQUATIONS

∆x = 0.41 to bracket the roots of f (x) = 0 given by (3.12). Let x1 = xmin .

Calculate f (x1 ). Consider

xi+1 = xi + ∆x ; i = 1, 2, . . . , n ∀xi+1 ∈ [xmin , xmax ] (3.13)

for each i value in (3.13), calculate f (xi+1 ). Using two successive values
f (xi ) and f (xi+1 ) of f (x), consider:
(
< 0 =⇒ a root in [xi , xi+1 ]
f (xi )f (xi+1 ) (3.14)
> 0 =⇒ no root in [xi , xi+1 ]

Details of the calculations are given in the following (Table 3.1).

Table 3.1: Results of incremental search for Example 3.1

∆x = 0.41000E + 00
xmin = −0.40000E + 01
xmax = 0.40000E + 01

x= −0.400000E + 01 f(x) = −0.139200E + 02

x= −0.359000E + 01 f(x) = −0.542845E + 01
x= −0.318000E + 01 f(x) = 0.215487E + 00

x= −0.318000E + 01 f(x) = 0.215487E + 00

x= −0.277000E + 01 f(x) = 0.342534E + 01
x= −0.236000E + 01 f(x) = 0.461462E + 01
x= −0.195000E + 01 f(x) = 0.419687E + 01
x= −0.154000E + 01 f(x) = 0.258562E + 01
x= −0.113000E + 01 f(x) = 0.194373E + 00
x= −0.720000E + 00 f(x) = −0.256333E + 01

x= −0.720000E + 00 f(x) = −0.256333E + 01

x= −0.310000E + 00 f(x) = −0.527396E + 01
x= 0.100000E + 00 f(x) = −0.752400E + 01
x= 0.510000E + 00 f(x) = −0.889992E + 01
x= 0.920000E + 00 f(x) = −0.898819E + 01
x= 0.133000E + 01 f(x) = −0.737529E + 01
x= 0.174000E + 01 f(x) = −0.364770E + 01
x= 0.215000E + 01 f(x) = 0.260812E + 01

A change in sign of f (x) indicates that a root is between the corresponding

values of x. From Table 3.1, all three roots of (3.12) are bracketed.

Root 1 : [xl , xu ] = [−3.59, −3.18]

Root 2 : [xl , xu ] = [−1.13, −0.72] (3.15)
Root 3 : [xl , xu ] = [1.74, 2.15]
3.2. ROOT-FINDING METHODS 95

Remarks.

(1) An incremental search with ∆x = ∆x/2 = 0.205 undoubtedly will also

bracket all three roots, but it will result in smaller range for each root.

(2) A value of ∆x larger than 0.41 can be used too, but in this case ∆x may
be too large, hence we may miss one or more roots.

(3) For each range of the roots in (3.15), we can perform incremental search
with progressively reduced ∆x to eventually obtain an accurate value of
each root. This approach to obtaining accurate values of each root is
obviously rather inefficient.

3.2.3 Bisection Method or Method of Half-Interval

When a root has been bracketed using incremental search method, the
bisection method, also known as the method of half-interval, can be used
more effectively to obtain a more accurate value of the root than continuing
incremental search with reduced ∆x.

(i) Let [xl ,xu ] be the interval containing a root, then

f (xl )f (xu ) < 0 (3.16)

(ii) Divide the interval [xl , xu ] into two equal intervals, [xl , xk ] and [xk , xu ],
in which xk = 1/2(xl + xu ).

(iii) Compute f (xk ) and check the products

f (xl )f (xk ) and f (xk )f (xu ) (3.17)

for a negative sign. As an example consider the graph of f (x) in Figure

3.4. From Figure 3.4 we note that

f (xl )f (xk ) < 0

(3.18)
f (xk )f (xu ) > 0

Since the root lies in [xl , xk ], the interval [xk , xu ] can be discarded.

(iv) We now reinitialize the x values, i.e., keep x = xl the same but set
xu = xk to create a new, smaller range of [xl , xu ].

(v) Check if xu − xl < ∆, a preset tolerance of accuracy (approximate

relative error). If not then repeat steps (ii)-(v). If xl and xu are within
96 NONLINEAR SIMULTANEOUS EQUATIONS

f (x)

f (xl ) > 0

xl xk xu
x
f (xk ) < 0

f (xu ) < 0

Figure 3.4: Range of a root with half intervals.

the tolerance, then we could use either of xu , xl as the desired value

of the root.
Remarks.
(1) In each pass we reduce the interval containing the root by half, hence
the name half-interval method or bisection method.
(2) This method is more efficient than the incremental search method of
finding more accurate value of the root, but is still quite brute force.

Example 3.2 (Bisection or Half-Interval Method). In this example we

also consider f (x) = 0 given by (3.12) and consider the brackets containing
the roots (3.15). We apply bisection method for each root bracket to ob-
tain more accurate values. In this method [xl , xu ] already contains a root.
Consider:
xl + xu
xk = (3.19)
2
and calculate the products:
(
< 0 =⇒ a root in the range [xl , xk ]
f (xl )f (xk )
> 0 =⇒ no root in the range [xl , xk ]
( (3.20)
< 0 =⇒ a root in the range [xk , xu ]
f (xk )f (xu )
> 0 =⇒ no root in the range [xk , xu ]
3.2. ROOT-FINDING METHODS 97

We discard the range x that does not contain the root and reinitialize the
other half-interval to [xl , xu ]. We repeat steps in (3.19) and (3.20) until:

xu − xl
% Error = × 100 ≤ ∆, a preset tolerance (3.21)
xu

We choose a tolerance of ∆ = 0.0001 and set the maximum number of

iterations I = 20. The computations for the three roots are shown in the
following.

Root 1: [xl , xu ] = [−3.59, −3.18], from Example 3.1

Table 3.2: Results of bisection method for the first root of equation (3.12)
xl = -0.35900E+01
xu = -0.31800E+01
∆= 0.10000E−03
I= 20

iter xl xk xu f(xl ) f(xk ) f(xu ) error (%)

1 -0.35900E+01 -0.33850E+01 -0.31800E+01 -0.54284E+01 -0.22764E+01 0.21549E+00

2 -0.33850E+01 -0.32825E+01 -0.31800E+01 -0.22764E+01 -0.95115E+00 0.21549E+00 0.31226E+01
3 -0.32825E+01 -0.32313E+01 -0.31800E+01 -0.95115E+00 -0.34841E+00 0.21549E+00 0.15861E+01
4 -0.32313E+01 -0.32056E+01 -0.31800E+01 -0.34841E+00 -0.61657E−01 0.21549E+00 0.79938E+00
5 -0.32056E+01 -0.31928E+01 -0.31800E+01 -0.61657E−01 -0.78109E−01 0.21549E+00 0.40129E+00
6 -0.32056E+01 -0.31992E+01 -0.31928E+01 -0.61657E−01 -0.85261E−02 0.78109E−01 0.20024E+00
7 -0.32056E+01 -0.32024E+01 -0.31992E+01 -0.61657E−01 -0.26491E−01 0.85261E−02 0.10002E+00
8 -0.32024E+01 -0.32008E+01 -0.31992E+01 -0.26491E−01 -0.89636E−02 0.85261E−02 0.50036E−01
9 -0.32008E+01 -0.32000E+01 -0.31992E+01 -0.89636E−02 -0.21471E−03 0.85261E−02 0.25023E−01
10 -0.32000E+01 -0.31996E+01 -0.31992E+01 -0.21471E−03 -0.41569E−02 0.85261E−02 0.12515E−01
11 -0.32000E+01 -0.31998E+01 -0.31996E+01 -0.21471E−03 -0.19707E−02 0.41556E−02 0.62588E−02
12 -0.32000E+01 -0.31999E+01 -0.31998E+01 -0.21471E−03 -0.87743E−03 0.19694E−02 0.31293E−02
13 -0.32000E+01 -0.32000E+01 -0.31999E+01 -0.21471E−03 -0.33073E−03 0.87743E−03 0.15646E−02
14 -0.32000E+01 -0.32000E+01 -0.32000E+01 -0.21471E−03 -0.57364E−04 0.33073E−03 0.78231E−03
15 -0.32000E+01 -0.32000E+01 -0.32000E+01 -0.21471E−03 -0.78020E−04 0.57364E−04 0.38743E−03
16 -0.32000E+01 -0.32000E+01 -0.32000E+01 -0.78020E−04 -0.90256E−05 0.57364E−04 0.19744E−03
17 -0.32000E+01 -0.32000E+01 -0.32000E+01 -0.90256E−05 -0.24820E−04 0.57364E−04 0.96858E−04

∴ x = −3.2 is a root of f (x) defined by (3.12) in the range [xl , xu ] =

[−3.59, 3.18].
98 NONLINEAR SIMULTANEOUS EQUATIONS

Root 2: [xl , xu ] = [−1.13, −0.72], from Example 3.1

Table 3.3: Results of bisection method for the second root of equation (3.12)
xl = -0.11300E+01
xu = -0.72000E+00
∆= 0.10000E−03
I= 20

iter xl xk xu f(xl ) f(xk ) f(xu ) error (%)

1 -0.11300E+01 -0.92500E+00 -0.72000E+00 0.19437E+00 -0.11645E+01 -0.25633E+01

2 -0.11300E+01 -0.10275E+01 -0.92500E+00 0.19437E+00 -0.47658E+00 -0.11645E+01 0.99757E+01
3 -0.11300E+01 -0.10788E+01 -0.10275E+01 0.19437E+00 -0.13878E+00 -0.47685E+00 0.47509E+01
4 -0.11300E+01 -0.11044E+01 -0.10788E+01 0.19437E+00 0.28462E−01 -0.13878E+00 0.23203E+01
5 -0.11044E+01 -0.10916E+01 -0.10788E+01 0.28462E−01 -0.54999E−01 -0.13878E+00 0.11738E+01
6 -0.11044E+01 -0.10980E+01 -0.10916E+01 0.28462E−01 -0.13228E−01 -0.54999E−01 0.58346E+00
7 -0.11044E+01 -0.11012E+01 -0.10980E+01 0.28462E−01 0.76277E−02 -0.13228E−01 0.29089E+00
8 -0.11012E+01 -0.10996E+01 -0.10980E+01 0.76277E−02 -0.27970E−02 -0.13228E−01 0.14565E+00
9 -0.11012E+01 -0.11004E+01 -0.10996E+01 0.76277E−02 0.24162E−02 -0.27970E−02 0.72774E−01
10 -0.11004E+01 -0.11000E+01 -0.10996E+01 0.24162E−02 -0.19047E−03 -0.27970E−02 0.36403E−01
11 -0.11004E+01 -0.11002E+01 -0.11000E+01 0.24162E−02 0.11129E−02 -0.19047E−03 0.18198E−01
12 -0.11002E+01 -0.11001E+01 -0.11000E+01 0.11129E−02 0.46141E−03 -0.19047E−03 0.90973E−02
13 -0.11001E+01 -0.11000E+01 -0.11000E+01 0.46141E−03 0.13586E−03 -0.19047E−03 0.45461E−02
14 -0.11000E+01 -0.11000E+01 -0.11000E+01 0.13586E−03 -0.27110E−04 -0.19047E−03 0.22758E−02
15 -0.11000E+01 -0.11000E+01 -0.11000E+01 0.13586E−03 0.54375E−04 -0.27110E−04 0.11379E−02
16 -0.11000E+01 -0.11000E+01 -0.11000E+01 0.54375E−04 0.13633E−04 -0.27110E−04 0.56895E−03
17 -0.11000E+01 -0.11000E+01 -0.11000E+01 0.13633E−04 -0.69327E−05 -0.27110E−04 0.28719E−03
18 -0.11000E+01 -0.11000E+01 -0.11000E+01 0.13633E−04 0.31559E−05 -0.69327E−05 0.14088E−03
19 -0.11000E+01 -0.11000E+01 -0.11000E+01 0.31559E−05 -0.18884E−05 -0.69327E−05 0.70442E−04

∴ x = −1.1 is a root of f (x) defined by (3.12) in the range [xl , xu ] =

[−1.13, −0.72].

Root 3: [xl , xu ] = [1.74, 2.15], from Example 3.1

Table 3.4: Results of bisection method for the third root of equation (3.12)
xl = 0.17400E+01
xu = 0.21500E+01
∆= 0.10000E−03
I= 20

iter xl xk xu f(xl ) f(xk ) f(xu ) error (%)

1 0.17400E+01 0.19450E+01 0.21500E+01 -0.36477E+01 -0.86166E+00 0.26081E+01

2 0.19450E+01 0.20475E+01 0.21500E+01 -0.86166E+00 0.78454E+00 0.26081E+01 0.50061E+01
3 0.19450E+01 0.19963E+01 0.20475E+01 -0.86166E+00 -0.60332E-01 0.78454E+00 0.25673E+01
4 0.19963E+01 0.20219E+01 0.20475E+01 -0.60332E−01 0.35661E+00 0.78454E+00 0.12674E+01
5 0.19963E+01 0.20091E+01 0.20219E+01 -0.60332E−01 0.14677E+00 0.35661E+00 0.63773E+00
6 0.19963E+01 0.20027E+01 0.20091E+01 -0.60332E−01 0.42881E−01 0.14677E+00 0.31988E+00
7 0.19963E+01 0.19995E+01 0.20027E+01 -0.60332E−01 -0.88102E−02 0.42881E−01 0.16020E+00
8 0.19995E+01 0.20011E+01 0.20027E+01 -0.88102E−02 0.17014E−01 0.42881E−01 0.80037E−01
9 0.19995E+01 0.20003E+01 0.20011E+01 -0.88102E−02 0.40956E−02 0.17014E−01 0.40037E−01
10 0.19995E+01 0.19999E+01 0.20003E+01 -0.88102E−02 -0.23577E−02 0.40956E−02 0.20017E−01
11 0.19999E+01 0.20001E+01 0.20003E+01 -0.23577E−02 0.86957E−03 0.40956E−02 0.10010E−01
12 0.19999E+01 0.20000E+01 0.20001E+01 -0.23577E−02 -0.74462E−03 0.86957E−03 0.50069E−02
13 0.20000E+01 0.20000E+01 0.20001E+01 -0.74462E−03 0.61493E−04 0.86957E−03 0.25004E−02
14 0.20000E+01 0.20000E+01 0.20000E+01 -0.74462E−03 -0.34205E−03 0.61493E−04 0.12517E−02
15 0.20000E+01 0.20000E+01 0.20000E+01 -0.34205E−03 -0.14028E−03 0.61493E−04 0.62585E−03
16 0.20000E+01 0.20000E+01 0.20000E+01 -0.14028E−03 -0.39394E−04 0.61493E−04 0.31292E−03
17 0.20000E+01 0.20000E+01 0.20000E+01 -0.39394E−04 0.11530E−04 0.61493E−04 0.15795E−03
18 0.20000E+01 0.20000E+01 0.20000E+01 -0.39394E−04 -0.13452E−04 0.11530E−04 0.77486E−04

∴ x = 2.0 is a root of f (x) defined by (3.12) in the range [xl , xu ] =

[1.74, 2.15].
All roots have been determined within ∆ = 0.0001 within twenty iterations.
3.2. ROOT-FINDING METHODS 99

3.2.4 Method of False Position

Let [xl , xu ] be the range of x containing a root of f (x).

f (xl )f (xu ) < 0 (3.22)

(i) Connect the points (xl , f (xl )) and (xu , f (xu )) by a straight line (see
Figure 3.5) and define the intersection of this line with the x-axis as
x = xr . Using the equation of the straight line, similar triangles, or

f (x)

f (xu ) > 0

xr θ
x
θ
xu
f (xl ) < 0 f (xr )
xl
Figure 3.5: Method of false position

simply equating tan(θ) from the two triangles shown in Figure 3.5, we
obtain:
f (xl ) f (xu )
= (3.23)
(xl − xr ) (xu − xr )
Solving for xr :
xu f (xl ) − xl f (xu )
xr = (3.24)
f (xl ) − f (xu )
Equation (3.24) is known as the false position formula. An alternate
form of (3.24) can be obtained. From (3.24):

xu f (xl ) xl f (xu )
xr = − (3.25)
f (xl ) − f (xu ) f (xl ) − f (xu )
100 NONLINEAR SIMULTANEOUS EQUATIONS

Add and subtract xu to the right-hand side of (3.25).

xu f (xl ) xl f (xu )
xr = xu + − − xu (3.26)
f (xl ) − f (xu ) f (xl ) − f (xu )

Combine the last three terms on the right side of (3.26).

f (xu )(xu − xl )
xr = xu − (3.27)
f (xu ) − f (xl )

Equation (3.27) is an alternate form of (3.24). In (3.27), xr is obtained

by subtracting the quantity in the bracket on the right-hand side of
(3.27) from xu .

(ii) Calculate f (xr ) and form the products

f (xl )f (xr )
(3.28)
f (xr )f (xu )

Check which is less than zero. From Figure 3.5 we note that

f (xl )f (xu ) > 0

(3.29)
f (xr )f (xu ) < 0

Hence, the root lies in the interval [xr , xu ] (for the function shown in
Figure 3.5). Therefore, we discard the interval [xl , xr ].

(iii) Reinitialize the range containing the interval.

xl = xr
and xu = xu unchanged

(iv) In this method xr is the new estimate of the root. We check the
convergence of the method using the following (approximate percentage
relative error):
(xr )i+1 − (xr )i
× 100 < ∆ (3.30)
(xr )i+1

in which (xr )i is the estimate of the root in the ith iteration. When
converged, i.e., when (3.30) is satisfied, (xr )i+1 is the final value of the
root.
3.2. ROOT-FINDING METHODS 101

Example 3.3 (False Position Method). In this method, once a root has
been bracketed we use the following to obtain an estimate of xr of the root
in the bracketed range:

f (xu )(xu − xl )
xr = xu − (3.31)
f (xu ) − f (xl )
Then, we consider the products f (xl )f (xr ) and f (xr )f (xu ) to determine the
range containing the root. We discard the range not containing the root and
reinitialize the range containing the root to [xl , xu ]. We iterate (3.31) and
use the steps following it until:
(xr )i+1 − (xr )i
% Error = × 100 ≤ ∆ (3.32)
(xr )i+1
We consider f (x) = 0 defined by (3.12) and the bracketed ranges of the roots
determined in Example 3.1 to present details of the false position method
for each root. We choose ∆ = 0.0001 and maximum of twenty iterations
(I = 20).

Root 1: [xl , xu ] = [−3.59, −3.18], from Example 3.1

Table 3.5: Results of false position method for the first root of equation (3.12)
xl = -0.35900E+01
xu = -0.31800E+01
∆= 0.10000E−03
I= 20

iter xl xr xu f(xl ) f(xr ) f(xu ) error (%)

1 -0.35900E+01 -0.31957E+01 -0.31800E+01 -0.54284E+01 0.47320E−01 0.21549E+00

2 -0.35900E+01 -0.31991E+01 -0.31957E+01 -0.54284E+01 0.10238E−01 0.47320E−01 0.10653E+00
3 -0.35900E+01 -0.31998E+01 -0.31991E+01 -0.54284E+01 0.22077E−02 0.10238E−01 0.23000E−01
4 -0.35900E+01 -0.32000E+01 -0.31998E+01 -0.54284E+01 0.47603E−03 0.22077E−02 0.49566E−02
5 -0.35900E+01 -0.32000E+01 -0.32000E+01 -0.54284E+01 0.10240E−03 0.47603E−03 0.10693E−02
6 -0.35900E+01 -0.32000E+01 -0.32000E+01 -0.54284E+01 0.22177E−04 0.10240E−03 0.22957E−03
7 -0.35900E+01 -0.32000E+01 -0.32000E+01 -0.54284E+01 0.47870E−05 0.22177E−04 0.49766E−04

∴ x = −3.2 is a root of f (x) defined by (3.12) in the range [xl , xu ] =

[−3.59, 3.18].

Root 2: [xl , xu ] = [−1.13, −0.72], from Example 3.1

Table 3.6: Results of false position method for the second root of equation (3.12)
xl = -0.11700E+01
xu = -0.72000E+00
∆= 0.10000E−03
I= 20

iter xl xr xu f(xl ) f(xr ) f(xu ) error (%)

1 -0.11700E+01 -0.11027E+01 -0.72000E+00 0.45046E+00 0.17833E−01 -0.25633E+01

2 -0.11027E+01 -0.11011E+01 -0.72000E+00 0.17833E−01 0.62602E−03 -0.25633E+01 0.24037E+00
3 -0.11001E+01 -0.11000E+01 -0.72000E+00 0.62602E−03 0.21879E−04 -0.25633E+01 0.84366E−02
4 -0.11000E+01 -0.11000E+01 -0.72000E+00 0.21879E−04 0.76075E−06 -0.25633E+01 0.29491E−03
5 -0.11000E+01 -0.11000E+01 -0.72000E+00 0.76075E−06 0.28912E−07 -0.25633E+01 0.10220E−04
102 NONLINEAR SIMULTANEOUS EQUATIONS

∴ x = −1.1 is a root of f (x) defined by (3.12) in the range [xl , xu ] =

[−1.13, −0.72].

Root 3: [xl , xu ] = [1.74, 2.15], from Example 3.1

Table 3.7: Results of false position method for the third root of equation (3.12)
xl = 0.17400E+01
xu = 0.21500E+01
∆= 0.10000E−03
I= 20

iter xl xr xu f(xl ) f(xr ) f(xu ) error (%)

1 0.17400E+01 0.19791E+01 0.21500E+01 -0.36477E+01 -0.33382E+00 0.26081E+01

2 0.19791E+01 0.19985E+01 0.21500E+01 -0.33382E+00 -0.24770E−01 0.26081E+01 0.97054E+00
3 0.19985E+01 0.19999E+01 0.21500E+01 -0.24770E−01 -0.18080E−02 0.26081E+01 0.71288E−01
4 0.19999E+01 0.20000E+01 0.21500E+01 -0.18080E−02 -0.13182E−03 0.26081E+01 0.51994E−02
5 0.20000E+01 0.20000E+01 0.21500E+01 -0.13182E−03 -0.96658E−05 0.26081E+01 0.37890E−03
6 0.20000E+01 0.20000E+01 0.21500E+01 -0.96658E−05 -0.70042E−06 0.26081E+01 0.27808E−04

∴ x = 2.0 is a root of f (x) defined by (3.12) in the range [xl , xu ] =

[1.74, 2.15].
All roots have been determined within the desired accuracy of ∆ = 0.0001 in
less than eight iterations, compared to the bisection method which required
close to twenty iterations for each root.

3.2.5 Newton-Raphson Method or Newton’s Linear Method

This method can be used effectively to converge to a root with very high
precision in an efficient manner. Let [xl , xu ] be the range of x containing a
root of f (x). The range [xl , xu ] can be obtained using the graphical method,
incremental search method, or bisection method.
Let xi ∈ [xl , xu ] be the first approximation of the root of f (x) in the
range [xl , xu ]. Since xi is an approximation of the root of f (x), we have:

f (xi ) 6= 0 (3.33)

Let ∆x be a correction to xi such that xi+1 = xi + ∆x and

f (xi + ∆x) = 0 (3.34)

If f (x) is analytic (continuous and differentiable) in the neighborhood of xi ,

then f (xi + ∆x) can be expanded in a Taylor series about xi .

f (xi + ∆x) = f (xi ) + ∆xf 0 (xi ) + O((∆x)2 ) = 0 (3.35)

3.2. ROOT-FINDING METHODS 103

If we neglect O((∆x)2 ) in (3.35) (valid when (∆x)2 << ∆x) then (3.35) is
approximate, and we can solve for ∆x.

f (xi )
∆x = − (3.36)
f 0 (xi )

The improved value of the root xi+1 is given by,

xi+1 = xi + ∆x (3.37)

Substituting for ∆x in (3.37) from (3.36).

f (xi )
xi+1 = xi − (3.38)
f 0 (xi )

Remarks.

(1) Since f (x) is given, f 0 (x) can be obtained.

(2) We use (3.38) for i = 0, 1, . . . , n, in which x0 ∈ [xl , xu ] is the initial

guess of the root, and iterate until the desired decimal place accuracy
is achieved. x1 , x2 , . . . are progressively improved values of the root of
f (x) in [xl , xu ].

3.2.5.1 Alternate Method of Deriving (3.38)

Consider the graph of f (x) versus x shown in Figure 3.6 for x ∈ [xl , xu ].
Let xi be an assumed value of x between xl and xu . Compute f (xi ) and
draw a tangent to the f (x) curve at x = xi . Let xi+1 be the intersection of
this tangent with the x-axis. Then the slope of the tangent to f (x) at x = xi

df
f 0 (xi ) = (3.39)
dx x=xi

can be approximated by:

f (xi )
f 0 (xi ) ≈ (3.40)
(xi − xi+1 )

Using the equality in (3.40) and solving for xi+1 :

f (xi )
xi+1 = xi − (3.41)
f 0 (xi )

xi+1 is the improved value (i.e., more accurate) of the root of f (x) in [xl , xu ].
Clearly (3.41) is the same as (3.38), hence earlier remarks hold here as well.
104 NONLINEAR SIMULTANEOUS EQUATIONS

f (x)

tangent to
f (x) at
x = xi
f (xi )
x
xi+1

xi xu
xl
(xi − xi+1 )

Figure 3.6: Newton Raphson or Newton’s linear method

Convergence Criterion

The approximate percentage relative error serves as convergence criteria.

xi+1 − xi
× 100 ≤ ∆ ; ∆ is a preset value (3.42)
xi+1

3.2.5.2 General Remarks Regarding Newton-Raphson Method

(1) The method requires a range [xl , xu ] that brackets the desired root and
an initial guess x0 ∈ [xl , xu ] of the root.

(2) The method works extremely well if f 0 (x) and f 00 (x) are well-behaved
in the range [xl , xu ].

(3) The method fails if f 0 (xi ) becomes zero, i.e., f 0 (x) changes sign in the
neighborhood of xi .

(4) When the initial guess xi is sufficiently close to the root of f (x), the
method has quadratic convergence (shown in a later section), hence only
a few iterations are required to obtain a highly accurate value of the root.

(5) Since in this method we construct a tangent to f (x) at x = xi , this

method is also called the method of tangents, or gradient method.
3.2. ROOT-FINDING METHODS 105

3.2.5.3 Error Analysis of Newton-Raphson Method

Let the range [xl , xu ] contain a root of f (x). Let xi ∈ [xl , xu ] be an
approximation of the root of f (x). Let ∆x be a correction to xi such that

f (xi + ∆x) = f (xi+1 ) = 0 (3.43)

where
xi+1 = xi + ∆x (3.44)
Expand f (xi + ∆x) = f (xi+1 ) in a Taylor series about xi .

∆x2
f (xi+1 ) = f (xi ) + f 0 (xi )∆x + f 00 (ξ) =0; ξ ∈ [xl , xu ] (3.45)
2
If we neglect f 00 (ξ) term in (2.138), then we obtain Newton’s linear method
or Newton-Raphson method.

f (xi ) + f 0 (xi )∆x = 0 (3.46)

or
f (xi ) + f 0 (xi )(xi+1 − xi ) = 0 (3.47)
f (xi )
∴ xi+1 = xi − (3.48)
f 0 (xi )
We can use Taylor series expansion to estimate the error in (3.48). We go
back to the Taylor series expansion (3.45) and use ∆x = xi+1 − xi .

f 00 (ξ)
f (xi ) + f 0 (xi )(xi+1 − xi ) + (xi+1 − xi )2 = 0 (3.49)
2
Since (3.49) is exact (in the sense that influence of all terms in the Taylor
series expansion is accounted for in the last term), xi+1 in (3.49) must be the
exact root or true root, say xi+1 = xt . We substitute xt for xi+1 in (3.49).

f 00 (ξ)
f (xi ) + f 0 (xi )(xt − xi ) + (xt − xi )2 = 0 (3.50)
2
Subtract (3.47) from (3.50), noting that xi+1 in (3.47) is not xt .

f 00 (ξ)
f 0 (xi )(xt − xi+1 ) + (xt − xi )2 = 0 (3.51)
2
Since xt is the true solution:

xt − xi+1 = εi+1 ; total error at xi+1

(3.52)
x t − x i = εi ; total error at xi
106 NONLINEAR SIMULTANEOUS EQUATIONS

Hence we can write (3.51) as:

f 00 (ξ)
f 0 (xi )εi+1 + (εi )2 = 0 (3.53)
2
f 00 (ξ)

∴ εi+1 = − 0 (εi )2 (3.54)
2f (xi )
∴ εi+1 ∝ (εi )2 (3.55)
From (3.55) we conclude that the total error at xi+1 is proportional to the
square of the total error at xi . This aspect of the Newton’s linear method
is referred to as quadratic convergence of the method. In each iteration
the total error reduces by two orders of magnitude when the computations
are within the radius of convergence. For example, if εi = O(10−2 ), then
εi+1 = O(10−4 ), a reduction of two orders of magnitude.

Example 3.4 (Newton-Raphson or Newton’s Linear Method). When

a root has been bracketed, this method can be used to find a very accurate
value of the root. In this method we use
f (xi )
xi+1 = xi − 0 ; i = 0, 1, . . . (3.56)
f (xi )
in which x0 is the initial guess of the root in the bracketed range. We
consider f (x) = 0 in (3.12) and the bracketed ranges of the roots obtained
in Example 3.1 to compute accurate values of each root using (3.56). Using
(3.12):
f (xi ) = (xi )3 + 2.3(xi )2 − 5.08xi − 7.04
(3.57)
f 0 (xi ) = 3(xi )2 + 4.6xi − 5.08
We choose ∆ = 0.1 × 10−4 , approximate percentage relative error and set a
limit of twenty iterations for each root (I = 20).

Root 1: [xl , xu ] = [−3.59, −3.18], from Example 3.1

We choose x0 = −3.4 as the initial guess of the root.
Table 3.8: Results of Newton’s linear method for the first root of equation (3.12)
x0 = -0.340000E+01
∆= 0.100000E−04
I= 20

iter xi xi+1 f(xi ) f(xi+1 ) error (%)

1 -0.340000E+01 -0.322206E+01 -0.248400E+01 -0.244493E+00 0.552246E+01

2 -0.322206E+01 -0.320032E+01 -0.244493E+00 -0.347346E−02 0.679467E+00
3 -0.320032E+01 -0.319999E+01 -0.347346E−02 -0.259699E−06 0.993584E−02
4 -0.319999E+01 -0.319999E+01 -0.259699E−06 -0.112885E−05 0.743186E−06
3.2. ROOT-FINDING METHODS 107

∴ x = −3.2 is a root of f (x) defined by (3.12) in the range [xl , xu ] =

[−3.59, 3.18].

Root 2: [xl , xu ] = [−1.13, −0.72], from Example 3.1

We choose x0 = −0.8 as the initial guess of the root.
Table 3.9: Results of Newton’s linear method for the second root of equation (3.12)
x0 = -0.800000E+00
∆= 0.100000E−04
I= 20

iter xi xi+1 f(xi ) f(xi+1 ) error (%)

1 -0.800000E+00 -0.109474E+01 -0.201600E+01 -0.342907E−01 0.269231E+02

2 -0.109474E+01 -0.109999E+01 -0.342907E−01 -0.276394E−04 0.478089E+00
3 -0.109999E+01 -0.110000E+01 -0.276394E−04 -0.246789E−06 0.385971E−03
4 -0.110000E+01 -0.110000E+01 -0.246789E−06 -0.298526E−06 0.344629E−05

∴ x = −1.1 is a root of f (x) defined by (3.12) in the range [xl , xu ] =

[−1.13, −0.72].

Root 3: [xl , xu ] = [1.74, 2.15], from Example 3.1

We choose x0 = 1.8 as the initial guess of the root.
Table 3.10: Results of Newton’s linear method for the third root of equation (3.12)
x0 = 0.1800000E+01
∆= 0.100000E−04
I= 20

iter xi xi+1 f(xi ) f(xi+1 ) error (%)

1 0.180000E+01 0.202446E+01 -0.290000E+01 0.399246E+00 0.110873E+02

2 0.202446E+01 0.200030E+01 0.399246E+00 0.487113E−02 0.120762E+01
3 0.200030E+01 0.200000E+01 0.487113E−02 -0.140689E−06 0.151043E−01
4 0.200000E+01 0.200000E+01 -0.140689E−06 0.140689E−06 0.436379E−06

∴ x = 2.0 is a root of f (x) defined by (3.12) in the range [xl , xu ] =

[1.74, 2.15].
All roots have been determined within the desired accuracy of ∆ = 0.0001 in
four iterations, compared to up to seven iterations for false position method
and almost twenty iterations for the bisection method for each root.
108 NONLINEAR SIMULTANEOUS EQUATIONS

3.2.6 Newton’s Second Order Method

Let [xl , xu ] be a range of x containing a root of f (x). Let xi ∈ [xl , xu ] be
the initial guess of the root:
f (xi ) 6= 0 (3.58)
Let ∆x be correction to xi such that xi+1 = xi + ∆x and

f (xi + ∆x) = 0 (3.59)

If f (x) is analytic (continuous and differentiable) in the neighborhood of xi ,

then f (xi + ∆x) can be expanded in a Taylor series about xi .

(∆x)2 00
f (xi + ∆x) = f (xi ) + ∆xf 0 (xi ) + f (xi ) + O((∆x)3 ) = 0 (3.60)
2!
If we neglect O((∆x)3 ), then (3.60) can be written as

(∆x)2 00
f (xi + ∆x) = f (xi ) + ∆xf 0 (xi ) + f (xi ) = 0 (3.61)
2!
In (3.61), f (xi ), f 0 (xi ), and f 00 (xi ) are known and have numerical values.
∆x is unknown, but in the following we treat it as a known increment in
part of the expression, or as completely unknown and to be determined. We
consider both cases in the following.

Case I: Treating ∆x in (3.61) as unknown

Equation (3.61) is quadratic in ∆x, hence there are two values of ∆x that
satisfy (3.61). Using the expression for the roots of a quadratic equation we
can find ∆x1 and ∆x2 . Using these two values of ∆x, we define improved
solutions as:

xi+1 = xi + ∆x1
(3.62)
xi+1 = xi + ∆x2

Of the two values of xi+1 in (3.62), the value that lies in the range (xl , xu )
is the correct improved or new value of the root. We check for convergence
based on approximate percentage relative error given by:

xi+1 − xi
100 ≤ ∆ (3.63)
xi

If (3.63) is satisfied, then xi+1 is the desired value of the root, otherwise
use xi+1 as the new initial or starting value instead of xi and repeat the
calculations, beginning with (3.61).
3.2. ROOT-FINDING METHODS 109

Case II: Treating ∆x in (3.61) as known

In this case we do not solve for ∆x, i.e., we do not calculate ∆x1 and
∆x2 (two values of ∆x) using (3.61). Instead we proceed as follows.
Rewrite (3.61) as:

0 ∆x 00
f (xi ) + ∆x f (xi ) + f (xi ) = 0 (3.64)
2

We approximate the (∆x/2) term in (3.64) using Newton’s linear method,

i.e., using equation (3.38) with ∆x = xi+1 − xi .

∆x 1 f (xi )
=− (3.65)
2 2 f 0 (xi )

Substitute for (∆x/2) from (3.65) in (3.64):

0 1 f (xi ) 00
f (xi ) + ∆x f (x) − f (xi ) = 0 (3.66)
2 f 0 (xi )

Using (3.66) solve for ∆x:

f (xi )
∆x = − (3.67)
f (xi )f 00 (xi )
f 0 (xi ) − 2f 0 (xi )

The improved value of the root is:

xi+1 = xi + ∆x (3.68)

or
f (xi )
xi+1 = xi − (3.69)
f (xi )f 00 (xi )
f 0 (xi ) − 2f 0 (xi )

We check for convergence:

xi+1 − xi
100 ≤ ∆ (3.70)
xi

If (3.70) is satisfied then we have a converged value of the root, i.e., xi+1 is
the desired root of f (x) in the range [xl , xu ]. If not, then using xi+1 as the
new initial or guess value, we repeat the calculations using (3.69).

Remarks.

(1) Obviously Case I is more accurate than Case II as it requires no further

approximation other than that used in the Taylor series expansion.
110 NONLINEAR SIMULTANEOUS EQUATIONS

(2) Case I does require the solution of ∆x, i.e., ∆x1 and ∆x2 , using the
expression for roots of a quadratic equation.
(3) Newton’s second order method requires f 00 (x) to be well-behaved in the
neighborhood of xi .
(4) As in the case of Newton’s linear method, Newton’s second order method
(both Case I and Case II) also have good convergence characteristics as
long as the initial or starting solution is in a sufficiently small neighbor-
hood of the correct value of the root.
(5) The convergence rate of Case II is similar to Newton-Raphson method
due to the introduction of approximation (3.65).

Example 3.5 (Newton’s Second Order Method). When a root has

been bracketed, this method can also be used to find a more accurate value
of the root in the bracketed range. We consider the same f (x) = 0 as in
(3.12) for Case I and Case II of Newton’s second order method..

Case I:
In this approach we use
(∆x)2 00
f (xi ) + f 0 (xi ) + f (xi ) = 0 (3.71)
2
in which
f (x) = x3 + 2.3x2 − 5.08x − 7.04
f 0 (x) = 3x2 + 4.6x − 5.08 (3.72)
00
f (x) = 6x + 4.6
and xi is the current estimate of the root in the bracketed range. Choose
i = 0, and hence x0 as the initial guess, and calculate f (xi ), f 0 (xi ), and
f 00 (xi ) using (3.72). Using these values and (3.71) find two values of ∆x, say
∆x1 and ∆x2 , using the quadratic formula. Let
xi+1 = xi + ∆x1
for i = 1, 2, . . . (3.73)
xi+1 = xi + ∆x2
Choose the value of xi+1 that falls within [xl , xu ], the range that brackets
the root. Increment i = i + 1 and repeat calculations beginning with (3.71).
The convergence criterion is given as (approximate percentage relative error)
xi+1 − xi
× 100 ≤ ∆ (3.74)
xi+1
3.2. ROOT-FINDING METHODS 111

Root 1: [xl , xu ] = [−3.59, −3.18], from Example 3.1

We choose x0 = −3.4 as the initial guess of the root.

Table 3.11: Results of Newton’s second order method (Case I) for the first root of
equation (3.12)
x0 = -0.340000E+01
∆= 0.100000E−04
I= 10

iter xi+1 f(xi ) f0 (xi ) f00 (xi ) error (%)

1 -0.319926E+01 -0.248400E+01 0.139600E+02 -0.158000E+02 0.627462E+01

2 -0.320000E+01 0.808915E−02 0.109091E+02 -0.149556E+02 0.231612E−01
3 -0.320000E+01 -0.121498E−05 0.109200E+02 -0.146000E+02 0.418457E−05

∴ x = −3.2 is a root of f (x) defined by (3.12) in the range [xl , xu ] =

[−3.59, 3.18].

Root 2: [xl , xu ] = [−1.13, −0.72], from Example 3.1

We choose x0 = −0.8 as the initial guess of the root.

Table 3.12: Results of Newton’s second order method (Case I) for the second root of
equation (3.12)
x0 = -0.800000E+00
∆= 0.100000E−04
I= 10

iter xi+1 f(xi ) f0 (xi ) f00 (xi ) error (%)

1 -0.109602E+01 -0.201600E+01 -0.684000E+01 -0.200000E+00 0.270085E+02

2 -0.110000E+01 -0.259336E−01 -0.651791E+01 -0.197611E+01 0.361925E+00
3 -0.110000E+01 -0.517368E−07 -0.651000E+01 -0.200000E+01 0.866976E−06

∴ x = −1.1 is a root of f (x) defined by (3.12) in the range [xl , xu ] =

[−1.13, −0.72].

Root 3: [xl , xu ] = [1.74, 2.15], from Example 3.1

We choose x0 = 1.8 as the initial guess of the root.

Table 3.13: Results of Newton’s second order method (Case I) for the third root of
equation (3.12)
x0 = 0.180000E+01
∆= 0.100000E−04
I= 10

iter xi+1 f(xi ) f0 (xi ) f00 (xi ) error (%)

1 0.200050E+01 -0.290000E+01 0.129200E+02 0.154000E+02 0.100225E+02

2 0.200000E+01 0.806149E−02 0.161283E+02 0.166030E+02 0.249988E−01
3 0.200000E+01 0.000000E+00 0.161200E+02 0.166000E+02 0.287251E−05

∴ x = 2.0 is a root of f (x) defined by (3.12) in the range [xl , xu ] =

[1.74, 2.15].
112 NONLINEAR SIMULTANEOUS EQUATIONS

All roots have been determined within the desired accuracy of ∆ = 0.0001
in three iterations, compared to four iterations for Newton’s linear method.
Strictly in terms of iteration count, this is an improvement, but this method
also requires additional calculation of two values of ∆x at each iteration
using the quadratic formula, and determination of which choice of ∆x is ap-
propriate. This may result in worse overall efficiency compared to Newton’s
linear method.

Case II
In this approach we approximate ∆x/2 using Newton’s linear method to
obtain:
xi
xi+1 = xi − 00 (x )
; i = 0, 1, 2, . . . (3.75)
f 0 (xi ) − f (x2fi )f
0 (x )
i
i

Root 1: [xl , xu ] = [−3.59, −3.18], from Example 3.1

We choose x0 = −3.4 as the initial guess of the root.
Table 3.14: Results of Newton’s second order method (Case II) for the first root of
equation (3.12)
x0 = -0.340000E+01
∆= 0.100000E−04
I= 10

iter xi+1 f(xi ) f0 (xi ) f00 (xi ) error (%)

1 -0.322455E+01 -0.248400E+01 0.139600E+02 -0.158000E+02 0.544108E+01

2 -0.320044E+01 -0.272500E+00 0.112802E+02 -0.147473E+02 0.753168E+00
3 -0.320000E+01 -0.486085E−02 0.109265E+02 -0.146027E+02 0.139016E−01
4 -0.320000E+01 -0.121498E−05 0.109200E+02 -0.146000E+02 0.347694E−05

∴ x = −3.2 is a root of f (x) defined by (3.12) in the range [xl , xu ] =

[−3.59, 3.18].

Root 2: [xl , xu ] = [−1.13, −0.72], from Example 3.1

We choose x0 = −0.8 as the initial guess of the root.
Table 3.15: Results of Newton’s second order method (Case II) for the second root of
equation (3.12)
x0 = -0.800000E+00
∆= 0.100000E−04
I= 10

iter xi+1 f(xi ) f0 (xi ) f00 (xi ) error (%)

1 -0.108251E+01 -0.201600E+01 -0.684000E+01 -0.200000E+00 0.260977E+02

2 -0.109991E+01 -0.114156E+00 -0.654406E+01 -0.189506E+01 0.158174E+01
3 -0.110000E+01 -0.595965E−03 -0.651018E+01 -0.199945E+01 0.832202E−02
4 -0.110000E+01 0.517368E−07 -0.651000E+01 -0.200000E+01 0.722480E−06
3.2. ROOT-FINDING METHODS 113

∴ x = −1.1 is a root of f (x) defined by (3.12) in the range [xl , xu ] =

[−1.13, −0.72].

Root 3: [xl , xu ] = [1.74, 2.15], from Example 3.1

We choose x0 = 1.8 as the initial guess of the root.
Table 3.16: Results of Newton’s second order method (Case II) for the third root of
equation (3.12)
x0 = 0.180000E+01
∆= 0.100000E−04
I= 10

iter xi+1 f(xi ) f0 (xi ) f00 (xi ) error (%)

1 0.202107E+01 -0.290000E+01 0.129200E+02 0.154000E+02 0.109383E+02

2 0.200020E+01 0.343354E+00 0.164711E+02 0.167264E+02 0.104352E+01
3 0.200000E+01 0.319411E−02 0.161233E+02 0.166012E+02 0.990540E−02
4 0.200000E+01 0.000000E+00 0.161200E+02 0.166000E+02 0.000000E+00

∴ x = 2.0 is a root of f (x) defined by (3.12) in the range [xl , xu ] =

[1.74, 2.15].
All roots have been determined within the desired accuracy of ∆ = 0.0001
in four iterations, the same as when Newton’s linear method is used. This is
to be expected. By substituting the approximation for ∆x/2 from Newton’s
linear method, the convergence rate of Newton’s second order method is
naturally limited to that of Newton’s linear method.

3.2.7 Secant Method

In this method we use the same expression for the improved value of the
root as in Newton’s linear method, i.e., we begin with:
f (xi )
xi+1 = xi − (3.76)
f 0 (xi )

We approximate f 0 (xi ) using a difference expression. Let xi−1 , xi with

xi−1 < xi be such that xi−1 , xi ∈ (xl , xu ). Consider Figure 3.7. The
derivative f 0 (xi ) may be approximated by:

f (xi ) − f (xi−1 )
f 0 (xi ) ∼
= (3.77)
xi − xi−1
This is a backwards difference approximation for f 0 (xi ). Substituting from
(3.77) into (3.76) for f 0 (xi ):

f (xi )(xi − xi−1 )

xi+1 = xi − (3.78)
f (xi ) − f (xi−1 )
114 NONLINEAR SIMULTANEOUS EQUATIONS

f (x)

f (xi )

f (xi−1 )

x
xi−1 xi

xl xu
Figure 3.7: Secant Method

This is known as the secant method. This expression for xi+1 in (3.78) is
the same as that derived for false position method (see Example 3.3 for a
numerical example).

Remarks.

(1) We note that in this method, f 0 (xi ) is approximated compared to New-

ton’s linear method, hence f 0 (xi ) does not appear in the expression (3.78)
used to perform iterations.

(2) This method is helpful when f (x) is complicated, in which case deter-
mining f 0 (x) may be involved, but is avoided in this case.

3.2.8 Fixed Point Method or Basic Iteration Method

In this method we recast f (x) = 0 in the form x = g(x). If x∗ is a root
of f (x) then f (x∗ ) = 0 and hence x∗ = g(x∗ ) holds. We can recast x = g(x)
in a different form that is more suitable for iterative computations.

xi+1 = g(xi ) ; i = 1, 2, . . . (3.79)

We begin with xi ∈ [xl , xu ] as the assumed value of the root in the bracketed
range [xl , xu ] and iterate using (3.79) until converged, i.e., until the following
holds (approximate percentage relative error):

xi+1 − xi
100 ≤ ∆ (3.80)
xi+1
3.2. ROOT-FINDING METHODS 115

Remarks. In deriving x = g(x) from f (x) = 0 it is possible to use various

different approaches.
(1) We can consider f (x) = 0 and add x to both sides to obtain f (x)+x = x
and then define f (x) + x = g(x).
(2) If possible we can use f (x) = 0 to solve for x in terms of quantities that
are functions of x. For complex expressions in f (x) = 0, this may also
lead to more than one possible form of x = g(x).
(3) Example 3.6 illustrates some of these possibilities.

Example 3.6 (Fixed Point Method or Basic Iteration Method). In

this example we consider a few problems to demonstrate how to set up the
difference form in the basic iteration method and present numerical studies.

Case (a)
Consider
f (x) = x2 − 4x + 3
We express f (x) as x = g(x).

Possible choices are

x2 +3
(i) x = 4 = g(x)
x2i + 3
∴ xi+1 = ; i = 0, 1, . . . x0 is initial guess
4
(ii) Add x to both sides of f (x) = 0

x = x2 − 3x + 3 = g(x)
∴ xi+1 = x2i − 3xi + 3 ; i = 0, 1, . . . x0 is initial guess
√
(iii) x = ± 4x − 3 = ±g(x)
√
∴ xi+1 = ± 4xi − 3 ; i = 0, 1, . . . x0 is initial guess

Case (b)
Consider
f (x) = sin(x) = 0
Add x to both sides of f (x) = 0.
x = sin(x) + x = g(x)
∴ xi+1 = sin(xi ) + xi ; i = 0, 1, . . .
116 NONLINEAR SIMULTANEOUS EQUATIONS

Case (c)
Consider
f (x) = e−x − x
∴ x = e−x = g(x)

Hence,
xi+1 = e−xi ; i = 0, 1, . . . x0 is initial guess
We present a numerical study of Case (c) using x0 = 0. Calculated values
are tabulated in the following.
Table 3.17: Results of fixed point method for Case (c) with x0 = 0.
x0 = 0.00000E+00
∆= 0.10000E−03
I= 20

iter xi xi+1 = g(xi ) f(xi ) f(xi+1 ) error (%)

1 0.00000E+00 0.10000E+01 0.10000E+01 -0.63212E+00 0.10000E+03

2 0.10000E+01 0.36788E+00 -0.63212E+00 0.32432E+00 0.17183E+03
3 0.36788E+00 0.69220E+00 0.32432E+00 -0.19173E+00 0.46854E+02
4 0.69220E+00 0.50047E+00 -0.19173E+00 0.10577E+00 0.38309E+02
5 0.50047E+00 0.60624E+00 0.10577E+00 -0.60848E−01 0.17447E+02
6 0.60624E+00 0.54540E+00 -0.60848E−01 0.34217E−01 0.11157E+02
7 0.54540E+00 0.57961E+00 0.34217E−01 -0.19497E−01 0.59033E+01
8 0.57961E+00 0.56012E+00 -0.19497E−01 0.11028E−01 0.34809E+01
9 0.56012E+00 0.57114E+00 0.11028E−01 -0.62637E−02 0.19308E+01
10 0.57114E+00 0.56488E+00 -0.62637E−02 0.35493E−02 0.11089E+01
11 0.56488E+00 0.56843E+00 0.35493E−02 -0.20139E−02 0.62441E+00
12 0.56843E+00 0.56641E+00 -0.20139E−02 0.11418E−02 0.35556E+00
13 0.56641E+00 0.56756E+00 0.11418E−02 -0.64772E−03 0.20119E+00
14 0.56756E+00 0.56691E+00 -0.64772E−03 0.36734E−03 0.11426E+00
15 0.56691E+00 0.56728E+00 0.36734E−03 -0.20832E−03 0.64756E−01
16 0.56728E+00 0.56707E+00 -0.20832E−03 0.11814E−03 0.36736E−01
17 0.56707E+00 0.56719E+00 0.11814E−03 -0.66996E−04 0.20829E−01
18 0.56719E+00 0.56712E+00 -0.66996E−04 0.37968E−04 0.11813E−01
19 0.56712E+00 0.56716E+00 0.37968E−04 -0.21517E−04 0.66945E−02
20 0.56716E+00 0.56714E+00 -0.21517E−04 0.12159E−04 0.37940E−02

∴ x = 0.56714 is a root of f (x). The theoretical value of the root is

0.56714329. Percentage error listed above is based on (approximate percent-
age relative error):
xi+1 − xi
% error = × 100
xi+1
3.2. ROOT-FINDING METHODS 117

3.2.9 General Remarks on Root-Finding Methods

(1) Bisection method has the worst performance out of all of the methods
except fixed point method (in general). To achieve error O(10−4 ), close
to 20 iterations are required in the example problem, more than needed
for any other method for the error of the same order of magnitude. Said
differently, for a fixed number of iterations, this method has the worst
error in the calculated root.

(2) False position method has remarkably improved performance compared

to bisection method. For error O(10−4 ), this method required between
5-7 iterations for the numerical example presented here.

(3) Newton’s linear method has even better performance than false position
method due to the fact that in false position method (or secant method),
the function derivative is approximated. Error O(10−5 ) (lower than (1)
and (2)) required only four iterations for each root. From third to fourth
iteration the error (relative error) reduces from O(10−1 ) or O(10−2 ) to
O(10−5 ) or O(10−6 ), better than the theoretical quadratic convergence
rate.

(4) Newton’s second order method in which ∆x is calculated using the

quadratic equation has the best performance compared to all of the
methods. Only three iterations yield error O(10−5 ) or O(10−6 ). From
the second to third iteration relative error changes from O(10−1 ) to
O(10−5 ) or O(10−6 ), much better than quadratic convergence rates (es-
tablished for Newton’s linear method). Improved performance of this
method is expected as the Taylor series expansion used in deriving this
method is more accurate than Newton’s linear method.

(5) Newton’s second order method in which (∆x/2) is approximated using

Newton’s linear method has performance similar to Newton’s linear
method. This is not surprising due to the fact that approximating (∆x/2)
using Newton’s linear method will naturally result in the accuracy of this
method being comparable to Newton’s linear method. Hence, the reason
for similar performance of this method to Newton’s linear method.

(6) Fixed point method is even worse than bisection method. For the nu-
merical example considered here error O(10−1 ) required 20 iterations.

(7) In all root-finding methods, appropriately bracketed roots are of vital

importance. If the initial solution (guess solution) is not close enough
to the true value of the root, all root-finding methods will not function
properly.
118 NONLINEAR SIMULTANEOUS EQUATIONS

3.3 Solutions of Nonlinear Simultaneous Equations

From the root-finding methods presented in the previous sections, we note
that many of the methods can not be extended easily for more than one non-
linear equation. However, Newton’s linear method has good mathematical
foundation as well as quadratic convergence rate, and can be conveniently
extended for obtaining the solution of a system of nonlinear simultaneous
equations. We present details in the following.

3.3.1 Newton’s Linear Method or Newton-Raphson Method

For the sake of simplicity consider a system of three simultaneous non-
linear equations (not necessarily algebraic):

f (x, y, z) = 0
g(x, y, z) = 0 (3.81)
h(x, y, z) = 0

Let (xi , yi , zi ) be an approximation of the true solution of (3.81) in the small

neighborhood of the true solution:

f (xi , yi , zi ) 6= 0
g(xi , yi , zi ) 6= 0 (3.82)
h(xi , yi , zi ) 6= 0

Let ∆x, ∆y, and ∆z be corrections to the xi , yi , zi such that:

xi+1 = xi + ∆x
yi+1 = yi + ∆y (3.83)
zi+1 = zi + ∆z

f (xi+1 , yi+1 , zi+1 ) = f (xi + ∆x, yi + ∆y, zi + ∆z) = 0

g(xi+1 , yi+1 , zi+1 ) = g(xi + ∆x, yi + ∆y, zi + ∆z) = 0 (3.84)
h(xi+1 , yi+1 , zi+1 ) = h(xi + ∆x, yi + ∆y, zi + ∆z) = 0
Expand f (xi + ∆x, yi + ∆y, zi + ∆z), g(xi + ∆x, yi + ∆y, zi + ∆z) and
h(xi + ∆x, yi + ∆y, zi + ∆z) in Taylor series about xi , yi , zi and retaining
only up to linear terms in ∆x, ∆y, ∆z.

f (xi , yi , zi ) + fx (xi , yi , zi )∆x + fy (xi , yi , zi )∆y + fz (xi , yi , zi )∆z = 0

g(xi , yi , zi ) + gx (xi , yi , zi )∆x + gy (xi , yi , zi )∆y + gz (xi , yi , zi )∆z = 0
h(xi , yi , zi ) + hx (xi , yi , zi )∆x + hy (xi , yi , zi )∆y + hz (xi , yi , zi )∆z = 0
(3.85)
3.3. SOLUTIONS OF NONLINEAR SIMULTANEOUS EQUATIONS 119

In (3.85) the subscripts x, y, and z imply partial differentiations with respect

to x, y, and z. Equations (3.85) can be arranged in matrix and vector form.
Using ∆x = xi+1 − xi , ∆y = yi+1 − yi , and ∆z = zi+1 − zi , we obtain:
       
f (xi , yi , zi ) fx fy fz xi+1 − xi  0
g(xi , yi , zi ) +  gx gy gz  yi+1 − yi = 0 (3.86)
h(xi , yi , zi ) hx hy hz x ,y ,z zi+1 − zi 0
     
i i i

From (3.86) we can solve for xi+1 , yi+1 , and zi+1 .

   −1  
xi+1 − xi  fx fy fz f 
yi+1 − yi = − gx gy gz
  g (3.87)
zi+1 − zi hx hy hz x ,y ,z h x ,y ,z
   
i i i i i i

     −1  
xi+1  xi  fx fy fz f 
∴ yi+1 = yi −  gx gy gz  g (3.88)
zi+1 zi hx hy hz x ,y ,z h x ,y ,z
     
i i i i i i

xi+1 , yi+1 , zi+1 from (3.88) are improved values of the solution compared to
xi , yi , zi (previous iteration). Now we check for convergence (approximate
percentage relative error):

xi+1 − xi
100 ≤ ∆
xi+1
yi+1 − xi
100 ≤ ∆ (3.89)
yi+1
zi+1 − zi
100 ≤ ∆
zi+1

In which ∆ is a preset tolerance for convergence based on the desired ac-

curacy. If (3.89) are satisfied then we have xi+1 , yi+1 , zi+1 as the desired
solution of the non-linear equations (3.81). If not, then we set xi , yi , zi to
xi+1 , yi+1 , zi+1 and repeat calculations using (3.88) until converged.

Remarks.

(1) As we have seen in the case of root-finding methods, solutions of all

nonlinear equations (a single equation or a system of simultaneous equa-
tions) are iterative. Thus, a starting guess or initial solution is required
to commence the iterative process.

(2) When these nonlinear equations describe a physical process, the physics
is generally of help in estimating or guessing a good starting solutions.
For example, Stokes flow is a good assumption as a starting solution for a
120 NONLINEAR SIMULTANEOUS EQUATIONS

system of nonlinear equations obtained by discretizing the Navier-Stokes

partial differential equations.

(3) Often a null vector or a vector of ones may serve as a crude guess also.
Such a choice may result in many more iterations as this choice may
be far away from the true solution. In some cases, this choice may also
result in lack of convergence.

(4) The most important point to remember is that Newton’s method has
excellent convergence characteristics, provided the starting solution is
in a sufficiently small neighborhood of the correct solution. Thus, a
choice of initial solution close to the true solution is necessary, otherwise
the method may require too many iterations to converge or may not
converge at all.

(5) This is obviously method of tangents or gradient method in R2 .

3.3.1.1 Special Case: Single Equation

If we have only one equation then we only have f (x) = 0 in (3.81), and
hence (3.88) reduces to:

xi+1 = xi − (fx )−1

xi f (xi ) (3.90)

or
f (xi )
xi+1 = xi − (3.91)
f 0 (xi )
which is the same as Newton’s linear method or Newton-Raphson method
derived in Section 3.2.5 for a single nonlinear equation.

Example 3.7 (Newton’s Linear Method for a System of Two Non-

linear Equations). Consider the following system of two nonlinear equa-
tions:
f (x, y) = x3 + 3y 2 − 21 = 0
(3.92)
g(x, y) = x2 + 2y + 2 = 0

We wish to find all possible values of x,y that satisfy the above equations,
i.e., all roots, using Newton’s linear method or Newton’s first order method.
For a system of two equations described by f (x, y) = 0 and g(x, y) = 0, we
have:
xi+1 xi fx fy f
= − (3.93)
yi+1 yi gx gy (x ,y ) g (x ,y )
i i i i
3.3. SOLUTIONS OF NONLINEAR SIMULTANEOUS EQUATIONS 121

Using (3.92), we have:

fx = 3x2
fy = 6y
(3.94)
gx = 2x
gy = 2
Hence, (3.93) can be written as:
2 −1 3
xi + 3yi2 − 21

xi+1 xi 3xi 6yi
= − (3.95)
yi+1 yi 2xi 2 x2i + 2yi + 2
for i = 0, 1, . . . in which x0 , y0 is the initial solution of the desired root. We
need an initial solution/starting guess x0 , y0 for each root. In this case we
can use a simple graphical procedure to determine these. From (3.92), we
can obtain: r
21 − x3 −2 − x2

y=± ; y= (3.96)
3 3
We plot graphs of (3.96) using x as abscissa and y as ordinate (see Figure
3.8). From the graphs in Figure 3.8, we note that the system of equations in
(3.92) have two roots and we choose their approximate locations as (x, y) =
(−2, −3) and (x, y) = (1.4, −2.0). We use these as initial guess in (3.94) for
the two roots. We refer to the root near (−2, −3) as root 1 and the root
near (1.4, −2.0) as root 2.

4
y = [(21-x3)/3](1/2)

0
y

y = (-2-x2)/2

-2
y2

y1 y = -[(21-x3)/3](1/2)
-4

x1 x2
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x

Figure 3.8: Plot of y versus x in (3.96)

122 NONLINEAR SIMULTANEOUS EQUATIONS

Root 1: (x0 , y0 ) = (−2, −3)

We use the initial solution (−2, −3) and convergence tolerance of
∆ = 0.1 × 10−5 for both x and y. We limit the maximum number of
iterations to 10. Calculations are shown below.
Table 3.18: Results of Newton’s linear method for the first root
x0 = -0.20000E+01
y0 = -0.30000E+01
∆= 0.10000E−05
I= 10

xi yi f(xi , yi ) g(xi , yi ) fx fy gx gy

-0.20000E+01 -0.30000E+01 -0.20000E+01 0.00000E+00 0.12000E+02 -0.18000E+02 -0.40000E+01 0.20000E+01

-0.20833E+01 -0.31667E+01 0.41000E−01 0.69444E−02 0.13021E+02 -0.19000E+02 -0.41667E+01 0.20000E+01
-0.20793E+01 -0.31617E+01 -0.28702E−04 0.16244E−04 0.12971E+02 -0.18970E+02 -0.41586E+01 0.20000E+01
-0.20793E+01 -0.31617E+01 -0.12877E−09 0.22008E−10 0.12970E+02 -0.18970E+02 -0.41586E+01 0.20000E+01
-0.20793E+01 -0.31617E+01 -0.42633E−13 0.62172E−14 0.12970E+02 -0.18970E+02 -0.41586E+01 0.20000E+01

∴ (x, y) = (−2.0793, −3.1617) is a root.

Root 2: (x0 , y0 ) = (1.4, −2)

We use the initial solution (1.4, −2) and convergence tolerance of
∆ = 0.1 × 10−5 for both x and y. We limit the maximum number of
iterations to 10. Calculations are shown below.
Table 3.19: Results of Newton’s linear method for the second root
x0 = 0.14000E+01
y0 = -0.20000E+01
∆= 0.10000E−05
I= 10

xi yi f(xi , yi ) g(xi , yi ) fx fy gx gy

0.14000E+01 -0.20000E+01 -0.62560E+01 -0.40000E−01 0.58800E+01 -0.12000E+02 0.28000E+01 0.20000E+01

0.16864E+01 -0.23810E+01 0.80350E+00 0.82036E−01 0.85320E+01 -0.14286E+02 0.33728E+01 0.20000E+01
0.16438E+01 -0.23502E+01 0.11947E−01 0.18140E−02 0.81065E+01 -0.14101E+02 0.32877E+01 0.20000E+01
0.16430E+01 -0.23498E+01 0.35454E−05 0.62518E−06 0.80987E+01 -0.14099E+02 0.32861E+01 0.20000E+01
0.16430E+01 -0.23498E+01 0.38014E−12 0.48406E−13 0.80987E+01 -0.14099E+02 0.32861E+01 0.20000E+01

∴ (x, y) = (1.6430, −2.3498) is a root.

All roots have been determined within the desired accuracy of ∆ = 0.1×10−5
in five iterations, similar to Newton’s linear method for a single equation.

Remarks.
(1) The method has convergence rates similar to Newton’s linear method
finding roots of f (x).
(2) This method has good mathematical foundation and is perhaps the best
method for obtaining solutions of nonlinear systems of equations.
(3) It is important to note again that Newton’s methods (both linear and
quadratic) have small radii of convergence. The radius of convergence
3.3. SOLUTIONS OF NONLINEAR SIMULTANEOUS EQUATIONS 123

is the region or neighborhood about the root such that a choice of the
starting solution or guess solution for the root in this region is assured to
yield a converged solution. Due to the fact that the radius of convergence
is small for Newton’s methods, close proximity of the guess or starting
solution to the true root or solution is essential for the convergence of
the methods.

3.3.2 Concluding Remarks

In this chapter solution methods for a single and a system of nonlinear
simultaneous equations are considered. Solution methods for a single non-
linear equation, referred to as root-finding methods, are introduced first.
Graphical methods, incremental search method, bisection method, method
of false position, secant method, and fixed-point method are presented with
numerical examples. This is followed by Newton-Raphson or Newton’s linear
method as well as Newton’s quadratic method. Convergence rates are de-
rived for Newton’s method. Newton’s linear method is extended to a system
of nonlinear simultaneous equations. The solutions of nonlinear equations
are always iterative, hence the methods of obtaining solutions of nonlinear
equations are always methods of approximation. Due to the iterative nature
of the methods, a starting solution is always essential. The physics leading
to the nonlinear equations is generally helpful in choosing a starting solu-
tion. In case of a single nonlinear equation, the roots can be bracketed to
obtain a good starting solution. However in R2 and R3 with a system of
nonlinear equations this is not possible. In such cases there is no alterna-
tive but to resort to the physics described by the non-linear equations. For
example, Stokes flow (solution of linear Navier-Stokes equations) is a good
starting solution for a system of nonlinear algebraic equations resulting from
the discretization of the Navier-Stokes equations.
As a final note, Newton’s linear method is the preferred choice for non-
linear equations because of its extremely good convergence rate leading to
accurate solutions in just a few iterations. As a word of caution, Newton’s
linear method has a small radius of convergence, hence generally requires a
starting solution in a small neighborhood of the correct solution. Continu-
ity and differentiability of the functions in the neighborhood of the solution
sought are obviously essential as the method is based on tangents.
124 NONLINEAR SIMULTANEOUS EQUATIONS

Problems
Consider the following cubic algebraic polynomial.

f (x) = x3 − 6.8x2 + 5.4x + 19.0 = 0 (1)

Consider the range between [-2,6].

3.1 Plot a graph of f (x) versus x to locate the roots of (1), approximately.

Perform incremental search using x between [-2,6] with ∆x = 0.61 to bracket

the roots of (1). Tabulate your answers in the same fashion as done in the
examples. Clearly show the three brackets of x containing the roots.

3.2 Using the brackets of the roots determined in 3.1, use bisection method
to determine more accurate values of each root. Use
(xm )i+1 − (xm )i
× 100 ≤ 0.1 × 10−3
(xm )i+1

as convergence criterion, where (xm )i is the average of xl and xu for Lth

iteration. Tabulate your calculations in the same manner as in the examples.
Clearly show the values of the roots. Limit the number of iterations to 20.

3.3 Using brackets of the roots determined in 3.1, use method of false posi-
tion to determine more accurate values of each root. Use
(xr )i+1 − (xr )i
× 100 ≤ 0.1 × 10−3
(xr )i+1

as convergence criterion, where (xr )i is the value of the root in the ith iter-
ation. Tabulate your calculations in the same manner as in the examples.
Clearly show the values of the roots. Limit number of iterations to 20.

3.4 Using the brackets of the roots determined in 3.1, use Newton’s linear
method (Newton-Raphson method) to determine more accurate values of
each root. Use -0.8, 3.3 and 4.8 as starting values (initial guess) of the roots
(in ascending order). Use
xi+1 − xi
× 100 ≤ 0.1 × 10−4
xi+1

as convergence criterion, where xi is the value of the root in the ith iteration.
Tabulate your calculations in the same manner as in the examples. Limit
the number of iterations to 10.

3.5 Using the brackets of the roots determined in 3.1, use Newton’s second
order method:
3.3. SOLUTIONS OF NONLINEAR SIMULTANEOUS EQUATIONS 125

(a) Case I: Use Newton’s linear method as approximation for ( ∆x

2 ). See
section 3.2.6.
(b) Case II: Calculate value of ∆x using quadratic formula. Use a value of
∆x for which the new value of the root lies in the bracketed range.

Use -0.8, 3.3 and 4.8 as initial guess of the roots (considered in ascending
order) for Case I. Use -0.8, 2.9, and 4.8 as initial guess of the roots (consid-
ered in ascending order) for Case II.

In both cases use

xi+1 − xi
× 100 ≤ 0.1 × 10−4
xi+1
as convergence criterion, where xi is the value of the root in the ith iteration.
Tabulate your calculations in the same manner as in the examples. Limit
the number of iterations to 5.

3.6 Based on the studies performed in problems 3.2 - 3.5, write a short
discussion regarding the accuracy of various methods, convergence charac-
teristics, and efficiency.

3.7 Consider
f (x) = −x2 + 5.5x + 11.75

(a) Determine roots of f (x) = 0 graphically.

(b) Use quadratic formula to determine roots of f (x) = 0.
(c) Beginning with (xl , xu ) = (5, 10), use bisection method to determine the
root of f (x) = 0. Perform three iterations. At iteration compute relative
error and true error using the correct value of root obtained in (b).

3.8 Consider
f (x) = 3x3 − 2.5x2 + 3.5x − 1

(a) Locate roots of f (x) = 0 graphically.

(b) Using approximate values of the roots from (a), employ bisection method
to determine more accurate values of roots.

3.9 Consider

f (x) = x5 − 13.85x4 + 69.85x3 − 135.38x2 + 126.62x − 40

126 NONLINEAR SIMULTANEOUS EQUATIONS

(a) Determine roots of f (x) = 0 graphically.

(b) Use bisection method to determine more accurate values of the smallest
and the largest roots within relative error of 10%

3.10 Consider
f (x) = −x3 + 6.8x2 − 8.8x − 4.4

(a) Bracket roots of f (x) = 0 using graphical method.

(b) Using the brackets of each roots established in (a)
(b1) Use method of false position to determine the roots within four
decimal place accuracy.
(b2) Use Newtons linear method to calculate the roots with five decimal
place accuracy. Use a starting value for each root within the ranges
established in (a).

3.11 Consider
f (x) = −x2 + 1.889x + 2.778

(a) Determine roots of f (x) = 0 using Newton-Raphson method upto five

decimal place accuracy.
(b) Find roots of f (x) = 0 using fixed point method upto three decimal
place accuracy.

3.12 Consider
f (x) = x3 − 8x2 + 12x − 4

(a) Determine all roots graphically

(b) Determine roots of f (x) = 0 using Newton’s linear method with five
decimal place accuracy.
(c) Also find roots of f (x) = 0 using secant method with the same accuracy
as in (b).

3.13 Consider
f (x) = 0.5x3 − 3x2 + 5.5x − 3.05

(a) Determine roots of f (x) = 0 graphically

(b) Use Newton’s linear method to find roots of f (x) = 0 accurate up to
five decimal places.
3.3. SOLUTIONS OF NONLINEAR SIMULTANEOUS EQUATIONS 127

3.14 Consider x2 = x + 2. Use Newton’s linear method to find root starting

with initial guess of x0 = 1.0. Calculate four new estimates. Calculate
percentage relative error based on two successive solutions. Comment on
the convergence of the method based on relative error.

3.15 Consider
x2 − 3x − 4 = 0 (1)

(a) Use basic iteration method or fixed point method to find a root of (1)
near x = 3. Perform five iterations
(b) Let (xl , xu ) = (3.2, 5) contain root of (1). Determine a value of the root.
Perform four iterations only.
(c) Use method of false position to obtain a root of (1). Perform four
iterations only.

Use decimal place accuracy of the computed solutions in (b) and (c) to
compare their convergence characteristics.

3.16 Find square of π (3.14159265) using Newton’s linear method (but with-
out taking square root) starting with a value of 1.0. Calculate five new es-
timates.
√
Hint: Let x = π.

3.17 Calculate e−1 (inverse of e) without taking its inverse using Newton’s
linear method with accuracy upto four decimal places. Use e = 2.7182183
and a starting value of 0.3.
Hint: x = e−1 .

3.18 Let f (x) = cos(x) where x is in radians. Find a root of f (x) starting
with x0 = 1.0 using fixed point or basic iteration method. Calculate five
new estimates.

3.19 Find cube root of 8 accurate upto three decimal places using Newton’s
linear method starting with a value of 1.

3.20 Consider system of non-linear algebraic equations

−2x2 + 2x − 2y + 1 = 0
(1)
0.2x2 − xy − 0.2y = 0
128 NONLINEAR SIMULTANEOUS EQUATIONS

Use Newton’s linear method to find solutions of (1) using x0 , y0 = 1.0, 1.0
as initial guess with at least five decimal place accuracy. Write a computer
program to perform calculations.

3.21 Consider the following system of non-linear algebraic equations

(x − 2)2 + (y − 2)2 − 3 = 0
(1)
x2 + y 2 − 4 = 0

Plot graphs of the functions in (1) to obtain initial guess of the roots. Use
Newton’s linear method to obtain values of the roots accurate upto five
decimal places. Write a computer program to perform all calculations.

3.22 Consider the following non-linear equations

x2 − y + 1 = 0
(1)
3 cos(x) − y = 0

Plot graphs of the functions in (1) in the xy-plane to obtain initial values
of the roots. Use Newton’s linear method to obtain the values of the roots
accurate upto five decimal places. Write a computer program to perform all
calculations.
4
Algebraic Eigenvalue
Problems

4.1 Introduction
Algebraic eigenvalue problems play a central and crucial role in dynamics,
mathematical physics, continuum mechanics, and many areas of engineer-
ing. Broadly speaking eigenvalue problems are mathematically classified as
standard eigenvalue problems or generalized eigenvalue problems.

Definition 4.1 (Standard Eigenvalue Problem (SEVP)). For a square

matrix [A], if there exists a scalar λ and a vector {φ} such that

[A]{φ} − λ{φ} = [A]{φ} − λ[I]{φ} = [[A] − λ[I]] {φ} = 0 (4.1)

holds, then (4.1) is called the standard eigenvalue problem. The scalar λ
(or λs) and the corresponding vector(s) {φ} are called eigenvalue(s) and
eigenvector(s). Together we refer to (λ,{φ}) as an eigenpair of the standard
eigenvalue problem (4.1).

Definition 4.2 (Generalized Eigenvalue Problem (GEVP)). For square

matrices [A] and [B], if there exists a scalar λ and a vector {φ} such that

[A]{φ} − λ[B]{φ} = [[A] − λ[B]] {φ} = {0} (4.2)

holds, then (4.2) is called a generalized eigenvalue problem. The scalar

λ (or λs) and the corresponding vector(s) {φ} are called eigenvalue(s) and
eigenvector(s). Together we refer to (λ,{φ}) as an eigenpair of the generalized
eigenvalue problem (4.2).

Remarks. If [B]−1 exists in (4.2), then we can premultiply (4.2) by [B]−1 to

convert the GEVP (4.2) into a SEVP [A]{φ}−λ[I]{φ} = {0}, [A] = [B]−1 [A].
e e

4.2 Basic Properties of the Eigenvalue Problems

In the following we list and study some basic properties of the eigenvalue
problems.

129
130 ALGEBRAIC EIGENVALUE PROBLEMS

(i) Consider the standard eigenvalue problem (4.1) in which [A] is (n × n)

and {φ} is (n×1). Equation (4.1) represents a system of n homogeneous
algebraic equations in n unknowns {φ}. We note that λ is unknown as
well. For non-zero λ and non-null {φ}, the left hand side of (4.2) must
yield a null vector, hence the rank of [[A] − λ[I]] is less than its size n,
i.e., [[A] − λ[I]] is rank deficient or singular. Hence,
det ([A] − λ[I]) = 0 (4.3)
must hold. Likewise, in the case of a GEVP the following most hold:
det ([A] − λ[B]) = 0 (4.4)

(ii) Expansion of (4.3) and (4.4) using Laplace expansion will result in a
nth degree polynomial in λ called the characteristic polynomial p(λ)
corresponding to the eigenvalue problems (4.1) or (4.2).
(iii) The nth degree characteristic polynomial p(λ) has n roots λ1 , λ2 , . . . , λn
called eigenvalues of the eigenvalue problem (4.1) or (4.2). We generally
arrange λi s in ascending order.
λ1 < λ2 < · · · < λn (4.5)

(iv) For each eigenvalue λi ; i = 1, 2, . . . , n there exists a unique eigenvector

{φ}i ; i = 1, 2, . . . , n such that each eigenpair (λi , {φ}i ) ; i = 1, 2, . . . , n
satisfies (4.1) or (4.2).
[[A] − λi [I]] {φ}i = {0} ; for SEVP (4.6)
[[A] − λi [B]] {φ}i = {0} ; for GEVP (4.7)
For each eigenvalue λi ; i = 1, 2, . . . , n we can use (4.6) or (4.7) to find
the corresponding eigenvectors {φ}i ; i = 1, 2, . . . , n. Since for each
eigenvalue
det([A] − λi [I]) = 0 (4.8)
det([A] − λi [B]) = 0 (4.9)
hold, the coefficient matrices in (4.6) and (4.7) are rank deficient, i.e.,
if [A] is (n×n) then rank of the coefficient matrices in (4.6) and (4.7) is
less than n. Thus, the only way we can determine {φ}i corresponding
to λi is to assume a value (say one) for one component of {φ}i , φk (k th
row) and then use (4.6) or (4.7) to obtain a reduced (n − 1 × n − 1)
system with a nonzero right side to solve for the remaining components
of {φ}. Thus, in fact we have calculated th
k location
φ1 φ2 φn
{φ}Ti = , , . . . , 1, . . . , (4.10)
φk φk φk i
4.2. BASIC PROPERTIES OF THE EIGENVALUE PROBLEMS 131

Thus, we note that the magnitude of the components of the eigenvector

{φ}i depends upon the choice of the magnitude of φk , k th component
of {φ}i , which is arbitrary. Hence, the (n × 1) eigenvector {φ}i repre-
sents a direction in the n-dimensional space. Its magnitude can not be
determined in the absolute sense but can be scaled as desired.

(v) Consider the SEVP (4.1). When [A] is symmetric its eigenvalues λi ;
i = 1, 2, . . . , n are real. When [A] is positive-definite, then λi are real
and positive, i.e., λi > 0; i = 1, 2, . . . , n. When [A] is positive semi-
definite then all eigenvalues of [A] are real, but the smallest eigenvalue
λi (or more) can be zero. When [A] is non-symmetric then its eigen-
values can be real, real and complex, or all of them can be complex. In
this course as far as possible we only consider [A] to be symmetric. In
case of GEVP the same rules hold for both [A] and [B] together, i.e.,
either symmetric or non-symmetric.

4.2.1 Orthogonality of Eigenvectors

In this section we consider SEVP and GEVP to show that eigenvectors
in both cases are orthogonal and can be orthonormalized.

4.2.1.1 Orthogonality of Eigenvectors in SEVP

Consider the standard eigenvalue problem in which [A] is symmetric:

[A]{φ} − λ[I]{φ} = {0} (4.11)

Let (λi , {φ}i ) and (λj , {φ}j ) be two distinct eigenpairs of (4.11), i.e., λi 6= λj .
Then we have:

[A]{φ}i − λi [I]{φ}i = {0} (4.12)

[A]{φ}j − λj [I]{φ}j = {0} (4.13)

Premultiply (4.12) by {φ}Tj and (4.13) by {φ}Ti .

{φ}Tj [A]{φ}i − λi {φ}Tj [I]{φ}i = 0 (4.14)

{φ}Ti [A]{φ}j − λj {φ}Ti [I]{φ}j =0 (4.15)

Take the transpose of (4.15) (since [A] is symmetric, [A]T = [A]).

{φ}Tj [A]{φ}i − λj {φ}Tj [I]{φ}i = 0 (4.16)

Subtract (4.16) from (4.14):

{φ}Tj [A]{φ}i − {φ}Tj [A]{φ}i − (λi − λj ){φ}Tj [I]{φ}i = 0 (4.17)

132 ALGEBRAIC EIGENVALUE PROBLEMS

or
(λi − λj ){φ}Tj [I]{φ}i = 0 (4.18)

Since λi 6= λj , λi − λj 6= 0, hence equation (4.18) implies:

{φ}Tj [I]{φ}i = 0 =⇒ {φ}Tj {φ}i = 0

(4.19)
or {φ}Ti [I]{φ}j = 0 =⇒ {φ}Ti {φ}j = 0

The property (4.19) is known as the orthogonal property of the eigenvectors

{φ}k ; k = 1, 2, . . . , n of the standard eigenvalue problem (4.11). That is,
when λi 6= λj , the eigenvectors {φ}i and {φ}j are orthogonal to each other
with respect to identity matrix or simply orthogonal to each other.

4.2.1.2 Normalizing an Eigenvector of SEVP

We note that
{φ}Ti {φ}i > 0 (4.20)

and is equal to zero if and only if {φ}i = {0}, a null vector. Since the
eigenvectors only represent a direction, we can normalize them such that
their length is unity (in this case). Let ||{φ}i || be the euclidean norm or the
length of the eigenvector {φ}i .
q
||{φ}i || = {φ}Ti {φ}i (4.21)

Consider:
1
{φ}
e i= {φ}i (4.22)
||{φ}i ||
s s
1 1 ||{φ}i ||2
{φ}
e i = {φ}Ti {φ}i = =1 (4.23)
||{φ}i || ||{φ}i || ||{φ}i || ||{φ}i ||

Thus, {φ}
e i is the normalized {φ}i such that the norm of {φ}
e i is one. With
this normalization (4.19) reduces to:
(
e T [I]{φ}
{φ} e j = δij = 1
e T {φ}
e j = {φ} if j = i
(4.24)
i i
0 6 i
if j =

The quantity δij is called the Kronecker delta. The condition (4.24) is called
the orthonormality condition of the normalized eigenvectors of SEVP.
4.2. BASIC PROPERTIES OF THE EIGENVALUE PROBLEMS 133

4.2.1.3 Orthogonality of Eigenvectors in GEVP

Consider the GEVP given by:

[A]{φ} − λ[B]{φ} = {0} (4.25)

Let (λi , {φ}i ) and (λj , {φ}j ) be two eigenpairs of (4.25) in which λi and λj
are distinct, i.e., λi 6= λj . Then we have:

[A]{φ}i − λi [B]{φ}i = {0} (4.26)

[A]{φ}j − λj [B]{φ}j = {0} (4.27)

Premultiply (4.26) by {φ}Tj and (4.27) by {φ}Ti .

{φ}Tj [A]{φ}i − λi {φ}Tj [B]{φ}i = 0 (4.28)

{φ}Ti [A]{φ}j − λj {φ}Ti [B]{φ}j =0 (4.29)

Take the transpose of (4.29) (since [A] and [B] are symmetric [A]T = [A]
and [B]T = [B]).

{φ}Tj [A]{φ}i − λj {φ}Tj [B]{φ}i = 0 (4.30)

Subtract (4.30) from (4.28).

{φ}Tj [A]{φ}i − {φ}Tj [A]{φ}i − (λi − λj ){φ}Tj [B]{φ}i = 0 (4.31)

Which reduces to:

(λi − λj ){φ}Tj [B]{φ}i = 0 (4.32)
Since λi 6= λj , λi − λj 6= 0 and the following holds:

{φ}Tj [B]{φ}i = {φ}Ti [B]{φ}j = 0 (4.33)

This is known as the [B]-orthogonal property of the eigenvectors {φ}i and

{φ}j of the GEVP (4.25).

4.2.1.4 Normalizing an Eigenvector of GEVP

We note that
{φ}Ti [B]{φ}i (4.34)
and is equal to zero if and only if {φ}i = {0}, a null vector. Since the
eigenvectors only represent a direction, we can [B]-normalize them, i.e., their
norm with respect to [B] becomes unity. Let the [B]-norm of {φ}i , denoted
by ||{φ}i ||B , be defined as:
q
||{φ}i ||B = {φ}Ti [B]{φ}i (4.35)
134 ALGEBRAIC EIGENVALUE PROBLEMS

Consider:
{φ}i
{φ}
e i= (4.36)
||{φ}i ||B
Taking the [B]-norm of {φ}
e i:
q
||{φ}
e i ||B = e T [B]{φ}
{φ} e i (4.37)
i

Substitute from (4.36) into (4.37).

s
T
e i ||B = {φ}i [B]{φ}i = ||{φ}i ||B = 1
||{φ} (4.38)
||{φ}i ||2B ||{φ}i ||B

Thus, the [B]-norm of {φ}

e i , ||{φ}
e i ||B , is one, i.e., {φ}
e i is the [B]-normalized
{φ}i . Using the condition (4.38) for the eigenvectors {φ} e i , (4.33) can be
written as: (
e T [B]{φ}
{φ} e j = δij = 1 if j = i (4.39)
i
0 if j 6= i
Condition (4.39) is called the [B]-orthonormality condition of the eigenvec-
tors of GEVP. The eigenvectors {φ} e i are orthogonal and normalized, hence
orthonormal.

4.2.2 Scalar Multiples of Eigenvectors

From the material in Section 4.2.1 it is straightforward to conclude that
eigenvectors are only determined within a scalar multiple, i.e., if (λi , {φ}i )
is an eigenpair for SEVP or GEVP then (λi , β{φ}i ) is also an eigenpair of
SEVP or GEVP, β being a nonzero scalar.

4.2.2.1 SEVP
If (λi , {φ}i ) is an eigenpair of the SEVP, then:

[A]{φ}i − λ[I]{φ}i = {0} (4.40)

To establish if (λi , β{φ}i ) is an eigenpair of the SEVP:

[A]β{φ}i − λi [I]β{φ}i = 0 must hold (4.41)

or
β([A]{φ}i − λi [I]{φ}i ) = {0} (4.42)
Since β 6= 0:
[A]{φ}i − λi [I]{φ}i = {0} must hold (4.43)
Hence, (λi , β{φ}i ) is an eigenpair of SEVP as (4.43) is identical to the SEVP.
4.2. BASIC PROPERTIES OF THE EIGENVALUE PROBLEMS 135

GEVP

If (λi , {φ}i ) is an eigenpair of the GEVP, then:

[A]{φ}i − λi [B]{φ}i = {0} (4.44)

To establish if (λi , β{φ}i ) is an eigenpair of the GEVP:

[A]β{φ}i − λi [B]β{φ}i = {0} must hold (4.45)

or
β([A]{φ}i − λi [B]{φ}i ) = {0} (4.46)

Since β 6= 0:
[A]{φ}i − λi [B]{φ}i − {0} must hold (4.47)

Hence, (λi , β{φ}i ) is an eigenpair of the GEVP as (4.47) is identical to the

GEVP.

4.2.3 Consequences of Orthonormality of {φ}

4.2.3.1 Orthonormality of {φ}

e in SEVP

Consider:
[A]{φ} − λ[I]{φ} = {0} (4.48)

For an eigenpair (λi , {φ}

e i ):

e i − λi [I]{φ}
[A]{φ} e i = {0} (4.49)

e T.
Premultiply (4.49) by {φ}i

e T [A]{φ}
{φ} e T [I]{φ}
e i − λi {φ} e i=0 (4.50)
i i

Since {φ}
e i is normalized with respect to [I]:

e T [I]{φ}
{φ} e i=1 (4.51)
i

Hence, (4.50) reduces to:

e T [A]{φ}
{φ} e i = λi (4.52)
i

The property (4.52) is known as the [A]-orthonormal property of [I]-normalized

eigenvectors of the SEVP.
136 ALGEBRAIC EIGENVALUE PROBLEMS

4.2.3.2 Orthonormality of {φ}

e in GEVP

Consider:
[A]{φ} − λ[B]{φ} = {0} (4.53)
For an eigenpair (λi , {φ}
e i ):

e i − λi [B]{φ}
[A]{φ} e i = {0} (4.54)
e T.
Premultiply (4.54) by {φ}i

e T [A]{φ}
{φ} e T [B]{φ}
e i − λi {φ} e i=0 (4.55)
i i

Since {φ}
e i is normalized with respect to [B]:

e T [B]{φ}
{φ} e i=1 (4.56)
i

Hence, (4.55) reduces to:

e T [A]{φ}
{φ} e i = λi (4.57)
i

The property (4.57) is known as the [A]-orthonormal property of [B]-normalized

eigenvectors of the GEVP.

4.3 Methods of Determining Eigenpairs of SEVPs

and GEVPs
Since the eigenvalues λi of an eigenvalue problem are roots of the char-
acteristic polynomial p(λ) obtained by setting the determinant of the coef-
ficient matrix in the homogeneous system of equations to zero, the methods
of finding eigenvalues λi are similar to root-finding methods, hence are itera-
tive (when the degree of the characteristic polynomial is greater than three).
Since the characteristic polynomial of degree up to three is rather academic,
in general the methods of determining the eigenvalues and eigenvectors are
iterative, hence are methods of approximation. We consider the following
methods for both SEVPs and GEVPs.

(I) Using characteristic polynomial p(λ) = 0

(II) Vector iteration methods

(a) Inverse iteration method

(b) Forward iteration method
(c) Gram-Schmidt orthogonalization or iteration vector deflation tech-
nique for calculating intermediate or subsequent eigenpairs
4.3. DETERMINING EIGENPAIRS 137

(III) Transformation methods

(a) Jacobi method (for SEVP)

(b) Generalized Jacobi method (for GEVP)
(c) Householder method with QR iterations for SEVP
(d) Subspace iteration method and other

4.3.1 Characteristic Polynomial Method

In this method we construct the characteristic polynomial p(λ) corre-
sponding to the eigenvalue problem (either SEVP or GEVP). The roots of
the characteristic polynomial are the eigenvalues. For each eigenvalue we
determine the eigenvector using the eigenvalue problem.

Basic Steps:

(i) Find characteristic polynomial p(λ) using:

det[[A] − λ[I]] = 0 ; for SEVP

(4.58)
det[[A] − λ[B]] = 0 ; for GEVP

(ii) Find roots of the characteristic polynomial using root-finding methods.

This gives us eigenvalues λi ; i = 1, 2, . . . , n. Arrange λi in ascending
order.
λ1 < λ2 < · · · < λn (4.59)

(iii) Corresponding to each λi we find the corresponding eigenvector {φ}i

using:

[[A] − λ[I]]{φ}i = 0 ; for SEVP

(4.60)
[[A] − λ[B]]{φ}i = 0 ; for GEVP

Thus now we have all pairs (λi , {φ}i ); i = 1, 2, . . . , n.

The characteristic polynomial p(λ) can be obtained by using Laplace ex-

pansion when [A] and [B] are not too large. But, when [A] and [B] are
large, Laplace expansion is algebraically too cumbersome. In such cases we
can use a more efficient method of obtaining characteristic polynomial, the
Faddeev-Leverrier method presented in the following.
138 ALGEBRAIC EIGENVALUE PROBLEMS

4.3.1.1 Faddeev-Leverrier Method of Obtaining the Characteristic

Polynomial p(λ)
We present details of the method (without proof) for the SEVP keeping
in mind that the GEVP can be converted to the SEVP. Consider the SEVP:

[A] − λ[I] {φ} = {0} (4.61)

Let
n
P
[B1 ] = [A] ; p1 = tr[B1 ] = (b1 )ii
i=1
n
1
tr[B2 ] = 12
P
[B2 ] = [A] [B1 ] − p1 [I] ; p2 = 2 (b2 )ii
i=1
n
1
tr[B3 ] = 13 (4.62)
P
[B3 ] = [A] [B2 ] − p2 [I] ; p3 = 3 (b3 )ii
i=1
.. ..
. .
n
1 1
P
[Bn ] = [A] [Bn−1 − pn−1 [I] ; pn = n tr[Bn ] = n (bn )ii
i=1

Then, the characteristic polynomial p(λ) is given by:

(−1)n (λn − p1 λn−1 − λ2 λn−2 · · · pn ) (4.63)

The inverse of [A] can be obtained using:

1
[A]−1 =

[Bn−1 − pn−1 [I] (4.64)
pn
From (4.63), we have (if we premultiply by [A]):
1
[A][A]−1 =

[A] [Bn−1 ] − pn−1 [I] (4.65)
pn
or
1
[I] = [A] [Bn−1 ] − pn−1 [I] (4.66)
pn

∴ [Bn−1 ] − pn−1 [I] = pn [I] (4.67)

In (4.67), pn [I] is a diagonal matrix with pn as diagonal elements.

Example 4.1 (Characteristic Polynomial p(λ)). Consider the following

SEVP:        
2 −1 0 1 0 0 φ1  0
−1 2 − 1 − λ 0 1 0 φ2 = 0 (4.68)
0 −1 1 0 0 1 φ3 0
   
4.3. DETERMINING EIGENPAIRS 139

(a) Characteristic polynomial p(λ) using the determinant of the coefficient

matrix
Consider :
     
2 −1 0 1 0 0 φ1 
det −1
  2 −1 −λ 0
  1 0 φ2 = 0
0 −1 1 0 0 1 φ3
 

or
2−λ −1 0
−1 2−λ −1 = 0
0 −1 1−λ
Laplace expansion using the first row:

2−λ −1 −1 − 1 −1 2 − λ
(2 − λ) + (−1) + (0) =0
−1 1−λ 1−λ 0 0 −1

(2 − λ) (2 − λ)(1 − λ) − (−1)(−1) + (−1) (−1)(0) − (−1)(1 − λ) = 0
or
p(λ) = −λ3 + 5λ2 − 6λ + 1 = 0

(b) Faddeev-Leverrier method

 
2 −1 0
[A] = −1 2 − 1
0 −1 1
 
2 −1 0
[B1 ] = [A] = −1 2 − 1 ; p1 = tr[B1 ] = 2 + 2 + 1 = 5
0 −1 1

[B2 ] = [A] [B1 ] − p1 [I]
     
2 −1 0 2 −1 0 1 0 0
= −1 2 − 1 −1 2 − 1 − (5) 0 1 0
0 −1 1 0 −1 1 0 0 1
  
2 −1 0 −3 − 1 0
= −1
 2 − 1 −1 −3 −1
 
0 −1 1 0 −1 − 4
 
−5 1 1
1 1
∴ [B2 ] =  1 −4 2 ; p2 = tr[B2 ] = (−5 − 4 − 3) = −6
2 2
1 2 −3
140 ALGEBRAIC EIGENVALUE PROBLEMS

[B3 ] = [A] [B2 ] − p2 [I]
     
2 −1 0 2 −1 0 1 0 0
= −1 2 − 1 −1 2 − 1 − (−6) 0 1 0
0 −1 1 0 −1 1 0 0 1
  
2 −1 0 1 1 1
= −1 2 − 1 1 2 2
0 −1 1 1 2 3
 
1 0 0
1 1
∴ [B3 ] = 0 1 0
 ; p3 = tr[B3 ] = (1 + 1 + 1) = 1
3 3
0 0 1
The characteristic polynomial p(λ) is given by:
p(λ) = (−1)3 (λ3 − p1 λ2 − p2 λ − p3 )
or
p(λ) = (−1)3 (λ3 − 5λ2 − (−6)λ − 1)
or
p(λ) = −λ3 + 5λ2 − 6λ + 1
which is the same as obtained using the determinant method. We note
that:
 
1 1 1
1 1
[A]−1 =

[B2 ] − p2 [I] = 1 2 2
p3 1
1 2 3
 
1 1 1
−1
∴ [A] = 1 2 2

1 2 3

We could verify that [A]−1 [A] = [A][A]−1 = [I] holds.

Example 4.2 (Determination of Eigenpairs Using Characteristic

Polynomial). Consider the following SEVP:
[A]{x} − λ[I]{φ} = {0} (4.69)
in which
2 −1
[A] =
−1 2
4.3. DETERMINING EIGENPAIRS 141

We determine the eigenpairs of (4.69) using characteristic polynomial.

Characteristic Polynomial p(λ)

From (4.69):
[A] − λ[I] {φ} = {0} (4.70)
Hence
det [A] − λ[I] = 0
or
2 −1 1 0
det −λ =0
−1 2 0 1
or
2−λ −1
det =0
−1 2−λ
or
(2 − λ)2 − 1 = 0
Therefore
p(λ) = λ2 − 4λ + 3 = 0 (4.71)

Roots of the Characteristic Polynomial p(λ)

λ2 − 4λ + 3 = 0 =⇒ (λ − 1)(λ − 3) = 0 (4.72)
∴ λ = 1 and λ = 3
Hence, the eigenvalues λ1 and λ2 are (in ascending order):

λ1 = 1 , λ2 = 3 (4.73)

Eigenvectors
Corresponding to eigenvalues λ1 = 1, λ2 = 3, we calculate eigenvectors in
the following. Each eigenvector must satisfy the eigenvalue problem (4.69).

(a) Eigenvector Corresponding to λ1 = 1

Using λ = λ1 = 1 in (4.69):

2 −1 1 0 φ1 0
− (1) = (4.74)
−1 2 0 1 φ2 0
142 ALGEBRAIC EIGENVALUE PROBLEMS

φ1
{φ}1 = is the desired eigenvector corresponding to λ1 = 1. From
φ2 1
equation (4.74):
1 − 1 φ1 0
= (4.75)
−1 1 φ2 0

1 −1
Obviously det = 0 as expected.
−1 1
To determine {φ}1 , we must choose a value for either φ1 or φ2 in
(4.75) and then solve for the other. Let φ1 = 1, then using:

φ1 − φ2 = 0 or − φ1 + φ2 = 0

from (4.75), we obtain:

φ2 = 1
Hence
φ1 1
{φ}1 = = (4.76)
φ2 1
We now have the first eigenpair corresponding to the lowest eigenvalue
λ1 = 1:
1
(λ1 , {φ}1 ) = 1, (4.77)
1

(b) Eigenvector Corresponding to λ2 = 3

Using λ = λ2 = 3 in (4.69):

2 −1 1 0 φ1 0
− (3) = (4.78)
−1 2 0 1 φ2 0

φ1
{φ}2 = is the desired eigenvector corresponding to λ2 = 3. From
φ2 2
equation (4.78):
−1 − 1 φ1 0
= (4.79)
−1 −1 φ2 0
In this case we also note that the determinant of the coefficient matrix
in (4.79) is zero (as expected). To determine {φ}2 we must choose a
value of φ1 or φ2 in (4.79) and then solve for the other. Let φ1 = 1, then
using:
−φ1 − φ2 = 0 (4.80)
4.3. DETERMINING EIGENPAIRS 143

we obtain:
φ2 = −1
Hence
φ1 1
{φ}2 = = (4.81)
φ2 −1
We now have the second eigenpair corresponding to the second eigen-
value λ2 = 3:
1
(λ1 , {φ}2 ) = 3, (4.82)
−1
Thus, the two eigenpairs in acending order of the eigenvalues are:

1 1
1, and 3, (4.83)
1 −1

Orthogonality of Eigenvectors
We note that
T
1 1 1
{φ}T1 [I]{φ}2 {φ}T1 {φ}2

= = = 11 =0
1 −1 −1
T (4.84)
T T 1 1 1
{φ}2 [I]{φ}1 = {φ}2 {φ}1 = = 1 −1 =0
−1 1 1
That is, {φ}1 and {φ}2 are orthogonal to each other or with respect to [I].
Since
T
T 1 1 1
{φ}1 {φ}1 = = 11 =2
1 1 1
T (4.85)
T 1 1 1
and {φ}2 {φ}2 = = 1 −1 =2
−1 −1 −1
Therefore
{φ}Ti {φ}j 6= δij ; i, j = 1, 2 (4.86)
Hence, {φ}1 and {φ}2 are not orthonormal.

Normalized Eigenvectors and their Orthonormality

Since this is a SEVP we normalize {φ}1 and {φ}2 with respect to [I].
s
T √
1 1
q q
||{φ}1 || = {φ}T1 [I]{φ}1 = {φ}T1 {φ}1 = = 2 (4.87)
1 1
√
1 1 1 1/ 2
∴ {φ} e 1= {φ}1 = √ = 1√ (4.88)
||{φ}1 || 2 1 / 2
144 ALGEBRAIC EIGENVALUE PROBLEMS

and
s T
q q
1 1 √
||{φ}2 || = {φ}T2 [I]{φ}2 = {φ}T2 {φ}2 = = 2 (4.89)
−1 −1
√
1 1 1 1/ 2
∴ {φ}
e 2= {φ}2 = √ = √ (4.90)
||{φ}2 || 2 −1 −1/ 2

We note that {φ}

e i ; i = 1, 2 satisfy the following orthonormal property:
(
e T {φ}
{φ} e j = δij = 1 ; j = i i, j = 1, 2 (4.91)
i
0 ; j 6= i

Thus, {φ}
e 1 and {φ}
e 2 are orthogonal and normalized (with respect to [I]),
hence these eigenvectors are orthonormal.

4.3.2 Vector Iteration Method of Finding Eigenpairs

It can be shown that the vector iteration method always yields the small-
est eigenvalue and the corresponding eigenvector regardless of whether it is
a SEVP or GEVP or regardless of the specific form in which SEVP and
GEVP are recast.

4.3.2.1 Inverse Iteration Method: Setting Up an Eigenvalue Prob-

lem for Determining Smallest Eigenpair
Consider the following SEVP and GEVP:

[A]{x} = λ[I]{x} ; SEVP (4.92)

[A]{x} = λ[B]{x} ; GEVP (4.93)

The only difference between (4.92) and (4.93) is that in the right side of
(4.92) we have [I] instead of [B]. Thus (4.92) can be obtained from (4.93)
by redefining [B] as [I]. Hence, instead of (4.92) and (4.93) we could define
a new eigenvalue problem:

[A]{x} = λ[B ]{x} (4.94)

e e
If we choose [A] = [A] and [B ] = [I],we recover (4.92) and if we set [A] = [A]
and [B ] = [B],
e then we obtain
e (4.93). Eigenvalue problem (4.94) e is the
preferred form of defining the SEVP as well as the GEVP given by (4.92)
e
and (4.93). Other forms of (4.92) and (4.93) are possible as well which can
4.3. DETERMINING EIGENPAIRS 145

also be recast as (4.94). For example, we could premultiply (4.92) by [A]−1

(provided [A] is invertible):

[A]−1 [A]{x} = λ[A]−1 {x}

or
[I]{x} = λ[A]−1 {x} (4.95)
If we define [A] = [I] and [B ] = [A]−1 in (4.95) then we obtain (4.94). In
the case of (4.93),
e we could epremultiply it by [A]−1 (provided [A]−1 exists)
to obtain:
[A]−1 [A]{x} = λ[A]−1 [B]{x}
or
[I]{x} = λ([A]−1 [B]){x} (4.96)
If we define [A] = [I] and [B ] = [A]−1 [B] in (4.96), then again we obtain
(4.94). Alternatively, in the case of (4.93), we can also premultiply by [B]−1
e e
(provided [B]−1 exists) to obtain:

[B]−1 [A]{x} = λ[B]−1 [B]{x}

or
([B]−1 [A]){x} = λ[I]{x} (4.97)
If we define [A] = [B]−1 [A] and [B ] = [I] in (4.97), then we also obtain
(4.94). e e
Thus, the eigenvalue problem in the form (4.94) is the most general
representation of any one of the five eigenvalue problem forms defined by
(4.92), (4.93), (4.95), (4.96), and (4.97). Regardless of the form we use, the
eigenvalues remain unaffected due to the fact that in all these various forms
λ has never been changed. However, a [B ]-normalized eigenvector may be
different if [B ] is not the same. e
It then suffices
e to consider the eigenvalue problem (4.94) for present-
ing details of the vector iteration method. The specific choices of [A] and
[B ] may be application dependent, but these choices do not influence e the
eigenvalue.
e As mentioned earlier, when vector iteration method is applied
to an eigenvalue problem such as (4.94), it always yields lowest eigenpair
(proof omitted) (λ1 , {φ}1 ). Calculation of lowest eigenpair by vector itera-
tion method is also called the Inverse Iteration Method.

4.3.2.2 Inverse Iteration Method: Determination of Smallest Eigen-

pair (λ1 , {φ}1 )
Consider the eigenvalue problem (4.94):

[A]{x} = λ[B ]{x} (4.98)

e e
146 ALGEBRAIC EIGENVALUE PROBLEMS

We present details of the inverse iteration method in the following. We want

to calculate (λ1 , {φ}1 ) in which λ1 is the lowest eigenvalue and {φ}1 is the
corresponding eigenvector.
1. Choose λ = 1 and rewrite (4.98) in the difference equation form as follows:
[A]{x̄}k+1 = [B ]{x}k ; k = 1, 2, . . . (4.99)
e e
For k = 1, choose {x}1 , a starting vector or initial guess of the eigenvector
as a vector whose components are unity. Thus (1, {x}1 ) is the initial guess
of (λ1 , {φ}1 ). {x}1 should be such that it is not orthogonal to {φ}1 .
2. Use (4.99) to solve for {x̄}k+1 (solution of linear simultaneous algebraic
equations). This is a new estimate of the non-normalized eigenvector.
3. Calculate a new estimate of the eigenvalue, say P ({x̄}k+1 ) using (4.98)
and {x̄}k+1 , i.e., replace {x} in (4.98) by {x̄}k+1 and λ by P ({x̄}k+1 ).
[A]{x̄}k+1 = P ({x̄}k+1 )[B ]{x̄}k+1 (4.100)
e e
Premultiply (4.100) by {x̄}Tk+1 and solve for P ({x̄}k+1 ).

{x̄}Tk+1 [A]{x̄}k+1 = P ({x̄}k+1 ){x̄}Tk+1 [B ]{x̄}k+1

e e
{x̄}Tk+1 [A]{x̄}k+1
∴ P ({x̄}) = (4.101)
{x̄}Tk+1 [B ]{x̄}k+1
e
e
4. Normalize eigenvector {x̄}k+1 with respect to [B ] using B -norm of {x̄}k+1 .
e e
q
||{x̄}k+1 || = {x̄}Tk+1 [B ]{x̄}k+1 (4.102)
e
1
∴ {x}k+1 = {x̄}k+1 (4.103)
||{x̄}k+1 ||
The vector {x}k+1 is the [B ]-normalized new estimate of {φ}1 . Thus, the
new estimate of the eigenpair
e is:

(P ({x̄}k+1 , {x}k+1 )) (4.104)

Using (4.99) and Steps 2 – 4 for k = 1, 2, . . . we obtain a sequence of
approximations (4.104) for the first eigenpair (λ1 , {φ}1 ).
5. For each value of k we check for the convergence of the calculated eigen-
pair.
P ({x̄}k+1 ) − P ({x̄}k )
For eigenvalue: ≤ ∆1 (4.105)
P ({x̄}k+1 )

For eigenvector: ||{x}k+1 − {x}k || ≤ ∆2 (4.106)

4.3. DETERMINING EIGENPAIRS 147

The scalars ∆1 and ∆2 are preset tolerances. In (4.106), we check for

the norm of the difference between {x}k+1 and {x}k , i.e., the norm of the
relative error in the eigenvector.
If converged, then (P ({x̄}k+1 , {x}k+1 )) is the lowest eigenpair, i.e., (λ1 , {φ}1 ).
If not converged, then k is incremented by one and Steps 2 – 5 are re-
peated using (4.99) until (4.105) and (4.106) both are satisfied.

Remarks.

(1) The method described above to calculate the smallest eigenpair is called
the inverse iteration method, in which we iterate for an eigenvector and
then for an eigenvalue, hence the method is sometimes also called the
vector iteration method.

(2) In the vector iteration method described above for k = 1, 2, . . . , the

following holds.

lim P ({x̄}k ) = λ1 ; lowest eigenvalue

k→∞
(4.107)
lim {x}k = {φ}1 ; eigenvector corresponding to λ1
k→∞

(3) Using the method presented here it is only possible to determine (λ1 , {φ}1 )
(proof omitted), the smallest eigenvalue and the corresponding eigenvec-
tor.

(4) Use of (4.98) in the development of the computational procedure permits

treatments of SEVP as well as GEVP in any one of the desired forms
shown earlier by choosing appropriate definitions of [A] and [B ].
e e

4.3.2.3 Forward Iteration Method: Setting Up an Eigenvalue Prob-

lem for Determining Largest Eigenpair

Consider the SEVP and GEVP given by:

[A]{x} = λ[I]{x} (4.108)

[A]{x} = λ[B]{x} (4.109)

Since the vector iteration technique described in the previous section only
determines the lowest eigenpair, we must recast (4.108) and (4.109) in al-
ternate forms that will allow us to determine the largest eigenpair. Divide
148 ALGEBRAIC EIGENVALUE PROBLEMS

both (4.108) and (4.109) by λ and switch sides.

1
[I]{x} = [A]{x} (4.110)
λ

1
[B]{x} = [A]{x} (4.111)
λ
Let
1 e
=λ (4.112)
λ
Then, (4.110) and (4.111) become:

[I]{x} = λ[A]{x}
e (4.113)

[B]{x} = λ[A]{x}
e (4.114)

We can represent (4.113) or (4.114) by :

[A]{x}
e = λ[
e B]{x}
e (4.115)

The alternate forms of (4.113) and (4.114) described in Section 4.3.2.1 are
possible to define here too. If we premultiply (4.113) by [A]−1 (provided
[A]−1 exists), we obtain:

[A]−1 {x} = λ[I]{x}

e (4.116)
e = [A]−1 and [B]
If we define [A] e = [I] in (4.116), we obtain (4.115).

If we premultiply (4.114) by [B]−1 (provided [B]−1 exists), we obtain

−1
[I]{x} = λ([B]
e [A]){x} (4.117)

If we define [A] e = [B]−1 [A] in (4.117), then we obtain (4.115).

e = [I] and [B]
If we premultiply (4.114) by [A]−1 (provided [A]−1 exists), we obtain:

([A]−1 [B]){x} = λ[I]{x}

e (4.118)
e = [A]−1 [B] and [B]
If we define [A] e = [I] in (4.118), then we obtain (4.115).
Thus, the eigenvalue problem in the form (4.115) is the most general rep-
resentation of any one of the five eigenvalue problems defined by (4.113),
(4.115), (4.116), (4.117) and (4.118). Regardless of the form we use, the
eigenvalues remain unaffected due to the fact that in the various forms λ e
has never been changed. But the [B]-normalized eigenvector may differ if
e
the definition of [B]
e changes. Thus, instead of considering five different
4.3. DETERMINING EIGENPAIRS 149

forms of the eigenvalue problem, it suffices to consider the eigenvalue prob-

lem (4.115).
The vector iteration method when applied to (4.115) will yield the lowest
eigenpair (λe1 , {φ}
e 1 ) but λ = 1/λe, hence, when λ
e is λ
e1 , the lowest eigenvalue,
1/λ
e1 is the largest eigenvalue.

e1 , {φ}
(λ e 1 ) gives us (1/λe, {φ}
e 1 ) = (λn , {φ}n )

Thus, it is the largest eigenpair. Details of determining (λ e1 , {φ}

e 1 ) using
(4.115) are exactly the same as described earlier for (λ1 , {φ}1 ) using (4.94)
except the fact that in (4.94) we have λ and in (4.115) we have λ e instead
of λ. This procedure of calculating the largest eigenpair is called Forward
Iteration Method. Here we calculate smallest λ,e i.e., λ
e1 , using (4.116) and
vector iteration method and determine the largest eigenvalue λn by taking
the reciprocal of λ
e1 .

4.3.2.4 Forward Iteration Method: Determination of Largest Eigen-

pair (λn , {φ}n )
Consider the eigenvaue problem (4.115):

[A]{x}
e = λ[
e B]{x}
e (4.119)

The details are exactly the same as presented for calculating (λ1 , {φ}1 ), but
we repeat these in the following for the sake of completeness. We want
e {φ}
to calculate (λ, e1 is the lowest eigenvalue and {φ}
e 1 ) in which λ e is the
associated eigenvector.

1. Choose λ e = 1 and rewrite (4.119) in the difference equation form as

follows:
[A]{x̄}
e k+1 = [B]{x}k
e ; k = 1, 2, . . . (4.120)
For k = 1, choose {x}1 , a starting vector or initial guess of the eigenvector
as a vector whose components are unity. Thus (1, {x}1 ) is the initial guess
e1 , {φ}
of (λ e 1 ). {x}1 should be such that it is not orthogonal to {φ}e 1.

2. Use (4.120) to solve for {x̄}k+1 (solution of linear simultaneous algebraic

equations). This is a new estimate of the non-normalized eigenvector.
3. Calculate a new estimate of the eigenvalue, say Pe({x̄}k+1 ), using (4.119)
and {x̄}k+1 , i.e., replace {x} in (4.119) by {x̄}k+1 and λ
e by Pe({x̄}k+1 ).

[A]{x̄}
e k+1 = P ({x̄}k+1 )[B]{x̄}k+1
e e (4.121)

Premultiply (4.121) by {x̄}Tk+1 and solve for Pe({x̄}k+1 ).

{x̄}Tk+1 [A]{x̄}
e e T
k+1 = P ({x̄}k+1 ){x̄}k+1 [B]{x̄}k+1
e
150 ALGEBRAIC EIGENVALUE PROBLEMS

{x̄}Tk+1 [A]{x̄}
e k+1
∴ P̃ ({x̄}) = (4.122)
T
{x̄} [B]{x̄}k+1
e
k+1

4. Normalize the eigenvector {x̄}k+1 with respect to [B]

e using B-norm
e of
{x̄}k+1 .
q
||{x̄}k+1 || = {x̄}Tk+1 [B]{x̄}
e k+1 (4.123)
1
∴ {x}k+1 = {x̄}k+1 (4.124)
||{x̄}k+1 ||
The vector {x}k+1 is the [B]-normalized
e new estimate of {φ}1 . Thus, the
new estimate of the eigenpair is:

(Pe({x̄}k+1 , {x}k+1 )) (4.125)

Using (4.120) and Steps 2 – 4 for k = 1, 2, . . . , we obtain a sequence of

approximations (4.125) for the first eigenpair (λ e1 , {φ}
e 1 ).

5. For each value of k we check for the convergence of the calculated eigen-
pair.

Pe({x̄}k+1 ) − Pe({x̄}k )
For eigenvalue: ≤ ∆1 (4.126)
Pe({x̄}k+1 )

For eigenvector: ||{x}k+1 − {x}k || ≤ ∆2 (4.127)

The scalars ∆1 and ∆2 are preset tolerances. In (4.126), we check for

the norm of the difference between {x}k+1 and {x}k , i.e., the norm of the
relative error in the eigenvector. If converged then (Pe({x̄}k+1 , {x}k+1 ))
e1 , {φ}
is the lowest eigenpair, i.e., (λ e 1 ). If not converged, then k is incre-
mented by one and Steps 2 – 5 are repeated using (4.120) until (4.126)
and (4.127) both are satisfied.

Remarks.
(1) In the vector iteration method described above for k = 1, 2, . . . the
following holds.

lim P ({x̄}k ) = λ
e1 ; lowest eigenvalue (4.128)
k→∞

1
= λn ; largest eigenvalue (4.129)
λ
e
lim {x}k = {φ}
e 1 ; eigenvector corresponding to λn (4.130)
k→∞
4.3. DETERMINING EIGENPAIRS 151

(2) Using the method presented it is only possible to determine one eigen-
pair.
(3) Use of (4.115) in the development of the computational procedure per-
mits treatment of the SEVP as well as the GEVP in any one of the
desired forms shown earlier by choosing appropriate definitions of [A]
e
and [B].
e

Example 4.3 (Determining the Smallest Eigenpair: Inverse Itera-

tion Method). Consider the following SEVP [A]{x} = λ[I]{x}.

2 − 1 x1 1 0 x1
=λ (4.131)
−1 4 x2 0 1 x2

When we compare (4.131) with (4.94) we find that:

2 −1 1 0
[A] = and [B ] = (4.132)
e −1 4 e 0 1

We will calculate the lowest eigenvalue λ1 of (4.131) and the corresponding

eigenvector using the inverse iteration method.

Characteristic Polynomial and its Roots

First we calculate the eigenvalues of (4.131) using the characteristic poly-
nomial. This will serve as a check for the inverse iteration method.

2 −1 1 0
p(λ) = det −λ =0 (4.133)
−1 4 0 1
or
p(λ) = (2 − λ)(4 − λ) − (−1)(−1) = λ2 − 6λ + 7 = 0 (4.134)
We can find the roots of (4.134) using the quadratic formula.
p
−(−6) ± (6)2 − 4(1)(7)
λ= (4.135)
2
or √
6± 8 √
λ= =3± 2 (4.136)
2
Hence
√
λ1 = 3 − 2 (smallest eigenvalue)
√ (4.137)
λ2 = 3 + 2 (largest eigenvalue)

Determination of eigenvectors {φ}1 and {φ}2 is straightforward.

152 ALGEBRAIC EIGENVALUE PROBLEMS

Inverse Iteration Method to Determine

(λ 1 , {φ}
1)
x1 1
Let λ = 1 = λ1 and {x}1 = = for k = 1 be the initial guess
x2 1 1
of the lowest eigenpair. We use:
[A]{x} = λ[B ]{x}
e e
2 −1 1 0
[A] = [B ] =
e −1 4 e 0 1
The difference equation form of the eigenvalue problems becomes:

2 − 1 x̄1 1 0
= {x}k = [I]{x}k (4.138)
−1 4 x̄2 k+1 0 1

We use (4.138) for k = 1, 2, . . .

For k = 1
1
We have 1, for the initial guess of the eigenpair (not orthogonal
1
to {φ}1 ). Using (4.138), we have:

2 − 1 x̄1 1 0 1
=
−1 4 x̄2 2 0 1 1
Hence
x̄1 0.71429
= = {x̄}2
x̄2
2
0.42857
T
0.71429 2 − 1 0.71429
{x̄}T2 [A]{x̄}2 0.42857 −1 4 0.42857
P ({x̄}2 ) = T
= T = 1.6471
{x̄}2 [I]{x̄}2
e
0.71429 0.71429
[I]
0.42857 0.42857

1
{x}2 = q {x̄}2
{x̄}T2 [I]{x̄}2

1 0.71429 0.85749
= s =
T 0.42857 0.51450
0.71429 0.71429
0.42857 0.42857
∴ The new estimate of the first eigenpair is

0.85749 x
1.6471, = P ({x̄}2 ), 1
0.51450 x2 2
4.3. DETERMINING EIGENPAIRS 153

For k = 2

2 − 1 x̄1 x
= [I] 1
−1 4 x̄2 3 x2 2
Hence
x̄1 0.56350
= = {x̄}3
x̄2 3
0.26950
T
0.56350 2 − 1 0.56350
{x̄}T3 [A]{x̄}3 0.26950 −1 4 0.26950
P ({x̄}3 ) = = = 1.5938
{x̄}T3 [I]{x̄}3
e T
0.56350 0.56350
[I]
0.26950 0.26950

1
{x}3 = q {x̄}3
{x̄}T3 [I]{x̄}3

1 0.56350 0.90213
= s =
T 0.26950 0.43146
0.56350 0.56350
[I]
0.26950 0.26950

∴ The new estimate of the eigenpair is

0.90213 x
1.5938, = P ({x̄}3 ), 1
0.43146 x2 3

We proceed in a similar fashion for k = 3, 4, . . . until converged. A summary

of the calculations is given in the following.
Table 4.1: Results of the inverse iteration method for Example 4.3

Normalized Eigenvector
P({x̄}k+1 )−P({x̄}k )
k P({x̄}k+1 ) P({x̄}k+1 )
x1 x2

1 0.16470588E+01 0.00000E+00 0.85649E+00 0.51450E+00

2 0.15938462E+01 0.33386E−01 0.90213E+00 0.43146E+00
3 0.15868292E+01 0.44220E−02 0.91636E+00 0.40035E+00
4 0.15859211E+01 0.57262E−03 0.92122E+00 0.38905E+00
5 0.15858038E+01 0.73933E−04 0.92293E+00 0.38497E+00
6 0.15857887E+01 0.95422E−05 0.92354E+00 0.38351E+00
7 0.15857867E+01 0.12315E−05 0.92376E+00 0.38298E+00
√
The converged first eigenvalue is 1.5857867 ≈ 3 − 2 (theoretical value) and
154 ALGEBRAIC EIGENVALUE PROBLEMS

the corresponding eigenvector is

0.92376
{φ}1 =
0.38298

The convergence tolerance ∆1 = O(10−5 ) has been used for the eigenvalue
in the results listed in Table 4.1.
Example 4.4 (Determining the Largest Eigenpair: Forward Itera-
tion Method). Consider the following SEVP:

2 − 1 x1 1 0 x1
=λ (4.139)
−1 4 x2 0 1 x2

To set up (4.139) for determining the largest eigenpair, divide (4.139) by λ

and let λ1 = λ.
e

1 0 x1 2 − 1 x1
=λ e (4.140)
0 1 x2 −1 4 x2
Comparing (4.140) with (4.119) we find that:

1 0 2 −1
[A] =
e and [B] =
e
0 1 −1 4

The basic steps to be used in this method have already been presented,
so we do not repeat these here. Instead we present calculation details and
numerical results.
Choose λe = 1.0 and x1 = 1.0 and rewrite (4.140) in the difference
x2 1.0
equation form with λ = 1. The vector {x}1 must not be orthogonal to {φ}
e e 1.

1 0 x̄1 2 − 1 x1
= ; k = 1, 2, . . . (4.141)
0 1 x̄2 k+1 −1 4 x2 k

For k = 1
1
We have 1, as the initial guess of the largest eigenvalue and the
1
corresponding eigenvector, hence:

1 0 x̄1 2 −1 1 1 0
=
0 1 x̄2 −1 4 1 0 1

x̄1 1.0
∴ = = {x̄}2 ; new estimate of eigenvector
x̄2 3.0
4.3. DETERMINING EIGENPAIRS 155

Using {x̄}2 and (4.140), obtain a new estimate of λ.

e
T
1 1 0 1
T
{x̄}2 [A]{x̄}2
e 3 0 1 3
P ({x̄}2 ) = = T = 0.31250
T
{x̄} [B]{x̄}2
e
2 1 2 −1 1
3 −1 4 3
Normalize {x̄}2 to obtain {x}2 .

1 1 1 0.17678
{x}2 = q {x̄}2 = s =
T 3 0.53033
{x̄}T2 [B]{x̄}

e 2 1 2 −1 1
3 −1 4 3
∴ The new estimate of the eigenpair is

0.17678
0.31250,
0.53033

For k = 2

1 0 x̄1 2 − 1 x1
=
0 1 x̄2 3 −1 4 x2 2

1 0 x̄1 2 − 1 0.17678
=
0 1 x̄2 3 −1 4 0.53033

x̄1 −0.17678
= = {x̄}3
x̄2 3 1.9445
Using {x̄}3 and (4.140), we obtain a new estimate of λ.e
T
−0.17678 1 0 −0.17678
{x̄}T3 [A]{x̄}
e 3 1.9445 0 1 1.9445
P ({x̄}3 ) = = T = 0.24016
{x̄}T [B]{x̄}

3
e 3 −0.17678 2 − 1 −0.17678
1.9445 −1 4 1.9445
Normalize {x̄}3 to obtain {x}3 .
1
{x}3 = q {x̄}3
T
{x̄}3 [B̃]{x̄}3

1 −0.17678 −0.44368
= s =
T 1.9445 0.48805
−0.17678 2 − 1 −0.17678
1.9445 −1 4 1.9445
156 ALGEBRAIC EIGENVALUE PROBLEMS

∴ The new estimate of eigenpair is

−0.44368
0.24016,
0.48805

We proceed in a similar fashion for k = 3, 4, . . . until converged. A summary

of the calculations is given in the following.
Table 4.2: Results of the forward iteration method for Example 4.4

Normalized Eigenvector
P({x̄}k+1 )−P({x̄}k )
k P({x̄}k+1 ) P({x̄}k+1 )
x1 x2

1 0.31250000E+00 0.00000E+00 0.17678E+00 0.53033E+00

2 0.24015748E+00 0.30123E+00 -0.44368E−01 0.48805E+00
3 0.22835137E+00 0.51701E−01 -0.13263E+00 0.45909E+00
4 0.22677549E+00 0.69491E−02 -0.16441E+00 0.44693E+00
5 0.22657121E+00 0.90161E−03 -0.17578E+00 0.44235E+00
6 0.22654483E+00 0.11644E−03 -0.17986E+00 0.44068E+00
7 0.22654142E+00 0.15029E−04 -0.18132E+00 0.44007E+00
8 0.22654098E+00 0.19396E−05 -0.18185E+00 0.43985E+00

Thus, in this process we obtain λ e = 0.22653547, the smallest eigenvalue of

[I]{x} = λ[A].
e Hence, the largest eigenvalue (which in this case is λ2 ) will
be
1 1 √
λ2 = = = 4.414320 ≈ 3 + 2
λ
e 0.22653547
which matches the theoretical value of the largest eigenvalue. Hence, the
largest eigenvalue and the corresponding eigenvector are

−0.38246
4.414320,
0.92397

Example 4.5 (Forward Iteration Method by Converting a GEVP

to a SEVP). Consider the following eigenvalue problem:

2 − 1 x1 1 0 x1
=λ (4.142)
−1 4 x2 0 1 x2

To find the largest eigenvalue and the corresponding eigenvector we convert

(4.142) to the following:

1 0 x1 2 − 1 x1 e= 1
=λ e ; λ (4.143)
0 1 x2 −1 4 x2 λ

As shown in the previous example, we can use (4.143) and the forward itera-
tion method to find the smallest λ,
e i.e., λ
e1 , and hence the largest eigenvalue
4.3. DETERMINING EIGENPAIRS 157

would be λ2 = e1 . The eigenvector remains unchanged. The eigenvalue

λ
problem (4.143) is a GEVP in which:

1 0 2 −1
[A] = and [B] = (4.144)
0 1 −1 4

We can also take another approach. We can convert (4.143) to a SEVP by

premultiplying (4.143) by:
−1
−1 2 −1
[B] = (4.145)
−1 4
−1
2 −1 1 0 x1 e 1 0 x1
=λ (4.146)
−1 4 0 1 x2 0 1 x2
or
0.57143 0.14286 x1 1 0 x1
=λ
e (4.147)
0.14286 0.28571 x2 0 1 x2

In this case in (4.147), we have [A]{x}

e = λ[
e B]{x}.
e Equations (4.147) define
a SEVP. We can now use (4.147) and the vector iteration method to find
the smallest eigenvalue λ e1 and the corresponding eigenvector {φ}e 1 . This
is obviously the forward iteration method. Hence, e1 = λ2 is the largest
λ
eigenvalue and we have the desired eigenpair (λ2 , {φ}2 ). {φ}2 is the same
as {φ}
e 1 , of the eigenvalue problem (4.142). Calculations are summarized in
Table 4.3.
Table 4.3: Results of the forward iteration method for Example 4.5

Normalized Eigenvector
P({x̄}k+1 )−P({x̄}k )
k P({x̄}k+1 ) P({x̄}k+1 )
x1 x2

1 0.39999240E+00 0.00000E+00 0.31621E+00 0.94869E+00

2 0.26228664E+00 0.52502E+00 -0.90552E−01 0.99589E+00
3 0.23153436E+00 0.13282E+00 -0.27755E+00 0.96071E+00
4 0.22718759E+00 0.19133E−01 -0.34526E+00 0.93851E+00
5 0.22661973E+00 0.25058E−02 -0.36930E+00 0.92931E+00
6 0.22654633E+00 0.32399E−03 -0.37788E+00 0.92585E+00
7 0.22653685E+00 0.41821E−04 -0.38096E+00 0.92459E+00
8 0.22653563E+00 0.53397E−05 -0.38206E+00 0.92414E+00
9 0.22653547E+00 0.69651E−06 -0.38246E+00 0.92397E+00

Thus, λ
e1 = 0.22653547 (same as previous example) and we have:

1 1
λ2 = = = 4.414320 (largest eigenvalue)
λ̃1 0.22653547
158 ALGEBRAIC EIGENVALUE PROBLEMS

The eigenvector in this case is different than previous example due to the
fact that it is normalized differently. Hence, we have

−0.38246
(λ2 , {φ}2 ) = 4.414320,
0.92397

4.3.3 Gram-Schmidt Orthogonalization or Iteration Vector De-

flation to Calculate Intermediate or Subsequent Eigen-
pairs

We recall that the inverse iteration method only yields the lowest eigen-
pair while the forward iteration method gives the largest eigenpair. These
two methods do not have a mechanism for determining intermediate or sub-
sequent eigenpairs. For this purpose we utilize Gram-Schmidt orthogonaliza-
tion or the iteration vector deflation method in conjunction with the inverse
or forward iteration method.
The basis for Gram-Schmidt orthogonalization or iteration vector defla-
tion is that in order for an assumed eigenvector (iteration vector) to converge
to the desired eigenvector in the inverse or forward iteration method, the it-
eration vector must not be orthogonal to the desired eigenvector. In other
words, if the iteration vector is orthogonalized to the eigenvectors that have
already been calculated, then we can eliminate the possibility of convergence
of iteration vector to any one of them and hence convergence will occur to
the next eigenvector. A particular orthogonalization procedure used exten-
sively is called Gram-Schmidt orthogonalization process or iteration vector
deflation method. This procedure can be used for the SEVP as well as the
GEVP in the inverse or forward iteration methods.
Based on the material presented for the inverse and forward iteration
methods it suffices to consider:

[A]{x} = λ[B ]{x} ; Inverse iteration

e e
or [A]{x}
e = λ[
e B]{x}
e ; Forward iteration

By choosing a specific definition of [A], [B ] and [A],

e [B]e we can have these
forms yield what we need. Further, the vector iteration method for both
e e
forms yields the lowest eigenpair hence, for discussing iteration vector defla-
tion method we can consider either of the two forms without over or under
tilde (∼).
4.3. DETERMINING EIGENPAIRS 159

4.3.3.1 Gram-Schmidt Orthogonalization or Iteration Vector De-

flation
Consider the eigenvalue problem:

[A]{x} = λ[B]{x} (4.148)

Let {φ}1 , {φ}2 , . . . , {φ}m corresponding to m eigenpairs be the eigenvectors

that have already been determined or calculated and are [B]-orthogonal, i.e.,
normalized with respect to [B]. We wish to calculate the (m + 1)th eigen-
pair (λm+1 , {φ}m+1 ). Let {x}1 be the initial guess (or starting) eigenvector
(iteration vector). We subtract a linear combination of {φ}i , i = 1, 2, . . . , m
from {x}1 to obtain a new starting or guess iteration vector {x}1 as follows.
e
Xm
{x}1 = {x}1 − αi {φ}i (4.149)
i=1
e

Where αi ∈ R are scalars yet to be determined. We refer to {x}1 as the

deflated starting iteration vector due to the fact that it is obtainedefrom {x}1
by removing the influence of {φ}i , i = 1, 2, . . . , m from it. We determine αi ,
i = 1, 2, . . . , m by orthogonalizing {x}1 to {φ}i ; i = 1, 2, . . . , m with respect
to [B]. By doing so we ensure that {x e } as a starting iteration vector will not
1
converge to any one of the {φ}i ; i e= 1, 2, . . . , m. If {x}1 is [B]-orthogonal
to {φ}i ; i = 1, 2, . . . , m, then: e

{φ}Ti [B]{x}1 = 0 ; i = 1, 2, . . . , m (4.150)

e
Premultiply (4.149) by {φ}Tk [B] ; k = 1, 2, . . . , m.
m
X
{φ}Tk [B]{x}1 = {φ}Tk [B]{x}1 − αi {φ}Tk [B]{φ}i ; k = 1, 2, . . . m
i=1
e
(4.151)
Since {φ}i are [B]-orthogonal, we have:
(
1 ; k=i
{φ}k [B]{φ}i = (4.152)
0 ; k 6= i

and since {x}1 is also [B]-orthogonal to {φ}i ; i = 1, 2, . . . , m, we also have:

e
{φ}k [B]{x}1 = 0 ; k = 1, 2, . . . , m (4.153)
e
Using (4.152) and (4.153) in (4.151), we obtain:

αk = {φ}Tk [B]{x}1 ; k = 1, 2, . . . , m (4.154)

160 ALGEBRAIC EIGENVALUE PROBLEMS

In (4.154), {φ}k ; k = 1, 2, . . . , m, [B], and {x}1 are known, therefore we

can determine αk ; k = 1, 2, . . . m. Knowing αi ; i = 1, 2, . . . , m, {x}1 can
be calculated using (4.149). The deflated vector {x}1 is used in inverse e or
forward iteration methods to extract the eigenpair (λm+1 , {φ}m+1 ).
e

Remarks.

(1) In the inverse iteration method we find the lowest eigenpair (λ1 , {φ}1 )
using the usual procedure and then use vector deflation to find (λ2 , {φ}2 ),
(λ3 , {φ}3 ),. . . , in ascending order.

(2) In the forward iteration method we find the largest eigenpair (λn , {φ}n )
using usual procedure and then use iteration vector deflation to find
(λn−1 , {φ}n−1 ), (λn−2 , {φ}n−2 ), . . . , in descending order.

(3) Consider the GEVP:

[A]{x} = λ[B]{x} (4.155)
For determining the eigenpairs in ascending order, we use (4.155) in
inverse iteration method with iteration vector deflation. Using (4.155)
we consider:

[B]{x} = λ[A]{x}
e ; e= 1
λ (4.156)
λ
or [A]{x}
e = λ[
e B]{x}
e (4.157)
where [A]
e = [B] ; [B]e = [A] (4.158)

The eigenvalue problem (4.157) is identical to (4.155) except with new

definitions of [A] and [B], hence we can also use (4.157) in inverse iter-
ation with iteration vector deflation to extract:
e1 , {φ}
(λ e 1 ), e2 , {φ}
(λ e 2 ), ... (4.159)

in ascending order. Since λi = 1/λei we have the folowing from (4.159).

(λn , {φ}n ), (λn−1 , {φ}n−1 ), ... (4.160)

In (4.160) we have eigenpairs of (4.155) in descending order.

4.3.3.2 Basic Steps in Iteration Vector Deflation

Consider the GEVP:
[A]{x} = λ[B]{x} (4.161)
We present details for extracting the eigenpairs of (4.161) in ascending order,
i.e., we use inverse iteration with iteration vector deflation. Let (λi , {φ}i ) ;
i = 1, 2, . . . , m be the eigenpairs that have already been calculated.
4.3. DETERMINING EIGENPAIRS 161

1. Choose λ = 1 in (4.161) and let {x}Tk = [1, 1, . . . , 1] be the initial guess

for the eigenvector. λ1 is the initial estimate of λm+1 .

2. Calculate scalars αi ; i = 1, 2, . . . , m using:

αi = {φ}Ti [B]{x}k (4.162)

3. Calculate deflated starting iteration vector {x}k using:

e
Xm
{x}k = {x}k − αi {φ}i (4.163)
i=1
e

{x}k is the initial estimate of {φ}m+1 .

e
4. Using (4.161), we set up difference form with λ = 1.

[A]{x̄}k+1 = [B]{x}k (4.164)

e
Solve for {x̄}k+1 (solution of linear simultaneous algebraic equations).

5. Calculate a new estimate of λm+1 , say Pm+1 ({x̄}k+1 ), using (4.161) and
{x̄}k+1 .
{x̄}Tk+1 [A]{x̄}k+1
Pm+1 ({x̄}k+1 ) = (4.165)
{x̄}Tk+1 [B]{x̄}k+1

6. [B]-normalize the new estimate {x̄}k+1 of {φ}m+1 .

q
||{x̄}k+1 || = {x̄}Tk+1 [B]{x̄}k+1 (4.166)
1
∴ {x}k+1 = {x̄}k+1 (4.167)
||{x̄}k+1 ||

Hence, (Pm+1 ({x̄}k+1 ), {x}k+1 ) is the new estimate of (λm+1 , {φ}m+1 ).

7. Convergence check:

Pm+1 ({x̄}k+1 ) − Pm+1 ({x̄}k )

Eigenvalue : ≤ ∆1 (4.168)
Pm+1 ({x̄}k+1 )
Eigenvector : ||{x}k+1 − {x}k || ≤ ∆2 (4.169)

If converged then (Pm+1 ({x̄}k+1 ), {x}k+1 ) ≈ (λm+1 , {φ}m+1 ) and we stop,

otherwise increment k by 1, i.e., set k = k + 1 and repeat steps 2 to 7.
162 ALGEBRAIC EIGENVALUE PROBLEMS

Remarks.

(1) In the iteration process, we note that:

lim Pm+1 ({x̄}k+1 ) = λm+1 (4.170)

k→∞
lim {x}k+1 = {φ}m+1 (4.171)
k→∞

(2) Once (λm+1 , {φ}m+1 ) has been determined, we change m to m + 1 and

repeat steps 1 to 7 to extract (λm+2 , {φ}m+2 ). This process can be
continued until all eigenpairs have been determined.

(3) If eigenpairs are desired to be extracted in descending order, then we

use (4.157) instead of (4.155) and follow exactly the same procedure as
described above.

Example 4.6 (Gram-Schmidt Orthogonalization or Iteration Vector

Deflation). Consider the following GEVP:

[A]{x} = λ[B]{x}

in which
1 −1 2 1
[A] = and [B] =
−1 2 1 2
We want to calculate both eigenpairs of this GEVP by using inverse
iteration method with vector deflation.

First Eigenpair (λ1 , {φ}1 )

First calculate (λ1 , {φ}1 ) using the standard inverse iteration method
with {x}T1 = [1 1] as a guess of the eigenvector and λ = 1 as the corre-
sponding eigenvalue and using a tolerance of 0.00001 for the relative error in
the eigenvalue. This is similar to what has already been described in detail
in Example 4.3. A summary of the calculations is given below.
Table 4.4: Results of inverse iteration method for first eigenpair of Example 4.6

Normalized Eigenvector
P({x̄}k+1 )−P({x̄}k )
k P({x̄}k+1 ) P({x̄}k+1 )
x1 x2

1 0.13157895E+00 0.00000E+00 0.48666E+00 0.32444E+00

2 0.13148317E+00 0.72846E−03 0.49058E+00 0.31995E+00
3 0.13148291E+00 0.19595E−05 0.49079E+00 0.31971E+00

Second Eigenpair (λ2 , {φ}2 )

4.3. DETERMINING EIGENPAIRS 163

We already have:
0.49079
{φ}1 =
0.31971
hence m, the number of known eigenpairs, is one.

For k = 1

1. Choose λ = 1 and {x}T1 = [1 1] (corresponding to k = 1).

2. Calculate scalars αi ; i = 1, 2, . . . , m using:

αi = {φ}Ti [B]{x}1

In this case m = 1, hence we have:

T
0.49079 2 1 1
α1 = {φ}T1 [B]{x}1 = = 2.4315
0.31971 1 2 1

3. Calculate the deflated starting iteration vector {x}1 using:

e
Xm
{x}1 = {x}1 − αi {φ}i = {x}1 − α1 {φ}1
e i=1

1 0.49079 −0.19335
or {x}1 = − 2.4315 =
e 1 0.31971 0.22262

4. Construct the difference form of the eigenvalue problem for λ = 1.

[A]{x̄}2 = [B]{x}1
e
Calculate {x̄}2 using {x}1 from Step 3.
e
−1
1 −1 2 1 −0.19335 −0.076285
{x̄}2 = =
−1 2 1 2 0.22262 0.087799

5. Calculate the new estimate of λ2 , say P2 ({x̄}2 ), using {x̄}2 in Step 4.

T
−0.076285 1 − 1 −0.076285
{x̄}T2 [A]{x̄}2 0.087799 −1 2 0.087799
P2 ({x̄}2 ) = =
{x̄}T2 [B]{x̄}2
T
−0.076285 2 1 −0.076285
0.087799 1 2 0.087799

∴ P2 ({x̄}2 ) = 2.5352 (new estimate of λ2 )

164 ALGEBRAIC EIGENVALUE PROBLEMS

6. [B]-normalize the new estimate {x̄}2 of {φ}2 .

s T
−0.076285 2 1 −0.076285
q
||{x̄}2 || = {x̄}T2 [B]{x̄}2 =
0.087799 1 2 0.087799
or
||{x̄}2 || = 0.11688

1 1 −0.076285 0.65268
∴ {x}2 = {x̄}2 = =
||{x̄}2 || 0.11688 0.087799 0.75120
Thus at the end of the calculations for k = 1, we have the following
estimate of the second eigenpair (λ2 , {φ}2 ).

0.65268
2.5352, = (P2 ({x̄}2 , {x}2 ))
0.75120

For k = 2

0.65268
1. Choose λ = 1 and {x}2 = , from Step 6 for k = 1.
0.75120
2. Calculate the scalar α1 .

0.49079 2 1 0.65268
αi = {φ}Ti [B]{x}2 =
0.31971 1 2 0.75120
or
α1 = −0.31083 × 10−3

3. Calculate the deflated iteration vector {x}2 using:

e
Xm
{x}2 = {x}2 − αi {φ}i = {x}1 − α1 {φ}1
e i=1

0.65268 0.49079
{x}2 = − (−0.31083 × 10−3 )
e 0.75120 0.31971

−0.65253
or {x}2 =
e 0.75130

4. Construct the difference form of the eigenvalue problem for λ = 1.

[A]{x̄}3 = [B]{x}2
e
Calculate {x̄}3 using {x}2 from Step 3.
e
−1
1 −1 2 1 −0.65253 −0.25745
{x̄}3 = =
−1 2 1 2 0.75130 0.29631
4.3. DETERMINING EIGENPAIRS 165

5. Calculate the new estimate of λ2 , P2 ({x̄}3 ), using {x̄}3 in Step 4.

{x̄}T3 [A]{x̄}3 same as for k = 1 up

P2 ({x̄}3 ) = = 2.5352
{x̄}T3 [B]{x̄}3 to four decimal places

6. [B]-normalize the new estimate {x̄}3 of {φ}2 .

q
||{x̄}3 || = {x̄}T3 [B]{x̄}3 = 0.39445

1 −0.65268 same as {x}2 for k = 1
∴ {x̄}3 = ;
||{x̄}3 || 0.75120 up to 5 decimal places
Thus, at the end of the calculations for k = 2, we have the following
estimate of the second eigenpair (λ2 , {x}2 ).

−0.65268
2.5352, = (P2 ({x̄}3 ), {x}3 )
0.75120

The calculations are summarized in the following, including relative error.

Table 4.5: Results of vector deflation for the second eigenpair of Example 4.6

Normalized Eigenvector
P({x̄}k+1 )−P({x̄}k )
k P({x̄}k+1 ) P({x̄}k+1 )
x1 x2

1 0.25351835E+01 0.00000E+00 -0.65268E+00 0.75120E+00

2 0.25351835E+01 0.17517E−15 -0.65268E+00 0.75120E+00

From the relative error, we note that for k = 2, we have converged values
of the second eigenpair, hence:

−0.65268
(λ2 , {φ}2 ) = 2.5352,
0.75120

Thus, the second eigenpair is determined using (λ1 , {φ}1 ) and iteration
vector deflation method. Just in case the estimates of the second eigenpair
are not accurate enough for k = 2, the process can be continued for
k = 3, 4, . . . until desired accuracy is achieved.

4.3.4 Shifting in Eigenpair Calculations

Shifting is a technique in eigenpair calculations that can be used to
achieve many desired features.
(i) Shifting may be used to avoid calculations of zero eigenvalues.
166 ALGEBRAIC EIGENVALUE PROBLEMS

(ii) Shifting may be used to improve convergence of the inverse or forward

iteration methods.

(iii) Shifting can be used to calculate eigenpairs other than (λ1 , {φ}1 ) and
(λn , {φ}n ) in inverse and forward iteration methods.

(iv) In inverse iteration method, if [A] is singular or positive-semidefinite,

then shift can be used to make it positive-definite without influencing
the eigenvector.

(v) In forward iteration method, if [B] is singular or positive-semidefinite,

then shift can be used to make it positive-definite without influencing
the eigenvector.

4.3.4.1 What is a Shift?

Consider the GEVP:
[A]{x} = λ[B]{x} (4.172)
Consider µ ∈ R, µ 6= 0. Consider the new GEVP defined by:

[[A] − µ[B]] {y} = η[B]{y} (4.173)

in which η and {y} are an eigenvalue and eigenvector of the GEVP (4.173).
µ is called the shift and the GEVP defined by (4.173) is called the shifted
GEVP.

4.3.4.2 Consequences of Shifting

Using (4.172) and (4.173), we determine relationship(s) between (λ, {x})
and (η, {y}), the eigenpairs of (4.172) and (4.173). From (4.173) we can
write:
[A]{y} = (η + µ)[B]{y} (4.174)
Comparing (4.172) and (4.174) we note that in both GEVPs we have the
same [A] and [B], hence the following must hold.

λ=η+µ or η =λ−µ
(4.175)
{y} = {x}

Thus,

(i) The eigenvectors of the original GEVP and shifted GEVP are the same.

(ii) The eigenvalues η of the shifted GEVP are shifted by µ compared to

the eigenvalues λ of the original GEVP.
4.4. TRANSFORMATION METHODS FOR EIGENVALUE PROBLEMS 167

Remarks.
(1) Shifting also holds for the SEVP as in this case the only difference com-
pared to GEVP is that [B] = [I].

(2) If λ = 0 is the smallest eigenvalue of the GEVP or SEVP, then by using

shifting (a negative value of µ) we can construct shifted eigenvalue prob-
lem (4.173) such that the smallest eigenvalue of the shifted eigenvalue
problem will be greater than zero.

(3) When [A] or [B] is singular or positive-semidefinite, then shifting can

be used in inverse or forward iteration methods to make a new [A] or
[B] in the shifted eigenvalue problem positive-definite, thereby avoiding
difficulties in their inverses.

4.4 Transformation Methods for Eigenvalue Prob-

lems
In transformation methods the treatment of the SEVP and GEVP differs
somewhat but the basic principle is the same. In the following we present the
basic ideas employed for the SEVP and GEVP in designing transformation
methods.

4.4.1 SEVP: Orthogonal Transformation, Change of Basis

Consider the following SEVP:

[A]{x} = λ[I]{x} (4.176)

in which [A] is a symmetric matrix. First, we show that an orthogonal

transformation on (4.176) does not alter its eigenvalues, but the eigenvectors
do change. In an orthogonal transformation we perform a change of basis,
i.e., we replace {x} by {x}1 through an orthogonal transformation of the
type:
{x} = [P1 ]{x}1 (4.177)
in which [P1 ] is orthogonal, i.e., [P1 ]−1 = [P1 ]T . Substituting from (4.177)
into (4.176) and premultiplying by [P1 ]T , we obtain:

[P1 ]T [A][P1 ]{x}1 − λ[P1 ]T [I][P1 ]{x} = {0}

or
[P1 ]T [[A] − λ[I]] [P1 ] {x}1

(4.178)
In the eigenvalue problem (4.178), {x}1 is the eigenvector (and not {x}),
thus a change of basis alters eigenvectors. To determine the eigenvalues of
168 ALGEBRAIC EIGENVALUE PROBLEMS

(4.178), we determine the characteristic polynomial associated with (4.178),

i.e., we set the determinant of the coefficient matrix in (4.178) to zero.

det [P1 ]T [[A] − λ[I]] [P1 ] = 0

(4.179)

or
det[P1 ]T det[[A] − λ[I]] det[P1 ] = 0 (4.180)
Since [P1 ] is orthogonal:

det[P1 ] = det[P1 ]T = 1 (4.181)

Hence, (4.180) reduces to:

det[[A] − λ[I]] = 0 = p(λ) (4.182)

Which is the same as the characteristic polynomial of the original eigenvalue

problem (4.176). Thus, a change of basis in the SEVP through an orthog-
onal transformation does not alter its eigenvalues. The eigenvectors of the
original SEVP (4.176) and the transformed SEVP (4.178), {x} and {x}1 ,
are naturally related through [P1 ] as shown in (4.177).

4.4.2 GEVP: Orthogonal Transformation, Change of Basis

Consider the following GEVP:

[A]{x} = λ[B]{x} (4.183)

in which [A] and [B] are symmetric matrices. Here also, we show that an
orthogonal transformation on (4.183) does not alter its eigenvalues but the
eigenvectors change. As in the case of the SEVP, we replace {x} by {x}1
through an orthogonal transformation of the type:

{x} = [P1 ]{x}1 (4.184)

in which
[P1 ]−1 = [P1 ]T and det[P1 ] = det[P1 ]T = 1 (4.185)
Substituting from (4.184) into (4.183) and premultiplying (4.183) by [P1 ]T :

[P1 ]T [A][P1 ]{x}1 − λ[P1 ]T [B][P1 ]{x}1 − {0} (4.186)

or
[P1 ]T [[A] − λ[B]] [P1 ] {x}1

(4.187)
In the eigenvalue problem (4.187), {x}1 is the eigenvector (and not {x}), thus
change of basis alters eigenvectors. To determine the eigenvalues of (4.187),
4.4. TRANSFORMATION METHODS FOR EIGENVALUE PROBLEMS 169

we determine the characteristic polynomial associated with the eigenvalue

problem (4.187), i.e., we set the determinant of the coefficient matrix in
(4.187) to zero.
det [P1 ]T [[A] − λ[B]] [P1 ] = 0

(4.188)
or
det[P1 ]T det [[A] − λ[B]] det[P1 ] = 0 (4.189)
Using (4.185), (4.189) reduces to:

det [[A] − λ[B]] = 0 = p(λ) (4.190)

which is the same as the characteristic polynomial of the original GEVP

(4.183). Thus, a change of basis in the GEVP through an orthogonal trans-
formation does not alter its eigenvalues. The eigenvectors of the original
GEVP (4.183) and the transformed GEVP (4.187), {x} and {x}1 , are nat-
urally related through [P1 ] as shown in (4.184).

Remarks.

(1) In all transformation methods we perform a series of orthogonal trans-

formations on the original eigenvalue problem such that:

(a) In the case of the SEVP, [A] becomes a diagonal matrix but [I]
matrix remains unaltered. Then, the diagonals of this transformed
[A] matrix are the eigenvalues and columns of the products of the
transformation matrices contain the eigenvectors.
(b) In the case of the GEVP, we make both [A] and [B] diagonal ma-
trices through orthogonal transformations. Then, the ratios of the
corresponding diagonals of transformed [A] and [B] are the eigen-
values and the columns of the products of transformation matrices
contain the eigenvectors.

(2) In transformation methods all eigenpairs are extracted simultaneously.

(3) The eigenpairs are not in any particular order, hence these must be
arranged in ascending order.

(4) Just like the root-finding methods used for the characteristic polynomial,
the transformation methods are also iterative. Thus, the eigenpairs are
determined only within the accuracy of preset thresholds for the eigenval-
ues and eigenvectors. The transformation methods are indeed methods
of approximation.

(5) In the following sections we present details of the Jacobi method for the
SEVP, the Generalized Jacobi method for the GEVP, and only provide
170 ALGEBRAIC EIGENVALUE PROBLEMS

basic concepts of Householder QR and subspace iteration methods as

these methods are quite involved, hence detailed presentations of these
methods are beyond the scope of study in this book.

4.4.3 Jacobi Method for SEVP

Consider the SEVP:
[A]{x} = λ[I]{x} (4.191)
in which [A] is a symmetric matrix. Our aim is to perform a series of
orthogonal transformations (change of basis) on (4.191) such that all off-
diagonal elements of [A] become zero, i.e., [A] becomes a diagonal matrix
while [I] on right side of (4.191) remains unaltered.
Consider the change of basis using an orthogonal matrix [P1 ].

{x} = [P1 ]{x}1 (4.192)

Substitute (4.192) in (4.191) and premultiply (4.191) by [P1 ]T .

[P1 ]T [A][P1 ]{x}1 = λ[P1 ]T [I][P1 ]{x}1 (4.193)

Since [P1 ] is orthogonal:

[P1 ]T [I][P1 ] = [I] (4.194)
Hence, (4.193) becomes:

[P1 ]T [A][P1 ]{x}1 = λ[I]{x}1 (4.195)

We construct [P1 ] in such a way that it makes an off-diagonal element of [A]

zero in [P1 ]T [A][P1 ]. Perform another change of basis using an orthogonal
matrix [P2 ].
{x}1 = [P2 ]{x}2 (4.196)
Substitute from (4.196) in (4.195) and premultiply (4.195) by [P2 ]T .

[P2 ]T [P1 ]T [A][P1 ][P2 ]{x}2 = λ[P2 ]T [I][P2 ]{x}2 (4.197)

Since [P2 ] is orthogonal:

[P2 ]T [I][P2 ] = [I] (4.198)
Hence, (4.197) reduces to:

[P2 ]T [P1 ]T [A][P1 ][P2 ]{x}2 = λ[I]{x}2 (4.199)

and
{x} = [P1 ][P2 ]{x}2 (4.200)
4.4. TRANSFORMATION METHODS FOR EIGENVALUE PROBLEMS 171

Equations (4.200) describe how the eigenvectors {x}2 of (4.199) are related
to the eigenvectors {x} of the original SEVP (4.191). We construct [P2 ]
such that the transformation (4.199) makes an off-diagonal element zero in
[P2 ]T ([P1 ]T [A][P1 ])[P2 ]. This process is continued by choosing off-diagonal el-
ements of the progressively transformed [A] in sequence until all off-diagonal
elements have been considered. In this process it is possible that when we
zero out a specific element of the transformed [A], the element that was made
zero in the immediately preceding transformation may not remain zero, but
may be of a lower magnitude than its original value. Thus, to make [A] diago-
nal it may be necessary to make more than one pass through the transformed
off-diagonal elements of [A]. We discuss details in the following sections.
Thus, after k transformations, we obtain:

[Pk ]T [Pk−1 ] . . . [P2 ]T [P1 ]T [A][P1 ][P2 ] . . . [Pk−1 ][Pk ]{x}k = λ[I]{x}k (4.201)

And we have the following:

lim [Pk ]T [Pk−1 ] . . . [P2 ]T [P1 ]T [A][P1 ][P2 ] . . . [Pk−1 ][Pk ] = [Λ]

(4.202)
k→∞

lim [[P1 ][P2 ] . . . [Pk−1 ][Pk ]] = [Φ] (4.203)

k→∞

in which [Λ] is a diagonal matrix containing the eigenvalues and the columns
of the square matrix [Φ] are the corresponding eigenvectors. Thus in the
end, when [A] becomes the diagonal matrix [Λ] we have all eigenvalues in
[Λ] and [Φ] contains all eigenvectors. In (4.202) and (4.203), each value of
k corresponds to a complete pass through all of the off-diagonal elements of
the progressively transformed [A] that are not zero.

4.4.3.1 Constructing [Pl ] ; l = 1, 2, . . . , k Matrices

[Pl ] orthogonal matrices in the Jacobi method are called rotation matri-
ces as these represent rigid rotations of the coordinate axes. A specific [Pl ] is
designed to make a specific off-diagonal term of the transformed [A] (begin-
ning with [P1 ] for [A]) zero. Since [A] is symmetric, [Pl ] can be designed to
make an off-diagonal term of [Al ] (transformed [A]) as well as its transposed
term zero at the same time.
Let us say that we have already performed some transformations and the
new transformed [A] is [Al ]. We wish to make alij and alji of [Al ] zero. We
design [Pl ] as follows to accomplish this.
172 ALGEBRAIC EIGENVALUE PROBLEMS

Column i Column j
 
1

 1 


 cos θ - sin θ  Row i

 1 
[Pl ] =   (4.204)

 1 


 sin θ cos θ  Row j

 1 
1
In [Pl ], the elements corresponding to rows i, j and columns i, j are non-zero.
The remaining diagonal elements are unity, and all other elements are zero.
θ is chosen such that in the following transformed matrix:

[Pl ]T [Al ][Pl ] (4.205)

The elements at the locations i, j and j, i, i.e., alij = alji , have become zero.
This allows us to determine θ.
2alij
tan(2θ) = ; aljj 6= alii
alii − aljj (4.206)
π
and θ = when aljj = alii
4

4.4.3.2 Using Jacobi Method

Consider the SEVP:
[A]{x} = λ[I]{x} (4.207)
Since [A] is symmetric, we only need to consider making the elements above
the diagonal zero as the transformations (4.204) ensure that the correspond-
ing element below the diagonal will automatically become zero. We follow
the steps outlined below.

1. Consider the off-diagonal elements row-wise, and each element of the row
in sequence. That is first consider row one of the original matrix [A] and
the off-diagonal element a12 (a21 = a12 ). We use a11 , a22 , and a12 to
determine θ using (4.206) and then use this value of θ to construct [P1 ]
using (4.204) and perform the orthogonal transformation on [A] to obtain
[A1 ].
[P1 ]T [A][P1 ] = [A1 ] (4.208)
In [A1 ], a112 and a121 have become zero.

2. Next consider [A1 ] and the next element in row one, i.e., a113 (a131 = a113 ).
Using a111 , a133 , and a113 and (4.206) determine θ and then use this value
4.4. TRANSFORMATION METHODS FOR EIGENVALUE PROBLEMS 173

of θ in (4.204) to determine [P2 ]. Perform the orthogonal transformation

on [A1 ] to obtain [A2 ].
[P2 ]T [A1 ][P2 ] = [A2 ] (4.209)
In [A2 ], a213 and a231 have become zero but the elements at the locations
1,2 and 2,1 made zero in (1) may have become non-zero, however its
magnitude may be less than a12 in the original [A]. This is a drawback of
this transformation. Also accumulate the product [P1 ][P2 ] as it is needed
to recover the eigenvector.
3. Next consider a214 (a241 = a214 ) of [A2 ] in row one. Construct [P3 ] using
a211 , a244 , and a214 in (4.206) and then θ in (4.204). Perform the orthogonal
transformation on [A2 ] to obtain [A3 ].
[P3 ]T [A2 ][P3 ] = [A3 ] (4.210)
In [A3 ], a314 and a341 have become zero but the elements at locations 1,3 and
3,1 made zero in (2) may have become non-zero, however their magnitudes
may be smaller than a13 and a31 in the original [A]. Also accumulate the
product [P1 ][P3 ][P3 ] as it is needed for recovering the eigenvectors of the
original eigenvalue problem.
4. We continue this process for all elements of row one using progressively
transformed [A]. When row one is exhausted, we begin with the off-
diagonal elements of row two of the most recently transformed [A]. This
process is continued until all off-diagonal elements of all rows of the pro-
gressively transformed [A] have been considered once. This constitutes
a ‘sweep’, i.e., we have swept all off-diagonal elements of progressively
transformed [A] matrices once. At the end of the sweep only the last off-
diagonal element made zero remains zero. All other off-diagonal elements
may have become non-zero, but their magnitudes are generally smaller
than those in the original [A].
5. Using the more recently transformed [A] in (4), we begin another sweep
starting with row one. At the end of sweep two the off-diagonal elements
in the transformed [A] will even be smaller than those after sweep one. We
also continue accumulating the products of the transformation matrices
continuously in the sweeps. We make as many sweeps as necessary to
ensure that each off-diagonal element is the most recently transformed
[A] is below a preset tolerance ∆ of numerically computed zero.
6. The procedure described above is called Cyclic Jacobi method due to the
fact we have cycles of sweeps of the off-diagonal elements until all off-
diagonal elements are below a preset tolerance of numerically computed
zero.
174 ALGEBRAIC EIGENVALUE PROBLEMS

7. Threshold Jacobi
In threshold Jacobi we perform an orthogonal transformation to zero out
an off-diagonal element of [A] (or of the most recently transformed [A])
and then check the magnitude of the next off-diagonal element to be made
zero. If it is below the threshold we skip the orthogonal transformation
for it and move to the next off-diagonal element in sequence, keeping
in mind that the same rule applies to this current off-diagonal elements
as well as those to come. It is clear that in this procedure we avoid
unnecessary transformation for the elements that are already within the
threshold of zero. Thus, the threshold Jacobi clearly is more efficient than
cyclic Jacobi.

8. A sweep is generally quite fast if [A] is not too large.

Example 4.7 (Jacobi Method for SEVP). Consider the SEVP:

[A]{x} = λ[I]{x}

in which
2 −1
[A] =
−1 4
In this case there is only one off-diagonal term in [A], a12 = a21 = −1 (row 1,
column 2 ; i = 1, j = 2). We construct [P1 ] or [P12 ] to make a12 = a21 = −1
zero in [A]. The subscript 12 in [P12 ] implies the matrix [P ] corresponding
to the element of [A] located at row 1, column 2.

cos θ − sin θ
[P12 ] =
sin θ cos θ
2a12 2(−1) −2
tan 2θ = = = =1
a11 − a22 2−4 −2
π π
∴ 2θ = ; θ=
4 8
π
cos = 0.92388
π8
sin = 0.38268
8

0.92388 − 0.38268
∴ [P12 ] =
0.38268 0.92388
∴ [A1 ] = [P12 ]T [A][P12 ]
4.4. TRANSFORMATION METHODS FOR EIGENVALUE PROBLEMS 175

or

1 0.92388 0.38268 2 − 1 0.92388 − 0.38268
[A ] =
−0.38268 0.92388 −1 4 0.38268 0.92388

0.92388 0.38268 1.46508 − 1.68924
=
−0.38268 0.92388 0.60684 4.0782

1.58578 0 λ 0
= = 1
0 4.41421 0 λ2

The eigenvectors corresponding to λ1 and λ2 are the columns of [P12 ].

0.92388 − 0.38268
[Φ] = [P12 ] = = [{φ}1 , {φ}2 ]
0.38268 0.92388

Hence, we have:

0.92388
(λ1 , {φ}1 ) = 1.52578,
0.38268

−0.38268
and (λ2 , {φ}2 ) = 4.41421,
0.92388

as the two eigenpairs of the SEVP considered here.

4.4.4 Generalized Jacobi Method for GEVP

Consider the GEVP:
[A]{x} = λ[B]{x} (4.211)

in which [A] and [B] are symmetric matrices and [B] 6= [I]. Our aim in the
generalized Jacobi method is to perform a series of orthogonal transformation
on (4.211) such that:

(i) [A] becomes a diagonal matrix and [B] becomes [I]. The diagonal ele-
ments of the final transformed [A], i.e., [Ak ] (after k transformations),
will be the eigenvalues of (4.211) and the columns of the product of the
transformation matrices will contain the corresponding eigenvectors.

(ii) [A] and [B] both become diagonal, but [B] is not an identity matrix. In
this approach the ratios of the diagonal elements of the transformed [A]
and [B], i.e., [Ak ] and [B k ] (after k transformations), will be the eigen-
values and the columns of the product of the transformation matrices
will contain the corresponding eigenvectors.
176 ALGEBRAIC EIGENVALUE PROBLEMS

(iii) Using either (i) or (ii), the results remain unaffected. In designing
transformation matrices, (ii) is easier.

4.4.4.1 Basic Theory of Generalized Jacobi Method

Consider the change of basis in (4.211) using:

{x} = [P1 ]{x1 } (4.212)

in which [P1 ] is orthogonal, i.e., [P1 ]T = [P1 ]−1 . Substituting from (4.212)
in (4.211) and premultiplying by [P1 ]T :

[P1 ]T [A][P1 ]{x}1 = λ[P1 ]T [B][P1 ]{x}1 (4.213)

We choose [P1 ] such that an off-diagonal element of [A] and the corresponding
off-diagonal element of [B] become zero. If we define:

[A1 ] = [P1 ]T [A][P1 ]

(4.214)
[B 1 ] = [P1 ]T [B][P1 ]

Then we can write (4.213) as:

[A1 ]{x}1 = λ[B1 ]{x}1 (4.215)

[A1 ] and [B 1 ] are the transformed [A] and [B] after the first orthogonal
transformation. Perform another change of basis on (4.215).

{x}2 = [P2 ]{x}1 (4.216)

Substituting from (4.216) in (4.215) and premultiplying by [P2 ]T :

[P2 ]T [A1 ][P2 ]{x}2 = λ[P2 ]T [B 1 ][P2 ]{x}2 (4.217)

We choose [P2 ] such that it makes an off-diagonal element of [A1 ] and the
corresponding element of [B 1 ] zero and

{x} = [P1 ][P2 ]{x}k (4.218)

Continuing this process we obtain after k transformations:

[Pk ]T [Pk−1 ]T . . . [P2 ]T [P1 ]T [A][P1 ][P2 ] . . . [Pk−1 ][Pk ]

= λ[Pk ]T [Pk−1 ]T . . . [P2 ]T [P1 ]T [B][P1 ][P2 ] . . . [Pk−1 ][Pk ] (4.219)
or [Pk ]T [Ak−1 ][P k ]{x}k = λ[Pk ]T [Bk−1 ][Pk ]{x}k (4.220)
k
or [A ]{x} = λ[Bk ]{x}k (4.221)
4.4. TRANSFORMATION METHODS FOR EIGENVALUE PROBLEMS 177

Definitions of [Ak ] and [B k ] are clear from (4.219), and

{x} = [P1 ][P2 ] . . . [Pk−1 ][Pk ]{x}k (4.222)

And we have the following.

lim [Ak ] = [τ ] ; a diagonal matrix

k→∞
lim [B k ] = [Λ] ; a diagonal matrix (4.223)
k→∞
and lim [P1 ][P2 ] . . . [Pk−1 ][Pk ] = [Φ]
k→∞

The eigenvalues are given by (not in any particular order):

τii
λi = ; i = 1, 2, . . . , n (4.224)
Λii

The columns of [Φ] are the corresponding eigenvectors. What remains in

this method is to consider the details of constructing [Pl ]; l = 1, 2, . . . , k
matrices.

4.4.4.2 Construction of [Pl ] Matrices

The [Pl ] matrices are called rotation matrices, same as in the case of
the Jacobi method for the SEVP. In the design of [Pl ] we take into account
that [A] and [B] are symmetric. To be general, consider [Al ] and [B l ] after l
transformations. Let us say that we want to make alij and blij zero (alji and
blji are automatically made zero as we consider symmetry of [A] and [B] in
designing [Pl+1 ]), then [Pl+1 ] can have the following form.
i j
 
1

 1 


 1 α 
 i
 1 
[Pl+1 ] =   (4.225)

 1 


 β 1 
 j
 1 
1

The parameters α and β are determined such that in the transformed [Al ]
and [B l ], i.e., in

[Pl+1 ]T [Al ][Pl+1 ] and [Pl+1 ]T [B l ][Pl+1 ] (4.226)

178 ALGEBRAIC EIGENVALUE PROBLEMS

alij and blij are zero (hence, alji and blji are zero). Using [Pl+1 ] in (4.226) and
setting alij = blij = 0 gives us the following two equations.

αalii + (1 + αβ)alii + βaljj = 0

(4.227)
αblii + (1 + αβ)blij + βbljj = 0
These are nonlinear equations in α and β and have the following solution.
āljj alii
α= , β=−
X X
where
s 2
āl 1 ā−l
X= + sign(ā ) + ālii āljj (4.228)
2 2
ālii = alii blij − blii alij
āljj = aljj blij − bljj alij
āl = alii bljj − aljj blii
• The basic steps in this method are identical to the Jacobi method described
for the SEVP.
• Thus, we have the cyclic generalized Jacobi and threshold generalized
Jacobi methods.

Example 4.8 (Generalized Jacobi Method for GEVP). Consider the

GEVP:
[A]{x} = λ[B]{x}
in which
1 −1 2 1
[A] = and [B] =
−1 1 1 2
The off-diagonal elements of [A] and [B] at location (1,2) are to be made
zero. We perform the change of basis, i.e., orthogonal transformation, on the
eigenvalue problem to diagonalize [A] and [B]. Since [A] and [B] have only
one off-diagonal element to be made zero (due to symmetry), one orthogonal
transformation is needed to accomplish this. In this case:

a12 = a21 = −1
b12 = b21 = 1

1α
[P12 ] =
β 1
4.4. TRANSFORMATION METHODS FOR EIGENVALUE PROBLEMS 179

α and β are calculated as

ā11 = a11 b12 − b11 a12 = 1(1) − 2(−1) = 3
ā22 = a22 b12 − b22 a12 = 1(1) − 2(−1) = 3
ā = a11 b22 − a22 b11 = 1(2) − 1(2) = 0 (sign is positive)
r
ā ā 2
∴ X = + sign(ā) + ā11 ā22
2 s2
0 0 2
∴ X = + sign(ā) + (3)(3) = 0 + 3 = 3
2 2
ā22 3 ā11 3
∴ α= = =1; β=− = − = −1
X 3 X 3

1 1
∴ [P12 ] =
−1 1
Hence
[P12 ]T [A][P12 ]{x}1 = λ[P12 ]T [B][P12 ]{x}1
or

1 −1 1 −1 1 1 x1 1 −1 2 1 1 1 x1
=λ
1 1 −1 1 −1 1 x1 1 1 1 1 2 −1 1 x1 1
or
1 −1 2 0 x1 1 −1 1 3 x1
=λ
1 1 −2 0 x1 1 1 1 −1 3 x1 1
or
4 0 x1 2 0 x1
=λ
0 0 x1 1 0 6 x1 1

4 0 2 0
∴ [τ ] = and [Λ] =
0 4 0 6
τii
λi =
Λii
τ11 4
λ1 = = =2
Λ11 2
τ22 0
λ2 = = =0
Λ22 6

1 1 1 1
and [Φ] = [P12 ] = = ,
−1 1 −1 1

1
∴ (λ1 , {φ}1 ) = 2,
−1

1
(λ2 , {φ}2 ) = 0,
1
180 ALGEBRAIC EIGENVALUE PROBLEMS

We note that λ’s are not in ascending order, but can be arranged in ascending
or descending order.

4.4.5 Householder Method with QR Iterations

The Householder method can only be used for the SEVP. Thus, to use
this method for the GEVP, we must first transform it into the SEVP (only
possible if [B] in the GEVP is invertible). Consider the SEVP:

[A]{x} = λ[I]{x} (4.229)

This method consists of two steps:

1. Perform a series of orthogonal transformations (change of basis) on (4.229)
such that [A] becomes tridiagonal but [I] on right side of (4.229) remains
unaffected. These transformations are similarity transformations called
Householder transformations.

2. Using the tridiagonal form of [A] and [I] in the transformed (4.229),
we perform QR iterations to extract the eigenvalues and eigenvectors of
(4.229).

4.4.5.1 Step 1: Householder Transformations to Tridiagonalize [A]

Consider (4.229) and perform a series of transformations such that:
(i) Each transformation is designed to transform a row and the corre-
sponding column into tridiagonal form.

(ii) Unlike the Jacobi method, in the Householder method once a row and
the corresponding column are in tridiagonal form, subsequent transfor-
mations for other rows and columns do not affect them.

(iii) Thus, for an (n × n) matrix, only (n − 2) Householder transformations

are needed, i.e., this process is not iterative (as in the Jacobi method).
Details of the change of bases on (4.229) follow the usual procedure used
for Jacobi method, except that the orthogonal transformation matrix [Pl ]
in this method is not the same as the transformation matrix in the Jacobi
method.
{x}1 = [P1 ]{x} , {x}2 = [P2 ]{x}1 . . . (4.230)
After k transformations we have:

[Ak ]{x}k = λ[I]{x}k (4.231)

4.4. TRANSFORMATION METHODS FOR EIGENVALUE PROBLEMS 181

in which [P1 ] ; i = 1, 2, . . . , k are orthogonal and:

{x} = [P1 ][P2 ] . . . .[Pk ] (4.232)

[Ak ] = [Pk ]T [Pk−1 ]T . . . [P2 ]T [P1 ][A][P1 ][P2 ] . . . [Pk−1 ][Pk ] (4.233)

After (n − 2) transformations, the transformed [A] will be tridiagonal.

[An−2 ] = [Pn−2 ]T [Pn−3 ]T . . . [P2 ]T [P1 ][A][P1 ][P2 ] . . . [Pn−3 ][Pn−3 ] (4.234)
{x} = [P1 ][P2 ] . . . .[Pn−3 ][Pn−2 ] (4.235)

[An−2 ] is the final tridiagonal form of [A]. We note that [I] remains unaf-
fected.

4.4.5.2 Using Householder Transformations

Let [Pl ] be the Householder transformation matrix that makes row l and
column l of [Al−1 ] tridiagonal (i.e., only one element above and below the
diagonal are non-zero) without affecting the tridiagonal forms of rows and
columns 1 through l − 1. [Pl ] is given by:

[Pl ] = [I] − θ{wl }{wl }T (4.236)

2
θ= (4.237)
{wl }T {wl }

Thus, [Pl ] is completely defined once {wl } is defined.

It is perhaps easier to understand [Pl ] matrices if we begin with [P1 ]
that operates on row one and column one and then consider [P2 ], [P3 ], etc.
subsequently. Consider [P1 ] (l = 1) in (4.236) and (4.237). We partition
[P1 ], [A], and {wl } as follows:

a11 {a1 }T
     
1 [0]  0 
[P1 ] =   ; [A] =   ; {w1 } =
{0} [P̄1 ] {a1 } [A11 ] {w̄1 }
 
(4.238)
where [P̄1 ], {w̄1 } and [A11 ] are of order (n − 1). Premultiply [A] by [P1 ]T
and post-multiply by [P1 ] to obtain [A1 ].

a11 {a1 }T
   
1 [0] 1 [0]
[A1 ] =     (4.239)
{0} [P̄1 ]T {a1 } [A11 ] {0} [P̄1 ]
or
{a1 }T [P̄1 ]
 
a11
[A1 ] =   (4.240)
T
]T [A

P̄1 {a1 } [P̄1 11 ][P̄1 ]
182 ALGEBRAIC EIGENVALUE PROBLEMS

In [A1 ] first row and the first column should be tridiagonal, i.e. [A1 ] should
have the following form:
 
a11 x 0 0 . . . 0
 
 

 x 

[A1 ] = 
 0 
(4.241)
[Ā1 ]


 0 

 .. 
 . 
0

where x is a nonzero element and

[Ā1 ] = [P̄1 ]T [A11 ][P̄1 ] (4.242)

[P̄1 ] is called the reflection matrix. We are using [P̄1 ] to reflect {a1 } of
[A] into a vector such that only its first component is non-zero (obvious by
comparing (4.240) and (4.241)). Since the length of the vector corresponding
to row one or column one (excluding a11 ) must be the same as the length of
{a1 }, we can use this condition to determine {w1 } (i.e., first {w̄1 } and then
{w1 }).

[[I] − θ{w̄1 }{w1 }]{a1 } = ± ||{a11 }||L2 {e}1

 

 1 
0
  (4.243)
where {e1 } = .

 .. 


 
0


A positive or negative sign is selected for numerical stability. From (4.243),

we can solve for {w̄1 }.

{w̄1 } = {a1 } + sign(a21 ) ||{a1 }||L2 {e1 } (4.244)

where a21 is the element (2,1) of matrix [A]. The vector {w1 } is obtained
using {w̄1 } from (4.244) in (4.238). Thus, [P1 ] is defined and we can perform
the Householder transformation for column one and row one to obtain [A1 ],
in which column one and row one are in tridiagonal form.

[A1 ] = [P1 ]T [A][P1 ] (4.245)

Next we consider column two and row two to obtain [P2 ] and then use:

[A2 ] = [P2 ]T [A1 ][P2 ] (4.246)

4.4. TRANSFORMATION METHODS FOR EIGENVALUE PROBLEMS 183

In [A2 ] the first two columns and rows are in tridiagonal form. We continue
this (n − 2) times to finally obtain [An−2 ] in tridiagonal form, and we can
write:
[An−2 ]{x}n−2 = λ[I]{x}n−2 (4.247)
and {x} = [P1 ][P2 ] . . . [Pn−3 ][Pn−2 ]{x}n−2 (4.248)
Equation (4.248) is essential to recover the original eigenvector {x}.

4.4.5.3 Step 2: QR Iterations to Extract Eigenpairs

We apply QR iterations to the tridiagonal form (4.247) to extract the
eigenvalues of the original SEVP (same as the eigenvalues of (4.247)). QR
iterations can also be applied to the original [A] but QR iterations are more
efficient with tridiagonal form of [A]. The purpose of QR iterations is to
decompose [An−2 ] into a product of [Q] and [R].

[An−2 ] = [Q][R] (4.249)

The matrix [Q] is orthogonal and [R] is upper triangular. Since [Q] is or-
thogonal we perform a change of basis on (4.247).

{x}n−2 = [Q1 ]{x}1n−2 (4.250)

n−2
with [A ] = [Q1 ][R1 ] (4.251)

Substitute (4.250) in (4.247) and premultiply (4.247) by [Q1 ]T .

[Q1 ]T [An−2 ][Q1 ]{x}1n−2 = λ[Q1 ]T [I][Q1 ]{x}1n−2 (4.252)

or
[Q1 ]T [An−2 ][Q1 ]{x}1n−2 = λ[I]{x}1n−2 (4.253)
Using (4.251) in the left side of (4.253) for [An−2 ]:

[Q1 ]T [An−2 ][Q1 ] = [Q1 ]T [Q1 ][R1 ][Q1 ] = [R1 ][Q1 ] (4.254)

That is, performing an orthogonal transformation on [An−2 ] using [Q1 ] is the

same as taking the product of [R1 ][Q1 ]. Thus, once we have the decomposi-
tion (4.251), the orthogonal transformation on [An−2 ] is simply the product
[R1 ][Q1 ].

4.4.5.4 Determining [Q] and [R]

We wish to perform the decomposition:

[An−2 ] = [Q][R] (4.255)

184 ALGEBRAIC EIGENVALUE PROBLEMS

[R] can be obtained by premultiplying [An−2 ] and successively transformed

[An−2 ] by a series of rotation matrices designed to make the elements of
[An−2 ] below the diagonal zero, so that the transformed [An−2 ] will be upper
triangular. Thus,

[R] = [P ]n,n−1 . . . [P ]3,3 . . . [P ]1,1 [P ]3,1 [P ]2,1 [An−1 ] (4.256)

where [P ]i,j corresponding to a row i and column j that makes an−2

ji zero is
given by:
Column i Column j
 
1

 1 


 cos θ - sin θ  Row i

 1 
[P ]i,j =  (4.257)

 1 


 sin θ cos θ  Row j

 1 
1
The diagonals of (4.256) are unity and:

an−2
ji an−2
ii
sin θ = q ; cos θ = q (4.258)
(an−2 2 n−2 2
ii ) + (ajj ) (aii ) + (an−2
n−2 2
ji )
2

If a2ii + a2ji = 0, no transformation is required. Therefore:

[Q] = [P ]2,1 [P ]3,1 [P ]1,1 . . . [P ]3,2 [P ]3,3 . . . [P ]n,n−1 (4.259)

4.4.5.5 Using QR Iteration

Given [An−2 ], we want to obtain [Q] and [R] and take the product of
[R][Q], which is the same as an orthogonal transformation on [An−2 ] to get
[An−2 ]1 . The process of obtaining [Q] and [R] follows the previous section.
Now we take [An−2 ]1 and repeat QR iterations and continue doing so un-
til [An−2 ] has become a diagonal matrix. Keeping in mind that each QR
iteration changes the eigenvector the columns of

[[Q1 ][Q2 ] . . . [Qn ]] = [Φ] (4.260)

contain the eigenvectors of the SEVP and the diagonal elements of the final
transformed [An−2 ] are the eigenvalues. We also note that these eigenvalues
are not in any particular order.
4.4. TRANSFORMATION METHODS FOR EIGENVALUE PROBLEMS 185

Example 4.9 (Householder Transformation). Consider matrix [A] ap-

pearing in a standard eigenvalue problem.
 
5 −4 1 0
 −4 6 −4 1
[A] = 
 1 −4

6 − 4
0 1 −4 5

We use Householder transformations to reduce [A] to tridiagonal form.

1. Reducing column one to tridiagonal form: making a11 zero

{a1 }T = [−4 1 0]
sign(a21 ) = sign(−4) = −
p √
||a1 || = (−4)2 + (1)2 + (0)2 = 17 = 4.123
     
−4 1 −8.1231
∴ {w̄} = 1 − 4.123 0 = 1
0 0 0
     

 
   0
0  

−8.1231
   
{w1 } = ... =
1
{w̄1 }
   

0
 

2 2
θ= = = 0.029857
{w1}T {w 1} 0 + (−8.1231)(−8.1231) + (1)(1) + 0
 
1 0 0 0
0 − 0.9701 0.2425 0
∴ [P1 ] = [I] − θ{w1 }{w1 }T =  0

0.2425 0.9701 0
0 0 0 1
 
5 4.1231 0 0
4.1231 7.8823 3.5294 −1.9403
[A1 ] = [P1 ]T [A][P1 ] =  
 0 3.5294 4.1177 −3.6380
0 −1.9403 −3.6380 5

2. Reducing column two of [A1 ] to tridiagonal form: making a141 zero

{a2 }T = [3.5294 − 1.9403]

sign(a221 ) = sign(3.5294) = +
p
||a2 || = (3.5294)2 + (−1.9403)2 = 4.0276
186 ALGEBRAIC EIGENVALUE PROBLEMS

3.5294 1 7.5570
∴ {w̄2 } = + 4.0276 =
−1.9403 0 −1.9403
 

 0 


0
 2
{w2 } = ; θ= = 0.032855

 7.5570 
 {w2 }T {w 2}
−1.9403
 
 
1 0 0 0
0 1 0 0 
[P2 ] = [I] − θ{w2 }{w2 }T = 

∴ 0 0 − 0.8763

0.4817
0 0 0.4817 0.8763
 
5 4.1231 0 0
4.1231 7.8823 −4.0276 0 
[P2 ][A1 ][P2 ] = [A2 ] 

∴  0 −4.0276 7.3941

2.3219
0 0 2.3218 1.7236
[A2 ] is the final tridiagonal form of [A]. The tridiagonal form is symmetric
as expected.

Remarks. The QR iteration process is computationally intense, hence not

practical to illustrate in this example. QR iterations need to be programmed.

4.4.6 Subspace Iteration Method

In this method a large eigenvalue problem

[K]{φ} − λ[M ]{φ} = {0} (4.261)

in which [K] and [M ] are (n × n) is reduced to a much smaller eigenvalue

problem based on the desired number of eigenpairs to be extracted. The
steps of this method are given in the following.
If p (p << n) is the desired number of eigenpairs, then choose a n × q
(q > p) starting matrix [x1 ]n×q whose columns are initial guesses for the
eigenvectors.

(1) Consider the following inverse iteration problem (k = 1):

[K]n×n [x̄k+1 ]n×q = [M ]n×n [xk ]n×q (4.262)

We calculate [x̄k+1 ]n×q matrix using (4.262).

4.4. TRANSFORMATION METHODS FOR EIGENVALUE PROBLEMS 187

(2) We find the projections of [K] and [M ] in the space spanned by the q
matrices using:

[Kk+1 ]q×q = [x̄k+1 ]Tq×n [K]n×n [x̄k+1 ]n×q

(4.263)
[Mk+1 ]q×q = [x̄k+1 ]Tq×n [M ]n×n [x̄k+1 ]n×q

(3) We solve the eigenvalue problem constructed using [Kk+1 ] and [Mk+1 ].

[Kk+1 ][Φqk+1 ] = [Λk+1 ][Mk+1 ][Φqk+1 ] (4.264)

[Λk+1 ] is a diagonal matrix containing approximated eigenvalues and

[Φqk+1 ] is a matrix whose columns are eigenvectors of the reduced-dimension
problem corresponding to the eigenvalues in [Λk+1 ].

(4) Construct new starting matrices:

[xk+1 ]n×q = [x̄k+1 ]n×q [Φqk+1 ]q×q (4.265)

(5) Use k = k + 1 and repeat steps (1) – (4).

Provided the vectors in the columns of the starting matrix [x1 ] are not
orthogonal to one of the required eigenvectors, we have:

[Λk+1 ]q×q → [Λ]q×q and [xk+1 ]n×q → [Φq ]n×q

as k → ∞
(4.266)
where [Λ] is a diagonal (square) matrix containing q eigenvalues and [Φq ]
is a (rectangular) matrix whose columns are eigenvectors of the original
eigenvalue problem corresponding to the eigenvalues in [Λ].
Remarks.
(1) The choice of starting vectors in [x1 ] is obviously critical. First, they
should be orthogonal to each other. Secondly, none of these should be
orthogonal to any of the desired eigenvectors.

(2) Based on the diagonal elements of [K] and [M ], some guidelines can be
used.

(a) Construct ri = kii/mii . If {e1 }, {e2 }, . . . , {en } are unit vectors con-
taining one at rows 1, 2, . . . , n, respectively, and zeroes everywhere
else and if rj , rk , rl are progressively increasing values of ri , then
[x]n×3 consists of [{ej }, {ek }, {el }].

(3) The eigenvalue problem (4.264) can be solved effectively using Gener-
alized Jacobi or QR-Householder method in which all eigenpairs are
determined.
188 ALGEBRAIC EIGENVALUE PROBLEMS

Example 4.10 (Starting Vectors in Subspace Iteration Method).

Consider the following eigenvalue problem:

[K]{φ} − λ[M ]{φ} = {0}

in which    
2 −1 0 0 1 0 0 0
 −1 2 −1 0  0 2 0 0
[K] = 
 0 −1 2 −1  ;
 [M ] = 
0

0 4 0
0 0 −1 1 0 0 0 3
In this case:
ri = kii/mii = 2, 1, 0.5, 1/3
i=1,2,3,4 i=1,2,3,4

Let q = 2, i.e., we choose two starting vectors. We see that i = 4 and i = 3

correspond to the lowest values of ri . Then, based on ri ; i = 1, 2, 3, 4 we
choose the following.
 
00
0 0
[x1 ]4×2 = [{e4 }, {e3 }] = 
0 1


4.5 Concluding Remarks

(1) The characteristic polynomial method is rather primitive due to the fact
that (i) it requires determination of the coefficients of the polynomial
and (ii) it uses root-finding methods that may be computationally inef-
ficient or even ineffective when large numbers of eigenpairs are required.
Generally for eigensystems smaller than (10 × 10), this method may be
employed.

(2) The vector iteration method is quite effective in determining the smallest
and the largest eigenpairs. The vector iteration method with iteration
vector deflation is quite effective in calculating a few eigenpairs. For a
large number of eigenpairs, the orthogonalization process becomes er-
ror prone due to inaccuracies in the numerically computed eigenvectors
(hence, not ensuring their orthogonal property). This can cause the
computations to become erroneous or even to break down completely.

(3) The Jacobi and generalized Jacobi methods yield all eigenpairs, hence
4.5. CONCLUDING REMARKS 189

are not practical for eigensystems larger than (50 × 50) or at the most
(100 × 100). In these methods, the off-diagonal terms made zero in
an orthogonal transformation become non-zero in the next orthogonal
transformation, thus these methods may require a larger number of cy-
cles or sweeps before convergence is achieved.

(4) In the Householder method with QR iterations for the SEVP (only), we
tridiagonalize the [A] matrix by Householder transformations and then
use QR iterations on the tridiagonal form to extract the eigenpairs. The
tridiagonalization process is not iterative, but QR iterations are (as the
name suggests). This method also yields all eigenpairs and hence is only
efficient for eigensystems that are smaller than (50 × 50) or at the most
(100 × 100). Extracting eigenpairs using QR iterations is more efficient
with the tridiagonal form of [A] than the original matrix [A]. This is
the main motivation for converting (transforming) [A] to the tridiagonal
form before extracting eigenpairs.

(5) The subspace iteration method is perhaps the most practical method
for larger eigensystems as in this method a large eigenvalue problem
is reduced to a very small eigenvalue problem. Computation of the
eigenpairs only requires working with an eigensystem that is (q × q),
q > p, where p is the desired number of eigenpairs. Generally we choose
q = 2p.
190 ALGEBRAIC EIGENVALUE PROBLEMS

Problems
4.1 Use minors to expand and compute the determinant of
 
2−λ 2 10
 8 3−λ 4 
10 4 5−λ
Use Faddeev-Leverrier method to perform the same computations. Also
compute the matrix inverse and verify that it is correct.

Write a computer program to perform computations for problems 4.2 and

4.3.

4.2 Consider the eigenvalue problem

[A]{x} = λ[B]{x} (1)

where    
100 100
[A] = 0 2 0 ; [B] = 0 1 0
003 001

(a) Use inverse iteration method to compute the lowest eigenvalue and
the corresponding eigenvector.
(b) Transform (1) into new SEVP such that the transformed eigenvalue
problem can be used to determine the largest eigenvalue and the
corresponding eigenvector of (1). Compute the largest eigenvalue
and the corresponding eigenvector using this form.
(c) For eigenvalue problem (1), use inverse iteration with iteration vec-
tor deflation technique to compute all its eigenpairs.
(d) Consider the transformed SEVP in (b). Apply inverse iteration
with iteration vector deflation to compute all eigenpairs.
(e) Tabulate and compare the eigenpairs computed in (c) and (d). Dis-
cuss and comment on the results.

4.3 Consider the eigenvalue problem

[A]{x} = λ[B]{x} (1)

where    
2 −1 0 0 1 0 0 0
−1 2 −1 0 0 1 0 0
[A] =  
 0 −1 2 −1 ; [B] = 
0

0 1 0
0 0 −1 1 0 0 0 1
4.5. CONCLUDING REMARKS 191

(a) Use inverse iteration method with iteration vector deflation to com-
pute all eigenpairs of (1).
(b) Transform (1) into SEVP such that the transformed eigenvalue
problem can be used to find the largest eigenvalue and the cor-
responding eigenvector of (1). Apply inverse iteration method to
this transformed eigenvalue problem with iteration vector deflection
to compute all eigenpairs.
(c) Tabulate and compare the eigenpairs computed in (a) and (b). Dis-
cuss and comment on the results.

4.4 (a) Show that λ = 1 and λ = 4 are the eigenvalues of the following
eigenvalue problem without calculating them.

53 x 20 x
=λ
35 y 02 y

(b) For what values of a, b and c, the following vectors constitute a

system of eigenvectors.
     
1  a  1
1 ; −1 ; b
c 1 −1
     

4.5 Consider the following eigenvalue problem

4 −4 x1 1 0 x1
=λ (1)
−4 8 x2 0 1 x2
Let
0.85064
(λ1 , {φ}1 ) = (1.52786, )
0.52574
be the first eigenpair of (1).
Use inverse iteration method with iteration
vector deflation to calculate sec-
1
ond eigenpair of (1) using {x} = as initial starting vector. Write a
1
computer program to perform the calculations.

4.6 Consider the following SEVP

4 −4 x1 1 0 x1
=λ (1)
−4 8 x2 0 1 x2

Calculate both eigenpairs of (1) using standard Jacobi method. Discuss and
show the orthogonality of the eigenvectors calculated in this method.
192 ALGEBRAIC EIGENVALUE PROBLEMS

4.7 Consider the following GEVP

2 1 x1 2 0 x1
=λ (1)
1 2 x2 0 1 x2
Calculate both eigenvalues of (1) using generalized Jacobi method.
4.8 Consider the following GEVP

4 1 x1 2 −1 x1
=λ (1)
1 4 x2 −1 2 x2

(a) Using basic properties of the eigenvalue problem, show if λ = 1

and λ = 5 are the eigenvalues of (1). If yes, then calculate the
corresponding eigenvectors and show if they are orthogonal or not.
(b) Convert the eigenvalue problem in (1) to a SEVP in the form
[A]{x} = λ[I]{x} and then compute its eigenpairs using standard
Jacobi method.
4.9 Consider the following GEVP

3 1 x1 2 −1 x1
=λ (1)
1 3 x2 −1 2 x2

(a) Compute eigenvalues and the corresponding eigenvectors and show

that the eigenvectors are orthogonal.
(b) Transform the eigenvalue problem in (1) to the standard form [A]{x} =
λ[I]{x} and then compute its eigenpairs using Jacobi method.
4.10 Consider the following eigenvalue problem [K]{x} = λ[M ]{x} in which
   
2 −1 0 1/2 0 0
[K] = −1 4 −1 and [M ] =  0 1 0  (1)
0 −1 2 0 0 /2
1

(a) Perform [L][D][L]T factorization of [K] − λ[M ] at λ = 5.

(b) Describe the significance of this factorization.

(c) What does this factorization for this eigenvalue problem indicate.
4.11 Consider the same eigenvalue problem 4.10 in problem 4.5. If λ1 = 2,
λ2 = 4 and λ3 = 6 are its eigenvalues and
     
0.707 −1  0.707
{φ}1 = 1.0 ; {φ}2 = 0 and {φ}3 = −1.0
0.707 1 0.707
     
4.5. CONCLUDING REMARKS 193

are its eigenvectors, then how are the eigenvalues and the eigenvectors of
this problem related to the eigenvalue problem [K̂]{x} = µ[M ]{x}?
Where
[K̂] = [K] + 1.5[M ]
[K] and [M ] are same as in problem 4.10

4.12 Consider the following eigenvalue problem [A]{x} = λ[B]{x}

1 −1 x1 1 0 x1
=λ (1)
−1 1 x2 0 1 x2

(a) Use Faddeev-Leverrier method to determine the characteristic poly-

nomial.
(b) What can you conclude about the inverse of [A] in (a).
(c) Based on the inverse in (b) what can you conclude about the nature
of [A]. Can the same conclusion be arrived at independent of the
method used here.

4.13 Consider the following eigenvalue problem [A]{x} = λ[I]{x}

 
3 −1 0
[A] = −1 2 −1
0 −1 3
 
 −1
If λ1 = 4 is an eigenvalue and {φ}1 = −1 is the corresponding eigenvec-
−1
 
tor of the above eigenvalue problem, then determine one subsequent eigen- 
1
pair using inverse iteration with iteration vector deflation. Use {x} = 1
1
 
as starting vector.
5
Interpolation and Mapping

5.1 Introduction
When taking measurements in experiments, we often collect discrete data
that may describe the behavior of a desired quantity of interest at selected
discrete locations. These data in general may be over irregular domains
in R1 , R2 , and R3 . Constructing a mathematical description of these data
is helpful and sometimes necessary if we desire to perform operations of
integration, differentiation, etc. for the physics described by these data. One
of the techniques or methods of constructing a mathematical description
for discrete data is called interpolation. Interpolation yields an analytical
expression for the discrete data. This expression then can be integrated or
differentiated, thus permitting operations of integration or differentiation on
discrete data. The interpolation technique ensures that the mathematical
expression so generated will yield precise function values at the discrete
locations that are used in generating it.
When the discrete data belong to irregular domains in R1 , R2 , and R3 ,
the interpolations may be quite difficult to construct. To facilitate the in-
terpolations over irregular domains, the data in the irregular domains are
mapped into regular domains of known shape and size in R1 , R2 , and R3 .
The desired operations of integration and differentiation are also performed
in the mapped domain and then mapped back to the original (physical)
irregular domain. Details of the mapping theories and interpolations are
considered in this chapter. First, we introduce the concepts of interpolation
theory in R1 in the physical coordinate space (say x). This is followed by
mapping theory in R1 that maps data in the physical coordinate space to
the natural coordinate space ξ in a domain of two unit length with the origin
located at the center of the two unit length. The concepts of piecewise map-
ping in R1 , R2 , and R3 as well as interpolations over the mapped domains
are presented with illustrative examples.

5.2 Interpolation Theory in R1

First, we consider basic elements of interpolation theory in R1 .

Definition 5.1 (Interpolation). Given a set of values (xi , fi ) ; i = 1, 2, . . . , n+

195
196 INTERPOLATION AND MAPPING

1, if we can establish an analytical expression f (x) such that f (xi ) = fi ,

then f (x) is called the interpolation associated with the data set (xi , fi ) ;
i = 1, 2, . . . , n + 1.

Important properties of f (x) are that (i) it is an analytical expression and

(ii) at each xi in the given data set the function f (x) has a value f (xi ) that
agrees with fi , i.e., f (xi ) = fi . There are many approaches one could take
that would satisfy the requirement in the definition.

5.2.1 Piecewise Linear Interpolation

Let (xi , fi ) ; i = 1, 2, . . . , n + 1 be a set of given data points. In this
method we assume that the interpolation f (x) associated with the data set
is linear between each pair of data points and hence f (x) for x1 ≤ x ≤ xn+1
is a piecewise linear function (Figure 5.1).

f (x)

fn+1
fi+1
fi
f3
f2

x
x1 x2 x3 xi xi+1 xn+1

Figure 5.1: Piecewise linear interpolation

Consider a pair of points (xi , fi ) and (xi+1 , fi+1 ). The equation of a

straight line describing linear interpolation between these two points is given
by
f (x) − fi f (x) − fi+1
= (5.1)
x − xi x − xi+1
or
fi+1 − fi
f (x) = fi + (x − xi ) ; i = 1, 2, . . . , n (5.2)
xi+1 − xi
5.2. INTERPOLATION THEORY IN R1 197

The function f (x) in equation (5.2) is the desired interpolation function

for the data set (xi , fi ) ; i = 1, 2, . . . , n + 1. For i = 1, 2, . . . , n we obtain
piecewise linear interpolations between each pair of successive data points.
It is obvious from Figure 5.1 that f (x) is continuous for x1 ≤ x ≤ xn+1 but
df
has non-unique dx at x = xi ; i = 2, 3, . . . , n. This may be an undesirable
feature in some instances.

5.2.2 Polynomial Interpolation

Consider data points (xi , fi ) ; i = 1, 2, . . . , n + 1. In this approach we
consider the interpolation f (x) for this data set to be a linear combination of
the monomials xj ; j = 0, 1, . . . , n, i.e. 1, x, x2 , . . . , xn . Using the constants
ai ; i = 0, 1, . . . , n (to be determined) we can write

f (x) = a0 + a1 x + a2 x2 + · · · + an xn (5.3)

Based on the given data set (xi , fi ); i = 1, 2, . . . , n + 1, f (x) in (5.3) must

satisfy the following conditions:

f (x)|x=xi = fi ; i = 1, 2, . . . , n + 1 (5.4)

If we substitute x = xi ; i = 1, 2, . . . , n+1 in (5.3), we must obtain f (xi ) = fi ;

i = 1, 2, . . . , n + 1. Using (5.4) in (5.3), we obtain n + 1 linear simultaneous
algebraic equations in the unknowns ai ; i = 0, 1, . . . , n. These can be written
in the matrix and vector form.
1 x1 x21 . . . xn1 
    
a
  0 
   f 1 

1 x2 x2 . . . xn    a1    f2  
2 2 
= (5.5)

 .. .. .. .  .. ..
. . . . . . ..   
 . 
 
 . 

    
1 xn+1 x2n+1 . . . xnn+1 an fn+1
 

In equations (5.5), the first equation corresponds to (x1 , f1 ), the second

equation corresponds to (x2 , f2 ) and so on. We can write (5.5) in compact
notation as:
[A]{a} = {F } (5.6)
From (5.6) we can calculate the unknown coefficients {a} by solving the
system of linear simultaneous algebraic equations. When we substitute ai ;
i = 0, 1, . . . , n in (5.3), we obtain the desired polynomial interpolations f (x)
for the data set (xi , fi ) ; i = 1, 2, . . . , n + 1 ∀x ∈ [x1 , xn+1 ].

Remarks.

(1) In order for {a} to be unique det[A] 6= 0 must hold. This is ensured if
each xi location is distinct or unique.
198 INTERPOLATION AND MAPPING

(2) When two data point locations (i.e., xi values) are extremely close to
each other the coefficient matrix [A] may become ill-conditioned.

(3) For large data sets (large values of n), this method requires solutions of a
large system of linear simultaneous algebraic equations. This obviously
leads to inefficiency in its computations.

5.2.3 Lagrange Interpolating Polynomials

Let (xi , fi ) ; i = 1, 2, . . . , n be the given data points.

Theorem 5.1. There exists a unique polynomial ψ(x) of degree not exceed-
ing n called the Lagrange interpolating polynomial such that

ψ(xi ) = fi ; i = 1, 2, . . . , n (5.7)

Proof. The existence of the polynomial ψ(x) can be proven if we can estab-
lish the existence of polynomials Lk (x) ; k = 1, 2, . . . , n with the following
properties:

(i) Each Lk (x) is a polynomial of degree less than or equal to n

(
1 ; j=i
(ii) Li (xj ) =
0 ; j 6= i (5.8)
Xn
(iii) Lk (x) = 1
k=1

Assuming the existence of the polynomials Lk (x) we can can write:

n
X
ψ(x) = fk Lk (x) (5.9)
k=1

We note that for x = xi in (5.9):

n
X
ψ(xi ) = fk Lk (xi ) = fi
k=1

Hence ψ(x) has the desired properties of f (x), an interpolation for the data
set (xi , fi ) ; i = 1, 2, . . . , n.
n
X
f (x) = ψ(x) = fi Li (x) (5.10)
i=1
5.2. INTERPOLATION THEORY IN R1 199

Remarks.
(1) Lk (x) ; k = 1, 2, . . . , n are polynomials of degree less than or equal to n.
(2) ψ(x) is a linear combination of fk and Lk (x), hence ψ(x) is also a poly-
nomial of degree less than or equal to n.
(3) ψ(xk ) = fk = f (xk ) because Lk (xi ) = 0 for i 6= k and Lk (xk ) = 1.
(4) Lk (x) are called Lagrange interpolating polynomials or Lagrange inter-
polation functions.
n
Li (x) = 1 is essential due to the fact that if fi = f ∗
P
(5) The property
i=1
; i = 1, 2, . . . , n, then f (x) from (5.10) must be f ∗ for all values of x,
n
P
which is only possible if Li (x) = 1.
i=1

5.2.3.1 Construction of Lk (x): Lagrange Interpolating Polynomials

The Lagrange interpolating polynomials Lk (x) can be constructed using:
n
Y x − xm
Lk (x) = ; k = 1, 2, . . . , n (5.11)
xk − xm
m=1
m6=k

The functions Lk (x) defined in (5.11) have the desired properties (5.8).
Hence we can write
n
X
f (x) = ψ(x) = fi Li (x) ; f (xi ) = fi (5.12)
i=1

Example 5.1 (Quadratic Lagrange Interpolating Polynomials). Con-

sider the set of points (x1 , f1 ), (x2 , f2 ) and (x3 , f3 ) where (x, x2 , x3 ) are
(-1,0,1). Then we can express these data by f (x) using the Lagrange inter-
polating polynomials.

f (x) = f1 L1 (x) + f2 L2 (x) + f3 L3 (x) ∀x ∈ [x1 , x3 ] = [−1, 1] (5.13)

We establish Lk (x) ; k = 1, 2, 3 in the following. In this case x1 = −1,

x2 = 0, x3 = −1 and
3
Y x − xm
Lk (x) = k = 1, 2, 3 (5.14)
xk − xm
m=1
m6=k
200 INTERPOLATION AND MAPPING

Therefore we have:
3
Y x − xm (x − x2 )(x − x3 ) (x − 0)(x − 1) x(x − 1)
L1 (x) = = = =
(k=1) x1 − xm (x1 − x2 )(x1 − x3 ) (−1 − 0)(−1 − 1) 2
m=1
m6=1
3
Y x − xm (x − x1 )(x − x3 ) (x − (−1))(x − 1)
L2 (x) = = = = 1 − x2
(k=2) x2 − xm (x2 − x1 )(x2 − x3 ) (0 − (−1))(0 − 1)
m=1
m6=2
3
Y x − xm (x − x1 )(x − x2 ) (x − (−1))(x − 0) x(x + 1)
L3 (x) = = = =
(k=3) x3 − xm (x3 − x1 )(x3 − x2 ) (1 − (−1))(1 − 0) 2
m=1
m6=3
(5.15)

L1 (x), L2 (x), and L3 (x) defined in (5.15) are the desired Lagrange polyno-
mials in (5.13), hence f (x) is defined.
Remarks. Lk (x); k = 1, 2, 3 in (5.15) have the desired properties.
(i) (
1 ; j=i
Li (xj ) = (5.16)
0 ; j 6= i

(ii)
3
X x(x − 1) x(x + 1)
Li (x) = + (1 − x2 ) + =1 (5.17)
2 2
i=1

(iii) Plots of Li (x) ; i = 1, 2, 3 versus x for x ∈ [−1, 1] are shown in Fig-

ure 5.2.
L1 (x) L2 (x) L3 (x)

x
x = −1 x=0 x=1
(x1 ) (x2 ) (x3 )

Figure 5.2: Plots of Li (x); i = 1, 2, 3 versus x

5.2. INTERPOLATION THEORY IN R1 201

(iv) We have
f (x) = f1 L1 (x) + f2 L2 (x) + f3 L3 (x) (5.18)
Substituting for L1 (x), L2 (x), and L3 (x) from (5.15) in (5.18):

x(x − 1) x(x + 1)
f (x) = f1 + f2 (1 − x2 ) + f3 (5.19)
2 2
where f1 , f2 , f3 are given numerical values. The function f (x) in (5.19)
is the desired interpolating polynomial for the three data points. We
note that f (x) is a quadratic polynomial in x (i.e., a polynomial of
degree two).

Example 5.2. Let (xi , fi ) ; i = 1, 2, 3, 4 be the given data set in which

(x1 , x2 , x3 , x4 ) = (−1, −1/3, 1/3, 1)

The Lagrange interpolating polynomial f (x) for this data set can be written
as:

f (x) = f1 L1 (x) + f2 L2 (x) + f3 L3 (x) + f4 L4 (x) ∀x ∈ [x1 , x4 ] = [−1, 1]

(5.20)
We need to determine Li (x); i = 1, 2, . . . , 4 in (5.20). In this case x1 = −1,
x2 = −1/3, x3 = 1/3, x4 = 1.
4
Y x − xm
Lk (x) = k = 1, 2, . . . , 4 (5.21)
xk − xm
m=1
m6=k

We have the following:

4
Y x − xm (x − x2 )(x − x3 )(x − x4 )
L1 (x) = =
(k=1) x1 − xm (x1 − x2 )(x1 − x3 )(x1 − x4 )
m=1
m6=1
9
=− (1 − x) (1/3 + x) (1/3 − x)
16 (5.22)
4
Y x − xm (x − x1 )(x − x3 )(x − x4 )
L2 (x) = =
(k=2) x2 − xm (x2 − x1 )(x2 − x3 )(x2 − x4 )
m=1
m6=2
27
= (1 + x)(1 − x) (1/3 − x)
16
202 INTERPOLATION AND MAPPING

4
Y x − xm (x − x1 )(x − x2 )(x − x4 )
L3 (x) = =
(k=3) x3 − xm (x3 − x1 )(x3 − x2 )(x3 − x4 )
m=1
m6=3
27
= (1 + x)(1 − x) (1/3 + x)
16 (5.22)
4
Y x − xm (x − x1 )(x − x2 )(x − x3 )
L4 (x) = =
(k=4) x4 − xm (x4 − x1 )(x4 − x2 )(x4 − x3 )
m=1
m6=4
9 1
=− ( /3 + x) (1/3 − x) (1 + x)
16
Li (x); i = 1, 2, . . . , 4 in (5.22) are the desired Lagrange interpolating poly-
nomials in (5.20).

Remarks. The usual properties of Li (x); i = 1, 2, . . . , 4 in (5.22) hold.

(i) (
1 ; j=i
Li (xj ) = (5.23)
0 ; j 6= i

(ii)
n
X
Li (x) = 1 (5.24)
i=1

These can be verified by using Li (x); i = 1, 2, . . . , 4 defined by (5.22).

5.3 Mapping in R1
Consider a line segment in one-dimensional space x with equally spaced
coordinates xi ; i = 1, 2, . . . , n (Figure 5.3). We want to map this line seg-
ment in another coordinate space ξ in which its length becomes two units
and xi ; i = 1, 2, . . . , n are mapped into locations ξi ; i = 1, 2, . . . , n respec-
tively. Thus x1 maps into location ξ1 = −1, and xn into location ξn = +1
and so on. This can be done rather easily if we recall that for the data set
(xi , fi ), the Lagrange interpolation f (x) is given by:

n
X
f (x) = fi Li (x) (5.25)
i=1

If we replace xi by ξi in (5.25) (and thereby replacing x with ξ) and fi

5.3. MAPPING IN R1 203

1 2 3 n−1 n
x
x1 x2 x3 xn−1 xn

(a) A line segment with equally spaced n points in R1

ξ = −1 ξ = +1

1 2 3 n−1 n
ξ
ξ1 ξ2 ξ3 ξ = 0 ξn−1 ξn
2

(b) Map of line segment of (a) in natural coordinate ξ

Figure 5.3: A line segment in x-space and its map in ξ-space

by xi , then the data set (xi , fi ) becomes (ξi , xi ) and (5.25) becomes:
n
X
x(ξ) = xi Li (ξ) (5.26)
i=1

The ξ-coordinate space is called natural coordinate space. The origin of the
ξ-coordinate space is considered at the middle of the map of two unit length
(Figure 5.3). Equation (5.26) indeed is the desired equation that describes
the mapping of points in x- and ξ-coordinate spaces.
Remarks.
(1) The Lagrange interpolation functions Li (ξ) are constructed using the
configuration of Figure 5.3 in the ξ-coordinate space.
(2) xi ; i = 1, 2, . . . , n are the Cartesian coordinates of the points on the line
segment in the Cartesian coordinate space.
(3) In equation (5.26), x(ξ) is expressed as a linear combination of the La-
grange polynomials Li (ξ) using the Cartesian coordinates of the points
in the x-space.
(4) If we choose a point −1 ≤ ξ ∗ ≤ 1, then (5.26) gives its corresponding
location in x-space.
n
X
x∗ = x(ξ ∗ ) = xi Li (ξ ∗ ) (5.27)
i=1
204 INTERPOLATION AND MAPPING

Thus given a location −1 ≤ ξ ∗ ≤ 1, the mapping (5.26) explicitly gives

the corresponding location x∗ in x-space. But given x∗ we need to find
the correct root of the polynomial in ξ to determine ξ ∗ . Thus, the
mapping (5.26) is explicit in ξ but implicit in x, i.e., we do not have an
explicit expression for ξ = ξ(x).

(5) We generally consider xi ; i = 1, 2, . . . , n to be equally spaced but this

is not a strict requirement. However, regardless of the spacing of xi ;
i = 1, 2, . . . , n in the x-space, the points ξi ; i = 1, 2, . . . , n are always
taken to be equally spaced in constructing the Lagrange interpolation
functions Li (ξ); i = 1, 2, . . . , n.

(6) The Lagrange interpolation functions Lk (ξ) are given by replacing x and
xi ; i = 1, 2, .., n with ξ and ξi ; i = 1, 2, . . . , n in (5.11).

n
Y ξ − ξm
Lk (ξ) = ; k = 1, 2, . . . , n (5.28)
ξk − ξm
m=1
m6=k

We consider some examples in the following.

Example 5.3 (1D Mapping: Two Points). Consider a line segment in

R1 consisting of two end points (x1 , x2 ) = (2, 6). Derive mapping to map
this line segment in two unit length in ξ-space with the origin of ξ-coordinate
system at the center of [−1, 1]. Mapping of (a) into (b) is given by:

x(ξ) = L1 (ξ)x1 + L2 (ξ)x2 (5.29)

L1 (ξ) and L2 (ξ) are derived using:

2
Y ξ − ξm
Lk (ξ) = ; k = 1, 2
ξk − ξm
m=1
m6=k

ξ − ξ2 ξ−1 1−ξ
∴ L1 (ξ) = = =
ξ1 − ξ2 −1 − 1 2
(5.30)
ξ − ξ1 ξ − (−1) 1+ξ
L2 (ξ) = = =
ξ2 − ξ1 1 − (−1) 2
5.3. MAPPING IN R1 205

1−ξ 1+ξ
∴ x(ξ) = x1 + x2
2 2

1−ξ 1+ξ
or x(ξ) = (2) + (6)
2 2
or x(ξ) = (1 − ξ) + 3(1 − ξ)
or x(ξ) = 4 + 2ξ (5.31)

We note that when ξ = −1, then x(−1) = 2 = x1 and when ξ = 1, x(1) =

6 = x2 , i.e., with ξ = −1, 1 we recover the x-coordinates of points 1 and 2 in
the x-space. In this case the mapping is a stretch mapping. The line segment
of length 4 units in x-space is uniformly compressed into a line segment of
length 2 units in ξ-space.

Example 5.4 (1D Mapping: Three Points). Consider a line segment in

R1 containing three equally spaced points (x1 , x2 , x3 ) = (2, 4, 6). Derive the
mapping to map this line segment in two unit length ξ-space with the origin
of the ξ-coordinate system at the center of [−1, 1]. Mapping of (a) into (b)
is given by:
x(ξ) = L1 (ξ)x1 + L2 (ξ)x2 + L3 (ξ)x3 (5.32)
L1 (ξ), L2 (ξ) and L3 (ξ) are derived using (with ξ1 = −1, ξ2 = 0, ξ3 = −1):
3
Y ξ − ξm
Lk (ξ) = ; k = 1, 2, 3
ξk − ξm
m=1
m6=k

(ξ − ξ2 )(ξ − ξ3 ) ξ(ξ − 1)
∴ L1 (ξ) = =
(ξ1 − ξ2 )(ξ1 − ξ3 ) 2
(ξ − ξ1 )(ξ − ξ3 )
L2 (ξ) = = 1 − ξ2 (5.33)
(ξ2 − ξ1 )(ξ2 − ξ3 )
(ξ − ξ1 )(ξ − ξ2 ) ξ(ξ + 1)
L3 (ξ) = =
(ξ3 − ξ1 )(ξ3 − ξ2 ) 2

ξ(ξ − 1) ξ(ξ + 1)
and x(ξ) = x1 + (1 − ξ 2 )x2 + x3
2 2
ξ(ξ − 1) ξ(ξ + 1)
x(ξ) = (2) + (1 − ξ 2 )(4) + (6)
2 2
x(ξ) = 4 + 2ξ (5.34)
206 INTERPOLATION AND MAPPING

Remarks.

(1) We note that the mapping (5.34) is the same as (5.31). This is not a
surprise as the mapping in this case is also a linear stretch mapping due
to the fact that points in x-space are equally spaced. Hence, in this case
we could have used points 1 and 3 with coordinates x1 and x3 in x-space
and linear Lagrange interpolation
functions
corresponding to points at
1−ξ 1+ξ
ξ = −1 and ξ = 1, i.e., 2 and 2 , to derive the mapping:

1−ξ 1+ξ
x(ξ) = (2) + (6) = 4 + 2ξ (5.35)
2 2

which is the same as (5.31) and (5.34).

(2) The conclusion in (1) also holds for more than three equally spaced
points in the x-space.

(3) From (1) and (2) we conclude that when the points in x-space are equally
spaced it is only necessary to use the
coordinates
of the two end points
1−ξ
with Lagrange linear polynomials 2 and 1+ξ
2 for the mapping
between x- and ξ-spaces. Thus, if we have n equally spaced points in
x-space that are mapped in ξ-space, the following x(ξ) can be used for
defining the mapping.

1−ξ 1+ξ
x(ξ) = x1 + xn (5.36)
2 2

Example 5.5 (1D Mapping: Three Unequally Spaced Points).

Consider a line segment in R1 containing three unequally spaced points
(x1 , x2 , x3 ) = (2, 3, 6). Derive the mapping to map this line segment in
two unit length ξ-space with origin of the ξ-coordinate system at the center
of [−1, 1]. We recall that points in the ξ-space are always equally spaced.
Mapping of (a) into (b) is given by (using Lk (ξ); k = 1, 2, 3 derived in
Example 5.4):

1−ξ 1+ξ
x(ξ) = (2) + (1 − ξ 2 )(3) + (6)
2 2
or x(ξ) = 3 + 2ξ + ξ 2 (5.37)
5.4. LAGRANGE INTERPOLATION IN R1 USING MAPPING 207

On the other hand if we used linear mapping, i.e,. x1 , x3 and 1−ξ
2 , 1+ξ
2 ,
we obtain:

1−ξ 1+ξ
x(ξ) = (2) + (6)
2 2
or x(ξ) = 4 + 2ξ (a linear stretch mapping) (5.38)

When
ξ = −1 ; x = 2 = x1
ξ=1 ; x = 6 = x3
ξ=0 ; x = 4 6= x2
Thus, mapping (5.38) is not valid in this case.

Remarks.

(1) Mapping (5.37) is not a stretch mapping due to the fact that points in
x-space are not equally spaced. In mapping (5.37) the length between
points 2 and 1 in x-space (x2 − x1 = 3 − 2 = 1) is mapping into unit
length (ξ2 −ξ1 = 1) in the ξ-space. On the other hand the length between
points 3 and 2 (x3 − x2 = 6 − 3 = 3) is also mapped in the unit length
(ξ3 − ξ2 = 1) in the ξ-space. Hence, this mapping is not a linear stretch
mapping for the entire domain in x-space.

(2) From this example we conclude that when the points in the x-space are
not equally spaced, we must utilize all points in the x-space in deriving
in the mapping. This is necessitated due to the fact that in this case the
mapping is not a linear stretch mapping.

5.4 Lagrange Interpolation in R1 using Mapping

In this section we present details of Lagrange interpolation using map-
ping, i.e., using the mapped domain in ξ-space. Let (xi , fi ) ; i = 1, 2, . . . , n
be given data points. Let xi be equally spaced points in the x-space. Then
the mapping of points xi ; i = 1, 2, . . . , n from x-space to ξ-space in a two
unit length is given by (using only the two end points in x-space)

1−ξ 1+ξ
x(ξ) = x1 + xn (5.39)
2 2
The mapping defined by (5.39) maps xi ; i = 1, 2, . . . , n equally spaced points
in x-space into ξi ; i = 1, 2, . . . , n equally spaced points in a two unit length
in ξ-space. Let Li (ξ) ; i = 1, 2, . . . , n be Lagrange interpolating polynomials
208 INTERPOLATION AND MAPPING

corresponding to ξi ; i = 1, 2, . . . , n in the ξ-space. We note that even though

xi are equally spaced, the function values fi ; i = 1, 2, . . . , n may not have
a constant increment between two successive values. This necessitates that
we use all values of fi ; i = 1, 2, . . . , n in the interpolation f (ξ). Now we can
construct Lagrange interpolating polynomial f (ξ) in the ξ-space as follows:
n
X
f (ξ) = fi Li (ξ) (5.40)
i=1
For a given ξ ∗ , we obtain f (ξ ∗ ) from (5.40) that correspond to x∗ in x-space
obtained using (5.39) i.e.
1 − ξ∗ 1 + ξ∗

∗ ∗
x = x(ξ ) = x1 + xn
2 2
Thus, (5.40) suffices as interpolation for data points (xi , fi ) ; i = 1, 2, . . . , n.
The mapping (5.39) together with Lagrange interpolation f (ξ) given
by (5.40) in ξ-space completes the interpolation for the data (xi , fi ) ; i =
1, 2, . . . , n. We note that the data are interpolated in ξ-space and the map-
ping of geometry (lengths) between the x- and ξ-spaces establishes the cor-
respondence in x-space for a location in ξ-space.
Remarks.
(1) In the process of interpolation described here, the interpolating polyno-
mials are constructed in ξ-space. The correspondence of a value of f (ξ)
at ξ ∗ , i.e. f (ξ ∗ ) to x-space, is established by the mapping x∗ = x(ξ ∗ ).
(2) The Lagrange polynomials for mapping can be chosen suitably (as done
above) depending upon the spacing of the coordinates xi . These can
be independent of the Lagrange polynomials used to interpolate fi ;
i = 1, 2, . . . , n.
(3) Given a set of data (xi , fi ), all xi must be mapped in ξ-space and for
each ξi ; i = 1, 2, . . . , n we must construct Lagrange polynomials so that
f (ξ) can be expressed as a linear combination of Li (ξ) ; i = 1, 2, . . . , n
using fi ; i = 1, 2, . . . , n.

Example 5.6 (1D Lagrange Interpolation in ξ-Space). Consider

(xi , fi ) ; i = 1, 2, 3 given below:

i 1 2 3
xi 2 4 6
fi 0 10 0
5.5. PIECEWISE MAPPING AND LAGRANGE INTERPOLATION IN R1 209

Derive Lagrange interpolating polynomial for this data set using the map
of xi ; i = 1, 2, 3 in ξ-space in two unit length.

Mapping from x-space to ξ-space

In this case the points xi are equally spaced in the x-space hence the mapping
x(ξ) can be defined by

1−ξ 1+ξ 1−ξ 1+ξ
x(ξ) = x1 + x3 = (2) + (6)
2 2 2 2
or x(ξ) = 4 + 2ξ (5.41)

Lagrange Interpolation in ξ-space

3
X
f (ξ) = fi Li (ξ) (5.42)
i=1

In which

1−ξ 1+ξ
L1 (ξ) = ; L2 (ξ) = (1 − ξ 2 ) ; L3 (ξ) =
2 2

1−ξ 1+ξ
∴ f (ξ) = (0) + (1 − ξ 2 )(10) + (0)
2 2
∴ f (ξ) = 10(1 − ξ 2 ) (5.43)

Hence, (5.41) and (5.43) complete the interpolation of the data in the table.
For a given ξ ∗ , we obtain f (ξ ∗ ) from (5.43) that corresponds to x∗ (in x-
space) obtained using (5.41), i.e., x∗ = x(ξ ∗ ).

5.5 Piecewise Mapping and Lagrange Interpolation

in R1
When a large number of data (xi , fi ) ; i = 1, 2, . . . , n are given, it may not
be practical to construct a single Lagrange interpolating polynomial for all
the data in the data set. In such cases we can perform piecewise interpolation
and mapping, keeping in mind that this approach will undoubtedly yield
different results than a single interpolating polynomial for the entire data
set, but may be necessitated due to practical considerations.
Let f (ξ) be the single interpolating polynomial for the entire data set
(xi , fi ) ; i = 1, 2, . . . , n. We divide the domain [x1 , xn ] = Ω̄ into subdomains
210 INTERPOLATION AND MAPPING

Ω̄(e) ; e = 1, 2, . . . , M such that

M
[
Ω̄ = Ω̄(e) (5.44)
e=1

Each subdomain Ω̄(e) consists of suitably chosen consecutive points xi . A

subdomain is connected to the adjacent subdomains through their end points.
The choice of the number of points for each subdomain Ω̄(e) depends upon
the desired degree of interpolation of the Lagrange interpolating polynomial
for the subdomain and the regularity of data (xi , fi ). A subdomain con-
taining m points will permit a Lagrange interpolating polynomial of degree
m − 1. For each subdomain consider:
(i) Mapping of Ω̄(e) into Ω̄(ξ) = [−1, 1] in the ξ-space.
(ii) Lagrange interpolating polynomial f (e) (ξ) for the subdomain Ω̄(e) using
its map Ω̄(ξ) .
Then f (ξ) for the entire domain [x1 , xn ] is given by
M
[
f (ξ) = f (e) (ξ) (5.45)
e=1
We can illustrate this in the following:

Ω̄(e)
1 2 3 4 5 6 7 n−2 n−1 n

Ω̄(1) Ω̄(2) Ω̄(3)

Ω̄ = [x1 , xn ] xi−1 xi xi+1
fi−1 fi fi+1
Ω̄(e) = [xi−1 , xi+1 ]

Figure 5.4: Subdomain of Ω̄, each consisting of three points

Let us consider Ω̄(e) , a subdomain of Ω̄ consisting of three points (Figure

5.4).
(i) We map each Ω̄(e) in the x-space (Figure 5.5(a)) into a two unit length
Ω̄(ξ) in ξ-space (Figure 5.5(b)), and then construct Lagrange interpo-
lation over Ω̄(ξ) .
(ii) Then
M
[
Ω̄ = [x1 , xn ] = Ω̄(e)
e=1
M
[
(e)
and f (ξ) = f (ξ)
e=1
5.5. PIECEWISE MAPPING AND LAGRANGE INTERPOLATION IN R1 211

x ξ
xi−1 xi xi+1 -1 0 +1

(a) Subdomain Ω̄(e) (b) Ω̄ξ map of Ω̄(e)

Figure 5.5: Mapping of x into ξ

This completes the interpolation of the data (xi , fi ) ; i = 1, 2, . . . , n.

Remarks.

(1) Rather than choosing three points for a subdomain Ω̄(e) , we could have
chosen four points in which case f (e) (ξ) over Ω̄(ξ) would be a polynomial
of degree three.

(2) Choice of the number of points for a subdomain depends upon the degree
p(e) of the Lagrange interpolation f (e) (ξ) desired.

(3) We note that when the entire data set (xi , fi ) ; i = 1, 2, . . . , n is interpo-
lated using a single Lagrange interpolating polynomial f (ξ), then f (ξ) is
a polynomial of degree (n − 1), hence it is of class C n−1 , i.e., derivatives
of f (ξ) of up to order (n − 1) are continuous.

(4) When we use piecewise mapping and interpolation, then f (e) (ξ) is of
e
M
class C p ; pe ≤ n but f (ξ) given by f (e) (ξ) is only of class C 0 . This
S
e=1
is due to the fact that at the mating boundaries between the subdomains
Ω̄(e) , only the function f is continuous. This is a major and fundamental
difference between piecewise interpolation and using a single Lagrange
polynomial describing the interpolation of the data.

(5) It is also possible to design piecewise interpolations that would yield

higher order differentiability for the whole domain Ω̄. Spline interpola-
tion is an example. Other approaches are possible too.

Example 5.7 (1D Piecewise Lagrange Interpolation). Consider the

same problem in Example 5.6 in which (xi , fi ) ; i = 1, 2, 3 are given by:

i 1 2 3 1 2 3
xi 2 4 6 x1 x2 x3
fi 0 10 0 f1 f2 f3

In the first case we construct a single Lagrange interpolating polynomial

using all three data points. The result is the same as in Example 5.6, and
212 INTERPOLATION AND MAPPING

we have:
x(ξ) = 4 + 2ξ (5.46)
2
f (ξ) = 10(1 − ξ ) (5.47)
In the second case, we consider piecewise mapping and interpolations using
the subdomains Ω̄(1) = [x1 , x2 ], Ω̄(2) = [x2 , x3 ].

Consider Ω̄(1)
x1 = 2 , x2 = 4 ; f1 = 0 , f2 = 10

1 Ω̄(1) 2 1 2
ξ
x1 = 2 x2 = 4 ξ1 = −1 ξ2 = 1
f1 = 0 f2 = 4
(a) x-space (b) ξ-space

(1) 1−ξ 1+ξ
∴ x (ξ) = x1 + x2
2 2

(1) 1−ξ 1+ξ
x (ξ) = (2) + (4)
2 2
∴ x(1) (ξ) = 3 + ξ (5.48)
and
(1) 1−ξ 1+ξ
f (ξ) = f1 + f2
2 2
or

(1) 1−ξ 1+ξ
f (ξ) = (0) + (10)
2 2
∴ f (1) (ξ) = 5(1 + ξ) (5.49)
Consider Ω̄(2)
x2 = 4 , x3 = 6 ; f2 = 10 , f3 = 0

2 Ω̄(2) 3 2 3
ξ
x2 = 4 x3 = 6 ξ1 = −1 ξ2 = 1
f2 = 10 f3 = 0
(a) x-space (b) ξ-space
5.5. PIECEWISE MAPPING AND LAGRANGE INTERPOLATION IN R1 213

(2) 1−ξ 1+ξ
∴ x (ξ) = x2 + x3
2 2

1−ξ 1+ξ
or x(2) (ξ) = (4) + (6)
2 2
x(2) (ξ) = 5 + ξ (5.50)
and

(2) 1−ξ 1+ξ
f (ξ) = f2 + f3
2 2

1−ξ 1+ξ
or f (2) (ξ) = (10) + (0)
2 2
∴ f (2) (ξ) = 5(1 − ξ) (5.51)
Summary
(i) Single Lagrange polynomial for the whole domain Ω = [x1 , x3 ]
x(ξ) = 4 + 2ξ
(5.52)
f (ξ) = 10(1 − ξ 2 )
(ii) Piecewise interpolation
(a) Subdomain Ω̄(1) :
x(1) (ξ) = 3 + ξ
(5.53)
f (1) (ξ) = 5(1 + ξ)
(b) Subdomain Ω̄(2) :
x(2) (ξ) = 5 + ξ
(5.54)
f (2) (ξ) = 5(1 − ξ)
Figure 5.6 shows plots of f (ξ), f (1) (ξ), and f (2) (ξ) over Ω̄ = [x1 , x3 ] = [2, 6].
f (x)
f (x) = 10(1 − ξ 2 )
f (1) (ξ) = 5(1 + ξ)
10
f (2) (ξ) = 5(1 − ξ)

x
1 2 3
Figure 5.6: Plots of f (ξ), f (1) (ξ), and f (2) (ξ)
214 INTERPOLATION AND MAPPING

It is clear that f (ξ) = 10(1 − ξ 2 ) is of class C 2 (Ω̄) where as f (ξ) =

2
f (e) (ξ) is of class C 0 (Ω̄), due to the fact that f (x) is continuous ∀x ∈
S
e=1
df
[x1 , x3 ] but dx is discontinuous at point 2 (x2 = 4).

5.6 Mapping of Length and Derivatives of f (·) in x-

and ξ-spaces (R1 )
The concepts presented in the following can be applied to x(ξ), f (ξ) for
Ω̄ as well as to x(e) (ξ), f (e) (ξ) for a subdomain Ω̄(e) . Consider
n
e
X
x(ξ) = xi L
e i (ξ) (5.55)
i=1

and
n
X
f (ξ) = fi Li (ξ) (5.56)
i=1

L
e i (ξ) and Li (ξ) are suitable Lagrange polynomials in ξ for mapping of points
and interpolation. We note that (5.55) only describes mapping of points, i.e.,
given ξ ∗ we can obtain x∗ using (5.55), x∗ = x(ξ ∗ ). Mapping of length in
x- and ξ-spaces requires a different relationship than (5.55). Consider the
differential of (5.55):
n n
!!
dL
e i (ξ) dLe i (ξ)
X e e
X
dx(ξ) = xi dξ = xi dξ (5.57)
dξ dξ
i=1 i=1

Let
n
!
dL
e i (ξ)
e
X
J= xi (5.58)
dξ
i=1
∴ dx(ξ) = Jdξ (5.59)
Equation (5.59) describes a relationship between elemental lengths dξ and
dx in ξ- and x-spaces. J is called the Jacobian of mapping.
df
From (5.56), we note that f is a function of ξ, thus if we require dx , it
can not be obtained directly using (5.56). Differentiate (5.56) with respect
to ξ and since x = x(ξ), we also have ξ = ξ(x) (inverse of the mapping),
hence we can use the chain rule of differentiation whenever needed.
n
df (ξ) X dLi (ξ)
= fi (5.60)
dx dx
i=1
5.6. MAPPING OF LENGTH AND DERIVATIVES OF F (·) 215

Thus, determination of dfdx

(ξ)
requires dLdx
i (ξ)
. We can differentiate Li (ξ) with
respect to ξ using the chain rule of differentiation.

dLi (ξ) dLi (ξ) dx dx dLi (ξ)
= =
dξ dx dξ dξ dx
or
dLi (ξ) dLi (ξ)
=J
dξ dx
dLi (ξ) 1 dLi (ξ)
∴ = (5.61)
dx J dξ
Similarly
d2 Li (ξ)

d dLi (ξ) d dLi (ξ)
= = J (5.62)
dξ 2 dξ dξ dξ dx
If we assume that the mapping x(ξ) is a linear stretch, then x(ξ) is a linear
function of ξ and hence J = dx dξ is not a function of ξ. Thus, we can write
(5.62) as:
d2 Li (ξ)
2 2
d dLi (ξ) d Li (ξ) dx d Li (ξ)
=J =J =J J
dξ 2 dξ dx dx2 dξ dx2
d2 Li (ξ) 2
2 d Li (ξ)
or = J
dξ 2 dx2
Hence,
d2 Li (ξ) 1 d2 Li (ξ)
= (5.63)
dx2 J 2 dξ 2
In general, for the derivative of order k we can write:
dk Li (ξ) 1 dk Li (ξ)
= (5.64)
dxk J k dξ k
If we substitute from (5.61) in (5.60), then we obtain:
n
!
df (ξ) 1 X dLi (ξ) 1 df (ξ)
= fi = (5.65)
dx J dξ J dξ
i=1

Likewise, using (5.63):

n n
!
d2 f (ξ) X d2 Li (ξ) 1 X d2 Li (ξ) 1 d2 f (ξ)
2
= fi 2
= 2 fi = (5.66)
dx dx J dξ 2 J 2 dξ 2
i=1 i=1

In general for the derivative of f (ξ) of order k, we have:

n
!
dk f (ξ) 1 X dk Li (ξ) 1 dk f (ξ)
= f i = (5.67)
dxk Jk dξ k J k dξ k
i=1
216 INTERPOLATION AND MAPPING

2 k
The derivatives dLdξ i (ξ)
, d dξ
Li (ξ)
2 , . . . , d dξ
Li (ξ)
k ; k = 1, 2, . . . , n can be deter-
mined by differentiating Li (ξ); i = 1, 2, . . . , n with respect to ξ. Hence, the
derivatives of f (ξ) with respect to x of any desired order can be determined.
The mapping of length and derivatives of f in the two spaces (x and ξ)
is quite important in many other instances than just obtaining derivatives
of f (·) with respect to x. We illustrate this in the following.

(i) Suppose we require the integral of f (x) (interpolation of data (xi , fi ) ;

i = 1, 2, . . . , n) over Ω̄ = [x1 , xn ].
Zxn
I= f (x)dx (5.68)
x1

If [x1 , xn ] → Ω(ξ) = [−1, 1], then (5.68) can be written as

Z1
I= f (ξ)Jdξ (5.69)
−1

f (ξ) is Lagrange interpolating polynomial in ξ-space corresponding to

the data set (xi , fi ) ; i = 1, 2, . . . , n. The integrand in (5.69) is an
algebraic polynomial in ξ, hence can be easily integrated. See Chapter
6 for Gauss quadrature to obtain values of the integral I.
k
(ii) Suppose we require the integral of ddxfk ; k = 1, 2, . . . , n over the domain
Ω̄, then
Zxn k Z1
1 dk f

d f
I= dx = J dξ (5.70)
dxk J k dξ k
x1 −1

Thus various differentiation and integration processes are now possible

using f (ξ), its derivatives with respect to ξ, J, and the mapping.

Example 5.8 (Mapping of Derivatives in 1D). Consider the following

sets of data.
i 1 2 3
xi 2 4 6
fi 0 10 0

Use all three data points to derive the mapping x(ξ). Also derive the map-
ping using points x1 and x3 . Construct Lagrange polynomial f (ξ) using all
5.7. MAPPING AND INTERPOLATION THEORY IN R2 217

three points. Determine the Jacobian of mapping J using both approaches

df 2
of mapping. Determine an expression for dx and ddxf2 .
As seen in previous examples, in this case the mapping x(ξ) is a linear
stretch mapping, hence using x1 , x2 , x3 or x1 , x3 we obtain the same
mapping. In the following we present a slightly different exercise.

Using points 1 and 3:

1−ξ 1+ξ
x(ξ) = x1 + x3 (5.71)
2 2

dx x3 − x1 h
∴ J= = = ; h = length of the domain (5.72)
dξ 2 2
On the other hand if we use points 1,2, and 3, then

1−ξ 2 1+ξ
x(ξ) = x1 + (1 − ξ )x2 x3 (5.73)
2 2

But since the points are equally spaced x2 = x1 +x 2 , hence we obtain the
3

following (by substituting x2 = x1 +x

2
3
into (5.73)).

1−ξ 1+ξ
x(ξ) = x1 + x3
2 2

which is the same mapping as (5.71), hence in this case also J = h2 . Thus
for linear stretch mapping between x and ξ, mapping is always linear in ξ
and, hence J = h2 , h being the length of the domain in x-space.
As derived earlier, f (ξ) is given by
x3 − x1 6−1
f (ξ) = 10(1 − ξ 2 ) , h= = =2
2 2
df 1 df 1 1
= = (−20ξ) = (−20ξ) = −10ξ
dx J dξ J 2
2
d f 2
1 d f 1 1
2
= 2 2 = 2 (−20) = 2 (−20) = −5
dx J dξ J 2

5.7 Mapping and Interpolation Theory in R2

Consider a domain Ω̄ ⊂ R2 and let ((xi , yi ), fi ) ; i = 1, 2, . . . , n be the data
points in Ω̄. Our objective is to construct f (x, y) that interpolates this data
218 INTERPOLATION AND MAPPING

set, i.e., establish an analytical expression f (x, y) such that f (xi , yi ) = fi ;

i = 1, 2, . . . , n.
This problem may appear simple but in reality it is quite complex. In
many applications, the complexity of Ω̄ adds to the complexity of the prob-
lem of interpolation. In such cases rather than constructing a single inter-
polation function f (x, y) for Ω̄, we may consider the possibility of piecewise
mapping and the interpolation using the mapping. Even though from map-
ping in R1 we know that these two approaches are not the same, it may be
necessary to use the second approach due to complexity of Ω̄.

5.7.1 Division of Ω̄ into Subdivisions Ω̄(e)

Consider data points ((xi , yi ), fi ) ; i = 1, 2, . . . , n shown in Figure 5.7(a).
We regularize this data into four sided quadrilateral subdomains shown in
Figure 5.7(c) such that each data point is the vertex S of the quadrilaterals.
In this case we say that Ω̄ is discretized into Ω̄ = Ω̄ in which Ω̄(e) is a
T (e)
e
typical quadrilateral subdomain containing data only at the vertices (Figure
5.8(a)).
The numbers (not shown) at the vertices are local numbers assigned to
each vertex of the quadrilateral Ω̄(e) of Ω̄T . Figure 5.7(b) shows another set
of data ((xi , yi ), fi ) ; i = 1, 2, . . . , n which are regularized into subdomains
containing nine data points (Figure 5.7(d)). A typical subdomain of Figure
5.7(d) is shown in Figure 5.8(b).
Now the problem of interpolating data in Figure 5.7(a) reduces to con-
structing piecewise interpolation for each Ω̄(e) (containing four data points)
of Ω̄T . Likewise the interpolation of data in Figure 5.7(b) reduces to piece-
wise interpolation for Ω̄(e) containing nine data points (Figure 5.8(b)). If
f (e) (x, y) is the piecewise interpolation of data for Ω̄(e) of either Figure 5.8(a)
or Figure 5.8(b), then the interpolation f (x, y) for the entire data set of Ω̄
(approximated by Ω̄T ) is given by:
[
f (x, y) = f (e) (x, y) (5.74)
e

Choice of data points for subdomains (i.e., Figure 5.8(a) or 5.8(b)) is not
arbitrary but depends upon the degree of interpolation desired for Ω̄(e) as
shown later.
Instead of choosing quadrilateral subdomains we could have chosen tri-
angular or any other desired shape of subdomain. We illustrate the details
using quadrilateral subdomains. Constructing interpolations of data over
subdomains of Figures 5.8(a) and 5.8(b) is quite a difficult task due to irreg-
ular geometry. This task can be simplified by using mapping of the domains
of Figures 5.8(a) and 5.8(b) into regular shapes such as two unit squares,
and then constructing interpolation in the mapped domain.
5.7. MAPPING AND INTERPOLATION THEORY IN R2 219

y y
x x

(a) Ω̄ (b) Ω̄

Ω̄(e)
Ω̄(e)

y y
x x

(c) Subdivision of Ω̄ of (a) in Ω̄T = (d) Subdivision of Ω̄ of (b) in Ω̄T =

∪Ω̄(e) ∪Ω̄(e)
e e

Figure 5.7: Discrete data points in R2 and the subdivision

5.7.2 Mapping of Ω̄(e) ⊂ R2 into Ω̄(ξη) ⊂ R2

Figure 5.9(a) shows a four-node quadrilateral subdomain in xy-space

and Figure 5.9(b) shows its map in a two unit square Ω̄(ξη) in the ξη natural
coordinate space with the origin of the coordinate system at the center of
the subdomain Ω̄(ξη) .
Figure 5.9(c) shows a nine-node distorted quadrilateral subdomain with
curved faces. Figure 5.9(d) shows its map in the natural coordinate space
ξη in a two unit square Ω̄(ξη) with the origin of the coordinate system at the
center of Ω̄(ξη) . For convenience we assign local numbers to the data points
(node numbers).
Consider a four-node quadrilateral of Figure 5.9(a) and its map in ξη-
space shown in Figure 5.9(b). Let Li (ξ, η) ; i = 1, 2, . . . , 4 be Lagrange poly-
nomials corresponding to nodes 1, 2, 3, 4 of Figure 5.9(b) with the following
220 INTERPOLATION AND MAPPING

4 7 Ω̄(e)
3
6
5
Ω̄(e) 8 9
1 4

y y 1 2
2 3
x x

(a) A typical four-node quadrilateral (b) A typical nine-node quadrilateral

subdomain of Ω̄T subdomain of Ω̄T

Figure 5.8: Sample subdomains of Ω̄T

properties.
(
1 ; j=i
1. Li (ξj , ηj ) = ; i, j = 1, 2, . . . , 4
0 ; j 6= i
4
X
2. Li (ξ, η) = 1 (5.75)
i=1
3. Li (ξ, η) ; i = 1, 2, . . . , 4 are polynomials of degree
less than or equal to 2 in ξ and η

Using Li (ξ, η), we can define x(ξ, η) and y(ξ, η)

4
X
x(ξ, η) = Li (ξ, η)xi
i=1
(5.76)
X4
y(ξ, η) = Li (ξ, η)yi
i=1

Equations (5.76) are the desired equations for mapping of points between
Ω̄(e) and Ω̄(ξη) .
Remarks.
(1) Equations (5.76) are explicit in ξ and η, i.e., given values of ξ and η
(ξ ∗ , η ∗ ), we can use (5.76) to determine their map (x∗ , y ∗ ) using (5.76),
(x∗ , y ∗ ) = (x(ξ ∗ , η ∗ ), y(ξ ∗ , η ∗ )).

(2) Equations (5.76) are implicit in x and y. Given (x∗ , y ∗ ) in xy-space, de-
termination of its map (ξ ∗ , η ∗ ) in ξη-space requires solution of simultaneous
5.7. MAPPING AND INTERPOLATION THEORY IN R2 221

η
4 3 4 3

1 (0,0) ξ
2

y
2 1 2
x 2

(a) Four-node Ω̄(e) in xy-space (b) Map of Ω̄(e) into Ω̄(ξ,η) in a two
unit square in ξη coordinate space

η
7 6
7 5
6
5
8 9 8 9 ξ
2 4
4

y 1 2
3 1 2 3
x 2

Figure 5.9: Maps of Ω̄(e) in Ω̄(ξ,η)

equations resulting from (5.76) after substituting x(ξ, η) = x∗ and y(ξ, η) =

y∗.

(3) We still need to determine the Lagrange polynomials Li (ξ, η) ; i =

1, 2, 3, 4 that have the desired properties (5.75).

(4) As shown subsequently, this mapping is bilinear due to the fact that
Li (ξ, η) in this case are linear in both ξ and η.

(5) In case of mapping of nine-node subdomain (Figures 5.9(c) and 5.9(d)),

222 INTERPOLATION AND MAPPING

we can write:

9
X
x(ξ, η) = Li (ξ, η)xi
i=1
(5.77)
9
X
y(ξ, η) = Li (ξ, η)yi
i=1

Li (ξ, η) ; i = 1, 2, . . . , 9 in (5.77) have the same properties as in (5.75)

and are completely defined.

(6) Choice of the configurations of nodes (as in Figures 5.9(a) and 5.9(c)) is
not arbitrary and is based on the degree of the polynomial desired in ξ
and η, and can be determined using Pascal’s rectangle.

5.7.3 Pascal’s Rectangle: A Polynomial Approach to Deter-

mine Li (ξ, η)

If we were to express x(ξ, η) and y(ξ, η) as linear combinations of the

monomials in ξ and η, then we could write the following if x and y are linear
functions of ξ and η.

x(ξ, η) = c0 + c1 ξ + c2 η + c3 ξη (5.78)

y(ξ, η) = d0 + d1 ξ + d2 η + d3 ξη (5.79)

In this case the choice of monomials 1, ξ, η, and ξη was not too difficult.
However in case of nine-node configuration of Figure 5.9(d) the choice of the
monomials in ξ and η is not too obvious. Pascal’s rectangle facilitates (i)
the selection of monomials in ξ and η for complete polynomials of a chosen
degree in ξ and η, and (ii) determination of the number of nodes and their
location in the two unit square in ξη-space.
Consider increasing powers of ξ and η in the horizontal and vertical
directions (see Figure 5.10). This arrangement is called Pascal’s rectangle.
We can choose up to the desired degree in ξ and η using Figure 5.10. The
terms located at the intersections of ξ and η lines are the desired monomial
terms. The locations of the monomial terms are the locations of the nodes
in the ξη configuration.
5.7. MAPPING AND INTERPOLATION THEORY IN R2 223

Increasing Powers of ξ
1 ξ ξ2 ξ3 ξ4

ξ2 η ξ3 η ξ4 η

Increasing Powers of η
η ξ η

η2 ξ η2 ξ2 η2 ξ3 η2 ξ4 η2

ξ η3 ξ2 η3 ξ3 η3 ξ4 η3
η3

ξ η4 ξ2 η4 ξ3 η4 ξ4 η4
η4

Figure 5.10: Pascal’s rectangle in R2 in ξη-coordinates

Example 5.9 (Using Pascal’s Rectangle).

(a) From Pascal’s rectangle (Figure 5.10), a linear approximation in ξ and
η would require four nodes and the terms 1, ξ, η, and ξη. We can write:
x(ξ, η) = c0 + c1 ξ + c2 η + c3 ξη
(5.80)
y(ξ, η) = d0 + d1 ξ + d2 η + d3 ξη
Using (ξi , ηi ) ; i = 1, 2, . . . , 4 and the corresponding (xi , yi ) ; i =
1, 2, . . . , 4 in (5.80) we can determine x(ξ, η) and y(ξ, η).
4
X
x(ξ, η) = Li (ξ, η)xi
i=1
(5.81)
4
X
y(ξ, η) = Li (ξ, η)yi
i=1

Li (ξ, η) ; i = 1, 2, . . . , 4 have the properties in (5.75).

(b) From Pascal’s rectangle (Figure 5.10), a biquadratic approximation in
ξ and η would require nine nodes and the monomial terms: 1, ξ, η, ξη,
ξ 2 , ξη, η 2 , ξ 2 η, ξη 2 , ξ 2 η 2 . In this case we can write:
x(ξ, η) = c0 + c1 ξ + c2 η + c3 ξη + c4 ξ 2 + c5 η 2
+ c6 ξ 2 η + c7 ξη 2 + c8 ξ 2 η 2
(5.82)
y(ξ, η) = d0 + d1 ξ + d2 η + d3 ξη + d4 ξ 2 + d5 η 2
+ d6 ξ 2 η + d7 ξη 2 + d8 ξ 2 η 2
224 INTERPOLATION AND MAPPING

Using (5.82) with (ξi , ηi ) ; i = 1, 2, . . . , 9 and the corresponding (xi , yi )

; i = 1, 2, . . . , 9, we can determine x(ξ, η) and y(ξ, η)
9
X
x(ξ, η) = Li (ξ, η)xi
i=1
(5.83)
X9
y(ξ, η) = Li (ξ, η)yi
i=1

Li (ξ, η) ; i = 1, 2, . . . , 9 in (5.83) are completely determined. Li (ξ, η)

have the same properties as shown in (5.75).

Remarks.

(1) Pascal’s rectangle provides a mechanism to determine functions Li (ξ, η)

that allow us to establish mapping between Ω̄(e) and Ω̄(ξη) of any desired
degree in ξ and η.

(2) This process involves the inverse of a coefficient matrix, an undesirable

feature which shall be corrected in the next section using the tensor
product of 1D Lagrange polynomials in ξ and η.

(3) Pascal’s rectangle is still extremely useful as it can tell us the nodal
configurations and the monomials for complete polynomials of desired
degrees in ξ and η.

(4) Complete implies all terms up to a chosen degree of the polynomial in ξ

and η have been considered.

5.7.4 Tensor Product to Generate Li (ξ, η) ; i = 1, 2, . . .

The tensor product is an approach to determine 2D Lagrange polynomials
Li (ξ, η) in ξη-space corresponding to the desired degrees in ξ and η using
1D Lagrange polynomials in ξ and η.

5.7.4.1 Bilinear Li (ξ, η) in ξ and η

Based on Pascal’s rectangle, bilinear Li (ξ, η) in ξ and η would require a
four-node configuration (shown in Figure 5.11(a) below).
Consider two-node 1D configurations in ξ and η shown in Figure 5.11(b).
Consider 1D Lagrange polynomials in ξ and η corresponding to two-node
configurations of Figure 5.11(b).
5.7. MAPPING AND INTERPOLATION THEORY IN R2 225

η η
(-1,1) (1,1)
1 2
4 3

1 2
-1 1 1 2
(-1,-1) (1,-1) ξ
-1 1

(a) Four-node configuration (b) 1D two-node configurations in ξ

and η

Figure 5.11: Use of tensor product for four-node subdomains in R2

In the ξ-direction:

1−ξ 1+ξ
Lξ1 (ξ) = , Lξ2 (ξ) = (5.84)
2 2

In the η-direction:

1−η 1+η
Lη1 (η) = , Lη2 (η) = (5.85)
2 2

Arrange Lξ1 (ξ) and Lξ2 (ξ) as a vector along with their ξ coordinates of −1 and
+1. Note that ±1 are not elements of the vector, they have been included
to indicate the location of the node corresponding to each Lk . In this case,
this arrangement gives a 2 × 1 vector of Lξ1 and Lξ2 .
   
ξ 1−ξ 
L1 (ξ)
  

  
  2 
 


 (−1)     (−1)  
ξ =
1+ξ 
(5.86)


 L 2 (ξ) 

 

 2 
 

 
 (+1) 
    (+1)  

Arrange Lη1 (η) and Lη2 (η) as a row matrix along with their η coordinates of
−1 and +1.
" η #  1−η 1+η 
L1 (η) Lη2 (η)
= 2 2  (5.87)
(−1) (+1) (−1) (+1)

Take the product of Lξi (ξ) in (5.86) with Lηj (η) in (5.87), keeping their ξ, η co-
ordinates together with the product terms. This is called the tensor product.
226 INTERPOLATION AND MAPPING

   
ξ


L 1 (ξ)

" Lξ1 (ξ)Lη1 (η) Lξ1 (ξ)Lη2 (η)
   
 Lη (η) Lη (η)
#

 (−1)   (−1, −1) (−1, +1) 
1 2
= ξ (5.88)
 
Lξ2 (ξ) L2 (ξ)Lη1 (η) Lξ2 (ξ)Lη2 (η)



 
 (−1) (+1)  
 
 (+1)  (+1, −1) (+1, +1)
 

Substituting for Lξ1 (ξ), Lξ2 (ξ), Lη1 (η), and Lη2 (η) in (5.88):

   
1−ξ 1−η 1−ξ 1+η
2 2 2 2 L1 (ξ, η) L4 (ξ, η)
   
 (−1, −1) (−1, +1)  (−1, −1) (−1, +1)
 1+ξ 1−η 1+ξ 1+η  =  (5.89)
   

   L2 (ξ, η) L3 (ξ, η) 
 2 2 2 2   
(+1, −1) (+1, +1) (+1, −1) (+1, +1)

The coordinates ξ, η associated with the terms and their comparisons with
the ξ, η coordinates of the four nodes in Figure 5.11(a) identifies Li (ξ, η) ; i =
1, 2, . . . , 4 for the four-node configuration of Figure 5.11(a). We could view
this process as the two-node configuration in η direction (Figure 5.11(b))
traversing along the ξ-direction. As it encounters a node in the ξ-direction,
we obtain a trace of the nodes. Each node of the trace contains products of
1D functions in ξ and η as the two 2D functions in ξ and η. Thus, we have
for the four-node configuration of Figure 5.11(a):

1−ξ 1−η
L1 (ξ, η) =
2 2

1−ξ 1+η
L2 (ξ, η) =
2 2
(5.90)
1+ξ 1−η
L3 (ξ, η) =
2 2

1+ξ 1+η
L4 (ξ, η) =
2 2

The functions Li (ξ, η) in (5.90) satisfy the properties (5.75).

5.7.4.2 Biquadratic Li (ξ, η) in ξ and η

From Pascal’s rectangle we have the nine-node configuration in ξ and η.

5.7. MAPPING AND INTERPOLATION THEORY IN R2 227

η η
7 6
5 Lη3 3 (+1)

8 9 4 ξ Lη2 2 (0)

Lη1 1 (-1) (0) (+1)

1 2 3 ξ
1 2 3
Lξ1 Lξ2 Lξ3

(a) Nine-node configuration (b) 1D three-node configurations in ξ

and η

Figure 5.12: Use of tensor product for nine-node subdomains in R2

The tensor product of 1D functions in ξ and η gives the following.

 
Lξ1 Lη1 Lξ1 Lη2 Lξ1 Lη2
(−1, −1) (−1, 0) (−1, +1)
 
 
ξ
L1 ; (−1)
 
  η η η  ξ η ξ η ξ η 
L1 L2 L3  L L L2 L2 L2 L3 
Lξ2 ; (0) = 2 1
(−1) (0) (+1) (0, −1) (0, −1) (0, +1) 

Lξ ; (+1)
  
3
 
 ξ η
Lξ3 Lη2 Lξ3 Lη3 

 L3 L1 (5.91)
(+1, −1) (+1, 0) (+1, +1)

 
L1 (ξ, η) L8 (ξ, η) L7 (ξ, η)
= L2 (ξ, η) L9 (ξ, η) L6 (ξ, η)
L3 (ξ, η) L4 (ξ, η) L5 (ξ, η)
Thus Li (ξ, η) ; i = 1, 2, . . . , 9 are completely determined.

Recall
ξ(ξ − 1) ξ(ξ + 1)
Lξ1 (ξ) = , Lξ2 (ξ) = (1 − ξ 2 ) , Lξ3 (ξ) =
2 2 (5.92)
η(η − 1) η(η + 1)
Lη1 (η) = , Lη2 (η) = (1 − η 2 ) , Lη3 (η) =
2 2
Remarks.
(1) Using this procedure it is possible to determine the complete polynomials
Li (ξ, η) for any desired degree in ξ and η.
228 INTERPOLATION AND MAPPING

(2) Hence, for mapping of geometry Ω̄(e) to Ω̄(ξ,η) we can write in general:

n
e
X
x(ξ, η) = L
e i (ξ, η)xi
i=1
(5.93)
Xn
e
y(ξ, η) = L
e i (ξ, η)yi
i=1

Choice of n e and L
e i (ξ, η) depend upon the degrees of the polynomial in ξ
and η and are defined by Pascal’s rectangle. We have intentionally used n e
and L
e i (ξ, η) for mapping of geometry.

5.7.5 Interpolation of Function Values fi Over Ω̄(e) Using Ω̄(ξ,η)

The mapping of Ω̄(e) to Ω̄(ξ,η) is given by:

n
e
X
x(ξ, η) = L
e i (ξ, η)xi
i=1
(5.94)
n
e
X
y(ξ, η) = L
e i (ξ, η)yi
i=1

n
e and Le i (ξ, η) are suitably chosen number of nodes and the Lagrange inter-
polation polynomials for mapping of Ω̄(e) to Ω̄(ξ,η) . The function values fi
at the nodes of Ω̄(e) or Ω̄(ξ,η) can be interpolated using:

n
X
f (e) (ξ, η) = Li (ξ, η)fi (5.95)
i=1

in which

n=4 for bilinear interpolation in ξ, η space (four-node configuration)

n=9 for biquadratic interpolation in ξ, η space (nine-node configuration)
n = 16 for bicubic interpolation in ξ, η space (sixteen-node configuration)
and so on. Li (ξ, η) are determined using tensor product of suitable 1D
functions in ξ and η.
5.7. MAPPING AND INTERPOLATION THEORY IN R2 229

5.7.6 Mapping of Length, Areas and Derivatives of f (ξ, η) with

Respect to x, y and ξ, η
In the following we drop the superscript (e) on f (·) (for convenience).
Using (5.94):

∂x ∂x
dx = dξ + dη
∂ξ ∂η
(5.96)
∂y ∂y
dy = dξ + dη
∂ξ ∂η
Hence ( ) " #( ) ( )
∂x ∂x
dx ∂ξ ∂η dξ dξ
= ∂y ∂y
= [J] (5.97)
dy ∂η ∂η dη dη
where " #
∂x ∂x
∂ξ ∂η
[J] = ∂y ∂y
(5.98)
∂η ∂η

The matrix [J] is called the Jacobian of mapping. The matrix [J] provides
a relationship between elemental lengths dξ, dη and dx, dy in ξη- and xy-
spaces.

5.7.6.1 Mapping of Areas

~eη
dx
y dy ξ
~eξ
~j dη

x dξ
~i
(a) Ω̄(e) (b) Ω̄(ξ,η)

Figure 5.13: Ω̄(e) and Ω̄(ξ,η) with elemental areas

Consider elemental lengths dx and dy forming elemental area dΩ = dxdy

in xy-space. Likewise, consider lengths dξ, dη along ξ- and η-axes forming
an area dΩ(ξ,η) = dξdη. In this section we establish a relationship between
dΩ and dΩ(ξ,η) .
Let ~i, ~j be unit vectors along x- and y-axes and ~eξ , ~eη be the unit vectors
along ξ-, η-axes. Then, the cross-product of vectors dx~i and dy~j would yield
a vector perpendicular to the plane containing the vectors dx~i, dy~j and the
230 INTERPOLATION AND MAPPING

magnitude of this vector represents the area formed by these two vectors,
i.e., dΩ. Thus:
dx~i × dy~j = dxdxy ~i × ~j = dxdy~k (5.99)

But

∂x ∂x
dx~i = dξ~eξ + dη~eη (5.100)
∂ξ ∂η
∂y ∂y
dy~j = dξ~eξ + dη~eη (5.101)
∂ξ ∂η

∂x ∂x ∂y ∂y
∴ dx~i × dy~j = dxdy~k = dξ~eξ + dη~eη × dξ~eξ + dη~eη
∂ξ ∂η ∂ξ ∂η
(5.102)
Expanding right side of (5.102):

∂x ∂y ∂x ∂y
dxdy~k = dξ dξ ~eξ × ~eξ + dη dξ ~eη × ~eξ
∂ξ ∂ξ ∂η ∂η
∂y ∂y ∂y ∂y
+ dξ dη ~eξ × ~eη + dη dη ~eη × ~eη (5.103)
∂ξ ∂η ∂η ∂η

Noting that:

~eξ × ~eξ = ~eη × ~eη = 0

~eξ × ~eη = ~el = ~k (5.104)
~eη × ~eξ = −~el = −~k

Substituting from (5.104) into (5.102):

∂x ∂y ∂x ∂y
dxdy~k = − dξdη~k
∂ξ ∂η ∂η ∂ξ

∂x ∂y ∂x ∂y
∴ dxdy = − dξdη (5.105)
∂ξ ∂η ∂η ∂ξ
But
∂x ∂y ∂x ∂y
det[J] = |J| = − (5.106)
∂ξ ∂η ∂η ∂ξ

∴ dxdy = |J|dξdη (5.107)

or dΩ = |J|dΩ(ξ,η) (5.108)
5.7. MAPPING AND INTERPOLATION THEORY IN R2 231

5.7.6.2 Obtaining Derivatives of f (ξ, η) with Respect to x, y

Since f = f (ξ, η) is the interpolation of data over Ω̄(e) , a subdomain of

Ω̄, we can write:
n
df X ∂Li (ξ, η)
= fi (5.109)
dx ∂x
i=1

n
df X ∂Li (ξ, η)
= fi (5.110)
dy ∂y
i=1

∂Li (ξ,η) ∂Li (ξ,η)

Thus, we need to determine ∂x and ∂y ; i = 1, 2, . . . , n. We note
that:

∂Li (ξ, η) ∂Li ∂x ∂Li ∂y

= +
∂ξ ∂x ∂ξ ∂y ∂ξ
i = 1, 2, . . . , n (5.111)
∂Li (ξ, η) ∂Li ∂x ∂Li ∂y
= +
∂η ∂x ∂η ∂y ∂η

Arranging (5.111) in matrix and vector form:

( ) " ∂y
#( ) ( )
∂Li ∂x ∂Li ∂Li
∂ξ ∂ξ ∂ξ ∂x ∂x
∂Li
= ∂x ∂y ∂Li
= [J T ] ∂Li
i = 1, 2, . . . , n (5.112)
∂η ∂η ∂η ∂y ∂y

( ) ( )
∂Li ∂Li
∂x T −1 ∂ξ
∴ ∂Li
= [J ] ∂Li
(5.113)
∂y ∂η

Hence, ∂f ∂f
∂x and ∂y in (5.109) and (5.110) are now explicitly defined, hence
can be determined.

Remarks.

(1) Many remarks made in Section 5.6 for mapping and interpolation in R1
hold here as well.

(2) We recognize that piecewise interpolation over Ω̄(e) facilitates the process
but is not the same as interpolation
S e over the entire Ω̄ due to limited
differentiability of f (x, y) = f (x, y) (only C 0 in this case) in the
e
piecewise process.

(3) We could have also used triangular subdomains instead of quadrilateral.

232 INTERPOLATION AND MAPPING

5.8 Serendipity family of C 0 interpolations over square

subdomains Ω̄(ξη)
“Serendipity” means discovery by chance. Thus, this family of interpo-
lation has very little theoretical or mathematical basis other than the fact
that in generating approximation functions for these we only utilize the two
fundamental properties of the approximation functions,
(
1, j=i
Ni (ξj , ηj ) = (i = 1, . . . , m) (5.114)
0, j 6= i

and
m
X
Ni (ξ, η) = 1 (5.115)
i=1

(a) The main motivation in generating these functions is to possibly elimi-

nate some or many of the internal nodes that appear in generating the
interpolations using tensor product for family of higher degree interpo-
lation functions.
(b) For example, in the case of a bi-quadratic local approximation requiring
a nine-node element, the corresponding serendipity element will contain
eight boundary nodes, as shown in Fig. 5.14.
η η

7 6 5 7 6 5

8 9 ξ 8 ξ
4 4

1 2 3 1 2 3

Nine-node Lagrange Eight-node serendipity

bi-quadratic element element
Figure 5.14: Nine-node Lagrange and eight-node serendipity Ω̄(ξη) domains

(c) In the case of a bi-cubic interpolations requiring 16-nodes with four

internal nodes, the corresponding serendipity element will contain 12
boundary nodes (see Fig. 5.15)
(d) While in the case of bi-quadratic and bi-cubic local approximations it
was possible to eliminate the internal nodes and thus serendipity ele-
ments were possible. This may not always be possible for higher degree
local approximations than three.
5.8. SERENDIPITY FAMILY OF C 00 INTERPOLATIONS 233

η η

ξ ξ

16-node bi-cubic 12-node cubic serendipity

element element
Figure 5.15: Sixteen-node Lagrange and twelve-node serendipity Ω̄(ξη) domains

5.8.1 Method of deriving serendipity interpolation functions

We use the two basic properties that the approximation functions must
satisfy (stated by (5.114) and (5.115)). Let us consider a four-node bilinear
element. In this case, obviously, non-serendipity and serendipity approxima-
tions are identical. Nonetheless, we derive the approximation functions for
these using the approach used for serendipity basis functions.
η 1−η =0
1−ξ =0
4 3

1+ξ =0 1 2

1+η =0

Figure 5.16: Derivation of 2D bilinear serendipity element

(a) First, we note that the four sides of the domain Ω̄(ξη) are described by the
equations of the straight lines as shown in the figure. Consider node 1.
N1 (ξ, η) is one at node 1 and zero at nodes 2,3 and 4. Hence, equations
of the straight lines connecting nodes 2 and 3 and nodes 3 and 4 can be
used to derive N1 (ξ, η). That is,
N1 (ξ, η) = c1 (1 − ξ)(1 − η) (5.116)
in which c1 is a constant. But N1 (−1, −1) = 1, hence using (5.116) we
get
1
N1 (−1, −1) = 1 = c1 (1 − (−1))(1 − (−1)) ⇒ c1 = (5.117)
4
234 INTERPOLATION AND MAPPING

Thus, we have
1
N1 (ξ, η) = (1 − ξ)(1 − η) (5.118)
4
which is the correct approximation function for node 1 of the bilinear
approximation functions. Similarly, for nodes 2, 3, and 4 we can write

N2 (ξ, η) = c2 (1 + ξ)(1 − η)
N3 (ξ, η) = c3 (1 + ξ)(1 + η) (5.119)
N4 (ξ, η) = c4 (1 − ξ)(1 + η)

But
1
N2 (1, −1) = 1 ⇒ c2 =
4
1
N3 (1, 1) = 1 ⇒ c3 = (5.120)
4
1
N4 (−1, 1) = 1 ⇒ c4 =
4
Thus, from (5.119) and (5.120) we obtain

1
N2 (ξ, η) = (1 + ξ)(1 − η)
4
1
N3 (ξ, η) = (1 + ξ)(1 + η) (5.121)
4
1
N4 (ξ, η) = (1 − ξ)(1 + η)
4

(5.118) and (5.121) are the correct approximation functions for the four-
node bilinear approximation functions.
(b) In the above derivations we have only utilized the property (5.114), hence
we must show that the interpolation functions in (5.118) and (5.121)
satisfy (5.115). In this case, obviously they do. However, this may not
always be the case.

Eight-node serendipity domain Ω̄(ξη) :

Consider node 1 first. We have N1 (ξ, η)|(−1,−1) = 1 and zero at all the
remaining nodes. Hence, for node 1 we can write

N1 (ξ, η) = c1 (1 − ξ)(1 − η)(1 + ξ + η) (5.122)

Since
1
N1 (ξ, η)|(−1,−1) = 1 ⇒ c1 = − (5.123)
4
5.8. SERENDIPITY FAMILY OF C 00 INTERPOLATIONS 235

η 1−η =0
7 6 5
1−ξ =0

1+ξ+η =0
8 ξ
4

1+ξ =0
1 2 3
1+η =0

Figure 5.17: Derivation of 2D bilinear serendipity approximation functions: node 1

we obtain

1
N1 (ξ, η) = − (1 − ξ)(1 − η)(1 + ξ + η) (5.124)
4

For nodes 3, 5, and 7 one may use the equations of the lines indicated in
Fig. 5.17 and the conditions similar to (5.123) for N2 , N3 , and N4 .

1−ξ−η =0 1+ξ−η =0
1−η =0 5 7

1+ξ =0 1+ξ =0 1−ξ =0

3 1+η =0
1−ξ+η =0 1+η =0

for node 3 for node 5 for node 7

Figure 5.18: Derivation of 2D bi-quadratic serendipity approximation functions: nodes
3, 5, and 7

For the mid-side nodes, the product of the equations of straight lines not
236 INTERPOLATION AND MAPPING

containing the mid-side nodes provide the needed expressions and we have

1
N1 = (1 − ξ)(1 − η)(−1 − ξ − η)
4
1
N2 = (1 − ξ 2 )(1 − η)
2
1
N3 = (1 + ξ)(1 − η)(−1 + ξ − η)
4
1
N8 = (1 − ξ)(1 − η 2 )
2
1
N4 = (1 + ξ)(1 − η 2 )
2
1
N7 = (1 − ξ)(1 + η)(−1 − ξ + η)
4
1
N6 = (1 − ξ 2 )(1 + η)
2
1
N5 = (1 + ξ)(1 + η)(−1 + ξ + η)
4

8
P
In this case also we must show that Ni (ξ, η) = 1, which holds.
i=1

Twelve-node serendipity domain Ω̄(ξη) :

Using procedures similar to the four-node bilinear and eight-node bi-

quadratic approximations (see Figure 5.19) we can also derive the interpo-
lation functions for the twelve-node serendipity domain Ω̄(ξη) .

9 10 11 12
7 8
ξ
5 6
1 2 3 4

Figure 5.19: Derivation of 2D bi-cubic serendipity domain Ω̄(ξη)

5.9. MAPPING AND INTERPOLATION IN R3 237

1
N1 = (1 − ξ)(1 − η)[−10 + 9(ξ 2 + η 2 )]
32
9
N2 = (1 − ξ 2 )(1 − η)(1 − 3ξ)
32
9
N3 = (1 − ξ 2 )(1 − η)(1 + 3ξ)
32
1
N4 = (1 + ξ)(1 − η)[−10 + 9(ξ 2 + η 2 )]
32
9
N5 = (1 − ξ)(1 − η 2 )(1 − 3η)
32
9
N6 = (1 + ξ)(1 − η 2 )(1 − 3η)
32
9
N7 = (1 − ξ)(1 − η 2 )(1 + 3η)
32
9
N8 = (1 + ξ)(1 − η 2 )(1 + 3η)
32
1
N9 = (1 − ξ)(1 + η)[−10 + 9(ξ 2 + η 2 )]
32
9
N10 = (1 − ξ 2 )(1 + η)(1 − 3ξ)
32
9
N11 = (1 − ξ 2 )(1 + η)(1 + 3ξ)
32
1
N12 = (1 + ξ)(1 + η)[−10 + 9(ξ 2 + η 2 )]
32
Remarks.
(1) Serendipity interpolations are obviously incomplete polynomials in ξ and
η, hence have poorer local approximation compared to the local approx-
imations based on Pascal’s rectangle.
(2) There is no particular theoretical basis for deriving them.
(3) In view of p-version hierarchical approximations [49, 50], serendipity ap-
proximations are precluded and are of no practical significance.

5.9 Mapping and Interpolation in R3

Consider Ω̄ ⊂ R3 and let ((xi , yi , zi ), fi ) ; i = 1, 2, . . . , n be the data
points in Ω̄. Our objective is to construct an analytical expression f (x, y, z)
that interpolates this data such that f (xi , yi , zi ) = fi ; i = 1, 2, . . . , n.
Following the details presented in Section 5.7 for mapping and interpo-
lation in R2 , here also we choose piecewise interpolations over a subdomain
Ω̄(e) of Ω̄ using its mapping Ω̄(m) in ξ, η, ζ natural coordinate system. Ω̄(e)
is a suitably chosen volume containing data points. Here we choose Ω̄(e) to
be a hexahedron, shown in Figure 5.20.
238 INTERPOLATION AND MAPPING

ζ η

y ξ

z
(a) Eight-node irregular hexahedron (b) Map of Ω̄(e) of (a) in a two unit
in xyz-space cube in ξηζ space

ζ η

y ξ

z
(c) A 27-node distorted hexahedron in (d) Map of Ω̄(e) of (c) in ξηζ natural
xyz-space coordinate space in a two unit cube

Figure 5.20: Ω̄(e) in R3 and their maps Ω̄(m) in ξηζ-space

5.9.1 Mapping of Ω̄(e) into Ω̄(m) in ξηζ-Space

Let ((xi , yi , zi ), fi ); i = 1, 2, . . . , n be the data associated with subdomain

Ω̄(e) in xyz-space. Then following the details of the mapping in R2 , we can
write:

n
e
X
x(ξ, η, ζ) = L
e i (ξ, η, ζ)xi
i=1
Xn
e
y(ξ, η, ζ) = L
e i (ξ, η, ζ)yi (5.125)
i=1
Xn
e
z(ξ, η, ζ) = L
e i (ξ, η, ζ)zi
i=1
5.9. MAPPING AND INTERPOLATION IN R3 239

n are suitably chosen for mapping depending upon the degrees of polynomials
in ξ, η, and ζ and Le i (ξ, η, ζ) are the Lagrange polynomials associated with n
nodes. Li (ξ, η, ζ) have the following properties (similar to mapping in R2 ).
e

1. Each L e i (ξ, η, ζ) is a polynomial of certain degree in ξ, η and ζ

(
e i (ξj , ηj , ζj ) = 1 ; j = i ; i = 1, 2, . . . , n
2. L e
0 ; j 6= i (5.126)
n
X
3. L
e i (ξ, η, ζ) = 1
i=1

The importance of these properties have already been discussed for mapping
in R2 . The conclusions drawn for R2 mapping hold here as well. Equations
(5.125) map Ω̄(e) from xyz-space to Ω̄(m) in a two unit cube in ξηζ-space.
Once we know L e i (ξ, η, ζ) ; i = 1, 2, . . . , n
e, the mapping is completely defined
by (5.125).

5.9.1.1 Construction of L
e i (ξ, η, ζ) using Polynomial Approach

We can express x, y, z as a linear combination of monomials in ξ, η and

ζ and their products.

x(ξ, η, ζ) = c0 + c1 ξ + c2 η + c3 ζ + . . .
y(ξ, η, ζ) = d0 + d1 ξ + d2 η + d3 ζ + . . . (5.127)
z(ξ, η, ζ) = b0 + b1 ξ + b2 η + b3 ζ + . . .

Using xi = x(ξi , ηi , ζi ), yi = y(ξi , ηi , ζi ), and zi = z(ξi , ηi , ζi ) for i =

1, 2, . . . , n in (5.127), we obtain n simultaneous algebraic equations in ci , di , bi ;
i = 0, 1, 2, . . . , n − 1 from which we can solve for the coefficients. Substitut-
ing these in (5.127) gives us (5.125). The selection of the monomials in ξ, η,
and ζ and their products depends upon the degree of the polynomials in ξ,
η, and ζ and is facilitated by using Pascal’s prism (Figure 5.21).
Figure 5.21 shows progressively increasing degree monomials in ξ, η, and
ζ directions (shown orthogonal to each other). We connect these terms by
straight lines (only shown for ξη in Figure 5.10) in ξ-, η-, and ζ-directions.
We choose degrees of polynomials in ξ, η, and ζ. Based on this choice, using
Pascal’s prism we have the following information.

(i) The locations of the terms are the locations of the points of the nodes
in Ω̄(m) and Ω̄(e) configuration.

(ii) The terms or monomials (and their products) are the choice we should
use in the linear combination (5.127).
240 INTERPOLATION AND MAPPING

1 ξ ξ2 ξ3 ξ4

ζ
η ξη ξ2η ξ3η ξ4η

ζ2
η2 ξη 2 ξ2η2 ξ3η2 ξ4η2
ζ3
η3 ξη 3 ξ2η3 ξ3η3 ξ4η3

η4 ξη 4 ξ2η4 ξ3η4 ξ4η4

(a) Pascal’s prism

Nodal configuration Monomials to be used

1 ξ

ζ ζξ

η ξη

ζη ζηξ

(b) Eight-node configuration

1 ξ ξ2

η ξη ξ2η
ζ1 ζ 2ξ ζξ 2

η2 ξη 2 ξ2η2
ζη ζξη ζξ 2 η
ζ 21 ζ 2ξ ζ 2ξ2
ζη 2 ζξη 2 ζξ 2 η 2
ζ 2η ζ 2 ξη ζ 2ξ2η

ζ 2η2 ζ 2 ξη 2 ζ 2 ξ 2 η 2

(c) 27-node configuration

Figure 5.21: Pascal’s prism and selection of monomials

5.9. MAPPING AND INTERPOLATION IN R3 241

Thus, for given degrees of the polynomials L e i (ξ, η, ζ) in ξ, η, ζ, L

e i (ξ, η, ζ)
are completely determined from (5.127).
As an example, linear L e i (ξ, η, ζ) in ξ, η, and ζ would require an eight-
node configuration and the monomials shown in Figure 5.21(b). If we require
L
e i (ξ, η, ζ) to be quadratic in ξ, η, and ζ, then we need a 27-node configura-
tion and the monomial terms shown in Figure 5.21(c). With this approach
quite complex domains Ω̄(e) can be mapped in Ω̄(m) by choosing appropriate
degrees of L e i (ξ, η, ζ) in ξ, η, and ζ.

Remarks.

(1) This polynomial approach requires inverse of the coefficient matrix which
can be avoided by using tensor product approach similar to R2 .

(2) An important outcome of Pascal’s prism is that it tells us the number

of nodes based on the choice of the degrees of polynomials Li (ξ, η, ζ) in
ξ, η, and ζ and their locations.

5.9.1.2 Tensor Product to Generate L

e i (ξ, η, ζ)

The concept of tensor product used for quadrilateral elements in R2 can

be extended to hexahedron elements in R3 . Consider 1D Lagrange polyno-
mials in ξ, η, and ζ with desired degrees in ξ, η, and ζ. Let n, m and q be
the number of points or nodes in ξ-, η-, and ζ-directions that would yield
(n − 1), (m − 1) and (q − 1) as degrees of the 1D Lagrange functions Lξi (ξ) ;
i = 1, 2, . . . , n ; Lηj (η) ; j = 1, 2, . . . , m and Lζk (ζ) ; k = 1, 2, . . . , q associated
with n, m, and q nodes in ξ-, η-, and ζ-directions (figure 5.22).
We first take the tensor product of 1D Lξj (ξ) and Lηk (η) in ξ- and η-
directions that would yield (n × m) 2D Lagrange polynomials in ξη with
degrees (n − 1) and (m − 1) in ξ and η (figure 5.23). Tensor product of these
functions with 1D Lagrange polynomials in ζ-direction gives

L
e i (ξ, η, ζ) ; i = 1, 2, . . . , (n)(m)(q)

for a (n×m×q) nodal configuration in which L(·)e are polynomials of degrees

(n − 1), (m − 1), and (q − 1) in ξ-, η-, and ζ-directions.
242 INTERPOLATION AND MAPPING

Lηm m

Lη3 3

Lη2 2
1
Lη1
Lξ1 Lξ2 Lξ3 Lξn
1 ξ
2 Lζ1 1 2 3 n
3 Lζ2
Lζ3
q
Lζq
ζ
Figure 5.22: 1D Lagrange polynomials in ξ, η, and ζ

Lξ2 Lηm
Lξ1 Lηm Lξn Lηm

Lξ1 Lηm−1 Lξn Lηm−1

Lξ1 Lη2 Lξn Lη2

Lξ1 Lη1 Lξ2 Lη Lξn−1 Lη1 Lξn Lη1

1
2 Lζ1
3 Lζ2
Lζ3
q
Lζq
ζ
Figure 5.23: Tensor product in ξ, η and 1D functions in ζ

Example 5.10 (3D Lagrange Linear Interpolating Polynomials in

ξηζ-Space). Construct interpolation functions Li (ξ, η, ζ) that are linear in
ξ, η, and ζ. From Pascal’s prism Ω̄(e) is an eight-node domain in xyz-space
and Ω̄(m) is its map (also containing eight vertex nodes) in ξηζ-space. Figure
5.24 shows details of 1D Lagrange polynomials of degree one in ξ-, η-, and ζ-
directions and Figure 5.25 shows their tensor product in ξη-directions. The
tensor product of ξη functions with the functions in the ζ-direction (shown
in Figure 5.26) gives us the final interpolation functions Le i (ξ, η, ζ).
5.9. MAPPING AND INTERPOLATION IN R3 243

Lη1 = 1−η
2 2

Lη2 = 1+η
2
1
1 2
1
ξ
ξ
L
1 =
1−ξ
2 Lξ2 = 1+ξ
2
Lζ1 = 1−ζ
2

2
Lζ2 = 1+ζ
2

Figure 5.24: 1D Lagrange polynomials of order one

Lξ1 Lη2 = 1−ξ
2
1+η
2 Lξ2 Lη2 = 1+ξ
2
1+η
2

1 Lξ1 Lη1 = 1−ξ 1−η
Lξ η
L
2 1 = 1+ξ 1−η
2 2 2 2
Lζ1 = 1−ζ 2

2
Lζ2 = 1+ζ
2

ζ
Figure 5.25: Tensor product in ξη-space
244 INTERPOLATION AND MAPPING

η
ζ
Lξ1 Lη2 Lζ1
Lξ2 Lη2 Lζ1
○
4 ○
3

Lξ1 Lη2 Lζ2 Lξ2 Lη2 Lζ2

○
8 ○
7
ξ

○
1 Lξ2 Lη1 Lζ1
Lξ1 Lη1 Lζ1 ○
2

○
5 ○
6

Lξ1 Lη1 Lζ2 Lξ2 Lη1 Lζ2

Figure 5.26: Lagrange polynomials L

e i (ξ, ηζ) generalized using tensor product

From Figure 5.26 (using the node numbering shown)

ξ η ζ 1 − ξ 1 − η 1 − ζ
L
e 1 (ξ, η, ζ) = L L L =
1 1 1
2 2 2

ξ η ζ 1+ξ 1−η 1−ζ
L2 (ξ, η, ζ) = L2 L1 L1 =
e
2 2 2

e 3 (ξ, η, ζ) = Lξ Lη Lζ = 1 + ξ 1+η 1−ζ
L 2 2 1
2 2 2

ξ η ζ 1−ξ 1+η 1−ζ
L4 (ξ, η, ζ) = L1 L2 L1 =
e
2 2 2
(5.128)
e 5 (ξ, η, ζ) = Lξ Lη Lζ = 1 − ξ
L
1−η 1+ζ
1 1 2
2 2 2

ξ η ζ 1+ξ 1−η 1+ζ
L6 (ξ, η, ζ) = L2 L1 L2 =
e
2 2 2

ξ η ζ 1+ξ 1+η 1+ζ
L7 (ξ, η, ζ) = L2 L2 L2 =
e
2 2 2

ξ η ζ 1−ξ 1+η 1+ζ
L8 (ξ, η, ζ) = L1 L2 L2 =
e
2 2 2

These are the desired Lagrange polynomials that are linear in ξ, η, and ζ
and correspond to the eight-node configuration in Figure 5.26.
5.9. MAPPING AND INTERPOLATION IN R3 245

5.9.2 Interpolation of Function Values fi Over Ω̄(e) Using Ω̄(m)

The mapping of Ω̄(e) to Ω̄(m) is given by:
n
e
X
x(ξ, η, ζ) = L
e i (ξ, η, ζ)xi
i=1
Xn
e
y(ξ, η, ζ) = L
e i (ξ, η, ζ)yi (5.129)
i=1
Xn
e
z(ξ, η, ζ) = L
e i (ξ, η, ζ)zi
i=1

n
e and L e i (ξ, η, ζ) are suitably chosen number of nodes and the Lagrange
interpolation polynomials for mapping of Ω̄(e) to Ω̄(m) in a two unit cube.
If fi are the function values at the nodes of Ω̄(e) or Ω̄(m) then these can be
interpolated using
n
X
f (e) (ξ, η, ζ) = Li (ξ, η, ζ)fi
i=1
in which
n=8 for linear Li (ξ, η, ζ) in ξ, η, and ζ
n = 27 for quadratic Li (ξ, η, ζ) in ξ, η, and ζ
n = 64 for cubic Li (ξ, η, ζ) in ξ, η, and ζ
and so on. Li (ξ, η, ζ) are determined using the tensor product. The functions
L
e i (ξ, η, ζ) and Li (ξ, η, ζ) are generally not the same but can be the same
if so desired. The choice of L e i (ξ, η, ζ) depends on the geometry mapping
considerations, whereas Li (ξ, η, ζ) are chosen based on data points to be
interpolated.

5.9.3 Mapping of Lengths, Volumes and Derivatives of f (ξ, η, ζ)

with Respect to x, y, z and ξ, η, ζ in R3
5.9.3.1 Mapping of Lengths
We establish a relationship between dx, dy and dz and dξ, dη, dζ in the
xyz- and ξηζ-spaces. Since x = x(ξ, η, ζ), y = y(ξ, η, ζ) and z = z(ξ, η, ζ),
we can write

∂x ∂x ∂x
dx = dξ + dη + dζ
∂ξ ∂η ∂ζ
∂y ∂y ∂y
dy = dξ + dη + dζ (5.130)
∂ξ ∂η ∂ζ
∂z ∂z ∂z
dz = dξ + dη + dζ
∂ξ ∂η ∂ζ
246 INTERPOLATION AND MAPPING

or    
dx dξ 
dy = [J] dη (5.131)
dz dζ
   

where  
∂x ∂x ∂x
∂ξ ∂η ∂ζ
 ∂y ∂y ∂y 

[J] = 
 ∂ξ ∂η ∂ζ  (5.132)
∂z ∂z ∂z
∂ξ ∂η ∂ζ

[J] is called the Jacobian of mapping.

5.9.3.2 Mapping of Volumes

In this section we derive a relationship between the elemental volume
dxdydz in xyz-space and dξdηdζ in ξηζ-space. Let ~i, ~j, and ~k be unit
vectors along x-, y-, and z-axes and ~eξ , ~eη , and ~eζ be unit vectors along ξ-,
η-, and ζ-axes, then:

∂x ∂x ∂x
dx~i = dξ~eξ + dη~eη + dζ~eζ
∂ξ ∂η ∂ζ
∂y ∂y ∂y
dy~j = dξ~eξ + dη~eη + dζ~eζ (5.133)
∂ξ ∂η ∂ζ
∂z ∂z ∂z
dz~k = dξ~eξ + dη~eη + dζ~eζ
∂ξ ∂η ∂ζ

We note that:

dx~i · (d~j × dz~k) = dx~i · (dydz~i) = dxdydz~i · ~i = dxdydz (5.134)

Substituting for dx~i, dy~j and dz~k from (5.133) into (5.134) and using the
properties of the dot product and cross product in ξηζ- and xyz-spaces, we
obtain:
dxdydz = det[J]dξdηdζ (5.135)
We note that for (5.135) to hold det[J] > 0 must hold. Thus for the mapping
between Ω̄(e) and Ω̄(m) to be valid (one-to-one and onto) det[J] > 0 must
hold. This is an important conclusion from (5.135).

5.9.3.3 Obtaining Derivatives of f (ξ, η, ζ) with Respect to x, y, z

We note that f = f (ξ, η, ζ) and since
n
X
f (ξ, η, ζ) = Li (ξ, η, ζ)fi (5.136)
i=1
5.9. MAPPING AND INTERPOLATION IN R3 247

we have:
n
∂f X ∂Li (ξ, η, ζ)
= fi
∂x ∂x
i=1
n
∂f X ∂Li (ξ, η, ζ)
= fi (5.137)
∂y ∂y
i=1
n
∂f X ∂Li (ξ, η, ζ)
= fi
∂z ∂z
i=1

∂f ∂f ∂f
Thus, ∂x , ∂y , ∂z are deterministic if we know:

∂Li (ξ, η, ζ) ∂Li (ξ, η, ζ) ∂Li (ξ, η, ζ)

, and ; i = 1, 2, . . . , n
∂x ∂y ∂z

Since Li = Li (ξ, η, ζ) and x = x(ξ, η, ζ), y = y(ξ, η, ζ), and z = z(ξ, η, ζ), we
can use the chain rule of differentiation to obtain:
∂Li (ξ, η, ζ) ∂Li ∂x ∂Li ∂y ∂Li ∂z
= + +
∂ξ ∂x ∂ξ ∂y ∂ξ ∂z ∂ξ
∂Li (ξ, η, ζ) ∂Li ∂x ∂Li ∂y ∂Li ∂z
= + + ; i = 1, 2, . . . , n (5.138)
∂η ∂x ∂η ∂y ∂η ∂z ∂η
∂Li (ξ, η, ζ) ∂Li ∂x ∂Li ∂y ∂Li ∂z
= + +
∂ζ ∂x ∂ζ ∂y ∂ζ ∂z ∂ζ
or    
 ∂Li (ξ,η,ζ)   ∂Li 
 ∂ξ 
   ∂x 
 
∂Li (ξ,η,ζ)
∂η = [J T ] ∂Li
∂y  ; i = 1, 2, . . . , n (5.139)
  
 ∂Li (ξ,η,ζ) 
   ∂Li 
 
∂ζ ∂z
   
∂L ∂L
 ∂xi 

 
  ∂ξi 

 

∴ ∂Li
∂y  = [J T ]−1 ∂Li
∂η  ; i = 1, 2, . . . , n (5.140)
 
 ∂Li 
   ∂Li 
 
∂z ∂ζ

Using (5.140) and (5.137) ∂f ∂f ∂f

∂x , ∂y , and ∂z are deterministic. We note that in
the computation of [J] we use (5.129) to find components of [J].

Example 5.11 (Mapping and Interpolation in R2 ). Consider Ω̄(e) to

be a four-node quadrilateral domain in xy-space shown in Figure 5.27(a)
with xy-coordinates given.
248 INTERPOLATION AND MAPPING

x η

4 3
3

(4, 4)
2 ξ
4 (0, 2)

2 1 2
1 x
(0, 0) (2, 0) 2

a) Ω̄(e) b) Ω̄(ξ,η)

Figure 5.27: Ω̄(e)

Figure 5.27(b) shows a map of Ω̄(e) in Ω̄(ξ,η) in ξη-space in a two unit

square.
(a) Determine equations x = x(ξ, η), y = y(ξ, η) describing the mapping.
(b) Determine the Jacobian [J] of mapping and det[J].
(c) Determine the derivatives of the Lagrange polynomials Li (ξ, η) ; i =
1, 2, . . . , 4 with respect to x and y.
(d) If f1 = 0, f2 = 1, f3 = 2 and f4 = 1 are the function values at the four
nodes of the quadrilateral, then interpolate this data using Ω̄(ξ,η) , i.e.,
determine f (ξ, η) that interpolates this data.
(e) Determine derivatives of f (ξ, η) with respect to x, y.

Solution
(a) Equations describing the mapping: The Lagrange polynomials Li (ξ, η)
; i = 1, 2, . . . , 4 are

1−ξ 1−η 1+ξ 1−η
L1 = , L2 =
2 2 2 2

1+ξ 1+η 1−ξ 1+η
L3 = , L4 =
2 2 2 2
4
X 4
X
∴ x(ξ, η) = Li xi , y(ξ, η) = Li yi
i=1 i=1
5.9. MAPPING AND INTERPOLATION IN R3 249

Substituting for xi , yi , and Li ; i = 1, 2, . . . , 4:

1−ξ 1−η 1+ξ 1−η
x= (0) + (2)
2 2 2 2

1+ξ 1+η 1−ξ 1+η
+ (4) + (0)
2 2 2 2
or
1+ξ 1−η 1+ξ 1+η
x= (2) + (4)
2 2 2 2
or
1+ξ
x= (3 + η)
2

Similarly

1−ξ 1−η 1+ξ 1−η
y(ξ, η) = (0) + (0)
2 2 2 2

1+ξ 1+η 1−ξ 1+η
+ (4) + (2)
2 2 2 2
or
1+ξ
1+η 1−ξ 1+η
y(ξ, η) = (4) + (2)
2 2 2 2
or
1+η
y(ξ, η) = (3 + ξ)
2

1+ξ
∴ x(ξ, η) = (3 + η) )
2
Equations describing mapping
1+η
y(ξ, η) = (3 + ξ)
2

(b) Jacobian of mapping [J] and its determinant |J|:

" #
∂x ∂x
∂ξ ∂η
[J] = ∂y ∂y
∂ξ ∂η

∂x 3+η ∂x 1+ξ ∂y 1+η ∂y 3+ξ
= , = , = , =
∂ξ 2 ∂η 2 ∂ξ 2 ∂η 2
 
3+η 1+ξ
2 2 
∴ [J] =  1+η

3+ξ
2 2
250 INTERPOLATION AND MAPPING

3+η 1+ξ 1+η 3+ξ 1
det[J] = − = (8 + 2ξ + 2η)
2 2 2 2 4
(c) Derivatives of Li with respect to x, y:
 
3+η 1+η
[J T ] =  2 2 
1+ξ 3+ξ
2 2

 
3+η
1 − 1+η
[J T ]−1 =  2 2  =
det[J] − 1+ξ
2
3+ξ
2
 
3+η
1 − 1+η
 2 2 
1
4 (8 + 2ξ + 2η) − 1+ξ
2
3+ξ
2
( ) ( )
∂Li ∂Li
∂x
∂Li = [J T ]−1 ∂ξ
∂Li ; [J T ]−1 is defined above
∂y ∂η

and

∂L1 1 1−η ∂L1 1 1−ξ
=− =−
∂ξ 2 2 ∂η 2 2

∂L2 1 1−η ∂L2 1 1+ξ
= =−
∂ξ 2 2 ∂η 2 2

∂L3 1 1+η ∂L3 1 1+ξ
= =
∂ξ 2 2 ∂η 2 2

∂L4 1 1+η ∂L4 1 1−ξ
=− =
∂ξ 2 2 ∂η 2 2
∂Li ∂Li
Hence, ∂x , ∂y ; i = 1, 2, 3, 4 are completely defined.
(d) Determination of f (ξ, η):
X
f (ξ, η) = Li (ξ, η)fi

1−ξ 1−η 1+ξ 1−η
= (0) + (1)
2 2 2 2

1+ξ 1+η 1−ξ 1+η
+ (2) + (1)
2 2 2 2
After simplifying,
1
f (ξ, η) = (4 + 2ξ + 2η)
4
5.10. NEWTON’S INTERPOLATING POLYNOMIALS IN R1 251

(e) Derivatives of f (ξ, η) with respect to x, y

∂f ∂f ∂x ∂f ∂y
= +
∂ξ ∂x ∂ξ ∂y ∂ξ
∂f ∂f ∂x ∂f ∂y
= +
∂η ∂x ∂η ∂y ∂η
( ∂f ) " ∂y
# ( ∂f
) ( ∂f )
∂x
∂ξ ∂ξ ∂ξ ∂x ∂x
∴ ∂f
= ∂x ∂y ∂f
= [J T ] ∂f
∂η ∂η ∂η ∂y ∂y

∂f 1 ∂f 1
= , =
∂ξ 2 ∂η 2
 ( )
( ∂f ) 3+ξ 1+η 1
1 −
∴ ∂x
= 1  2 2  2
∂f 1+ξ 3+η 1
∂y 4 (8 + 2ξ + 2η) − 2 2 2

(3+ξ)
− (1+η)
( ∂f ) ( )
∂x 1 4 4
∂f
= 1 (1+ξ) (3+η)
∂y 4 (8 + 2ξ + 2η) − 4 + 4

5.10 Newton’s Interpolating Polynomials in R1

Let (xi , fi ) ; i = 1, 2, . . . , n+1 be given data points. Newton’s interpolat-
ing polynomial is another method of determining an nth degree polynomial
that passes through these data points. Recall that when we construct an nth
degree polynomial, we write:

f (n) = C0 + C1 x + C2 x2 + · · · + Cn xn (5.141)

We can also write (5.141) in an alternate way using the locations of the data
points.

f (x) = a0 + a1 (x − x1 ) + a2 (x − x1 )(x − x2 ) + a3 (x − x1 )(x − x2 )(x − x3 ) + . . . .

(5.142)
We can show that (5.141) and (5.142) are equivalent. Consider f (x) to be
quadratic (for simplicity) in x, then from (5.142), we have:

f (x) = a0 + a1 (x − x1 ) + a2 (x − x1 )(x − x2 ) (5.143)

Expanding (5.143):

f (x) = a0 + a1 x − a1 x1 + a2 x2 + a2 xx1 − a2 xx2 + a2 x1 x2 (5.144)

252 INTERPOLATION AND MAPPING

Collecting constant terms, coefficients of x and x2 :

f (x) = (a0 + a1 x1 + a2 x1 x2 ) + (a1 + a2 x1 − a2 x2 )x + a2 x2 (5.145)

If we define:
C0 = a 0 + a 1 x 1 + a 2 x 1 x 2
C1 = a1 + a2 x1 − a2 x2 (5.146)
C2 = a2

then (5.145) becomes:

f (x) = C0 + C1 x + C2 x2 (5.147)

which is exactly the same as (5.141) when f (x) is a quadratic polynomial in

x. The same holds true when f (x) is a higher degree polynomial.
Thus, we conclude that (5.141) and (5.142) are exactly equivalent. We
consider (5.142) in the following.

5.10.1 Determination of Coefficients in (5.142)

The coefficients ai ; i = 0, 1, . . . , n must be determined using the data
(xi , fi ) ; i = 1, 2, . . . , n + 1.

(i) If we let x = x1 , then except for a0 , all other terms become zero due
to the fact that they all contain (x − x1 ). Thus, we obtain:

f (x1 ) = a0 = f1 (5.148)

The coefficient a0 is determined.

(ii) If we let x = x2 , then except the first two terms on the right side of
(5.142), all others are zero and we obtain (after substituting for a0 from
(5.148)):

f (x2 ) = b0 + a1 (x2 − x1 ) = f (x1 ) + a1 (x2 − x1 ) (5.149)

f (x2 ) − f (x1 ) f2 − f1
∴ a1 = = = f [x2 , x1 ] (5.150)
x2 − x1 x2 − x1
f [x2 , x1 ] is a convenient notation. It is called first divided difference
between the points x1 and x2 . Thus, a1 is determined.

(iii) If we let x = x3 , then except the first three terms on the right side of
(5.142), all others are zero and we obtain:

f (x3 ) = a0 + a1 (x3 − x1 ) + a2 (x3 − x1 )(x3 − x2 ) (5.151)

5.10. NEWTON’S INTERPOLATING POLYNOMIALS IN R1 253

Substituting for a0 and a1 from (5.148) and (5.150):

f (x2 ) − f (x1 )
f (x3 ) = f (x1 ) + (x3 − x1 ) + a2 (x3 − x1 )(x3 − x2 ) (5.152)
x2 − x1

Solving for a2 , we obtain:

f (x3 )−f (x2 ) f (x2 )−f (x1 )
x3 −x2 − x2 −x1
a2 = (5.153)
x3 − x1

Introducing the notation for the first divided difference in (5.153):

f [x3 , x2 ] − f [x2 , x1 ]
a2 = = f [x3 , x2 , x1 ] (5.154)
x3 − x3

f [x3 , x2 , x1 ] is called second divided difference.

(iv) Following this procedure we obtain:

a0 = f (x1 )
a1 = f [x2 , x1 ]
a2 = f [x3 , x2 , x1 ] (5.155)
..
.
an = f [xn+1 , xn , . . . , x1 ]

in which the f values in the square brackets are the divided differences
defined by:

f (x2 ) − f (x1 )
f [x2 , x1 ] = ; first divided difference
x2 − x1
f [x3 , x2 ] − f [x2 , x1 ]
f [x3 , x2 , x1 ] = ; second divided difference
x3 − x1
f [x4 , x3 , x2 ] − f [x3 , x2 , x1 ]
f [x4 , x3 , x2 , x1 ] = ; third divided difference
x4 − x1
.. .. ..
. . .
f [xn+1 , xn , . . . , x2 ] − f [xn , xn−1 , . . . , x1 ]
f [xn+1 , xn , . . . , x1 ] =
xn+1 − x1
(5.156)

The details of calculating a0 , a1 , . . . , an described above can be made more

systematic by using a tabular presentation (Table 5.1). Consider (xi , fi ) ;
i = 1, 2, . . . , 4, i.e., four data points for the purpose of this illustration.
254 INTERPOLATION AND MAPPING

Table 5.1: Divided differences in Newton’s interpolating polynomials

First Second Third

i xi f (xi ) = fi Divided Divided Divided
Difference Difference Difference
1 x1 f (x1 )
f [x2 , x1 ]
2 x2 f (x2 ) f [x3 , x2 , x1 ]
f [x3 , x2 ] f [x4 , x3 , x2 , x1 ]
3 x3 f (x3 ) f [x4 , x3 , x2 ]
f [x4 , x3 ]
4 x4 f (x4 )

For this data:

f (x) = a0 +a1 (x−x1 )+a2 (x−x1 )(x−x2 )+a3 (x−x3 )(x−x2 )(x−x3 ) (5.157)

where

f (xi ) = fi ; i = 1, 2, . . . , 4
and
a0 = f (x1 )
(5.158)
a1 = f [x2 , x1 ]
a2 = f [x3 , x2 , x1 ]
a3 = f [x4 , x3 , x2 , x1 ]

Example 5.12 (Newton’s Interpolating Polynomial). Consider the

following data set:

i 1 2 3 4
xi 1 2 3 4
fi 0 10 0 -5

First Divided Differences:

f (x2 ) − f (x1 ) 10 − 0
f [x2 , x1 ] = = = 10
x2 − x1 2−1
f (x3 ) − f (x2 ) 0 − 10
f [x3 , x2 ] = = = −10
x3 − x2 3−2
f (x4 ) − f (x3 ) −5 − 0
f [x4 , x3 ] = = = −5
x4 − x3 4−3
5.10. NEWTON’S INTERPOLATING POLYNOMIALS IN R1 255

Second Divided Differences:

f [x3 , x2 ] − f [x2 , x1 ] −10 − 10
f [x3 , x2 , x1 ] = = = −10
x3 − x1 3−1
f [x4 , x3 ] − f [x3 , x2 ] −5 − (−10) 5
f [x4 , x3 , x2 ] = = =
x4 − x2 4−2 2
Third Divided Difference:
5
f [x4 , x3 , x2 ] − f [x3 , x2 , x1 ] 2 − (−10) 25
f [x4 , x3 , x2 , x1 ] = = =
x4 − x1 4−1 6

Table 5.2: Newton’s interpolating polynomial example

First Second Third

i xi f (xi ) = fi Divided Divided Divided
Difference Difference Difference
(a0 )
1 1 0
(a1 )
10
(a2 )
2 2 10 −10
(a3 )
25
−10 6
5
3 3 0 2

−5
4 4 −5

∴ f (x) = a0 (x − x1 ) + a2 (x − x1 )(x − x2 ) + a3 (x − x1 )(x − x2 )(x − x3 )

25
f (x) = (0) + 10(x − 1) − 1(x − 1)(x − 2) + (x − 1)(x − 2)(x − 3)
6
or
25
f (x) = 10(x − 1) − 10(x − 1)(x − 3) + (x − 1)(x − 2)(x − 3)
6
is the interpolation of the given data.
256 INTERPOLATION AND MAPPING

5.11 Approximation Errors in Interpolations

By considering Newton’s interpolating polynomials it is possible to estab-
lish the order of the truncation errors in the interpolation process. Consider
Newton’s interpolating polynomial.
For equally spaced data, if h is the data spacing in x-space, then:

x2 = x1 + h
x3 = x2 + 2h
.. (5.159)
.
xn = x1 + nh

For this case:

f (x2 ) − f (x1 ) f (x2 ) − f (x2 )
f [x2 , x1 ] = = (5.160)
x2 − x1 h
f (x3 )−f (x2 ) f (x2 )−f (x2 )
f [x3 , x2 ] − f [x2 , x1 ] x3 −x2 − x2 −x1
f [x3 , x2 , x1 ] = =
x3 − x1 (x3 − x1 )
or
f (x3 )−f (x2 )
h − f (x2 )−f
h
(x1 )
f (x3 ) − 2f (x2 ) + f (x1 )
f [x3 , x2 , x1 ] = = (5.161)
2h 2h2
If we use a Taylor series expansion about x1 in the interval [x1 , x2 ] to ap-
proximate the derivatives of f (x), we obtain:

f (x2 ) − f (x1 )
f 0 (x1 ) = = f [x2 , x1 ]
h (5.162)
f (x3 ) − 2f (x2 ) + f (x1 ) f [x3 , x2 , x1 ]
f 00 (x1 ) = 2
=
h 2!
and so on. Hence,

f (x) = f (x1 ) + f [x2 , x1 ](x − x1 ) + f [x3 , x2 , x1 ](x − x1 )(x − x2 ) + . . . . (5.163)

can be written as
f 00 (x1 )
f (x) = f (x1 ) + f 0 (x1 )(x − x1 ) + (x − x1 )(x − x2 ) + . . . (5.164)
2!
Equation (5.164) is an important form of the Newton’s interpolating poly-
nomial.
If we let
x − x1
=α ∴ x − x − 1 = hα
h
5.12. CONCLUDING REMARKS 257

then
x − x2 = x − (x1 − h) = x − x1 − h = αh − h = h(α − 1)
x − x3 = x − (x2 + h) = x − (x1 + h + h) = x − x1 − 2h (5.165)
= αh − 2h = h(α − 2)

Hence, (5.164) can be written as:

f 0 (x1 ) 2 f n hn
f (x) = f (x1 )+f 0 (x1 )hα+ h α(α−1)+· · ·+ α(α−1) . . . (α−(n−1))+Rn
2! n!
(5.166)
f n+1 (ξ) n+1
Rn = h α(α − 1)(α − 2) . . . (α − n) ; remainder (5.167)
(n + 1)!
Remarks.

(1) Rn is the remainder in (5.166) and is a measure of the order of truncation

error.

(2) Equation (5.166) for interpolation f (x) suggests that:

(i) If f (x) is linear (n = 2), then the truncation error R2 is O(h2 ).

(ii) If f (x) is quadratic (n = 3), then the truncation error R3 is O(h3 )
and so on.

(3) This conclusion drawn using Newton’s interpolating polynomials also

holds for other methods of interpolation.

(4) We note that all interpolation processes are methods of approximation.

The interpolating polynomials (in R1 , R2 , and R3 ) only approximate the
real behavior of the data.

5.12 Concluding Remarks

Interpolation theory and mapping of lengths, areas, and volumes from
physical coordinate spaces x, xy and xyz to natural coordinate spaces ξ, ξη
and ξηζ are presented in this chapter. Mapping of irregular domains in R2
and R3 in the physical spaces xy and xyz to the natural coordinate spaces
ξη and ξηζ spaces facilitates interpolation theory. Lagrange interpolation in
R1 , R2 and R3 is considered in ξ, ξη and ξηζ spaces. The tensor product
is highly meritorious for generating interpolation functions in the natural
coordinate space in R2 and R3 .
258 INTERPOLATION AND MAPPING

Problems
5.1 Consider the following table of data

i 1 2 3
xi 0 3 6
fi f1 f2 f3

We can express
3
X
f (x) = Lk (x)fk (1)
k=1

Where Lk (x) are Lagrange interpolation functions.

(a) State properties of Lagrange interpolation functions Lk (x). Why

are these properties necessary.
(b) Construct an analytical expression for f (x) using Lagrange inter-
polating polynomial (1). If f1 = 1, f2 = 13 and f3 = 43, then
determine numerical value of f (4).

5.2 Consider the following table of data

i 1 2 3 4 5
xi 0 1 2 3 4
fi 1 3 7 13 21

Use Newton’s interpolating polynomial to calculate f (2.5) and f 0 (2.5) i.e.

df
dx .
x=2.5

5.3 Consider the following table of data

i 1 2 3
xi -1.0 -0.5 1.0
fi f1 f2 f3

(a) Determine an analytical expression for f (x); −1 ≤ x ≤ 1 using the

data in the table and using Lagrange interpolating polynomial.
(b) Tabulate the values of Lagrange interpolation functions Lk (x); k =
1, 2, 3 at xi ; i = 1, 2, 3. Comment
P on their behavior. Why is this be-
havior necessary? Show that 3k=1 Lk (x) = 1. Why is this property
necessary?
5.12. CONCLUDING REMARKS 259

5.4 Consider the following table of data

i 1 2 3
xi 0 2 4
fi 1 5 17

Determine f (3) using Lagrange interpolating polynomial for the data in the
table.

5.5 Consider the following table of data

i 1 2 3 4
xi 3 4 2.5 5
fi 8 2 7 1

Calculate f (3.4) using Newton’s interpolating polynomials of degrees 1, 2

and 3.

5.6 Given the following table

i 1 2 3 4 5
xi 0 1 2 3 4
fi 0 2 6 12 20

Use Newton’s interpolating polynomials of degrees 1, 2, 3 and 4 to calculate

f (2.5).

5.7 Consider a two node configuration Ω̄(e) in R1 shown in Figure (a) with
coordinates. Figure (b) shows its map Ω̄(ξ) in the natural coordinate space
ξ.
y
Ω̄(ξ)
1 2
1 Ω̄(e) 2 ξ
x
x=1 x=4 2
(e)
(a) A two node domain Ω̄ (b) Map of Ω̄(e) in Ω̄(ξ)

(a) Derive equation describing the mapping of points between x and ξ

spaces.
(b) Derive equation describing mapping of lengths between x and ξ
spaces.
260 INTERPOLATION AND MAPPING

(c) If f1 and f2 are function values at nodes 1 and 2 of Figure (a), then
establish interpolation f (ξ) in the natural coordinate space ξ i.e.
Ω̄(ξ) .
df (ξ)
(d) Derive an expression for dx using the interpolation f (ξ) derived
in (c).
(e) Using f1 = 10 and f2 = 20 calculate values of f at x = 1.75 and
x = 3.25 using the interpolation f (ξ) in (c).
df
(f) Also calculate dx at x = 1.75 and x = 3.25.

5.8 Consider a three node configuration Ω̄(e) in R1 shown in Figure (a).

Figure (b) shows a map Ω̄(ξ) of Ω̄(e) in the natural coordinate space ξ.
η
y

1 2 3
1 2 3 ξ
x −1 0 1
x=1 x = 2.5 x = 4 2
(a) A three node configuration Ω̄(e) (b) Map Ω̄(ξ) of Ω̄(e)

(a) Derive equation describing the mapping of points between x and ξ

spaces.
(b) Derive equation describing mapping of lengths between x and ξ
spaces.
(c) If f1 , f2 and f3 are function values at nodes 1, 2 and 3 of Figure
(a), then establish interpolation f (ξ) of f in the natural coordinate
space ξ i.e. Ω̄(ξ) .
df (ξ)
(d) Derive an expression for dx using the interpolation f (ξ) derived
in (c).
df (ξ)
(e) Using f1 = 2, f2 = 6 and f3 = 1 calculate values of f and dx at
x = 1.375 and x = 3.625.

5.9 Consider a three node configuration Ω̄(e) in R1 shown in Figure (a).

Figure (b) shows a map Ω̄(e) of Ω̄(ξ) in the natural coordinate space ξ.

(a) Derive equation describing the mapping of points between x and ξ

spaces.
(b) Derive equation describing mapping of lengths between x and ξ
spaces.
5.12. CONCLUDING REMARKS 261

η
y

1 2 3
1 2 3 ξ
x −1 0 1
x = 1 x = 1.75 x=4 2
(a) A three node configuration Ω̄(e) (b) Map Ω̄(ξ) of Ω̄(e)

(c) If f1 , f2 and f3 are function values at nodes 1, 2 and 3 of Figure

(a), then establish interpolation f (ξ) of f in the natural coordinate
space ξ i.e. Ω̄(ξ) .
df (ξ)
(d) Derive an expression for dx using the interpolation f (ξ) derived
in (c).
(e) Using f1 = 2, f2 = 6 and f3 = 1 calculate values of f and dfdx
(ξ)
at
df (ξ)
x = 1.375 and x = 2.875. Also calculate dx at nodes 1, 2, and 3
of the configuration in Figure (a).
(f) Plot graphs of f versus x and dfdx
(ξ)
versus x for x ∈ [1, 4]. Take at
least twenty points between x = 1 and x = 4. Do not curve fit the
calculated values.

5.10 Consider a three node configuration Ω̄(e) in R1 shown in Figure (a).

Figure (b) shows a map Ω̄(e) of Ω̄(ξ) in the natural coordinate space ξ.
η
y

1 2 3
1 2 3 ξ
x −1 0 1
x=1 x = 3.25 x = 4 2
(a) A three node configuration Ω̄(e) (b) Map Ω̄(ξ) of Ω̄(e)

(a) Derive equation describing the mapping of points between x and ξ

(e) Using f1 = 2, f2 = 6 and f3 = 1 calculate values of f and dfdx (ξ)

at x = 2.125 and x = 3.625. Also calculate dfdx

(ξ)
at nodes of the
configuration in Figure (a).
(f) Plot graphs of f versus x and dfdx
(ξ)
versus x for x ∈ [1, 4]. Take at
least twenty points between x = 1 and x = 4. Do not curve fit the
calculated values.

5.11 Consider a three node configuration Ω̄(e) in R1 shown in Figure (a).

The coordinates of the nodes are x1 , x2 , x3 . The map of the element Ω̄(ξ) in
the natural coordinate space ξ is shown in Figure (b).
y η

1 2 3 1 2 3
x ξ
x1 x2 x3 ξ=0 ξ=1 ξ=2
(a) A three node configuration Ω̄(e) (b) Map Ω̄(ξ) of Ω̄(e)

(a) Derive expression describing the mapping of points between Ω̄(e)

and Ω̄(ξ) i.e. derive x = x(ξ).
(b) Derive an expression for mapping of lengths between x and ξ spaces.
(c) Determine the length between nodes 1 and 3 in Ω̄(e) i.e. Figure (a)
using its map in the natural coordinate space.

5.12 Figure (a) shows a four node quadrilateral Ω̄(e) in R2 . Coordinates

of the nodes are given. Figure (b) shows a map Ω̄(ξη) of Ω̄(e) in natural
coordinate space ξη.

y η

3 (0, 2) (2, 2)
4 3
(p, q)
4
(0, 2)

1 2 1 2
x ξ
(0, 0) (2, 0) (0, 0) (2, 0)
(a) Ω̄(e) in x, y space (b) Map Ω̄(ξ,η) of Ω̄(e)
5.12. CONCLUDING REMARKS 263

The coordinates of the nodes are also given in the two spaces in Figures (a)
and (b).
(a) Determine the equations describing the mapping of points in xy and
ξη spaces for Ω̄(e) and Ω̄(ξη) i.e. determine x = x(ξ, η), y = y(ξ, η).
Simplify the expressions till no further simplification is possible.
(b) Determine the relationship between p and q (the Cartesian coordi-
nates of node 3) for their admissibility in the geometric description
of the geometry Ω̄(e) in the xy space. Simplify the final expression
or equation.
5.13 Consider a four node quadrilateral bilinear geometry Ω̄(e) in R2 shown
in Figure (a).

3
B
(3, 3)
4
(0, 2)
A

1 2
x
(0, 0) (2, 0)
(a) A four node Ω̄(e) in R2

Let fi ; i = 1, 2, . . . , 4 be the function values at nodes 1, 2, . . . , 4. Locations

A and B represent the mid points of sides 2,3 and 4,3.
If f1 = 100, f2 = 200, fA = 300 and fB = 275, then calculate f3 and f4
using interpolation theory.
5.14 Consider a six node Ω̄(e) in R2 shown in Figure (a). Its map Ω̄(ξη) in
the natural coordinate space ξη is shown in Figure (b).
(a) Construct interpolation functions for the Ω̄(ξη) in the natural coor-
dinate space ξη.
(b) If a function f is interpolated using the functions generated in (a)
and using values of f at nodes 1 – 6 of Figure (a), then determine
whether f (ξ, η) so interpolated is a complete polynomial in ξ and
η. Explain, why or why not.
264 INTERPOLATION AND MAPPING

6 4
5 η

(0, 1) (1, 1)
6 5 4
y
1 2
3

1 2 3
ξ
x (0, 0) (1, 0)
(a) Ω̄(e) in R2 (b) Map Ω̄(ξ,η) of Ω̄(e)

(c) Determine degrees of interpolation of f (ξ, η) in ξ and η i.e. pξ and

pη . Explain your reasoning.

5.15 Consider a four node bilinear Ω̄(e) in R2 shown in Figure (a). Its map
Ω̄(ξη) in ξη space is shown in Figure (b).
y η

4
(0, 4) 3
(0, 1) (1, 1)
(s, t) 4 3
4

1 2
x
(0, 0) (2, 0)
1 2
ξ
2
(0, 0) (1, 0)
(a) Ω̄(e) in R2 (b) Map Ω̄(ξ,η) of Ω̄(e)

(a) Determine equations describing mapping of points between Ω̄(ξη)

and Ω̄(e) . Simplify the resulting expressions.
(b) Determine Jacobian of mapping [J].
(c) Can the location of node 3 in xy space be arbitrary (i.e. can the
values of s and t be arbitrary) or are their restrictions on them? If
there are, then determine them.

5.16 Consider a six node para-linear Ω̄(e) in R2 shown in Figure (a). Its
map Ω̄(ξη) in ξη space is shown in Figure (b).
5.12. CONCLUDING REMARKS 265

y
η
6 5 6
6 (1.5, 2.25) (3, 3)
(0, 3)
5 4
2 ξ

(1.5, 0.75)
3 1 2 3
1
x 2 (3, 0)
(0, 0) 2
(e) 2
(a) Ω̄ in R (b) Map Ω̄(ξ,η) of Ω̄(e)

Calculate length of the face of Ω̄(e) containing nodes 1, 2, 3 in the Cartesian

coordinate space by utilizing its map Ω̄(ξη) in the natural coordinate space
ξη.

5.17 Consider a four node bilinear Ω̄(e) in R2 shown in Figure (a).

y
√
5

3
B
A
4
√
5 3
4
2.25
2

1 2
x

(a) A four node Ω̄(e) in R2

Let f1 , f2 , f3 and f4 be the function values at nodes 1, 2, 3 and 4. Locations

of points A and B on sides 2 – 3 and 4 – 3 are shown in Figure (a). If
f1 = 10, f2 = 20, fA = 30 and fB = 27.5, then calculate f3 and f4 using
interpolation theory.

5.18 Consider two-dimensional Ω̄(e) in R2 shown in Figure (a), (b), and (c).
266 INTERPOLATION AND MAPPING

The Cartesian coordinates of the nodes are given. The domains Ω̄(e) are
mapped into ξη-space into a two-unit square.
y y (6.5,7) y
(5,6) (10,6)
3
4 3 4 5
7 6
3
60°
10 5 8

3 4
1 2 1 2 3 1.5
1 2
x x x

10 5 3 3
0.5
(a) (b) (c)

Figure 1: Ω̄(e) in R2

(a) Determine the Jacobian matrix of transformation and its determinant

for each Ω̄(e) . Calculate and tabulate the value of the determinant of
the Jacobian at the nodes of each Ω̄(e) .
(b) Calculate the derivatives of the approximation function with respect to
x and y for node 3 (i.e. ∂N3∂x
(ξ,η)
and ∂N3∂y(ξ,η) ) for each of the three Ω̄(e)
shown in Figures (a) – (c).

5.19 Consider a two-dimensional eight-node Ω̄(e) shown in Figure (a). The

Cartesian coordinates of the nodes are given in Figure (a). The domain Ω̄(e)
is mapped into natural coordinate space ξη into a two-unit square Ω̄(ξη) with
the origin of the ξη coordinate system at the center of Ω̄(ξη) .
y
(5,6) (10,6)

7 6 5
3
8
1
2 4
3 1
1 3 1.5
x

3 3
0.5
(e)
Figure 1: Ω̄ in R2

(a) Write a computer program (or calculate otherwise) to determine the

Cartesian coordinates of the points midway between the nodes. Tabulate
5.12. CONCLUDING REMARKS 267

the xy coordinates of these points. Plot the sides of Ω̄(e) in xy-space by

taking more intermediate points.
(b) Determine the area of Ω̄(e) using Gauss quadrature. Select and use the
minimum number of quadrature points in ξ and η directions to calculate
the area exactly. Show that increasing the order of the quadrature does
not affect the area.
(c) Determine the locations of the quadrature points (used in (b)) in the
Cartesian space. Provide a table of these points and their locations in
xy-space. Also mark their locations on the plot generated in part (a).

Provide program listing, results, tables, and plots along with a write-up on
the equations used as part of the report. Also provide a discussion of your
results.
6
Numerical Integration or
Quadrature

6.1 Introduction
In many situations, due to the complexity of integrands and irregular-
ity of the domain in definite integrals it becomes necessary to approximate
the value of the integral. The numerical integration methods or quadra-
ture methods are methods of obtaining approximate values of the definite
integrals. Many simple numerical integration methods are derived using the
simple fact that if we wish to calculate the integral of f (x) between the limits
x = A to x = B, i.e.,
ZB
I = f (x) dx (6.1)
A

then the value of the integral of f (x) between x = A to x = B is the area

under the curve f (x) versus x (Figure 6.1). Thus, the numerical integration

f (x)

RB
shaded area = I = f (x) dx
A

x
x=A x=B

Figure 6.1: Plot of f (x) versus x

methods are based on approximating the actual area under the curve f (x)
versus x between x = A and x = B.
Numerical integration methods such as trapezoid rule, Simpson’s 1/3
and 3/8 rules, Newton-Cotes integration, Richardson’s extrapolation, and
Romberg method presented in the following sections are all methods of
approximation. These methods are only effective in R1 . Gauss quadra-
ture is equally effective in R1 , R2 , and R3 . Gauss quadrature a numerical

269
270 NUMERICAL INTEGRATION OR QUADRATURE

method without approximation when the integrand is an algebraic polyno-

mial. When the integrand is not an algebraic polynomial, Gauss quadrature
is also a method of approximation.
In this chapter we consider numerical integration methods in R1 , R2 , as
well as R3 . First we consider numerical integration in R1 .

6.1.1 Numerical Integration in R1

We consider two classes of methods.

(1) In the first category of methods the integration interval [A, B] is divided
into subintervals (may be considered of equal width for convenience).
The integration methods are developed for calculating the approximate
value of the integral for a subinterval. The sum of the approximated
integral values for each subinterval then yields the approximate value of
the integral over the entire interval of integration [A, B]. We consider
two methods:

(a) In a typical subinterval [a, b], f (x) in (6.1) is approximated by a

polynomial of degree one, two, etc. and then integrated explicitly
to obtain the approximate value of the integral for this subinterval
[a, b]. The trapezoid rule, Simpson’s 13 method, Simpson’s 38 method,
and Newton-Cotes integration techniques fall into this category.
(b) The second class of methods using subintervals includes Richard-
son’s extrapolation and Romberg method. In these methods the
initially calculated values of the integral using a subinterval size are
improved based on the order of truncation errors and their elimina-
tion and the integral estimate based on reduced subinterval size.

(2) In the second category of methods, the integration interval [A, B] is

not subdivided into subintervals. A numerical integration method is
designed such that if f (x) is an algebraic polynomial in x, then it is
integrated exactly. This method is called Gauss quadrature. Gauss
quadrature can also be used to integrate f (x) even if it is not an algebraic
polynomial, but in this case the calculated integral value is approximate
(however, it can be improved to desired accuracy).

6.1.2 Numerical Integration in R2 and R3 :

Gauss quadrature in R1 can be easily extended to numerical integration
in R2 and R3 by using mapping from physical domain (x, y or x, y, z) to
natural coordinate space. Gauss quadrature in R2 and R3 also can be used
to integrate algebraic polynomial integrands exactly.
6.2. NUMERICAL INTEGRATION IN R1 271

6.2 Numerical Integration in R1 : Methods Based

on Approximating f (x) by a Polynomial
Consider the integrand f (x) between x = A and x = B in equation
(6.1). We subdivide the interval [A, B] into n subintervals. Let [ai , bi ] be the
integration limits for a subinterval i, then (6.1) can be written as:
 b
ZB

n
X Zi n
X
I= f (x) dx =  f (x) dx = Ii (6.2)
A i=1 ai i=1

in which
Zbi
Ii = f (x) dx (6.3)
ai

is the integral of f (x) for a subinterval [ai , bi ]. We consider methods of

approximating the integral Ii for each subinterval i (i = 1, 2, . . . , n) and
thereby approximating I in (6.2).
For a subinterval we consider (6.3). We approximate f (x) ∀x ∈ [ai , bi ]
by linear, quadratic, cubic, etc. polynomials. This leads to various methods
of approximating the integral Ii for [ai , bi ]. We consider details of various
methods resulting from these approximations of f (x) in the following.

6.2.1 Trapezoid Rule

Consider the integral (6.3). Calculate f (ai ) and f (bi ) using f (x) (given
integrand). Using (ai , f (ai )) and (bi , f (bi )), we approximate f (x) ∀x ∈ [ai , bi ]
by a linear polynomial in x, i.e., a straight line.

f (bi ) − f (ai )
f (x) ≈ fe(x) = f (ai ) + (x − ai ) (6.4)
bi − ai

Zbi
Ii ≈ Iei = fe(x) dx (6.5)
ai

Substituting for fe(x) from (6.4) into (6.5).

Zbi
f (bi ) − f (ai )
Ii ≈ f (ai ) + (x − ai ) dx (6.6)
(bi − ai )
ai

or
f (ai ) + f (bi )
Ii ≈ (bi − ai ) (6.7)
2
272 NUMERICAL INTEGRATION OR QUADRATURE

f (x)

f (bi )
f (ai )

x
ai bi

subinterval i

Figure 6.2: Trapezoidal Rule for subinterval [ai , bi ]

This is called the trapezoid rule (Figure 6.2).

Iei is the area of the trapezoid between [ai , bi ], shown in Figure 6.2. We
n
P
calculate Iei for each subinterval [ai , bi ] using (6.7) and then use Ie = Iei to
i=1
obtain the approximate value Ie of the integral (6.1)
Remarks.
(1) The accuracy of the method is dependent on the size of the subinter-
val [ai , bi ]. The smaller the subinterval, the better the accuracy of the
approximated value of the integral I.
(2) It can be shown that in the trapezoid rule, the truncation error in cal-
culating I using (6.7) for a subinterval (bi − ai ) = hi is on the order of
O(h2i ), where hi = bi − ai .
(3) This definition of hi is different than used in Sections 6.2.2 and 6.2.3.

1
6.2.2 Simpson’s 3
Rule
Consider Ii for a subinterval [ai , bi ].
Zbi
Ii = f (x) dx (6.8)
ai

We calculate f (ai ), f ( ai +b
2 ), and f (bi ) using f (x), the integrand in (6.8).
i

For convenience of notation we let:

x1 = ai f (ai ) = f (x1 )
;
ai + b2 ai + bi
x2 = ; f = f (x2 ) (6.9)
2 2
x3 = bi ;
f (bi ) = f (x3 )
6.2. NUMERICAL INTEGRATION IN R1 273

Using (x1 , f (x1 )), (x2 , f (x2 )), and (x3 , f (x3 )), we establish a quadratic in-
terpolating polynomial fe(x) (say, using Lagrange polynomials) that is con-
sidered to approximate f (x) ∀x ∈ [ai , bi ].

(x − x2 )(x − x3 ) (x − x1 )(x − x3 )
fe(x) = f (x1 ) + f (x2 )
(x1 − x2 )(x1 − x3 ) (x2 − x1 )(x2 − x3 )
(x − x1 )(x − x2 )
+ f (x3 ) (6.10)
(x3 − x1 )(x3 − x2 )

We approximate f (x) in (6.8) by fe(x) in (6.10).

Zbi
Ii ≈ Iei = fe(x) dx (6.11)
ai

Substituting fe(x) from (6.10) into (6.11) and integrating:

hi bi − ai
Ii ≈ Iei = (f (x1 ) + 4f (x2 ) + f (x3 )) hi = (6.12)
3 2

This is called Simpson’s 13 Rule. Figure 6.3 shows fe(x) and the true f (x)
for the subinterval [ai , bi ]. We note that (6.12) can also be written as:

f (x1 ) + 4f (x2 ) + f (x3 )
Ii = (bi − ai )
e = (bi − ai )Hi (6.13)
| {z } | 6
width
{z }
average height

f (x)
fe(x)

f (x)

ai +bi
x
ai 2 bi
(x1 ) (x2 ) (x3 )

Figure 6.3: f (x) and fe(x), quadratic approximation of f (x) for the subinterval [ai , bi ]
274 NUMERICAL INTEGRATION OR QUADRATURE

From (6.13), we note that Simpson’s 13 can be interpreted as the area

of a rectangle with base (bi − ai ) and height Hi (given in equation (6.13)).
Pn
Ie = Iei is used to obtain an approximate value of the integral (6.1). As
i=1
shown in Figure 6.3, the approximation fe(x) may be quite different than the
true f (x).
Remarks.
(1) The accuracy of the method is dependent on the size of the subinter-
val [ai , bi ]. The smaller the subinterval, the better the accuracy of the
approximated value of the integral I.
(2) The truncation error in calculating Ii using (6.13) for a subinterval (bi −
ai ) = hi is of the order O(h4i ) (proof omitted).
(3) This definition of hi is different than used in Sections 6.2.1 and 6.2.3.

3
6.2.3 Simpson’s 8
Rule
Consider Ii for a subinterval [ai , bi ]. We divide the subinterval in three
equal parts and define the coordinates as:

bi − ai 2(bi − ai )
x1 = ai ; x2 = ai + ; x3 = ai + ; x4 = b4
3 3
(6.14)
We calculate f (x1 ), f (x2 ), f (x3 ), and f (x4 ) using f (x) in the integrand of
Ii .
Zbi
Ii = f (x) dx (6.15)
ai

Using (xi , f (xi )) ; i = 1, 2, . . . , 4, we construct a cubic interpolating polyno-

mial fe(x) (say, using Lagrange polynomials) that is assumed to approximate
f (x) ∀x ∈ [ai , bi ].
(x − x2 )(x − x3 )(x − x4 ) (x − x1 )(x − x3 )(x − x4 )
fe(x) = f (x1 ) + f (x2 )+
(x1 − x2 )(x1 − x3 )(x1 − x4 ) (x2 − x1 )(x2 − x3 )(x2 − x4 )
(x − x1 )(x − x2 )(x − x4 ) (x − x1 )(x − x2 )(x − x3 )
f (x3 ) + f (x4 )
(x3 − x1 )(x3 − x2 )(x3 − x4 ) (x4 − x1 )(x4 − x2 )(x4 − x3 )
(6.16)

We approximate f (x) in (6.15) by fe(x) in (6.16).

Zbi
Ii ≈ Iei = fe(x) dx (6.17)
ai
6.2. NUMERICAL INTEGRATION IN R1 275

Substituting for fe(x) in (6.17) and integrating yields:

3 bi − ai
Ii ≈ Iei = hi (f (x1 )+3f (x2 )+3f (x3 )+f (x4 )) ; hi = (6.18)
8 3

This method of approximating Ii by Iei is called Simpson’s 38 rule. We can

also write (6.18) as:

f (x1 ) + 3f (x2 ) + 3f (x3 ) + f (x4 )
Ii ≈ Iei = (bi − ai ) = (bi − ai )Hi
| {z } | 8
width
{z }
average height
(6.19)

f (x)

fe(x)

x
x1 x2 x3 x4
(bi −ai ) 2(bi −ai )
(ai ) ai + 3 ai + 3 (bi )

Figure 6.4: f (x) and fe(x), cubic approximation of f (x) for the interval [ai , bi ]

We note that fe(x) can be quite different than f (x). From (6.19), we
can interpret Simpson’s 38 rule as the area of a rectangle with base (bi − ai )
and height Hi (given by (6.19)). We calculate Iei for a subinterval and use
n
P
Ie = Iei to obtain approximate value of the integral (6.1).
i=1

Remarks.

(1) As in the other methods discussed, here also the accuracy of the method
is dependent on the size of the subinterval. The smaller the subinterval,
the better the accuracy of the approximated value of the integral I.

(2) It can be shown that the truncation error in calculating Ii using (6.19)
is of the order of O(h6 ) (proof omitted).

(3) This definition of hi is different than used in Sections 6.2.1 and 6.2.2.
276 NUMERICAL INTEGRATION OR QUADRATURE

6.2.4 Newton-Cotes Iteration

In trapezoid rule, Simpson’s 13 rule, and Simpson’s 38 rule we approxi-
mate the integrand f (x) by fe(x), a linear, quadratic, and cubic polynomial
(respectively) over a subinterval [ai , bi ]. Based on these approaches, it is
possible to construct fe(x)∀x ∈ [ai , bi ] as a higher degree polynomial than
three and then proceed with the approximation Iei of Ii for subinterval [ai , bi ].
These methods or schemes are called Newton-Cotes integration schemes. De-
tails are straightforward and follow what has already been presented.

6.2.4.1 Numerical Examples

In this section we consider a numerical example using Trapezoid rule,
Simpson’s 13 rule, and Simpson’s 38 rule.

Consider the integral

Z2
I= (sin(x))2 ex dx (6.20)
0
In all these three methods we divide the interval [0, 2] into subintervals. In
this example we consider uniform subdivision of [0, 2] interval into one, two,
four, eight, and sixteen subintervals of widths 2, 1 , 0.5, 0.25, and 0.125. We
represent the integral I as sum of the integrals over the subintervals.
n n Z bi n Z bi
X X X
2 x
I= Ii = (sin(x)) e dx = f (x) dx
i=1 i=1 a i=1 a
i i

In each of the three methods, we calculate I for each subinterval [ai , bi ] and
sum them to obtain I.

Example 6.1 (Trapezoid Rule).

f (x)

f (bi )
bi −ai
• Ii ≈ 2 (f (ai ) + f (bi ))
f (ai )
• f (ai ) and f (bi ) are calculated using
f (x) = (sin x)2 ex

• Truncation error O(h2 )

x
ai bi
6.2. NUMERICAL INTEGRATION IN R1 277

Table 6.1: Results of trapezoid rule for (6.20) using one subinterval
subintervals = 1
bi − ai = 0.200000E+01

i ai bi Ii

1 0.000000E+00 0.200000E+01 0.610943E+01

TOTAL 0.610943E+01

Table 6.2: Results of trapezoid rule for (6.20) using two subintervals
subintervals = 2
bi − ai = 0.100000+01

i ai bi Ii

1 0.000000E+00 0.100000E+01 0.962371E+00

2 0.100000E+00 0.200000E+01 0.401709E+01

TOTAL 0.497946E+01

Table 6.3: Results of trapezoid rule for (6.20) using four subintervals
subintervals = 4
bi − ai = 0.500000+00

i ai bi Ii

1 0.000000E+00 0.500000E+00 0.947392E−01

2 0.500000E+00 0.100000E+01 0.575925E+00
3 0.100000E+01 0.150000E+01 0.159600E+01
4 0.150000E+01 0.200000E+01 0.264217E+01

TOTAL 0.490884E+01
278 NUMERICAL INTEGRATION OR QUADRATURE

Table 6.4: Results of trapezoid rule for (6.20) using eight subintervals
subintervals = 8
bi − ai = 0.250000+00

i ai bi Ii

1 0.000000E+00 0.250000E+00 0.982419E−02

2 0.250000E+00 0.500000E+00 0.571938E−01
3 0.500000E+00 0.750000E+00 0.170323E+00
4 0.750000E+00 0.100000E+01 0.363546E+00
5 0.100000E+01 0.125000E+01 0.633506E+00
6 0.125000E+01 0.150000E+01 0.950321E+00
7 0.150000E+01 0.175000E+01 0.125388E+01
8 0.175000E+01 0.200000E+01 0.146015E+01

TOTAL 0.489874E+01

Table 6.5: Results of trapezoid rule for (6.20) using 16 subintervals

subintervals = 16
bi − ai = 0.250000+00

i ai bi Ii

1 0.000000E+00 0.125000E+00 0.110084E−02

2 0.125000E+00 0.250000E+00 0.601293E−02
3 0.250000E+00 0.375000E+00 0.171118E−01
4 0.375000E+00 0.500000E+00 0.358845E−01
5 0.500000E+00 0.625000E+00 0.636581E−01
6 0.625000E+00 0.750000E+00 0.101450E+00
7 0.750000E+00 0.875000E+00 0.149804E+00
8 0.875000E+00 0.100000E+01 0.208623E+00
9 0.100000E+01 0.112500E+01 0.277019E+00
10 0.112500E+01 0.125000E+01 0.353179E+00
11 0.125000E+01 0.137500E+01 0.434293E+00
12 0.137500E+01 0.150000E+01 0.516540E+00
13 0.150000E+01 0.162500E+01 0.595174E+00
14 0.162500E+01 0.175000E+01 0.664705E+00
15 0.175000E+01 0.187500E+01 0.719221E+00
16 0.187500E+01 0.200000E+01 0.752825E+00

TOTAL 0.489660E+01
6.2. NUMERICAL INTEGRATION IN R1 279

1
Example 6.2 (Simpson’s 3 Rule).
f (x)

f (x3 )
f (x2 )
(bi −ai )
• Ii ≈ 6 (f (x1 ) + 4f (x2 ) + f (x3 ))
f (x1 )
• f (x1 ), f (x2 ), and f (x3 ) are calculated us-
ing f (x) = (sin x)2 ex

• Truncation error O(h4 )

x
x1 = ai x2 x3 = bi

1
Table 6.6: Results of Simpson’s 3
rule for (6.20) using one subinterval
subintervals = 1
bi − ai = 0.200000E+01

i ai bi Ii

1 0.000000E+00 0.200000E+01 0.460280E+01

TOTAL 0.460280E+01

1
Table 6.7: Results of Simpson’s 3
rule for (6.20) using two subintervals
subintervals = 2
bi − ai = 0.100000+01

i ai bi Ii

1 0.000000E+00 0.100000E+01 0.573428E+00

2 0.100000E+00 0.200000E+01 0.431187E+01

TOTAL 0.488530E+01
280 NUMERICAL INTEGRATION OR QUADRATURE

1
Table 6.8: Results of Simpson’s 3
rule for (6.20) using four subintervals
subintervals = 4
bi − ai = 0.500000+00

i ai bi Ii

1 0.000000E+00 0.500000E+00 0.577776E−01

2 0.500000E+00 0.100000E+01 0.519850E+00
3 0.100000E+01 0.150000E+01 0.157977E+01
4 0.150000E+01 0.200000E+01 0.273798E+01

TOTAL 0.489538E+01

1
Table 6.9: Results of Simpson’s 3
rule for (6.20) using eight subintervals
subintervals = 8
bi − ai = 0.250000+00

i ai bi Ii

1 0.000000E+00 0.250000E+00 0.621030E−02

2 0.250000E+00 0.500000E+00 0.515971E−01
3 0.500000E+00 0.750000E+00 0.163370E+00
4 0.750000E+00 0.100000E+01 0.356720E+00
5 0.100000E+01 0.125000E+01 0.629096E+00
6 0.125000E+01 0.150000E+01 0.951004E+00
7 0.150000E+01 0.175000E+01 0.126188E+01
8 0.175000E+01 0.200000E+01 0.147601E+01

TOTAL 0.489589E+01
6.2. NUMERICAL INTEGRATION IN R1 281

1
Table 6.10: Results of Simpson’s 3
rule for (6.20) using 16 subintervals
subintervals = 16
bi − ai = 0.250000+00

i ai bi Ii

1 0.000000E+00 0.125000E+00 0.713010E−03

2 0.125000E+00 0.250000E+00 0.549697E−02
3 0.250000E+00 0.375000E+00 0.164699E−01
4 0.375000E+00 0.500000E+00 0.351296E−01
5 0.500000E+00 0.625000E+00 0.628159E−01
6 0.625000E+00 0.750000E+00 0.100560E+00
7 0.750000E+00 0.875000E+00 0.148919E+00
8 0.875000E+00 0.100000E+01 0.207811E+00
9 0.100000E+01 0.112500E+01 0.276356E+00
10 0.112500E+01 0.125000E+01 0.352750E+00
11 0.125000E+01 0.137500E+01 0.434184E+00
12 0.137500E+01 0.150000E+01 0.516830E+00
13 0.150000E+01 0.162500E+01 0.595925E+00
14 0.162500E+01 0.175000E+01 0.665956E+00
15 0.175000E+01 0.187500E+01 0.720973E+00
16 0.187500E+01 0.200000E+01 0.755030E+00

TOTAL 0.489591E+01

3
Example 6.3 (Simpson’s 8 Rule).
f (x)

f (x3 ) f (x4 )
f (x2 )
(bi −ai
f (x1 ) • Ii ≈ 8 f (x1 ) + 3f (x2 ) + 3f (x3 )

+f (x4 )

• f (x1 ), f (x2 ), f (x3 ), and f (x4 ) are calcu-

lated using f (x) = (sin x)2 ex

• Truncation error O(h6 )

x
x1 = ai x2 x3 x4 = bi
x2 = ai + bi −a i
3
x3 = ai + 2(bi3−ai )
282 NUMERICAL INTEGRATION OR QUADRATURE

3
Table 6.11: Results of Simpson’s 8
rule for (6.20) using one subinterval
subintervals = 1
bi − ai = 0.200000E+01

i ai bi Ii

1 0.000000E+00 0.200000E+01 0.477375E+01

TOTAL 0.477375E+01

3
Table 6.12: Results of Simpson’s 8
rule for (6.20) using two subintervals
subintervals = 2
bi − ai = 0.100000+01

i ai bi Ii

1 0.000000E+00 0.100000E+01 0.575912E+00

2 0.100000E+00 0.200000E+01 0.431542E+01

TOTAL 0.488530E+01

3
Table 6.13: Results of Simpson’s 8
rule for (6.20) using four subintervals
subintervals = 4
bi − ai = 0.500000+00

i ai bi Ii

1 0.000000E+00 0.500000E+00 0.577952E−01

2 0.500000E+00 0.100000E+01 0.519993E+00
3 0.100000E+01 0.150000E+01 0.157996E+01
4 0.150000E+01 0.200000E+01 0.273793E+01

TOTAL 0.489568E+01
6.2. NUMERICAL INTEGRATION IN R1 283

3
Table 6.14: Results of Simpson’s 8
rule for (6.20) using eight subintervals
subintervals = 8
bi − ai = 0.250000+00

i ai bi Ii

1 0.000000E+00 0.250000E+00 0.621011E−02

2 0.250000E+00 0.500000E+00 0.515985E−01
3 0.500000E+00 0.750000E+00 0.163373E+00
4 0.750000E+00 0.100000E+01 0.356726E+00
5 0.100000E+01 0.125000E+01 0.629102E+00
6 0.125000E+01 0.150000E+01 0.951009E+00
7 0.150000E+01 0.175000E+01 0.126188E+01
8 0.175000E+01 0.200000E+01 0.147601E+01

TOTAL 0.489591E+01

3
Table 6.15: Results of Simpson’s 8
rule for (6.20) using 16 subintervals
subintervals = 16
bi − ai = 0.250000+00

i ai bi Ii

1 0.000000E+00 0.125000E+00 0.712995E−03

2 0.125000E+00 0.250000E+00 0.549698E−02
3 0.250000E+00 0.375000E+00 0.164699E−01
4 0.375000E+00 0.500000E+00 0.351297E−01
5 0.500000E+00 0.625000E+00 0.628160E−01
6 0.625000E+00 0.750000E+00 0.100560E+00
7 0.750000E+00 0.875000E+00 0.148919E+00
8 0.875000E+00 0.100000E+01 0.207811E+00
9 0.100000E+01 0.112500E+01 0.276357E+00
10 0.112500E+01 0.125000E+01 0.352751E+00
11 0.125000E+01 0.137500E+01 0.434184E+00
12 0.137500E+01 0.150000E+01 0.516830E+00
13 0.150000E+01 0.162500E+01 0.595925E+00
14 0.162500E+01 0.175000E+01 0.665956E+00
15 0.175000E+01 0.187500E+01 0.720973E+00
16 0.187500E+01 0.200000E+01 0.755029E+00

TOTAL 0.489592E+01

Numerical values of the integral I obtained using Trapezoid rule (Exam-

ple 6.1), Simpson’s 13 rule (Example 6.2), and Simpson’s 38 rule (Example
284 NUMERICAL INTEGRATION OR QUADRATURE

6.3) are summarized in Tables 6.1 – 6.15. In these studies all subintervals are
of uniform widths, however use of non-uniform width subintervals presents
no problem. In this case one needs to be careful to establish ai and bi based
on subinterval widths for evaluating Ii for the subinterval [ai , bi ].
In each method, as the number of subintervals are increased, accuracy
of the value of the integral improves. For the same number of subintervals
Simpson’s 13 method produces integral values with better accuracy (error
O(h4 )) compared to trapezoid rule (error O(h2 )), and Simpson’s 38 rule (error
O(h6 )) is more accurate than Simpson’s 13 rule. In Simpson’s 38 rule the
integral values for 8 and 16 subintervals are accurate up to four decimal
places.

6.2.5 Richardson’s Extrapolation

When we approximate f (x) by fe(x), an algebraic polynomial in a subin-
terval [ai , bi ], then the truncation error is O(hN ), where N depends upon
the polynomial approximation made for the actual f (x), the integrand.
Thus, if we consider a uniform subdivision of the integration interval
[A, B], then for two subinterval sizes h1 and h2 , where h2 < h1 , we can
write:

I ≈ Ih1 + ChN
1 (6.21)
I ≈ Ih2 + ChN
2 (6.22)

where Ih1 is the value of the integral I using subinterval size h1 , Ih2 is
the value of the integral I using subinterval size h2 . N depends upon the
polynomial approximation used in approximating actual f (x). ChN 1 is the
error in Ih1 and ChN2 is the error in I h2 . The expressions (6.21) and (6.22)
are based on several assumptions:
(i) The constant C is not the same in (6.21) and (6.22), but we assume it
to be.

(ii) Since Ih2 is based on h2 < h1 , Ih2 is more accurate than Ih1 and hence
we expect I in (6.22) to have better accuracy than I in (6.21).
First, assuming I to be the same in (6.21) and (6.22), we can solve for C.
Ih1 − Ih2
C≈ (6.23)
hN
2 − h1
N

We substitute C from (6.23) in I in equation (6.22) (as it is based on h2 < h1 ,

hence more accurate).

Ih1 − Ih2
∴ I ≈ Ih2 + hN
2 (6.24)
hN
2 − h N
1
6.2. NUMERICAL INTEGRATION IN R1 285

or N
h1
Ih − Ih h2 Ih2 − Ih1
I ≈ Ih2 + 2N 1 = N (6.25)
h1 h1
h2 −1 h2 −1

Value of N :
1. In trapezoid rule ; N =2
2. In Simpson’s 13 rule ; N =4
3. In Simpson’s 38 rule ; N = 6 and so on
Remarks.

(1) Use of (6.25) requires Ih1 for subinterval width h1 and Ih2 for subin-
terval width h2 < h1 . The integral value I in (6.25) is an improved
approximation of the integral.

(a) When N = 2, we have eliminated errors O(h2 ) in (6.25)

(b) When N = 4, we have eliminated errors O(h4 ) in (6.25)
(c) When N = 6, we have eliminated errors O(h6 ) in (6.25)

(2) Thus, we can view the truncation error in the numerical integration
process when the integrand f (x) is approximated by a polynomial in a
subinterval to be of the following form, a series in h:

Et = C1 h2 + C2 h4 + C3 h6 + . . . (6.26)

where h is the width of the subinterval.

(3) In Richardson’s extrapolation if Ih1 and Ih2 are obtained using trapezoid
rule, then N = 2 and by using (6.25), we eliminate errors O(h2 ).

(4) On the other hand if Ih1 and Ih2 are obtained using Simpson’s 13 rule,
then N = 4 and by using (6.25), we eliminate errors of the order of
O(h4 ) and so on.

6.2.6 Romberg Method

Romberg method is based on successive application of Richardson’s ex-
trapolation to eliminate errors of various orders of h. Consider the integral

ZB
I= f (x) dx (6.27)
A

Let us consider trapezoid rule and let us calculate numerical values of the
integral I using one, two, three, etc. uniform subintervals. Then all these
286 NUMERICAL INTEGRATION OR QUADRATURE

Table 6.16: Romberg method

Integral Value Integral Value Integral Value Integral Value

Subintervals Trapezoid Rule
Error O(h2 ) Error O(h4 ) Error O(h6 ) Error O(h8 )
1 I1
1
I12
2
2 I2 I12,24
1 3
I24 I(12,24),(24,48)
2
4 I4 I24,48
1
I48
8 I8

integral values have truncation error of the order of O(h2 ), shown in Table
6.16 below.
We use values of the integral in column two that contain errors O(h2 ) in
Richardson’s extrapolation to eliminate errors O(h2 ). In doing so we use
Ih − Ih
I ≈ Ih2 + 2N 1 (6.28)
h1
h2 −1

in which N is the order ot the leading truncation error (2, 4, 6, 8, . . . etc).

We use values in column one with error O(h2 ) and (6.28) with N = 2
and h1/h2 = 2 to obtain column three of table 6.16 in which errors O(h2 )
are eliminated. Thus, the leading order of the error in the integral values in
column three is O(h4 ). We use integral values in column three, and (6.28)
with N = 4, h1/h2 = 2 to obtain integral values in column four in which
leading orders of the error is O(h6 ). We use integral values in column four
and (6.28) with N = 6, h1/h2 = 2 to obtain the final integral value in column
five that contains leading error of the order of O(h8 ). We will consider a
numerical example to illustrate the details.

Example 6.4 (Romberg Method with Richardson’s Extrapolation).

In this example we consider repeated use of Richardson’s extrapolation in
Romberg method to obtain progressively improved numerical values of the
integral. Consider
ZB
I = f (x) dx
A
in which
A=0
B = 0.8
6.2. NUMERICAL INTEGRATION IN R1 287

and
f (x) = 0.2 + 25x − 200x2 + 675x3 − 900x4 + 400x5
Consider trapezoid rule with 1, 2, 4, and 8 subintervals of uniform width
for calculating the numerical value of the integrals. These numerical values
of the integral contain truncation errors 0(h2 ). The numerical values of this
integral I are listed in Table 6.17 in column two.

Table 6.17: Romberg method example

Integral Value Integral Value Integral Value Integral Value
Subintervals Trapezoid Rule
Error O(h2 ) Error O(h4 ) Error O(h6 ) Error O(h8 )
1 0.172800
1.367467
2 1.068800 1.640533
1.623467 1.640533
4 1.484800 1.640533
1.639467
8 1.600800
Using the integral values in column two containing truncation errors
O(h2 ), in Richardson’s extrapolation we can eliminate truncation errors of
O(h2 ) to obtain integral values that contain leading truncation error O(h4 ).
We use N
h1
h2 Ih2 − Ih1
I≈ N
h1
h2 −1

with N = 2, Ih1 and Ih2 corresponding to h2 < h1 . Since the subinterval is

uniformly reduced to half the previous width we have:
h1
=2
h2
Hence, we can write:

(2)2 Ih2 − Ih1 4Ih2 − Ih1

I≈ 2
=
(2) − 1 4−1

We use this with values of the integral in column 2 to calculate the integral
values in column three of Table 6.17, which contain leading truncation error
O(h4 ). With the integral values in column three we use:

(2)4 Ih2 − Ih1 16Ih2 − Ih1

I≈ 4
=
(2) − 1 16 − 1
288 NUMERICAL INTEGRATION OR QUADRATURE

to obtain integral values in column four of table 6.17, which contain leading
truncation error O(h6 ). Using integral values in column four and

(2)6 Ih2 − Ih1 64Ih2 − Ih1

I≈ 6
=
(2) − 1 64 − 1

we obtain the integral value in column five that contains leading truncation
error O(h8 ). This is the final most accurate value of the integral I based on
Romberg method employing Richardson’s extrapolation.

Remarks.
(1) Numerical integration methods such as the Newton-Cotes methods dis-
cussed in R1 are difficult to extend to integrals in R2 and R3 .
(2) Richardson’s extrapolation and Romberg method are specifically de-
signed to improve the accuracy of the numerically calculated values of
the integrals from trapezoid rule, Simpson’s 13 method, Simpson’s 38
method, and in general Newton-Cotes methods. Thus, their extensions
to R2 and R3 are not possible either.
(3) In the next section we present Gauss quadrature that overcomes these
shortcomings.

6.3 Numerical Integration in R1 using Gauss Quadra-

ture for [−1, 1]
The Gauss quadrature is designed to integrate algebraic polynomials ex-
actly. This method can also be used to integrate functions that are not
algebraic polynomials, but in such cases the calculated value of the integral
may not be exact. The method is based on indeterministic coefficients. To
understand the basic principles of the method, we recall that in trapezoid
rule we use:
Zb
a−b
I = f (x) dx ≈ (f (a) + f (b)) (6.29)
2
a
We can rewrite (6.29) as:

b−a b−a
I≈ f (a) + f (b) (6.30)
2 2
In (6.30) the integrand f (x) is calculated at x = a andx = b. Thecalculated
values at x = a and x = b are multiplied with b−a 2 and b−a2 and then
6.3. NUMERICAL INTEGRATION IN R1 USING GAUSS QUADRATURE FOR [−1, 1] 289

added to obtain the approximate value of the integral I. We can write a

more general form of (6.30).

I ≈ w1 f (x1 ) + w2 f (x2 ) (6.31)

If we choose w1 = b−a b−a

2 , w2 = 2 , x1 = a, and x2 = b in (6.31), then we
recover (6.30) for trapezoid rule.
Consider (6.31) as integral of f (x) between the limits [a, b] in which
we will treat w1 , w2 , x1 , and x2 as unknowns. Obviously to determine
w1 , w2 , x1 , and x2 we need four conditions. We present the details in the
following. To make the derivation of determining w1 , w2 , x1 , and x2 general
so that these would be applicable to any arbitrary integration limits [a, b],
we consider the following integral with integration limits [−1, 1].
Z1
I= f (ξ) dξ (6.32)
−1

6.3.1 Two-Point Gauss Quadrature

Let
I ≈ w1 f (ξ1 ) + w2 f (ξ2 ) (6.33)
in which w1 , w2 , ξ1 , and ξ2 are yet to be determined. Determination of w1 ,
w2 , ξ1 , and ξ2 requires four conditions. We assume that (6.33) integrates
a constant, a linear, a quadratic, and a cubic function exactly, i.e., when
f (ξ) = 1, f (ξ) = ξ, f (ξ) = ξ 2 , and f (ξ) = ξ 3 , (6.33) gives exact values of
their integrals in the interval [−1, 1].
R1
when f (ξ) = 1 ; w1 + w2 = 1 dξ = 2
−1
R1
when f (ξ) = ξ ; w1 ξ1 + w2 ξ2 = ξ dξ = 0
−1
(6.34)
R1 2
when f (ξ) = ξ2 ; w1 ξ12 + w2 ξ22 = ξ 2 dξ = 3
−1
R1
when f (ξ) = ξ 3 ; w1 ξ13 + w2 ξ23 = ξ 3 dξ = 0
−1

Thus we have four equations in four unknowns: w1 , w2 , ξ1 , and ξ2 .

w1 + w2 = 2
w1 ξ1 + w2 ξ2 = 0
2 (6.35)
w1 ξ12 + w2 ξ22 =
3
3 3
w1 ξ 1 + w2 ξ 2 = 0
290 NUMERICAL INTEGRATION OR QUADRATURE

Equations (6.35) are a system of nonlinear equations in w1 , w2 , ξ1 , and ξ2 .

Their solution gives:
1
w1 = 1 ; ξ1 = − √
3
(6.36)
1
w2 = 1 ; ξ2 = √
3
w1 and w2 are called weight factors and ξ1 , ξ2 are called sampling points or
quadrature points.
Thus for integrating f (ξ) in (6.32) using (6.33) with (6.36), we have a
two-point integration scheme referred to as two-point Gauss quadrature that
integrates an algebraic polynomial of up to degree three exactly. We note
two-point Gauss quadrature is the minimum quadrature rule for algebraic
polynomials of up to degree three.
Thus, in summary, to integrate
Z1
I= f (ξ) dξ (6.37)
−1

using a two-point Gauss quadrature we write

2
X
I= wi f (ξi ) (6.38)
i=1

in which wi ; i = 1, 2 are weight factors and ξi ; i = 1, 2 are the sampling or

quadrature points given by (6.37).
Remarks.
(1) wi ; i = 1, 2 and ξi ; i = 1, 2 are derived using the fact that 1, ξ, ξ 2 , and
ξ 3 are integrated exactly. Therefore, given a cubic algebraic polynomial
in ξ:
f (ξ) = C0 + C1 ξ + C2 ξ 2 + C3 ξ 3 ;
C0 , C1 , C2 , C3 : Constants
(6.39)
the two-point quadrature rule (6.38) would integrate f (ξ) in (6.39) ex-
actly. That is, a two-point Gauss quadrature integrates up to a cubic
algebraic polynomial exactly and is the minimum quadrature rule.

6.3.2 Three-Point Gauss Quadrature

Consider:
Z1
I= f (ξ) dξ (6.40)
−1
6.3. NUMERICAL INTEGRATION IN R1 USING GAUSS QUADRATURE FOR [−1, 1] 291

In three-point Gauss quadrature we have three weight factors w1 , w2 , w3

and three sampling or quadrature points ξ1 , ξ2 , ξ3 , and we write:
3
X
I = w1 f (ξ1 ) + w2 f (ξ2 ) + w3 f (ξ3 ) = wi f (ξi ) (6.41)
i=1

In this case (6.41) requires determination of w1 , w2 , w3 , ξ1 , ξ2 , and ξ3 , hence

we need six conditions.
Let (6.41) integrate f (ξ) = 1, f (ξ) = ξ i ; i = 1, 2, . . . , 5 exactly, then
using (6.41) and (6.40) we can write:
R1
when f (ξ) = 1 ; w1 + w2 + w3 = 1 dξ = 2
−1
R1
when f (ξ) = ξ ; w1 ξ1 + w2 ξ2 + w3 ξ3 = ξ dξ = 0
−1
R1 2
when f (ξ) = ξ 2 ; w1 ξ12 + w2 ξ22 + w3 ξ32 = ξ 2 dξ = 3
−1
(6.42)
R1
when f (ξ) = ξ 3 ; w1 ξ13 + w2 ξ23 + w3 ξ33 = ξ 3 dξ = 0
−1
R1 2
when f (ξ) = ξ4 ; w1 ξ14 + w2 ξ24 + w3 ξ34 = ξ 4 dξ = 5
−1
R1
when f (ξ) = ξ 5 ; w1 ξ15 + w2 ξ25 + w3 ξ35 = ξ 5 dξ = 0
−1

Equations (6.42) are six simultaneous nonlinear algebraic equations in six

unknowns: w1 , w2 , w3 , ξ1 , ξ2 , ξ3 . The solution of these six equations gives
weight factors wi ; i = 1, 2, 3 and the locations of the quadrature points ξi ;
i = 1, 2, 3.
w1 = 0.5555556 ; ξ1 = −0.774596669
w2 = 0.8888889 ; ξ2 = 0.0 (6.43)
w3 = 0.5555556 ; ξ3 = 0.774596669
This is a three-point Gauss quadrature.
In summary, to evaluate
Z1
I= f (ξ) dξ (6.44)
−1

using a three-point Gauss quadrature, we write

3
X
I= wi f (ξi ) (6.45)
i=1
292 NUMERICAL INTEGRATION OR QUADRATURE

in which (wi , ξi ) ; i = 1, 2, 3 are given by (6.43).

Remarks.
(1) wi , ξi ; i = 1, 2, 3 are derived using the fact that 1, ξ, ξ 2 , ξ 3 , ξ 4 , and ξ 5
are integrated exactly. Thus, given:

f (ξ) = C0 + C1 ξ + C2 ξ 2 + C3 ξ 3 + C4 ξ 4 + C5 ξ 5 (6.46)

where Ci ; i = 1, 2, . . . , 5 are constants, a fifth degree polynomial in ξ,

then the three-point quadrature rule will integrate it exactly. Three-
point Gauss quadrature is the minimum for algebraic polynomials up to
fifth degree.
(2) From remark (1) it is clear that if we had used three-point rule for a
polynomial ξ of degree one or two or three or four (less than five), the
three-point Gauss quadrature will integrate them exactly also.

6.3.3 n-Point Gauss Quadrature

Consider the following integral.
Z1
I= f (ξ) dξ (6.47)
−1

Let N , the degree of the polynomial f (ξ), be such that the n-point quadra-
ture rule integrates it exactly. Then, we can write:
n
X
I= wi f (ξi ) (6.48)
i=1

where wi ; i = 1, 2, . . . , n are weight factors and ξi ; i = 1, 2, . . . , n are the

sampling or quadrature points. Clearly N and n are related for the integral
value to be exact.
Remarks.
(1) Since two- and three-point Gauss quadratures integrate polynomials of
up to degree 3 and 5 exactly, we conclude that n-point quadrature must
integrate an algebraic polynomial of up to degree (2n − 1) exactly (proof
omitted). Thus, we have the following rule for determining minimum
number of quadrature points n for an algebraic polynomial of maximum
degree N .
N = 2n − 1
N +1 (6.49)
or n = , round up to the next integer
2
6.3. NUMERICAL INTEGRATION IN R1 USING GAUSS QUADRATURE FOR [−1, 1] 293

Knowing the highest degree of the polynomial f (ξ) in ξ, i.e., knowing N ,

we can determine n, the minimum number of quadrature points needed
integrate it exactly.

(2) Values of wi and ξi for various values of n are generally tabulated (see
Table 6.18). Since the locations of the quadrature points in the interval
[−1, 1] are symmetric about ξ = 0, the values of ξi in the table are listed
only in the itnerval [0, 1]. For example, for n = 3 we have −ξ1 , ξ2 (= 0),
and ξ1 , thus only ξ1 and ξ2 need to be listed as given in Table 6.18. For
n = 4, we have −ξ1 , −ξ2 , ξ2 , and ξ1 , thus only ξ1 and ξ2 ned to be listed
in Table 6.18. The weight factors for ±ξi are the same, i.e., wi applies
to +ξi as well as −ξi . The values of wi and ξi are listed up to fifteen
decimal places.

6.3.4 Using Gauss Quadrature in R1 with [−1, 1] Limits for

Integrating Algebraic Polynomials and Other Functions
Consider:
Z1
I= f (ξ) dξ (6.50)
−1

(a) When f (ξ) is an algebraic polynomial in ξ:

(i) Determine the highest degree N of the polynomial f (ξ).

(ii) Use n = N 2+1 (round to the next highest integer) to find out the
minimum number of quadrature points n.
(iii) Use Table 6.18 to determine the weight factors and the locations
of the quadrature points.

(wi , ξi ) ; i = 1, 2, . . . , n

(iv) Then
n
X
I= wi f (ξi ) (6.51)
i=1

is the exact value of the integral (6.50).

(v) If we use a higher number of quadrature points than n, then obvi-
ously the accuracy of I does not deteriorate or improve.
N +1
(vi) Obviously the choice of n lower than n = 2 will not integrate
f (ξ) exactly.

(b) When f (ξ) is not an algebraic polynomial in ξ:

294 NUMERICAL INTEGRATION OR QUADRATURE

Table 6.18: Sampling points and weight factors for Gauss quadrature for integration
limits [−1, 1]

+1
R n
P
I= f (x) dx = Wi f (xi )
−1 i=1
±xi Wi
n=1
0 2.00000 00000 00000
n=2
0.57735 02691 89626 1.00000 00000 00000
n=3
0.77459 66692 41483 0.55555 55555 55556
0.00000 00000 00000 0.88888 88888 88889
n=4
0.86113 63115 94053 0.34785 48451 37454
0.33998 10435 84856 0.65214 51548 62546
n=5
0.90617 98459 38664 0.23692 68850 56189
0.53846 93101 05683 0.47862 86704 99366
0.00000 00000 00000 0.56888 88888 88889
n=6
0.93246 95142 03152 0.17132 44923 79170
0.66120 93864 66265 0.36076 15730 48139
0.23861 91860 83197 0.46791 39345 72691
n=7
0.94910 79123 42759 0.12948 49661 68870
0.74153 11855 99394 0.27970 53914 89277
0.40584 51513 77397 0.38183 00505 05119
0.00000 00000 00000 0.41795 91836 73469
n=8
0.96028 98564 97536 0.10122 85362 90376
0.79666 64774 13627 0.22238 10344 53374
0.52553 24099 16329 0.31370 66458 77887
0.18343 46424 95650 0.36268 37833 78362
n=9
0.96816 02395 07626 0.08127 43883 61574
0.83603 11073 26636 0.18064 81606 94857
0.61337 14327 00590 0.26061 06964 02935
0.32425 34234 03809 0.31234 70770 40003
0.00000 00000 00000 0.33023 93550 01260
n = 10
0.97390 65285 17172 0.06667 13443 08688
0.86506 33666 88985 0.14945 13491 50581
0.67940 95682 99024 0.21908 63625 15982
0.43339 53941 29247 0.26926 67193 09996
0.14887 43389 81631 0.29552 42247 14753
6.3. NUMERICAL INTEGRATION IN R1 USING GAUSS QUADRATURE FOR [−1, 1] 295

(i) If f (ξ) is not an algebraic polynomial in ξ, we can still use Gauss

quadrature to integrate it.
(ii) In this case determination of minimum required n is not possible.
We can begin with lowest possible value of n and progressively
increase it by one.
(iii) The integral values calculated for progressively increasing n are
progressively better approximations of the integral of f (ξ). When
a desired decimal place accuracy is achieved we can stop the inte-
gration process.
(iv) Thus, when f (ξ) is not an algebraic polynomial, Gauss quadrature
can not integrate f (ξ) exactly, but we do have a mechanism of
obtaining an integral value of f (ξ) within any desired accuracy by
progressively increasing n.

6.3.5 Gauss Quadrature in R1 for Arbitrary Integration Limits

Consider the integral:
Zb
I= f (x) dx (6.52)
a
When we compare (6.52) with:
Z1
I= f (ξ) dξ (6.53)
e
−1

we find that integration variables are x and ξ in (6.52) and (6.53), but
that makes no difference. Secondly, the limits of integration in (6.52) (the
integral we want to evaluate) are [a, b], whereas in (6.53) they are [−1, 1].
By performing a change of variable from ξ to x in (6.53), we obtain (6.52).
We proceed as follows
(i) Determine the highest degree of the polynomial f (x) in (6.52), say N ,
then the minimum number of quadrature points n are determined using
n = N 2+1 (round to the next highest integer).
(ii) From Table 6.18 determine wi ; i = 1, 2, . . . , n and ξi ; i = 1, 2, . . . , n
for the integration interval [−1, 1].
(iii) Transform (wi , ξi ) ; i = 1, 2, . . . , n for the integration interval [a, b] in
(6.52) using:

a+b b−a x b−a
xi = + ξi ; wi = wi ; i = 1, 2, . . . , n
2 2 2
(6.54)
296 NUMERICAL INTEGRATION OR QUADRATURE

(iv) Now using the weight factors wix ; i = 1, 2, . . . , n and quadrature points
xi ; i = 1, 2, . . . , n for the integration interval [a, b] we can integrate
f (x) in (6.52).
Xn
I= wix f (xi ) (6.55)
i=1

6.4 Gauss Quadrature in R2

6.4.1 Gauss Quadrature in R2 over Ω̄ = [−1, 1] × [−1, 1]
The basic principles of Gauss quadrature in R1 can be extended for nu-
merical quadrature in R2 . First, let us consider integration of f (ξ, η), a
polynomial in ξ and η, over a square domain Ω = [−1, 1] × [−1, 1].
Z1 Z1
I= f (ξ, η) dξ dη (6.56)
−1 −1

We can rewrite (6.56) as:

Z1 Z1
I= f (ξ, η) dξ dη (6.57)
−1 −1

If N ξ and N η are the highest degrees of the polynomial f (ξ, η) in ξ and η,

then the minimum number of quadrature points nξ and nη in ξ and η can
be determined using:
Nξ + 1 Nη + 1
nξ = ; nη = round to the next higher integers (6.58)
2 2
Using Table 6.18, determine (wiξ , ξi ); i = 1, 2, . . . , nξ and (wjη , nj ); j =
1, 2, . . . , nη in the ξ- and η-directions. Using (6.57), first integrate with
respect to ξ using Gauss quadrature, holding η constant. This gives:
Z1 X
 
nξ
I=  wiξ f (ξi , η) dη (6.59)
−1 i=1

Now integrate with respect to η using (6.59).

 
nη nξ
wjη  wiξ f (ξi , ηj )
X X
I= (6.60)
j=1 i=1

This is the exact numerical value of the integral (6.56) using Gauss quadra-
ture.
6.4. GAUSS QUADRATURE IN R2 297

6.4.2 Gauss Quadrature in R2 Over Arbitrary Rectangular

Domains Ω̄ = [a, b] × [c, d]
The Gauss quadrature in R1 over arbitrary Ω = [a, b] can be easily ex-
tended for R2 over arbitrary rectangular domain Ω = [a, b] × [c, d]. Consider:

Zd Zb Zd Zb
I= f (x, y) dx dy = f (x, y) dx dy (6.61)
c a c a

Determine highest degrees of the polynomial f (x, y) in x and y, say N x

and N y , then the minimum number of quadrature points in x and y are
determined using:

Nx + 1 Ny + 1
nx = ; ny = round to the next higher integers (6.62)
2 2

Determine (wiξ , ξi ); i = 1, 2, . . . , nx and (wjη , ηj ); j = 1, 2, . . . , ny using Table

6.18 for the interval [−1, 1] in ξ and η. Transform (wiξ , ξi ); i = 1, 2, . . . , nx
to (wix , xi ); i = 1, 2, . . . , nx and (wjη , ηj ); j = 1, 2, . . . , ny to (wjy , yj ) ; j =
1, 2, . . . , ny for the integration intervals [a, b] and [c, d] in x and y using:

a+b b−a b−a
xi = + ξi ; wix= wiξ ; i = 1, 2, . . . , nx
2 2 2

c+d d−c y d −c
yj = + ηj ; wj = wjη ; j = 1, 2, . . . , ny
2 2 2
(6.63)

Now using (6.63) in (6.61), first we integrate with respect to x using Gauss
quadrature holding y constant.

Zd n
X
x !
I= wix f (xi , y) dy (6.64)
c i=1

Now we integrate (6.64) with respect to y using Gauss quadrature.

ny nx !
wjy
X X
I= wix f (xi , yj ) (6.65)
j=1 i=1

This is the exact numerical value of the integral (6.61) obtained using Gauss
quadrature.
298 NUMERICAL INTEGRATION OR QUADRATURE

6.5 Gauss Quadrature in R3

6.5.1 Gauss Quadrature in R3 over Ω̄ = [−1, 1] × [−1, 1] × [−1, 1]
Consider:
Z1 Z1 Z1
I= f (ξ, η, ζ) dξ dη dζ (6.66)
−1 −1 −1

or
Z1 Z1 Z1
I= f (ξ, η, ζ) dξ dη dζ (6.67)
−1 −1 −1

If N ξ , N η , and N ζ are the highest degrees of the polynomial f (ξ, η, ζ) in ξ,

η, and ζ, then nξ , nη , and nζ , the minimum number of quadrature points in
ξ, η, and ζ, are determined using:

Nξ + 1 Nη + 1 Nζ + 1
nξ = , nη = , nζ =
2 2 2 (6.68)
round to the next higher integers

Determine (wiξ , ξi ); i = 1, 2, . . . , nξ , (wjη , ηj ); j = 1, 2, . . . , nη , and (wkζ , ζk );

k = 1, 2, . . . , nζ using Table 6.18.
Using (6.67), first integrate with respect to ξ using Gauss quadrature,
holding η and ζ constant.

Z1 Z1
   
n ξ

wiξ f (ξi , η, ζ) dη  dζ

X
I=   (6.69)
−1 −1 i=1

Now integrate with respect to η using (6.69) holding ζ constant.

Z1
  
nη nξ
wjη 
X X
I=  f (ξi , ηj , ζ) dζ (6.70)
−1 j=1 i=1

Lastly, integrate with respect to ζ using (6.70).

  
n ζ nη n ξ

wkζ  wjη  wiξ f (ξi , ηj , ζk )

X X X
I= (6.71)
k=1 j=1 i=1

This is the exact numerical value of the integral using Gauss quadrature.
6.5. GAUSS QUADRATURE IN R3 299

6.5.2 Gauss Quadrature in R3 Over Arbitrary Prismatic Do-

mains Ω = [a, b] × [c, d] × [e, f ]
Consider the following integral.

Zf Zd Zb
I= f (x, y, z) dx dy dz (6.72)
e c a

or
Zf Zd Zb
I= f (x, y, z) dx dy dz (6.73)
e c a

Let N x, N y,
and Nz
be the highest degrees of the polynomial f (x, y, z) in
x, y, and z, then n , ny , and nz , the minimum number of quadrature points
x

in x, y, and z, are determined using:

Nx + 1 Ny + 1 Nz + 1
nx = , ny = , nz =
2 2 2 (6.74)
round up to the next higher integer

Determine (wiξ , ξi ); i = 1, 2, . . . , nx , (wjη , ηj ); j = 1, 2, . . . , ny , and (wkζ , ζk );

k = 1, 2, . . . , nz for [−1, 1] interval in ξ, η, and ζ using Table 6.18. Transform
(wiξ , ξi ), (wjη , ηj ), and (wkζ , ζk ) to (wix , xi ), (wjy , yj ), and (wkz , zk ) using the
following.

a+b b−a b−a
xi = + x
ξ i ; wi = wiξ ; i = 1, 2, . . . , nx
2 2 2

c+d d−c d−d
yj = + ηj ; wjy = wjη ; j = 1, 2, . . . , ny
2 2 2

e+f f −e f −e
zk = + z
ζ k ; wk = wkζ ; k = 1, 2, . . . , nz
2 2 2
(6.75)

Now using (6.73) and (6.75) we can integrate f (x, y, z) with respect to x, y,
and z using Gauss quadrature.
 !
nz ny nx
wjy
X X X
I= wkz  wix f (xi , yj , zk )  (6.76)
k=1 j=1 i=1

This is the exact value of the integral using Gauss quadrature.

Remarks.
300 NUMERICAL INTEGRATION OR QUADRATURE

(1) When f (x), f (x, y), or f (x, y, z) are algebraic polynomials in x ; x, y ;

or x, y, z we can determine the minimum number of quadrature points
required to integrate them exactly.
(2) If the integrand is not an algebraic polynomial in any one or more of the
variables, then we must proceed with the minimum number of quadra-
ture points in those variables and progressively increase the number of
quadrature points until the desired accuracy is achieved.
(3) The Gauss quadratures discussed in R2 and R3 only hold for rectangular
and prismatic domains but of arbitrary size.

6.5.3 Numerical Examples

In this section we consider numerical examples for Gauss quadrature in
R1 and R2 for integration intervals [−1, 1], [−1, 1]×[−1, 1] as well as arbitrary
integration intervals [a, b], [a, b] × [c, d].

Example 6.5 (Gauss Quadrature in R1 : Integration Interval [−1, 1]).

Consider:
Z1 Z1
2 3
I = (1 − 0.1x + x ) dx = f (x) dx
−1 −1

The highest degree of the polynomial in the integrand is three (N = 3) hence

the minimum number of quadrature points n are given by:
N +1 3+1
n= = =2
2 2
From Table 6.18:
x1 = −0.5773502590 ; w1 = 1.0
x2 = 0.5773502590 ; w2 = 1.0
2
X
∴ I= wi f (xi ) = w1 f (x1 ) + w2 f (x2 )
i=1
or

I = (1) 1.0 − 0.1(−0.5773502590)2 + (−0.5773502590)3

+ (1) 1.0 − 0.1(0.5773502590)2 + (0.5773502590)3

or
I = 1.9333334000
6.5. GAUSS QUADRATURE IN R3 301

This value agrees with the theoretical value of I up to six decimal places.
We could check that if we use n = 3 (one order higher than minimum
quadrature rule) the value of the integral remains unaffected up to six deci-
mal places. Details are given in the following.
3
X
I= wi f (xi )
i=1
From Table 6.18, for n = 3, we have:
x1 = −0.7745966910 ; w1 = 0.5555555820
x2 = 0.0 ; w2 = 0.8888888960
x3 = 0.7745966910 ; w3 = 0.5555555820
3
P
Using these values of wi , xi ; i = 1, 2, 3 and I = wi f (xi ), we obtain:
i=1
I = 1.9333334000
which agrees with the integral value calculated using n = 2 up to all com-
puted decimal places. Thus, using n = 3, the integral value neither improved
nor deteriorated.

Example 6.6 (Gauss Quadrature in R1 : Arbitrary Integration In-

terval). Consider the integral:
Z3.7
I = (1 − 0.1x2 + x3 ) dx
1.5
The integrand in this case is the same as in Example 6.5, but the limits of
integration are [1.5,3.7] as opposed to [−1, 1] in Example 6.5. Thus, in this
case also N = 3 and n = 2 (minimum), and we have the following from
Table 6.18 for the integration interval [−1, 1].
ξ1 = −0.5773502590 ; w1ξ = 1.0
ξ2 = 0.5773502590 ; w2ξ = 1.0
We transform (wiξ , ξi ); i = 1, 2 for the interval [1.5, 3.7] using:

1.5 + 3.7 3.7 + 1.5
xi = + ξi
2 2
; i = 1, 2
x 3.7 − 1.5
wi =
2
302 NUMERICAL INTEGRATION OR QUADRATURE

This gives us:

x1 = 1.964914680 ; w1x = 1.100000020

x2 = 3.235085250 ; w2x = 1.100000020
2
X
∴ I= wix f (xi )
i=1
or
I = (1.100000020) 1 − 0.1(1.964914680)2 + (1.964914680)3

+ (1.00000020) 1 − 0.1(3.235085250)2 + (3.235085250)3

or I = 46.212463400

This value is accurate up to six decimal places when compared with the
theoretical value of I.
As in Example 6.5, here also if we use n = 3 instead of n = 2 (minimum)
and repeat the integration process, we find that I remains unaffected.

Example 6.7 (Gauss Quadrature in R2 : Integration Interval

[−1, 1] × [−1, 1]).
Consider the following integral.
Z1 Z1 Z1 Z1
2 2
I= (xy + x y ) dx dy = f (x, y) dx dy (6.77)
−1 −1 −1 −1

We can rewrite this as:

Z1 Z1
 

I=  (xy + x2 y 2 ) dx dy (6.78)

−1 −1

The highest degrees of the polynomial in the integrand in x and y are N x = 2,

N y = 2, hence the minimum number of quadrature points nx and ny in x
and y are:
Nx + 1 2+1 3
nx = = = ; nx = 2
2 2 2 (6.79)
y Ny + 1 2+1 3
n = = = ; ny = 2
2 2 2
6.5. GAUSS QUADRATURE IN R3 303

Determine the quadrature point location and weight factors in x and y using
Table 6.18 for the interval [−1, 1].

x1 = −0.5773502590 ; w1x = 1.0

(6.80)
x2 = 0.5773502590 ; w2x = 1.0

y1 = −0.5773502590 ; w1y = 1.0

(6.81)
y2 = 0.5773502590 ; w2y = 1.0
Using (6.78) and (6.80) we integrate with respect to x holding y constant.
Z1
w1x (x1 y + x21 y 2 ) + w2x (x2 y + x22 y 2 ) dy

I= (6.82)
−1

Now using (6.82) and (6.81) integrate with respect to y.

I =w1y w1x (x1 y1 + x21 y12 ) + w2x (x2 y1 + x22 y12 ) +

(6.83)
w2y w1x (x1 y2 + x21 y22 ) + w2x (x2 y2 + x22 y22 )

Substituting numerical values of wix , xi ; i = 1, 2 and wjy , yj ; j = 1, 2 from

(6.80) and (6.81) in (6.83), we obtain:

I = 0.4444443880

This value agrees with theoretical values up to at least seven decimal places.
It can be verified that using Gauss quadrature higher than (nx × ny ) =
(2 × 2), the value I of the integral remains unaffected.

Example 6.8 (Gauss Quadrature in R2 : Arbitrary Rectangular Do-

main). Consider the following integral:
1.13 Z
Z 1.05

I= (xy + x2 y 2 ) dx dy (6.84)
0.31 0.11

We rewrite (6.84) as:

1.13 Z
 1.05 
Z
I=  (xy + x2 y 2 ) dx dy (6.85)
0.31 0.11
304 NUMERICAL INTEGRATION OR QUADRATURE

In this example the integrand is the same as in Example 6.7, but the limits
of integration are not [−1, 1] in x and y as in Example 6.7. In this case also
N x = 2, N y = 2, hence:
Nx + 1 2+1 3
nx = = = ; nx = 2
2 2 2
N y +1 2 + 1 3
ny = = = ; ny = 2
2 2 2
Determine the quadrature points and the weight function factors in ξ and η
for the integration interval [−1, 1] using Table 6.18.

ξ1 = −0.5773502590 ; w1ξ = 1.0

(6.86)
ξ1 = 0.5773502590 ; w2ξ = 1.0

η1 = −0.5773502590 ; w1η = 1.0

(6.87)
η1 = 0.5773502590 ; w2η = 1.0
Transform (ξi , wiξ ); i = 1, 2 to (xi , wix ); i = 1, 2 using:

1.05 + 0.11 1.05 − 0.11
xi = + ξi
2 2
; i = 1, 2 (6.88)
x 1.05 − 0.11 ξ
wi = wi
2

Also transform (ηj , wjη ); j = 1, 2 to (yj , wjy ); j = 1, 2 using:

1.12 + 0.31 1.13 − 0.31
yj = + ηj
2 2
; j = 1, 2 (6.89)
1.13 − 0.31
wjy = wjη
2
We obtain the following:

x1 = 0.3086453680 ; w1x = 0.4699999690

(6.90)
x2 = 0.8513545990 ; w2x = 0.4699999690

y1 = 0.4832863810 ; w1y = 0.4099999960

(6.91)
y2 = 0.9567136170 ; w2y = 0.4099999960
Using (6.85) and (6.90) we integrate with respect to x.
1.13
Z
w1x (x1 y + x21 y 2 ) + w2x (x2 y + x22 y 2 ) dy

I= (6.92)
0.31
6.5. GAUSS QUADRATURE IN R3 305

Now using (6.92) and (6.91), we integrate with respect to y.

I =w1y w1x (x1 y1 + x21 y12 ) + w2x (x2 y1 + x22 y12 ) +

(6.93)
w2y w1x (x1 y2 + x21 y22 ) + w2x (x2 y2 + x22 y22 )

Substituting numerical values of wix , xi ; i = 1, 2 and wjy , yj ; j = 1, 2 from

(6.90) and (6.91) in (6.93), we obtain the value of the integral.

I = 0.5034378170

This value agrees with the theoretical value up to at least seven decimal
places.

Example 6.9 (Integrating Functions that are not Polynomials Using

Gauss Quadrature). Consider the following integral.
Z1 !!2
ex + e(1−x)
I= 1− dx
1+e
0

In this case the integrand is not an algebraic polynomial, hence it is not

possible to determine N x or nx . In such cases we begin with the lowest
order Gauss quadrature and progressively increase the order of the Gauss
quadrature until desired accuracy of the computed integral is obtained.
We begin with n = 2 and progressively increase it by one. The number of
quadrature points and the integral values are listed in Table 6.19.

Table 6.19: Results of Gauss quadrature for Example 6.9

Quadrature Points In

2 0.577189866000E−02
3 0.686041173000E−02
4 0.687233079000E−02
5 0.687239133000E−02
6 0.687239319000E−02
7 0.687239412000E−02

From the integral values for n = 6 and n = 7, we observe at least six

decimal place accuracy.
306 NUMERICAL INTEGRATION OR QUADRATURE

6.5.4 Concluding Remarks

In this chapter we have presented various methods of obtaining numeri-
cal values of definite integrals. All numerical integration methods presented
in this chapter are methods of approximation except Gauss quadrature for
integrals in which integrands that are algebraic polynomials. Gauss quadra-
ture integrates algebraic polynomials in R1 , R2 , and R3 exactly, hence is a
numerical method without approximation. However, when the integrand is
not an algebraic polynomial, Gauss quadrature is also a method of approxi-
mation. Gauss quadrature is most meritorious out of all the other methods,
even when the integrand is not an algebraic polynomial. When integra-
tion functions that are not algebraic polynomials using Gauss quadrature,
we progressively increase the number of quadrature points until the desired
decimal place accuracy is achieved in the value of the integral.
6.5. GAUSS QUADRATURE IN R3 307

Problems
6.1 Calculate the value of the integral

Z2
1 2

I= x+ dx
x
1

numerically.

(a) Using trapezoidal rule with 1, 2 and 4 strips. Tabulate results.

(b) Using the integral values calculated in (a), apply Romberg method
to improve the accuracy the integral value.

6.2 Use Romberg method to evaluate the following integral with accuracy
of the order of O(h8 )
Z3
I = xe2x dx
0

Hint: Use trapezoidal rule with 1, 2, 4 and 8 steps then apply Romberg
method.

6.3 Use lowest order Gauss quadrature to obtain exact value of the following
integral.
Z1
10 + 5x2 + 2.5x3 + 1.25x4 + 0.62x5 dx

I=
−1

6.4 Use lowest order Gauss quadrature to obtain exact value of the following
integral.
Z2
4x2 + 2x4 + x6 dx

I=
1

Provide details of the sampling point locations and the weight functions.

6.5 Use two, three and four point Gauss quadrature to evaluate the following
integral.
Z1/2
I = sin(πx) dx
0

Will the value of the integral improve with 5, 6 and higher order quadrature
and why? Can you determine the order of the Gauss quadrature that will
yield exact value of the integral I, explain.
308 NUMERICAL INTEGRATION OR QUADRATURE

6.6 Write a computer program to calculate the following integral numeri-

cally using Gauss quadrature.

Zd Zb !
I= f (x, y) dx dy
c a

f (x, y) and the limits [a, b] and [c, d] are given in the following. Use lowest
order Gauss quadrature in each case.

(a)

f (x, y) = 1 + 4x2 y 2 + 2x4 y 4

[a, b] = [−1, 1] ; [c, d] = [−1, 1]

(b)

f (x, y) = 1 + x + y + xy + x2 + y 2 + x3 + x2 y + xy 2 + y 3
[a, b] = [−1, 1] ; [c, d] = [1, 2]

(c)

f (x, y) = x2 y 2 exy
[a, b] = [1, 2] ; [c, d] = [1.2, 2.1]

Use 1 × 1, 2 × 2, 3 × 3, 4 × 4, 5 × 5 and 6 × 6 Gauss quadrature.

Tabulate your results and comment on the accuracy of the integral.
Can it be improved further using higher order quadrature? Explain.

Provide a listing of the computer program and a writeup documenting your

work.

6.7 Write a computer program to calculate the following integral numeri-

cally .
Zb
I = f (x) dx (1)
a
using:

(a) Trapezoid rule

(b) Simpson’s 1/3 rule
(c) Simpson’s 3/8 rule

Calculate numerical values of the integral I in (1) using (a), (b) and (c) with
1, 2, 4, 8, 16 and 32 number of strips for the following f (x) and [a, b].
6.5. GAUSS QUADRATURE IN R3 309

1 2

I. f (x) = x + ; [a, b] = [1, 2]
x
II. f (x) = xe2x ; [a, b] = [0, 3]
III. f (x) = 10 + 5x2 + 2.5x3 + 1.25x4 + 0.625x5 ; [a, b] = [−1, 1]
IV. f (x) = 4x2 + 2x4 + x6 ; [a, b] = [1, 2]
V. f (x) = sin(πx) ; [a, b] = [1, 1/2]
For each f (x), tabulate your results in the following form. Provide a listing
of your program and document your work.

Integral Value
Number of steps
Trapezoid rule Simpson’s 1/3 rule Simpson’s 3/8 rule
1
2
..
.
32

For each of the three methods ((a), (b) and (c)) apply Romberg method
to obtain the most improved values of the integrals. When using Romberg
method, provide a separate table for each of the three methods for each f (x).
6.8 Consider the following integral
Z3 Z2 1/3
I= x2 + y 2 dx dy
2 1

obtain numerical values of the integral I using Gauss quadrature: 1×1, 2×2,
. . . , n × n to obtain the integral value with seven decimal place accuracy.
Tabulate your calculations.
6.9 Consider the following integral
Zπ/2
I = cos(x) dx
0

Calculate numerical value of I using progressively increasing Gauss quadra-

ture till the I value is accurate up to seven decimal places.
6.10 Given
Z1.5 Z2.5 Z2
I= x3 (x2 + 1)2 (y 2 − 1)(y + 4)z dx dy dz
−1 1.5 1
310 NUMERICAL INTEGRATION OR QUADRATURE

Use the lowest order of Gauss quadrature in x, y and z to obtain exact

numerical value of I.
6.11 (a) Why is it that Gauss quadrature can be used to integrate algebraic
(s) polynomials exactly?
(b) Describe relationship between the highest degree of the polynomial
(s) and the minimum number of quadrature points needed to integrate
(s) it exactly. Justify your answers based on (a).
(c) Let
Z1 Z1
x5 (x2 + 4x − 12)2 (y 3 + 2y − 6)y 4

I= dx dy
(x − 2)2
−1 −1

(s) Can this be integrated exactly using Gauss quadrature? If yes, then
(s) find the minimum number of quadrature points in x and y. Clearly
(s) illustrate and justify your answers.
6.12 Consider the following table of data.
i 1 2 3 4 5
xi 0 0.25 0.5 0.75 1.0
fi = f (xi ) 0.3989 0.3867 0.3521 0.3011 0.2410

R1
(a) Compute I = f (x) dx with strip widths of h = 0.25, h = 0.5 and
0
h = 1.0 using Trapezoid rule. Using these computed values employ
Romberg method to compute the most accurate value of the integral
I. Tabulate your calculations. Show the orders of the errors being
eliminated.
(b) given
Z1 Z3.9 Z2.7
4.2 1.2 2 3 −y 3.6 1
I= x (1 + x ) y e z 1 + 2.6 dx dy dz
z
0.5 2.5 1.6
What is the lowest order of Gauss quadrature in x, y and z to
calculate exact value of I. Explain your reasoning.

6.13 Consider the integral

Z3 Z2 !
I= xy dx dy
1 1
Calculate numerical values of I using progressively increasing equal order
(in x and y) Gauss quadrature that is accurate up to seven decimal places.
7
Curve Fitting

7.1 Introduction
In the interpolation theory, we construct an analytical expression, say
f (x), for the data points (xi , fi ); i = 1, 2, . . . , n. The function f (x) is such
that it passes through the data points, i.e., f (xi ) = fi ; i = 1, 2, . . . , n. This
polynomial representation f (x) of the data points may some times be a
poor representation of the functional relationship described by the data (see
Figure 7.1).

functional relationship
described by the data

polynomial
representation

Figure 7.1: Polynomial representation of data and comparison with true functional rela-
tionship

Thus, a polynomial representation of data points may not be the best

analytical form describing the behavior represented by the data points. In
such cases we need to find an analytical expression that represents the best
fit to the data points. In doing so, we assume that we know the analytical
form of the function g(x) that best describes the data. Generally g(x) is
represented by a linear or non-linear combination of the suitably chosen
functions using unknown constants or coefficients. The unknown constants
or coefficients in g(x) are determined to ensure that g(x) is the best fit to
the data. The method of least squares fit is one such method.
When g(x) is a linear function of the unknown constants or coefficients,

311
312 CURVE FITTING

we have a linear least squares fit to the data. When g(x) is a non-linear
function of the unknown constants or coefficients, we obviously have a non-
linear least squares fit. In this chapter we consider linear as well as non-linear
least squares fits. In case of linear least squares fit we also consider weighted
least squares fit in which the more accurate data points can be assigned
larger weight factors to ensure that the resulting fit is biased towards these
data points. The non-linear least squares fit is first presented for a special
class of g(x) in which taking log or ln of both sides of g(x) yields a form that
is suitable for linear least squares fit with appropriate correction so that true
residual is minimized. This is followed by a general non-linear least squares
fit process that is applicable to any form of g(x) in which g(x) is a desired
non-linear function of the unknown constants or coefficients. A weighted
non-linear least squares formulation of this non-linear least squares fit is
also presented. It is shown that this non-linear least squares fit formulation
naturally degenerates to linear and weighted linear least squares fit when
g(x) is a linear function of the unknown constants or coefficients.

7.2 Linear Least Squares Fit (LLSF)

Let (xi , fi ); i = 1, 2, . . . , n be the given data points. Let g1 (x), g2 (x), . . . ,
gm (x) be known functions to be used in the least squares fit g(x) of the data
points (xi , fi ); i = 1, 2, . . . , n such that g(x) is a linear combination of gk ;
k = 1, 2, . . . , m, hence g(x) is a linear function of ci .
m
X
g(x) = ck gk (x) (7.1)
k=1

Since g(x) does not necessarily pass through the data points, if xi ; i =
1, 2, . . . , n are substituted in g(x) to obtain g(xi ); i = 1, 2, . . . , n, these may
not agree with fi ; i = 1, 2, . . . , n. Let r1 , r2 , . . . , rn be the differences
between g(x1 ), g(x2 ), . . . , g(xn ) and f1 , f2 , . . . , fn , called residuals at each
of the location xi ; i = 1, 2, . . . , n.
m
X
ck gk (xi ) − fi = ri ; i = 1, 2, . . . , n (7.2)
k=1

or
[G]n×m {c}m×1 − {f }n×1 = {r}n×1 (7.3)
where  
g1 (x1 ) g2 (x1 ) . . . gm (x1 )
 g1 (x2 ) g2 (x2 ) . . . gm (x2 ) 
[G] =  (7.4)
 
.. 
 . 
g1 (xn ) g2 (xn ) . . . gm (xn )
7.2. LINEAR LEAST SQUARES FIT (LLSF) 313

     

 c1  
 f1  
 r1 
 c2 
   f2 
   r2 
 
{c} = .. ; {f } = .. ; {r} = .. (7.5)


 . 
 
 . 
 .
  
  
 
cm m×1 fn n×1 rn n×1
  

The vector {r} is called the residual vector. It represents the difference
between the assumed fit g(x) and the actual function values fi . In the least
squares fit we minimize the sum of squares of the residuals, i.e., we consider
minimization of the sum of the squares of the residuals R.
n
!
X
2
(R)minimize = (ri ) (7.6)
i=1 minimize

From (7.3) we note that ri are functions of ck ; k = 1, 2, . . . , m. Hence,

minimization in (7.6) implies the following.
n
!
∂R ∂ X
2
= (ri ) =0 ; k = 1, 2, . . . , m (7.7)
∂ck ∂ck
i=1

or
n
X ∂ri
2ri =0 ; k = 1, 2, . . . , m
∂ck
i=1
or
n
X ∂ri
ri =0 ; k = 1, 2, . . . , m (7.8)
∂ck
i=1
But
∂ri
= gk (xi ) ; from (7.2) (7.9)
∂ck
Hence, (7.8) can be written as:
n
X
ri gk (xi ) = 0 ; k = 1, 2, . . . , m (7.10)
i=1

or
{r}T [G] = [0, 0, . . . ., 0] (7.11)
or
[G]T {r} = {0} (7.12)
Substituting for {r} from (7.3) into (7.12):

[G]T [G]{c} − [G]T {f } = {0} (7.13)

or
[G]T [G]{c} = [G]T {f } (7.14)
314 CURVE FITTING

Using (7.14), the unknowns {c} can be calculated. Once {c} are known, the
desired least squares fit is given by g(x) in (7.1).

Example 7.1. Linear least squares fit

Consider the following data:
i 1 2 3 4
xi 0 1 2 3
fi 2.4 3.4 13.8 39.5
Determine the constants c1 and c2 for g(x) given below to be a least squares
fit to the data in the table.

g(x) = c1 + c2 x3

Here g1 (x) = 1 , g2 (x) = x3

   
g1 (x1 ) g2 (x2 ) 1 0
g1 (x2 ) g2 (x2 ) 1 1 
[G] = 
g1 (x3 ) g2 (x3 ) = 1 8 
  

g1 (x4 ) g2 (x4 ) 1 27
{f }T = [2.4 3.4 13.8 39.5]

T 4.0 36.0
[G] [G] =
36.0 794.0

T 59.1
[G] {f } =
1180.0

c
∴ [G]T [G] 1 = [G]T {f }
c2
or
4.0 36.0 c1 59.1
=
36.0 794 c2 1180.0
∴ c1 = 2.35883
c2 = 1.37957
Hence,
g(x) = 2.35883 + 1.37957x3
is the least squares fit to the data in the table.
For this least squares fit we have
3
X
R= ri2 = 0.291415
i=1
7.3. WEIGHTED LINEAR LEAST SQUARES FIT (WLLSF) 315

40
Given Data
Curve fit g(x)
35

25
Data fi or g(x)

0
0 0.5 1 1.5 2 2.5 3
x

Figure 7.2: fi versus xi or g(x) versus x: example 7.1

Figure 7.2 shows plots of data points (xi , fi ) and g(x) versus x. In this case
g(x) provides good approximation to the data (xi , fi ).

7.3 Weighted Linear Least Squares Fit (WLLSF)

In the weighted least squares method or fit, the residual ri for a data point
i may be assigned a weight factor wi and then we minimize the weighted sum
of squares of the residuals. Thus in this case we consider
n
!
X
(R)minimize = wi (ri )2 (7.15)
i=1 minimize

w1 , w2 , . . . , wn are the weight factors assigned to data points (xi , fi ); i =

1, 2, . . . , n. These weight factors are positive real numbers. This process
allows us to assign extra weight by assigning a weight factor > 1 to some
individual data points that may be more accurate or relevant than others.
This procedure allows us to bias the curve fitting processes towards data
points that we feel are more important or more accurate. When wi = 1; i =
316 CURVE FITTING

1, 2, . . . , n, then the weighted least squares curve fitting reduces to standard

curve fitting described in Section 7.2. Proceeding in the same manner as in
Section 7.2, (7.15) implies:
n n
!
∂R ∂ X
2
X ∂ri
= wi (ri ) = 2wi ri = 0 ; k = 1, 2, . . . , m
∂ck ∂ck ∂ck
i=1 i=1
or
n
X ∂ri
wi ri =0 ; k = 1, 2, . . . , m (7.16)
∂ck
i=1
But
∂ri
= gk (xi ) (7.17)
∂ck
Hence, (7.16) can be written as:
n
X
wi ri gk (xi ) = 0 ; k = 1, 2, . . . , m (7.18)
i=1

[w1 r1 w2 r2 . . . wn rn ][G] = [0 0 0 . . . 0] (7.19)

or
[G]T [W ]{r} = 0 (7.20)
where [W ] is a diagonal matrix of the weight factors. Substituting for {r}
from (7.3) in (7.20), we obtain:
[G]T [W ] ([G]{c} − {f }) = {0} (7.21)
or
[G]T [W ][G]{c} = [G]T [W ]{f } (7.22)
This is weighted least squares fit. We use (7.22) to calculate {c}. When
[W ] = [I], (7.22) reduces to standard least squares fit given by (7.14).

Example 7.2. Weighted linear least squares fit

Consider the following data:

i 1 2 3
xi 1 2 3
fi 4.5 9.5 19.5

Here we demonstrate the use of weight factors (considered unity in this case).
Determine the constants c1 and c2 for g(x) given below to be a least squares
fit to the data in the table. Use weight factors of 1.0 for each data point.

g(x) = c1 + c2 x2
7.3. WEIGHTED LINEAR LEAST SQUARES FIT (WLLSF) 317

g1 (x) = 1 , g2 (x) = x2
   
g1 (x1 ) g2 (x2 ) 11
[G] = g1 (x2 ) g2 (x2 ) = 1 4
  
g1 (x3 ) g2 (x3 ) 19
   
100  4.5 
[W ] = 0 1 0 ; {f } = 9.5
001 19.5
 

c
∴ [G] [W ][G] 1 = [G]T [W ]{f }
T
c2
Where
T 3.0 14.0
[G] [W ][G] =
14.0 98.0

T 33.5
[G] [W ]{f } =
218.0

3.0 14.0 c1 33.5
∴ =
14.0 98.0 c2 218.0
∴ c1 = 2.35714
c2 = 1.88775
Hence,
g(x) = 2.35714 + 1.88775x2
is a least squares fit to the data in the table with weight factors of 1.0
assigned to each data point. We note that the least squares with or without
the weight factors will yield the same results due to the fact that the weight
factors are unity in this example.
In this case we have
X 3
R= ri2 = 0.255102
i=1
318 CURVE FITTING

20
Given Data
Curve fit g(x)
18

14
Data fi or g(x)

4
1 1.5 2 2.5 3
x

Figure 7.3: fi versus xi or g(x) versus x: example 7.2

Figure 7.3 shows plots of data points (xi , fi ) and g(x) versus x. We note
that g(x) is a good fit to the data (xi , fi ).

Example 7.3. Weighted linear least squares fit

Consider the same problem as in example 7.2 except that we use weight
factor of 2 for the second data point.
i 1 2 3
xi 1 2 3
fi 4.5 9.5 19.5
Consider g(x) = c1 + c2 x2 as a least squares fit to the data. Determine c1
and c2 when weight factors 1, 2, and 1 are assigned to the three data points.
Thus, here we create a bias to data point two as it has a weight factor w2 = 2
compared to weight factors of 1 for the other two data points.
g1 (x) = 1 ; g2 (x) = x2
   
g1 (x1 ) g2 (x2 ) 11
[G] = g1 (x2 ) g2 (x2 ) = 1 4
  
g1 (x3 ) g2 (x3 ) 19
7.3. WEIGHTED LINEAR LEAST SQUARES FIT (WLLSF) 319

   
100  4.5 
[W ] = 0 2 0 ; {f } = 9.5
001 19.5
 

4.0 18.0
[G]T [W ][G] =
18.0 114.0

43.0
[G]T [W ]{f } =
256.0

4.0 18.0 c1 43.0
∴ =
18.0 114.0 c2 256.0
∴ c1 = 2.227273
c2 = 1.89394
∴ g(x) = 2.227273 + 1.89394x2
This is a least squares fit to the data in the table with weight factors of 1,
2, and 1 for the three data points. Due to w2 = 2, c1 and c2 have changed
slightly compared to Example 7.2. In this case we have
3
X
R= wi ri2 = 0.378788
i=1

20
Given Data
Curve fit g(x)
18

14
Data fi or g(x)

4
1 1.5 2 2.5 3
x

Figure 7.4: fi versus xi or g(x) versus x: example 7.3

320 CURVE FITTING

Figure 7.4 shows plots of (xi , fi ) and g(x) versus x. Weight factors of 2 for
data point two does not appreciatively alter g(x).

Example 7.4. Weighted linear least squares fit

We consider the same problem as in Example 7.3, but rather than assigning
a weight factor of 2.0 to the second data point, we repeat this data point
and assign weight factors of 1 to all data points. Thus we have:

i 1 2 3 4
xi 1 2 2 3
fi 4.5 9.5 9.5 19.5

g(x) = c1 + c2 x2
g1 (x) = 1 , g2 (x) = x2
   
g1 (x1 ) g2 (x2 ) 11
g1 (x2 ) g2 (x2 ) 1 4
[G] =  g1 (x3 ) g2 (x3 ) = 1 4
  

g1 (x4 ) g2 (x4 ) 19
   
1000 
 4.5 

0 1 0 0 
9.5

[W ] = 0 0 1 0
 ; {f } =

 9.5 

0001 19.5
 

T 4.0 18.0
[G] [W ][G] =
18.0 114.0

T 43.0
[G] [W ]{f } =
256.0

4.0 18.0 c1 43.0
∴ =
18.0 114.0 c2 256.0

c1 = 2.227273
∴ exactly the same as in Example 7.3
c2 = 1.89394

∴ g(x) = 2.227273 + 1.89394x2

This is a least squares fit to the data in the table in which data points two
and three are identically the same. Thus, assigning a weight factor k (an
7.4. NON-LINEAR LEAST SQUARES FIT: A SPECIAL CASE (NLSF) 321

integer) to the data point is the same as repeating this data point k times
with a weight factor of one. In this case we have
3
X
R= wi ri2 = 0.378788
i=1

This value of R is same as in example 7.3 (as expected).

20
Given Data
Curve fit g(x)
18

14
Data fi or g(x)

4
1 1.5 2 2.5 3
x

Figure 7.5: fi versus xi or g(x) versus x: example 7.4

Figure 7.5 shows plots of (xi , fi ) and g(x) versus x.

7.4 Non-linear Least Squares Fit: A Special Case

(NLSF)
In the least squares fit considered in Sections 7.2 and 7.3, g(x) was a
linear function of ci ; i = 1, 2, . . . , m. In some applications this may not be
the case. Let (xi , fi ) ; i = 1, 2, . . . , n be the given data points. In this section
we consider a special case in which by taking log (or ln) of both sides the
least squares fit process will be linear in the log term. This is then followed
322 CURVE FITTING

by correction to account for logs (or ln). Let us assume that

g(x) = cxk ; c and k to be determined (7.23)
describes the fit to the data (xi , fi ) ; i = 1, 2, . . . , n. If we minimize
n
X
R= (g(xi ) − fi )2 (7.24)
i=1

then due to the specific form of g(x) in (7.23), the minimization of (7.24)
will result in a system of nonlinear algebraic equations in c and k. This can
be avoided by considering the following: consider (7.23) and take the log of
both sides.
log(g(x)) = log c + k log x = c1 g1 (x) + c2 g2 (x) (7.25)
where c1 = log c
c2 = k
(7.26)
and g1 (x) = 1
g2 (x) = log x
Now we can use (7.25) and apply linear least squares fit.
n
!
X
(R)minimize = (log(g(xi )) − log(fi ))2 (7.27)
i=1
e
minimize

We note that in (7.27), we are minimizing the sum of squares of the residuals
of logs of g(xi ) and fi . However, if we still insist on using (7.27), then some
adjustments or corrections must be made so that (7.27) indeed would result
in what we want.
Let ∆fi be the error in fi , then the corresponding error in log(g(xi )),
i.e., ∆(log(g(xi ))), can be approximated.

d(log(fi )) 1 ∆fi
∆(log(g(xi ))) = d(log(g(xi ))) ' d(log(fi )) = dfi = dfi =
dfi fi fi
(7.28)
∴ ∆fi = fi ∆(log(g(xi ))) (7.29)
∴ (∆fi )2 = (fi )2 (∆(log(g(xi ))))2 (7.30)
From (7.30), we note that minimization of the square of the error fi requires
minimization of the error in log(g(xi )) squared multiplied with fi2 , i.e., fi2 be-
haves like a weight factor. Thus instead of considering minimization (7.27),
if we consider:
n
!
X
2
(R)minimize = wi (log(g(xi )) − log(fi )) (7.31)
i=1 minimize
7.4. NON-LINEAR LEAST SQUARES FIT: A SPECIAL CASE (NLSF) 323

in which wi = fi2 . We minimize the sum of the squares of the residuals

between g(xi ) and fi , which is what we need to do. Equation (7.31) will
give:
[G]T [W ][G]{c} = [G]T [W ]{fˆ}
Where for this particular example we have:
   
g1 (x1 ) g2 (x1 ) 1 log x1
 g1 (x2 ) g2 (x2 )  1 log x2 
[G] =  . ..  =  .. (7.32)
   
. .. 
 . .  . . 
g1 (xn ) g2 (xn ) 1 log xn
 2 
f1 0 0 . . . 0
 0 f2 0 . . . 0 
2
[W ] =  . . . . (7.33)
 
 .. .. . . . . 


0 0 0 . . . fn2
 

 log(f 1 ) 

 log(f2 ) 
 
{fˆ} = .. (7.34)


 .  

log(fn )
 

This procedure described above has approximation due to (7.28), but helps
in avoiding nonlinear algebraic equations resulting from the least squares fit.

Example 7.5. Non-linear least squares fit: special case

Consider the following data points:

i 1 2 3
xi 1 2 3
fi 1.2 3.63772 6.959455

Let g(x) = cxk be a least squares fit to the data in the table. Determine c
and k using non-linear least squares fit procedure described in section 7.4.
Take log10 of both sides of g(x) = cxk .

log10 g(x) = log10 c + k log10 x = c1 + c2 log10 x ; c1 = log10 c , c2 = k

∴ g1 (x) = 1 , g2 (x) = log10 x

     
g1 (x1 ) g2 (x2 ) 1 log10 1 1 0
[G] = g1 (x2 ) g2 (x2 ) = 1 log10 2 = 1 0.30103
g1 (x3 ) g2 (x3 ) 1 log10 3 1 0.47712
324 CURVE FITTING

 2   
f1 0 0 1.44 0 0
[W ] =  0 f22 0  =  0 13.233 0 
0 0 f32 0 0 48.434
   
log10 f1   log10 1.2 
{fˆ} = log10 f2 = log10 3.63772
log10 f3 log10 6.959455
   

63.1070 27.0924
∴ [G]T [W ][G] =
27.0924 12.2249

T ˆ 48.3448
[G] [W ]{f } =
21.7051

c
∴ [G]T [W ][G] 1 = [G]T [W ]{fˆ} gives
c2

63.1070 27.0924 c1 48.3448
=
27.0924 12.2249 c2 21.7051
c1 = 0.0791822 = log10 c ∴ c = 1.2
c2 = 1.6 = k
Hence,
g(x) = 1.2x1.6
is the least squares fit to the data. For this least squares fit we have
3
X 3
X
R= ri2 = (fi − g(xi ))2 = 1.16858 × 10−11
i=1 i=1

7
Given Data
Curve fit g(x)1
6

5
Data fi or g(x)

1
1 1.5 2 2.5 3
x

Figure 7.6: fi versus xi or g(x) versus x: example 7.5

7.4. NON-LINEAR LEAST SQUARES FIT: A SPECIAL CASE (NLSF) 325

Figure 7.6 shows plots of (xi , fi ) and g(x) versus x. The fit by g(x) is
almost exact. This is also obvious from such low value of R.

Example 7.6. Non-linear least squares fit: special case

We consider the same data as in Example 7.5 and the same function
g(x) = cxk , but obtain a solution for c and k using nonlinear least squares
fit employing natural logarithm instead of log10 .
Take natural log, ln, of both sides of g(x) = cxk .
ln (g(x)) = ln(c) + kln(x) = c1 + c2 ln(x) ; c1 = ln(c) , c2 = k
g1 (x) = 1 , g2 (x) = ln(x)
 2   
f1 0 0 1.44 0 0
[W ] =  0 f22 0  =  0 13.233 0 
0 0 f32 0 0 48.434
   
ln(f1 )  ln(1.2) 
{fˆ} = ln(f2 ) = ln(3.63772)
ln(f3 ) ln(6.959455)
   

c
∴ [G] [W ][G] 1 = [G]T [W ]{fˆ}
T
c2
where
T 63.107 62.3826
[G] [W ][G] =
62.3826 64.8152

111.1318
[G]T [W ]{fˆ} =
115.078

63.107 62.3826 c1 111.1318
∴ =
62.3826 64.8152 c2 115.078
∴ c1 = 0.182322 =⇒ ln(c) = 0.182322 ∴ c = 1.2
c2 = 1.6
Hence,
g(x) = 1.2x1.6
is the nonlinear least squares fit to the data. We see that whether we use
log10 or ln, it makes no difference. Using calculated c = 1.2 and k = 1.6 we
obtain
X3
R= (fi − g(xi ))2 = 2.37556 × 10−12
i=1
326 CURVE FITTING

7
Given Data
Curve fit g(x)
6

5
Data fi or g(x)

1
1 1.5 2 2.5 3
x

Figure 7.7: fi versus xi or g(x) versus x: example 7.6

Figure 7.7 shows plots of (xi , fi ) and g(x) versus x. g(x) is almost exact fit
to the data. This is also obvious from very low R.

Example 7.7. Non-linear least squares fit: special case

Consider the following data:
i 1 2 3
xi 0 1 2
fi 1.2 2.67065 5.94364
Let g(x) = cekx be the least squares fit to the data in the table. Determine
c and k using nonlinear least squares fit. In this case it is advantageous to
take ln of both sides of g(x) = cekx .

ln(g(x)) = ln(c) + kx = c1 + c2 x ; c1 = ln(c) , c2 = k

g1 (x) = 1 , g2 (x) = x
   
g1 (x1 ) g2 (x2 ) 10
[G] = g1 (x2 ) g2 (x2 ) = 1 1
  
g1 (x3 ) g2 (x3 ) 12
7.4. NON-LINEAR LEAST SQUARES FIT: A SPECIAL CASE (NLSF) 327

 2   
f1 0 0 1.44 0 0
[W ] =  0 f22 0  =  0 7.1324 0 
0 0 f32 0 0 35.327
   
ln(f1 )  ln(1.2) 
{fˆ} = ln(f2 ) = ln(2.67065)
ln(f3 ) ln(5.94364)
   

Therefore we have

c
[G] [W ][G] 1 = [G]T [W ]{fˆ}
T
c2

where
T 43.8992 77.7861
[G] [W ][G] =
77.7861 148.440

T ˆ 70.2327
[G] [W ]{f } =
132.93

43.8992 77.7861 c1 70.2327
∴ =
77.7861 148.440 c2 132.93
∴ c1 = 0.1823226 = ln(c) =⇒ c = 1.2
c2 = 0.8 = k
Hence
g(x) = 1.2e0.8x
is the least squares fit to the data given in the table. For this least squares
fit we have
3
X 3
X
R= ri2 = (fi − g(xi ))2 = 9.59730 × 10−13
i=1 i=1
328 CURVE FITTING

6
Given Data
5.5 Curve fit g(x)

4.5
Data fi or g(x)

3.5

2.5

1.5

1
0 0.5 1 1.5 2
x

Figure 7.8: fi versus xi or g(x) versus x: example 7.7

Figure 7.8 shows graphs of (xi , fi ) and g(x) versus x. In this case also g(x)
is almost exact fit to the data

7.5 General formulation for non-linear least squares

fit (GNLSF)
We note that in general for nonlinear least squares fit the forms required
may not always be such that the technique of taking log will always be
beneficial as shown in section 7.4. In this section we consider a general
nonlinear least square fit formulation that is applicable to all non-linear
least squares fit regardless of the forms.
Let (xi , fi ); i = 1, 2, . . . , n be the data points and g(x, c1 , c2 , . . . , cm ) be
the desired least squares fit to the data in which g(··) is a non-linear function
of c1 , c2 , . . . , cm and x. Thus, we need to minimize the sum of squares of
the residuals between the given data and the non-linear equation g(··) the
residuals ri are given by
g(xi , c1 , c2 , . . . , cm ) − fi = ri ; i = 1, 2, . . . , n (7.35)
Since g(··) is a non-linear function of c1 , c2 , . . . , cm , amongst all other tech-
niques of solving for c1 , c2 , . . . , cm , Gauss-Newton method is the simplest
7.5. GENERAL FORMULATION FOR NON-LINEAR LEAST SQUARES FIT (GNLSF) 329

and straight forward. The main concept in this approach is to obtain an

approximate linear form of (7.35) by using Taylor series. It is obvious that
since g(x, c1 , c2 , . . . , cm ) is a non-linear function of ci ; i = 1, 2, . . . , m we will
have to determine ci ; i = 1, 2, . . . , m iteratively.
Let k and k + 1 be two successive iterations, then we can write (7.35) as

g(xi , c1 , c2 , . . . , cm )k+1 − fi = (ri )k ; i = 1, 2, . . . , n (7.36)

At the beginning of the iterative procedure k refers to initial guess {c}k of

c1 , c2 , . . . , cm and {c}k+1 the improved values of c1 , c2 , . . . , cm .
We expand g(xi , c1 , c2 , . . . , cm )k+1 in Taylor series in {c} about {c}k

g(xi , c1 , c2 , . . . , cm )k+1 = g(xi , c1 , c2 , . . . , cm )k +

∂g(xi , c1 , c2 , . . . , cm )k ∂g(xi , c1 , c2 , . . . , cm )k
(∆c1 )k + (∆c2 )k + . . . (7.37)
∂c1 ∂c2
in which
(∆c1 )k = (c1 )k+1 − (c1 )k
(7.38)
(∆c2 )k = (c2 )k+1 − (c2 )k . . . etc.

Substituting (7.37) in (7.36) we obtain

∂g(xi , c1 , c2 , . . . , cm )k ∂g(xi , c1 , c2 , . . . , cm )k
(∆c1 )k + (∆c2 )k + · · · +
∂c1 ∂c2
g(xi , c1 , c2 , . . . , cm )k − fi = ri ; i = 1, 2, . . . , n (7.39)

Equation (7.39) can be written in the matrix and vector form.

[G]k {∆c}k − {d}k = {r}k (7.40)

in which
 ∂g(x
1 ,c1 ,c2 ,...,cm )k
∂g(x1 ,c1 ,c2 ,...,cm )k ∂g(x1 ,c1 ,c2 ,...,cm )k

∂c1 ∂c ... ∂cm
 ∂g(x2 ,c1 ,c2 ,...,cm )k ∂g(x2 ,c1 ,c22,...,cm )k ∂g(x2 ,c1 ,c2 ,...,cm )k 
 ∂c1 ∂c2 ... ∂cm 
[G]k =  
.. .. ..
 (7.41)
. . .

 
∂g(xn ,c1 ,c2 ,...,cm )k ∂g(xn ,c1 ,c2 ,...,cm )k ∂g(xn ,c1 ,c2 ,...,cm )k
∂c1 ∂c2 ... ∂cm n×m

{∆c}Tk = [(∆c1 )k (∆c2 )k . . . (∆cm )k ] (7.42)

{d}Tk = [f1 − g(x1 , c1 , c2 , . . . , cm )k , f2 − g(x2 , c1 , c2 , . . . , cm )k , . . . ,

fn − g(xn , c1 , c2 , . . . , cm )k ] (7.43)
330 CURVE FITTING

n
!
X
(Rk )minimization = (ri )2k (7.44)
i=1 minimization
We note that (7.40) is similar to (7.3) when {d}k in (7.40) takes the place
of {f } in (7.3), hence the least squares fit becomes
[G]Tk [G]k {∆c}k = [G]Tk {d}k (7.45)
We solve for {∆c}k using (7.45). Improved values of {c} i.e. {c}k+1 are
given by
{c}k+1 = {c}k + {∆c}k (7.46)
Convergence check for the iterative process (7.45) and (7.46) is given by
(ci )k+1 − (ci )k
100 ≤ ∆ ; i = 1, 2, . . . , m (7.47)
(ci )k+1
or simply |(ci )k+1 − (ci )k | ≤ ∆1 .
We note that the method requires initial or starting values of ci ; i = 1, 2, . . . , m
i.e. {c}k so that coefficients of [G]k and g(x, c1 , c2 , . . . , cm )k in {d}k in (7.5)
can be calculated.
When the convergence criteria in (7.47) is satisfied we have the solution
{c}k+1 for ci ; i = 1, 2, . . . , m in g(x, c1 , c2 , . . . , cm ) otherwise we increment k
by one and repeat (7.45) and (7.46) till (7.47) is satisfied.

7.5.1 Weighted general non-linear least squares fit (WGNLSF)

In this case we consider
n
!
X
(R)minimizing = wi (ri )2k (7.48)
i=1 minimizing

in which wi are weight factors and ri ; i = 1, 2, . . . , n are given by (7.40).

Following the derivations in section 7.3 and using (7.40) and (7.48), we
obtain the following instead of (7.45).
[G]Tk [W ][G]k {∆c}k = [G]Tk [W ]{d}k (7.49)
in which [W ] is the diagonal matrix of weight factors. Rest of the details
remains same as in section 7.3.

7.5.1.1 Using general non-linear least squares fit for linear least
squares fit
In case of linear least squares fit we have
m
X
g(x, c1 , c2 , . . . , cm ) = ci gi (x) (7.50)
i=1
7.5. GENERAL FORMULATION FOR NON-LINEAR LEAST SQUARES FIT (GNLSF) 331

Hence
∂g
= gk (x) (7.51)
∂ck
Hence, [G]k in (7.45) reduces to [G] ((7.4)) in linear least squares fit and we
have (omitting subscript k)
[G]T [G]{∆c} = [G]T {d} (7.52)
in which di = fi − g(xi ); i = 1, 2, . . . , n.
(1) With the initial choice of ci = 0; i = 1, 2, . . . , m, {d} becomes {f }, hence
(7.52) reduces to
[G]T [G]{∆c} = {f } (7.53)
we note that (7.53) is same as (7.4) in linear least squares fit. Clearly
{∆c} = {c}.
(2) With any non-zero choices of {c}, the iterative process converges in two
iterations as [G] is not a function of {c}.

Example 7.8. We consider the same problem as example 7.5 but apply the
general formulation for non-linear least squares fit presented in section 7.5.

g(x, c, k) = cxk
∂g ∂g
= xk , = ckxk−1
∂c ∂k
we consider matrix [G] and vector {d} using (7.41) and (7.43)
 
∂g(x1 ,ck ) ∂g(x1 ,ck )
 ∂g(x∂c2 ,ck ) ∂g(x∂k2 ,ck ) 
 ∂c ∂k

[G] =  .. .. 
. .

 
∂g(xn ,ck ) ∂g(xn ,ck )
∂c ∂k

{d}T = [f1 − g(x1 , ck ), f2 − g(x2 , ck ), . . . , fn − g(xn , ck )]

From example 7.5 we know that the correct values of c and k are 1.2 and
1.6.
We begin with (7.45) for k = 1 and choose c1 = 1.2 and k1 = 1.6 as initial
values of c and k at k = 1. We consider tolerance ∆1 = 10−6 for convergence.
Details of the computations using (7.45) are given below.
 
1.0 1.92
[G]1 = 3.03143 2.91018
5.79955 3.71171
332 CURVE FITTING

43.8243 32.2682
[G]T1 [G]1 =
32.2682 25.9323
{[G]T1 {d}1 }T = [−3.65106 × 10−6 − 2.10685 × 10−6 ]
{c}T2 = [1.1999998 1.6000003]
[G]T1 [G]1 {∆c}1 = [G]T1 {d}1
gives
{∆c}T1 = [−2.08034 × 10−7 2.6759 × 10−7 ]
are
{c}T2 = {c}T1 + {∆c}T1 = [1.1999998 1.600003] = [c, k]
{c}2 is converged solution based on tolerance ∆1 = 10−6 .
Since {c}1 (initial values of c and k) are the correct values, the non-linear
iterations solution procedure converges in only one iteration and we have
g(x) = 1.2x1.6 with R = 1.38914 × 10−12
the desired least squares fit.
7
Given Data
Curve fit g(x)
6

5
Data fi or g(x)

1
1 1.5 2 2.5 3
x

Figure 7.9: fi versus xi or g(x) versus x: example 7.8

Figure 7.9 shows plots of (xi , fi ) and g(x) versus x.

7.5. GENERAL FORMULATION FOR NON-LINEAR LEAST SQUARES FIT (GNLSF) 333

Example 7.9. Here we consider the same problem as example 7.7 but apply
the general formulation for non-linear least squares fit.

g(x, c, k) = cekx
∂g ∂g
= ekx , = ckekx
∂c ∂k
Matrix [G] and vector {d} are constructed using using (7.41) and (7.43)
 
∂g(x1 ,ck ) ∂g(x1 ,ck )
 ∂g(x∂c2 ,ck ) ∂k
∂g(x2 ,ck ) 
 ∂c ∂k

[G] =  .. .. 
. .
 
 
∂g(xn ,ck ) ∂g(xn ,ck )
∂c ∂k

{d}T = [f1 − g(x1 , ck ), f2 − g(x2 , ck ), . . . , fn − g(xn , ck )]

From example 7.7, the correct values of c and k are c = 1.2 and k = 0.8.
We choose c1 = 1.1 and k1 = 0.7, i.e. {c} in (7.41) as {c}T1 = [1.1 , 0.7]
and ∆1 = 10−6 as convergence tolerance for {∆c}. Details are given in the
following.
For k = 1 (iteration one)
 
1.0 0.0
[G]1 = 2.01375 2.21513
4.05520 8.92144

T 21.4998 40.6389
[G]1 [G]1 =
40.6389 84.4989
{[G]T1 {d}1 }T = [7.63085 14.2388]
[G]T1 [G]1 {∆c}1 = [G]T1 {d}1
gives
{∆c}T1 = [0.093517 0.12353]
Hence,
{c}T2 = {c}T1 + {∆c}T1 = [1.1935166 0.823533]
and R = 6.63185 × 10−2
For k = 2 (iteration two)
 
1.0 0.0
[G]2 = 2.27854 2.71947
5.19173 12.3928
334 CURVE FITTING

33.1457 70.5365
[G]T2 [G]2 =
70.5365 160.978
{[G]T2 {d}2 }T = [−1.41707 − 3.26531]
[G]T2 [G]2 {∆c}2 = [G]T2 {d}2
gives
{∆c}T2 = [6.1246 × 10−3 − 2.2968 × 10−2 ]
Hence,
{c}T3 = {c}T2 + {∆c}T2 = [1.19964123 0.800565183]
with R = 2.50561 × 10−5
For k = 3 (iteration three)
 
1.0 0.0
[G]3 = 2.22680 2.67136
4.95863 11.8972

T 30.5467 64.9423
[G]3 [G]3 =
64.9423 148.679
{[G]T3 {d}3 }T = [−2.57275 × 10−2 − 6.06922 × 10−2 ]
[G]T3 [G]3 {∆c}3 = [G]T3 {d}3
gives
{∆c}T3 = [3.5895 × 10−4 − 5.6500 × 10−4 ]
Hence,
{c}T4 = {c}T3 + {∆c}T3 = [1.20000017 0.80000019]
and R = 3.53244 × 10−12
For k = 4 (iteration four)
 
1.0 0.0
[G]4 = 2.22554 2.67065
4.95303 11.8873

T 30.4856 64.8218
[G]4 [G]4 =
64.8218 148.44
{[G]T4 {d}4 }T = [−9.38711 × 10−6 − 2.22698 × 10−5 ]
[G]T4 [G]4 {∆c}4 = [G]T4 {d}4
gives
{∆c}T4 = [1.5505 × 10−7 − 2.1773 × 10−7 ]
7.5. GENERAL FORMULATION FOR NON-LINEAR LEAST SQUARES FIT (GNLSF) 335

Hence,
{c}T5 = {c}T4 + {∆c}T4 = [1.200000290 0.799999952]
with R = 0.313247 × 10−13
Absolute value of each components of {c}4 − {c}3 is less than or equal to
∆1 = 10−6 , hence
{c}T2 = [c, k] = [1.2 0.8]
Thus we have
g(x) = 1.2e0.8x
is the desired least squares fit. This is same as in example 7.7.
6
Given Data
5.5 Curve fit g(x)

4.5
Data fi or g(x)

3.5

2.5

1.5

1
0 0.5 1 1.5 2
x

Figure 7.10: fi versus xi or g(x) versus x: example 7.9

Figure 7.10 shows plots of (xi , fi ) and g(x) versus x.

Remarks.

(1) Examples 7.1 - 7.4, linear least squares fit have also been solved using the
general non-linear least squares fit (section 7.5), the results are identical
to those obtained in examples 7.1 - 7.4, hence are not repeated here.

(2) In examples 7.1 - 7.4 when using formulations of section 7.5 the initial
336 CURVE FITTING

or starting values of the unknown coefficients is irrelevant. The process

always converges in two iterations as the problem is linear.

7.6 Least squares fit using sinusoidal functions (LS-

FSF)
Use of trigonometric functions in the least squares fit is useful for least
squares fit of periodic data. Square waves, sawtooth waves, triangular waves,
etc. are examples of periodic functions encountered in many applications.

(a) Square wave

(b) Triangular wave

Figure 7.11: Periodic functions

For a periodic function f (t) we have

f (t) = f (t + T ) (7.54)

in which T is called period. T is the smallest value of time for which (7.54)
holds i.e. f (··) repeats after every value of time as a multiple of T .
In least squares fit we can generally use functions in time of the forms

g(t) = c̃1 + c̃2 cos(ωt + φ)

(7.55)
or g(t) = c1 + c2 sin(ωt + φ)
e e
c̃1 or c1 is mean value, c̃2 or c2 is the peak value of the oscillating function
cos(ωte + φ) or sin(ωt + φ), eω is the frequency i.e. how often the cycle
7.6. LEAST SQUARES FIT USING SINUSOIDAL FUNCTIONS (LSFSF) 337

repeats and φ is called phase shift that defines how the function is shifted
horizontally. Negative φ implies lag whereas positive φ results in lead.
An alternative to (7.55) is to use

g(t) = c1 + c2 cos ωt + c3 sin ωt (7.56)

We note that (7.56) can be obtained from (7.55) by expanding cos(ωt + φ)

or sin(ωt + φ) and by defining new coefficients. For example

g(t) = c̃1 + c̃2 cos(ωt + φ) (7.57)

or
g(t) = c̃1 + c̃2 (cos ωt cos φ − sin ωt sin φ)
(7.58)
= c̃1 + (c̃2 cos φ) cos ωt + (−c̃2 sin φ) sin ωt

Let
c1 = c̃1 , c2 = c̃2 cos φ , c3 = −c̃2 sin φ (7.59)
Then, we can write (7.57) as

g(t) = c1 + c2 cos ωt + c3 sin ωt (7.60)

on the other hand if we consider

g(t) = c1 + c2 sin(ωt + φ) (7.61)

e e
or
g(t) = c1 + c2 (sin ωt cos φ + cos ωt sin φ) (7.62)
e e
Let
c1 = c1 , c2 = c2 sin φ , c3 = c2 cos φ (7.63)
e e e
Then, we can write (7.61) as

g(t) = c1 + c2 cos(ωt) + c3 sin(ωt) (7.64)

Thus, instead of using (7.55) we consider (7.60)(or (7.64)).

Let (ti , f (ti )) or (ti , fi ); i = 1, 2, . . . , n be given data with time period T ,

hence with ω = 2π T (radians/sec), the angular frequency.
Let
g(t) = c1 + c2 cos ωt + c3 sin ωt (7.65)
be the desired least squares fit to the data (ti , fi ); i = 1, 2, . . . , n. We can
rewrite (7.65) in standard form (7.1) by letting

g1 (t) = 1 , g2 (t) = cos ωt and g3 (t) = sin ωt (7.66)

338 CURVE FITTING

Then, (7.65) can be written as

3
X
g(t) = ck gk (t) (7.67)
k=1

The residuals ri ; i = 1, 2, . . . , n are given by

3
X
g(ti ) − fi = ck gk (ti ) − fi = ri ; i = 1, 2, . . . , n (7.68)
k=1

In matrix form we can write (7.66) as

     
g1 (t1 ) g2 (t1 ) g3 (t1 )    f1 
  r1 

 g1 (t2 ) g2 (t2 ) g3 (t2 )  c1 
 
 f2 
   r2 
 
..  c2 − = (7.69)
 
 .. .. .. ..
 . . .  c   .  .
3

   
  
 
g1 (tn ) g2 (tn ) g3 (tn ) fn rn
 

or
[G]{c} − {f } = {r} (7.70)
In weighted least squares curve fit we consider
Xn
(R)minimizing =( wi ri2 )minimizing (7.71)
i=1

Following section 7.3 we obtain the following

[G]T [W ][G]{c} = [G]T [W ]{f } (7.72)

which is same as (7.22) in section 7.3.

A more compact form of (7.72) can be derived by substituting from (7.66) in
[G] and then carrying out the matrix multiplication in (7.72) and we obtain
n n n
 
P P P
 wi wi cos ωti wi sin ωti  
 i=1
n
i=1
n
i=1
n
 c1 
2
P P P 
 wi cos ωti w i (cos ωti ) w i sin ωti cos ωti  c2 =
i=1 i=1 i=1
 
 n n n
 c3
P 2
P P 
wi sin ωti wi cos ωti sin ωti wi (sin ωti )
i=1 i=1 i=1
 n 
P
wi fi

 


 

 i=1 
n

P 

wi fi cos ωti (7.73)
 i=1 
n

 

P 
 wi fi sin ωti 

 

i=1
7.6. LEAST SQUARES FIT USING SINUSOIDAL FUNCTIONS (LSFSF) 339

We compute c1 , c2 and c3 using (7.73). Once we know c1 , c2 and c3 , the

least squares fit g(t) to the data (ti , fi ) or (ti , f (ti )) is defined.

Remarks. Equations (7.73) can be further simplified if weight factors wi =

1; i = 1, 2, . . . , n and if the points t = ti ; i = 1, 2, . . . , n are equispaced with
time interval ∆t i.e. for the time period T we have T = (n − 1)∆t.
Then
n
X n
X
sin ωti = 0 , cos ωti = 0
i=1 i=1
n n
X n X n
sin2 ωti = , cos2 ωti = (7.74)
2 2
i=1 i=1
Xn
cos ωti sin ωti = 0
i=1

using (7.74) in (7.73) we obtain

 n 
P
fi
   
n0 0   





i=1
 n  c1 

n 

 
0 0  P
fi cos ωti
 2  c2 = (7.75)
i=1
c3
    
0 0 n2 n

 

 P 
 fi sin ωti 

 

i=1

Hence
n
1 X
c1 = ( fi )
n
i=1
n
2 X
c2 = ( fi cos ωti ) (7.76)
n
i=1
n
2 X
c3 = ( fi sin ωti )
n
i=1

Example 7.10. In this example we consider least squares fit using sinusoidal
functions.
Consider
f (t) = 1.5 + 0.5 cos 4t + 0.25 sin 4t
for t ∈ [0, 1.5]. We generate ti and f (ti ) or fi ; i = 1, 2, . . . , 16 in equal incre-
ment ∆t = 0.1. (ti , fi ); i = 1, 2, . . . , 16 are given in the following (n = 16).
340 CURVE FITTING

i ti fi
1 0.00000E+00 0.20000E+01
2 0.10000E+00 0.20579E+01
3 0.20000E+00 0.20277E+01
4 0.30000E+00 0.19142E+01
5 0.40000E+00 0.17353E+01
6 0.50000E+00 0.15193E+01
7 0.60000E+00 0.13002E+01
8 0.70000E+00 0.11126E+01
9 0.80000E+00 0.98626E+00
10 0.90000E+00 0.94099E+00
11 0.10000E+01 0.98398E+01
12 0.11000E+01 0.11084E+01
13 0.12000E+01 0.12947E+01
14 0.13000E+01 0.15134E+01
15 0.14000E+01 0.17300E+01
16 0.15000E+01 0.19102E+01

We consider these values of (ti , fi ); i = 1, 2, . . . , n to obtain a least squares

fit g(t) to this data using

3
X
g(t) = c1 + c2 cos 4t + c3 sin 4t = ck gk (t)
k=1

in which
g1 (t) = 1 , g2 (t) = cos 4t and g3 (t) = sin 4t

[G] matrix is given by

 
g1 (t1 ) g2 (t1 ) g3 (t1 )
 g1 (t2 ) g2 (t2 ) g3 (t2 ) 
[G] =  .
 
.. .. 
 .. . . 
g1 (tn ) g2 (tn ) g3 (tn )

or
 
1 cos 4t1 sin 4t1
1 cos 4t2 sin 4t2 
[G] =  .
 
.. .. 
 .. . . 
1 cos 4tn sin 4tn
7.6. LEAST SQUARES FIT USING SINUSOIDAL FUNCTIONS (LSFSF) 341

using ti ; i = 1, 2, . . . , 16 in the data (ti , fi ), we have

 
0.100000E+01 0.100000E+01 0.000000E+00
0.100000E+01 0.921061E+00 0.389418E+00 
 
0.100000E+01 0.696707E+00 0.717356E+00 
 
0.100000E+01 0.362358E+00 0.932039E+00 
 
0.100000E+01 -0.291995E-01 0.999574E+00 
 
0.100000E+01 -0.416147E+00 0.909297E+00 
 
0.100000E+01 -0.737394E+00 0.675463E+00 
 
0.100000E+01 -0.942222E+00 0.334988E+00 
 
[G] = 
0.100000E+01 -0.998295E+00 -0.583742E-01 

0.100000E+01 -0.896758E+00 -0.442520E+00
 

0.100000E+01 -0.653644E+00 -0.756802E+00

 
0.100000E+01 -0.307333E+00 -0.951602E+00
 
0.100000E+01 0.874992E-01 -0.996165E+00
 
0.100000E+01 0.468516E+00 -0.883455E+00
 
0.100000E+01 0.775566E+00 -0.631267E+00
0.100000E+01 0.960170E+00 -0.279415E+00

 
0.160000E+02 0.290885E+00 -0.414649E-01
[G]T [G] = 0.290885E+00 0.814368E+01 -0.418135E-01 
-0.414649E-01 -0.418135E-01 0.785632E+01

{[G]T {f }}T = [0.241351E+02 0.449774E+01 0.188108E+01]

Using [G]T [G]{c} = [G]T {f }

we obtain
   
c1  0.1500003340E+01
{c} = c2 = 0.5000023250E+00
c3
   
0.2500131130E+00

with R = 6.86627 × 10−9

Hence, g(t) = 1.5 + 0.5 cos ωt + 0.25 sin ωt is the desired least squares fit.
Since the generated data are exact for c1 = 1.5, c2 = 0.5 and c3 = 0.25 the
least squares fit using this data produces precisely the same values of the
coefficients.
A plot of (xi , fi ) and g(x) versus x is shown in 7.12.
342 CURVE FITTING

2.2
Given Data
Curve fit g(x)
2

1.8
Data fi or g(x)

1.6

1.4

1.2

0.8
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
x

Figure 7.12: fi versus xi or g(x) versus x: example 7.10

7.6.1 Concluding remarks

When the data points (xi , fi ) are close together and when there is large
variation in fi values, the interpolation technique may produce wildly oscil-
lating behavior that may not be a reasonable mathematical representation
of this data set. In such cases linear least squares fit is meritorious. As
we have seen the least squares fit requires that we know what function and
their combinations are a reasonable mathematical description of the data.
Weighted linear least squares fit provides means to assign weight factors
greater than one to data points that are more accurate so that the least
squares fit becomes biased towards these data points. The non-linear least
squares fit for special forms suitable for taking log or natural log described
in section 7.4 is a convenient way to treat special classes of non-linearities
in the coefficients by modifying the linear process provided it is possible to
take log natural log of both sides and obtain linear least squares fit in log or
natural log. The general non-linear least squares fit presented in section 7.5
is the most general linearized approach to non-linear least squares fit with or
without weight factors that is applicable to any non-linear least squares fit.
This formulation automatically degenerates to linear least squares fit. Thus,
the least squares fit formulation in section 7.5 is meritorious for linear as
7.6. LEAST SQUARES FIT USING SINUSOIDAL FUNCTIONS (LSFSF) 343

well as non-linear least squares fit, both weighted as well as without weight
factors.
344 CURVE FITTING

Problems
7.1 Consider the following data

i 1 2 3 4
xi −2 −1 0 1
fi = f (xi ) 6 4 3 3

Consider g(x) = c1 +c2 x to be least squares fit to this data. Find coefficients
c1 and c2 . Plot graphs of data points and g(x) versus x as well as tabulate.

7.2 Find the constants c1 and c2 so that

g(x) = c1 sin(x) + c2 cos(x)

is least squares fit to the data in the following table.

i 1 2 3
xi 0 π/4 π/2

fi = f (xi ) 0 1 0

7.3 Find constants c1 and c2 so that

g(x) = c1 + c2 ex

is least squares fit to the data in the following table.

i 1 2 3
xi 0 1 2
fi = f (xi ) 1 2 2

7.4 Consider the following data

i 1 2 3
xi 0 1 2
fi = f (xi ) 2 6.40 22.046

Calculate coefficients a and b in g(x) = aebx for g(x) to be a least squares

fit to the data in the table. Calculate values of a and b using formulations
in section 7.4 as well as section 7.5. Compare the computed values of a and
b from the two formulations and provide some discussion.

7.5 Consider the following data

i 1 2 3
xi 1 2 4
fi = f (xi ) 1.083 3.394 9.6
7.6. LEAST SQUARES FIT USING SINUSOIDAL FUNCTIONS (LSFSF) 345

Obtain constants a and b in g(x) = axb so that g(x) is a least squares fit to
the data in the table. Use formulations in section 7.4 and 7.5. Compare and
discuss the results obtained from the two formulations.

7.6 Construct a least squares fit to the data

i 1 2 3
xi 0 1 2
fi = f (xi ) 10 3 1

Using a form of the type g(x) = k1 e−k2 x . Calculate k1 and k2 using the
formulations in section 7.4 as well as 7.5. Compare and discuss the values
of k1 and k2 obtained using the two formulations.

7.7 Consider the data in the following table.

i 1 2 3
xi 0 1 2
fi = f (xi ) 10 12 18

Determine the constants c1 and c2 in g(x) = c1 + c2 x2 for g(x) to be a

least squares fit to the data in the table. Also calculate c1 and c2 using the
non-linear formulation in section 7.5.

7.8 Consider data in the following table.

i 1 2 3
xi 0.1 0.2 0.3
fi = f (xi ) 12.161 13.457 20.332

Determine the constants c1 and c2 in g(x) = c1 + c2 3x for g(x) to be a least

squares fit to the data in the table. Use formulations in section 7.4 as well
as 7.5. Compare and discuss the values of c1 and c2 obtained using the two
formulations.
8
Numerical Differentiation

8.1 Introduction
In many situations, given the discrete data set (xi , fi ); i = 1, 2, . . . , n, we
are faced with the problem of determining the derivative of a function f with
respect to x. The discrete data set (xi , fi ); i = 1, 2, . . . , n may be from an
experiment in which we have only determined values fi at discrete locations
xi . In such a situation the value of a function f ∀x 6= xi ; i = 1, 2, . . . , n
is not known. Secondly, we only have discrete data points. A function f (x)
describing this data set is not known yet.
For the data set (xi , fi ); i = 1, 2, . . . , n we wish to determine approximate
value of the derivative of f with respect to x. We consider the following two
approaches in this chapter.

dk f
8.1.1 Determination of Approximate Value of dxk
; k = 1, 2, . . . .
using Interpolation Theory
In this approach we consider the data set (xi , fi ); i = 1, 2, . . . , n and
establish the interpolating polynomial f (x) using (see Chapter 5):

(a) Polynomial approach

(b) Lagrange interpolating polynomial

(c) Newton’s interpolating polynomial

The end result is that we have an analytical expression f (x), a polynomial

in x such that:
f (xi ) = fi ; i = 1, 2, . . . , n
k
Now, we can differentiate f (x) and obtain ddxfk ∀x ∈ [x1 , xn ] for desired
k. The strength of this approach is that once we establish the polynomial
k
f (x), ddxfk are defined for all values of x between x1 and xn . This approach is
straightforward and needs no further considerations. Details of determining
interpolating polynomials have already been presented in Chapter 5.

347
348 NUMERICAL DIFFERENTIATION

8.1.2 Determination of Approximate Values of the Derivatives

of f with Respect to x Only at xi ; i = 1, 2, . . . , n
k
In many applications it is sufficient to know approximate values ddxfk ;
k = 1, 2, . . . for discrete locations xi ; i = 1, 2, . . . , n for which fi are given.
In such cases we can utilize Taylor series expansions. We consider details
of this approach in this chapter. We refer to this approach as numerical
differentiation using Taylor series expansions. A major limitation of this
k
approach is that ddxfk ; k = 1, 2, . . . are only obtainable at discrete xi ; i =
1, 2, . . . , n values.

8.2 Numerical Differentiation using Taylor Series

Expansions
Consider a discrete data set:

(xi , fi ) ; i = 1, 2, . . . , n (8.1)

For simplicity, consider xi ; i = 1, 2, . . . , n to be equally spaced.

x1 x2 x3 xi−1 xi xi+1 xn

Figure 8.1: Discrete data points (xi , fi )

xi+1 = xi + h ; i = 1, 2, . . . , n − 1 (8.2)

The scalar h is the spacing between the two successive data points.
k
If we pose the problem of determining ddxfk ; k = 1, 2, . . . at x = xi ,
k
then by letting i = 1, 2, . . . we can determine ddxfk ; k = 1, 2, . . . at x = xi ;
i = 1, 2, . . . , n. Consider x = xi and fi and two sets (for example) of data
points immediately preceding x = xi as well as immediately following x = xi
(see Figure 8.2). Since fi is the value of f at xi , we can define:

h h h h

xi−2 xi−1 xi xi+1 xi+2

fi−2 fi−1 fi fi+1 fi+2
f (xi−2 ) f (xi−1 ) f (xi ) f (xi+1 ) f (xi+2 )

Figure 8.2: Subset of data centered on xi

8.2. NUMERICAL DIFFERENTIATION USING TAYLOR SERIES EXPANSIONS 349

fi = f (xi ) ; i = 1, 2, . . . , n (8.3)
dk f
Our objective is to determine dxk
at x = xi ; k = 1, 2, . . . ; i = 1, 2, . . . , n.

df
8.2.1 First Derivative of dx
at x = xi
(a) Forward difference method or first forward difference
Consider Taylor expansion of f (xi+1 ) about x = xi .

h2 h3
f (xi+1 ) = f (xi ) + f 0 (xi )h + f 00 (xi ) + f 000 (xi ) + . . . (8.4)
2! 3!
or
h2 h3
f (xi+1 ) − f (xi ) = f 0 (xi )h + f 00 xi + f 000 (xi ) (8.5)
2! 3!
or
f (xi+1 ) − f (xi ) = f 0 (xi )h + O(h2 ) (8.6)
or
f (xi+1 ) − f (xi )
= f 0 (xi ) + O(h) (8.7)
h
f (xi+1 ) − f (xi )
∴ f 0 (xi ) ' (8.8)
h
The approximate value of the derivative of f with respect to x at x = xi
given by (8.8) has truncation error of the order of h O(h). This is called
df
forward difference approximation of dx at x = xi . By letting i = 1, 2, . . .
df
in (8.8), we can obtain dx at x = xi ; i = 1, 2, . . . , n − 1.

(b) Backward difference method or first backward difference

Consider Taylor series expansion of f (xi−1 ) about x = xi .

h2 h3
f (xi−1 ) = f (xi ) − f 0 (xi )h + f 00 (xi ) − f 000 (xi ) (8.9)
2! 3!
or
h2 h3
f (xi−1 ) − f (xi ) = −f 0 (xi )h + f 00 (xi ) − f 000 (xi ) (8.10)
2! 3!
or
f (xi−1 ) − f (xi ) = −f 0 (xi )h + O(h2 ) (8.11)
or
f (xi−1 ) − f (xi )
= −f 0 (xi ) + O(h) (8.12)
h
f (xi ) − f (xi−1 )
∴ f 0 (xi ) ' (8.13)
h
The approximate value of the derivative of f with respect to x at x = xi
given by (8.13) has truncation error of the order of h O(h). This is
350 NUMERICAL DIFFERENTIATION

df
called backward difference approximation of dx at x = xi . By letting
df
i = 1, 2, . . . in (8.13), we can obtain dx at x = xi ; i = 2, 3, . . . ,n.
(c) First central difference method
Consider Taylor series expansion (8.5) and (8.10).
h2 h3
f (xi+1 ) − f (xi ) = f 0 (xi )h + f 00 (xi ) + f 000 (xi ) + . . . (8.14)
2! 3!
h 2 h3
f (xi−1 ) − f (xi ) = −f 0 (xi )h + f 00 (xi ) − f 000 (xi ) + . . . (8.15)
2! 3!
Subtracting (8.15) from (8.14):
h3
f (xi+1 ) − f (xi−1 ) = 2hf 0 (xi ) + 2f 000 (xi ) (8.16)
3!
or
f (xi+1 ) − f (xi−1 ) = 2hf 0 (xi ) + O(h3 ) (8.17)
or
f (xi+1 ) − f (xi−1 )
= f 0 (xi ) + O(h2 ) (8.18)
2h
f (xi+1 ) − f (xi−1 )
∴ f 0 (xi ) ' (8.19)
2h
df
The approximate value of dx at x = xi given by (8.19) has truncation
2
error of the order of O(h ). This is known as central difference approx-
df
imation of dx at x = xi for u = 2, 3, . . . , n − 1.

Remarks.
(1) Forward difference and backward difference approximation have the same
order of truncation error O(h), hence we expect similar accuracy in ei-
ther of these two approaches.
(2) The central difference method has truncation error of the order of O(h2 ),
hence this method is superior to forward or backward difference method
and will yield better accuracy. Thus, this is higher order approximation
by one order than (a) and (b).

d2 f
8.2.2 Second Derivative dx2
at x = xi : Central Difference
Method
Consider Taylor series expansions (8.5) and (8.10).
h2 h3
f (xi+1 ) − f (xi ) = f 0 (xi ) + f 00 (xi ) + f 000 (xi ) + . . . (8.20)
2! 3!
h 2 h3
f (xi−1 ) − f (xi ) = −f 0 (xi ) + f 00 (x)i) − f 000 (xi ) + . . . (8.21)
2! 3!
8.2. NUMERICAL DIFFERENTIATION USING TAYLOR SERIES EXPANSIONS 351

Adding (8.20) and (8.21):

f (xi+1 ) − 2f (xi ) + f (xi−1 ) = f 00 (xi )h2 + O(h4 ) (8.22)

or
f (xi+1 ) − 2f (xi ) + f (xi−1 )
= f 00 (xi ) + O(h2 ) (8.23)
h2
f (xi+1 ) − 2f (xi ) + f (xi−1 )
∴ f 00 (xi ) ' (8.24)
h2
2
The approximation of ddxf2 at x = xi ; i = 2, 3, . . . , n − 1 given by (8.24) has
truncation error of the order of O(h2 ).

d3 f
8.2.3 Third Derivative dx3
at x = xi
Recall (8.5) and (8.10) based on Taylor series expansions of f (xi+1 ) and
f (xi−1 ) about x = xi .

h2 h3
f (xi+1 ) − f (xi ) = f 0 (xi )h + f 00 (xi ) + f 000 (xi ) + . . . (8.25)
2! 3!
h 2 h3
f (xi−1 ) − f (xi ) = −f 0 (xi )h + f 00 (xi ) − f 000 (xi ) + . . . (8.26)
2! 3!
Also consider Taylor series expansions of f (xi+2 ) and f (xi−2 ) about x = xi .

(2h)2 (2h)3
f (xi+2 ) = f (xi ) + f 0 (xi )(2h) + f 00 (xi ) + f 000 (xi ) + ... (8.27)
2! 3!
(2h)2 (2h)3
f (xi−2 ) = f (xi ) − f 0 (xi )(2h) + f 00 (xi ) − f 000 (xi ) + ... (8.28)
2! 3!
Subtracting (8.26) from (8.25):
1
f (xi+1 ) − f (xi−1 ) = 2f 0 (xi )h + f 000 (xi )h3 + O(h5 ) (8.29)
3
Subtracting (8.28) from (8.27):
8
f (xi+2 ) − f (xi−2 ) = 4f 0 (xi )h + f 000 (xi )h3 + O(h5 ) (8.30)
3
Multiply (8.29) by 2 and subtract it from (8.30).

f (xi+2 ) − f (xi−2 ) − 2f (xi+1 ) + 2f (xi−1 ) = 2f 000 (xi )h3 + O(h5 ) (8.31)

f (xi+2 ) − f (xi−2 ) − 2f (xi+1 ) + 2f (xi−1 )

= f 000 (xi ) + O(h2 ) (8.32)
2h3
f (xi+2 ) − f (xi−2 ) − 2f (xi+1 ) + 2f (xi−1 )
∴ f 000 (xi ) ' (8.33)
2h3
352 NUMERICAL DIFFERENTIATION

3
The approximation of ddxf3 at x = xi ; i = 3, 4, . . . , n − 2 using (8.33) has
truncation error of O(h2 ). Since in this derivation we have considered two
data points immediately before and after x = xi , (8.33) can be labeled as
3
central difference approximation of ddxf3 at x = xi ; 3, 4, . . . , n − 2.

Remarks.

(1) It is also possible to derive approximations of the derivatives of f with

respect to x of various orders at x = xi using purely backward differ-
encing approach as well as purely forward differencing approach with
truncation errors O(h) or O(h2 ). A summary is given in the following:
dk f
(a) dxk
; k = 1, 2, 3, 4 with O(h) using Forward difference

f (xi+1 ) − f (xi )
f 0 (xi ) =
h
f (xi+2 ) − 2f (xi+1 ) + f (xi )
f 00 (xi ) =
h2
f (xi+3 ) − 3f (xi+2 ) + 3f (xi+1 ) − f (xi )
f 000 (xi ) =
h3
f (xi+4 ) − 4f (x i+3 ) + 6f (xi+2 ) − 4f (xi+1 ) + f (xi )
f iv (xi ) =
h4
(8.34)

Forward difference expressions with truncation error O(h2 ) can also

be derived.
dk f
(b) dxk
; k = 1, 2, 3, 4 with O(h) using backward difference

f (xi ) − f (xi−1 )
f 0 (xi ) =
h
f (xi ) − 2f (xi−1 ) + f (xi−2 )
f 00 (xi ) =
h2
f (xi ) − 3f (xi−1 ) + 3f (xi−2 ) − f (xi−3 )
f 000 (xi ) =
h3
f (xi ) − 4f (xi−1 ) + 6f (xi−2 ) − 4f (xi−3 ) + f (xi−4 )
f iv (xi ) =
h4
(8.35)

(2) Approximating derivatives using Taylor series expansion works well and
is easier to use when the data points are equally spaced or have uniform
spacing.

(3) The various differencing expressions are often called finite difference ap-
proximations of the derivatives of the function f defined by the discrete
8.2. NUMERICAL DIFFERENTIATION USING TAYLOR SERIES EXPANSIONS 353

data set. The order of approximation n is indicated by O(hn ). It indi-

cates the order of the truncation error in Taylor series.
(4) The finite difference expressions for the derivatives of f with respect to x
can also be derived using non-uniform spacing between the data points.
However, in such cases it is more advantageous to establish interpolating
polynomial f (x) that passes through the data points and then calculate
the derivatives of the function by differentiating f (x).

Example 8.1. Consider the following data.

i 1 2 3 4
xi 0 1 2 3
fi 0 4 0 -2
Determine:
df
(a) dx at xi ; i = 1, 2, . . . , 4 using numerical differentiation. Use central
difference wherever possible.
(b) Determine the Lagrange interpolating polynomial f (x) that passes
through the data points such that f (x) = fi ; i = 1, 2, . . . , 4. Using
f (x), determine derivatives of f (x) at x = 0, 1, 2, 3 and compare these
with those calculated in (a).
Solution:
(a) Using central difference approximation:
df fi+1 − fi−1
=
dx xi 2h

In this case h = 1 (spacing between the data points).

x1 = 0 x2 = 1 x3 = 2 x4 = 3
f1 = 0 f2 = 4 f3 = 0 f4 = −2

df
Thus we can determine dx using central difference at x = 1 and x = 2.

df f3 − f1 0−0
= = =0
dx x=1 2(1) 2(1)
df f1 − f2 −2 − (4)
= = = −3
dx x=2 2(1) 2
354 NUMERICAL DIFFERENTIATION

At x = 0, we do not have a choice but to use forward difference.

df f2 − f1 4−0
= = =4
dx x=0 (1) 1

At x = 3, we must use backward difference.

df f4 − f3 −2 − 0
= = = −2
dx x=3 (1) 1

i 1 2 3 4
xi 0 1 2 3
df
dx x=x 4 0 -3 -2
i

(b) Using Lagrange interpolating polynomial:

x(x − 2)(5x − 17)

f (x) =
3
df 2(5x − 17)(x − 1) 5x(x − 2)
= +
dx 3 3

i 1 2 3 4
xi 0 1 2 3
df 34
dx x 3 − 53 − 14
3
11
3
i

df
We note that dx values in the two tables are quite different. This is gen-
erally the case when only few data points are available and the spacing
between them is relatively large as is the case in this example.

8.3 Concluding Remarks

In this chapter two methods have been considered for obtaining deriva-
tives of the function for which only discrete data is given. In the first
approach an interpolating polynomial is established using the given data
followed by its differentiation to obtain the desired derivative(s). This ap-
proach permits calculation of the derivatives of desired orders for any value
of x in the range. In the second approach using Taylor series expansion the
derivatives can be calculated only at the data point locations xi .
8.3. CONCLUDING REMARKS 355

Problems
8.1 Consider the following table of xi , f (xi ).

i 1 2 3 4 5 6 7
xi 0 1/16 1/8 3/16 1/4 3/8 1/2

fi = f (xi ) 0 0.19509 0.38268 0.5556 0.70711 0.9238 1.0

df
Compute dx at x = 1/8 and x = 1/4 using forward difference and backward
difference approximation with truncation error of the order O(h) (h = 1/16 in
2
this case). Also compute ddxf2 at x = 1/8 and x = 1/4 using central difference
approximation with truncation error of the order O(h2 ).
Using f (x) = sin(πx) as the actual function describing the data in the ta-
ble, calculate percentage error in the estimates of the first and the second
derivatives.

8.2 Consider the following table of xi , f (xi ).

i 1 2 3 4 5
xi 0 5 10 15 20
fi = f (xi ) 0 1.60944 2.30259 2.70805 2.99573

df
Compute dx at x =5, 10 and 15 using forward difference and backward dif-
ference approximation with truncation error of the order O(h) (h = 5 in
2
this case). Also compute ddxf2 at x = 5, 10 and 15 using central difference
approximation with truncation error of the order O(h2 ).
Using f (x) = ln(x) as the actual function describing the data in the ta-
ble, calculate percentage error in the estimates of the first and the second
derivatives.

8.3 Consider the following table of xi , f (xi ).

i 1 2 3 4 5
xi 1 2 3 4 5
fi = f (xi ) 2.71828 7.3896 20.08554 54.59815 148.41316

df
Compute dx at x = 2, 3 and 4 using forward difference and backward differ-
ence approximation with truncation error of the order O(h) (h = 1 in this
2
case). Also compute ddxf2 at x = 2, 3 and 4 using central difference approxi-
mation with truncation error of the order O(h2 ).
Using f (x) = ex as the actual function describing the data in the table, calcu-
late percentage error in the estimates of the first and the second derivatives.
356 NUMERICAL DIFFERENTIATION

8.4 Consider the following table of xi , f (xi ).

i 1 2 3 4 5
xi 1 2 3 4 5
fi = f (xi ) 0.36788 0.13536 0.04979 0.01832 0.006738

df
Compute dx at x = 2, 3 and 4 using forward difference and backward differ-
ence approximation with truncation error of the order O(h) (h = 1 in this
2
case). Also compute ddxf2 at x = 2, 3 and 4 using central difference approxi-
mation with truncation error of the order O(h2 ).
Using f (x) = e−x as the actual function describing the data in the table,
calculate percentage error in the estimates of the first and the second deriva-
tives.

8.5 Consider xi , f (xi ) given in the table below.

i 1 2 3 4 5
xi 0 1 2 3 4
fi = f (xi ) 0 0.2 1.6 5.4 32

df
Compute dx at x = 2, 3 and 4 using forward difference and backward differ-
ence approximation with truncation error of the order O(h) (h = 1 in this
2
case). Also compute ddxf2 at x = 2, 3 and 4 using central difference approxi-
mation with truncation error of the order O(h2 ).
Using f (x) = 0.2x3 as the actual function describing the data in the ta-
ble, calculate percentage error in the estimates of the first and the second
derivatives.

8.6 Consider the table of data (xi , fi ); i = 1, 2, . . . , 7 given in problem 8.1.

Using the data points (xi , fi ); i = 1, 2, . . . , 6 construct a Lagrange interpo-
lating polynomial p(x) passing through the data points.
d2 p(x)
Compute dp(x)
dx at x = /8 and x = /4, and dx2 at x = /8 and x = /4,
1 1 1 1
2
and compare these with dfdx (x)
and d dx
f (x)
2 estimated in problem 8.1 using fi-
nite difference approximation as well as with those calculated using f (x) =
2 f (x)
sin(πx). Also calculate percentage error in dfdx (x)
and d dx 2 values using
f (x) = sin(πx) as the true behavior of data in the table.

8.7 Consider the table of data (xi , fi ); i = 1, 2, . . . , 5 given in problem 8.2.

Using these data points construct a Lagrange interpolating polynomial p(x)
passing through the data points.
d2 p(x)
Compute dp(x)
dx and dx2 at x = 5, 10 and 15 and compare these with dx
df (x)

d2 f (x)
and dx2
estimated in problem 8.2 using finite difference approximation as
8.3. CONCLUDING REMARKS 357

well as with those calculated using f (x) = ln(x). Also calculate percentage
2 f (x)
error in dfdx
(x)
and d dx 2 values using f (x) = ln(x) as the true behavior of
data in the table.

8.8 Consider the table of data (xi , fi ); i = 1, 2, . . . , 5 given in problem 8.3.

Using these data points construct a Lagrange interpolating polynomial p(x)
passing through the data points.
d2 p(x)
Compute dp(x)
dx and dx2 at x = 2, 3 and 4 and compare these with dx
df (x)
2
and d dx
f (x)
2 estimated in problem 8.3 using finite difference approximation as
well as with those calculated using f (x) = ex . Also calculate percentage
2 f (x)
error in dfdx(x)
and d dx 2 values using f (x) = ex as the true behavior of data
in the table.

8.9 Consider the table of data (xi , fi ); i = 1, 2, . . . , 5 given in problem 8.4.

Using these data points construct a Lagrange interpolating polynomial p(x)
passing through the data points.
d2 p(x)
Compute dp(x)
dx and dx2 at x = 2, 3 and 4 and compare these with dx
df (x)
2
and d dx
f (x)
2 estimated in problem 8.4 using finite difference approximation as
well as with those calculated using f (x) = e−x . Also calculate percentage
2 f (x)
error in dfdx(x)
and d dx 2 values using f (x) = e−x as the true behavior of data
in the table.

8.10 Consider the table of data (xi , fi ); i = 1, 2, . . . , 5 given in problem 8.5.

Using these data points construct a Lagrange interpolating polynomial p(x)
passing through the data points.
d2 p(x)
Compute dp(x)
dx and dx2 at x = 2, 3 and 4 and compare these with dx
df (x)
2
and d dx
f (x)
2 estimated in problem 8.5 using finite difference approximation as
well as with those calculated using f (x) = 0.2x3 . Also calculate percentage
2 f (x)
error in dfdx(x)
and d dx 2 values using f (x) = 0.2x3 as the true behavior of
data in the table.
9
Numerical Solutions of
Boundary Value Problems

9.1 Introduction
A boundary value problem (BVP) describes a stationary process in which
the state of the process does not change over time, hence the values of the
dependent variables remain the same or fixed for all values of time. The
mathematical description of BVP result in ordinary or partial differential
equations in dependent variables and spatial coordinates x, y, and z but
not time t. The BVPs also have boundary conditions that may consist
of, specified values of dependent variables and/or their derivatives on the
boundaries of the domain of definition of the BVP.
There are many methods currently employed for obtaining approximate
numerical solutions of the BVPs:

(a) Finite difference methods

(b) Finite volume methods

(c) Finite element method

(d) Boundary element method

(e) Others

The fundamental question at this stage is ‘what is mathematically the cor-

rect approach of obtaining solutions (approximate or otherwise) of differ-
ential and partial differential equations?’ The answer of course is obvious
if we realize that differentiation and integration go hand in hand. For ex-
ample, if φ is given, we can obtain dφ dx by differentiating φ. On the other
hand, if dφ
dx is given then we can recover φ by integrating dφ
dx . This fact is
crucial in understanding the mathematically correct approach for obtaining
solutions of ordinary differential (ODE) and partial differential equations
(PDE) describing boundary value problems.

359
360 NUMERICAL SOLUTIONS OF BVPS

As an example, consider the simple first order ordinary differential equa-

tion (ODE):
dφ
= x2 ; 0 < x < 2 = Ω
dx (9.1)
φ(0) = 0
In (9.1), Ω is often called the domain of definition of the ODE. Ω consists
of values of x for which (9.1) holds. In the ODE (9.1) we have dφ/dx, hence
to recover φ from it (which is the solution to the differential equation (9.1))
we integrate it with respect to x.
Z Z
dφ
dx = x2 dx + C (9.2)
dx
or
φ = x3/3 + C (9.3)
The boundary condition φ(0) = 0 gives C = 0, hence the solution φ of the
ODE (9.1) is:
φ = x3/3 (9.4)
It is obvious that if we differentiate (9.3) with respect to x, we recover the
original ODE (9.1).

Remarks.

(1) We see that integration of the ODE yields its solution and the differen-
tiation of the solution gives back the ODE.

(2) At this stage even though we do not know the specific details of the
various methods mentioned above (but regardless of the details), one
thing is clear, the methods of solution of ODEs and PDEs must consider
their integration in some form or the other over their domain of definition
as this is the only mathematically justifiable approach for obtaining their
solutions.

(3) We generally represent ODEs and PDEs using differential operators and
dependent variable(s). The differential operator contains operations of
differentiation (including differentiation of order zero). When the differ-
ential operator acts on the dependent variable it produces the original
differential or partial differential equations. For example in case of (9.1),
we can write
Aφ = x2 ∀x ∈ Ω (9.5)
in which the differential operator is A = d/dx.
If
dφ 1 d2 φ
− = f (x) ∀x ∈ (a, b) = Ω (9.6)
dx P e dx2
9.2. INTEGRAL FORMS 361

is the BVP, then we can write (9.6) as

Aφ = f (x) ∀x ∈ Ω
d 1 d2 (9.7)
A= −
dx P e dx2
If
dφ 1 d2 φ
φ − = f (x) ∀x ∈ (a, b) = Ω (9.8)
dx Re dx2
is the BVP, then we can write (9.8) as

Aφ = f (x) ∀x ∈ (a, b) = Ω
d 1 d2 (9.9)
A=φ −
dx Re dx2
In (9.9) the differential operator is a function of the dependent variable
φ.
If
d2 φ
+ φ = f (x) ∀x ∈ (a, b) = Ω (9.10)
dx2
is the BVP, then we can write (9.10) as

Aφ = f (x) ∀x ∈ Ω
d2 (9.11)
A= +1
dx2

(4) If we consider methods of approximation for obtaining approximate so-

lution of a BVP:
Aφ − f = 0 ∀x ∈ Ω (9.12)
in which Ω ⊂ R1 or R2 or R3 is the domain over which the BVP is valid,
then based on (9.1) we must consider integration of (9.12) in some form
over Ω. The approximate solution of (9.12) is then obtained numerically
using this integral form. For simplicity consider the differential operator
A to be linear.

(5) From (9.4) we see that an approximate solution of a BVP requires an

integral form that is constructed using the BVP over the domain Ω. We
discuss possible approaches in the following section.

9.2 Integral Form Corresponding to a BVP and Ap-

proximate Solution of the BVP
The integral form corresponding to a boundary value problem can be
constructed over Ω using [49]: (i) either the fundamental lemma of calculus
362 NUMERICAL SOLUTIONS OF BVPS

of variations or (ii) a residual functional and its extremum. We consider

both approaches here, first for the entire domain Ω without its discretization.
The methods of approximation when considered over the entire domain of
definition (undiscretized) are called classical methods of approximation based
on the fundamental lemma and the residual functional. We consider details
in the following.

9.2.1 Integral Form Based on the Fundamental Lemma and

the Approximate Solution φn
Since Aφ − f = 0 over Ω, if we choose a function v(x) ∀x ∈ Ω such that
v = 0 where φ is given or specified (boundary conditions), then based on the
fundamental lemma of the calculus of variations we can write:
Z
(Aφ − f )v dx = 0 (9.13)
Ω̄

in which Ω̄ = Ω ∪ Γ; Γ being the boundary of Ω. The function v(x) is called

test function. An approximation φn of φ can be obtained using (9.13) by
assuming:
n
P
φn (x) = ψ0 (x) + Ci ψi (x) (9.14)
i=1

in which ψi (x); i = 0, 1, . . . , n are known functions, Ci are unknown coef-

ficients. Since φn (x) is approximation of the solution of the BVP, φn (x)
satisfies the boundary conditions of the BVP. The boundary condition re-
quirements on φn (x) and the differentiability and completeness requirements
on ψi (x); i = 0, 1, . . . , n enable us to choose ψi (x); i = 0, 1, . . . , n. The re-
quirement that the test function v(x) = 0 where φn is specified implies that
v(x) = δφn , variation or change in φn , is valid as δφn = 0 on boundaries
where φn is specified and we have:

∂φn
v(x) = δφn (Ci ) = = ψj (x) ; j = 1, 2, . . . , n (9.15)
∂Ci

First, we rewrite (9.13) using φn instead of φ.

Z
(Aφn − f )v dx = 0 (9.16)
Ω̄

Z Z
(Aφn )v dx = f v dx (9.17)
Ω̄ Ω̄
9.2. INTEGRAL FORMS 363

Substitute φn and v from (9.14) and (9.15) into (9.17).

n
Z Z
P
A ψ0 (x) + Ci ψi (x) ψj (x) dx = f ψj (x) dx ; j = 1, 2, . . . , n
i=1
Ω̄ Ω̄
(9.18)
or
Z n
P
A Ci ψi (x) ψj (x) dx =
i=1
Ω̄
Z Z
f ψj (x) dx − Aψ0 (x)ψj (x) dx ; j = 1, 2, . . . , n (9.19)
Ω̄ Ω̄

or
n
Z
P
Ci Aψi (x) ψj (x) dx =
i=1
Ω̄
Z Z
f ψj (x) dx − Aψ0 (x)ψj (x) dx ; j = 1, 2, . . . , n (9.20)
Ω̄ Ω̄

We can write (9.20) in the matrix and vector form as:

[K]{C} = {F } (9.21)

in which [K] is an n × n matrix, {C} is a vector of n unknowns, and {F } is

an n × 1 vector of known quantities such that:
Z

Kij = Aψj (x) ψi (x) dx
Ω̄
Z Z ; i, j = 1, 2, . . . , n (9.22)
Fi = f ψi (x) dx − Aψ0 (x)ψi (x) dx
Ω̄ Ω̄

Using (9.21), we calculate {C}. Then, equation (9.14) defines the approxi-
mation φn (x) of φ(x) over Ω̄.
Remarks.
R
(1) When v = δφn , (Aφn −f )v dx = 0 is called the Galerkin method (GM).

(2) When v(x) = w(x) = 0 where φn is specified but w(x) 6= δφn (x), then:
Z Z
(Aφn (x) − f )v(x) dx = (Aφn (x) − f )w(x) dx = 0 (9.23)
Ω̄ Ω̄
364 NUMERICAL SOLUTIONS OF BVPS

is called the Petrov-Galerkin method (PGM) or the weighted residual

method (WRM).

(3) If some differentiation is transferred from φn to v in (9.17) using inte-

gration by parts, then the differentiation is lowered on φn but increased
on v. We obtain the following form of (9.17).
Z
B(φn , v) − l(v) = f v dx (9.24)
e
Ω̄

In B(φn , v) all terms contain both φn and v are included. The additional
expression l(v) is due to integration by parts and contains those terms
that
R only have
e v. It is called the concomitant. We can combine l(v) and
f v dx to obtain: e
B(φn , v) = l(v)
Z
l(v) = f v dx + l(v) (9.25)
e
Ω̄

This method is called the Galerkin method with weak form (GM/WF)
(v = δφn ) and the integral form (9.25) is called the weak form of (9.17).
The reason for transferring differentiation from φn to v in (9.17) is to
ensure that each term of the integrand of B(φn , v) contains equal orders
of differentiation
R of φn and v. We only perform integration by parts for
those terms in (Aφn )v dx that yield this. Thus, integration by parts is
Ω̄
performed on those terms that contain even order derivatives of φn . In
such terms, after integration by parts, φn and v are interchangeable in
the integrand in GM/WF in which v = δφn .

(4) We note that the integrals over Ω̄ are definite integrals, hence pro-
duce numbers after the limits are substituted. Such integralsR are called
functionals. Thus, (9.13), (9.17), B(φn , v), l(v), l(v), and Ω̄ f v dx are
all functionals. In GM, PGM, and WRM also e we can write (9.13) as
B(φn , v) = l(v) in which:
Z Z
B(φn , v) = (Aφn )v dx and l(v) = f v dx (9.26)
Ω̄ Ω̄

(5) The domain of definition Ω of the BVP is not discretized, hence GM,
PGM, WRM, and GM/WF considered here are often referred to as clas-
sical methods of approximation.

(6) These methods, as we have seen here, are rather simple and straight-
forward in principle. The major difficulty lies in the selection of ψi (x);
9.2. INTEGRAL FORMS 365

i = 0, 1, . . . , n such that all boundary conditions of the BVP are satis-

fied by φn (x). Even in R1 this may be difficult. In R2 and R3 with in-
volved BCs it is virtually impossible to find satisfactory functions ψi (x);
i = 0, 1, . . . , n.

(7) Because of the shortcoming discussed in Remark (6), classical GM,

PGM, WRM, and GM/WF are virtually impossible to use in practi-
cal applications.

(8) We note that in GM/WF, [K] is symmetric when the differential oper-
ator A contains only even order derivatives.

9.2.2 Integral Form Based on the Residual Functional

Let φn (x) given by (9.14) be the approximation of φ for the BVP (9.12),
then the residual function E is defined by:
n
P
E = Aφn − f = Aψ0 + Ci Aψi − f 6= 0 (9.27)
i=1

or
E = [k]{c} − f ; E T = [c]{k} − f (9.28)
e e
in which
ki = Aψi ; i = 1, 2, . . . , n
(9.29)
f = −f + Aψ0
e
The residual functional I is given by:
Z Z Z
2 T
I = E dx = E E dx = [c]{k} − f [k]{c} − f dx (9.30)
e e
Ω̄ Ω̄ Ω̄

or Z
I= [c]{k}[k]{c} − f [k]{c} − f [c]{k} + f 2 dx (9.31)
e e e
Ω̄

Since [k]{c} = [c]{k}, we can write:

Z
I= [c]{k}[k]{c} − 2f [c]{k} + f 2 dx (9.32)
e e
Ω̄

To find an extremum of I we set the first variation of I (i.e. δI) to zero.

Z Z
∂I
δI = = 0 =⇒ 2 [{k}[k]] {c}dx − 2 f {k}dx = 0 (9.33)
∂{c} e
Ω̄ Ω̄
366 NUMERICAL SOLUTIONS OF BVPS

Hence, we have:  
Z Z
 {k}[k]dx {c} = f dx (9.34)
e
Ω̄ Ω̄
or
[K]{C} = {F } (9.35)
in which
Z
Kij = (Aψi Aψj )dx
Ω̄
; i, j = 1, 2, . . . , n (9.36)
Fi = (f − Aψ0 )Aψi

Using (9.35) we can calculate {C}, hence the approximation φn is known

from (9.14).
Remarks.
(1) [K] is always symmetric, a definite advantage in this method.

(2) Here also we have the same problems associated with the choice of ψi (x)
as described in Section 9.2.1, hence its usefulness for practical applica-
tions is extremely limited.

(3) If we have more than one PDE, then we have a residual function for each
PDE, Ei ; i = 1, 2, . . . , m, and the residual functional is defined as:
m m
Z
(Ei )2 dx
P P
I= Ii = (9.37)
i=1 i=1
Ω̄

The details for each Ii follow what has been described for a single residual
function.

9.3 Finite Element Method for Solving BVPs

The finite element method borrows its mathematical foundation from the
classical methods of approximation based on the integral forms presented in
Sections 9.2.1 and 9.2.2. This method eliminates all of the problems asso-
ciated with the choices of ψ0 and ψi ; i = 1, 2, . . . , n. In the finite element
method, we discretize (subdivide) the domain Ω̄ into subdomains of smaller
sizes than Ω̄ using subdomain shapes of preference. In R1 , we have line
subdomains. In R2 , common choices are triangular or quadrilateral subdo-
mains, and R3 typically makes use of tetrahedron and hexahedron subdomain
shapes. Each subdomain of finite size is called a finite element. Figures 9.1
and 9.2 show discretizations in R1 and R2 .
9.3. FINITE ELEMENT METHOD FOR BVPS 367

A, E
P

(a) Physical system

Ω̄
P

(b) Mathematical idealization

y
Ω̄T Ω̄e
1 2 3 x
y xe xe+1
P x
1 2 3 4 a typical element e

(c) Discretization Ω̄T using two-node elements

y
Ω̄T Ω̄e
1 2 3 x
y xe xe 1 xe+2
P x
1 2 3 4 a typical element e

(d) Discretization Ω̄T using three-node elements

Figure 9.1: Axial rod with end load P

Each subdomain or finite element contains identifiable and desired points

on its boundary and/or interior called node points. A finite element com-
municates to its neighboring finite elements through the node points and
the mating boundaries. Choice of the node points is dictated by geometric
considerations as well as considerations for defining the dependent variable
φ over the element.
368 NUMERICAL SOLUTIONS OF BVPS

t, E, ν

Ω̄ σx

(a) Physical system

y
Ω̄T

Ω̄e
σx

a typical element e
x

(b) Discretization using 3-node triangular elements

y
Ω̄T

Ω̄e
σx

a typical element e
x

(c) Discretization using 6-node triangular elements

y
Ω̄T

Ω̄e

σx

a typical element e
x

(d) Discretization using 9-node quadrilateral elements

Figure 9.2: Thin plate in tension

9.3. FINITE ELEMENT METHOD FOR BVPS 369

Let Ω̄T be the discretization of Ω̄, then:

Ω̄T = ∪Ω̄e (9.38)

in which Ω̄e = Ωe ∪Γe is the domain of an element with its closed boundary
Γe (see Figures 9.1 and 9.2). Let φh (x) be the approximation of φ over Ω̄T ,
then:
φh (x) = ∪φeh (x) (9.39)
e

in which φeh (x)is the approximation of φ(x) over an element e with domain
Ω̄e , called the local approximation of φ.

9.3.1 Finite Element Processes Based on the Fundamental

Lemma
In this section, finite element formulations are constructed using integral
methods described in Section 9.2.1, i.e., GM, PGM, WRM, and GM/WF.
We begin with (9.13) over Ω̄T and use φh in place of φ.
Z
(Aφh − f )v dx = 0 (9.40)
Ω̄T

The test function v = δφh for GM, GM/WF and v(x) = w(x) 6= δφh for
in PGM, WRM. Since the definite integral in (9.40) is a functional, we can
write this as a sum of the integrals over the elements.
Z XZ
(Aφh − f )v dx = (Aφeh − f )v dx = 0 (9.41)
e
Ω̄T Ω̄e

or
B e (φeh , v) − le (v) = 0
P
(9.42)
e
Z
e
B (φeh , v) = (Aφeh )v dx
Ω̄e
Z (9.43)
e
l (v) = f v dx
Ω̄e

Consider Aφ − f = 0 to be an ODE in independent coordinate x ∈ (0, L).

Let us consider local approximation φeh of φ over Ω̄e in which only function
values are unknown quantities at the nodes. Figure 9.3(a) and (b) show a
two-element discretization in which φeh is linear and quadratic, corresponding
to the degrees of the polynomials (p-levels) one and two (p = 1 and p = 2).
The nodal values of φ are called degrees of freedom. Thus, for element one
370 NUMERICAL SOLUTIONS OF BVPS

1 2

(p=1)
1 2 3
(a) Using two-node linear element

1 2

(p=2)
1 2 3 4 5
(b) Using three-node quadratic element

Figure 9.3: Two element uniform discretizations

and two in Figure 9.3(a), the degrees of freedom are {δ 1 } and {δ 2 }. Using
Lagrange interpolating polynomials (Chapter 5), we can easily define local
approximations φ1h and φ2h for elements one and two. First, the elements are
mapped into ξ-space, i.e., Ω̄e → Ω̄ξ = [−1, 1] using (for an element):

1−ξ 1+ξ
x(ξ) = xe + xn (9.44)
2 2
For elements one and two of Figure 9.3(a), we have (e, n) = (1, 2) and (2, 3),
whereas for elements one and two of Figure 9.3(b) we have (e, n) = (1, 3)
and (3, 5). The mapping (9.44) is a linear stretch in both cases. The local
approximations φ1h (ξ) and φ2h (ξ) can now be established in ξ-space using
Lagrange interpolation and {δ 1 } and {δ 2 }.

φeh (ξ) = [N ]{δ e } ; e = 1, 2 (9.45)

In the case of Figure 9.3(a), the functions N1 (ξ) and N2 (ξ) for both φ1h (ξ)
and φ2h (ξ) are:
1−ξ 1+ξ
N1 (ξ) = ; N2 (ξ) = (9.46)
2 2
For Figure 9.3(b) we have N1 (ξ), N2 (ξ), and N3 (ξ).
ξ(ξ − 1) ξ(ξ + 1)
N1 (ξ) = ; N2 (ξ) = 1 − ξ 2 ; N3 (ξ) = (9.47)
2 2
Thus, we could write (9.45) as:
n
φeh (ξ) = Ni (ξ)δie
P
(9.48)
i=1

When using Lagrange interpolation functions, n = 2, 3, . . . for p =

1, 2, . . . , hence the corresponding elements will contain p + 1 nodes (in R1 ).
9.3. FINITE ELEMENT METHOD FOR BVPS 371

9.3.1.1 Finite Element Processes Based on GM, PGM, WRM

In these methods we consider (9.41) or (9.42) (without integration by
parts). Choice of v defines the method. For an element e we consider:
Z Z Z
e e
(Aφh − f )v dx = (Aφh )v dx − f v dx (9.49)
Ω̄e Ω̄e Ω̄e

in which φeh is given by (9.48) and we choose:

v = wj ; j = 1, 2, . . . , n (9.50)

Keep in mind that wj = Nj in GM but wj 6= Nj in PGM and WRM. Using

(9.50) and (9.42) in (9.49) (and choosing the test function for GM):
n
Z Z Z
e e
P
(Aφh − f )v dx = A Ni δi Nj dx − f Nj dx ; j = 1, 2, . . . , n (9.51)
i=1
Ω̄e Ω̄e Ω̄e
e e e
= [K ] {δ } − {f } (9.52)

in which
Z1

Z 
e

Kij = Ni (ANj ) dx = Ni (ANj )J dξ ; J = he/2





e Ω̄ −1
i, j = 1, 2, . . . , n
Z1 

fie =

f Ni J dξ





−1
(9.53)
Thus, for elements (1) and (2) of Figure 9.3(a) we have:
Z 1 1
1
1 K11 K12 φ1 f1
(Aφh )v dx = 1 1 − ; element one (9.54)
K21 K22 φ2 f21
Ω̄1

2 K2 f12
Z
K11 φ2
(Aφ2h )v dx = 12
2 K2 − ; element two (9.55)
K21 22 φ3 f22
Ω̄2
These are called the element equations.
For the two-element discretization Ω̄T we have:
X2 Z
(Aφeh − f )v dx = 0 (9.56)
e=1 T
Ω̄

Equations (9.54) and (9.55) must be substituted into (9.56) to obtain their
sum. Since the degrees of freedom for elements are different, the summation
372 NUMERICAL SOLUTIONS OF BVPS

process, or assembly, of the element equations requires care. From the dis-
cretization shown in Figure 9.3(a) and the dofs at the nodes or grid points,
we know that (9.56) will yield:

[K]{δ} = {F } (9.57)

in which [K] is a 3 × 3 matrix, {δ}T = [φ1 φ2 φ3 ], and {F } is a 3 × 1 vector.

The contents of [K] and {F } are obtained by summing (9.54) and (9.55).
The simplest way to do this is to set aside a 3 × 3 space for [K] and initialize
its contents to zero. Label its rows and columns as φ1 , φ2 , and φ3 . Likewise
label the rows and columns of [K 1 ] and [K 2 ] as φ1 , φ2 and φ2 , φ3 and the
rows of {f 1 }, {f 2 } as φ1 , φ2 and φ2 , φ3 . Now add the elements of [K 1 ], {f 1 }
and [K 2 ], {f 2 } to [K] and {F } using the row and column identification. The
end result is that (9.56) gives (9.57) containing the element contributions.
 1 1 0 φ1   f11 
   
K11 K12
K211 K1 + K2 K2  φ = f21 + f12 (9.58)
22 11 12 2
0 2
K21 2
K22

φ3
  2
f2


We impose boundary conditions on one or more of φ1 , φ2 , and φ3 and solve

for the remaining. Thus, now φ1 , φ2 , and φ3 are known and we have an
approximation for the solution φ (i.e., φh ) in (9.48), hence φeh for each element
of the discretization.

Remarks.

(1) The assembly process remains the same for Figure 9.3(b), except that
in this case the element matrices and vectors are (3 × 3) and (3 × 1) and
the assemble [K] and {F } are (5 × 5) and (5 × 1).

(2) We shall consider a specific example in the following section.

9.3.1.2 Finite Element Processes Based on GM/WF

For an element e we consider Ω̄e (Aφeh − f )v dx. For those terms in the
R

integrand that contain even order derivatives of φeh , we transfer half of the
differentiation to v. By doing so, we can make the order of differentiation on
φeh and v in these terms the same. This results in a symmetric coefficient ma-
trix for the element corresponding to these terms. The integration by parts
results in boundary terms or boundary integrals, called the concomitant.
Thus, in this process we have:
Z Z Z
e e
(Aφh − f )v dx = (Aφh )v dx − f v dx (9.59)
Ω̄e Ω̄e Ω̄e
9.3. FINITE ELEMENT METHOD FOR BVPS 373

or Z Z
(Aφeh − f )v dx = B e
(φeh , v) e
− l (v) − f v dx (9.60)
e
Ω̄e Ω̄e
We note that (9.60) is due to (9.59) after integration by parts. This is
referred to as weak form of (9.59) due to the fact that it contains lower
order derivatives of φeh compared to the BVP. B e (φeh , v) contains only those
terms that contain both φeh and v. The concomitant le (v) only contains the
terms that resulting from integration by parts thatehave v (and not φeh ).
After substituting for φeh and v = δφeh = Nj ; j = 1, 2, . . . , n, we obtain:
Z
(Aφeh − f )v dx = [K e ]{δ e } − {P e } − {f e } (9.61)
Ω̄e

The vector {P e } is due to the concomitant and is called the vector of sec-
ondary variables. The assembly process for [K e ] and {f e } follows Section
9.3.1.1. Assembly for {P e }, giving {P }, is the same as that for {f e }.
Remarks.
(1) When the differential operator A contains only even order derivatives,
[K e ] and [K] are assured to be symmetric. This is not true in GM,
PGM, and WRM.

(2) After imposing boundary conditions, [K] is positive-definite, hence a

unique solution {δ} is ensured. This may not be the case in GM, PGM,
and WRM.

(3) The concomitant resulting due to integration by parts contains valu-

able and crucial details. For Aφ = f ∀x ∈ Ω BVP we designate
concomitant by < Aφ, v >Γe in which Ω̄e = Ωe ∪ Γe = [xe , xe+1 ] or
[xe , xn ] is the domain of the eth element in R1 . Concomitant in R1 are
boundary terms. The precise nature of these depends upon the differ-
ential operator A. In general < Aφ, v >Γe may contain any of the terms
x x
(v(··))|xe+1
e , (dv/dx(··)) |xe+1
e , . . . , etc. or all of these depending upon the or-
der of the differentiation of φ in Aφ − f = 0. For illustration purposes
let
dv
< Aφ, v >Γe = v(p)|xxe+1 e
+ (q)|xxe+1
e
(9.62)
dx
Then,

(1) φ, dφ/dx (due to v, dv/dx) are called primary variables (PV).

(2) p and q are called secondary variables (SV).
(3) φ = φ0 , dφ/dx = g0 (on some boundaries) are called essential bound-
ary conditions (EBC).
374 NUMERICAL SOLUTIONS OF BVPS

(4) p = p0 and q = q0 (on some boundaries) are called natural boundary

conditions (NBC).

We shall see that NBC are naturally satisfied or absorbed while EBC
need to be specified or imposed on the assembled equations to ensure
uniqueness of the solution.
In R2 the concomitant is a contour integral over closed contour Γe of an
element e. In R3 the concomitant is a surface integral. Simplification of
concomitant in R2 and R3 requires that we split the integral over Γe into
integral over Γe1 , Γe2 on which EBC and NBC are specified. For specific
details on these see reference [49].

9.3.2 Finite Element Processes Based on the Residual Func-

tional: Least Squares Finite Element Method (LSFEM)
Recall the material presented in Section 9.2.2 related to the classical
method based on the residual functional.

E = Aφh − f ∀x ∈ Ω̄T (9.63)

e
E = Aφeh −f ∀x ∈ Ω̄ e
(9.64)
Z XZ
E 2 dx = (E e )2 dx =
P e
I= I (9.65)
e e
Ω̄T Ω̄e

An extremum of I requires that:

Z X Z
2 E e δE e dx = δI e = 0
P
δI = 2 EδE dx = (9.66)
e e
Ω̄T Ω̄e

Consider δI e for an element e.

n
E e = Aφeh − f = (ANi )δie − f
P
(9.67)
i=1
∂E e
δE e = = ANj ; j = 1, 2, . . . , n (9.68)
∂{δ e }
Z Z n
e e e
(ANi )δie − f ANj dx ; j = 1, 2, . . . , n
P
δI = E δE dx =
i=1
Ω̄e Ω̄e
(9.69)

or
δI e = [K e ]{δ e } − {f e } (9.70)
9.3. FINITE ELEMENT METHOD FOR BVPS 375

in which Z 
e
Kij = (ANi )(ANj ) dx





Ω̄e
Z i, j = 1, 2, . . . , n (9.71)
fie = f (ANi ) dx






Ω̄e

Assembly of element equations follows the standard procedure, and by sub-

stituting (9.70) into (9.66) we obtain:

[K]{δ} = {F } (9.72)
[K] = [K e ] ; {δ} = ∪{δ e } ; {f e }
P P
{F } = (9.73)
e e e

Remarks.

(1) When the operator A is linear, [K e ] and [K] are symmetric.

(2) Surana, et al. [49] have shown that [K e ] and [K] can also be made
symmetric when A is nonlinear.

(3) Numerical examples are presented in the following.

(4) When the differential operator A contains higher order derivatives, we

can use auxiliary variables and auxiliary equations to obtain a system
of lower order equations. In principle any system of ODEs or PDEs
can be reduced to a first order system of equations for which C 0 local
approximation can be used.
If
d2 φ
+ φ = f (x) ∀x ∈ (a, b) = Ω (9.74)
dx2
is the BVP, then we can write

dα
+ φ = f (x) ∀x ∈ Ω (9.75)
dx
dφ
α= (9.76)
dx
in dependent variables φ and α. α is called auxiliary variable and equa-
tion (9.76) is called auxiliary equation. Using the same approach a higher
oder ODE in φ can be reduced to a system of first order ODEs.

9.3.3 General Remarks on Various Finite Element Processes

In this section we make some remarks regarding various methods of con-
structing finite element processes.
376 NUMERICAL SOLUTIONS OF BVPS

(1) Unconditional stability of a computational process is the most fun-

damental requirement that all computational processes must satisfy.
Surana, et al. [49] have shown using calculus of variations that a vari-
ationally consistent integral form for which a unique extremum princi-
ple exists results in unconditional finite element process. This concept
can be simply translated with simple guidelines that ensure variation-
ally consistent integral form, hence unconditionally stable finite element
processes.
(2) When the differential operator A in Aφ − f = 0 only contain even order
derivative, then GM/WF yield VC integral form when the functional
B(φ, v) = B(v, φ) i.e. symmetric. In such cases each term in the inte-
grand of B(·, ·) has same orders of derivatives of φ and v, hence symme-
try of B(·, ·). Thus, in linear solid and structural mechanics GM/WF is
ideally suited for constructing finite element processes.
(3) When the differential operator A contains odd order derivatives (some
or all) or when the BVP is non-linear in which case A is a function of φ,
then only the least squares method of constructing finite element pro-
cess yields VC integral form, hence unconditionally stable finite element
process.

Example 9.1. Second order non-homogeneous ODE: finite element method

d2 T
+ T = f (x) ∀x ∈ (0, 1) = Ω ⊂ R1 (9.77)
dx2
with boundary conditions

T (0) = 0 , T (1) = −0.5 (9.78)

we can write (9.77) as

AT = f (x) ∀x ∈ Ω
d2 (9.79)
A= +1
dx2
Since A contains even order derivatives, GM/WF is ideally suited for de-
signing finite element process for (9.77). Let Ω̄T = ∪Ω̄e be discretization of
e
Ω̄ = [0, 1] in which Ω̄e = [xe , xe+1 ] is an element e.
Let Th be approximation of T over Ω̄T and The be approximation of T over
Ω̄e , then
Th = ∪The (9.80)
e
9.3. FINITE ELEMENT METHOD FOR BVPS 377

we consider Z
(ATh − f (x))v(x) dx = 0 ; v = δTh (9.81)
Ω̄T
or
XZ
(AThe − f )v(x) dx = 0 ; v = δThe (9.82)
e
Ω̄e
consider
d2 The
Z Z Z
(AThe − f )v(x) dx = + Th
e
v(x) dx − f v dx (9.83)
dx2
Ω̄e Ω̄e Ω̄e

We transfer one order of differentiation from d2 The/dx2 to v(x) using integration

by parts.
e xe+1
dv dThe
Z Z
e e dTh
(ATh −f (x))v(x) dx = − + Th v dx+ v(x) −
dx dx dx xe
Ω̄e Ω̄e
dv dThe
Z Z Z
e e
f v dx = − + Th v dx+ < ATh , v >Γe − f v dx (9.84)
dx dx
Ω̄e Ω̄e Ω̄e

In which e xe+1
dTh
< AThe , v >Γe = v(x) (9.85)
dx xe
is the concomitant resulting due to integration by parts. In this case since
we have an ODE in R1 , the concomitant consists of boundary terms. From
(9.85), we find that
• T is PV and T = T0 (given) on some boundary Γ∗1 is EBC.

• dT/dx is SV and dT/dx = q0 on some boundary Γ∗2 is NBC.

We expand the boundary term in (9.85)
dThe dThe
< AThe , v >Γe = v(xe+1 ) − v(xe ) (9.86)
dx xe+1 dx xe

Let
dThe dThe
= −P2e and = P1e (9.87)
dx xe+1 dx xe

Using (9.87) in (9.86)

< AThe , v >Γe = −v(xe+1 )P2e − v(xe )P1e (9.88)

378 NUMERICAL SOLUTIONS OF BVPS

Substituting from (9.88) in (9.84) we obtain

dv dThe
Z Z
e
(ATh , v)Ωe = − + Th v dx − f v dx − v(xe )P1e − v(xe+1 )P2e
e
dx dx
Ω̄e Ω̄e
(9.89)
or
(AThe , v)Ωe = B e (The , v) − le (v) (9.90)
in which
dv dThe
Z
e
B (The , v) = − + The v dx (9.91)
dx dx
Ω̄e
Z
le (v) = f v dx + v(xe )P1e + v(xe+1 )P2e (9.92)
Ω̄e

B e (The , v) = B e (v, The ) i.e. interchanging the roles of The and v does not
change B e (·, ·), hence B e (·, ·) is symmetric. (9.90) is the weak form of the
integral from (9.83).
Consider a five element uniform discretization using two node linear element

x1 = 0 x3 = 0.4 x5 = 0.8 x6 = 1.0

1 1 2 2 3 3 4 4 5 5 6
x
T1 T2 T3 T4 T5 T6

T1 = T (0) = 0 T6 = T (1) = −0.5

Figure 9.4: A five element uniform discretization using two node elements

(Local Node Numbers)

1 2
x
xe xe+1
he
δ1e δ2e

Figure 9.5: A two node linear element Ω̄e

1 2
ξ
ξ = −1 Ω̄ξ ξ = +1

Figure 9.6: Map of Ω̄e in Ω̄ξ

9.3. FINITE ELEMENT METHOD FOR BVPS 379

Following section 9.3.1

e 1−ξ e 1+ξ
Th (ξ) = δ1 + δ2e = N1 (ξ)δ1e + N2 (ξ)δ2e (9.93)
2 2

in which δ1e and δ2e are nodal degrees of freedom for nodes 1 and 2 (local
node numbers) of element e. Mapping of points is defined by

1−ξ 1+ξ
x(ξ) = xe + xe+1 (9.94)
2 2

Hence,
dx
dx = dξ = Jdξ (9.95)
dξ
Where

d 1−ξ d 1+ξ xe+1 − xe he
J= xe + xe+1 = = (9.96)
dξ 2 dξ 2 2 2

dNj dNj dx dNj

= = J; j = 1, 2 (9.97)
dξ dx dξ dx
dNi dNi 1 2 dNi
= = ; i = 1, 2 (9.98)
dx dξ J he dξ
v = δThe = Nj (ξ) ; j = 1, 2 (9.99)
we now return back to weak form (9.90)

dv dThe
Z
e e e
B (Th , v) = − + Th v dx
dx dx
Ω̄e
2 2
Z ! ! !
dNj X dNi X
= − δie + Ni δie Nj dx
dx dx
i=1 i=1
Ω̄e
2 2
Z ! ! !
1 dNj X 1 dNi e X
= − δ + Ni δie Nj Jdξ
J dξ J dξ i
i=1 i=1
Ω̄e
Z2 2
!! Z+1 X
2
!
2 dNj X dNi he
= − δie dξ + Ni δie Nj dξ
he dξ dξ 2
i=1 i=1 −1 i=1
1 e e 2 e e
= [ K ]{δ } + [ K ]{δ }
(9.100)
380 NUMERICAL SOLUTIONS OF BVPS

in which
Z+1
1e 2 dNi dNj
Kij =− dξ ; i, j = 1, 2 (9.101)
he dξ dξ
−1

Z+1
2e he
Kij = Ni Nj dξ ; i, j = 1, 2 (9.102)
2
−1

and le (v)is given by (after substituting v = Nj )

Z
l (v) = f (x)Nj dx + Nj (xe )P1e + Nj (xe+1 )P2e ;
e
j = 1, 2 (9.103)
Ω̄e
Z+1
= f (ξ)Nj (ξ)J dξ + Nj (−1)P1e + Nj (1)P2e ; j = 1, 2 (9.104)
−1

for j = 1 (xe → ξ = −1, xe+1 → ξ = +1), we have

Z+1
l (N1 ) = f (ξ)N1 J dξ + N1 (−1)P1e + N1 (1)P2e
e
(9.105)
−1

for j = 2
Z+1
le (N2 ) = f (ξ)N2 J dξ + N2 (−1)P1e + N2 (1)P2e (9.106)
−1

Since
N1 (−1) = 1 , N1 (1) = 0
(9.107)
N2 (−1) = 0 , N2 (1) = 1

We can write
le (v) = {F e } + {P e }
Z+1
Fie = f (ξ)Ni J dξ ; i = 1, 2 (9.108)
−1
{P } = [P1e
e T
P2e ]

{F e } are loads at the element nodes due to f (x) and {P e } is a vector of

secondary variables at the element nodes. Secondary variable {P e } at the
9.3. FINITE ELEMENT METHOD FOR BVPS 381

element nodes are still unknown.

Using (9.93)
dN1 1 dN2 1
=− ; = (9.109)
dξ 2 dξ 2
Hence,  
Z+1 dN 1 dN1 dN1 dN2
2  dξ dξ dξ dξ 
[1K e ] = −  dξ (9.110)
he

dN2 dN1 dN2 dN2
−1 dξ dξ dξ dξ
or
1 e1 1 −1
[K ]=− (9.111)
he −1 1
and
Z+1
he N1 N1 N1 N2
[2K e ] = dξ (9.112)
2 N2 N1 N2 N2
−1
or
2 ehe 2 1
[K ]= (9.113)
6 12
{F e }T = [F1e F2e ] (9.114)
Where

1 xe+1 4 1 5
F1e= 4 5
(xe+1 − xe ) − (xe+1 − xe )
he 4 5
(9.115)
e 1 1 5 5 xe 4 4
F2 = (x − xe ) − (xe+1 − xe )
he 5 e+1 4
Thus, we have for element e
Z e e e
e 1 1 −1 he 2 1 δ1 F1 P1
(ATh − f )v dx = − + e − Fe − Pe
he −1 1 6 1 2 δ2 2 2
Ω̄e
e e e
e δ F1 P1
= [K ] 1e − −
δ2 F2e P2e
(9.116)
For elements 1−5 of figure (9.4), we can write (9.116) as
Z e e
e e Te F P
(ATh −f )v dx = [K ] − 1e − 1e ; e = 1, 2, . . . , 5 (9.117)
Te+1 F2 P2
Ω̄e

in which
e −4.93333 5.03333
[K ] = ; e = 1, 2, . . . , 5 (9.118)
5.03333 −4.93333
382 NUMERICAL SOLUTIONS OF BVPS

F11 0.48889 × 10−4 0.20889 × 10−2

2
F1
= , = ,
F21 0.31111 × 10−4 F22 0.39111 × 10−2
0.10489 × 10−1 0.30089 × 10−1
3 4
F1 F1
= , = ,
F23 0.15511 × 10−1 F24 0.399111 × 10−1
0.65689 × 10−1
5
F1
= (9.119)
F25 0.81911 × 10−1

Assembly of element equations is given by

XZ
e P e
[K ] {δ} − {F e } − {P e } = {0}
P P
(ATh − f )v dx =
e e e e
Ω̄e
(9.120)
= [K]{δ} − {F } − {P } = 0
in which {δ} = ∪{δ e }
e

Assembled [K], {F } and {P } and degrees of freedom {δ} are shown in

the following. T1 , T2 , . . . , T6 are arranged in {δ} in such a way that known
values of T1 and T6 appear as first two elements of {δ} so that the assembled
equations remain in partitioned form. We note that rows and columns of
assembled [K] are identified as T1 , T6 , T2 , T3 , T4 and T5 for ease of assembly
and solutions.

T1 T6 T2 T3 T4 T5
−4.933 0 5.033 0 0 0  T
1
 0 −4.933 0 0 0 5.033  T6
 (−4.933 
 5.033 0 5.033 0 0  T2
 
 − 4.933) 
(−4.933
 
[K] =  0 0 5.033 5.033 0  T3
 
 − 4.933) 
(−4.933
 
 0 0 0 5.033 5.033  T4
 
 − 4.933) 
(−4.933
 
0 5.033 0 0 5.033 T5
− 4.933)

(9.121)

0.88889 × 10−4
 
 
−1
 


 0.819711 × 10 


−4 −2
 
0.31111 × 10 + 0.20889 × 10
 
{F } = (9.122)

 0.39111 × 10−2 + 0.10489 × 10−1 

0.15511 × 10−1 + 0.30089 × 10−1 

 
 

0.39911 × 10−1 + 0.65689 × 10−1
 

{δ}T = T1 T6 T2 T3 T4 T5

(9.123)
9.3. FINITE ELEMENT METHOD FOR BVPS 383

Rules regarding secondary variables {P } are as follows. The secondary vari-

ables in solid mechanics are forces or moment, in heat transfer these are
fluxes.
(1) The sum of the secondary variables at a node is zero if there is no
externally applied disturbance. Thus,

P21 + P12 = P2 = 0
P22 + P13 = P3 = 0
(9.124)
P23 + P14 = P4 = 0
P24 + P15 = P5 = 0

(2) Where primary variables are specified (or given) at a node i.e. when
the essential boundary conditions are given at a node, the sum of the
secondary variables is unknown at that node. Thus P11 = P1 and P25 =
P6 are unknown.

(3) If there is externally applied disturbance at a node, then the sum of

the secondary variables is equal to the externally applied disturbance at
that node. This condition does not exist in this example.
Thus in {δ} vector T1 and T6 are known (0.0 and −0.5) and T2 , T3 , . . . , T5
are unknown. {F } vector is completely known. In the secondary variable
vector {P }, P1 and P6 are unknown but P2 , P3 , . . . , P5 are zero (known).
Assembled equations (9.120) with [K], {F }, {P }, {δ} in (9.121) – (9.123)
can be written in partitioned form.

[K11 ] [K12 ] {δ}1 {F }1 {P }1
= + (9.125)
[K21 ] [K22 ] {δ}2 {F }2 {P }2

in which
{δ}T1 = T1 T6 = [0.0 − 0.5] ; known

{δ}T2 = T2 T3 T4 T5 ; unknown

{F }T1 = F1 F6 ; {F }T2 = F2 F3 F4 F5

(9.126)
{P }T1 = P1 P6 ; unknown

{P }T2 = P2 P3 P4 P5 ; known

Using (9.125) we can write

[K11 ]{δ}1 + [K12 ]{δ}2 = {F }1 + {P }1

(9.127)
[K21 ]{δ}1 + [K22 ]{δ}2 = {F }2 + {P }2
384 NUMERICAL SOLUTIONS OF BVPS

Using second set of equation in (9.127) we can solve for {δ}2

[K22 ]{δ}2 = {F }2 + {P }2 − [K21 ]{δ}1 (9.128)

and from the first set of equations in (9.127) we can solve for {P }1 .

{P }1 = [K11 ]{δ}1 + [K12 ]{δ}2 − {F }1 (9.129)

First, we solve for {δ}2 using (9.128) and calculate {P }1 using (9.129).
in which

−4.93333 0
[K11 ] =
0 −4.93333

5.03333 0 0 0 (9.130)
[K12 ] =
0 0 0 5.03333
[K21 ] = [K12 ]T
 
−9.86666 5.03333 0 0
 5.03333 −9.86666 5.03333 0 
[K22 ] =   (9.131)
 0 5.03333 −9.86666 5.03333 
0 0 5.03333 −9.86666
{F }T1 = 0.88889 × 10−4 0.818711 × 10−1

(9.132)
{F }T2 = 0.212 × 10−2 0.144 × 10−1 0.456 × 10−1 1.056 × 10−1

(9.133)
{[K21 ]{δ}1 }T = 0.0 0.0 0.0 −2.516667

(9.134)
{P }T2 = 0 0 0 0

(9.135)
Thus, using (9.128), we now have using (9.131), (9.133), (9.134) and (9.135)
in (9.128) we can solve for {δ}2 .

{δ}T2 = T2 T3 T4 T5 = −0.12945 −0.25328 −0.36418 −0.45155 (9.136)

Now using (9.129) we can calculate the unknown secondary variables {P }1 .

P1 −4.93333 0 0.0
{P }1 = = +
P6 0 −4.93333 −0.5
 


 −0.12945 

5.03333 0 0 0 −0.25328

−
0 0 0 5.03333 
 −0.36418 
−0.45155
 

0.88889 × 10−4

(9.137)
0.819711 × 10−1
9.3. FINITE ELEMENT METHOD FOR BVPS 385

or
P1 −0.651643
{P }1 = = (9.138)
P6 0.111952
Thus, now {δ}T = [T1 , T2 , . . . , T6 ] is known, hence using local approximation
for each element we can describe T over each element domain Ω̄ξ .

1−ξ 1+ξ
T (ξ) = Te + Te+1 ; e = 1, 2, . . . , 5 (9.139)
2 2

The local approximation (9.139) describes T over each element Ω̄e =

[xe , xe+1 ] → Ω̄ξ = [−1, 1]. Using (9.139) we can also calculate derivative
of T (ξ) with respect to x for each element.
dT 1 dT (ξ) 1 he
= = (Te+1 − Te ) ; J = e = 1, 2, . . . , 5 (9.140)
dx J dξ he 2

Table 9.1: Nodal values of the solution (example 9.1)

node x T
1 0.0 0.0
2 0.2 -0.12945
3 0.46 -0.25328
4 0.6 -0.36418
5 0.8 -0.45155
6 1.0 -0.5

e
Table 9.2: dTh/dx versus x (example 9.1)

Element nodes x-coordinate dThe/dx

1 0.0 -0.64725
1
2 0.2 -0.64725
2 0.2 -0.6195
2
3 0.4 -0.6195
3 0.4 -0.5545
3
4 0.5 -0.5545
4 0.6 -0.43685
4
5 0.8 -0.43685
5 0.8 -0.24225
5
6 1.0 -0.24225

Table 9.1 gives values of temperature at the nodes of the discretization.

386 NUMERICAL SOLUTIONS OF BVPS

Table 9.2 gives dThe/dx values calculated at each of the five elements of the
discretization.
0

-0.1
Temperature Th
e

-0.2

-0.3

-0.4

-0.5
0 0.2 0.4 0.6 0.8 1
x

Figure 9.7: T versus x (example 9.1)

-0.2

-0.25

-0.3

-0.35

-0.4
d(Th)/dx

-0.45
e

-0.5

-0.55

-0.6

-0.65

-0.7
0 0.2 0.4 0.6 0.8 1
x

Figure 9.8: dT/dx versus x (example 9.1)

9.3. FINITE ELEMENT METHOD FOR BVPS 387

Figures 9.7 and 9.8 show plots of T versus x and dT/dx versus x. Since The is
of class C 0 (Ω̄e ), we observe inter element discontinuity of dThe/dx at the inter
element boundaries. Upon mesh refinement the jumps in dThe/dx diminishes
at the inter element boundaries.

Example 9.2. Second order non-homogeneous ODE in R1 : finite element

method
We consider same ODE as in example 9.1 but with different boundary con-
ditions.
d2 T
+ T = f (x) ∀x ∈ (0, 1) = Ω ⊂ R1 (9.141)
dx2
dT
T (0) = 0 , = 20 (9.142)
dx x=1
The weak formR using GM/WF in this case is same as in example 9.1. The
integral form (AThe − f )v dx yields (9.116) or (9.117). For a uniform dis-
Ω̄e
cretization consisting of five two node linear elements (Figure 9.4), the el-
ement matrices and {F e } vectors given by (9.118) and (9.119). Assembly
of the element equations in symbolic form are shown in (9.120). For this
example problem the BC T (0) = 0 is Essential Boundary Condition (EBC)
whereas dTdx x=1 = 20 is the Natural Boundary Condition (NBC). Thus, for
the five element discretization T1 = 0.0, but T6 is not known and secondary
variable at node 6 (x = 1) is −20 (as the SV at ξ = +1 is defined as −dT/dx).
Thus, when assembling element equations for this example we can choose
the following order for {δ}, the degrees of freedom at the nodes.
{δ}T = T1 T2 . . . T6

(9.143)
For the assembly equations, we have (rows and columns are identified as
T1 , T2 , . . . , T6 )
T1 T2 T3 T4 T5 T6
−4.933 5.033 0 0 0 0  T
1
 5.033 (−4.933
5.033 0 0 0  T2


 − 4.933) 
(−4.933
 
 0 5.033 5.033 0 0  T3
 
− 4.933)
[K] = 
 
(−4.933

 0 0 5.033 5.033 0  T4
 
 − 4.933) 
(−4.933
 
 0 0 0 5.033 5.033  T5
 
− 4.933)
0 0 0 0 5.033 −4.933 T6

(9.144)
388 NUMERICAL SOLUTIONS OF BVPS

0.88889 × 10−4
 
 
−4 + 0.20889 × 10−2 
 


0.31111 × 10 

−4 −1
 
0.39111 × 10 + 0.10489 × 10
 
{F } = −1 −1 (9.145)
0.15511 × 10 + 0.30089 × 10 
 
0.39911 × 10 + 0.65689 × 10−1 
−1

 

 


0.819711 × 10 −1 

Using the rules for defining the sum of the secondary variables in example
(9.1), we have

P2 = 0 , P3 = 0 , P4 = 0 , P5 = 0 , P6 = −20 (9.146)

and P1 is unknown.
Due to EBC, we have

T (0) = 0 , T1 = 0.0 and T2 , T3 , . . . , T6 (9.147)

are unknown.
Assembled equations (9.120) with [K], {F } and {P } defined by (9.144) –
(9.147) can be written in partitioned form

[K11 ] [K12 ] {δ}1 {F }1 {P }1
= + (9.148)
[K21 ] [K22 ] {δ}2 {F }2 {P }2

in which
{δ}T1 = {0.0} ; known
{δ}T2 = T2 T3 T4 T5 T6 ; unknown

{F }T1 = {F1 } ; {F }T2 = F2 F3 F4 F5 F6

(9.149)
{P }T1 = [P1 ] ; unknown
{P }T2

= 0.0 0.0 0.0 0.0 −20 ; known

Using (9.148) we can write

[K11 ]{δ}1 + [K12 ]{δ}2 = {F }1 + {P }1

(9.150)
[K21 ]{δ}1 + [K22 ]{δ}2 = {F }2 + {P }2

Using second set of equations in (9.150) we can solve for {δ}2

[K22 ]{δ}2 = {F }2 + {P }2 − [K21 ]{δ}1 (9.151)

From the first set of equations in (9.150) we can solve for {P }1 .

{P }1 = [K11 ]{δ}1 + [K12 ]{δ}2 − {F }1 (9.152)

9.3. FINITE ELEMENT METHOD FOR BVPS 389

First, we solve for {δ}2 using (9.151) and then calculate {P }1 using (9.152).
in which
[K11 ] = [−4.93333]

[K12 ] = 5.03333 0.0 0.0 0.0 0.0 (9.153)
[K21 ] = [K12 ]T
 
−9.86666 5.03333 0 0 0
 5.03333 −9.86666 5.03333 0 0 
 
[K22 ] = 
 0 5.03333 −9.86666 5.03333 0 
 (9.154)
 0 0 5.03333 −9.86666 5.03333 
0 0 0 5.03333 −4.93333
{F }T1 = {0.88889 × 10−4 } (9.155)

{F }T2 = [0.212 × 10−2 0.144 × 10−1 0.456 × 10−1

1.056 × 10−1 0.818711 × 10−1 ] (9.156)

{[K21 ]{δ}1 }T = 0.0 0.0 0.0 0.0 0.0

(9.157)
T

{P }2 = 0.0 0.0 0.0 0.0 −20.0 (9.158)
Thus, using (9.151), we now have using (9.154), (9.156), (9.157) and (9.158)
we can solve for {δ}2 .
{δ}T2 = T2 T3 T4 T5 T6 = 7.2469 14.206 20.604 26.192 30.761 (9.159)

and using (9.152) we can calculate the unknown secondary variable {P }1 .

{P }1 = P1 = [−4.93333]{0.0}+
 

7.2469
14.206

 


5.03333 0.0 0.0 0.0 0.0 20.604 −
26.192

 

 

30.761
 

{0.88889 × 10−4 } (9.160)

or
{P }1 = 36.47599 (9.161)
Since {δ}T = [T1 , T2 , . . . , T6 ] is known, hence using local approximations for
each element we can describe T over each element domain Ω̄e → Ω̄ξ .

1−ξ 1+ξ
T (ξ) = Te + Te+1 ; e = 1, 2, . . . , 5 (9.162)
2 2
390 NUMERICAL SOLUTIONS OF BVPS

The local approximation (9.162) describes T over each element Ω̄e =

[xe , xe+1 ] → Ω̄ξ = [−1, 1]. Using (9.162) we can also calculate derivative
of T (ξ) with respect to x for each element.

dT 1 dT (ξ) 1 he
= = (Te+1 − Te ) ; J = e = 1, 2, . . . , 5 (9.163)
dx J dξ he 2

Table 9.3: Nodal values of the solution (example 9.2)

node x T
1 0.0 0.0
2 0.2 7.2469
3 0.46 14.206
4 0.6 20.604
5 0.8 26.192
6 1.0 30.761

e
Table 9.4: dTh/dx versus x (example 9.2)

Element nodes x-coordinate dThe/dx

1 0.0 36.2345
1
2 0.2 36.2345
2 0.2 34.7955
2
3 0.4 34.7955
3 0.4 31.99
3
4 0.5 31.99
4 0.6 27.94
4
5 0.8 27.94
5 0.8 22.845
5
6 1.0 22.845

Table 9.3 gives values of temperature at the nodes of the discretization.

Table 9.4 gives dThe/dx values calculated at each of the five elements of the
discretization.
9.3. FINITE ELEMENT METHOD FOR BVPS 391

25
Temperature Th
e

0
0 0.2 0.4 0.6 0.8 1
x

Figure 9.9: T versus x (example 9.2)

35
d(Th)/dx

30
e

20
0 0.2 0.4 0.6 0.8 1
x

Figure 9.10: dT/dx versus x (example 9.2)

392 NUMERICAL SOLUTIONS OF BVPS

Figures 9.9 and 9.10 show plots of T versus x and dT/dx versus x. Since The is
of class C 0 (Ω̄e ), we observe inter element discontinuity of dThe/dx at the inter
element boundaries. Upon mesh refinement the jumps in dThe/dx diminishes
at the inter element boundaries.

Example 9.3. Second order homogeneous ODE: finite element method

Consider the following BVP.

d2 u
+ λu = 0 ∀x ∈ (0, 1) = Ω ⊂ R1 (9.164)
dx2
BCs: u(0) = 0 , u(1) = 0 (9.165)
we wish to determine the eigenvalue λ and the corresponding eigenvectors
using finite element method.
We can write (9.164) as

Au = 0 ∀x ∈ Ω
d2 (9.166)
A= +λ
dx2
Since the operator A has even order derivatives, we consider GM/WF. Let
Ω̄T = ∪Ω̄e be discretization of Ω̄ = [0, 1] in which Ω̄e is an element. We
e
consider GM/WF over an element Ω̄e = Ωe ∪ Γe ; Γe is boundary of Ω̄e
consisting of xe and xe+1 end points.
Let uh be approximation of u over Ω̄T such that

uh = ∪ueh (9.167)
e

in which ueh is the local approximation of u over Ω̄e . Consider integral form
over Ω̄e = [xe , xe+1 ]

d2 ueh
Z Z
(Aueh )v dx = e
+ λuh v dx = 0 ; v = δueh (9.168)
dx2
Ω̄e Ω̄e

Integrate by parts once in the first term in the integrand

xe+1
dv dueh
Z Z e
e e duh
(Auh )v(x) dx = − + λuh v dx + v (9.169)
dx dx dx xe
Ω̄e Ω̄e
9.3. FINITE ELEMENT METHOD FOR BVPS 393

Consider a two node linear element Ω̄e with degrees of freedom δ1e and δ2e at
the nodes, then
Let
1−ξ 1+ξ
ueh = δ1e + δ2e = N1 (ξ)δ1e + N2 (ξ)δ2e (9.170)
2 2
and
v = δueh = Nj ; j = 1, 2 (9.171)
First, concomitant in (9.169)

dueh dueh

< Aueh , v >Γe = v(xe+1 ) − v(xe )
dx xe+1 dx xe
Let (9.172)
dueh dueh

− = P1e , = P2e
dx xe dx xe+1

Then,

dv dueh
Z Z
e
(Auh )v(x) dx = − + λuh v dx+v(xe+1 )P2e +v(xe )P1e (9.173)
e
dx dx
Ω̄e Ω̄e

Substituting (9.170) and (9.171) in (9.173) we can write

2 2
Z Z ! ! !
dNj X dNi X
(Aueh )v(x) dx = − δie +λ Ni δie Nj dx+
dx dx
i=1 i=1
Ω̄e Ω̄e
Nj (xe+1 )P2e + Nj (xe )P1e (9.174)

Noting that xe → ξ = −1, xe+1 → ξ = +1 and using properties of Ni ; i =

1, 2, we can write (9.174) in the matrix form
Z
(Aueh )v(x) dx = [K e ]{δ e } + {P e } (9.175)
Ω̄e

[K e ] is calculated using procedure described in example 9.1 and we have

Z+1 Z+1
e 2 dNi dNj λhe
Kij =− dξ + Ni Nj dξ (9.176)
he dξ dξ 2
−1 −1
394 NUMERICAL SOLUTIONS OF BVPS

1 1 2 2 3 3 4 4 5
x
u1 u2 u3 u4 u5

x1 = 0 x2 = 0.25 x3 = 0.5 x4 = 0.75 x5 = 1.0

Figure 9.11: A four element uniform discretization using two node elements

Consider a four element uniform discretization shown in Figure 9.11. For an

element Ω̄e with he = 1/4. We can obtain the following using (9.176).
1 1
e 1 1 −1 1 21 −4 4
[K ] = − + he λ = + λ 12 24
1 1 (9.177)
he −1 1 6 12 4 −4 24 12

Thus, for each element of the discretization of Figure 9.11 we can write
Z e
e
1 e 2 e
ue P1
(Auh )v dx = [ K ] + λ[ K ] + ; e = 1, 2, . . . , 4 (9.178)
ue+1 P2e
Ω̄e
1 1

−4 4
[1K e ] = ; [2K e ] = 12
1
24
1 (9.179)
4 −4 24 12
Assembly of the element equations can be written as
Z 4 Z
X 4 4 4
X X X
(Aueh )v dx 1 e
[ K ] {δ}+ {P e } = {0}
2 e

(Auh )v dx = = [ K ]+λ
e=1 e=1 e=1 e=1
Ω̄T Ω̄e
(9.180)
{δ} = ∪{δ e }
e
or
[K]{δ} = −{P }
" 4 4
#
X X
[K] = [1K e ] + λ [2K e ]
e=1 e=1
(9.181)
= [1K] + λ[2K]

4
X
and {P } = {P e }
e=1

Since u(0) = 0 and u(1) = 0 implies that u1 = 0 and u5 = 0; we order the

degrees of freedom in {δ} as follows

{δ}T = u1 u5 u2 u3 u4

(9.182)
9.3. FINITE ELEMENT METHOD FOR BVPS 395

so that known u1 and u5 are together, hence the assembled equations will
be in partitioned form. Thus, we label rows and columns of [K] as u1 , u2 ,
u3 , u4 and u5 . Assembled [1K e ], [2K e ] and {P e } are shown in the following.
u1 u5 u2 u3 u4
−4 0 4 0 0  u
1
0 −4 0 0 4  u5
(−4
 
4 0 4 0  u2
 
1 e − 4)
[K ]= (9.183)
 
(−4

0 0 4 4  u3
 
 − 4) 
(−4
 
0 4 0 4 u4
− 4)

u1 u5 u2 u3 u4
1 1

12
0 24
0 0  u
1
1 1
0 12
0 0 24  u5
( 1/12
 
1 1
0 0  u2

 24 24
2 e + 1/12)
[K ]= (9.184)
 
( 1/12 
1 1
0 0  u3
 
 24 + 1/12) 24 
( 1/12
 
1 1
0 24
0 24
u4
+ 1/12)

P11 
   
 P1 
4
   
P  P5 

 
 
 
 2 
{P } = P21 + P12 = P2 (9.185)
P 2 + P13  P3 

  
 
 23

 
 
 

P2 + P1 4  
P4


EBCs: u(0) = u1 = 0 , u(1) = u5 = 0 (9.186)

NBCs: P11 and P24 are unknown

P21 + P12 = 0
(9.187)
P22 + P13 = 0
P23 + P14 = 0

We partition [1K e ] and [2K e ]

1
[ K11 ] [1K12 ]
2
[ K11 ] [2K12 ]

{δ}1 {P }1
+λ 2 + = {0} (9.188)
[1K21 ] [1K22 ] [ K21 ] [2K22 ] {δ}2 {P }2

{δ}T1 = u1 u5 = 0.0 0.0

(9.189)
{δ}T2 = u2 u3 u4

(9.190)
396 NUMERICAL SOLUTIONS OF BVPS

using (9.185) – (9.187) in (9.188), we obtain the following from the second
set of partitioned equations

1
[ K22 ] + λ[2K22 ] {δ}2 = {0}

(9.191)

Equation (9.191) define eigenvalue problem. We note that

1 −4 0
[ K11 ] =
0 −4

1 400
[ K12 ] =
004
[1K21 ] = [1K12 ]T
 
−8 4 0
[1K22 ] =  4 −8 4 
0 4 −8
" #
1 (9.192)
2 3 0
[ K11 ] =
0 13
" #
1
6 0 0
[2K12 ] =
0 0 16
[2K21 ] = [2K12 ]T
2 1 
3 6 0
2 1 2 1
[ K22 ] =  6 3 6 
0 16 23

Using [1K22 ] and [2K22 ] from (9.192) in (9.191) and changing sign throughout

 2 1   
8 −4 0

3 6 0 u2 
 −4 8 −4 − λ  16 2 1 
u3 = {0} (9.193)
 
3 6 
0 −4 8 u4
 
1 2
0 6 3

Eigenpairs of (9.193) extracted using inverse iteration with iteration vector

9.4. FINITE DIFFERENCE METHOD 397

deflection technique are given in the following

 
1.0528
(λ1 , {φ}1 ) = (1.03867, 1.4886 )
1.0528
 
 
 1.7321 
(λ2 , {φ}2 ) = (48.0, −0.10165 × 10−3 ) (9.194)
−1.7320
 
 
 1.5222 
(λ3 , {φ}3 ) = (126.75, −2.1554 )
1.5226
 

Theoretical values of λ (given in example 9.6) are λ = 9.8696, 39.4784 and

88.8264.

9.4 Finite Difference Method for ODEs and PDEs

In Section 9.1 we have shown that the mathematically justifiable ap-
proach for solving ODEs and PDEs, approximately or otherwise, is to in-
tegrate them. At present there are many other numerical approaches used
in attempts to obtain approximate numerical solutions of ODEs and PDEs.
The finite difference technique is one such approach. In this method, the
derivatives appearing in the statement of the ODEs or PDEs are replaced
by their algebraic approximations derived using Taylor series expansions.
Thus, all derivative terms in the ODEs and PDEs are replaced by algebraic
expressions containing nodal values of the functions (and/or their deriva-
tives) for a discretization containing the desired number of points (nodes).
This process yields algebraic equations in the unknown nodal values of the
functions and their derivatives. Solution of these algebraic equations yields
nodal values of the solution.
The fundamental question in this approach is to examine what math-
ematical principle is behind this approach that ensures that this process
indeed yields approximate solutions of ODEs and PDEs. This is an unre-
solved issue based on the author’s opinion. Nonetheless, since this technique
is still commonly used in applications such as CFD, we consider basic details
of the method for ODEs and PDEs describing BVPs.
398 NUMERICAL SOLUTIONS OF BVPS

9.4.1 Finite Difference Method for Ordinary Differential Equa-

tions
We consider the basic steps in the following, which are then applied to
specific model problems.
Let Aφ − f = 0 in Ω ⊂ R1 be the boundary value problem, in which A is
the differential operator, φ is the dependent variable(s) and Ω is the domain
of definition of the BVP.

(a) Consider a discretization Ω̄T of Ω̄ = Ω ∪ Γ ; Γ being the boundary of Ω,

i.e., the end points of Ω in this case. Generally a uniform discretization
is simpler. Label grid points or nodes and establish their coordinates.

1 2 3 4 5
x
x1 x2 x3 x4 x5
h h h h
x=0 x=L

Figure 9.12: Discretization of Ω̄T of Ω̄

Figure 9.12 shows a five-node or points discretization Ω̄T of Ω̄ = [0, L].

(b) Express the derivatives in the differential equation describing the BVP
in terms of their finite difference approximations using Taylor series ex-
pansions about the nodes in [0, L] and substitute these in the differential
equation.

(d) As far as possible we use finite difference expressions for the various
derivatives that have truncation error of the same order so that the
order of the truncation error in the solution is clearly defined. In case of
using finite difference expressions for the derivatives that have truncation
errors of different orders, it is the lowest order truncation error that
controls the order of the truncation error in the numerically computed
solution.

(e) In step (b), the differential equation is converted into a system of alge-
braic equations. Arrange the final equations resulting in (b) in matrix
form and solve for the numerical values of the unknown dependent vari-
ables at the grid or node points.

We consider some examples in the following.

9.4. FINITE DIFFERENCE METHOD 399

Example 9.4. Second Order Non-Homogeneous ODE: Finite Dif-

ference Method Consider the following ODE.
d2 T
+ T = x3 ; ∀x ∈ (0, 1) = Ω ⊂ R1 (9.195)
dx2
with BCs :
T (0) = 0, T (1) = −0.5 (9.196)
Find the numerical solution of the ODE using finite difference method with
central differencing and uniform spacing of node points with h = 0.2 (Figure
9.13).
x1 = 0 x2 = 0.2 x3 = 0.4 x4 = 0.6 x5 = 0.8 x6 = 1.0

T1 T2 T3 T4 T5 T6
T1 = T (0) = 0 T6 = T (1) = −0.5
(BC) (BC)

Figure 9.13: Schematic of Example 9.4

Consider 3-node stencil (the nodal discretization used to convert the

derivatives to algebraic expressions, Figure 9.14).

i−1 i i+1
0.2 0.2

Figure 9.14: A three-node stencil of points in Example 9.4

Using
Ti = T (xi ) ; i = 1, 2, . . . with h = 0.2 (9.197)
we can write:
d2 T Ti+1 − 2Ti + Ti−1 Ti+1 − 2Ti + Ti−1
= = (9.198)
dx2 x=xi h2 (0.2)2
Consider (9.195) at node i.
d2 T
+ T |x=i = x3i (9.199)
dx2 x=xi

d2 T
Substituting for dx2 x=x
from (9.198) into (9.199):
i

Ti+1 − 2Ti + Ti−1

+ Ti = x3i (9.200)
(0.2)2
400 NUMERICAL SOLUTIONS OF BVPS

25Ti−1 − 49Ti + 25Ti+1 = x3i (9.201)

Since T1 = T (0) = 0 and T6 = T (1) = −0.5, the solution T is known at

nodes 1 and 6, therefore the finite difference form of (9.195), i.e., (9.200),
only needs to be satisfied for i = 2, 3, 4, and 5, which gives us:

25T1 − 49T2 + 25T3 = (0.2)3 = 0.008

25T2 − 49T3 + 25T4 = (0.4)3 = 0.064
(9.202)
25T3 − 49T4 + 25T5 = (0.6)3 = 0.216
25T4 − 49T5 + 25T6 = (0.8)3 = 0.512

Using T1 = 0 and T6 = (−0.5) and arranging (9.202) in matrix form:

    
−49 25 0 0  T2 
 
 0.008 

 25 −49 25 0 
 T3 = 0.064
   
 (9.203)
 0 25 −49 25  T   0.216 
 4
   
0 0 25 −49 T5 13.012
 

Solution of linear simultaneous equations in (9.203) gives:

T2 = −0.128947, T3 = −0.252416
(9.204)
T4 = −0.363228, T5 = −0.450872

The values of Ti in (9.204) is the approximate solution of (9.195) and (9.196)

at x = xi ; i = 2, 3, 4 and 5.
Table 9.5: Temperature values at the grid points (example 9.4)

node x T
1 0.0 0.0
2 0.2 -0.128947
3 0.46 -0.25328
4 0.6 -0.36418
5 0.8 -0.45155
6 1.0 -0.5
9.4. FINITE DIFFERENCE METHOD 401

-0.1
Temperature T

-0.2

-0.3

-0.4

-0.5
0 0.2 0.4 0.6 0.8 1
distance x

Figure 9.15: T versus x (example 9.4)

Table 9.5 gives values of T at the grid points. Figures 9.15 show plot of
(Ti , xi ); i = 1, 2, . . . , 6.
Remarks.
(1) We note that the coefficient matrix in (9.203) is in tridiagonal form,
hence we can take advantage in storing the coefficients of the matrix as
well as in solution methods for calculating Ti , i = 2, 3, . . . , 5.

(2) The solution in (9.204) is approximate.

(a) The finite difference expression for the derivatives are approximate.
(b) We have only used a finite number of node points (only six) in Ω̄T .
(c) If the number of points in Ω̄T are increased, accuracy of the com-
puted nodal values of T will improve.

(3) Both boundary conditions in (9.196) are function values, i.e., values of
T at the two boundaries x = 0 and x = 1.0.

(4) We only know the solution at the grid or node points. Between the node
points we only know that the solution is continuous and differentiable,
but we do not know what it is. This is not the case in finite element
method.
402 NUMERICAL SOLUTIONS OF BVPS

Example 9.5. Second Order Non-Homogeneous ODE: Finite Dif-

ference Method Consider the same BVP as in Example 9.4 but with dif-
ferent BCs than for Example 9.4.
d2 T
+ T = x3 ; ∀x ∈ (0, 1) = Ω ⊂ R1 (9.205)
dx2
with BCs :
dT
T (0) = 0, = 20.0 (9.206)
dx x=1
We consider the same discretization Ω̄T as in Example 9.4.
x1 x2 x3 x4 x5 x6 x7 imaginary
point

T1 T2 T3 T4 T5 T6 T7
dT
T1 = 0.0 dx = 20
x1 = 0.0 x6 = 1.0

Figure 9.16: Schematic of Example 9.5

Consider central difference to find numerical values of the solution T at

the nodes. Consider a 3-node stencil.

i−1 i i+1
h h

Figure 9.17: A three-node stencil of points in Example 9.5

d2 T Ti+1 − 2Ti + Ti−1 Ti+1 − 2Ti + Ti−1

' = (9.207)
dx2 i h2 (0.2)2
Consider (9.205) for a node i.
d2 T
+ T |x=xi = x3i (9.208)
dx2 x=xi

d2 T
Substituting for dx2 x=x
in (9.208) from (9.207):
i

Ti+1 − 2Ti + Ti−1

+ Ti = x3i (9.209)
(0.2)2
or 25Ti−1 − 49Ti + 25Ti+1 = x3i (9.210)
9.4. FINITE DIFFERENCE METHOD 403

Since at x = 1.0, dT
dx is given, T at x = 1.0 is not known, hence (9.210) must
also hold at x = 1.0, i.e., at node 6 in addition to i = 2, 3, 4, 5. In order to
satisfy the BC dTdx = 20 at x = 1.0 using a central difference approximation
dT
of dx , we need an additional node 7 (outside the domain) as shown in Figure
9.16. Using a three-node stencil i − 1, i and i + 1, we can write the following
(central difference).

dT Ti+1 − Ti−1 Ti+1 − Ti−1

= = = 2.5(Ti+1 − Ti−1 ) (9.211)
dx x=xi 2h 2(0.2)

Using (9.211) for i = 6:

dT dT
= = 20 = 2.5(T7 − T5 ) (9.212)
dx x=x6 dx x=1

or
T7 = T5 + 8.0 (9.213)
Thus T7 is known in terms of T5 . Using (9.210) for i = 2, 3, 4, 5 and 6:

25T1 − 49T2 + 25T3 = (0.2)3 = 0.008

25T2 − 49T3 + 25T4 = (0.4)3 = 0.064
25T3 − 49T4 + 25T5 = (0.6)3 = 0.216 (9.214)
3
25T4 − 49T5 + 25T6 = (0.8) = 0.512
25T5 − 49T6 + 25T7 = (1.0)3 = 1.000

Substitute T1 = 0 and T7 = T5 + 8.0 (BCs) in (9.214) and arrange the

resulting equations in the matrix form.
    
−49 25 0 0 0  T2 
 
 0.008 

 25 −49 25 0 0 
T3





 0.064



 
 0 25 −49 25 0  T4 = 0.216 (9.215)
   
 0 0 25 −49 25   T 0.512

 5
 
 
 

   
0 0 0 50 −49 T6 −199.0
 

Solution of linear simultaneous equations in (9.215) gives:

T2 = 7.32918 , T3 = 14.36551 ,
T6 = 31.07091 (9.216)
T4 = 20.82977 , T5 = 26.46949 ,

The values of Ti ; i = 1, 2, . . . , 6 are the approximate solution of (9.205) and

(9.206) at x = xi ; i = 2, 3, . . . , 6.
404 NUMERICAL SOLUTIONS OF BVPS

Table 9.6: Temperature values at the grid points (example 9.5)

node x T
1 0.0 0.0
2 0.2 7.32918
3 0.46 14.36551
4 0.6 20.82977
5 0.8 26.46949
6 1.0 31.07091

25
Temperature T

0
0 0.2 0.4 0.6 0.8 1
distance x

Figure 9.18: T versus x (example 9.5)

Table 9.6 gives values of T at the grid points. Figures 9.18 show plot of
(Ti , xi ); i = 1, 2, . . . , 6.
Remarks.
(1) Node points such as point 7 that are outside the domain Ω̄ are called
imaginary points. These are necessary when the derivative (first or sec-
ond) boundary conditions are specified at the boundary points.
(2) This example demonstrates how the function value and derivative
boundary condition (first derivative in this case) are incorporated in
the finite difference solution procedure.
9.4. FINITE DIFFERENCE METHOD 405

(3) The accuracy of the approximation in general is poorer when the deriva-
tive boundary conditions are specified at the boundary points due to the
additional approximation of the boundary condition.
(4) Here also, we only know the solution at the grid or node points. Be-
tween the node points we only know that the solution is continuous and
differentiable, but we do not know what it is. This is not the case in
finite element method.

Example 9.6. Second Order Homogeneous ODE: Finite Difference

Method Consider the following BVP:
d2 u
+ λu = 0 ∀x ∈ (0, 1) = Ω ⊂ R1 (9.217)
dx2
with the following BCs :

u(0) = 0 , u(1) = 0 (9.218)

The quantity λ is a scalar and is unknown. We seek solution of (9.218)

with BCs (9.218) using finite difference approximation of the derivatives by
central difference method.
Consider a five-node uniform discretization (h = 0.25).

x1 = 0 x2 = 0.25 x3 = 0.5 x4 = 0.75 x5 = 1

x
1 2 3 4 5
u1 u2 u3 u4 u5
u1 = 0.0 u5 = 0.0

Figure 9.19: Schematic of Example 9.6

In central difference approximation of the derivatives we consider a three
node stencil.
h h

i−1 i i+1
ui−1 ui ui+1

Figure 9.20: A three-node stencil of points in Example 9.6

At node i we have:
d2 u ui−1 − 2ui + ui+1 ui−1 − 2ui + ui+1
= = (9.219)
dx2 x=xi h2 (0.25)2
406 NUMERICAL SOLUTIONS OF BVPS

The differential equation at node i can be written as follows.

d2 u
+ λui = 0 (9.220)
dx2 x=xi

Substituting (9.220) in (9.219):

ui−1 − 2ui + ui+1

+ λyi = 0 (9.221)
(0.25)2
or
16ui−1 − 32ui + 16ui+1 + λui = 0 (9.222)

Since nodes 1 and 5 have function u specified, at these locations the solution
is known. Thus, (9.222) must only be satisfied at xi ; i = 2, 3, 4.

16u1 − 32u2 + 16u3 + λu2 = 0

16u2 − 32u3 + 16u4 + λu3 = 0 (9.223)
16u3 − 32u4 + 16u5 + λu4 = 0

Substitute u1 = 0 and u5 = 0 (BCs) in (9.223) and arrange in the matrix

and vector form.
    
−32 16 0 u2  u2 
 16 −32 16  u3 + λ u3 = {0} (9.224)
0 16 −32 u4 u4
   

or      
32 −16 0 u2  1 0 0 u2 
−16 32 16 u3 − λ 0 1 0 u3 = {0} (9.225)
0 −16 32 u4 001 u4
   

This is an eigenvalue problem. We can write (9.225) as:

  
32 − λ −16 0 u2 
 −16 32 − λ −16  u3 = {0} (9.226)
0 −16 32 − λ u4
 

The characteristic polynomial of (9.226) is given by:

 
32 − λ −16 0
det  −16 32 − λ −16  = 0 (9.227)
0 −16 32 − λ
9.4. FINITE DIFFERENCE METHOD 407

We can find the eigenpairs of (9.225) using any one of the methods discussed
in Chapter 4, and we obtain:
 
 0.5 
(λ1 , {φ}1 ) = 9.37, 0.707
0.5
 
 
 0.70711 
(λ2 , {φ}2 ) = 32.0, 0.0 (9.228)
−0.70711
 
 
 0.49957 
(λ3 , {φ}3 ) = 54.62, −0.70722
0.49957
 

We note that the eigenvectors are not normalized.

Theoretical Solution and Error in the Numerical Approximation:

The general solution of the BVP (9.217) is given by:
√ √
u = c1 sin λx + c2 cos λx (9.229)
Using BCs:
at x = 0, u = u(0) = 0 ; 0 = c1 (0) + c2 (1) =⇒ c2 = 0
√ √
at x = 1, u = u(1) = 0 ; 0 = c1 sin λ =⇒ λ = nπ ; n = 1, 2, . . .
(9.230)
Therefore we have:
√
u = cn sin nπx ; λ = nπ (9.231)

for n = 1 ; λ1 = (π)2 = 9.8696

for n = 2 ; λ2 = 4(π)2 = 39.4784 (9.232)
for n = 3 ; 2
λ3 = 9(π) = 88.8264
Comparing the theoretical values of λi ; i = 1, 2, 3 in (9.232) with the nu-
merically calculated values in (9.228), we find that:
error in λ1 = e1 = 9.8696 − 9.37 = 0.4996
error in λ2 = e2 = 39.4784 − 32.0 = 7.4784 (9.233)
error in λ3 = e3 = 88.82640 − 54.62 = 34.2064
Error in λi ; i = 1, 2, 3 becomes progressively larger for higher eigenvalues.
408 NUMERICAL SOLUTIONS OF BVPS

9.4.2 Finite Difference Method for Partial Differential Equa-

tions
Differential mathematical models in R2 and R3 result in partial differen-
tial equations in which dependent variables exhibit dependence on (x, y) (in
R2 ) or on (x, y, z) (in R3 ). Some of the simplest forms are:
∂2φ ∂2φ
+ 2 = 0 ∀x, y ∈ Ω (9.234)
∂x2 ∂y
∂2φ ∂2φ
+ 2 = f (x, y) ∀x, y ∈ Ω (9.235)
∂x2 ∂y
Equation (9.234) is called Laplace’s equation, a homogeneous second-order
partial differential equation in dependent variable φ = φ(x, y). Equation
(9.235) is Poisson’s equation, a non-homogeneous Laplace equation. Ω is
the xy-domain over which the differential equation holds. We define:

Ω̄ = Ω ∪ Γ (9.236)

Ω̄ is called the closure of Ω. Γ is the boundary of Ω. Boundary conditions

are defined on all or part of Γ. These may consist of known values of φ or
∂φ ∂φ ∂φ ∂φ
∂x , ∂y , ∂z or ∂n (derivative of φ normal to the boundary Γ) or any combi-
nation of these. BCs make the solution of (9.234) or (9.235) (or any BVP)
unique.
In the finite difference method of solving BVPs described by partial dif-
ferential equations, the derivatives in the BVPs must be expressed in terms
of algebraic expressions using their finite difference approximations. The
outcome of doing so is that the partial differential equation is converted into
a system of algebraic equations. After imposing BCs, we solve for the re-
maining quantities (function values at the nodes). The details are presented
in the following for Laplace and Poisson’s equations.

9.4.2.1 Laplace’s Equation

∂2φ ∂2φ
+ 2 = 0 ∀x, y ∈ Ω = (0, L) × (0, M ) (9.237)
∂x2 ∂y
BCs : φ(0, y) = φ(x, 0) = φ(L, y) = 0 , φ(x, M ) = 100 (9.238)
Figure 9.21 shows the domain Ω, its boundary Γ, and the boundary condi-
tions (9.238).
2 ∂2y
In the finite difference method we approximate ∂∂xφ2 and ∂x2 by their finite
difference approximation at a countable number of points in Ω̄. Consider
the following uniform grid (discretization Ω̄T of Ω̄) in which ∆x and ∆y are
spacing in x- and y-directions.
9.4. FINITE DIFFERENCE METHOD 409

φ = 100 ∀x ∈ [0, L] at y = M

y=M

φ=0
φ=0

x
x=0 x=L

φ=0

Figure 9.21: Domain Ω̄ of BVP (9.237) and BCs (9.238)

3
∆y
2

j=1 x
i=1 2 3 4 5
∆x

Figure 9.22: Discretization of Ω̄T of Ω̄

This discretization establishes 25 grid points or node points. At the grid

points located on the boundary Γ, φ is known (see (9.238) and Figure 9.21).
Our objective is to find function values φ at the interior nine points. Along
x-axis i = 1, 2, 3, 4, 5 and along y-axis j = 1, 2, 3, 4, 5 and their intersections
uniquely define all grid or node points. We consider (9.237) and approximate
∂2φ 2
∂x2
and ∂∂yφ2 at each of the nine interior nodes (or grid points) using the finite
difference method. Consider an interior grid point (i, j) and the node stencil
shown in Figure 9.23.
2 2
At node (i, j), ∂∂xφ2 and ∂∂yφ2 can be approximated using the following
(central difference):

∂2φ φi+1,j − 2φi,j + φi−1,j ∂2φ φi,j+1 − 2φi,j + φi,j−1

= , =
∂x2 i,j (∆x)2 ∂y 2 i,j (∆y)2
(9.239)
410 NUMERICAL SOLUTIONS OF BVPS

i, j + 1

j
i − 1, j i, j i + 1, j
i, j − 1

x
i
Figure 9.23: Node(i, j) and five-node stencil

Substituting from (9.239) in (9.237), we obtain the following for (9.237) at

node i, j.
φi+1,j − 2φi,j + φi−1,j φi,j+1 − 2φi,j + φi,j−1
+ =0 (9.240)
(∆x)2 (∆y)2
If we choose ∆x = ∆y (hence L = M ), then (9.240) reduces to:
φi,j+1 + φi,j−1 + φi+1,j + φi−1,j − 4φi,j = 0 (9.241)
Equation (9.241), the finite difference approximation of the BVP (9.237),
must hold at all interior grid points (or nodes).
Consider the following (Figure 9.24) for more specific details and numer-
ical calculation of the solution.
A
y
φ = 100
1,5 5,5
2,5 3,5 4,5

1,4 5,4
2,4 3,4 4,4
φ=0 φ=0
1,3 5,3
2,3 3,3 4,3

1,2 5,2
2,2 3,2 4,2

x
1,1 2,1 3,1 4,1 5,1
φ=0
A A-A is line of symmetry

Figure 9.24: Discretization of Ω̄, line of symmetry and BCs

9.4. FINITE DIFFERENCE METHOD 411

We note that A − A is a line (or plane) of symmetry. Ω̄, BCs, and the
solution φ are all symmetric about this line, i.e., the left half and right half
of line A − A are reflections of each other. Hence, for interior points we have:

φ4,4 = φ2,4
φ4,3 = φ2,3 (9.242)
φ4,2 = φ2,2

and the following for the boundary nodes.

φ4,1 = φ2,1 , φ4,5 = φ2,5 , φ5,1 = φ1,1

φ5,2 = φ1,2 , φ5,3 = φ1,3 , φ5,4 = φ1,4 (9.243)
φ5,5 = φ1,5

These conditions are already satisfied by the boundary conditions of the

BVP. From (9.241), we can write:

φi,j = 0.25(φi,j+1 + φi,j−1 + φi+1,j + φi−1,j ) (9.244)

Using (9.244), we can obtain the following for the nine interior grid points
(after substituting the value of φ for boundary nodes):

φ2,2 = 0.25(φ2,3 + φ2,3 )

φ2,3 = 0.25(φ2,2 + φ3,3 + φ2,4 )
φ2,4 = 0.25(φ2,3 + φ3,4 + 100)
φ3,2 = 0.25(φ4,2 + φ3,3 + φ2,2 )
φ3,3 = 0.25(φ3,2 + φ4,3 + φ3,4 + φ2,3 ) (9.245)
φ3,4 = 0.25(φ3,3 + φ4,4 + φ2,4 + 100)
φ4,2 = 0.25(φ4,3 + φ3,2 )
φ4,3 = 0.25(φ4,2 + φ4,4 + φ3,3 )
φ4,4 = 0.25(φ4,3 + φ3,4 + 100)

If we substitute condition (9.242) in the last three equations in (9.245), we

obtain:

φ4,2 = 0.25(φ2,3 + φ3,2 ) = φ2,2

φ4,3 = 0.25(φ2,2 + φ2,4 + φ3,3 ) = φ2,3 (9.246)
φ4,4 = 0.25(φ2,3 + φ3,4 + 100) = φ2,4

Which establishes existence of symmetry about the line A − A. Thus in

(9.245), we only need to use the first six equations. Writing these in the
412 NUMERICAL SOLUTIONS OF BVPS

matrix form:
   
1 −0.25 0 −0.25 0

0 
φ2,2  
 0
−0.25 1 −0.25 0 −0.25 0   
φ 2,3





 0 


    
 0 −0.25 1 0 0 −0.25 
φ 2,4
 
25


 −0.5
 = (9.247)
 0 0 1 −0.25 0  φ3,2  
 0
 0 −0.5 0 −0.25 1 −0.25  φ3,3  0
 
 
 
 
   
  

0 0 −0.5 0 −0.25 1

φ3,4 25


Solution of the linear simultaneous simultaneous equation (9.247) yields:

φ2,2 = 7.14286 , φ2,3 = 18.75 , φ2,4 = 42.85714

(9.248)
φ3,2 = 9.82143 , φ3,3 = 25.0 , φ3,4 = 52.6786

The solution values in (9.248) can be shown schematically (see Figure 9.25).
We note that this solution holds for any L = M i.e. for all square domains
Ω̄.
A

100 100 100 100 100

0.0 0.0
42.86 52.68 42.86

0.0 0.0
18.75 25.00 18.75

0.0 0.0
7.14 9.82 7.14

0.0 0.0 0.0 0.0 0.0

Figure 9.25: Solution values at the grid points

9.4.2.2 Poisson’s Equation

Consider:

∂2φ ∂2φ
+ 2 = −2 ∀x, y ∈ Ω = (0, L) × (0, L) (9.249)
∂x2 ∂y

with BCs : φ(0, y) = φ(x, 0) = φ(L, y) = φ(x, L) = 0.0 (9.250)

9.4. FINITE DIFFERENCE METHOD 413

In this case Ω is a square domain with side length L. Following the details
for Laplace equation, we can write the following for a grid points (i, j) (using
∆x = ∆y = h).

φi+1,j + φi−1,j + φi,j+1 + φi,j−1 − 4φi,j = −2h2 (9.251)

Consider the following cases for obtaining numerical solutions.

Case (a): Nine-Node Discretization

Consider the following discretization of Ω̄ with L = 2 and h = 1:
Using (9.251) for node (2, 2) we obtain (since φ is zero on all boundary
points):

−4φ2,2 = −2(1)2
(9.252)
∴ φ2,2 = 0.5

2,3 3,3
1,3

L=1
1,2 3,2
2,2

x
1,1 2,1 3,1
h h
L

Figure 9.26: A nine grid point discretization of Ω̄

Case (b): 25-Node Discretization

Consider the following discretization of Ω̄. A − A, B − B are lines of
symmetry and so is C − C. Thus we only need to consider the following:
Due to symmetry about C − C we only need to consider nodes (2, 2), (2, 3)
and (3, 3). We note that:
φ3,2 = φ2,3
414 NUMERICAL SOLUTIONS OF BVPS

φ=0
5

4
C φ=0

B 3 φ=0 B

2 L=2
h = 0.5

j=1
φ=0
C i=1 2 3 4 5

A
Figure 9.27: A 25-node discretization of Ω̄T

At node (2,2):

C
2,3 3,3
1,3

fictitious nodes
(or imaginary nodes)

1,2 3,2
2,2

1,1 2,1 3,1

Figure 9.28: Subdomain of Ω̄T

2φ2,3 − 4φ2,2 = −0.5 (9.253)

At node (2,3):

2φ2,2 + φ3,3 − 4φ2,3 = −0.5 (9.254)

At node (3,3):

4φ2,3 − 4φ3,3 = −0.5 (9.255)

9.5. CONCLUDING REMARKS 415

or in matrix form:
    
−4 2 0 φ2,2  −0.5
 2 −4 1  φ2,3 = −0.5 (9.256)
0 4 −4 φ3,3 −0.5
   

Solution of (9.256) gives:

φ2,2 = 0.3438 , φ2,3 = 0.4375 , φ3,3 = 0.5625 (9.257)

We note that at x = L2 , y = L2 , φ has changed from 0.5 in case (a) with h = 1

to φ = 0.5625 for h = 0.5. Obviously φ = 0.5625 is a better approximation
of φ as it is obtained from a more refined discretization. As we continue to
refine Ω̄T (add more grid points), the solution continues to improve. This is
the concept of convergence.
Remarks.
(1) The finite difference method of obtaining approximate numerical solu-
tion of BVP is a rather crude method of approximation.

(2) When Γ of Ω in Ω̄ is a curved boundary or a surface, many difficulties

are encountered in defining the derivative boundary conditions.

(3) This method, though crude, is simple for obtaining quick solutions that
give some idea about the solution behavior.

(4) The finite element method of approximation for BVPs has sound math-
ematical foundation. This method eliminates or overcomes many of the
difficulties encountered in the finite difference method.

9.5 Concluding Remarks

We have shown that solutions of BVPs require that we integrate them
over their domain of definition. This is the only mathematically justifiable
technique of obtaining their valid solutions numerically or otherwise. Finite
element method (FEM) is one such method in which one constructs integral
of the BVP over the discretized domain of definition of the BVP. Details
of FEM are presented for 1D BVPs in single variable including example
problems. In view of the fact the solutions of BVPs require their integration
over their domain of definition, finite difference, finite volume and other
techniques are not as meritorious as FEM and in some cases may even yield
wrong solutions. Nonetheless, since finite difference method is in wide use
in Computational Fluid Dynamics (CFD), details of finite difference method
are illustrated using the same example problems as considered in the FEM
so that some comparisons can be made.
416 NUMERICAL SOLUTIONS OF BVPS

Problems
9.1 Consider the following 1D BVP in R1

d2 u
= x2 + 4u ∀x ∈ (0, 1) (1)
dx2
with boundary conditions:

u(0) = 0 , u(1) = 20 (2)

Consider finite difference approximation of the derivatives using central dif-

ference method to obtain an algebraic system for (1). Obtain numerical
solution of u in (1) with BCs in (2) using a uniform discretization contain-
ing five grid points (i.e. ∆x = 0.25). Plots graph of u, du/dx and d2 u/dx2
versus x. du/dx and d2 u/dx2 can be calculated using central difference approx-
imation and forward and backward difference approximations at grid points
1 and 5 for du/dx.

9.2 Consider the following 1D BVP in R1

d2 u
= x2 + 4u ∀x ∈ (0, 1) (1)
dx2
with boundary conditions:

u0 (0) = 0 , u0 (1) = 20 (2)

Consider finite difference approximation of the derivatives using central dif-

ference method to obtain a system of algebraic equations for (1). Obtain
numerical solution of (1) with BCs in (2) using a uniform discretization con-
taining five grid points (i.e. ∆x = 0.25). Plots graph of u, du/dx and d2 u/dx2
versus x. du/dx and d2 u/dx2 can be calculated using central difference approx-
imation and forward and backward difference approximations at grid points
1 and 5 for du/dx.

9.3 Consider the following 1D BVP in R1

d2 u du
2
+4 + 2u = x2 ∀x ∈ (0, 2) (1)
dx dx
with boundary conditions:

du
u(0) = 0 , u0 (2) = = 0.6 (2)
dx x=2

Consider finite difference approximation of the derivatives using central dif-

ference method to obtain an algebraic system for (1). Obtain numerical
9.5. CONCLUDING REMARKS 417

solution of (1) with BCs in (2) using a uniform discretization containing five
grid points (i.e. ∆x = 0.5). Plots graph of u, du/dx and d2 u/dx2 versus x.
du/dx and d2 u/dx2 can be calculated using central difference approximation
and forward and backward difference approximations at grid points 1 and 5
for du/dx.

9.4 Consider the following 1D BVP in R1

d2 T x
− 1 − T =x ∀x ∈ (1, 3) (1)
dx2 5
with boundary conditions:

d2 T
T (1) = 10 , T 00 (3) = =6 (2)
dx2 x=3

Consider finite difference approximation of the derivatives using central dif-

ference method to obtain an algebraic system for (1). Obtain numerical
solution of (1) with BCs (2) using a uniform discretization containing five
grid points (i.e. ∆x = 0.5). Plots graph of T , dT/dx and d2 T/dx2 versus x.
dT/dx and d2 T/dx2 can be calculated using central difference approximation
and forward and backward difference approximations at grid points 1 and 5
for dT/dx.

9.5 Consider the same BVP (i.e. (1)) as in problem 9.4, but with the new
BCs :
d2 T x
− 1 − T = x ∀x ∈ (1, 3) (1)
dx2 5
with boundary conditions:

d2 T dT
T 00 (1) = = 2 , T 0 (3) = =1 (2)
dx2 x=1 dx x=3

Consider finite difference approximation of the derivatives using central dif-

9.6 Consider the Laplace equation given below

∂2T ∂2T
+ = 0 ∀(x, y) ∈ (Ωxy ) = Ωx × Ωy (1)
∂x2 ∂y 2

The domain Ω̄xy and the discretization including boundary conditions are
given in the following.
418 NUMERICAL SOLUTIONS OF BVPS

y
∂T
=0
∂y

o
o ∂T ∂T
8 =0, =0
T = 15 7 9
∂x ∂y

5
T = 15 4 5
o ∂T
6 =0
∂x
5
T = 15 1 2 3 x

T = 30 T = 30

10 10

Consider finite difference approximation of the derivatives in the BVP (1)

using central difference method. Construct a system of algebraic equations
for (1) using this approximation of the derivatives. Grid points are num-
bered. The value of T at the grid i is Ti . Thus T1 = 15, T4 = 15, T7 = 15,
T2 = 30 and T3 = 30 are known due to boundary conditions. Calculate
unknown values of T at the grid points 5, 6, 8 and 9, i.e. calculate T5 , T6 ,
T8 and T9 .
9.7 Consider the Poisson’s equation given below
∂2φ ∂2φ
+ 2 = xy ∀(x, y) ∈ (Ωxy ) = Ωx × Ωy (1)
∂x2 ∂y
The domain Ω̄xy and the discretization including boundary conditions are
given in the following.
y
∂φ ∂φ
=0 =0
∂y ∂y
o

9 10 11 12 φ = 100
φ=5
y=4

φ=5 5 6 7 8 φ = 100
y=2

φ=5 y=0 2 3 4 φ = 100 x

1
φ = 20 φ = 20

x=0 x=2 x=4 x=6

9.5. CONCLUDING REMARKS 419

Consider finite difference approximation of the derivatives in the BVP (1)

using central difference method. Construct a system of algebraic equations
for (1) using this approximation of the derivatives. Grid points are num-
bered. The value of φ at the grid i is φi . Thus φ1 = 5, φ5 = 5, φ9 = 5,
φ2 = 20, φ3 = 20, φ4 = 100, φ8 = 100 and φ12 = 100 are known due to
boundary conditions. Calculate unknown values of φ at the grid points 6, 7,
10 and 11 i.e. calculate φ6 , φ7 , φ10 and φ11 .
9.8 Consider Laplace equation given in the following
∂2u ∂2u
+ 2 = 0 ∀(x, y) ∈ (Ωxy ) = Ωx × Ωy (1)
∂x2 ∂y
The domain Ω̄xy and the discretization including boundary conditions are
shown in Figure (a) in the following.
y u(x, 1) = 3x

u(1, y) = 3y

3
u(0, y) = 0

x
u(x, 0) = 0
3
(a) Schematic of Ω̄xy

Consider the following two discretizations.

7 8 9 13 14 15 16

9
10 11 12

4 5 6
5
6 7 8

1 2 3 1 2 3 4
(b) A nine grid point discretization (c) A sixteen grid point discretiza-
tion
420 NUMERICAL SOLUTIONS OF BVPS

Consider finite difference approximations of the derivative in the BVP (1)

using central difference method to construct a system of algebraic equations
for (1) using this approximation of the derivatives. Find numerical values
of u5 for discretization in Figure (b) and the numerical values of u6 , u7 , u10
and u11 for the discretization in Figure (c).

9.9 Consider Laplace equation given in the following

∂2u ∂2u
+ 2 = 0 ∀(x, y) ∈ (Ωxy ) = Ωx × Ωy (1)
∂x2 ∂y

The domain Ω̄xy and the discretization including boundary conditions are
shown in Figure (a) in the following.
y u(x, 1) = x2

u(1, y) = y 2

3
u(0, y) = 0

x
u(x, 0) = 0
3
(a) Schematic of Ω̄xy

Consider the following two discretizations. Consider finite difference approx-

7 8 9 13 14 15 16

9
10 11 12

4 5 6
5
6 7 8

1 2 3 1 2 3 4
(b) A nine grid point discretization (c) A sixteen grid point discretiza-
tion
9.5. CONCLUDING REMARKS 421

imations of the derivative in the BVP (1) using central difference method to
construct a system of algebraic equations for (1) using this approximation
of the derivatives. Find numerical values of u5 for discretization in Figure
(b) and the numerical values of u6 , u7 , u10 and u11 for the discretization in
Figure (c).

9.10 Consider 1D convection diffusion equation in R1 .

dφ 1 d2 φ
− = 0 ∀x ∈ (Ωx ) = (0, 1) (1)
dx P e dx2
with boundary conditions:

φ(0) = 1 , φ(1) = 0 (2)

Consider finite difference approximation of the derivatives using central dif-

ference method to obtain an algebraic system for (1). Consider a five grid
point uniform discretization shown below.

1 2 3 4 5
x

x=0 x=1

φ1 = 1 φ2 φ3 φ4 φ5 = 0
(a) A five grid point uniform discretization

Calculate numerical values of φ2 , φ3 and φ4 for Peclet number P e = 1, 5, 10

and 20. For each P e plot graphs of φ, dφ/dx and d2 φ/dx2 versus x. Note that
dφ/dx and d2 φ/dx2 can be calculated using central difference approximation
and forward and backward difference approximations at grid points 1 and 5
for dφ/dx.

9.11 Consider 1D Burgers equation in R1 .

dφ 1 d2 φ
φ − = 0 ∀x ∈ (Ωx ) = (0, 1) (1)
dx Re dx2
with boundary conditions:

φ(0) = 1 , φ(1) = 0 (2)

This is a nonlinear BVP, hence the resulting algebraic system will be a system
of nonlinear algebraic equations. Consider finite difference approximation of
the derivatives using central difference method to obtain an algebraic system
for (1). Consider a five grid point uniform discretization shown below.
422 NUMERICAL SOLUTIONS OF BVPS

1 2 3 4 5
x

x=0 x=1

φ1 = 1 φ2 φ3 φ4 φ5 = 0
(a) A five grid point uniform discretization

Calculate numerical values of φ2 , φ3 and φ4 for Re = 1, 5, 10 and 20. For each

Reynolds number Re plot graphs of φ, dφ/dx and d2 φ/dx2 versus x. Note that
dφ/dx and d2 φ/dx2 can be calculated using central difference approximation
and forward and backward difference approximations at grid points 1 and 5
for dφ/dx. Solve nonlinear algebraic equations using any one of the methods
described in chapter 3.

9.12 Consider the following BVP

d2 u
= x2 + 4u ∀x ∈ (0, 1) = Ω (1)
dx2
with boundary conditions:

u(0) = 0 , u(1) = 20 (2)

4
Consider a four element uniform discretization Ω̄T = ∪ Ω̄e of Ω̄ in which Ω̄e
e=1
is a two node linear element. Construct a finite element formulation of (1)
over an element Ω̄e of Ω̄T using Galerkin method with weak form (GM/WF).
Derive and calculate matrices and vectors of element equations. Assemble
element equations and obtain numerical values of u at the grid points where
u is unknown using boundary conditions (2). Plot graphs of u versus x and
du/dx versus x. Calculate du/dx using element local approximation. Compare
this solution with the solution calculated in problem 9.1. Also calculate
values of the unknown secondary variables.

9.13 Consider the following BVP

d2 u
= x2 + 4u ∀x ∈ (0, 1) = Ω (1)
dx2
with boundary conditions:

u0 (0) = 0 , u0 (1) = 20 (2)

4
Consider a four element uniform discretization Ω̄T = ∪ Ω̄e of Ω̄ in which
e=1
Ω̄e is a two node linear element. Construct a finite element formulation
9.5. CONCLUDING REMARKS 423

of (1) over an element Ω̄e of Ω̄T using Galerkin method with weak form
(GM/WF). Assemble element equations and obtain numerical values of u at
the grid points with unknown u using boundary conditions (2). Plot graphs
of u versus x and du/dx versus x. Calculate du/dx using element local approx-
imation. Compare this solution with the solution calculated in problem 9.2.
Also calculate values of the unknown secondary variables (if any).

9.14 Consider the following BVP

d2 u du
2
−4 + 2u = x2 ∀x ∈ (0, 2) = Ω (1)
dx dx
with boundary conditions:

u(0) = 0 , u0 (2) = 0.6 (2)

4
Consider a four element uniform discretization Ω̄T = ∪ Ω̄e of Ω̄ in which
e=1
Ω̄e is a two node linear element. Construct a finite element formulation
of (1) over an element Ω̄e of Ω̄T using Galerkin method with weak form
(GM/WF). Assemble element equations and obtain numerical values of u at
the grid points with unknown u using boundary conditions (2). Plot graphs
of u versus x and du/dx versus x. Calculate du/dx using element local approx-
imation. Compare this solution with the solution calculated in problem 9.3.
Also calculate values of the unknown secondary variables.

9.15 Consider the following BVP

d2 T x
− 1 − T =x ∀x ∈ (1, 3) (1)
dx2 5
with boundary conditions:

T (1) = 10 , T 0 (3) = 5 (2)

4
Consider a four element uniform discretization Ω̄T = ∪ Ω̄e of Ω̄ in which
e=1
Ω̄e is a two node linear element. Construct a finite element formulation
of (1) over an element Ω̄e of Ω̄T using Galerkin method with weak form
(GM/WF). Assemble element equations and obtain numerical values of u
at the grid points with unknown T using boundary conditions (2). Plot
graphs of T versus x and dT/dx versus x. Calculate dT/dx using element local
approximation. Also calculate values of the unknown secondary variables.

9.16 Consider 1D convection diffusion equation in R1 .

dφ 1 d2 φ
− = 0 ∀x ∈ (Ωx ) = (0, 1) (1)
dx P e dx2
424 NUMERICAL SOLUTIONS OF BVPS

with boundary conditions:

φ(0) = 1 , φ(1) = 0 (2)
4
Consider a four element uniform discretization Ω̄T = ∪ Ω̄e of Ω̄ in which
e=1
Ω̄e is a two node linear element. Construct a finite element formulation
of (1) over an element Ω̄e of Ω̄T using Galerkin method with weak form
(GM/WF). Assemble element equations and obtain numerical values of φ
at the grid points where φ is unknown using boundary conditions (2) for
P e = 1, 5, 10 and 20. For each Peclet number P e plot graphs of φ versus
x and dφ/dx versus x. Calculate dφ/dx using element local approximation.
Also calculate values of the unknown secondary variables. Compare the
solution computed here with the solution calculated in problem 9.10 using
finite difference method.
9.17 Consider 1D Burgers equation in R1 .
dφ 1 d2 φ
φ − = 0 ∀x ∈ (Ωx ) = (0, 1) (1)
dx Re dx2
with boundary conditions:
φ(0) = 1 , φ(1) = 0 (2)
4
Consider a four element uniform discretization Ω̄T = ∪ Ω̄e of Ω̄ in which
e=1
Ω̄e is a two node linear element.
Construct a finite element formulation of (1) over an element Ω̄e of Ω̄T using
Galerkin method with weak form (GM/WF). Assemble element equations
and obtain numerical values of φ at the grid points where φ is unknown
using boundary conditions (2) for Re = 1, 5, 10 and 20. For each Reynolds
number Re plot graphs of φ versus x and dφ/dx versus x. Calculate dφ/dx
using element local approximation. Also calculate values of the unknown
secondary variables. Compare the solution computed here with the solution
calculated in problem 9.11 using finite difference method.
9.18 – 9.23 Consider the same boundary value problems as in 9.12 – 9.17
with the corresponding boundary conditions. For each BVP consider a five
4
node uniform discretization Ω̄T = ∪ Ω̄e of Ω̄ in which Ω̄e is a three node
e=1
quadratic element (Lagrange family). Consider a typical element Ω̄e to con-
struct finite element formulations using GM/WF. Assemble and compute
solutions for each case and plot similar graphs as in problem 9.12 – 9.17.
Compare the results computed here with those in 9.12 – 9.17 computed us-
ing two node linear elements. For each problem compare and discuss the
results.
10
Numerical Solution of Initial
Value Problems

10.1 General overview

The physical processes encountered in all branches of sciences and engi-
neering can be classified into two major categories: time-dependent processes
and stationary processes. Time-dependent processes describe evolutions in
which quantities of interest change with time. If the quantities of interest
cease to change in an evolution then the evolution is said to have reached a
stationary state. Not all evolutions have stationary states. The evolutions
without a stationary state are often referred to as unsteady processes. Sta-
tionary processes are those in which the quantities of interest do not depend
upon time. For a stationary process to be valid or viable, it must correspond
to the stationary state of an evolution. Every process in nature is an evo-
lution. Nonetheless it is sometimes convenient to consider their stationary
state. In this book we only consider non-stationary processes, i.e. evolutions
that may have a stationary state or may be unsteady.
A mathematical description of most stationary processes in sciences and
engineering often leads to a system of ordinary or partial differential equa-
tions. These mathematical descriptions of the stationary processes are re-
ferred to as boundary value problems (BVPs). Since stationary processes are
independent of time, the partial differential equations describing their behav-
ior only involve dependent variables and space coordinates as independent
variables. On the other hand, mathematical descriptions of evolutions lead
to partial differential equations in dependent variables, space coordinates,
and time and are referred to as initial value problems (IVPs).
In case of simple physical systems, the mathematical descriptions of IVPs
may be simple enough to permit analytical solutions. However, most phys-
ical systems of interest may be quite complicated and their mathematical
description (IVPs) may be complex enough not to permit analytical solu-
tions. In such cases, two alternatives are possible. In the first case, one could
undertake simplifications of the mathematical descriptions to a point that
analytical solutions are possible. In this approach, the simplified forms may
not be descriptive of the actual behavior and sometimes this simplification
may not be possible at all. In the second alternative, we abandon the possi-

425
426 NUMERICAL SOLUTION OF INITIAL VALUE PROBLEMS

bility of theoretical solutions altogether as viable means of solving complex

practical problems involving IVPs and instead we resort to numerical meth-
ods for obtaining numerical solutions of IVPs. The finite element method
(FEM) is one such method of solving IVPs numerically and constitutes the
subject matter of this book. Before we delve deeper into the FEM for IVPs,
it is perhaps fitting to discuss a little about the broader classes of available
methods for obtaining numerical solutions of IVPs.
The fundamental difference between BVPs and IVPs is that IVPs de-
scribe evolutions, i.e. the solution changes at spatial locations as time elapses.
This important distinction between BVPs and IVPs necessitates a funda-
mentally different appraoch(es) for obtaining numerical solutions of IVPs
compared to BVPs. Consider an abstract initial value problem
Aφ − f = 0 ∀(x, t) ∈ Ωxt = (0, L) × (0, τ ) (10.1)
with some boundary and initial conditions. In (10.1), A is a space-time
differential operator, φ = φ(x, t) is the dependent variable, f = f (x, t) is the
non-homogeneous part, and Ωxt is the space-time domain over which (10.1)
holds. Time t = 0 and t = τ are initial and final values of time for which we
seek φ = φ(x, t), the solution of (10.1).
We note that in initial value problems the dependent variables are func-
tions of spatial coordinates and time and their mathematical description
contain spatial as well as time derivatives of the dependent variable. Par-
allel to the material presented in section 9.1 for BVP, we find that solution
φ(x, t) of IVP Aφ − f = 0 ∀x, t ∈ Ωxt = Ωx × Ωt requires that we integrate
Aφ−f = 0 over Ω̄xt = Ωxt ∪Γ, Γ being the closed boundary of the space-time
domain Ωxt . That is we need to consider
Z
(Aφ(x, t) − f (x, t)) dxdt = 0 (10.2)
Ω̄xt

The integrand in (10.2) is a space-time integral. Thus, space-time coupled

methods that consider space-time integrals of Aφ − f = 0 are the only
mathematically justifiable methods of obtaining solution of Aφ − f = 0
numerically or otherwise. Thus, we see that space-time coupled classical
and finite element methods (considered subsequently) are meritorious over
all other methods. In the following sections we consider these methods as
well.

10.2 Space-time coupled methods of approximation

for the whole space-time domain Ω̄xt
We note that since φ = φ(x, t), the solution exhibits simultaneous de-
pendence on spatial coordinates x and time t. This feature is intrinsic in the
10.2. SPACE-TIME COUPLED METHODS FOR Ω̄XT 427

mathematical description (10.1) of the physics.

Thus, the most rational approach to undertake for the solution of (10.1)
(approximate or otherwise) is to preserve simultaneous dependence of φ on
x and t. Such methods are known as space-time coupled methods. Broadly
speaking, in such methods time t is treated as another independent variable
in addition to spatial coordinates. Fig. 10.1 shows space-time domain Ω̄xt =
4
Ωxt ∪Γ; Γ = ∪ Γi with closed boundary Γ. For the sake of discussion, as
i=1
an example we could have a boundary condition (BC) at x = 0 ∀t ∈ [0, τ ],
boundary Γ1 , as well as at x = L ∀t ∈ [0, τ ], boundary Γ2 , and an initial
condition (IC) at t = 0 ∀x ∈ [0, L], boundary Γ3 . Boundary Γ4 at final value
of time t = τ is open, i.e. at this boundary only the evolution (the solution
of (10.1) subjected to these BCs and IC), will yield the function φ(x, τ ) and
its spatial and time derivatives.
t
open boundary

t=τ
Γ4

BCs Γ1 Γ2 BCs

t=0 x
x=0 ICs Γ3 x=L

Figure 10.1: Space-time domain Ω̄xt

When the initial value problem contains two spatial coordinates, we have
space-time slab Ω̄xt shown in Fig. 10.2 in which

Ωxt = (0, L1 ) × (0, L2 ) × (0, τ ) (10.3)

is a prism. In this case Γ1 , Γ2 , Γ3 , and Γ4 are faces of the prism (surfaces).

For illustration, the possible choices of BCs and ICs could be: BCs on Γ1 =
ADD1 A1 and Γ2 = BCC1 B1 , IC on Γ3 = ABCD, and Γ4 = A1 B1 C1 D1
is the open boundary. This concept of space-time slab can be extended for
three spatial dimensions and time. Using space-time domain shown in Fig.
10.1 or 10.2 and treating time as another independent variable, we could
consider the following methods of approximation.
428 NUMERICAL SOLUTION OF INITIAL VALUE PROBLEMS

t
D1
C1

Γ1 D Γ4 C
Γ2
A1 B1
t=τ L2

Γ3
A B
t=0 x

Figure 10.2: Rectangular prism space-time domain

1. Finite difference method

2. Finite volume method
3. Finite element method
4. Boundary element method
5. Spectral element method
6. And possibly others

In all such methods the IVP in dependent variable(s), spatial coordi-

nate(s) x (or x, y or x, y, z), and time t is converted into a system of alge-
braic equations for the entire space-time domain Ω̄xt from which numerical
solution is computed after imposing BCs and ICs.
In the methods listed here there are two features that are common: (1)
partial differential equation or a system of partial differential equations in
(10.1) is converted into an algebraic system for the space-time domain Ω̄xt
and (2) in general, the numerical solution over Ω̄xt obtained from the alge-
braic system is an approximation of the true solution. The differences in the
various methods of approximation lie in the manner in which the PDE or
PDEs are converted into the algebraic system.

10.3 Space-time coupled methods using space-time

strip or slab with time-marching
In space-time coupled methods for the whole space-time domain Ω̄xt =
[0, L] × [0, τ ], the computations can be intense and sometimes prohibitive
if the final time τ is large. This problem can be easily overcome by using
10.3. SPACE-TIME COUPLED METHODS USING SPACE-TIME STRIP 429

space-time strip or slab for an increment of time ∆t and then time-marching

to obtain the entire evolution. Consider the space-time domain
4
Ω̄xt = Ωxt ∪Γ; Γ = ∪ Γi
i=1

shown in Fig. 10.3. For an increment of time ∆t, that is for 0 ≤ t ≤ ∆t,
(1)
consider the first space-time strip Ω̄xt = [0, L] × [0, ∆t]. If we are only
interested in the evolution up to time t = ∆t and not beyond t = ∆t, then
the evolution in the space-time domain [0, L] × [∆t, τ ] has not taken place
(1)
yet, hence does not influence the evolution for Ω̄xt , t ∈ [0, ∆t]. We also note
(1)
that for Ω̄xt , the boundary at t = ∆t is open boundary that is similar to
the open boundary at t = τ for the whole space-time domain. We remark
(1)
that BCs and ICs for Ω̄xt and Ω̄xt are identical in the sense of those that
(2)
are known and those that are not known. For Ω̄xt , the second space-time
(1)
strip, the BCs are the same as for Ω̄xt but the ICs at t = ∆t are obtained
(1)
from the computed evolution for Ω̄xt at t = ∆t. Now, with the known ICs
(2)
at t = ∆t, the second space-time strip Ω̄xt is exactly similar to the first
(1) (1)
space-time strip Ω̄xt in terms of BCs, ICs, and open boundary. For Ω̄xt ,
(2)
t = ∆t is open boundary whereas for Ω̄xt , t = 2∆t is open boundary. Both
open boundaries are at final values of time for the corresponding space-time
strips.
t
open boundary

t=τ
Γ4

t = tn + ∆t = tn+1
(n)
Ω̄xt
t = tn

BCs Γ1 Γ2 BCs
(n−1)
ICs from Ω̄xt

t = 2∆t = t3
(2)
Ω̄xt
t = ∆t = t2
(1)
Ω̄xt
t = 0 = t1 x
x=0 ICs Γ3 x=L

Figure 10.3: Space-time domain with 1st , 2nd , and nth space-time strips

In this process the evolution is computed for the first space-time strip
(1)
Ω̄xt = [0, L]×[0, ∆t] and refinements are carried out (in discretization and p-
(1)
levels in the sense of finite element processes) until the evolution for Ω̄xt is a
430 NUMERICAL SOLUTION OF INITIAL VALUE PROBLEMS

(1)
converged solution. Using this converged solution for Ω̄xt , ICs are extracted
(2)
at t = ∆t for Ω̄xt and a converged evolution is computed for the second
(2)
space-time strip Ω̄xt . This process is continued until t = τ is reached.

Remarks.

(1) In this process, evolution is computed for an increment of time ∆t and

the time-marched to obtain the entire evolution. This allows the compu-
tation of entire evolution through solutions of relatively small problems
associated with each space-time strip (or slab) corresponding to an incre-
ment of time ∆t, resulting in significant efficiency in the computations
compared to computing the evolution for entire space-time domain si-
multaneously.
(n)
(2) Since the initial conditions for the nth space-time strip (Ω̄xt ) are ex-
(n−1)
tracted from the (n − 1)th space-time strip (Ω̄xt ), it is necessary to
(n−1)
have accurate evolution for the space-time strip Ω̄xt otherwise the
(n)
initial condition for the space-time strip Ω̄xt will be in error.
(3) Accuracy of the evolution is ensured for each space-time strip (or slab)
before time-marching, hence ensuring accuracy of the entire evolution for
the entire space-time domain Ω̄xt . This feature of space-time strip with
time-marching is absent in the first approach in which the solution is
obtained simultaneously for the entire space-time domain Ω̄xt . It is only
after we have the entire evolution that we can determine its accuracy in
Ω̄xt , either element by element or for the whole space-time domain.
(4) When using space-time strip with time-marching there are no assump-
tions or approximations, only added advantages of assurance of accuracy
and significant increase in computational efficiency. However, care must
be exercised to ensure sufficiently converged solution for the current
space-time strip before moving on to the next space-time strip as the
initial conditions for the next space-time strip are extracted from the
computed evolution for the current space-time strip.
(5) In constructing the algebraic system for a space-time strip or slab, the
methods listed for the first approach in Section 10.2 are applicable here
as well.

10.4 Space-time decoupled or quasi methods

In space-time decoupled or quasi methods the solution φ = φ(x, t) is
assumed not to have simultaneous dependence on space coordinate x and
time t. Referring to the IVP (10.1) in spatial coordinate x (i.e. R1 ) and time
t, the solution φ(x, t) is expressed as the product of two functions g(x) and
10.4. SPACE-TIME DECOUPLED OR QUASI METHODS 431

h(t):
φ(x, t) = g(x)h(t) (10.4)
where g(x) is a known function that satisfies differentiability, continuity,
and the completeness requirements (and others) as dictated by (10.1). We
substitute (10.4) in (10.1) and obtain

A (g(x)h(t)) − f (x, t) = 0 ∀x, t ∈ Ωxt (10.5)

Integrating (10.5) over Ω̄x = [0, L] while assuming h(t) and its time deriva-
tives to be constant for an instant of time, we can write
Z
(A (g(x)h(t)) − f (x, t)) dx = 0 (10.6)
Ω̄x

Since g(x) is known, the definite integral in (10.6) can be evaluated, thereby
eliminating g(x), its spatial derivatives (due to operator A), and more specif-
ically spatial coordinate x altogether. Hence, (10.6) reduces to

Ah(t) − f (t) = 0 ∀t ∈ (0, τ ) (10.7)

e e
in which A is a time differential operator and f is only a function of time.
In other words, (10.7) is an ordinary differentiale equation in time which can
e
now be integrated using explicit or implicit time integration methods or finite
element method in time to obtain h(t) ∀t ∈ [0, τ ]. Using this calculated h(t)
in (10.4), we now have the solution φ(x, t):

φ(x, t) = g(x)h(t) ∀x, t ∈ Ω̄xt = [0, L] × [0, τ ] (10.8)

Remarks.

(1) In this approach decoupling of space and time occurs in (10.4).

(2) A partial differential equation in φ, x, and t as in (10.1) is reduced to
an ordinary differential equation in time as in (10.7).
(3) φ(x, t) in (10.4) must satisfy all BCs and ICs of the initial value problem
(10.1). When seeking theoretical solution φ(x, t) using (10.4) it may be
difficult or may not even possible to find g(x) and h(t) such that φ(x, t)
in (10.4) satisfies all BCs and ICs of the IVP.
(4) However, when using methods of approximation in conjunction with
(10.4) this difficulty does not arise as BCs and ICs are imposed dur-
ing time integration of the ordinary differential equation in time (10.7).
Specifically, in context with space-time decoupled finite element pro-
cesses, (10.4) is used over an element Ω̄ex of the spatial discretization Ω̄Tx
of Ω̄x = [0, L], hence g(x) are merely local approximation functions over
432 NUMERICAL SOLUTION OF INITIAL VALUE PROBLEMS

Ω̄ex that are obtained using interpolation theory irrespective of BCs and
ICs.
(5) In principle, (10.4) holds for all of the methods of approximation listed in
Section 10.2. In all these methods spatial coordinate is eliminated using
(10.4) for discretization in space that may be followed by integration
of A(g(x)h(t)) − f (x, t) over Ω̄Tx depending upon the method chosen.
In doing so the IVP (10.1) reduces to a system of ordinary differential
equations in time which are then integrated simultaneously using explicit
or implicit time integration methods or finite element method in time
after imposing BCs and the ICs of the IVP.

In the following we present two example model problems of decoupling

space and time using a time-dependent convection diffusion equation, a lin-
ear initial value problem, and using a time-dependent Burgers equation, a
nonlinear initial value problem.

Example 10.1 (1D convection diffusion equation). Consider 1D con-

vection diffusion equation

∂φ ∂φ 1 ∂2φ
+ − ∀(x, t) ∈ Ωxt = (0, 1) × (0, τ ) = Ωx × Ωt (10.9)
∂t ∂x P e ∂x2
with some BCs and ICs. Equation (10.9) is a linear partial differential equa-
tion in dependent variable φ, space coordinate x, and time t. P e is the
Péclet number. Let
φ(x, t) = g(x)h(t) (10.10)
in which g(x) ∈ V ⊂ H 3 (Ω̄x ). Substituting (10.10) in (10.9)

dh(t) dg(x) 1 d2 g(x)

g(x) + h(t) − h(t) =0 (10.11)
dt dx Pe dx2
Integrating (10.11) with respect to x over Ω̄x = [0, 1] while treating h(t) and
its time derivatives as constant
h(t) d2 g(x)
Z Z Z
dh(t) dg(x)
g(x) dx + h(t) dx − dx = 0 (10.12)
dt dx Pe dx2
Ω̄x Ω̄x Ω̄x

Let
d2 g(x)
Z Z Z
dg(x)
C1 = g(x) dx ; C2 = dx ; C3 = dx (10.13)
dx dx2
Ω̄x Ω̄x Ω̄x
10.4. SPACE-TIME DECOUPLED OR QUASI METHODS 433

Using (10.13) in (10.12)

dh(t) C3
C1 + C2 − h(t) = 0 ∀t ∈ (0, τ ) (10.14)
dt Pe
We note that (10.14) is a linear ordinary differential equation in dependent
variable h(t) and time t. Thus, decoupling of space and time due to (10.10)
has reduced a linear partial differential equation (10.9) in φ(x, t), space x,
and time t to a linear ordinary differential equation in h(t) and time t.

Example 10.2 (1D Burgers equation). Consider 1D Burgers equation

∂φ ∂φ 1 ∂2φ
+φ − ∀(x, t) ∈ Ωxt = (0, 1) × (0, τ ) = Ωx × Ωt (10.15)
∂t ∂x Re ∂x2
with some BCs and ICs. Equation (10.15) is a non-linear partial differential
equation in dependent variable φ, space coordinate x, and time t. Re is
Reynolds number. Let
φ(x, t) = g(x)h(t) (10.16)
in which g(x) ∈ V ⊂ H 3 (Ω̄x ). Substituting (10.16) into (10.15),
d2 g(x)

dh(t) dg(x) 1
g(x) + g(x)h(t) h(t) − h(t) =0 (10.17)
dt dx Re dx2
Integrating (10.17) with respect to x over Ω̄x = [0, 1] while treating h(t) and
its derivatives constant
Z Z Z 2
dh(t) 2 dg(x) 1 d g(x)
g(x) dx+(h(t)) g(x) dx− h(t) dx = 0 (10.18)
dt dx Re dx2
Ω̄x Ω̄x Ω̄x

Let
d2 g(x)
Z Z Z
dg(x)
C1 = g(x) dx ; C2 = g(x) dx ; C3 = dx (10.19)
dx dx2
Ω̄x Ω̄x Ω̄x

Using (10.19) in (10.18),

dh(t)
C1+ C2 (h(t))2 − C3 h(t) = 0 ∀t ∈ (0, τ ) (10.20)
dt
Equation (10.20) is a non-linear ordinary differential equation in h(t) and
time t. Thus, the decoupling of space and time due to (10.16) has reduced a
non-linear partial differential equation (10.15) in φ(x, t), space x, and time
t into a non-linear ordinary differential equation in h(t) and time t.
434 NUMERICAL SOLUTION OF INITIAL VALUE PROBLEMS

10.5 General remarks

From the material presented in Sections 10.2 – 10.4 it is clear that one
could entertain any of the methods of approximation listed in Section 10.2,
space-time coupled, or space-time decoupled approaches for obtaining nu-
merical solutions of the IVPs.
In this book we only consider finite element method in conjunction with
space-time coupled and space-time decoupled approaches for obtaining nu-
merical solutions of the IVPs. The finite element method for both approaches
has rigorous mathematical foundation, hence in this approach it is always
possible to ascertain feasibility, stability, and accuracy of the resulting com-
putational processes. Error estimation, error computation, convergence, and
convergence rates are additional meritorious features of the finite element
processes for IVPs compared to all other methods listed in Section 10.2.
In the following sections we present a brief description of space-time cou-
pled and space-time decoupled finite element processes, their merits and
shortcomings, time integration techniques for ODEs in time resulting from
decoupling space and time, stability of computational processes, error esti-
mation, error computation, and convergence.
Some additional topics related to linear structural and linear solid me-
chanics such as mode superposition techniques of obtaining time evolution
are also discussed.

10.6 Space-time coupled finite element method

In the initial value problem (10.1), the operator A is a space-time differ-
ential operator. Thus, in order to address STFEM for totality of all IVPs
in a problem- and application-independent fashion we must mathematically
classify space-time differential operators appearing in all IVPs into groups.
For these groups of space-time operators we can consider space-time methods
of approximation such as space-time Galerkin method (STGM), space-time
Petrov-Galerkin method (STPGM), space-time weighted residual method
(STWRM), space-time Galerkin method with weak form (STGM/WF), space-
time least squares method or process (STLSM or STLSP), etc., thereby ad-
dressing totality of all IVPs. The space-time integral forms resulting from
these space-time methods of approximation are necessary conditions. By
making a correspondence of these integral forms to the space-time calculus
of variations we can determine which integral forms lead to unconditionally
stable computational processes. The space-time integral forms that satisfy
all elements of the space-time calculus of variations are termed space-time
variationally consistent (STVC) integral forms. These integral forms result
in unconditionally stable computational processes during the entire evolu-
tion. The integral forms in which one or more aspects of the space-time
10.7. SPACE-TIME DECOUPLED FINITE ELEMENT METHOD 435

calculus of variations is not satisfied are termed space-time variationally in-

consistent (STVIC) integral forms. In STVIC integral forms, unconditional
stability of the computations is not always ensured.
Using the space-time operator classifications and the space-time methods
of approximation, space-time finite element processes can be considered for
the entire space-time domain Ω̄xt = [0, L] × [0, τ ] or for a space-time strip
(n)
(or slab) Ω̄xt = [0, L] × [tn , tn+1 ] for an increment of time ∆t with time-
marching. In both approaches, simultaneous dependence of φ on x and t
is maintained (in conformity with the physics) and the finite elements are
space-time finite elements. Determination of STVC or STVIC of the space-
time integral forms for the classes of operators decisively establishes which
methods of approximation are worthy of consideration for which classes of
space-time differential operators for unconditional stability of computations.
(n)
This space-time finite element methodology with either Ω̄xt or Ω̄xt with
time-marching is most meritorious as it preserves the physics in the descrip-
tion of the IVP in the computational process and permits consistent and
rigorous mathematical treatment of the process including establishing cor-
respondence with space-time calculus of variations. In the next section, we
consider space-time decoupled approach, where a two-stage approximation
is used to obtain the solution to the original IVP.

10.7 Space-time decoupled finite element method

In this methodology space and time are decoupled, i.e. φ(x, t) does not
have simultaneous dependence on x and t. Consider the IVP (10.1) in which
A is a linear operator in space and time (for simplicity). The spatial domain
Ω̄x = [0, L] is discretized (in this case in R1 ), that is, we consider discretiza-
tion Ω̄Tx = ∪ Ω̄ex of Ω̄x in which Ω̄ex is the eth finite element in the spatial
e
domain Ω̄x = [0, L]. We consider local approximation φeh (x, t) of φ(x, t) over
Ω̄ex using
n
φeh (x, t) = Ni (x)δie (t)
P
(10.21)
i=1

in which Ni (x) are local approximation functions and δie (t) are nodal de-
grees of freedom for an element e with spatial domain Ω̄ex . Using (10.1)
we construct integral form over Ω̄Tx using any of the standard methods of
approximation. Let us consider Galerkin method with weak form:
Z
(Aφh − f, v)Ω̄Tx = (Aφh − f )v(x) dx = 0; v = δφh (10.22)
Ω̄T
x
436 NUMERICAL SOLUTION OF INITIAL VALUE PROBLEMS

in which φh = ∪φeh is the approximation of φ over Ω̄Tx . The integral in

e
(10.22) can be written as
(Aφeh − f, v)Ω̄ex ; v = δφeh
P
(Aφh − f, v)Ω̄Tx = (10.23)
e

Consider (Aφeh − f, v)Ω̄ex for an element Ω̄ex in x. We transfer half of the

differentiation from φeh to v only for those terms that contain even order
derivatives of φeh with respect to x. Using definition of secondary variables,
etc., we arrive at
(Aφeh − f, v)Ω̄ex = B e (φeh , v) − le (v) (10.24)
le (v) is concomitant that contains secondary variables in addition to other
terms related to nonhomogeneous part f . We substitute local approximation
(10.21) into (10.24) and note that
dφeh Pn dN
i e d2 φeh Pn d2 N
i e
= δi (t); 2
= 2
δi (t)
dx i=1 dx dx i=1 dx
(10.25)
dφeh n . d2 φeh n ..e
Ni (x)δ ei (t);
P P
= = N i (x) δ i (t)
dt i=1 dt2 i=1

to obtain (noting that v = Nj (x); j = 1, 2, . . . , n),

n
(Aφeh − f, v)Ω̄ex = B e Ni (x)δie (t), Nj − le (Nj ) ;
P
j = 1, 2, . . . , n (10.26)
i=1

After performing integration with respect to x in (10.26), (10.26) reduces to

.
a system of ordinary differential equations in time in {δ e }, {δ e }, etc., load
vector {f e }, and the secondary variables {P e }:
.
(Aφeh − f, v)Ω̄ex = [C1e ] {δ e } + [C2e ] {δ e } + · · · − {f e } − {P e } (10.27)
If . .
{δ} = ∪{δ e } ; {δ} = ∪{δ e } . . . . . . (10.28)
e e
then, the assembly of the element equations over Ω̄Tx yields
.
(Aφeh −f, v)Ω̄ex = (Aφh −f, v)Ω̄Tx = [C1 ] {δ}+[C2 ] {δ}+· · ·−{f }−{P } = 0
P
e
(10.29)
The order of the time derivatives of {δ e }
and {δ} in (10.27) and (10.29)
depend on the orders of the time derivatives in (10.1). Equations (10.29)
are a system of ordinary differential equations in time. We note that the
choice of Ni (x) is straightforward (interpolation theory) as opposed to the
choice of g(x) in φ(x, t) = g(x)h(t). This is a significant benefit of space-time
decoupling using finite element discretization in space. Equations (10.29) are
then integrated using explicit or implicit time integration methods or finite
element method in time after imposing BCs and ICs of the IVP.
10.8. TIME INTEGRATION OF ODES IN SPACE-TIME DECOUPLED METHODS 437

10.8 Time integration of ODEs resulting from space-

time decoupled FEM
.
Using BCs and ICs of the IVP, {δ(t)}, {δ(t)}, . . . for spatial locations in
Ω̄Tx are calculated by integrating (10.29) in time for an increment of time
and then by time-marching the integration process for subsequent values of
time. For this purpose explicit or implicit time integration methods or finite
.
element method in time can be employed. Once {δ}, {δ}, etc. are known for
.
an increment of time, the solution {δ e }, {δ e }, etc. are known for each Ω̄ex in
space.
We note that since ODEs in time only result in space-time decoupled
methods, the time integration schemes are neither needed nor used in space-
time coupled methods.

Remarks.

(1) A detailed study of various methods of approximation briefly discussed

here is beyond the scope of study in this book.
(2) In this chapter we only consider methods of approximation primarily
based on finite difference approach using Taylor series, for obtaining
solution of ODEs in time resulting from the PDEs describing IVPs after
decoupling of space and time.
(3) The finite element method for ODEs in time are similar to those for
BVPs presented in chapter 9. See reference [50], textbook for finite
element method for IVPs.

10.9 Some time integration methods for ODEs in

time
Mathematical description of time dependent processes result in PDEs
in dependent variables, space coordinates and time. If we use space-time
decoupled methods of approximation and consider discretization in spatial
domain, then the result is a system of ODEs in time. On the other hand if
we consider lumping in space coordinates, then this also results in a single
or a system of ODEs in time. The subject of study here is to find numerical
methods of solution of ODE(s) in time.
Consider a simple ODE in time.
dφ
= f (φ, t) ∀t ∈ (t1 , t2 ) = (0, τ ) = Ωt (10.30)
dt
We refer to (10.30) as IVP in which as time t elapses, φ changes, as φ = φ(t).
Equation (10.30) (IVP) must have initial condition(s) (just like BVPs have
438 NUMERICAL SOLUTION OF INITIAL VALUE PROBLEMS

BCs). In case of (10.30), a first order ODE in time, we need one initial
condition at the commencement of the evolution (at t = t1 or simply t1 = 0).

φ(0) = φ t=0
= φ0 (10.31)

In (10.30), φ is only a function of time t and (10.31) represents the state of

the solution at t = 0, at commencement of evolution.
In this chapter we consider methods to find numerical solutions of IVP
(10.30) subject to initial conditions (10.31). Let ∆t = h represent an incre-
ment of time. Consider evolution described by (10.30) between times t = ti
and t = ti+1 , where ti+1 − ti = ∆t = h.

Integrate (10.30) for time interval [ti , ti+1 ].

ti+1
Z ti+1
Z
dφ
dt = f (φ, t) dt (10.32)
dt
ti ti
ti+1
Z ti+1
Z
or dφ = f (φ, t) dt (10.33)
ti ti
ti+1
Z
or φ ti+1
−φ ti
= f (φ, t) dt (10.34)
ti
ti+1
Z
or φi+1 = φi + f (φ, t) dt (10.35)
ti
where φ ti+1
= φi+1 and φ ti
= φi (10.36)

Equation (10.35) defines evolution at time ti+1 in terms of the solution φ at

time t and the integral of f (φ, t) between the limits ti and ti+1 , which is area
under the curve f (φ, t) versus t between t = ti and t = ti+1 (see figure 10.4).
Knowing φi+1 from (10.35), we can reuse (10.35) to find φi+2 and continue
doing so until the desired time is reached. In various methods of approxi-
mation for finding φ(t) numerically for (10.30), we use (10.35) in which the
integral of f (φ, t) over the interval [ti , ti+1 ] is approximated.

10.9.1 Euler’s Method

In this method we approximate the integral in (10.35) by the area of a
rectangle of width h and height φi , the function value at the left endpoint
10.9. SOME TIME INTEGRATION METHODS FOR ODES IN TIME 439

f (φ, t)

t
ti ti+1

Figure 10.4: f (φ, t) versus t

of the interval (see figure 10.5).

ti+1
Z
f (φ, t) dt ' hf (φi , ti ) = (ti+1 − ti )f (φi , ti ) (10.37)
ti

f (φ, t)

f (φi , ti )
area = hf (φi , ti )

t
ti ti+1

Figure 10.5: Euler’s method

Hence, the evolution for φ is computed using

φi+1 = φi + hf (φi , ti ) (10.38)

The error made in doing so is illustrated by the empty area bound by the dot-
ted line which is neglected in the approximation (10.37) and hence (10.38).
In Euler’s method we begin with i = 0 and φ0 defined using initial condition
at time t = 0 and march the solution in time using (10.38). It is obvious that
smaller values of h = ∆t will yield more accurate results. Euler’s method
440 NUMERICAL SOLUTION OF INITIAL VALUE PROBLEMS

is one of the simplest and most crude approximation techniques for ODEs
in time. This method is called explicit method as in this method the so-
lution at new value of time is explicitly expressed in terms of the solution
at current value of time. Thus, computations of solution at new value of
time is simply a matter of substitution in (10.38). By using more accurate
tR
i+1
approximation of f (φ, t) dt we can devise methods that will yield more
ti
accurate numerical solution of φ(t) of (10.30).

Example 10.3 (Euler’s Method). Consider the IVP

dφ
−t−φ=0 for t > 0 (10.39)
dt
IC: φ(0) = 1 (10.40)
We consider numerical solution of (10.39) with IC (10.40) using ∆t = 0.2,
0.1 and 0.05 for 0 ≤ t ≤ 1.
We rewrite (10.39) in the standard form (10.30).

dφ
= t + φ = f (φ, t) (10.41)
dt
Thus using (10.38) for (10.41), we have

φi+1 = φi + h(ti − φi ) ; i = 0, 1, ...

(10.42)
with φ0 = 1 = φ t=0
; t0 = 0

We calculate numerical values of φ using (10.42) with h = 0.2 , 0.1 corre-

sponding to 5 and 10 time steps for 0 ≤ t ≤ 1. Calculated values of φ for
various values of time using h = 0.2 , 0.1 are given in tables 10.1 and 10.2
(using (10.42), for i = 0, 1, ...).
Table 10.1: Results of Euler’s method for (10.42), h = 0.2

h = 0.2
0≤t≤1
step number time, t function value, φ time derivative
dφ
dt i
= f(φi , ti )

0 0.000000E+00 0.100000E+01 0.100000E+01

1 0.200000E+00 0.120000E+01 0.140000E+01
2 0.400000E+00 0.148000E+01 0.188000E+01
3 0.600000E+00 0.185600E+01 0.245600E+01
4 0.800000E+00 0.234720E+01 0.314720E+01
5 0.100000E+01 0.297664E+01 0.397664E+01
10.9. SOME TIME INTEGRATION METHODS FOR ODES IN TIME 441

Table 10.2: Results of Euler’s method for (10.42), h = 0.1

h = 0.2
0≤t≤1
step number time, t function value, φ time derivative
dφ
dt i
= f(φi , ti )

0 0.000000E+00 0.100000E+01 0.100000E+01

1 0.100000E+00 0.110000E+01 0.120000E+01
2 0.200000E+00 0.122000E+01 0.142000E+01
3 0.300000E+00 0.136200E+01 0.166000E+01
4 0.400000E+00 0.152820E+01 0.192820E+01
5 0.500000E+00 0.172102E+01 0.222102E+01
6 0.600000E+00 0.194312E+01 0.254312E+01
7 0.700000E+00 0.219743E+01 0.289743E+01
8 0.800000E+00 0.248718E+01 0.328718E+01
9 0.900000E+00 0.281590E+01 0.371590E+01
10 0.100000E+01 0.318748E+01 0.418748E+01

From tables 10.1 and 10.2, we note that even for h = 0.2 and 0.1, rather
large time increments, the values of φ and dφ
dt are quite reasonable. Plots of
dφ
φ and dx versus t are shown in figures 10.6 and 10.7 illustrate this.

3.5
h = 0.2
h = 0.1

2.5
φ

1.5

1
0 0.2 0.4 0.6 0.8 1
Time, t

Figure 10.6: Solution φ versus time t

442 NUMERICAL SOLUTION OF INITIAL VALUE PROBLEMS

4.5
h = 0.2
h = 0.1
4

3.5

3
dφ/dt

2.5

1.5

1
0 0.2 0.4 0.6 0.8 1
Time, t
dφ
Figure 10.7: dt
versus time t

10.9.2 Runge-Kutta Methods

Recall (10.35)
ti+1
Z
φi+1 = φi + f (φ, t) dt (10.43)
ti

In Runge-Kutta methods we approximate the integral using

ti+1
Z
f (φ, t) dt = h(a1 k1 + a2 k2 + ...an kn ) (10.44)
ti

Hence, (10.43) becomes

φi+1 = φi + h(a1 k1 + a2 k2 + ... + an kn ) (10.45)

10.9. SOME TIME INTEGRATION METHODS FOR ODES IN TIME 443

This is known as nth order Runge-Kutta method in which k1 , k2 , ... kn are

given by

k1 = f (φi , ti )
k2 = f (φ + q11 k1 h, ti + p1 h)
k3 = f (φ + q21 k1 h + q22 k2 h, ti + p2 h)
k4 = f (φ + q31 k1 h + q32 k2 h + q33 k3 , ti + p3 h) (10.46)
..
.
kn = f (φi + qn−1,1 k1 h + qn−1,2 k2 h + ...qn−1,n−1 kn−1 h)

In (10.46) p and q are constants. Note the recurrence relationship in k, i.e.

k2 contains k1 , k3 contains k1 and k2 , and so on. p and q are determined by
using Taylor series expansions. We consider details in the following.

10.9.2.1 Second Order Runge-Kutta Methods

Consider a Runge-Kutta method with n = 2 (second order).

φi+1 = φi + h(a1 k1 + a2 k2 ) (10.47)

where k1 = f (φi , ti ) (10.48)

k2 = f (φi + q11 k1 h, ti + p1 h) (10.49)
Our objective is determine a1 , a2 , q11 and p1 . We do this using Taylor
series expansions. Consider Taylor series expansion of φi+1 in terms φi and
f (φi , ti ) and retain only up to quadratic terms in h.

h2
φi+1 = φi + f (φi , ti )h + f 0 (φi , ti ) (10.50)
2!
∂f (φ, t) ∂f (φ, t) ∂φ
where f 0 (φ, t) = + (10.51)
∂t ∂φ ∂t
h2

∂f (φ, t) ∂f (φ, t) dφ
∴ φi+1 = φi + f (φi , ti )h + + (10.52)
∂t ∂t dt ti 2!
dφ
But dt = f (φ, t), hence (10.52) becomes

∂f (φi , ti ) h2

∂f (φi , ti )
φi+1 = φi + f (φi , ti )h + + f (φi , ti ) (10.53)
∂t ∂φ 2!
Consider Taylor series expansion of f (·) in (10.49), using it as a function of
two variables as in
∂g ∂g
g(x + u, y + v) = g(x, y) + u+ v + ... (10.54)
∂x ∂y
444 NUMERICAL SOLUTION OF INITIAL VALUE PROBLEMS

∂f ∂f
∴ k2 = f (φi + q11 k1 h, t + p1 h) = f (φi , ti ) + p1 h + q11 k1 h + O(h2 )
∂t ∂φ
(10.55)
Substituting from (10.55) and (10.48) into (10.47)
∂f ∂f
φi+1 = φi + ha1 f (φi , ti ) + a2 hf (φi , ti ) + a2 p1 h2 + a2 q11 k1 h2 + O(h2 )
∂t ∂φ
(10.56)
Rearranging terms in (10.56)
∂f ∂f
φi+1 = φi + (a1 h + h2 h)f (φi , ti ) + a2 p1 h2 + a2 q11 k1 h2 + O(h2 ) (10.57)
∂t ∂t
For φi+1 in (10.53) and (10.57) to be equivalent, the following must hold.

a1 + a2 = 1
1
a2 p 1 = (10.58)
2
1
a2 q11 =
2
Equation (10.58) are three equations in four unknowns (a1 , a2 , p1 and q11 ),
hence they do not have a unique solution. There are infinitely many solu-
tions, so an arbitrary choice must be made at this point.

10.9.2.2 Heun Method

If we choose a2 = 12 , then (10.58) give a1 = 12 , p1 = q11 = 1
2 and we have
the following for the 2nd order Runge-Kutta method.
1 1
φi+1 = φi + h( k1 + k2 )
2 2
where: k1 = f (φi , ti ) (10.59)
k2 = f (ti + h, φi + k1 h)

This set of constants a, p, and q is known as Heun’s method.

10.9.2.3 Midpoint Method

1
If we choose a2 = 1, then (10.58) gives a1 = 0, p1 = q11 = 2 and we have
the following for the 2nd order Runge-Kutta method.

φi+1 = φi + hk2
where:
k1 = f (φi , ti ) (10.60)
1 1
k2 = f (ti + h, φi + k1 h)
2 2
10.9. SOME TIME INTEGRATION METHODS FOR ODES IN TIME 445

This is known as the midpoint method.

Using the derivation similar to second order Runge-Kutta method, we
can also derive higher order Runge-Kutta methods.

10.9.2.4 Third Order Runge-Kutta Method

Consider a Runge-Kutta method with n = 3 (third order).

dφ
= f (φ, t) ∀t ∈ (t1 , t2 ) = (0, τ ) = Ωt (10.61)
dt
Then
φi+1 = φi + h(a1 k2 + a2 k2 + a3 k3 ) (10.62)

where
1 4 1
a1 = , a2 = , a3 = (10.63)
6 6 6
and

k1 = f (φi , ti )

k1 h
k2 = f φi , h, ti + (10.64)
2 2
k3 = f (φi + 2hk2 − hk1 , ti + h)

10.9.2.5 Fourth Order Runge-Kutta Method

Consider a Runge-Kutta method with n = 4 (fourth order).

dφ
= f (φ, t) ∀t ∈ (t1 , t2 ) = (0, τ ) = Ωt (10.65)
dt
Then
φi+1 = φi + h(a1 k1 + a2 k2 + a3 k3 + a4 k4 ) (10.66)

where
1 1
a1 = a4 = , a2 = a3 = (10.67)
6 3
and k1 = f (ti , φi )
h k1
k2 = f (ti + , φi + h )
2 2 (10.68)
h k2
k3 = f (ti + , φi + h )
2 2
k4 = f (ti + h, φi + hk3 )
446 NUMERICAL SOLUTION OF INITIAL VALUE PROBLEMS

10.9.2.6 Runge-Kutta Method for a System of ODEs in Time

The concepts described for a single ODE in time can be extended to a
system of ODEs in time. Consider the following ODEs in time. We consider
4th order Runge-Kutta method.
du dv
= f1 (u, v, t) and = f2 (u, v, t) ∀t ∈ (t1 , t2 ) = (0, τ ) = Ωt
dt dt
(10.69)
h h
ui+1 = ui + (k1 +2k2 +2k3 +k4 ) vi+1 = vi + (l1 +2l2 +2l3 +l4 ) (10.70)
6 6
k1 = f1 (ui , vi , ti ) l1 = f1 (ui , vi , ti )
hk1 hl1 h hk1 hl1 h
k2 = f1 (ui + , vi + , ti + ) l2 = f1 (ui + , vi + , ti + )
2 2 2 2 2 2
hk2 hl2 h hk2 hl2 h
k3 = f1 (ui + , vi + , ti + ) l3 = f1 (ui + , vi + , ti + )
2 2 2 2 2 2
k4 = f1 (ui + hk3 , vi + hl3 + ti + h) l4 = f1 (ui + hk3 , vi + hl3 + ti + h)
Remarks.
(1) For each we introduce ki , li , ..., i = 1, 2, 3, 4 when we have more than
one ODE in time.

(2) Similar to fourth order Runge-Kutta method described above, we can

also use 2nd and 3rd Runge-Kutta methods for a system of ODEs.

10.9.2.7 Runge-Kutta Method for Higher Order ODEs in Time

When ODEs in time contain time derivatives of order higher than one, the
Runge-Kutta methods can also be used. In such cases we recast the higher
order ODE as a system of first order ODEs using auxiliary variables and
auxiliary equations and then use the method described in section 10.9.2.6.
Consider
d2 v dv
2
= f1 (v, , t) ∀t ∈ (t1 , t2 ) = (0, τ ) = Ωt (10.71)
dt dt
a second order ODE in time. Let dv dt = u be an auxiliary equation in which u
is the auxiliary variable. Substituting this in (10.71), we have the following.
du
= f1 (v, u, t)
dt
and (10.72)
dv
= u = f2 (u)
dt
Equations (10.72) are a pair of first order ODEs in u and v, hence we can
use the method described in section 10.9.2.6. In this case also we can use
10.9. SOME TIME INTEGRATION METHODS FOR ODES IN TIME 447

2nd and 3rd order Runge-Kutta methods as we see fit. For (10.72), we give
details for 4th order Runge-Kutta method.

du dv
= f1 (u, v, t) = f2 (u) = u (10.73)
dt dt
h h
ui+1 = ui + (k1 + 2k2 + 2k3 + k4 ) vi+1 = vi + (l1 + 2l2 + 2l3 + l4 )
6 6
k1 = f1 (ui , vi , ti ) l1 = f2 (ui ) = ui
hk1 hl1 h hk1 hk1
k2 = f1 (ui + , vi + , ti + ) l2 = f2 (ui + ) = ui +
2 2 2 2 2
hk2 hl2 h hk2 hk2
k3 = f1 (ui + , vi + , ti + ) l3 = f2 (ui + ) = ui +
2 2 2 2 2
k4 = f1 (ui + hk3 , vi + hl3 , ti + h) l4 = f2 (ui + hk3 ) = ui + hk3

10.9.3 Numerical Examples

In this section we consider a few numerical examples to illustrate the
computational details.

Example 10.4. Consider a first order ODE

du
=t+4 with u(0) = 1 ∀t ∈ (t1 , t2 ) = (0, τ ) = Ωt , τ > 0
dt
We calculate u at t = 0.2 and t = 0.4 using ∆t = 0.2 using 2nd , 3rd , and 4th
order Runge-Kutta methods.

(a) Second Order Runge-Kutta Method (Huen Method)

1 1
ui+1 = ui + h( k1 + k2 )
2 2
k1 = f (ui , ti )
k2 = f (ui + hk1 , ti + h)

In this case f = (t + u).

For i = 1:
t = t1 = 0, u1 = u(0) = 1

For i = 2:
448 NUMERICAL SOLUTION OF INITIAL VALUE PROBLEMS

t = t2 = ∆t = 0.2 using ∆t = h = 0.2

k1 = f (t1 + u1 ) = 0 + 1 = 1
k2 = f (t1 + h, u1 + k1 h) = 0.2 + (1 + 1(0.2)) = 1.4
1 1.4
u2 = 1 + 0.2( + ) = 1 + 0.24 = 1.24
2 2

For i = 3:
t = t3 = 2∆t = 0.4 using ∆t = h = 0.2

k1 = f (t2 , u2 ) = 0.2 + 1.24 = 1.44

k2 = f (t2 + h, u2 + k1 h) = (0.2 + 0.2) + 1.24 + (0.2(1.44)) = 1.928
1 1
u3 = u2 + h( k1 + k2 ) = 1.24 + 0.2(1.44 + 1.928) = 1.5768
2 2

Thus we have
du
t u dt = f (u, t)
0 1 1
0.2 1.24 1.44
0.4 1.5768 1.9768

(b) Third Order Runge-Kutta Method

For i = 1:
t = t1 = 0, u1 = 1, f (u, t) = u + t, h = 0.2

For i = 2:
t = t2 = ∆t = 0.2 using ∆t = h = 0.2

k1 = f (u1 , t) = 1 + 0 = 1
k1 h h 1(0.2) 0.2
k2 = f (u1 + , t1 + ) = (1 + ) + (0 + ) = 1.2
2 2 2 2
k3 = f (u1 + 2hk2 − hk1 , t1 + h) = (1 + 2(0.2)(1.2) − 0.2(1)) + 0 + 0.2 = 1.48
h
u2 = u1 + (k1 + uk2 + k3 )
6
0.2
u2 = 1 + (1 + 4(1.2) + 1.48) = 1.24267
6

For i = 3:
10.9. SOME TIME INTEGRATION METHODS FOR ODES IN TIME 449

t = t3 = 2∆t = 0.4 using ∆t = h = 0.2

k1 = f (u2 , t2 ) = u2 + t2 = 1.24267 + 0.2 = 1.44267
k1 h h h h
k2 = f (u2 + , t2 + ) = (u2 + k1 ) + (t2 + )
2 2 2 2
0.2 0.2
= (1.24267 + 1.44267( )) + (0.2 + ) = 1.686937
2 2
k3 = f (u2 + 2hk2 − hk1 , t2 + h) = (u2 + 2hk2 − hk1 , t2 + h)
= (1.24267 + 2(0.2)(1.686937) − (0.2)(1.44267)) + (0.2 + 0.2)
= 2.0289
h 0.2
u3 = u2 + (k1 + 4k2 + k3 ) = 1.24267 + (1.44267 + 4(1.686937) + 2.0289)
6 6
= 1.583315
Thus we have
du
t u dt = f (u, t)
0 1 1
0.2 1.24267 1.44267
0.4 1.583315 1.983315

(c) Fourth Order Runge-Kutta Method

For i = 1:
t = t1 = 0, u1 = 1, f (u, t) = u + t, h = 0.2

For i = 2:
t = t2 = ∆t = 0.2 using ∆t = h = 0.2

k1 = f (u1 , t1 ) = 1 + 0 = 1
hk1 h hk1 h
k2 = f (u1 + , t1 + ) = (u1 + ) + (t1 + )
2 2 2 2
(0.2)(1) 0.2
= (1 + ) + (0 + ) = 1.22
2 2
hk2 h hk2 h
k3 = f (u1 + , t1 + ) = (u1 + ) + (t1 + )
2 2 2 2
(0.2)(1.2) 0.2
= (1 + ) + (0 + ) = 1.22
2 2
k4 = f (u1 + hk3 , t1 + h) = (u1 + hk3 ) + (t1 + h)
= 1 + (0.2)(1.22) + (0 + 0.2) = 1.444
h
u2 = u1 + (k1 + 2k2 + 2k3 + k4 )
6
1
= 1 + (0.2)(1 + 2(1.2) + 2(1.22) + 1.444) = 1.2428
6
450 NUMERICAL SOLUTION OF INITIAL VALUE PROBLEMS

For i = 3:
t = t3 = 2∆t = 0.4 using ∆t = h = 0.2

k1 = 1.3328
k2 = 1.68708
k3 = 1.71151
k4 = 1.9851
h
u3 = u2 + (k1 + 2k2 + 2k3 + k4 )
6
0.2
= 1.2428 + (1.3328 + 2(1.68708) + 2(1.71151) + 1.9851)
6
= 1.58364

Similarly, we find that for t = t4 = 3∆t = 0.6 we have

u4 = 2.044218

Thus, we have
du
t u dt = f (u, t)
0 1 1
0.2 1.2428 1.4428
0.4 1.58364 1.98314
0.6 2.044218 2.644218

Remarks.

(1) Obviously Euler’s method has the poorest accuracy as it is an extremely

crude approximation of the area under f (φ, t) versus t between [ti , ti+1 ].

(2) The accuracy of Runge-Kutta methods improve as the order increases.

(3) 4th order Runge-Kutta method has very good accuracy. This method is
used widely in practical applications.

Example 10.5 (4th order Runge-Kutta Method for a System of First Order
ODEs in Time). Consider the following system of ODEs in time.

dx dy
= xy+t = f (x, y, t) ; = x+yt = g(x, y, t) ∀t ∈ (t1 , t2 ) = (0, τ ) = Ωt
dt dt
10.9. SOME TIME INTEGRATION METHODS FOR ODES IN TIME 451

in which x = x(t), y = y(t)

with
x(0) = x t=0
= 1 = x0
; t = 0 = t0
y(0) = y t=0
= −1 = y0

Calculate the solutions x and y at t = 0.2 using ∆t = h = 0.2. Let kj ;

j = 1, 2, ..., 4 and lj ; j = 1, 2, ..., 4 be the area constants for the two ODEs.
For i = 0, 1, 2, ... we have
h h
xi+1 = xi + (k1 + 2k2 + 2k3 + k4 ) ; yi+1 = yi + (l1 + 2l2 + 2l3 + l4 )
6 6
For i = 0:
k1 = f (x0 , y0 , t0 )
= (1(−1) + 0) = −1
l1 = g(x0 , y0 , t0 )
= (0(−1) + 1) = 1

k1 h l1 h h
k2 = f (x0 + , y0 + , t0 + )
2 2 2
(−1)(0.2) (1)(0.2) 0.2
= (1 + )(−1 + )+
2 2 2
= 0.71
k1 h l1 h h
l2 = g(x0 + , y0 + , t0 + )
2 2 2
(−1)(0.2) 1(0.2) 0.2
= (1 + ) + (−1 + )( )
2 2 2
= 0.81

k2 h l2 h h
k3 = f (x0 + , y0 + , t0 + )
2 2 2
(0.71)(0.2) (0.81)(0.2) 0.2
= (1 + )(−1 + +
2 2 2
= −0.754
k2 h l2 h h
l3 = g(x0 + , y0 + , t0 + )
2 2 2
(−0.71)(0.2) (0.81)(0.2) 0.2
= (1 + ) + (−1 + )+( )
2 2 2
= 0.837
452 NUMERICAL SOLUTION OF INITIAL VALUE PROBLEMS

k4 = f (x0 + k3 h, y0 + l3 h, t0 h)
= (1 + (−0.754)(0.2)(−1 + 0.837(0.2)) + 0.2
= −0.507
l4 = g(x0 + k3 h, y0 + l3 h, t0 + h)
0.2
= (1 + (−0.754)(0.2)) + (−1 + (0.837)(0.2)) + ( )
2
= 0.68
h
x1 = x0 + (k1 + 2k2 + 2k3 + k4 )
6
0.2
=1+ (−1 + 2(−0.71) + 2(−0.754) − 0.507)
6
x1 = 0.8522

h
y1 = y0 + (l1 + 2l2 + 2l3 + l4 )
6
0.2
= −1 + (1 + 2(0.81) + 2(0.837) + 0.68)
6
y1 = −0.834
Hence solution at t = 0.2 is (x1 , y1 ) = (0.8522, −0.8341).

Example 10.6 (4th Order Runge-Kutta Method for a Second Order ODE
in Time). Consider the following second order ODE in time.
d2 θ 32.2
+ sin θ = 0 ∀t ∈ (t1 , t2 ) = (0, τ ) = Ωt
dt2 r
θ = θ(t)
with
θ(0) = θ t=0 = θ0 = 0.8 in radians
dθ
=0; r=2
dt t=0
2
Use fourth order Runge-Kutta method to calculate θ, dθ d θ
dt , and dt2 for t = 0.1
nd
using ∆t = h = 0.1. Convert the 2 order ODE to a system of first order
ODEs in time.

Let
dθ du
= u = f (u) = −16.1 sin θ = g(θ) ; (r = 2)
dt dt
10.9. SOME TIME INTEGRATION METHODS FOR ODES IN TIME 453

Hence
u t=0
= u0 = 0

Let kj ; j = 1, 2, ..., 4 and lj ; j = 1, 2, ..., 4 be the area constants

for the two ODEs in time. θi+1 = h6 (k1 + 2k2 + 2k3 + k4 ) and
ui+1 = ui + h6 (l1 + 2l2 + 2l3 + l4 ).

For i = 0:

k1 = f (u0 ) l1 = g(θ0 )
=0 = −16.1 sin(0.8) = −11.55
l1 h k1 h
k2 = f (u0 + ) l2 = g(θ0 + )
2 2
−11.55(0.1) 9(0.1)
=0+( ) = −16.1 sin(0.8 + )
2 2
= −0.578 = −11.55
l2 h k2 h
k3 = f (u0 + ) l3 = g(θ0 + )
2 2
0.1 (−0.578)(0.1)
= 0 + (−11.55)( ) = −16.1 sin(0.8 + )
2 2
= −0.578 = −11.22
k4 = f (u0 + l3 h) l4 = g(θ0 + k3 h)
= 0 + (−11.22)(0.1) = −16.1 sin(0.8 + (−0.578)(0.1))
= −1.122 = −10.882

h
θ1 = θ0 + (k1 + 2k2 + 2k3 + k4 )
6
0.1
= 0.8 + (0 + 2(−0.578) + 2(−0.578) − 1.122)
6
θ1 = 0.7429

h
u1 = u0 + (l1 + 2l2 + 2l3 + l4 )
6
0.1
=0+ (−11.55 + 2(−11.55) + 2(−11.55) − 10.882)
6
dθ
u1 = −1.133 =
dt

d2 θ −32.2
t=0.1
= sin(θ1 ) = −16.1 sin(0.7429) = −10.89
dt2 2
454 NUMERICAL SOLUTION OF INITIAL VALUE PROBLEMS

Therefore the solution at t = 0.1 is given by

θ t=0.1
= 0.7429
dθ
t=0.1
= −1.133
dt
d2 θ
t=0.1
= −10.89
dt2

10.9.4 Further Remarks on Runge-Kutta Methods

(1) In case of single first order ODE in time the computations of area con-
stants ki ; i = 1, 2, . . . must be in the order k1 , k2 , k3 , k4 , . . . up to the
order of the Runge-Kutta method.

(2) In case of two first order simultaneous ODEs the area constants ki ;
i = 1, 2, ..., 4 and li ; i = 1, 2, ..., 4 associated with the two ODEs must
be computed in the following order.

(k1 , l1 ) , (k2 , l2 ) , (k3 , l3 ) , (k4 , l4 )

This is due to the fact that (k2 , l2 ) contain (k1 , l1 ) and (k3 , l3 ) contain
(k2 , l2 ) and so on.

(3) When there are more than two first order ODEs, we also follow the rule
(2).
(k1 , l1 , m1 , ...) first followed by (k2 , l2 , m2 , ...)

10.10 Concluding Remarks

In this chapter we have presented a general overview of various methods
of approximations that can be used to obtain approximate solutions of PDEs
describing initial value problem. Out of all of the methods mentioned here
the space-time coupled leading to unconditionally stable computations are
by far the most meritorious. The space-time finite element method is one
such method of approximation. This method can be applied to any IVP re-
gardless of complexity and the nature of the space-time differential operator.
Unfortunately the limited scope of study here only permit consideration of
the time integration methods for ODEs in time resulting from decoupling of
space and time in the IVPs.
10.10. CONCLUDING REMARKS 455

Problems
10.1 Consider the following ordinary differential equation in time
du
= tu2 ∀t ∈ (1, 2) = (t1 , t2 ) = Ωt (1)
dt
with IC : u(1) = 1 (2)

(a) Use Euler’s method to calculate the solution u(t) and u0 (t) ∀t ∈
(1, 2] using integration step of 0.1. tabulate your results and plot
graphs of u versus t and u0 versus t.
(b) Repeat the calculations for step size of 0.05. Tabulate and plot
graphs and computed solution and compare the solution computed
here with the results obtained in (a). Write short discussion of the
results calculated in (a) and (b).

10.2 Consider a system of ordinary differential equations

du dv
=u+v ; = −u + v ∀t ∈ (t1 , t2 ) = (0, τ ) ; τ > 0 (1)
dt dt
with ICs : u(0) = 0 , v(0) = 1 (2)

(a) Calculate u, u0 , v and v 0 ∀t ∈ (0, 1.0] with time step of 0.1 using
second order and fourth order Runge Kutta methods. Plot graphs of
u, u0 , v and v 0 versus t. Tabulate your computed solution. Compare
two solutions from the second and the fourth order methods.
(b) Repeat the calculations in (a) using time step of 0.05. Tabulate
your computed solution and plot similar graphs in (a). Compare
the computed solution here with that in (a). Write a short discus-
sion. Also compare the two solutions obtained here from second
and fourth order Runge-Kutta method.

10.3 Consider a system of ordinary differential equations

d2 φ 1 dφ

1
+ + 1 − 2 φ = 0 ∀t ∈ (t1 , t2 ) = Ωt (1)
dt2 t dt 4t

with ICs : φ(π/2) = 0 , φ0 (π/2) = −1 (2)

(a) Calculate φ(t), φ0 (t) ∀t ∈ (π/2, π/2 + 3] using second order and fourth
order Runge Kutta methods with time step of 0.1. Tabulate your
456 NUMERICAL SOLUTION OF INITIAL VALUE PROBLEMS

calculations and plot graphs of φ(t), φ0 (t) versus t. Compare the

two sets of solutions from second and fourth order Runge-Kutta
methods. Provide a short discussion.
(b) Repeat the calculations and details in (a) using time step of 0.05.
Compare the computed the solution calculated here with that in
(a). Provide a short discussion of your findings.

10.4 Consider the following ordinary differential equations in time

2
d2 u du
2
+ 0.1 + 0.6u = 0 ∀t ∈ (t1 , t2 ) = (0, τ ) = Ωt (1)
dt dt

with ICs : u(0) = 1 , u0 (0) = 0 (2)

(a) Calculate u(t), u0 (t) ∀t ∈ (0, 2.0] with time step of 0.1 and using
second and fourth order Runge Kutta methods. Plot graphs of u(t)
and u0 (t) versus t using calculated solutions. Compare the solutions
obtained using second and fourth order Runge Kutta methods.
(b) Repeat the calculations and all the details in (a) using time step
of 0.05. Compare the solution obtained here with that obtained in
(a). Provide a short discussion.

10.5 Consider the following ordinary differential equations in time

d2 u du
2
+4 + 2u − t2 = 0 ∀t ∈ (t1 , t2 ) = (0, τ ) = Ωt (1)
dt dt
with ICs : u(0) = 1 , u0 (0) = 4 (2)

(a) Calculate u(t), u0 (t) ∀t ∈ (0, 2.0] using Runge Kutta methods of
second and fourth orders using an integrating time step of 0.2. Tab-
ulate your results and plot graphs of u(t) and u0 (t) versus t using
calculated solutions. Compare the two sets of solutions.
(b) Repeat the calculations and all other details in (a) using time step
of 0.01. Compare these results with those in (a). Write a short
discussion.

10.6 Consider a system of ordinary differential equations

d2 φ

t
− 1− φ = t ∀t ∈ (t1 , t2 ) = Ωt (1)
dt2 5

with ICs : φ(1) = 10 , φ0 (1) = 0.1 (2)

10.10. CONCLUDING REMARKS 457

(a) Calculate φ(t), φ0 (t) ∀t ∈ (1, 3] using first order and second order
Runge Kutta methods with time step of 0.2. Tabulate your calcu-
lated solution and plot graphs of φ(t), φ0 (t) versus t. Compare the
two sets of solutions.
(b) Repeat the calculations and details in (a) using integration time step
of 0.1. Compare this computed solution with the one calculated in
(a). Write a short discussion.
11
Fourier Series

11.1 Introduction
In many applications such as initial value problems often the periodic
forcing functions i.e. periodic non-homogeneous part may not be analytical.
Such forcing functions are not continuous and differential every where in the
domain of definition. Rectangular or square waves, triangular waves, saw
tooth waves, etc. are a few examples. In such cases solutions of the initial
value problems may be difficult to obtain. Fourier series provides means
of approximate representation of such functions that are continuous and
differentiable everywhere in the domain of definitions, hence are meritorious
in the solutions of the IVPs.

11.2 Fourier series representation of arbitrary peri-

odic function
In the Fourier series representation of arbitrary periodic function f (t)
with a time period T , we represent f (t) as an infinite series of sinusoids of
harmonically related frequencies. The fundamental frequency ω correspond-
ing to the time period T is ω = 2π/T . 2ω, 3ω, . . . etc. are called harmonics.
We represent f (t) using a constant term a0 and as linear combinations of
sin(kωt); k = 1, 2, . . . , ∞ and cos(kωt); k = 1, 2, . . . , ∞. Thus we can write
f (t) as
X∞

f (t) = a0 + ak sin(kωt) + bk cos(kωt) (11.1)
k=1
in which f (t) is the given periodic function, a0 , ak , bk ; k = 1, 2, . . . , ∞ are
to be determined. We proceed as follows
(i) Determination of a0
Integrate (11.1) with respect to time with limits [0, T ]. Since
Z T Z T
sin(kωt) dt = 0 , cos(kωt) dt = 0 ; k = 1, 2, . . . , ∞ (11.2)
0 0
we obtain from (11.1)
Z T Z T
f (t) dt = a0 dt = a0 T (11.3)
0 0

459
460 FOURIER SERIES

Hence,
Z T
1
a0 = f (t) dt (11.4)
T 0

(ii) Determination of ak ; k = 1, 2, . . . , j, . . . , ∞
To determine aj , we multiply (11.1) by sin(jωt) and integrate with
respect to t with limits [0, T ].
Z T Z T
f (t) sin(jωt) dt = a0 sin(jωt) dt+
0 0
∞ Z
X T
sin(jωt) ak sin(kωt) + bk cos(kωt) dt (11.5)
k=1 0

we note that
Z T
sin(jωt) dt = 0
0
Z T
sin(jωt) sin(kωt) dt = 0 ; k = 1, 2, . . . , ∞ k 6= j
0
Z T (11.6)
sin(jωt) cos(kωt) dt = 0 ; k = 1, 2, . . . , ∞ k =6 j
0
Z T Z T
T
sin(jωt) sin(jωt) dt = sin2 (jωt) dt =
0 0 2

Using (11.6) in (11.5) we obtain

Z T
T
f (t) sin(jωt) dt = aj (11.7)
0 2

Hence,
Z T
2
aj = f (t) sin(jωt) dt ; j = 1, 2, . . . , ∞ (11.8)
T 0

(iii) Determination of bk ; k = 1, 2, . . . , j, . . . , ∞
To determine bj , we multiply (11.1) by cos(jωt) and integrate with
respect to time with limits [0, T ].
Z T Z T
f (t) cos(jωt) dt = a0 cos(jωt) dt+
0 0
∞ Z
X T
cos(jωt) ak sin(kωt) + bk cos(kωt) dt (11.9)
k=1 0
11.2. FOURIER SERIES REPRESENTATION OF ARBITRARY PERIODIC FUNCTION461

we note that
Z T
cos(jωt) dt = 0
0
Z T
cos(jωt) sin(kωt) dt = 0 ; k = 1, 2, . . . , ∞ k 6= j
0
Z T (11.10)
cos(jωt) cos(kωt) dt = 0 ; k = 1, 2, . . . , ∞ k =6 j
0
Z T Z T
T
cos(jωt) sin(jωt) dt = cos2 (jωt) dt =
0 0 2

Using (11.10) in (11.9) we obtain

Z T
T
f (t) cos(jωt) dt = bj (11.11)
0 2
Hence,
Z T
2
bj = f (t) cos(jωt) dt ; j = 1, 2, . . . , ∞ (11.12)
T 0

Equations (11.4), (11.8) and (11.12) completely define a0 , ak ; k = 1, 2, . . . , ∞

and bk ; k = 1, 2, . . . , ∞.

Remarks.

(1) Regardless whether f (t) is analytic or not its Fourier series approxima-
tion (11.1) is always analytic.

(2) Fourier series approximation of f (t) is an infinite series in sinusoids con-

taining fundamental frequency and its harmonics. As the number of
terms are increased in the Fourier approximation, the proximity of the
Fourier approximation to actual f (t) improves.

(3) It is possible to design L2 -norm of error between actual f (t) and its
Fourier approximation to quantitatively judge the accuracy of the ap-
proximation.

In the following we present numerical example.

462 FOURIER SERIES

Example 11.1. A rectangular wave.

f (t)

1
t
t=0
1
t = −T/2 t = T/2
t= −T/4 t= T/4

Figure 11.1: One time period of a rectangular wave

One time period of a rectangular wave is shown in Figure 11.1. Referring to

Figure 11.1, we have the following.

−1; t ∈ [−T/2, −T/4]

f (t) = 1; t ∈ [−T/4, T/4]

−1; t ∈ [T/4, T/2]


!
Z T Z −T/4 Z T/4 Z T/2
1 1
a0 = f (t) dt = (−1) dt + (1) dt + (−1) dt
T 0 T −T/2 −T/4 T/4

1
= ((−T/2 + T/4) + (T/4 − T/4) + (T/2 − T/4))
T
1
= (0) = 0
T
Z T/2
2
aj = f (t) sin(jωt) dt
T −T/2

or
!
Z −T/4 Z T/4 Z T/2
2
aj = (−1) sin(jωt) dt + (1) sin(jωt) dt + (−1) sin(jωt) dt
T −T/2 −T/4 T/4

2 1 −T/4 T/4 T/2
= cos(jωt) − cos(jωt) + cos(jωt)
T jω −T/2 −T/4 T/4
11.3. CONCLUDING REMARKS 463

2 1
aj = cos (jω T/4) − cos (jω T/2) − cos (jω T/4) + cos (jω T/4) +
T jω
!
2
cos (jω T/2) − cos (jω T/4) = (0) = 0
T

and Z T/2
2
bj = f (t) cos(jωt) dt
T −T/2
or
!
Z −T/4 Z T/4 Z T/2
1
bj = (−1) cos(jωt) dt + (1) cos(jωt) dt + (−1) cos(jωt) dt
T −T/2 −T/4 T/4

or
2 1 −T/4 T/4 T/2
bj = − sin(jωt) + sin(jωt) − sin(jωt)
T jω −T/2 −T/4 T/4

From which we obtain


 4/jπ;
 j = 1, 5, 9, . . .
bj = −4/jπ; j = 3, 7, 11, . . .

0; j = 2, 4, 6, . . .


Thus, the Fourier series approximation can be written as

4 4 4 4
f (t) = cos(ωt) − cos(3ωt) + cos(5ωt) − cos(7ωt) + . . .
π 3π 5π 7π

11.3 Concluding Remarks

In this chapter Fourier series representation of periodic function has been
presented. When the periodic functions are not analytic every where (i.e.
not continuous and differentiable) the Fourier series representation is helpful.
Though Fourier series representation of the actual function is often approx-
imate, the representation is analytic i.e. continuous and differentiable every
where. We have seen that Fourier series is an infinite series in sinusoids of
fundamental frequency and its harmonious, thus in this approximation the
accuracy of approximation improves as the number of terms are increased.
464 FOURIER SERIES

Problems
11.1 Figure (a) below shows a rectangular wave.
f (t)

0 t
T/2 T

A
(a) A rectangular wave of period T

f (t) of Figure (a) can be described by:

(
A; t ∈ [0, T/2]
f (t) =
−A; t ∈ [T/2, T ]

Derive Fourier series approximation of f (t) in the form (11.1).

11.2 Consider the wave shown in Figure (a) below.

f (t)

T
0 t
T/2

A
(a) A sawtooth wave of period T

f (t) of Figure (a) can be described by:

f (t) = (−2A/T ) t + A ∀ t ∈ [0, T ]

Derive Fourier series approximation of f (t) in the form (11.1).

11.3. CONCLUDING REMARKS 465

f (t)

0 t
T/2 T
(a) A triangular wave of time period T

11.3 Consider triangular wave shown in figure below.

f (t) of Figure (a) can be described by:
(
(−2A/T ) t ; t ∈ [0, T/2]
f (t) =
(−2A/T ) t + 2A; t ∈ [T/2, T ]

Derive Fourier series approximation of f (t) in the form (11.1).

466 FOURIER SERIES

[32]
BIBLIOGRAPHY 467

BIBLIOGRAPHY
[1] Allaire, F.E.: Basics of the Finite Element Method. William C. Brown,
Dubuque, IA (1985)

[2] Ames, W.F.: Numerical Methods for Partial Differential Equations.

Academic Press, New York (1977)

[3] Atkinson, K.E.: An Introduction to Numerical Analysis. Wiley, New

York (1978)

[4] Baker, A.J.: Finite Element Computational Fluid Mechanics. McGraw-

Hill, New York (1983)

[5] Bathe, K.J., Wilson, E.L.: Numerical Methods in Finite Element Anala-
ysis. Prentice-Hall, Englewood Cliffs, NJ (1976)

[6] Belytschko, T., Hughes, T.J.R.: Computational Methods for Transient

Analysis, Volume 1. North-Holland (1992)

[7] Burden, R.L., Faires, J.D.: Numerical Analysis, 5th edn. PWS Pub-
lishing, Boston (1993)

[8] Carnahan, B., Luther, H.A., Wilkes, J.O.: Applied Numerical Methods.
Wiley, New York (1969)

[9] Chapra, S.C., Canale, R.P.: Introduction to Computing for Engineers,

2nd edn. McGraw-Hill, New York (1994)

[10] Cheney, W., Kincaid, D.: Numerical Mathematics and Computing, 2nd
edn. Brooks/Cole, Monterey, CA (1994)

[11] Collatz, L.: The Numerical Treatment of Differential Equations.

Springer-Verlag (1966)

[12] Crandall, S.H.: Engineering Analysis. McGraw-Hill (1956)

[13] Crandall, S.H., Karnoff, D.C., Kurtz, E.F.: Dynamics of Mechanical

and Electromechanical Systems. McGraw-Hill (1967)

[14] Davis, P.J., Rabinowitz, P.: Methods of Numerical Integration. Aca-

demic Press, New York (1975)

[15] Fadeev, D.K., Fadeeva, V.N.: Computational Methods of Linear Alge-

bra. Freeman, San Francisco (1963)

[16] Ferziger, J.H.: Numerical Methods for Engineering Application. Wiley,

New York (1981)
468 BIBLIOGRAPHY

[17] Forsythe, G.E., Malcolm, M.A., Moler, C.B.: Computer Methods for
Mathematical Computation. Prentice-Hall, Englewood Cliffs, NJ (1977)
[18] Froberg, C.E.: Introduction to Numerical Analysis. Addison-Wesley
Publishing Company (1969)
[19] Gear, C.W.: Numerical Initial-Value Problems in Ordinary Differential
Equations. Prentice-Hall, Englewood Cliffs, NJ (1971)
[20] Gear, C.W.: Applied Numerical Analysis, 3rd edn. Addison-Wesley,
Reading, MA (1989)
[21] Hamming, R.W.: Numerical Methods for Scientists and Engineers. Wi-
ley, New York (1973)
[22] Hartley, H.O.: The modified gauss-newton method for fitting non-linear
regression functions by least squares. Technometrics 3, 269–280 (1961)
[23] Henrici, P.H.: Elements of Numerical Analysis. Wiley, New York (1964)
[24] Henrick, P.: Error Propagation for Finite Difference Methods. John
Wiley & Sons (1963)
[25] Hildebrand, F.B.: Introduction to Numerical Analysis, 2nd edn.
McGraw-Hill, New York (1974)
[26] Hoffman, J.: The Theory of Matrices in Numerical Analysis. Blaisdell,
New York (1964)
[27] Hoffman, J.: Numerical Methods for Engineers and Scientists. McGraw-
Hill, New York (1992)
[28] Householder, A.S.: Principles of Numerical Analysis. McGraw-Hill,
New York (1953)
[29] Hurty, W.C., Rubinstein, M.F.: Dynamics of Structures. Prentice-Hall
(1964)
[30] Isaacson, E., Keller, H.B.: Analysis of Numerical Methods. Wiley, New
York (1966)
[31] Lapidus, L., Pinder, G.F.: Numerical Solution of Partial Differential
Equations in Science and Engineering. Wiley, New York (1981)
[32] Lapidus, L., Seinfield, J.H.: Numerical Solution of Ordinary Differential
Equations. Academic Press, New York (1971)
[33] Maron, M.J.: Numerical Analysis, A Practical Approach. Macmillan,
New York (1982)
BIBLIOGRAPHY 469

[34] Moursund, D.G., Duris, C.S.: Elementary Theory and Applications of

Numerical Analysis. McGraw-hill (1967)

[35] Na, T.Y.: Computational Methods in Engineering Boundary Value

Problems. Academic Press, New York (1979)

[36] Ortega, J., Rheinboldt, W.: Iterative Solution of Nonlinear Equations

in Several Variables. Academic Press, New York (1970)

[37] Paz, M.: Structural Dynamics: Theory and Computations. Van Nos-
trand Reinhold Company (1984)

[38] Ralston, A., Rabinowitz, P.: A First Course in Numerical Analysis, 2nd
edn. New York (1978)

[39] Reddy, J.N.: An Introduction to the Finite Element Method, 3rd edn.
McGraw-Hill (2006)

[40] Rice, J.R.: Numerical Methods, Software and Analysis. McGraw-Hill,

New York (1983)

[41] Rubinstein, M.F.: Structural Systems – Statics, Dynamics, and Stabil-

ity. Prentice-Hall (1970)

[42] Shampine, L.F., Jr., R.C.A.: Numerical Computing: An Introduction.

Saunders, Philadelphia (1973)

[43] Stark, P.A.: Introduction to Numerical Methods. Macmillan, New York

(1970)

[44] Stasa, F.L.: Applied Finite Element Analysis for Engineers. Hold, Rine-
hard and Winston, New York (1985)

[45] Stewart, G.W.: Introduction to Matrix Computations. Academic Press,

New York (1973)

[46] Surana, K.S., Ahmadi, A.R., Reddy, J.N.: The k-version of finite ele-
ment method for self-adjoint operators in BVP. International Journal
of Computational Engineering Science 3(2), 155–218 (2002)

[47] Surana, K.S., Ahmadi, A.R., Reddy, J.N.: The k-version of finite ele-
ment method for non-self-adjoint operators in BVP. International Jour-
nal of Computational Engineering Science 4(4), 737–812 (2003)

[48] Surana, K.S., Ahmadi, A.R., Reddy, J.N.: The k-version of finite ele-
ment method for non-linear operators in BVP. International Journal of
Computational Engineering Science 5(1), 133–207 (2004)
470 BIBLIOGRAPHY

[49] Surana, K.S., Reddy, J.N.: The Finite Element Method for Boundary
Value Problems: Mathematics and Computations. CRC Press/Taylor
& Francis Group (2016)

[50] Surana, K.S., Reddy, J.N.: The Finite Element Method for Initial Value
Problems: Mathematics and Computations. CRC Press/Taylor & Fran-
cis Group (2017)

[51] Surana, K.S., Reddy, J.N., Allu, S.: The k-version of finite element
method for initial value problems: Mathematical and computational
framework. International Journal for Computational Methods in Engi-
neering Science and Mechanics 8(3), 123–136 (2007)

[52] Wilkinson, J.H.: The Algebraic Eigenvalue Problem. Oxford University

Press, Fair Lawn, NJ (1965)

[53] Wilkinson, J.H., Reinsch, C.: Linear Algebra: Handbook for Automatic
Computation, vol. 11. Springer-Verlag, Berlin (1971)

[54] Young, D.M.: Iterative Solution of Large Linear Systems. Academic

Press, New York (1971)

[55] Zienkiewicz, O.C.: The Finite Element Method in Engineering Science.

McGraw-Hill, London (1971)
INDEX Newton’s second order method, 108–
110
Runge-Kutta methods, 442–447
Secant method, 113–114
Simpson’s rules, 272–276
Angular frequency, 336–337, 459
A Approximate relative error
Bisection method, 95
Accuracy, 2, 69, 93, 95, 103, 117, 125–128, False position method, 100
169, 270, 274, 275, 284, 288, 293, Fixed point method, 114
295, 300, 350, 405, 430 Gauss-Seidel method, 70
Advection diffusion equation, or Convection Newton’s method
diffusion, 421, 423, 432 first order, Newton-Raphson, 104
Algebraic Equations, or System of linear al- nonlinear simultaneous equations, 119
gebraic equations, 10–80 second order, 108
Definitions (linear, nonlinear), 10 Romberg integration, 285
Elementary row operations, 26
Linear dependence, 20 B
Linear independence, 20
Matrix and vector representation, 25
Back substitution, 36, 39, 41, 42, 45, 53
Methods of solution
Backward difference approximation, 349–354
Cholesky decomposition, 63–64
Banded matrices, 13
Cramer’s rule, 32–33
Best fit criterion, also see Least squares (meth-
Crout decomposition, 56–60
ods), 311–342
Elimination methods, 34
Bisection method, 95–98
Gauss elimination, 34–46
Algorithm, 95–96
full pivoting, 43–46
Convergence criteria, 95
naive, 34–39
Boundary conditions, 359–415
partial pivoting, 39–43
Derivatives, 387, 402
Gauss-Jordan, 46–49
Function values, 360, 376, 387, 392, 399,
Graphical, 28–32
402, 405, 408, 412
Inverse of the matrix, 65–68
Boundary value problems, 359–415
Iterative methods, 68
Eigenvalues, 392, 406–407
Gauss-Seidel, 68–74
Finite difference method, 397–415
Jacobi, 74–80
Finite element methods, 359–397
Relaxation, 80
Bracketing, 91
[L][U ] decomposition, 49–56
Roots of equations, 91
[L][U ] decomposition using Gauss elim-
bisection method (half interval method),
ination, 61–63
95–96
Algorithm(s)
false-position method, also see False
Bisection method, 95–96
position method, 100
Crout decomposition, 56–60
graphical method, 91
Euler’s method, 438–440
incremental search, 93
False position, 99
Fixed point method, 114–115
Gauss elimination, 34–46 C
Gauss-Jordan, 46–49
Gauss-Seidel method, 68–74 Central difference approximation, 350–352
Lagrange (polynomial), 198–242 First derivative, 350
[L][U ] Decomposition, 49–54 Second derivative, 350–351
Newton-Raphson method, 102–106 Third derivative, 351–352

471
472 INDEX

Characteristic polynomial, 130–151, 168–169, E

188, 406
Cholesky decomposition, 49–63 Eigenvalue problems, 129–189
Coefficients Basic properties, 129–136
in local approximation, 369–370 Characteristic polynomial, 137
of interpolating polynomial, 196–199 determinant (see Determinant)
Complete pivoting (full pivoting), 43–46 F-L method, 138–140
Condition number, 81 Definition, 129
Constants of integration, 360 Method of finding EP
Constraints Householder method with QR itera-
for BVPs, 359–415 tions, 180–186
for IVPs, 425–454 Determining Q & R, 183–184
Convergence House holder transformation, 181–
Bisection method, 95 183
False position-method, 100 QR iterations, 183
Fixed point method, 114 Tridiagonalize, 180–181
Gauss-Seidel method, 70 Iteration vector deflation or Gram-
Newton’s method Schmidt orthogonalization, 158–
first order, Newton-Raphson, 104 165
nonlinear simultaneous equations, 119 Subspace iteration method, 186–188
second order, 108 Transformation methods, 167–180
Romberg integration, 285 GEVP, 168–170
Cramer’s rule, 32–33 Generalized Jacobi method, 175–
Crout Decomposition, 56–60 180
Curve fitting, also see Least squares formu- SEVP, 167–168
lation (LSF), 311–342 Jacobi method, 170–175
General nonlinear LSF, 328–330 Vector iteration methods, 144–165
Linear LSF, 312–314 Forward iteration, 147–151, 154–
LSF using sinusoidal functions, 336-339 158
Nonlinear LSF special case, 321–323 Inverse iteration, 144–147, 151–154
Weighted general nonlinear LSF, 330 Shifting in EVP, 165–167
Weighted linear LSF, 315–316 Types
SEVP, 129
GEVP, 129
D Eigenvalues
Properties, 129–136
Largest
Decomposition Forward iteration method, 147–151,
Crout decomposition, 56–60 154–158
[L][U ] decomposition, 49–56 Smallest
Symmetric, skew symmetric, 11–12, 19 Inverse iteration method, 144-147, 151–
Definite integrals, 269–305 154
Deflation, 158–165 Eigenvector, also see Eigenvalue problems
Deflation in EVPs, 158–165 Methods of calculating properties of EV
Iteration vector, 158–165 I orthogonal, SEVP, 131–132
Derivative boundary condition M orthogonal, GEVP, 133–134
BVPs, 359–415 Element equations
IVPs, 425–454 Finite element method, 359–397
ODEs in time, 425–454 Elimination methods, 34–46
Determinant, 20–24, 32, 65, 136, 139–142 Gauss elimination, 34–46
Diagonal, 138, 169 naive, 34–39
dominance, 39–46 partial pivoting, 39–43
matrices, 10–81 full pivoting, 43–46
Differentiation, also see Numerical differen- Error, see Relative error, Approximate rela-
tiation, 347–354 tive error
Numerical differentiation, 347–354 Euler’s method, 438–440
Discretization Explicit method, 431
Finite difference method, 397–415 Euler’s method, 438–442
Finite element method, 359–397 Extrapolation, 269–270, 284–288
Double integral, 296–297 Richardson’s extrapolation, 284–285
INDEX 473

F Periodic, 459
Periodic functions, 459–466
Factorization or decomposition Representation of arbitrary periodic func-
Crout Decomposition, 56–60 tion, 459
[L][U ] decomposition, 49–54 Sawtooth wave, 464
False position method, 99 Square wave, 462–463, 464
Convergence, 100 Time period, 459
Derivation, 99 Triangular wave, 465
Relative error (stopping criteria), 100 Fourth order Rune-Kutta, 445–447
Finite difference methods, 397–415
BVPs, 397–415 G
ODEs, 397–407
Eigenvalue problem, 405–407 Gauss Elimination, 34–46
Second order non-homogeneous ODE, Full pivoting, 43–46
397–407 back substitution, 43–46
Function values as BCs, 397– upper triangular form, 43–44
407 Naive, 34–39
Function values and derivatives back substitution, 36–38
as BCs, 402–405 upper triangular form, 35–36
PDEs Partial pivoting, 39–43
Laplace equation, 408–412 back substitution, 41–43
Poisson’s equation, 408, 412–415 upper triangular form, 39–40
IVPs Gauss-Jordan method, 46–49
Heun method, 444-445 Algorithm, 46–48
Runge-Kutta methods, 442-454 Examples, 48–49
Numerical differentiation, 347–354 Gauss quadrature, 288–300
Finite element method, 359–397 Examples, 300–305
Differential operators, 360–361 in R1 over [−1, 1], 288–295
Discretization, 366-369 n point quadrature, 292–293
FEM based on FL, 369–374 Three point quadrature, 290–292
FEM based on residual functional, Two point quadrature, 289–290
374–375 in R1 over [a, b], 295–296
FEM using GM/WF, 372 in R2 over [−1, 1] × [−1, 1], 296
concomitant, 373 in R2 over [a, b] × [c, d], 273
EBC, NCM, 373 in R3 over a two unit cube, 297
PV, SV, 373 in R3 over a prism, 299
weak form, 372–373 Gauss-Seidel method, 68–74
FEM using GM, PGM, WRM, 371– Algorithm, 68–69
372 Convergence criterion, 70
assembly, 372 Examples, 70–74
element equations, 371 Gradient method, also see Newton’s method
integral form, 369–374 or Newton-Raphson method, 102
local approximations in R1 , 379 in R1 , 102–107
mapping in R1 , 379 error analysis, 105–106
second order ODE, 375 examples, 106–107
Global approximations, 369 method, 102–104
Integral form, 361 in R2 , 118–123
based on Fundamental Lemma, 362– example, 120–123
365 method, 118–120
residual functional, 365–366
Local approximations, 369–370 H
First Backward difference, 349–350
First Forward difference, 349 Half interval method, 95–98
First order approximation, 349–350 Harmonics, 459
First order ODEs, 360–407, 437–454 Heun’s method, 444–445
First order Runge-Kutta methods, 442 Higher order approximation, 350–353
Fourier series, 459–463 Householder’s method, also see Eigenvalue
Determination of coefficients, 459–461 problems, 180–186
Fundamental frequency, 459 House holder transformation, see Eigen-
Harmonic, 459 value problems, 181–183
474 INDEX

Determining Q & R, 183–184 Standard Jacobi method for SEVP, 170–

House holder transformation, 181– 175
183
QR iterations, 183 K
Tridiagonalize, 180–181
Kronecker delta, 12
I
L
Incremental search (bracketing a root), 92
Integration, also see Numerical integration, Lagrange interpolating polynomials
269–306 in R1 , 198–217
in R1 , 269 in R2 , 217–237
Examples, 276–283, 286–288, 300– tensor product, 216 – 220
305 in R3 , 237–247
Gauss quadrature, 288–300 tensor product, 224–267
n-point quadrature, 292–293 Least squares, also see Curve fit
over [−1, 1], 288–295 General nonlinear LSF, 328–330
over [a, b], 295–296 Linear LSF, 312–314
two point quadrature, 289–290 LSF using sinusoidal functions, 336-339
three point quadrature, 290–292 Nonlinear LSF special case, 321–323
Newton-Cotes integration, 276 Weighted general nonlinear LSF, 330
Richardson’s extrapolation, 284–285 Weighted linear LSF, 315–316
Romberg method, 285–286 Linear algebraic equation, also see Algebraic
Simpson’s 1/3 Rule, 272–274 equations or System of linear al-
Simpson’s 3/8 Rule, 274–276 gebraic equations, 10–80
Trapezoidal Rule, 271–272 Definitions (linear, nonlinear), 10
in R2 Elementary row operations, 26
Examples, 300–305 Linear dependence, 20
over [−1, 1] × [−1, 1], 296 Linear independence, 20
over [a, b] × [c, d], 297 Matrix and vector representation, 25
in R3 Methods of solution
over a prism, 299 Cholesky decomposition, 63–64
over two unit cube, 298 Cramer’s rule, 32–33
Interpolation Crout decomposition, 56–60
in R1 Elimination methods, 34
approximate error, see Approximate Gauss elimination, 34–46
relative error full pivoting, 43–46
definition, 195–196 naive, 34–39
Lagrange interpolating polynomial, partial pivoting, 39–43
198–217 Gauss-Jordan, 46–49
Newton’s interpolating polynomial, Graphical, 28–32
251–255 Inverse of the matrix, 65–68
Pascale rectangle, 222 Iterative methods, 68
piecewise linear, 196 Gauss-Seidel, 68–74
polynomial interpolation, 197–198 Jacobi, 74–80
in R2 , Lagrange interpolation, Tensor Relaxation, 80
product, 217–237, 224–231 [L][U ] decomposition, 49–56
in R3 , Lagrange interpolation, Tensor [L][U ] decomposition using Gauss elim-
product, 237–247, 237–247 ination, 61–63
Initial value problems, 425–454 Linear interpolation, also see Interpolation
Finite element method, 434–436 in R1
Time integration of ODEs, 437–454 approximate error, see Approximate
relative error
definition, 195–196
J Lagrange interpolating polynomial,
198–217
Jacobi method, 170–180 Newton’s interpolating polynomial,
For algebraic equations, 74–80 251–255
Generalized Jacobi method for GEVP, Pascale rectangle, 222
175–180 piecewise linear, 196
INDEX 475

polynomial interpolation, 197–198 N

in R2 , Lagrange interpolation, Tensor
product, 217–237, 224–231 Naive Gauss elimination, 34–39
in R3 , Lagrange interpolation, Tensor Natural boundary conditions, 373
product, 237–247, 237–247 Newton-Raphson method, 102–106
Lower triangular matrices, see matrix(ces) Newton’s method, 102–106
First order linear method (Newton-Raphson),
102–106
M Second order method, 108–110
Non-homogeneous, see BVPs and IVPs
Mapping (physical to natural space), 202, Nonlinear equations, 89–123
217 Ordinary Differential Equations
in R1 , 202 Boundary Value Problems (linear and
function derivatives, 215–216 nonlinear), 359–417
integrals, 216–217 Initial Value Problems (linear and
length in R1 , 214–215 nonlinear), 425–454
linear, 204–205 Root finding method, 90–116
piecewise mapping, 209–214 Bisection method (Half-interval method),
quadratic, 205–207 95–98
theory, 202–204 False position, 99-102
in R2 , 217–231 Fixed point, 114–116
derivatives, 231 Graphical, 91–92
length and areas, 229–230 Incremental search, 92–95
points, 220–222 Newton-Raphson (Newton’s linear)
subdivision, 218–219 method, 102–107
in R3 , 237-247 Newton’s second order method, 108–
derivatives, 246–247 113
lengths and volume, 245–246 Secant method, 113–114
points, 237–238 Solution of simultaneous, 118–123
Matrix(ces) Numerical differentiation, also see Differen-
Algebra, 13–15 tiation
Augmented, 19–20 Numerical integration, also see Integration
Banded, 13 in R1 , 269–306
Cholesky decomposition, also see [L][U ] Examples, 276–283, 286–288, 300–
decomposition, 49–56, 63–64 305
Condition number, 101 Gauss quadrature, 288–300
Diagonal, 12 n-point quadrature, 292–293
Element matrix, see Finite element method over [−1, 1], 288–295
over [a, b], 295–296
Identity, 12
two point quadrature, 289–290
Inverse, 15, 65–68
three point quadrature, 290–292
Kronecker delta, 12
Newton-Cotes integration, 276
Linear algebraic equations, 25
Richardson’s extrapolation, 284–285
Linear dependence, 20
Romberg method, 285–286
Linear independence, 20
Simpson’s 1/3 Rule, 272–274
Lower triangular, 13
Simpson’s 3/8 Rule, 274–276
Multiplication of, 14–15
Trapezoidal Rule, 271–272
Notation, 10
in R2
Rank, 20
Examples, 300–305
Rank deficient, 20
over [−1, 1] × [−1, 1], 296
Singular, 21
over [a, b] × [c, d], 297
Square, 11 in R3
Symmetric, 11 over a prism, 299
Trace, 15 over two unit cube, 298
Transpose, 15
Triangular, 13
Tridiagonal (banded), 13 O
Upper triangular, 13
Method of weighted residuals, also see Finite One dimensional Finite Element Method, 359–
element method, 374–375 397
476 INDEX

Open interval, see BVPs and IVPs and root Trapezoidal Rule, 271–272
finding methods in R2
Ordinary Differential Equations Examples, 300–305
Boundary Value Problem, 359–407 over [−1, 1] × [−1, 1], 296
Finite difference method, 397–407 over [a, b] × [c, d], 297
Finite element method, 366–397 in R3
Initial Value Problem, 425–454 over a prism, 299
Finite element method, 434–436 over two unit cube, 298
Time integration of ODEs, 437–454
R
P
Relative error
Partial differential equation Bisection method, 95
Finite difference method, 408–415 False position method, 100
Partial pivoting, 39–43 Fixed pint method, 114
PDEs, see partial differential equations Gauss-Seidel method, 70
Pivoting, 39–43, 43–46 Newton’s method
Gauss elimination first order, Newton-Raphson, 104
full, 43–46 nonlinear simultaneous equations, 119
partial, 39–43 second order, 108
Poisson’s equation, 408 Romberg integration, 285
Polynomial interpolation or polynomial, also Relaxation techniques, 80, 82
see Interpolation Residual, 312-315, 365, 394
in R1 Romberg integration, 285–286
approximate error, see Approximate Roots of equations
relative error Bisection method (Half-interval method),
definition, 195–196 95–98
Lagrange interpolating polynomial, False position, 99-102
198–217 Fixed point, 114–116
Newton’s interpolating polynomial, Graphical, 91–92
251–255 Incremental search, 92–95
Pascale rectangle, 222 Newton-Raphson (Newton’s linear) method,
piecewise linear, 196 102–107
polynomial interpolation, 197–198 Newton’s second order method, 108–
in R2 , Lagrange interpolation, Tensor 113
product, 217–237, 224–231 Secant method, 113–114
in R3 , Lagrange interpolation, Tensor Runge-Kutta methods, 442–454
product, 237–247, 237–247 First order Runge-Kutta method, 442
Fourth order Runge-Kutta method, 445–
Q 447
Second order Runge-Kutta method, 443–
QR iteration, 183 445
Quadratic convergence, 102–106 Third order Runge-Kutta method, 445
Quadratic interpolation, 198–217
Quadrature, also see Integration or numeri- S
cal integration
in R1 , 269 Secant method, 113–114
Examples, 276–283, 286–288, 300– Serendipity (interpolation), 232–237
305 Shape functions, 369–370
Gauss quadrature, 288–300 Simpson’s method
n-point quadrature, 292–293 1/3 Rule, 272–274
over [−1, 1], 288–295 3/8 Rule, 274–276
over [a, b], 295–296 Simultaneous equations, also System of lin-
two point quadrature, 289–290 ear algebraic equations, 10–80
three point quadrature, 290–292 Definitions (linear, nonlinear), 10
Newton-Cotes integration, 276 Elementary row operations, 26
Richardson’s extrapolation, 284–285 Linear dependence, 20
Romberg method, 285–286 Linear independence, 20
Simpson’s 1/3 Rule, 272–274 Matrix and vector representation, 25
Simpson’s 3/8 Rule, 274–276 Methods of solution
INDEX 477

Cholesky decomposition, 63–64 Truncation error, see Taylor series

Cramer’s rule, 32–33 Two point Gauss quadrature, 289–290
Crout decomposition, 56–60
Elimination methods, 34 V
Gauss elimination, 34–46
full pivoting, 43–46 Variable
naive, 34–39 dependent, see FEM, FDM
partial pivoting, 39–43 independent, see FEM, FDM
Gauss-Jordan, 46–49
Graphical, 28–32 W
Inverse of the matrix, 65–68
Iterative methods, 68
Weight factors or Weight functions, see Gauss
Gauss-Seidel, 68–74 quadrature, also see FEM
Jacobi, 74–80 Weighted residual method, 374–375
Relaxation, 80
[L][U ] decomposition, 49–56
[L][U ] decomposition using Gauss elim- Z
ination, 61–63
Sinusoidal function, 336, 359 Zero of functions, , see root finding methods
Stiffness matrix
Subspace iteration method, 186–188
Successive over relation (SOR), 80
System of linear algebraic equations, also see
Algebraic equations or System of
linear algebraic equations, 10–80
Definitions (linear, nonlinear), 10
Elementary row operations, 26
Linear dependence, 20
Linear independence, 20
Matrix and vector representation, 25
Methods of solution
Cholesky decomposition, 63–64
Cramer’s rule, 32–33
Crout decomposition, 56–60
Elimination methods, 34
Gauss elimination, 34–46
full pivoting, 43–46
naive, 34–39
partial pivoting, 39–43
Gauss-Jordan, 46–49
Graphical, 28–32
Inverse of the matrix, 65–68
Iterative methods, 68
Gauss-Seidel, 68–74
Jacobi, 74–80
Relaxation, 80
[L][U ] decomposition, 49–56
[L][U ] decomposition using Gauss elim-
ination, 61–63

T
Taylor series, 102, 105, 108, 109, 118, 256,
329, 348–354, 397, 398, 437, 443
Third order Runge-Kutta method, 445
Trace of matrices, 15
Transpose of a matrix, 15
Trapezoid rule, 271–273
Triangular matrices, 13
Tridiagonal matrices (banded), 13

An Introduction To Numerical Methods and Analysis Using MATLAB
No ratings yet
An Introduction To Numerical Methods and Analysis Using MATLAB
325 pages
Introduction Numerical Analysis
No ratings yet
Introduction Numerical Analysis
411 pages
Introduction Numerical Analysis
No ratings yet
Introduction Numerical Analysis
443 pages
topic-6-handout
No ratings yet
topic-6-handout
28 pages
Numerical Methods And Methods Of Approximation In Science And Engineering Karan S Surana pdf download
No ratings yet
Numerical Methods And Methods Of Approximation In Science And Engineering Karan S Surana pdf download
80 pages
Bài Giảng Giải Tích Số
No ratings yet
Bài Giảng Giải Tích Số
174 pages
Algebra Topology Differential Calculus and Optimization Theory For Computer Science and Engineering Jean Gallier pdf download
100% (1)
Algebra Topology Differential Calculus and Optimization Theory For Computer Science and Engineering Jean Gallier pdf download
66 pages
Main
No ratings yet
Main
164 pages
Finite Elements Chapter 4
No ratings yet
Finite Elements Chapter 4
89 pages
Chapter3 Python Fot Chemical Engineers
No ratings yet
Chapter3 Python Fot Chemical Engineers
84 pages
Lectures On Convex Optimization
100% (2)
Lectures On Convex Optimization
603 pages
Polynomials Algebra I
No ratings yet
Polynomials Algebra I
69 pages
Math 6610 - Analysis of Numerical Methods I
No ratings yet
Math 6610 - Analysis of Numerical Methods I
103 pages
Introduction Numerical Analysis (1) (1)
No ratings yet
Introduction Numerical Analysis (1) (1)
252 pages
Finalchapter4part4 Pde Elliptic
No ratings yet
Finalchapter4part4 Pde Elliptic
39 pages
Assignment Problems
100% (2)
Assignment Problems
36 pages
NACP_-_03_NUMERICAL_SOLUTIONS_OF_ALGEBRAIC_EQUATIONS
No ratings yet
NACP_-_03_NUMERICAL_SOLUTIONS_OF_ALGEBRAIC_EQUATIONS
24 pages
Numerical Analysis Lecture Ch.01 06
No ratings yet
Numerical Analysis Lecture Ch.01 06
241 pages
Algebra Topology Differential Calculus and Optimization Theory For Computer Science and Engineering Jean Gallier download
100% (1)
Algebra Topology Differential Calculus and Optimization Theory For Computer Science and Engineering Jean Gallier download
62 pages
Yousef Saad - Iterative Methods For Sparse Linear Systems-Society For Industrial and Applied Mathematics (2003)
No ratings yet
Yousef Saad - Iterative Methods For Sparse Linear Systems-Society For Industrial and Applied Mathematics (2003)
460 pages
CAAM 454 554 1lvazxx
No ratings yet
CAAM 454 554 1lvazxx
422 pages
Code PDF Merged
No ratings yet
Code PDF Merged
32 pages
CT2 Notes - All Chapters
No ratings yet
CT2 Notes - All Chapters
84 pages
Iterative Methods Sparse Linear Systems
No ratings yet
Iterative Methods Sparse Linear Systems
460 pages
Linear Equations
No ratings yet
Linear Equations
14 pages
Chapter04 - Dynamic Programming
No ratings yet
Chapter04 - Dynamic Programming
20 pages
IterMethBook 2nded
No ratings yet
IterMethBook 2nded
567 pages
Linear_Algebra_LectureNote
No ratings yet
Linear_Algebra_LectureNote
288 pages
Week4 Finite Element Method 1D
No ratings yet
Week4 Finite Element Method 1D
39 pages
MAT321 Lecture Notes Boumal 2019
No ratings yet
MAT321 Lecture Notes Boumal 2019
203 pages
Mth102 Notes
No ratings yet
Mth102 Notes
232 pages
CH 12 Arq
No ratings yet
CH 12 Arq
7 pages
Further Linear Algebra. Chapter IV. Jordan Normal Form.: Andrei Yafaev
No ratings yet
Further Linear Algebra. Chapter IV. Jordan Normal Form.: Andrei Yafaev
25 pages
Chapter 4: Divide and Conquer
No ratings yet
Chapter 4: Divide and Conquer
41 pages
Nm_script
No ratings yet
Nm_script
181 pages
Num PDF
No ratings yet
Num PDF
96 pages
FFT 1
No ratings yet
FFT 1
20 pages
MECH3780 Fluid Mechanics 2 and CFD
No ratings yet
MECH3780 Fluid Mechanics 2 and CFD
14 pages
Book Ena
No ratings yet
Book Ena
436 pages
Numerical Analysis
No ratings yet
Numerical Analysis
117 pages
B Simplex Method
78% (9)
B Simplex Method
14 pages
Introduction To Fast Fourier Transform (FFT) Algorithms: Mrs. E. Francy Irudaya Rani Ap/Ece, Fxec
No ratings yet
Introduction To Fast Fourier Transform (FFT) Algorithms: Mrs. E. Francy Irudaya Rani Ap/Ece, Fxec
8 pages
MAT 461/561: Numerical Analysis II: James V. Lambers May 5, 2014
No ratings yet
MAT 461/561: Numerical Analysis II: James V. Lambers May 5, 2014
124 pages
Numerical Lab Full
No ratings yet
Numerical Lab Full
15 pages
Numerical Methods For Electrical Engineers
100% (1)
Numerical Methods For Electrical Engineers
12 pages
EMBA 2nd Batch Simplex Method
No ratings yet
EMBA 2nd Batch Simplex Method
26 pages
4 Division of Algebraic Expressions
No ratings yet
4 Division of Algebraic Expressions
6 pages
Lecture Notes
No ratings yet
Lecture Notes
255 pages
Previewpdf
No ratings yet
Previewpdf
30 pages
Lecture Notes
No ratings yet
Lecture Notes
337 pages
Ma2264 QB
No ratings yet
Ma2264 QB
14 pages
Iterative Methods For Sparse Linear Systems
No ratings yet
Iterative Methods For Sparse Linear Systems
460 pages
Numerical Methods For Large Eigenvalue Problems
100% (1)
Numerical Methods For Large Eigenvalue Problems
285 pages
Lecture 7 Interpolation by Direct Method
No ratings yet
Lecture 7 Interpolation by Direct Method
8 pages
PrelimNum PDF
No ratings yet
PrelimNum PDF
236 pages
Algebra Basics
No ratings yet
Algebra Basics
7 pages
Numerical Methods For Eigenvalue Problems (PDFDrive)
No ratings yet
Numerical Methods For Eigenvalue Problems (PDFDrive)
217 pages
Make Sure You Show All Work For All Short Answer & Open-Ended Style Questions
No ratings yet
Make Sure You Show All Work For All Short Answer & Open-Ended Style Questions
7 pages
Numerical Methods QB
No ratings yet
Numerical Methods QB
4 pages
Saad
No ratings yet
Saad
460 pages
Math 9
No ratings yet
Math 9
3 pages
Wolfe Conditions
No ratings yet
Wolfe Conditions
5 pages
Analisis Verifikatif
No ratings yet
Analisis Verifikatif
2 pages
Course Notes MATH
No ratings yet
Course Notes MATH
130 pages
IterMethBook 2nded PDF
100% (1)
IterMethBook 2nded PDF
567 pages
Undergraduate Text
No ratings yet
Undergraduate Text
351 pages
Engineering Mathematics
60% (5)
Engineering Mathematics
234 pages
Maths 2 Iit Kanpur Peeyush Chandra, A. K. Lal, V. Raghavendra, G. Santhanam
No ratings yet
Maths 2 Iit Kanpur Peeyush Chandra, A. K. Lal, V. Raghavendra, G. Santhanam
255 pages
Numerical Methods For Graduate School: JP Bersamina October 11,2018
No ratings yet
Numerical Methods For Graduate School: JP Bersamina October 11,2018
67 pages
All Optimum Design Algorithms Require A Starting Point To Initiate The Iterative Process. False
No ratings yet
All Optimum Design Algorithms Require A Starting Point To Initiate The Iterative Process. False
1 page
Saad Y., Iterative Methods For Sparse Linear Systems
No ratings yet
Saad Y., Iterative Methods For Sparse Linear Systems
469 pages
Data Empowerment: Harnessing Advanced Mathematical and Statistical Methods for Data Science and Machine Learning
From Everand
Data Empowerment: Harnessing Advanced Mathematical and Statistical Methods for Data Science and Machine Learning
NAGARAJU CHEVURU
No ratings yet
The Ultimate Guide To Auto Cad 2022 3D Modeling For 3d Drawing And Modeling
From Everand
The Ultimate Guide To Auto Cad 2022 3D Modeling For 3d Drawing And Modeling
ALLEN BENTON
No ratings yet
Mortals or Immortals
From Everand
Mortals or Immortals
Konstantinos p Anastasiadis
No ratings yet
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet
Mathematics N4: FET College Nated, #6
From Everand
Mathematics N4: FET College Nated, #6
Efetobo Emede
No ratings yet
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
Deadline Istanbul (The Elizabeth Darcy Series)
From Everand
Deadline Istanbul (The Elizabeth Darcy Series)
Peggy Hanson
5/5 (1)
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Deadline Yemen (The Elizabeth Darcy Series)
From Everand
Deadline Yemen (The Elizabeth Darcy Series)
Peggy Hanson
5/5 (1)
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
Human Nature Potential in Nurture
From Everand
Human Nature Potential in Nurture
David L. Hawk
No ratings yet
Risk Management and System Safety
From Everand
Risk Management and System Safety
Leonam dos Santos Guimarães
5/5 (1)
Brass Methods: An Essential Resource for Educators, Conductors, and Students
From Everand
Brass Methods: An Essential Resource for Educators, Conductors, and Students
David Kish
No ratings yet
Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Keys to Better Reading
From Everand
Keys to Better Reading
Judy McFall
No ratings yet
Hamlet Had an Uncle: A Comedy of Honor
From Everand
Hamlet Had an Uncle: A Comedy of Honor
James Branch Cabell
4.5/5 (7)
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet