0% found this document useful (0 votes)
110 views510 pages

Numerical Methods For Least Squares Problems, Second Edition

The document discusses the method of least squares, a technique for minimizing errors in mathematical modeling, and highlights its applications in various scientific fields. It introduces the second edition of 'Numerical Methods for Least Squares Problems,' which provides comprehensive coverage of both direct and iterative methods for solving least squares problems, including specialized topics like generalized and nonlinear least squares. The book is aimed at graduate students and researchers in applied mathematics and numerical linear algebra.

Uploaded by

cnic.lsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views510 pages

Numerical Methods For Least Squares Problems, Second Edition

The document discusses the method of least squares, a technique for minimizing errors in mathematical modeling, and highlights its applications in various scientific fields. It introduces the second edition of 'Numerical Methods for Least Squares Problems,' which provides comprehensive coverage of both direct and iterative methods for solving least squares problems, including specialized topics like generalized and nonlinear least squares. The book is aimed at graduate students and researchers in applied mathematics and numerical linear algebra.

Uploaded by

cnic.lsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 510

Least Squares Problems

Numerical Methods for


The method of least squares, discovered by Gauss in 1795, is a principal tool
for reducing the influence of errors when fitting a mathematical model to given
observations. Applications arise in many areas of science and engineering. The
Numerical Methods
for Least Squares
increased use of automatic data capturing frequently leads to large-scale least
squares problems. Such problems can be solved by using recent developments in
preconditioned iterative methods and in sparse QR factorization.

Problems
The first edition of Numerical Methods for Least Squares Problems was the leading
reference on this topic for many years. The updated second edition stands out
compared to other books on the topic because it
• provides an in-depth and up-to-date treatment of direct and iterative methods for
solving different types of least squares problems and for computing the singular
value decomposition;
• covers generalized, constrained, and nonlinear least squares problems as well as
partial least squares and regularization methods for discrete ill-posed problems; Second Edition
and
• contains a bibliography of over 1,100 historical and recent references, providing a
comprehensive survey of past and present research in the field.

Audience
This book will be of interest to graduate students and researchers in applied
mathematics and to researchers working with numerical linear algebra applications.

Åke Björck is a professor emeritus at Linköping University, Sweden.


He is the author of many research papers and books on numerical Second
analysis and matrix computations. He served as managing editor of
the journal BIT Numerical Mathematics from 1993 to 2003 and has
Edition
been a SIAM Fellow since 2014.

Åke Björck
For more information about SIAM books, journals, conferences, memberships, and activities, contact:

Society for Industrial and Applied Mathematics

Åke Björck
3600 Market Street, 6th Floor
Philadelphia, PA 19104-2688 USA
+1-215-382-9800
[email protected] • www.siam.org

OT196

OT196
ISBN: 978-1-61197-794-3
90000

9 781611 977943

OT196_BJORCK_COVER_A_V5.indd 1 5/15/2024 3:22:23 PM


Numerical Methods
for Least Squares
Problems
Numerical Methods
for Least Squares
Problems
Second Edition

Åke Björck
Linköping University
Linköping, Sweden

Society for Industrial and Applied Mathematics


Philadelphia
Copyright © 2024 by the Society for Industrial and Applied Mathematics
10 9 8 7 6 5 4 3 2 1
All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or
transmitted in any manner without the written permission of the publisher. For information, write to the Society
for Industrial and Applied Mathematics, 3600 Market Street, 6th Floor, Philadelphia, PA 19104-2688 USA.
No warranties, express or implied, are made by the publisher, authors, and their employers that the programs
contained in this volume are free of error. They should not be relied on as the sole basis to solve a problem whose
incorrect solution could result in injury to person or property. If the programs are employed in such a manner, it is
at the user’s own risk and the publisher, authors, and their employers disclaim all liability for such misuse.
Trademarked names may be used in this book without the inclusion of a trademark symbol. These names are used
in an editorial context only; no infringement of trademark is intended.
MATLAB is a registered trademark of The MathWorks, Inc. For MATLAB product information, please contact
The MathWorks, Inc., 3 Apple Hill Drive, Natick, MA 01760-2098 USA, 508-647-7000, Fax: 508-647-7001,
[email protected], www.mathworks.com.

Publications Director Kivmars H. Bowling


Executive Editor Elizabeth Greenspan
Acquisitions Editor Elizabeth Greenspan
Developmental Editor Rose Kolassiba
Managing Editor Kelly Thomas
Production Editor Lisa Briggeman
Copy Editor Lisa Briggeman
Production Manager Rachel Ginder
Production Coordinator Cally A. Shrader
Compositor Cheryl Hufnagle
Graphic Designer Doug Smock

Library of Congress Control Number 2024007522

is a registered trademark.
Contents

List of Figures vii

List of Tables ix

Preface xi

Preface to the First Edition xiii

1 Mathematical and Statistical Foundations 1


1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Some Fundamental Matrix Decompositions . . . . . . . . . . . . . . . . . 8
1.3 Perturbation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Floating-Point Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2 Basic Numerical Methods 39


2.1 The Method of Normal Equations . . . . . . . . . . . . . . . . . . . . . . . 39
2.2 Orthogonalization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.3 Rank-Deficient Least Squares Problems . . . . . . . . . . . . . . . . . . . 71
2.4 Methods Based on LU Factorization . . . . . . . . . . . . . . . . . . . . . 84
2.5 Estimating Condition Numbers and Errors . . . . . . . . . . . . . . . . . . 94
2.6 Blocked Algorithms and Subroutine Libraries . . . . . . . . . . . . . . . . 106

3 Generalized and Constrained Least Squares 115


3.1 Generalized Least Squares Problems . . . . . . . . . . . . . . . . . . . . . 115
3.2 Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
3.3 Modified Least Squares Problems . . . . . . . . . . . . . . . . . . . . . . . 137
3.4 Equality Constrained Problems . . . . . . . . . . . . . . . . . . . . . . . . 155
3.5 Inequality Constrained Problems . . . . . . . . . . . . . . . . . . . . . . . 160
3.6 Regularized Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

4 Special Least Squares Problems 183


4.1 Band Least Squares Problems . . . . . . . . . . . . . . . . . . . . . . . . . 183
4.2 Bidiagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
4.3 Some Structured Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 204
4.4 Total Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
4.5 Least Squares Problems with Special Bases . . . . . . . . . . . . . . . . . . 227

v
vi Contents

5 Direct Methods for Sparse Problems 243


5.1 Tools for Sparse Matrix Computations . . . . . . . . . . . . . . . . . . . . 243
5.2 Sparse QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
5.3 Special Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

6 Iterative Methods 267


6.1 Basic Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
6.2 Krylov Subspace Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
6.3 Preconditioners for Least Squares Problems . . . . . . . . . . . . . . . . . 306
6.4 Regularization by Iterative Methods . . . . . . . . . . . . . . . . . . . . . . 325

7 SVD Algorithms and Matrix Functions 339


7.1 The QRSVD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
7.2 Alternative SVD Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 349
7.3 Computing Selected Singular Triplets . . . . . . . . . . . . . . . . . . . . . 363
7.4 Matrix Functions and SVD . . . . . . . . . . . . . . . . . . . . . . . . . . 376

8 Nonlinear Least Squares Problems 389


8.1 Newton-Type Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
8.2 Separable Least Squares Problems . . . . . . . . . . . . . . . . . . . . . . 402
8.3 Nonnegativity Constrained Problems . . . . . . . . . . . . . . . . . . . . . 416
8.4 Robust Regression and Related Topics . . . . . . . . . . . . . . . . . . . . 420

Bibliography 431

Index 487
List of Figures

1.1.1 Geometric interpretation of least squares property. . . . . . . . . . . . . . . . 6

2.2.1 Reflection of a vector a in a hyperplane with normal u. . . . . . . . . . . . . . 46

3.6.1 Singular values σi of a discretized integral operator. . . . . . . . . . . . . . . 173

4.1.1 Matrix A after reduction of first k = 3 blocks using Householder reflections . . 190
4.1.2 Reduction of a band matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
4.2.1 Relative error and residual for PLS and TSVD solutions. . . . . . . . . . . . . 203
4.3.1 One and two levels of dissection of a region. . . . . . . . . . . . . . . . . . . 206

5.1.1 Nonzero pattern of matrix from structural problem and its Cholesky factor. . . 244
5.1.2 The labeled graph G(C) of the matrix in (5.1.2). . . . . . . . . . . . . . . . . 248
5.1.3 Sequence of elimination graphs of the matrix in (5.1.2). . . . . . . . . . . . . 249
5.1.4 The transitive reduction and elimination tree T (ATA). . . . . . . . . . . . . . 250
5.1.5 The graph of a matrix for which minimum degree is not optimal. . . . . . . . . 252
5.2.1 Sparse matrix A and factor R using the MATLAB colperm reordering. . . . . 260
5.2.2 Sparse matrix A and factor R using the MATLAB colamd ordering. . . . . . . 260
5.2.3 Structure of upper triangular matrix R for a rank-deficient matrix. . . . . . . . 261

6.1.1 Structure of A and ATA for a simple image reconstruction problem. . . . . . . 268
6.2.1 ∥x† − xk ∥ and ∥AT rk ∥ for problem ILLC1850: LSQR and CGLS . . . . . . . 295
6.2.2 ∥x† − xk ∥ and ∥AT rk ∥ for problem ILLC1850: LSQR and LSMR . . . . . . 296
6.2.3 Underdetermined consistent problem with transpose of ILLC1850: ∥x† − xk ∥
and ∥rk ∥; CGME and LSME . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
6.2.4 Problem ILLC1850 overdetermined consistent: ∥x† − xk ∥ and ∥rk ∥; CGME
and CGLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298

7.3.1 Spectral transformation with shift µ = 1. . . . . . . . . . . . . . . . . . . . . 366

8.2.1 Orthogonal distance regression for q = n = 1. . . . . . . . . . . . . . . . . . 407


8.4.1 The penalizing effect using the ℓp -norm for p = 0.1, 1, 2, 10. . . . . . . . . . . 421

vii
List of Tables

1.4.1 IEEE 754–2008 binary floating-point formats. . . . . . . . . . . . . . . . . . 32

2.2.1 Condition number and loss of orthogonality in CGS and MGS. . . . . . . . . . 63


2.5.1 Average number of correct significant decimal digits. . . . . . . . . . . . . . . 106

3.2.1 Average number of correct significant decimal digits in the solution before and
after iterative refinement with various QR factorizations. . . . . . . . . . . . . 132

7.1.1 Comparison of multiplications for SVD algorithms. . . . . . . . . . . . . . . . 347

ix
Preface

More than 25 years have passed since the first edition of this book was published in 1996. Least
squares and least-norm problems have become more significant with every passing decade, and
applications have grown in size, complexity, and variety. More advanced techniques for data
acquisition give larger amounts of data to be treated. What counts as a large matrix has gone
from dimension 1000 to 106 . Hence, iterative methods play an increasingly crucial role for the
solution of least squares problems. On top of these changes, methods must be adapted to new
generations of multiprocessing hardware.
This second edition is primarily aimed at applied mathematicians and graduate students.
Like the first edition, it aims to give a comprehensive review of the state of the art and to survey
the most important publications. Special effort has gone into making the revised bibliography
as comprehensive and up to date as possible. More than half the references are new, and a
substantial share are from the last decade.
To address the mentioned trends, many parts of this edition are enlarged and completely
rewritten. Several new sections have been added, and the content has been reordered. Be-
cause underdetermined linear systems increasingly occur in applications, the duality between
least squares and least-norm problems is now more emphasized throughout.
The Cosine-Sine (CS) decomposition is a new addition to Chapter 1. Among the novelties
in Chapter 2 are new results on Gram–Schmidt and block QR algorithms. Blocked and recursive
algorithms for Cholesky and QR factorization are also covered. The section on computing the
SVD has been much enlarged and moved to the new Chapter 7.
Chapter 3 presents more complete treatments of generalized and weighted least squares prob-
lems. Oblique projectors and elliptic Modified Gram–Schmidt (MGS) and Householder algo-
rithms are other important additions, and new results are given on the stability of algorithms for
weighted least squares. Linear equality and inequality constrained least squares problems are
treated, along with a more complete treatment of quadratic constraints. A much enlarged treat-
ment of regularization methods is also found in Chapter 3, including truncated singular value
decomposition (SVD), Tikhonov regularization, and transformation to standard form.
Chapter 4 starts with a section on band matrices and methods for band least squares problems;
this section originally appeared in Section 6.1 of the first edition. Next, a new section follows on
Householder and Golub–Kahan bidiagonalizations and their connection to the concept of core
problems for linear systems and Krylov subspace approximations. Another new section covers
algorithms for the partial least squares (PLS) method for prediction and cause-effect inference.
(PLS is much used in chemometrics, bioinformatics, food research, medicine, pharmacology,
social sciences, physiology.) Next, this chapter gives methods for least squares problems with
special structure, such as block-angular form, Kronecker, or Toeplitz. An introduction to tensor
computations and tensor decompositions is also a new addition to this chapter. The section on
total least squares problems now includes a treatment of large-scale problems.

xi
xii Preface

Chapter 5 treats direct methods for sparse problems and corresponds to Chapter 6 in the first
edition. New software, such as SuiteSparse QR, is surveyed. A notable addition is a section on
methods for solving mixed sparse-dense least squares problems.
Iterative methods for least squares and least-norm problems are treated in Chapter 6. The
Krylov subspace methods CGLS and LSQR as well as the recently introduced LSMR are de-
scribed. The section on preconditioners is completely revised and covers new results on, e.g.,
approximate Cholesky and QR preconditioners. A survey of preconditioners based on random
sampling is also new. Section 6.4 now covers regularization by iterative methods, including
methods for saddle point and symmetric quasi-definite (SQD) systems.
Chapter 7 on algorithms for computing the SVD is a much enlarged version of Section 2.6
of the first edition. Here, new topics are Jacobi-type methods and differential qd and MRRR
algorithms. Brief surveys of the matrix square root and sign functions as well as the polar de-
composition are also included.
Chapter 8 covers methods for nonlinear least squares problems. Several new topics are in-
cluded, such as inexact Gauss–Newton methods, bilinear least squares, and nonnegative least
squares. Also discussed here are algorithms for robust regression, least-angle regression, and
LASSO; compressed sensing; and iteratively reweighted least squares (IRLS).

Acknowledgments
The works of Nick Higham, Lars Eldén, G. W. Stewart, Luc Giraud, and many others have been
prominent inspirations for many of the topics new to this edition. Special thanks goes to Michael
Saunders, who patiently read several versions of the book and gave valuable advice. Without his
encouragement and support, this revision would never have been finished.

Åke Björck
Linköping, March 2023
Preface to the First Edition

A basic problem in science is to fit a model to observations subject to errors. It is clear that
the more observations that are available the more accurately will it be possible to calculate the
parameters in the model. This gives rise to the problem of “solving” an overdetermined linear
or nonlinear system of equations. It can be shown that the solution which minimizes a weighted
sum of the squares of the residual is optimal in a certain sense. Gauss claims to have discovered
the method of least squares in 1795 when he was 18 years old. Hence this book also marks the
bicentennial of the use of the least squares principle.
The development of the basic modern numerical methods for solving linear least squares
problems took place in the late sixties. The QR decomposition by Householder transformations
was developed by Golub and published in 1965. The implicit QR algorithm for computing the
singular value decomposition (SVD) was developed about the same time by Kahan, Golub, and
Wilkinson, and the final algorithm was published in 1970. These matrix decompositions have
since been developed and generalized to a high level of sophistication. Great progress has been
made in the last decade in methods for generalized and modified least squares problems and in
direct and iterative methods for large sparse problems. Methods for total least squares problems,
which allow errors also in the system matrix, have been systematically developed.
Applications of least squares of crucial importance occur in many areas of applied and en-
gineering research such as statistics, geodetics, photogrammetry, signal processing, and control.
Because of the great increase in the capacity for automatic data capturing, least squares prob-
lems of large size are now routinely solved. Therefore, sparse direct methods as well as iterative
methods play an increasingly important role. Applications in signal processing have created a
great demand for stable and efficient methods for modifying least squares solutions when data
are added or deleted. This has led to renewed interest in rank-revealing QR decompositions,
which lend themselves better to updating than the singular value decomposition. Generalized
and weighted least squares problems and problems of Toeplitz and Kronecker structure are be-
coming increasingly important.
Chapter 1 gives the basic facts and the mathematical and statistical background of least
squares methods. In Chapter 2 relevant matrix decompositions and basic numerical methods
are covered in detail. Although most proofs are omitted, these two chapters are more elementary
than the rest of the book and essentially self-contained. Chapter 3 treats modified least squares
problems and includes many recent results. In Chapter 4 generalized QR and SVD decomposi-
tions are presented, and methods for generalized and weighted problems surveyed. Here also,
robust methods and methods for total least squares are treated. Chapter 5 surveys methods for
problems with linear and quadratic constraints. Direct and iterative methods for large sparse least
squares problems are covered in Chapters 6 and 7. These methods are still subject to intensive
research, and the presentation is more advanced. Chapter 8 is devoted to problems with special
bases, including least squares fitting of polynomials and problems of Toeplitz and Kronecker
structures. Finally, Chapter 9 contains a short survey of methods for nonlinear problems.

xiii
xiv Preface to the First Edition

This book will be of interest to mathematicians working in numerical linear algebra, com-
putational scientists and engineers, and statisticians, as well as electrical engineers. Although a
solid understanding of numerical linear algebra is needed for the more advanced sections, I hope
the book will be found useful in upper-level undergraduate and beginning graduate courses in
scientific computing and applied sciences.
I have aimed to make the book and the bibliography as comprehensive and up to date as
possible. Many recent research results are included, which were only available in the research
literature before. Inevitably, however, the content reflects my own interests, and I apologize in
advance to those whose work has not been mentioned. In particular, work on the least squares
problem in the former Soviet Union is, to a large extent, not covered.
The history of this book dates back to at least 1981, when I wrote a survey entitled “Least
Squares Methods in Physics and Engineering” for the Academic Training Programme at CERN
in Geneva. In 1985 I was invited to contribute a chapter on “Least Squares Methods” in the
Handbook of Numerical Analysis, edited by P. G. Ciarlet and J. L. Lions. This chapter was
finished in 1988 and appeared in Volume 1 of the Handbook, published by North-Holland in
1990. The present book is based on this contribution, although it has been extensively updated
and made more complete.
The book has greatly benefited from the insight and knowledge kindly provided by many
friends and colleagues. In particular, I have been greatly influenced by the work of Gene H.
Golub, Nick Higham, and G. W. Stewart. Per-Åke Wedin gave valuable advice on the chapter
on nonlinear problems. Part of the Handbook chapter was written while I had the benefit of
visiting the Division of Mathematics and Statistics at CSIRO in Canberra and the Chr. Michelsen
Institute in Bergen.
Thanks are due to Elsevier Science B.V. for the permission to use part of the material from the
Handbook chapter. Finally, I thank Beth Gallagher and Vickie Kearn at SIAM for the cheerful
and professional support they have given throughout the copy editing and production of the book.

Åke Björck
Linköping, February 1996
Chapter 1

Mathematical and
Statistical Foundations

De tous les principes qu’on peut proposer pour cet objet, je pense qu’il n’en est
pas de plus général, de plus exact, ni d’une application plus facile que celui qui . . .
consiste à rendre minimum la somme de quarrés des erreurs.1
—Adrien Marie Legendre, Nouvelles méthodes pour la détermination des or-
bites des comètes. Appendice. 1805.

1.1 Introduction
1.1.1 Historical Remarks
The least squares problem is a computational problem of primary importance in science and
engineering. Originally, it arose from the need to reduce the influence of errors when fitting
a mathematical model to given observations. A way to do this is to use a greater number of
measurements than the number of unknown parameters in the model. As an example, consider a
function known to be a linear combination of n known basis function ϕj (t):
n
X
f (t) = cj ϕj (t). (1.1.1)
j=1

The problem is to determine the n unknown parameters c1 , . . . , cn from m > n measurements


f (ti ) = yi + ϵi , i = 1, . . . , m, subject to random errors ϵi .
In 1748, before the development of the principle of least squares, Tobias Mayer had devel-
oped a method for “solving” overdetermined systems of equations, which later became known
as the method of averages. The m equations in n unknowns are separated into n groups and
groupwise summed. In this way the overdetermined system is replaced by a square linear system
that can be solved by elimination. Cauchy developed a related interpolation algorithm that leads
to systems of the form
Z T A⃗x = Z T ⃗b, Z = (⃗z1 , . . . , ⃗zn ),
where zij = ±1. An advantage of this choice is that forming the new system requires no multi-
plications.
Laplace proposed in 1793 that observations be combined by minimizing the sum of the abso-
lute values of the residuals ri = yi − f (ti ) with the added condition that the sum of the residuals
1 Of all the principles that can be proposed, I think there is none more general, more exact, nor easier to apply, than

that which consists of rendering the sum of squares of the errors minimum. (Our translation.)

1
2 Chapter 1. Mathematical and Statistical Foundations

be equal to zero. He showed that this implies that the solution x must satisfy exactly n out of the
m equations. Gauss argued against this, saying that by the principles of probability, greater or
smaller errors are equally possible in all equations. Therefore, a solution that precisely satisfies a
subset of the equations must be regarded as less consistent with the laws of probability.
Pm This led
him to the alternative principle of minimizing the sum of squared residuals S = i=1 ri2 , which
also gives a simpler computational procedure.
The first to publish the algebraic procedure and use the name “least squares method” was
Legendre [729, 1805]. A few years later, in a paper titled (in translation) Theory of the Motion
of the Heavenly Bodies Moving about the Sun in Conic Sections2 Gauss [444, 1809] justified the
method of least squares as a statistical procedure. Much to the annoyance of Legendre, he wrote
Our principle, which we have made use of since 1795, has lately been published by
Legendre.
Most historians agree that Gauss was right in his claim of precedence. He had used the least
squares principle earlier for the analyses of survey data and in astronomical calculations and had
communicated the principle to several astronomers. A famous example is Gauss’s prediction of
the orbit of the asteroid Ceres in 1801. After this success, the method of least squares quickly
became the standard procedure for analysis of astronomical and geodetic data and remains so to
this day.
Another early application of the least squares method is from 1793. At that time, the French
government decided to base the new metric system upon a unit, the meter, equal to one
10,000,000th part of the distance from the North Pole to the Equator along a meridian arc through
Paris. In a 1795 survey, four subsections of an arc from Dunkirk to Barcelona were measured.
For each subsection, the length S of the arc (in modules), the degrees d of latitude, and the
latitude L of the midpoint were determined by the following astronomical observations:

Segment Arc length S Latitude d Midpoint L


Dunkirk to Pantheon 62472.59 2.18910◦ 49◦ 56′ 30′′
Pantheon to Evaux 76145.74 2.66868◦ 47◦ 30′ 46′′
Evaux to Carcassonne 84424.55 2.96336◦ 44◦ 41′ 48′′
Carcassonne to Barcelona 52749.48 1.85266◦ 42◦ 17′ 20′′

If the earth is assumed to be ellipsoidal, then to a good approximation it holds that

z + y sin2 (L) = S/d,

where z and y are unknown parameters. The meridian quadrant is then M = 90(z + y/2),
and the eccentricity e is found from 1/e = 3(z/y + 1/2). The least squares estimates are
1/e = 157.951374 and M = 2, 564, 801.46; see Stigler [1038, 1981].
The early development of statistical methods for estimating parameters in linear models
is surveyed by Farebrother [397, 1999]. Detailed accounts of the invention and history of
least squares are given by Plackett [895, 1972], Stigler [1037, 1977], [1038, 1981], and
Goldstine [484, 1977].
Analyzing data sets of very large size is now a regular task in a broad variety of applica-
tions. The method of least squares, now over 200 years old, is still one of the most frequently
used methods for data fitting. Applications of least squares fitting cover a wide range of scien-
tific disciplines, such as geodesy, photogrammetry, tomography, molecular modeling, structural
2 A German language version of this paper containing his least squares work had appeared in 1806. The 1809 publi-

cation is in Latin, and an English language translation was not available until 1857.
1.1. Introduction and Historical Remarks 3

analysis, signal processing, cluster analysis and pattern matching. Many of these lead to prob-
lems of large size and complexity.
An application of spectacular size for its time is the least squares adjustment of coordinates
of the geodetic stations comprising the North American Datum; see Kolata [704, 1978]. This
problem consists of about 6.5 million equations in 540,000 unknowns (= twice the number of
stations). Since the equations are mildly nonlinear, only two or three linearized problems of this
size have to be solved.
A more recent application is the determination of the gravity field of Earth from highly accu-
rate satellite measurements; see Baboulin et al. [49, 2009]. To model the gravitational potential,
a function of the form
L l
GM X  r l+1 X  
V (r, θ, λ) = Plm (cos θ) clm cos mλ + slm sin mλ
R R m=0
l=0

is used, where G is the gravitational constant, M is Earth’s mass, R is Earth’s reference radius,
and Plm are the normalized Legendre polynomials of order m. The normalized harmonic co-
efficients clm and slm are to be determined. For L = 300, the resulting least squares problem
involves 90,000 unknowns and millions of observations and needs to be solved on a daily basis.
The demand for fast and accurate least squares solvers continues to grow as problem scales
become larger and larger. Analyzing data sets of billions of records is now a regular task at
many companies and institutions. Such large-scale problems arise in a variety of fields, such as
genetics, image processing, geophysics, language processing, and high-frequency trading.

1.1.2 Statistical Preliminaries


Gauss gave the method a sound theoretical basis in his two-part memoir, “Theoria Combinatio-
nis” [445, 1821], [446, 1823]. These contain his definitive treatment of the area. (An English
translation is given by Stewart [447, 1995].) Gauss proves the optimality of the least squares
estimate without assuming that the random errors follow a particular distribution. This contri-
bution of Gauss was somehow neglected until being rediscovered by Markov [775, 1912]; see
Theorem 1.1.4.
Let x be a random variable with distribution function F (x), where F (x) is nondecreasing
and right continuous and satisfies

0 ≤ F (x) ≤ 1, F (−∞) = 0, F (∞) = 1.

The expected value µ and variance σ 2 of x are defined as


Z ∞ Z ∞
ydF (x), σ 2 = E (x − µ)2 = (x − µ)2 dF (x).

µ = E(x) =
−∞ −∞

Let x = (x1 , . . . , xn )T be a vector of random variables where the joint distribution of xi and xj
is F (xi , xj ). Then the covariance σij between xi and xj is defined by
Z ∞
σij = cov(xi , xj ) = E[(xi − µi )(xj − µj )] = (xi − µi )(xj − µj )dF (xi , xj ).
xi ,xj =−∞

Then σij = E(xi xj ) − µi µj , where µi = E(xi ). The covariance matrix V ∈ Rn×n of the vector
x is defined by
V(x) = V = E[(x − µ)(x − µ)T ] = E(xxT ) − µµT ,
where µ = E(x) = (µ1 , . . . , µn ). We now prove some useful properties.
4 Chapter 1. Mathematical and Statistical Foundations

Lemma 1.1.1. Let z = F x, where F ∈ Rm×n is a given matrix, and let x ∈ Rn be a random
vector with E(x) = µ and covariance matrix V . Then

E(z) = F µ, V(z) = F V F T .

Proof. The first property follows directly from the definition of expected value. The second is
proved as

V(F x) = E[F (x − µ)(x − µ)T F T ] = F E[(x − µ)(x − µ)T ]F T = F V F T .

In the special case when F = f T is a row vector, z = f T x is a linear functional of x. Then,


if V = σ 2 I, V(z) = σ 2 f Tf . The following lemma is given without proof.

Lemma 1.1.2. Let A ∈ Rn×n be a symmetric matrix, and let y be a random vector with expected
value µ and covariance matrix V . Then

E(y T Ay) = µT Aµ + trace (AV ),

where trace (AV ) denotes the sum of diagonal elements of AV .

In the Gauss–Markov linear model it is assumed that the random vector of observations
b ∈ Rm is related to a parameter vector x ∈ Rn by a linear equation

Ax = b, E(b) = b̄, V(b) = σ 2 V, (1.1.2)

where V is the known covariance of a random error vector ϵ of mean zero. The standard model
has V = I, i.e., the errors bi − b̄i are assumed to be uncorrelated and to have the same variance.

Definition 1.1.3. A function f (x) of a random vector x is an unbiased estimate of a parameter


θ if E(f (x)) = θ. If c is a vector of constants, then cT x is called a best linear unbiased estimate
of θ if E(cT x) = θ and V(cT x) is minimized over all linear estimates.

Theorem 1.1.4 (Gauss–Markov Theorem). Consider the standard Gauss–Markov linear model
(1.1.2), where A ∈ Rm×n is a known matrix of rank n. Then the best linear unbiased estimate
of any linear functional cT x is cT x
b, where x
b is the least squares estimator that minimizes the
sum of squares rT r, where r = b − Ax. Furthermore, x b is obtained by solving the symmetric
positive definite system of normal equations

ATAx = AT b. (1.1.3)

Proof. See Theorem 1.1.5 and Zelen [1143, 1962, pp. 560–561].

In the literature, this result is often stated in less general form, where the errors are assumed to
be normally distributed or independent and identically distributed. However, Gauss only assumed
the weaker condition that the errors are uncorrelated with zero mean and equal variance.
From Lemma 1.1.1 it follows that the covariance matrix of the least squares estimate x b=
(ATA)−1 AT b is

x) = (ATA)−1 AT V(bb)A(ATA)−1 = σ 2 (ATA)−1 .


V(b (1.1.4)

Let σ 2 = E(s2 ), where s2 is the quadratic form


1
s2 = rbT rb, rb = b − Ab
x. (1.1.5)
m−n
1.1. Introduction and Historical Remarks 5

It can be shown that the s2 , and therefore also the rb is uncorrelated with x
b, i.e.,

cov(b
r, x
b) = 0, cov(s2 , x
b) = 0.

From the normal equations it follows that AT rb = AT (b − Ab


x) = 0. This shows that there are n
linear relations among the m components of rb.
Some applications lead to complex least squares problems to minimize
n
X
∥r∥22 =r r=H
|ri |2 , r = b − Ax, (1.1.6)
i=1

where A ∈ Cm×n , b ∈ Cm , and rH denotes the conjugate transpose of r. An example in


complex stochastic processes is given by Miller [795, 1973]. In the complex case, the normal
equations are
AHAx = AH b, (1.1.7)
where AH is the conjugate transpose of A. Most of the results and algorithms for the real case
given in this book admit straightforward extensions to the complex case.
The least squares method can be generalized to a Gauss–Markov linear model with rank(A) =
n and a positive definite covariance matrix V(ϵ) = σ 2 V .
Gauss and Laplace also treated weighted least squares problems, where the covariance matrix
is diagonal: V = diag (v1 , . . . , vm ). The case with a general positive definite covariance matrix
V was first considered by Aitken [12, 1934]. The best unbiased linear estimate of x is then
obtained from the generalized least squares problem

min(Ax − b)T V −1 (Ax − b)


x

and satisfies the generalized normal equations (see Section 3.1)

AT V −1 Ax = AT V −1 b. (1.1.8)

1.1.3 Characterization of Least Squares Solutions

Theorem 1.1.5. Given A ∈ Rm×n , m > n, b ∈ Rm , let

S = {x ∈ Rn | ∥Ax − b∥2 = min} (1.1.9)

be the set of all least squares solutions, where ∥ · ∥2 denotes the Euclidean vector norm ∥x∥2 =
(xT x)1/2 . Then x ∈ S if and only if the orthogonality condition AT (b − Ax) = 0 holds or,
equivalently, x satisfies the normal equations

ATAx = AT b. (1.1.10)

Proof. Assume that x̂ satisfies AT r̂ = 0, where r̂ = b − Ax̂. Then for any x ∈ Rn we have
r = b − Ax = r̂ + A(x̂ − x) ≡ r̂ + Ae. From this we obtain

rTr = (r̂ + Ae)T (r̂ + Ae) = r̂T r̂ + ∥Ae∥22 ,

which is minimized when x = x̂. On the other hand, suppose AT r̂ = z ̸= 0. If x = x̂ + ϵz, then
r = r̂ − ϵAz and
rT r = r̂T r̂ − 2ϵz T z + ϵ2 (Az)T Az < r̂T r̂
for sufficiently small ϵ > 0. Hence x̂ is not a least squares solution.
6 Chapter 1. Mathematical and Statistical Foundations

Theorem 1.1.6. The matrix ATA of normal equations is positive definite if and only if the col-
umns of A ∈ Rm×n are linearly independent, i.e., rank(A) = n. Then the matrix (ATA)−1
exists, and the unique least squares solution and residual are

x = (ATA)−1 AT b, r = b − A(ATA)−1 AT b. (1.1.11)

Proof. If the columns of A are linearly independent, then x ̸= 0 ⇒ Ax ̸= 0, and therefore


x ̸= 0 ⇒ xT ATAx = ∥Ax∥22 > 0. Hence ATA is positive definite. On the other hand, if the
columns are linearly dependent, then for some x0 ̸= 0 we have Ax0 = 0. Then xT0 ATAx0 = 0,
and ATA is only positive semidefinite.

For a matrix A ∈ Rm×n of rank r, the range (or column space) is the subspace

R(A) = {y = Ax | x ∈ Rn } ∈ Rm (1.1.12)

of dimension r. Because AT b ∈ R(AT ) = R(ATA), the normal equations are consistent.


Hence, for A of any dimensions and rank there always exists at least one least squares solution.
From the normal equations AT (b − Ax) = AT r = 0, it follows that

Ax ∈ R(A), r = b − Ax ⊥ R(A).

Thus, x is a least squares solution if and only if the residual r = b−Ax is perpendicular to R(A).
This geometric characterization is shown in Figure 1.1.1. The nullspace of a matrix A ∈ Rm×n
is defined as the subspace

N (A) = {z ∈ Rm | Az = 0} ∈ Rn (1.1.13)

of dimension n − r. A fundamental theorem in linear algebra says that

R(A) ∪ N (AT ) = Rm , N (A) ∪ R(AT ) = Rn , (1.1.14)

where Rm and Rn denote the space of m-vectors and n-vectors. Further, from the singular
value decomposition (SVD) of A in Section 1.2.2 it follows that R(A) ⊥ N (AT ) and N (A) ⊥
R(AT ).

Definition 1.1.7. A square matrix P ∈ Cm×m is a projector onto R(P ) if it satisfies P 2 = P


or, equivalently, P (I − P ) = 0. Such a matrix is also called idempotent. If also P T = P , then
P is an orthogonal projector.

Figure 1.1.1. Geometric interpretation of least squares property.


1.1. Introduction and Historical Remarks 7

If P is an orthogonal projector, then P (I − P )b = (P − P 2 )b = 0 and

(I − P )2 = I − 2P + P 2 = (I − P ).

It follows that I − P is an orthogonal projector onto N (P ). The rank-one modification of the


unit matrix
P = I − qq T , q T q = 1, (1.1.15)
is called an elementary orthogonal projector. An arbitrary vector b ∈ Rm is uniquely decom-
posed by an orthogonal projector P into two components b = P b + (I − P )b = b1 + b2 , such
that b1 ⊥ b2 . If λ is an eigenvalue of P , then P x = λx for some nonzero x. From P 2 = P it
follows that λ2 = λ. Hence the eigenvalues of P are either 1 or 0, and rank(P ) = trace (P ).
The orthogonal projector onto a given subspace can be shown to be unique. If rank(A) = n,
we find from the normal equation that Ax = A(ATA)−1 AT b. Hence,

P = A(ATA)−1 AT

is the orthogonal projector onto R(A). The above results can be summarized as follows.

Theorem 1.1.8. The following statements are equivalent:

1. x solves the least squares problem minx ∥Ax − b∥2 .

2. x satisfies the normal equation ATAx = AT b.

3. The residual r = b − Ax is orthogonal to R(A).

4. x solves the consistent linear system Ax = PR(A) b, where PR(A) is the orthogonal pro-
jector onto R(A).

If r = rank(A) < n, then A has a nullspace of dimension n − r > 0. Then the problem
minx ∥Ax − b∥2 is underdetermined, and its solution is not unique. If x̂ is a particular least
squares solution, then the set of all least squares solutions is S = {x = x̂ + z | z ∈ N (A)}. In
this case we can seek the least squares solution of least-norm ∥x∥2 , i.e., solve

min ∥x∥2 , S = {x ∈ Rn | ∥b − Ax∥2 = min}. (1.1.16)


x∈S

This solution is always unique.

Theorem 1.1.9. Let x be a solution of the problem minx ∥Ax − b∥2 . Then x is a least squares
solution of least-norm if and only if x ⊥ N (A) or, equivalently, x = AT z, z ∈ Rm .

Proof. Let x̂ be any least squares solution, and set x = x̂ + z, where z ∈ N (A). Then Az = 0,
so r = b − Ax̂ = b − Ax, and x̂ is also a least squares solution. By the Pythagorean theorem,
∥x∥22 = ∥x̂∥22 + ∥z∥22 , which is minimized when z = 0.

If the system Ax = b is consistent, then the least-norm solution satisfies the normal equations
of second kind,
x = AT z, AATz = b. (1.1.17)
If rank(A) = m, then AAT is nonsingular, and the solution to (1.1.17) is unique.
8 Chapter 1. Mathematical and Statistical Foundations

From this result and Theorem 1.1.8 we have the following characterization of a solution to
the least squares problem (1.1.16). It includes both the over- and underdetermined cases.

Theorem 1.1.10. For A of any dimension and rank, the least squares solution of minimum norm
∥x∥2 is unique and characterized by the conditions

r = b − Ax ⊥ R(A) and x ⊥ N (A). (1.1.18)

The normal equations ATAx = AT b are equivalent to the linear equations AT r = 0, and
r = b − Ax. Together, these form a symmetric augmented system of m + n equations
    
I A y b
= , (1.1.19)
AT 0 x c

where y = r and c = 0. The augmented system is nonsingular if and only if rank(A) = n. Then
its inverse is −1 
I − PR(A) A(ATA)−1
 
I A
= , (1.1.20)
AT 0 (ATA)−1 AT −(ATA)−1
where PR(A) = A(ATA)−1AT is the orthogonal projector onto R(A).

Theorem 1.1.11. If rank(A) = n, then the augmented system (1.1.19) has a unique solution
that solves the primal and dual least squares problems,
1
min ∥b − Ax∥22 + cT x, (1.1.21)
x∈Rn 2
1
min ∥y − b∥22 subject to AT y = c. (1.1.22)
y∈Rm 2

Proof. Differentiating (1.1.21) gives AT (b − Ax) = c, which with y = b − Ax is the augmented


system (1.1.19). This system is also obtained by differentiating the Lagrangian
1
L(x, y) = (y − b)T (y − b) + xT (AT y − c)
2
for (1.1.22) and equating to zero. Here x is the vector of Lagrange multipliers.

1.2 Some Fundamental Matrix Decompositions


1.2.1 The Cholesky Factorization
The classical method for solving a linear least squares problem minx ∥Ax − b∥2 , A ∈ Rm×n , is
to form and solve the symmetric normal equations ATAx = AT b. If rank(A) = n, then x ̸= 0
implies that Ax ̸= 0. Hence

xTATAx > 0 ∀x ∈ Rn , x ̸= 0, (1.2.1)

and AT A is positive definite. Conversely, a symmetric positive definite matrix is nonsingular.


If it were singular, there would be a vector x such that Ax = 0. But then xTAx = 0, which is a
contradiction. To solve the normal equations, Gauss developed an elimination process that uses
pivots from the diagonal; see Stewart [1029, 1995]. Then all reduced matrices are symmetric,
and only elements on and below (say) the main diagonal have to be computed. This reduces the
number of operations and amount of storage needed by half.
1.2. Fundamental Matrix Decompositions 9

Theorem 1.2.1 (Cholesky Factorization). Let the matrix C ∈ Cn×n be Hermitian positive
definite. Then there exists a unique upper triangular matrix R = (rij ) with real positive diagonal
elements called the Cholesky factor of C such that

C = RHR. (1.2.2)

Proof. The proof is by induction. The result is clearly true for n = 1. If it is true for some
n − 1, the leading principal submatrix Cn−1 of C has a unique Cholesky factorization Cn−1 =
H
Rn−1 Rn−1 , where Rn−1 is nonsingular. Then
   H
 
Cn−1 d Rn−1 0 Rn−1 r
Cn = = (1.2.3)
dH γ rH ρ 0 ρ

holds if r and ρ satisfy


H
Rn−1 r = d, rH r + ρ2 = γ. (1.2.4)
H
The first equation has a unique solution r. It remains to show that γ − r r > 0. From the
positive definiteness of C it follows that
 −1 
−H Rn−1 r −H −1 −1
0 < ( rH Rn−1 −1 ) A = rH Rn−1 Cn−1 Rn−1 r − 2rH Rn−1 c+γ
−1
= rH r − 2rH r + γ = γ − rH r.

Hence ρ = (γ − rH r)1/2 is uniquely determined.

Substituting the Cholesky factorization C = AHA = RHR into the normal equations gives
RH Rx = d, where d = AH b. Hence, the solution is obtained by solving two triangular systems,

RH z = d, Rx = z. (1.2.5)

This method is easy to implement and often faster than other direct solution methods. It works
well unless A is ill-conditioned.
For a consistent underdetermined linear system Ax = b, the solution to the least-norm prob-
lem min ∥x∥2 subject to Ax = b satisfies the normal equations of the second kind,

x = AH z, AAH z = b.

If A has full row rank, then AAH is symmetric positive definite, and the Cholesky factorization
AAH = RHR exists. Then z is obtained by solving

RH w = b, Rz = w. (1.2.6)

It is often preferable to work with the Cholesky factorization of the cross-product of the
extended matrix ( A b ),
 H  H 
A A A AH b
(A b) = , (1.2.7)
bH bHA bH b

when solving a least squares problem. If rank(A) = n, then the Cholesky factor of the cross-
product (1.2.7),  
R z
S= , (1.2.8)
0 ρ
10 Chapter 1. Mathematical and Statistical Foundations

exists, where we may have ρ = 0. Forming S H S shows that


AHA = RHR, RH z = AH b, bH b = z H z + ρ2 .
Hence, R is the Cholesky factor of AHA, and the least squares solution is obtained from Rx = z.
Since r = b − Ax is orthogonal to Ax, we have
∥Ax∥22 = (r + Ax)HAx = bHAx = bHAR−1 R−T AH b = z H z,
and hence ∥r∥22 = ρ2 = bH b − z H z and |b − Ax∥2 = ρ.
Let A ∈ Cm×n have full column rank, let AHA = RHR be its Cholesky factorization, and
set Q1 = AR−1 . Then
A = Q1 R, QH 1 Q1 = In , (1.2.9)
is the Cholesky QR factorization of A. The orthonormal factor Q1 can be computed as the
unique solution of the lower triangular matrix equation RH QH H
1 = A by forward substitution.
H H H H H
The normal equation simplifies to R Q1 Q1 Rx = R Rx = R Q1 b or
Rx = QH
1 b.

In the real case, the arithmetic cost of this Cholesky QR algorithm is 2mn2 + n3 /3 flops. More
accurate methods that compute the QR factorization (1.2.9) directly from A are described in
Sections 2.2.2–2.2.4.
We now state some useful properties of Hermitian positive definite matrices. From the proof
of Theorem 1.2.1 follows a well-known characterization.

Lemma 1.2.2 (Sylvester’s Criterion). Let Ck ∈ Ck×k , k = 1, . . . , n, be the leading principal


submatrices of the Hermitian matrix C ∈ Cn×n . Then C is positive definite if and only if
det(Ck ) > 0, k = 1, . . . , n.

Theorem 1.2.3. Let C ∈ Cn×n be Hermitian, and let X ∈ Cn×p have full column rank. Then
X HCX is positive definite. In particular, any principal p × p submatrix
 
ci1 i1 . . . ci1 ip
 . ..  ∈ Cp×p , 1 ≤ p < n,
 .. . 
c̄ip i1 . . . cip ip
is positive definite. From p = 1 it follows that all diagonal elements in C are real positive.

Proof. Suppose C is positive definite, z ̸= 0, and y = Xz. Then since X has full column rank,
it follows that y ̸= 0 and
z H (X HCX)z = y HCy > 0.
The result now follows because any principal submatrix of C can be written as X HCX, where the
columns of X are taken to be the columns k = ij , j = 1, . . . , p, of the identity
matrix.

Theorem 1.2.4. The element of maximum magnitude of a real symmetric positive definite matrix
C = (cij ) ∈ Cn×n lies on the diagonal.

Proof. From Theorem 1.2.3 and Sylvester’s criterion it follows that


 
cii cij
det = cii cjj − |cij |2 > 0, 1 ≤ i, j ≤ n. (1.2.10)
c̄ij cjj
Hence |cij |2 < cii cjj ≤ max1≤i≤n c2ii .
1.2. Fundamental Matrix Decompositions 11

1.2.2 SVD and Related Eigenvalue Decompositions


The singular value decomposition (SVD) provides a diagonal form of a complex bilinear form
xT Ay, A ∈ Cm×n , under a unitary equivalence transformation. It has numerous applications in
areas such as signal and image processing, control theory, pattern recognition, and time-series
analysis. The use of the SVD in numerical computations first became practical with the develop-
ment of the efficient and stable QRSVD algorithm by Golub and Reinsch [507, 1971]).

Theorem 1.2.5 (The Singular Value Decomposition). For any matrix A ∈ Cm×n of rank r
there exist unitary matrices U = (u1 , . . . , um ) ∈ Cm×m and V = (v1 , . . . , vn ) ∈ Cn×n such
that  
Σ1 0
A = U ΣV H = U V H, (1.2.11)
0 0
where Σ1 = diag (σ1 , σ2 , . . . , σr ) ∈ Rr×r . The σi are the singular values of A, and ui ∈ Cm
and vi ∈ Cn are the left and right singular vectors. In the following we assume that the singular
values are ordered so that
σ1 ≥ σ2 ≥ · · · ≥ σr > 0.

Proof. We give an inductive proof that constructs the SVD from its largest singular value σ1 and
the associated left and right singular vectors. Let v1 ∈ Cn with ∥v1 ∥2 = 1 be a unit vector such
that
∥Av1 ∥2 = ∥A∥2 = σ1 ,
where σ1 is real and positive. The existence of such a vector follows from the definition of a
subordinate matrix norm ∥A∥. If σ1 = 0, then A = 0, and (1.2.11) holds with Σ = 0 and
arbitrary unitary matrices U and V . If σ1 > 0, set u1 = (1/σ1 )Av1 ∈ Cm , and let

V = ( v1 V1 ) ∈ Cn×n , U = ( u1 U1 ) ∈ Cm×m

be unitary matrices. (Recall that it is always possible to extend a unitary set of vectors to a
unitary basis for the whole space.) Since U1H Av1 = σU1H u1 = 0, it follows that U H AV has the
structure  
σ1 w H
A1 ≡ U H AV = ,
0 B
where wH = uH H
1 AV1 and B = U1 AV1 ∈ C
(m−1)×(n−1)
.
From the two inequalities
   2 
2 H 1/2 σ1 σ + wH w
∥A1 ∥2 (σ1 + w w) ≥ A1 = ≥ σ12 + wH w
w 2
Bw 2

it follows that ∥A1 ∥2 ≥ (σ12 + wH w)1/2 . But U and V are unitary and ∥A1 ∥2 = ∥A∥2 = σ1 .
Hence w = 0, and the proof can now be completed by an induction argument on the smallest
dimension min(m, n).

Instead of the full SVD (1.2.11), it often suffices to consider the compact or economy size
SVD,
Xr
A = U1 Σ1 V1H = σi ui viH , (1.2.12)
i=1
m×r n×r
where U1 = C and V1 = C are the singular vectors corresponding to nonzero singular
values. If A has full column rank, then V1 = V . Similarly, if A has full row rank, then U1 = U .
By (1.2.12), A is decomposed into a sum of r rank-one matrices.
12 Chapter 1. Mathematical and Statistical Foundations

Like the eigenvalues of a real Hermitian matrix, the singular values have a min-max charac-
terization.

Theorem 1.2.6. Let A ∈ Cm×n have singular values σ1 ≥ σ2 ≥ · · · ≥ σp ≥ 0, p = min(m, n),


and let S be a linear subspace of Cn . Then
∥Ax∥2
σi = min max . (1.2.13)
dim (S)=n−i+1 x∈S ∥x∥2
x̸=0

Proof. The result is established similarly to the characterization of the eigenvalues of a symmetric
matrix in the Courant–Fischer theorem; see Horn and Johnson [641, 2012].

If A ∈ Cm×n represents a linear mapping from Cn to Cm , the significance of Theorem 1.2.5


is that there exists a unitary basis in each of these spaces, with respect to which this mapping is
represented by a generalized diagonal matrix Σ with real nonnegative elements.
The singular values of A are uniquely determined. The singular vectors ui and vi corre-
sponding to a singular value σi are unique if and only if σi is simple. To a singular value σi of
multiplicity p > 1 there correspond p singular vectors that can be chosen as any unitary basis
for the unique subspace that they span. The SVD gives unitary bases for the four fundamental
subspaces associated with A. It is easy to verify that
R(A) = span{u1 , . . . , ur }, N (A) = span{vr+1 , . . . , vn }, (1.2.14)
H H
R(A ) = span{v1 , . . . , vr }, N (A ) = span{ur+1 , . . . , um }. (1.2.15)
Note that N (AH ) is the unitary complement in Cm of R(A), and N (A) is the unitary comple-
ment of R(AH ) in Cn :
N (A)⊥ = R(AH ), R(A)⊥ = N (AH ).
Once a singular vector vi or ui corresponding to a simple singular value σi > 0 has been
determined, ui or vi is uniquely determined from
σj uj = Avj , σj vj = AH uj . (1.2.16)
There is a close relationship between the SVD and the Hermitian (or real symmetric) eigen-
value problems for AHA and AAH . If A = U ΣV H ∈ Cm×n is the SVD, then these eigenvalue
decompositions are
AHA = V ΣT ΣV H , AAH = U ΣT ΣU H . (1.2.17)
It follows that the singular values σi are the nonnegative square roots of the eigenvalues of AHA
and AAH .
The Hermitian matrix  
0 A
C= , A ∈ Cm×n , (1.2.18)
AH 0
is often referred to as the Jordan–Wielandt matrix. The following theorem is implicit in Jordan’s
derivation of the SVD [676, 1874].

Theorem 1.2.7 (Jordan–Wielandt). Let A ∈ Cm×n , m ≥ n, be a matrix of rank r, and let its
SVD be A = U ΣV H , where Σ = diag ( Σ1 0 ), U = ( U1 U2 ), and V = ( V1 V2 ). Then
 
  Σ1 0 0
0 A
C= = P  0 −Σ1 0  P H , (1.2.19)
AH 0
0 0 0
1.2. Fundamental Matrix Decompositions 13

where P is the unitary matrix


 √ H
1 U1 U1 2 U2 √0
P =√ . (1.2.20)
2 V1 −V1 0 2 V2

The eigenvalues of C are ±σi , i = 1, . . . , r, and zero is repeated (m + n − 2r) times.

The use of C for computing the SVD of A was pioneered by Lanczos [717, 1958], [714,
1961, Chap. 3]. Note that the matrix
 
AAH 0
C2 = (1.2.21)
0 AH A

has block diagonal form. Such a matrix is said to be two-cyclic.


The following example illustrates that explicitly forming AH A (or AAH ) can lead to a severe
loss of accuracy in the smaller singular values.

Example 1.2.8. Let A = (u1 , u2 ), where u1 and u2 are unit vectors such that 0 < uT1 u2 =
cos γ, where γ is the angle between the vectors u1 and u2 . The eigenvalues of the matrix
 
1 cos γ
ATA =
cos γ 1

are the roots of the equation (λ − 1)2 = cos2 γ and equal λ1 = 2 cos2 (γ/2), λ2 = 2 sin2 (γ/2).
Hence the singular values of A are
√ √
σ1 = 2 cos(γ/2), σ2 = 2 sin(γ/2).

The right singular vectors of A are the eigenvectors of ATA,


   
1 1 1 −1
v1 = √ , v2 = √ .
2 1 2 1

The left singular vectors can be determined from (1.2.16). However, if γ is less than the square
root of the unit roundoff, then numerically cos γ ≈ 1±γ 2 /2 = 1. Then the computed eigenvalues
of ATA are 0 and 2, i.e., the smallest singular value of A has been lost!

Notes and references


Beltrami [98, 1873] derived the SVD for a real, square, nonsingular matrix having distinct sin-
gular values. A year later, Jordan [676, 1874] independently published a derivation of the SVD
that handled multiple singular values and gave a variational characterization of the largest sin-
gular value as the maximum of a function. Picard [893, 1909] seems to have been the first to
call the numbers σi singular values. Autonne [42, 1913] extended the SVD to complex matrices.
The generalization to singular and rectangular matrices appeared in Autonne [43, 1915]. More
detailed accounts of the history of the SVD are given by Horn and Johnson [640, 1991, Sect. 3.0]
and Stewart [1026, 1993].

1.2.3 The Pseudoinverse and Generalized Inverses


By Theorem 1.1.10, for A ∈ Cm×n of any dimension and rank, the unique least squares solution
of minimum norm is characterized by r = b − Ax ⊥ R(A), and x ⊥ N (A). The full SVD
14 Chapter 1. Mathematical and Statistical Foundations

A = U ΣV H gives unitary bases for these two subspaces. Thus the SVD is a perfect tool for
solving least squares problems.

Theorem 1.2.9. The least squares problem


min ∥x∥2 , S = {x ∈ Rn | ∥b − Ax∥2 = min},
x∈S

has a unique solution that can be written as x = A† b, where


 −1 
† Σ1 0
A =V UH (1.2.22)
0 0
is the pseudoinverse of A.
   
z1 c1
Proof. Let z = V H x = , c = UHb = , where z1 , c1 ∈ Cr . Then
z2 c2

∥b − Ax∥2 = ∥U H (b − AV V H x)∥2
      
c1 Σ1 0 z1 c1 − Σ1 z1
= − = .
c2 0 0 z2 2
c2 2

Thus, the residual norm will be minimized for z1 = Σ−1


1 c1 and any z2 . The choice z2 = 0
minimizes ∥z∥2 , and gives ∥x∥2 = ∥V z∥2 .

The pseudoinverse was introduced by Moore and rediscovered by Penrose [889, 1955]. The
pseudoinverse is therefore also called the Moore–Penrose inverse. Penrose gave it an elegant
algebraic characterization.

Theorem 1.2.10 (The Penrose Conditions). The pseudoinverse X is uniquely determined by


the following four conditions:
(1) AXA = A, (2) XAX = X,
(3) (AX)H = AX, (4) (XA)H = XA. (1.2.23)
It is easily verified that A† in (1.2.22) satisfies the Penrose conditions. By the uniqueness of

A , this does not depend on the particular choice of U and V in the SVD. It follows easily from
Theorem 1.2.9 that A† minimizes ∥AX − I∥F .
If A is nonsingular, then A† = A−1 , so (1.2.22) is a generalization of the usual inverse. From
the SVD it easily follows that
A† = (AH A)† AH = AH (AAH )† . (1.2.24)
In the special case that rank(A) = n, we have
A† = (AH A)−1 AH , (AH )† = A(AH A)−1 . (1.2.25)
The pseudoinverse of a scalar is
1/σ ̸ 0,
if σ =
n
σ† = (1.2.26)
0 if σ = 0.
This shows the important fact that the pseudoinverse A† is not a continuous function of A, unless
we allow only perturbations that do not change the rank of A. The pseudoinverse has the property
A† = lim (AH A + δI)−1 AH . (1.2.27)
δ→0
1.2. Fundamental Matrix Decompositions 15

The following properties of the pseudoinverse follow from (1.2.22).

1. (A† )† = A;

2. (A† )H = (AH )† ;

3. (αA)† = α† A† ;

4. (AH A)† = A† (A† )H ;

5. if U and V are unitary, (U AV H )† = V A† U H ;

Ai , where Ai AH H †
A†i ;
P P
6. if A = i j = 0, Ai Aj = 0, i ̸= j, then A = i

7. if A is normal (AAH = AH A), then A† A = AA† and (An )† = (A† )n ;

8. A, AH , A† , and A† A all have rank equal to trace (A† A).

For the pseudoinverse, the relations and AA† = A† A and (AB)† = B † A† are not in general
T
true. For example, let A = ( 1 0 ) and B = ( 1 1 ) . Then AB = 1, but
 
† † 1 1 1
B A = (1 1) = .
2 0 2

Necessary and sufficient conditions for the identity (AB)† = B † A† to hold have been given by
Greville [536, 1966]. The following theorem gives an important sufficient condition.

Theorem 1.2.11. If A ∈ Cm×p , B ∈ Cp×n , and rank(A) = rank(B) = p, then

(AB)† = B † A† = B H (BB H )−1 (AH A)−1 AH . (1.2.28)

Proof. The last equality follows from (1.2.25). The first equality is verified by showing that the
four Penrose conditions are satisfied.

The pseudoinverse and the singular vectors of A give simple expressions for orthogonal pro-
jections onto the four fundamental subspaces of A. The following expressions are easily verified
using the Penrose conditions and the SVD:

PR(A) = AA† = U1 U1H , PN (AH ) = I − AA† = U2 U2H ,


PR(AH ) = A† A = V1 V1H , PN (A) = I − A† A = V2 V2H , (1.2.29)

where U1 = (u1 , . . . , ur ) and V1 = (v1 , . . . , vr ). From this we get the following expression for
the inverse of the augmented system matrix when A has full column rank:
−1
I − AA† (A† )T
  
I A
= . (1.2.30)
AH 0 A† −(AHA)−1

If only some of the four Penrose conditions hold, the corresponding matrix is called a gen-
eralized inverse. Such inverses have been extensively analyzed; see Nashed [823, 1976]. Any
matrix A− satisfying the first Penrose condition AA−A = A is called an inner inverse or {1}-
inverse. If it satisfies the second condition A−AA− = A− , it is called an outer inverse or a
{2}-inverse.
16 Chapter 1. Mathematical and Statistical Foundations

Let A− be a {1}-inverse of A. Then for all b such that the system Ax = b is consistent,
x = A− b is a solution. The general solution can be written

x = A− b + (I − A−A)y, y ∈ Cn . (1.2.31)

For any {1}-inverse of A, it holds that

(AA− )2 = AA−AA− = AA− , (A−A)2 = A−AA−A = A−A.

This shows that AA− and A−A are idempotent and therefore (in general, oblique) projectors; see
Section 3.1.4. The residual norm ∥Ax − b∥2 is minimized when x satisfies the normal equations
AHAx = AH b. Suppose that a {1}-inverse A− also satisfies the third Penrose condition

(AA− )H = AA− . (1.2.32)

Then AA− is the orthogonal projector onto R(A) and A− is called a least squares inverse. We
have
AH = (AA−A)H = AHAA− ,
which shows that x = A− b satisfies the normal equations and therefore is a least squares solution.
The following dual result also holds. If A− is a generalized inverse and (A−A)H = A−A,
then A−A is the orthogonal projector onto N (A), and A− is called a least-norm inverse. If
Ax = b is consistent, the unique solution for which ∥x∥2 is smallest satisfies the normal equa-
tions of the second kind,
x = AH z, AAH z = b.
For a least-norm inverse A− it holds that

AH = (AA−A)H = A−AAH .

Hence, x = AH z = A− (AAH )z = A− b, which shows that x = A− b is the solution of smallest


norm. Conversely, let A− be such that, whenever Ax = b has a solution, x = A− b is a least-norm
solution. Then A− is a least-norm inverse.
Finally, a warning: Manipulations with generalized inverses can hide intrinsic computational
difficulties associated with nearly rank-deficient matrices and should be used with caution.

Notes and references

The notion of inverse of a matrix was generalized to include all matrices A, singular as well as
rectangular, by E. H. Moore [804, 1920]. Moore called this the “general reciprocal.” His con-
tribution used unnecessarily complicated notation and was soon sinking into oblivion; see Ben-
Israel [100, 2002]. A collection of papers on generalized inverses can be found in Nashed [823,
1976]. Ben-Israel and Greville [101, 2003] give a comprehensive survey of generalized inverses.

1.2.4 Principal Angles and the CS Decomposition


The acute angle between two unit vectors x, y ∈ Cn is

θ = ∠(x, y) = arccos |xH y|, 0 ≤ θ ≤ π/2. (1.2.33)

The general concept of principal angles between any two subspaces of Cn goes back to a re-
markable paper by Jordan [677, 1875].
1.2. Fundamental Matrix Decompositions 17

Definition 1.2.12. Let X = R(X) and Y = R(Y ) be two subspaces of Cn . Without restriction
we can assume that
p = dim (X ) ≥ dim (Y) = q ≥ 1.

The principal angles θk between X and Y and the corresponding principal vectors uk , vk , k =
1, . . . , q, are recursively defined by

cos θk = max max uHv, ∥u∥2 = ∥v∥2 = 1, (1.2.34)


u∈X v∈Y

subject to the constraints u ⊥ uj , v ⊥ vj , j = 1, . . . , k − 1.

Note that for k = 1, the constraints are empty, and θ1 is the smallest principal angle between
X and Y. The principal vectors need not be uniquely defined, but the principal angles always
should be.
From the min-max characterization of the singular values and vectors given in Theorem 1.2.6
follows a relationship between principal angles and the SVD.

Theorem 1.2.13. Assume that X ∈ Cm×p and Y ∈ Cm×q form unitary bases for two subspaces
X and Y. Consider the SVD

M = X H Y = W CZ H , C = diag (σ1 , . . . , σq ), (1.2.35)

where σ1 ≥ σ2 ≥ · · · ≥ σq , W H W = Z H Z = Iq . Then the principal angles θk and principal


vectors are given by
cos θk = σk , U = XW, V = Y Z. (1.2.36)

Using this result, Björck and Golub [144, 1973] give stable algorithms for computing the
principal angles and vectors between subspaces. Golub and Zha [515, 1994], [516, 1995] give
a detailed perturbation analysis, discuss equivalent characterizations of principal angles, and
present algorithms for large and structured matrices. The stability of the Björck–Golub algorithm
is proved by Drmač [334, 2000].
The principal angles can be used to determine when two subspaces X and Y are close to each
other. X and Y are identical if and only if all principal angles are zero.

Definition 1.2.14. The largest principal angle θmax between two subspaces X and Y of the same
dimension is a measure of the distance between them:

dist(X , Y) = | sin θmax (X , Y)|. (1.2.37)

An alternative definition is dist(X , Y) = ∥PX − PY ∥2 , where PX and PY are the orthogonal


projectors onto the subspaces X and Y of Cn . This follows from the result that the nonzero
singular values of (PX − PY ) are

sin θk (X , Y), k = 1, . . . , q;

see Golub and Van Loan [512, 1996, Theorem 2.6.1].


18 Chapter 1. Mathematical and Statistical Foundations

Let Q ∈ Cm×n , m > n, be unitary and partitioned as


 
Q11
Q= , Q11 ∈ Rm1 ×n , Q21 ∈ Rm2 ×n ,
Q21

where m1 + m2 = m. Then the SVD of the blocks Q11 and Q21 are closely related. To simplify
the discussion, we assume that both Q11 and Q21 are square, i.e., m1 = m2 = n and Q11 is
nonsingular. Let
Q11 = U1 CV1H , C = diag (c1 , . . . , cn )

be the SVD of Q11 , and set X = Q21 V1 . Then V1H QH 2 H


11 Q11 V1 = C , and because Q11 Q11 +
H
Q21 Q21 = In , the columns of X are orthogonal as follows:

X H X = V1H QH H 2 2
21 Q21 V1 = V1 (I − C )V1 = I − C .

Then U2 = XS −1 , where S = diag (s1 , . . . , sn ), is orthonormal, and Q21 = U2 SV1H . Thus,


we have the CS decomposition
     
Q11 U1 0 C V1H 0
= . (1.2.38)
Q21 0 U2 S 0 V2H

A more general CS decomposition for a unitary matrix, where Q11 and Q21 are not required
to be square matrices, is given by Paige and Saunders [856, 1981].

Theorem 1.2.15 (CS Decomposition). For an arbitrary partitioning of a square unitary matrix
Q ∈ Cm×m ,
n n2
 1 
m1 Q11 Q12
Q= , (1.2.39)
m2 Q21 Q22

n1 + n2 = m1 + m2 = m, there are unitary matrices U1 , U2 , V1 , V2 such that


     
U1H 0 Q11 Q12 V1 0 D11 D12
= , (1.2.40)
0 U2H Q21 Q22 0 V2 D21 D22

where Dij = UiH Qij Vj ∈ Rni ×mj , i, j = 1, 2, are real and diagonal matrices,

OsH
 
I

 C S 


D11 D12
  Oc I 
= , (1.2.41)
 
D21 D22  Os I 
 
 S −C 
I OcH

C = diag (c1 , . . . , cp ), S = diag (s1 , . . . , sp ), and

1 > c1 ≥ · · · ≥ cr > 0, 0 < s1 ≤ · · · ≤ sr < 1, (1.2.42)

and c2i + s2i = 1. Here Oc and Os are zero blocks or may be empty matrices having no rows or
no columns. The unit matrices need not be equal and may not be present.
1.3. Perturbation Analysis 19

Proof. Paige and Wei [870, 1994] note that Qij = Ui CVjH , i, j = 1, 2, are essentially the SVDs
of the four blocks in the partitioned unitary matrix Q. Take U1 and V1 so that Q11 = U1 D11 V1H
is the SVD of Q11 . Hence, D11 is a nonnegative diagonal matrix with elements less than or equal
to unity. Choose unitary U2 and V2 to make U2H Q21 V1 lower triangular and U1H Q12 V2 upper
triangular with real nonnegative elements on their diagonals. Since the columns are orthonormal,
D21 must have the stated form. The orthonormality of rows gives D12 , except for the dimension
of the zero block denoted OsH . Since each row and column has unit length, the last block column
must have the form  
O12

 S 


Q12
  I 
U2 V2H =  .
 
Q22  K L 
 
 M N 
O22
The orthogonality of the second and fourth blocks of columns shows that SM = 0. Hence
M = 0 because S is nonsingular. Similarly, from the second and fourth blocks of rows, L = 0.
Next, from the fifth and second blocks of rows, SC + N S = 0 and hence N = −C. Then we
see that KK H = I and K H K = I, and they can be transformed into I without altering the
rest of D. Finally, the unit matrices in the (1, 1) and (4, 4) blocks show that O12 = OsH and
O22 = OcH .

As remarked by Stewart [1017, 1977], the CS decomposition of a unitary matrix Q “often


enables one to obtain routine computational proofs of geometric theorems that would otherwise
require considerable ingenuity to establish.” Some contexts need only the compact CS decompo-
sition that corresponds to the first k1 columns Q11 and Q21 of Q. The above proof is constructive,
and U1 , V1 , and C can be computed by a standard SVD algorithm. However, if some singular
values ci are close to 1, this is not a stable algorithm for computing S and U2 .

Notes and references


The canonical correlations cos θk , where θk are the principal angles, were introduced by Hotelling
[642, 1936]. These are used in statistical modeling in a wide variety of applications, such as
econometrics, psychology, and geodesy. The concept of principal angles is generalized to ellip-
tic scalar products by Knyazev and Argentati [701, 2002].
The CS decomposition is implicit in the works by Davis and Kahan [288, 1970] and Björck
and Golub [144, 1973]. An explicit form is put forward by Stewart [1017, 1977]. Stable al-
gorithms for computing the CS decomposition are given by Paige and Saunders [856, 1981],
Stewart [1020, 1982], and Van Loan [1080, 1985]. Higham [624, 2003] gives an analogue of the
CS decomposition for pseudounitary matrices. Sutton [1052, 2009] gives a two-phase algorithm
for computing the full CS decomposition. Applications are surveyed by Paige and Wei [870,
1994].

1.3 Perturbation Analysis


1.3.1 Vector and Matrix Norms
A norm is a function of a vector or matrix that gives a measure of its magnitude. Such norms can
be regarded as generalizations of the absolute value for scalars and are useful for error analysis
and other purposes. The Euclidean length of a vector is an example, but it is not the most
convenient in all situations.
20 Chapter 1. Mathematical and Statistical Foundations

A vector norm is a function ∥ · ∥ : Cn → R that satisfies the following conditions:

1. ∥x∥ > 0 ∀x ∈ Cn , x ̸= 0 (definiteness);

2. ∥αx∥ = |α|∥x∥ ∀α ∈ C, x ∈ Cn (homogeneity);

3. ∥x + y∥ ≤ ∥x∥ + ∥y∥ ∀x, y ∈ Cn (triangle inequality).

The most common vector norms are the Hölder ℓp -norms


1/p
∥x∥p = |x1 |p + |x2 |p + · · · + |xn |p , 1 ≤ p < ∞. (1.3.1)

These have the property ∥x∥p = ∥ |x| ∥p , where |x| = (|x1 |, . . . , |xn |). Such norms are said to
be absolute. The most important particular cases are p = 1, 2 and the limit when p → ∞,

∥x∥1 = |x1 | + · · · + |xn |,


1/2 1/2
∥x∥2 = |x1 |2 + · · · + |xn |2 = xH x , (1.3.2)
∥x∥∞ = max |xi |.
1≤i≤n

The vector ℓ2 -norm is the Euclidean length of the vector. If Q is unitary, then ∥Qx∥22 =
xH QH Qx = xH x = ∥x∥22 , i.e., this norm is invariant under unitary transformations: On a
finite-dimensional space, two norms differ by at most a positive constant that only depends on
the dimension. For the vector ℓp -norms,
 
1
− p1
∥x∥p2 ≤ ∥x∥p1 ≤ n p1 2 ∥x∥p2 , p1 ≤ p2 .

An important property of ℓp -norms is the Hölder inequality,

|xH y| ≤ ∥x∥p ∥y∥q , 1/p + 1/q = 1. (1.3.3)

In the special case p = q = 2 this is the Cauchy–Schwarz inequality.

Definition 1.3.1. For any given vector norm ∥ · ∥ on Cn , the dual norm ∥ · ∥D is defined by

∥x∥D = max |xH y|/∥y∥. (1.3.4)


y̸=0

The vectors in the set {y ∈ Cn | ∥y∥D ∥x∥ = y H x = 1} are said to be dual vectors to x with
respect to ∥ · ∥.

From Hölder’s inequality it follows that the dual of the ℓp -norm is the ℓq -norm, where 1/p +
1/q = 1. Hence the dual of the ℓ2 -norm is itself. It is the only norm with this property; see Horn
and Johnson [641, 2012, Theorem 5.4.16].
A matrix norm is a function ∥·∥ : Cm×n → R that satisfies analogues of the three properties
of a vector norm. The matrix norm subordinate to a given vector norm is defined by

∥Ax∥
∥A∥ = max = max ∥Ax∥. (1.3.5)
x̸=0 ∥x∥ ∥x∥=1

From this it follows directly that

∥Ax∥ ≤ ∥A∥ ∥x∥ ∀x ∈ Cn .


1.3. Perturbation Analysis 21

A subordinate matrix norm is submultiplicative, i.e., whenever the product AB is defined, the
inequality ∥AB∥ ≤ ∥A∥∥B∥ holds. The matrix norms subordinate to the vector ℓp -norms are
especially important. For p = 2 it is given by the spectral norm

∥A∥2 = max ∥Ax∥2 = σ1 (A), (1.3.6)


∥x∥2 =1

where σ1 (A) is the largest singular value of A ∈ Cm×n . Because the nonzero singular values of
A and AH are the same, it follows that ∥A∥2 = ∥AH ∥2 .
For p = 1 and p = ∞ it can be shown that the matrix subordinate norms are
m
X n
X
∥A∥1 = max |aij |, ∥A∥∞ = max |aij |. (1.3.7)
1≤j≤n 1≤i≤m
i=1 j=1

If e = (1, . . . , 1)T is a vector of ones of appropriate dimension, we can write

∥A∥∞ = ∥ |A|e∥∞ = ∥AH ∥1 .

These norms are easily computable. A useful upper bound for the spectral norm, which is ex-
pensive to compute, is given by

∥A∥2 ≤ (∥A∥1 ∥A∥∞ )1/2 . (1.3.8)

Another way to define matrix norms is to regard Cm×n as an mn-dimensional vector space
and apply a vector norm over that space. An example is the Frobenius norm derived from the
vector ℓ2 -norm,
m X
X n 1/2
∥A∥F = ∥AH ∥F = |aij |2 = trace (AH A)1/2 . (1.3.9)
i=1 j=1

The Frobenius norm is submultiplicative but is often larger than necessary, e.g., ∥I∥F = n1/2 .
Lower and upper bounds for the matrix ℓ2 -norm in terms of the Frobenius norm are
1
√ ∥A∥F ≤ ∥A∥2 ≤ ∥A∥F .
n

The Frobenius norm and the matrix subordinate norms for p = 1 and p = ∞ satisfy ∥ |A| ∥ =
∥A∥. However, for the ℓ2 -norm, the best result is ∥ |A| ∥2 ≤ n1/2 ∥A∥2 . The spectral and
Frobenius norms of A both can be expressed in terms of singular values σi (A) as
r
X 1/2
∥A∥2 = σmax (A), ∥A∥F = σi2 (A) , r = min{m, n}. (1.3.10)
i=1

Such norms are unitarily invariant, i.e., ∥A∥ = ∥U H AV ∥ for any unitary U and V . The follow-
ing characterization of such norms was given by von Neumann [1094, 1937]; see Stewart and
Sun [1033, 1990].

Theorem 1.3.2. Any unitarily invariant matrix norm ∥A∥ is a symmetric function of the singular
values of A, i.e.,
∥A∥ = Φ(σ1 , . . . , σn ),
where Φ is invariant under permutation of its arguments.
22 Chapter 1. Mathematical and Statistical Foundations

Proof. Let the singular value decomposition of A be A = U ΣV H . The invariance implies that
∥A∥ = ∥Σ∥, which shows that Φ(A) only depends on Σ. As the ordering of the singular values
in Σ is arbitrary, Φ must be symmetric in σi .

The converse of Theorem 1.3.2 was also proved by von Neumann, i.e., any function
Φ(σ1 , . . . , σn ) that is symmetric in its arguments and satisfies the properties of a vector norm
defines a unitarily invariant matrix norm. Such functions are called symmetric gauge functions.
An important class of unitarily invariant matrix norms is the so-called Schatten norms
(Schatten [969, 1960]) obtained by taking the ℓp vector norm (1.3.1) of the vector of singular
values σ = (σ1 , . . . , σn ) of A:

∥A∥ = ∥σ∥p , 1 ≤ p < ∞. (1.3.11)

For p = 2 we get the Frobenius norm, and p → ∞ gives the spectral norm ∥A∥2 = σ1 . Taking
p = 1, we obtain the nuclear norm or Ky Fan norm (see Ky Fan [394, 1949])

√ n
X
H
∥A∥∗ = trace ( A A) = σi (A). (1.3.12)
i=1

1.3.2 Sensitivity of Singular Values and Vectors


By applying classical perturbation bounds for Hermitian matrices to the Jordan–Wielandt ma-
trix (1.2.18), the following results for the sensitivity of the singular values and vectors of A to
perturbations can be derived; see Wedin [1107, 1972].

Theorem 1.3.3. Let A ∈ Rm×n have the singular values σ1 ≥ σ2 ≥ · · · ≥ σn . Then the
singular values σ̃1 ≥ σ̃2 ≥ · · · ≥ σ̃n of the perturbed matrix à = A + E, m ≥ n, satisfy
n
X
(i) |σi − σ̃i | ≤ ∥E∥2 , (ii) |σi − σ̃i |2 ≤ ∥E∥2F . (1.3.13)
i=1

Proof. See Stewart [1015, 1973, Theorem 6.6].

The second inequality in (1.3.13) is known as the Wielandt–Hoffman theorem for singular
values. The fact that a perturbation of A will produce perturbations of the same or smaller
magnitude in its singular values is important for the use of the SVD to determine the “numerical
rank” of a matrix; see Section 1.3.3.

Theorem 1.3.4. Let A ∈ Rm×n have a simple singular value σi with corresponding left and
right singular vectors ui and vi . Let γi = minj̸=i |σi − σj | be the absolute gap between σi and
the other singular values of A. Then, the perturbed matrix à = A + E, where ∥E∥2 < γi , has
a singular value σ̃i with singular vectors ũi and ṽi that satisfy

 ∥E∥2
max sin θ(ui , ũi ), sin θ(vi , ṽi ) ≤ . (1.3.14)
γi − ∥E∥2

It is well known that the eigenvalues of the leading principal minor of order (n − 1) of a
Hermitian matrix A ∈ Rn×n interlace the eigenvalues of A. From the min-max characterization
in Theorem 1.2.6, a similar result for the singular values follows.
1.3. Perturbation Analysis 23

Theorem 1.3.5. Let A be bordered by a column u ∈ Rm ,

A
b = (A u ) ∈ Rm×n , m ≥ n.

Then the ordered singular values σi of A separate the ordered singular values σ̂i of Â,

σ̂1 ≥ σ1 ≥ σ̂2 ≥ σ2 ≥ · · · ≥ σ̂n−1 ≥ σn−1 ≥ σ̂n .


 
T n b A
Similarly, if A is bordered by a row v ∈ R , A = ∈ Rm×n , m ≥ n, it holds again that
vT

b1 ≥ σ1 ≥ σ
σ b2 ≥ σ2 ≥ · · · ≥ σ
bn−1 ≥ σn−1 ≥ σ
bn ≥ σn .

Lemma 1.3.6. Let A ∈ Cm×n and Bk = Xk YkH , where Xk , Yk ∈ Cm×k . Then rank(Bk ) ≤
k < min{m, n} and σ1 (A − Bk ) ≥ σk+1 (A), where σi (·) denotes the ith singular value of its
argument.

Proof. Let vi , i = 1, . . . , n, be the right singular vectors of A. Since rank(Y ) = k < n, there is
a vector v = c1 v1 + · · · + ck+1 vk+1 such that ∥v∥22 = c21 + · · · + c2k+1 and Y H v = 0. It follows
that

σ12 (A − Bk ) ≥ v H (A − Bk )H (A − Bk )v = v H AH Av
= |c1 |2 σ12 + · · · + |ck+1 |2 σk+1
2 2
≥ σk+1 .

Lemma 1.3.7. Let A = B + C, where B, C ∈ Cm×n , m ≥ n, have ordered singular values


σ1 (B) ≥ · · · ≥ σn (B) and σ1 (C) ≥ · · · ≥ σn (C), respectively. Then the ordered singular
values of A satisfy
σi+j−1 (A) ≤ σi (B) + σj (C). (1.3.15)

Proof. For i = j = 1 we have σ1 (A) = uH H H


1 Av1 = u1 Bv1 +u1 Cv1 ≤ σ1 (B)+σ1 (C). Now let
Bi−1 and Ci−1 denote the SVD expansions truncated to i−1 terms. Then σ1 (B−Bi−1 ) = σi (B)
and σ1 (C − Ci−j ) = σj (C). Moreover, rank(Bi−1 + Ci−1 ) ≤ i + j − 2. From these facts and
Lemma 1.3.6 it follows that

σi (B) + σj (C) = σ1 (B − Bi−1 ) + σ1 (C − Cj−1 )


≥ σ1 (A − Bi−1 + Cj−1 ) ≥ σi+j−1 (A).

We are now able to prove an important best approximation property of truncated SVD
expansions.

Theorem 1.3.8 (Eckart–Young–Mirsky Theorem). Let A ∈ Cm×n be a matrix of rank r,


and let Mm×n
k denote the set of matrices in Cm×n of rank k. Then, for all unitarily invariant
norms, the solution of the problem

min ∥A − B∥, B ∈ Mm×n


k , 1 ≤ k < r,
B

Pk
is obtained by truncating the SVD expansion to k < r terms: Ak = i=1 σi ui viH . The minimum
distance is given by
2
∥A − Ak ∥2 = σk+1 , ∥A − Ak ∥F = (σk+1 + · · · + σr2 )1/2 . (1.3.16)
24 Chapter 1. Mathematical and Statistical Foundations

Proof. For the spectral norm the result follows directly from Lemma 1.3.6. For the Frobenius
norm, set B = A − Bk , where Bk has rank k. Then σk+1 (Bk ) = 0 and, setting j = k + 1 in
(1.3.15), we obtain σi (A−Bk ) ≥ σk+1 (A), i = 1, 2, . . . . From this it follows that ∥A−Bk ∥2F ≥
2
σk+1 (A) + · · · + σn2 (A).

The Eckart–Young–Mirsky theorem was originally proved for the Frobenius norm, for which
the solution is unique; see Eckhart and Young [357, 1936]. The result for an arbitrary unitarily
invariant norm is due to Mirsky [797, 1960]. An elementary proof of the general case is given by
Li and Strang [740, 2020].
The best approximation property of the partial sums of the SVD expansion has a wide range
of applications in applied mathematics and is a key tool for constructing reduced-order models.
In signal processing, A may be derived from data constituting a noisy signal, and a rank reduction
is used to filter out the noise and reconstruct the true signal. Other applications are noise filtering
in statistics and model reduction in control and systems theory. Recently it has been recognized
that most high-dimensional data sets can be well approximated by low-rank matrices; see Udell
and Townsend [1071, 2019].

Notes and references

Golub, Hoffman, and Stewart [494, 1987] prove a generalization that shows how to obtain a best
approximation when a specified set of columns in the matrix is to remain fixed.

1.3.3 Perturbation Theory of Pseudoinverses


The condition number of a problem is a measure of the sensitivity of the solution to small
perturbations in the input data. For a nonsingular matrix A ∈ Rn×n the inverse (A + E)−1
exists for sufficiently small ∥E∥, where the norm is assumed to be submultiplicative. Taking
norms of the identity (A + E)−1 − A−1 = (A + E)−1 EA−1 , we obtain

max (A + E)−1 − A−1 ≤ ϵ (A + E)−1 A−1 .


∥E∥≤ϵ

The absolute condition number for computing the inverse A−1 is defined as the limit

1
inf max (A + E)−1 − A−1 = ∥A−1 ∥2 , ϵ → +0. (1.3.17)
∥E∥≤ϵ ∥E∥≤ϵ ϵ

The relative condition number for computing the inverse is obtained by multiplying this quantity
by ∥A∥/∥A−1 ∥:
κ(A) = ∥A∥ ∥A−1 ∥. (1.3.18)

This is invariant under multiplication of A by a scalar, κ(αA) = κ(A). From (1.3.18) it follows
that
κ(AB) ≤ κ(A) κ(B).

From the identity AA−1 = I it follows that κp (A) ≥ ∥I∥p = 1 for all matrix ℓp -norms. A matrix
with large condition number is called ill-conditioned; otherwise, it is called well-conditioned.
To indicate that a particular norm is used, we write κ∞ (A). For the Euclidean norm the
condition number can be expressed in terms of the singular values σ1 ≥ · · · ≥ σn of A as

κ2 (A) = σ1 (A)/σn (A). (1.3.19)


1.3. Perturbation Analysis 25

This expression applies also to rectangular matrices A ∈ Cm×n with rank(A) = n. For a real
orthogonal or unitary matrix U ,

κ2 (U ) = ∥U ∥2 ∥U −1 ∥2 = 1.

Such matrices are perfectly conditioned in the ℓ2 -norm. Furthermore, if P and Q are real or-
thogonal or unitary, then κ(P AQ) = κ(A) for both the ℓ2 -norm and the Frobenius norm. This
is one reason why real orthogonal and unitary transformations play such a central role in matrix
computations.
The normwise relative distance of a matrix A to the set of singular matrices is defined as

dist(A) := min ∥E∥/∥A∥ | A + E is singular . (1.3.20)
E

For the spectral norm it follows from the Eckart–Young–Mirsky theorem (Theorem 1.3.8) that

dist2 (A) = 1/(∥A∥2 ∥A−1 ∥2 ) = 1/κ2 (A). (1.3.21)

This equality holds for any subordinate matrix norm and can be used to get a lower bound for the
condition number; see Kahan [681, 1966] and Stewart and Sun [1033, 1990, Theorem III.2.8].
Let B = A + E be a perturbation of a matrix A ∈ Rm×n . Estimating the difference
∥(A + E)† − A† ∥ is complicated by the fact that A† varies discontinuously when the rank of A
changes. A trivial example is
   
σ 0 0 0
A= , E= ,
0 0 0 ϵ

where σ > 0, ϵ ̸= 0. Then the perturbation in A† becomes unbounded when ϵ → 0:

∥(A + E)† − A† ∥2 = |ϵ|−1 .

Definition 1.3.9. Two subspaces R(A) and R(B) are said to be acute if the corresponding
orthogonal projections satisfy
PR(A) − PR(B) 2 < 1.

A perturbation B = A + E of A is said to be acute if R(A) and R(B), as well as R(AH ) and


R(B H ), are acute.

Theorem 1.3.10. The matrix B is an acute perturbation of A if and only if

rank(A) = rank(B) = rank(PR(A) BPR(AT ) ). (1.3.22)

Proof. See Stewart [1018, 1977].

If the perturbation E does not change the rank of A, then unbounded growth of (A + E)†
cannot occur.

Lemma 1.3.11. If rank(A + E) = rank(A) and η = ∥A† ∥2 ∥E∥2 < 1, then

1
∥(A + E)† ∥2 ≤ ∥A† ∥2 . (1.3.23)
1−η
26 Chapter 1. Mathematical and Statistical Foundations

Proof. From the assumption and Theorem 1.3.3 it follows that


1/∥(A + E)† ∥2 = σr (A + E) ≥ σr (A) − ∥E∥2 = 1/∥A† ∥2 − ∥E∥2 > 0,
which implies (1.3.23).

By expressing the projections in terms of pseudoinverses and using the relations in (1.2.29),
we obtain Wedin’s identity (Wedin [1106, 1969]). If B = A + E, then
B † − A† = −B † EA† + (B H B)† E H PN (AT ) − PN (B) E H (AAH )† . (1.3.24)

Theorem 1.3.12. If B = A + E and rank(B) = rank(A), then


∥B † − A† ∥ ≤ µ∥B † ∥ ∥A† ∥ ∥E∥, (1.3.25)
where µ = 1 for the Frobenius norm ∥·∥F provided that rank(A) = min(m, n). For the spectral
norm ∥ · ∥2 , ( √
1+ 5
if rank (A) < min(m, n),
µ = √2
2 if rank (A) = min(m, n).

The result for the 2-norm is due to Wedin [1108, 1973]. For the Frobenius norm, µ = 1 as
shown by van der Sluis and Veltkamp [1074, 1979]. From the results above we deduce that
lim (A + E)† = A† ⇐⇒ lim rank(A + E) = rank(A).
E→0 E→0

Let B = A(τ ) = A + τ E, where τ is a scalar parameter. Letting τ → 0 and assuming


A(τ ) has constant local rank, we see that Wedin’s identity gives the following formula for the
derivative of the pseudoinverse:
dA† dA † dAH dAH
= −A† A + (AHA)† PN (AT ) − PN (A) (AAH )† . (1.3.26)
dτ dτ dτ dτ
Similar formulas for derivatives of orthogonal projectors and pseudoinverses are derived by
Golub and Pereyra [503, 1973]. For the least squares solution x = A† b, we obtain
dx dA dAH dAH H †
= −A† x + (AHA)† PN (AT ) b − PN (A) (A ) x. (1.3.27)
dτ dτ dτ dτ
The discontinuity of the pseudoinverse means that the mathematical notion of rank is not
appropriate in numerical computations. If a matrix A has (mathematical) rank k < n, and E is a
random perturbation, then A+E will most likely have full rank n, and A† will differ significantly
from (A + E)† . Thus, if A is close to a rank-deficient matrix, for practical purposes it should be
considered as rank-deficient. It is important that this be recognized, because overestimating the
rank of A can lead to a computed solution x of very large norm.
The considerations above show that the numerical rank assigned to A should depend on a
tolerance reflecting the level of errors in A. We say that A has numerical δ-rank equal to k if
k = min{rank(B) | ∥A − B∥2 ≤ δ}. (1.3.28)
That is, the numerical rank of A is the smallest
Pn rank of all matrices at a distance from A less
than or equal to δ. Let A = U ΣV H = H
i=1 ui σi vi be the SVD expansion of A. By the
Wielandt–Hoffman Theorem 1.3.3, for k < n,
inf ∥A − B∥2 = σk+1 . (1.3.29)
rank(B)≤k
1.3. Perturbation Analysis 27

Pk
This infimum is attained for B = i=1 σi ui viH . It follows that A has numerical δ-rank k if and
only if
σ1 ≥ · · · ≥ σk > δ ≥ σk+1 , ≥ · · · ≥ σn . (1.3.30)
From (1.3.28) it follows that

∥A − Ak ∥2 = ∥AV2 ∥2 ≤ δ, V2 = (vk+1 , . . . , vn ),

where R(V2 ) is the numerical nullspace of A. In many applications the cost of using SVD to
determine the numerical rank and nullspace of a matrix can be prohibitively high.
Choosing the parameter δ in (1.3.30) depends on the context and is not always an easy matter.
Let E = (eij ) be an upper bound on the absolute error in A. If the elements eij are about the
same magnitude, and |eij | ≤ ϵ for all i, j, then

∥E∥2 ≤ ∥E∥F ≤ (mn)1/2 ϵ.

In this case a reasonable choice in (1.3.30) is δ = (mn)1/2 ϵ.


Definition (1.3.30) is satisfactory only when there is a well-defined gap between σk+1 and
σk . This should be the case if the exact matrix à is rank-deficient but well-conditioned. However,
some matrices lack a well-determined numerical rank. Then additional information is needed to
determine a meaningful solution; see Section 3.5.3.

1.3.4 Perturbation of Least Squares Solutions


Let minx ∥Ax − b∥2 be a least squares problem with rank(A) = n. Denote by x + δx the
solution to the perturbed problem

min ∥(A + E)x − (b + f )∥2 , ∥E∥2 < σn (A).


x

Then rank(A + E) = n, and the perturbed solution satisfies the normal equations

(A + E)T ((A + E)(x + δx) − (b + f )) = 0.

We shall derive a first-order estimate of ∥δx∥2 . Subtracting AT (Ax − b) = 0 and neglecting


second-order terms, we get (ATA)δx = AT (f − Ex) + E T (b − Ax), or

δx = A† (f − Ex) + (ATA)−1 E T r, (1.3.31)

where A† = (ATA)−1 AT and r = b − Ax. (Note that δx = A† f = A† AA† f = A† PR(A) f


depends only on the component f1 = PR(A) f in R(A).) From the SVD A = U ΣV T we have

∥(ATA)−1 AT ∥2 = 1/σn , ∥(ATA)−1 ∥2 = 1/σn2 ,

and taking norms in (1.3.31) gives the first-order result


1  1
∥δx∥2 ≤ ∥f ∥2 + ∥E∥2 ∥x∥2 + 2 ∥E∥2 ∥r∥2 . (1.3.32)
σn σn
Since 1/σn = κ(A)/∥A∥2 , the last term in (1.3.32) is proportional to κ2 (A). Golub and Wilkin-
son [514, 1966] were the first to note that a term proportional to κ2 occurs when r ̸= 0. Van der
Sluis [1073, 1975] gives a geometrical explanation for the occurrence of this term.
Wedin [1108, 1973] gives a more refined analysis that applies also to rank-deficient prob-
lems. To be able to prove any meaningful result, he assumes that the perturbation E satisfies the
conditions
rank(A + E) = rank(A), η = ∥A† ∥2 ∥E∥2 < 1. (1.3.33)
28 Chapter 1. Mathematical and Statistical Foundations

Note that if rank(A) = min(m, n) then the condition η < 1 suffices to guarantee that rank(A +
E) = rank(A). The analysis needs the following estimate for the largest principal angle between
the fundamental subspaces of à and A (see Definition 1.2.12).

Lemma 1.3.13. Suppose à = A + E and conditions (1.3.33) are satisfied. Then if χ(·) denotes
any of the four fundamental subspaces,

sin θmax (χ(Ã), χ(A)) ≤ η < 1. (1.3.34)

Proof. See Wedin [1108, 1973, Lemma 4.1].

Nearly optimal bounds for the perturbation of the solution of least squares problems are
derived in Björck [125, 1967].

Theorem 1.3.14. Suppose rank(A + E) = rank(A) and that perturbations E and f satisfy the
normwise relative bounds

∥E∥2 /∥A∥2 ≤ ϵA , ∥f ∥2 /∥b∥2 ≤ ϵb . (1.3.35)

Then if η = κϵA < 1, the perturbations δx and δr in the least squares solution x and residual
r = b − Ax satisfy
 
κ ∥b∥2 ∥r∥2
∥δx∥2 ≤ ϵA ∥x∥2 + ϵb + ϵA κ + ϵA κ∥x∥2 , (1.3.36)
1−η ∥A∥2 ∥A∥2

∥δr∥2 ≤ ϵA ∥x∥2 ∥A∥2 + ϵb ∥b∥2 + ϵA κ∥r∥2 . (1.3.37)

Proof. Decomposing δx = Æ b̃ − x = A† (Ax + r + f ) − x and using PN (Ã) = I − Æ Ã, we


get
δx = Æ PR(Ã) (f − Ex) + Æ r − PN (Ã) x. (1.3.38)

We separately estimate each of the three terms in this decomposition of δx. From Lemma 1.3.11
it follows that
1
∥Æ (f − Ex)∥2 ≤ ∥A† ∥2 (∥E∥2 ∥x∥2 + ∥f ∥2 ). (1.3.39)
1−η
From r ⊥ R(A) we have r = PN (AT ) r, and from (1.2.29) the second term becomes

Æ r = Æ ÃÆ r = Æ PR(Ã) PN (AT ) r. (1.3.40)

By definition, ∥PR(Ã) PN (AT ) ∥2 = sin θmax (R(Ã), R(A)), where θmax is the largest principal
angle between the subspaces R(Ã) and R(A). Similarly, x = PR(AT ) x, and the third term
can be written as PN (Ã) x = PN (Ã) PR(AT ) x, where by Lemma 1.3.13 ∥PN (Ã) PR(AT ) ∥2 =
sin θmax (N (Ã), N (A)) ≤ η. The estimate (1.3.36) now follows, and (1.3.37) is proved using
the decomposition

r̃ − r = PN (ÃT ) (b + f ) − PN (AT ) b
= PN (ÃT ) f + PN (ÃT ) PR(A) b − PR(ÃT ) PN (AT ) r,

and PN (ÃT ) PR(A) b = PN (ÃT ) Ax = −PN (ÃT ) Ex.


1.3. Perturbation Analysis 29

If rank(A) = n, then N (Ã) = {0}, and the last term in (1.3.36) (and therefore also in
(1.3.37)) vanishes. If the system is consistent, then r = 0, and the term involving κ2 in (1.3.36)
vanishes. If rank(A) = n and ϵb = 0, the condition number of the least squares problem can be
written as  
∥r∥2
κLS (A, b) = κ(A) 1 + κ(A) . (1.3.41)
∥A∥2 ∥x∥2
Note that the condition depends on r and therefore on the right-hand side b. This dependence is
negligible if κ(A)∥r∥2 ≪ ∥A∥2 ∥x∥2 . By considering first-order approximations of the terms, it
can be shown that for any matrix A of rank n and vector b there are perturbations E and f such
that the estimates in Theorem 1.3.14 can almost be attained.
In many applications, one is not directly interested in the least squares solution x but in some
functional c = LT x, where L ∈ Rn×k . For example, in the determination of positions using GPS
systems, the main quantities of interest are the three-dimensional coordinates, but the statistical
model involves several other auxiliary parameters; see Arioli, Baboulin, and Gratton [32, 2007].
The sensitivity of functionals of the solution of ill-conditioned least squares problems is studied
by Eldén [372, 1990]. From (1.3.31) we have

δLT x = LT A† (f − Ex) + LT (ATA)−1 E T r. (1.3.42)

Assume for convenience that rank(A) = n, and let A = QR be the QR factorization of A.


Then A† = R−1 QT , (ATA)−1 = R−1 R−T , and ∥R−T ∥2 = σn (A). We obtain the perturbation
bound  
1
∥δc∥2 ≤ ∥C∥2 ∥f ∥2 + ∥E∥2 ∥x∥2 + ∥r∥2 , (1.3.43)
σn
where C = R−T L. A bound for the perturbation of a particular component xi of the least squares
solution is obtained by taking L = ei , the ith column of the unit matrix In . In particular, for
−1 −1
i = n, the solution of RT C = en is simply C = rnn en and ∥C∥2 = rnn . If A is rank-deficient,
T
the SVD A = U ΣV can be used instead of the QR factors. Substituting

A† = V Σ† U, (ATA)−1 = V Σ† (Σ† )T V T

into (1.3.42) gives (1.3.43) with C T = (LT V )Σ† and σn exchanged for σr , where r = rank(A).
Normwise perturbation bounds yield results that are easy to present but ignore how the per-
turbations are distributed among the elements of the matrix and vector. When the matrix is poorly
scaled or sparse, such bounds can greatly overestimate the error. For this reason, component-
wise perturbation analysis is gaining increasing attention. As stressed in the excellent survey
by Higham [620, 1994], the conditioning of a problem should always be defined with respect to a
particular class of perturbations. In normwise analysis, perturbations are considered that satisfy
the inequality ∥E∥ ≤ ϵ for some matrix norm. If the columns of A have vastly different norms,
then a more relevant class of perturbations might be

ãj = aj + δaj , ∥δaj ∥ ≤ ϵ∥aj ∥, j = 1, . . . , n. (1.3.44)

In componentwise analysis, scaling factors eij ≥ 0 and fi ≥ 0 are specified, and perturbations
such that
|δaij | ≤ ωeij , |δbi | ≤ ωfi , i, j = 1, . . . , n, (1.3.45)
for some ω > 0 are considered. By setting eij to zero, we can ensure that the corresponding
element aij is not perturbed. With eij = |aij | and fi = |bi |, ω bounds the componentwise
relative perturbation in each component of A and b.
30 Chapter 1. Mathematical and Statistical Foundations

We now introduce some notation to be used in the following. A matrix A is nonnegative,


A ≥ 0, if aij ≥ 0 for all i, j. Similarly, A is positive, A > 0, if aij > 0 for all i, j. If A and
B are nonnegative, then so are their sum A + B and product AB. Hence, nonnegative matrices
form a convex set. The partial orderings “≤” and “<” for nonnegative matrices A, B and vectors
x, y are to be interpreted componentwise, e.g.,

A ≤ B ⇐⇒ aij ≤ bij , x ≤ y ⇐⇒ xi ≤ yi . (1.3.46)

A ≥ B means the same as B ≤ A, and A > B is the same as B < A. These orderings are
transitive: if A ≤ B and B ≤ C, then A ≤ C. Note that there are matrices that cannot be
compared by any of these relations. It is rather obvious which rules for handling inequalities can
Pngeneralized to this partial ordering in matrix spaces. If C = AB, it is easy to show that |cij | ≤
be
k=1 |aik | |bkj |, i.e., |C| ≤ |A| |B|. A similar rule holds for matrix-vector multiplication.
With the above notation the componentwise bounds (1.3.45) can be written more compactly
as
|δA| ≤ ωE, |δb| ≤ ωf, (1.3.47)

where E > 0, f > 0. Componentwise relative perturbations are obtained by taking E = |A| and
f = |b|. We first consider a nonsingular square linear system Ax = b. The basic identity used
for a componentwise perturbation analysis is

δx = (I + A−1 δA)−1 A−1 (δAx + δb).

If |A−1 ||δA| < 1, then taking absolute values gives

|δx| ≤ (I − |A−1 ||δA|)−1 |A−1 |(|δA||x| + |δb|),

where the inequality is to be interpreted componentwise. The matrix (I − |A−1 ||δA|) is guaran-
teed to be nonsingular if ∥ |A−1 | |δA| ∥ < 1. For perturbations satisfying (1.3.47), we obtain

|δx| ≤ ω(I − ω|A−1 |E)−1 |A−1 |(E|x| + f ). (1.3.48)

Assuming that ωκE (A) < 1, it follows from (1.3.48) that for any absolute norm,

ω
∥δx∥ ≤ ∥ |A−1 |(|E| |x| + f )∥. (1.3.49)
1 − ωκE (A)

Hence κE (A) = ∥ |A−1 |E∥ can be taken to be the componentwise condition number with re-
spect to E. For componentwise relative error bounds (E = |A|), we obtain the Bauer–Skeel
condition number of A,

κ|A| (A) = cond (A) = ∥ |A−1 ||A| ∥. (1.3.50)

It can be shown that cond (A) and the bound (1.3.49) with E = |A| are invariant under row
scaling.
The Bauer–Skeel perturbation analysis of linear systems can be extended to linear least
squares problems by considering the augmented system
    
I A y b
Mz = d ≡ = . (1.3.51)
AT 0 x c
1.3. Perturbation Analysis 31

If A has full column rank, then

(A† )T
 
PN (AT )
M −1 = , (1.3.52)
A† −(ATA)−1

where PN (AT ) = I − AA† is the orthogonal projection onto N (AT ). Componentwise perturba-
tions |δA| ≤ ωE, |δb| ≤ ωf , and |δc| ≤ ωg give rise to perturbations
   
0 δA δb
δM = , δd =
δAT 0 δc

of system (1.3.51). From (1.3.49) applied to M z = d, neglecting terms of order ω 2 , we obtain

|δy| ≤ ω |PN (AT ) |(E|x| + f ) + ω |(A† )T | (E T |y| + g), (1.3.53)

|δx| ≤ ω |A† |(f + E|x|) + ω |(ATA)−1 | (E T |y| + g). (1.3.54)

The least squares problem minx ∥Ax − b∥2 corresponds to taking c = g = 0 and r = y.
With E = |A| and f = |b| we obtain, after taking norms,

∥δr∥ ≤ ω ∥ PN (AT ) (|A||x| + |b|) ∥ + ω ∥ |(A† )T | |A|T |r| ∥, (1.3.55)

∥δx∥ ≤ ω ∥ |A† |(|A||x| + |b|) ∥ + ω∥ |(ATA)−1 | |A|T |r| ∥; (1.3.56)

see Björck [132, 1991]. For small-residual problems the componentwise condition number for
the least squares solution x can be defined as

cond (A) = ∥ |A† | |A| ∥. (1.3.57)

The least-norm problem min ∥y∥2 subject to AT y = c corresponds to taking b = f = 0,

|δy| ≤ ω |PN (AT ) E|x| + ω |(A† )T | (E T |y| + g), (1.3.58)

Notes and references

Componentwise perturbation analysis originated with Bauer [95, 1966] and was later refined by
Skeel [1000, 1979]; see also Higham [620, 1994]. A good survey on matrix perturbation theory
is given by Stewart and Sun [1033, 1990]. Demmel [304, 1992] conjectured that the distance of
a matrix from a singular matrix in a componentwise sense is close to the reciprocal of its Bauer–
Skeel condition number. This conjecture was later proved for the general weighted condition
number κE (A) by Rump [945, 1999].
Abdelmalek [3, 1974] gives a perturbation analysis for pseudoinverses and linear least squares
problems. Stewart [1017, 1977] gives a unified treatment of the perturbation theory for pseu-
doinverses and least squares solutions with historical comments. In particular, asymptotic forms
and derivatives for orthogonal projectors, pseudoinverses, and least squares solutions are derived.
Grcar [531, 2010] derives spectral condition numbers of orthogonal projections and full-rank lin-
ear least squares problems. Gratton [527, 1996] obtains condition numbers of the least squares
problem in a weighted Frobenius norm. A similar componentwise analysis is given in Arioli,
Duff, and de Rijk [36, 1989]. Baboulin and Gratton [50, 2009] give sharp bounds for the condi-
tion numbers of linear functionals of least squares solutions.
32 Chapter 1. Mathematical and Statistical Foundations

1.4 Floating-Point Computation


1.4.1 Floating-Point Arithmetic
In floating-point computation a real number a is represented in the form

a = ±m · β e , β −1 ≤ m < 1, (1.4.1)

where exponent e is an integer and β is the base of the system. If t digits are used to represent
the fraction part m, we write

m = (0.d1 d2 · · · dt )β , 0 ≤ di < β. (1.4.2)

The exponent is limited to a finite range emin ≤ e ≤ emax . In a floating-point number system,
every real number in the floating-point range can be represented with a relative error that does
not exceed the unit roundoff
u = 12 β −t+1 . (1.4.3)
The IEEE 754–2008 standard for binary floating-point arithmetic [655, 2019] is used on
virtually all general-purpose computers. It specifies formats for floating-point numbers, ele-
mentary operations, and rounding rules. Three basic formats are specified for representing a
number: single, double, and quadruple precision using 32, 64, and 128 bits, respectively. Also,
a half precision format fp16 using 16 bits was introduced. This offers massive speed-up, but
because the maximum number that can be represented is only about 65,000, overflow is more
likely. This motivated Google to propose another half precision format with wider range called
bfloat16. Half precision formats have applications in computer graphics as well as deep learn-
ing; see Pranesh [904, 2019]. Because it is cheaper to move data in lower precision, the cost
of communication is reduced. The characteristics of floating-point formats are summarized in
Table 1.4.1.

Table 1.4.1. IEEE 754–2008 binary floating-point formats.

Format t e emin emax u


Half bfloat16 16 8 8 −126 +127 3.91 · 10−3
Half fp16 16 11 5 −14 +15 4.88 · 10−4
Single 32 24 8 −126 +127 5.96 · 10−8
Double 64 53 11 −1022 +1023 1.11 · 10−16
Quadruple 128 113 15 −16, 382 +16, 383 0.963 · 10−34

Four rounding modes are supported by the standard. The default rounding mode is round
to the nearest representable number, with rounding to even in the case of a tie. Chopping (i.e.,
rounding toward zero) is also supported, as well as directed rounding to ∞ and −∞. The latter
modes simplify the implementation of interval arithmetic.
The IEEE standard specifies that all arithmetic operations, including the square root, should
be performed as if they were first calculated to infinite precision and then rounded to a floating-
point number according to one of the four modes mentioned above. One reason for specifying
precisely the results of arithmetic operations is to improve the portability of software. If a pro-
gram is moved between two computers, both supporting the IEEE standard, intermediate and
final results should be the same.
1.4. Floating-Point Computation 33

If x and y are floating-point numbers, then


f l (x + y), f l (x − y), f l (x · y), f l (x/y)
denote the results of floating-point operations that the machine stores in memory (after rounding
or chopping). Unless underflow or overflow occurs, in IEEE floating-point arithmetic,
f l (x op y) = (x op y)(1 + δ), |δ| ≤ u, (1.4.4)
where u is the unit roundoff and “op” stands for one of the four elementary operations +, −, ·,
and /. Similarly, for the square root it holds that
√ √
f l ( x) = x(1 + δ), |δ| ≤ u. (1.4.5)
Complex arithmetic can be reduced to real arithmetic as follows. If we let x = a + ib and
y = c + id be two complex numbers, where y ̸= 0, then
x ± y = a ± c + i(b ± d),
x × y = (ac − bd) + i(ad + bc),
ac + bd bc − ad
x/y = 2 +i 2 .
c + d2 c + d2

The square root of a complex number u + iv = x + iy is given by
1/2 1/2
p
u = (r + x/2) , v = (r − x/2) , r = x2 + y 2 . (1.4.6)
When x > 0 there will be cancellation when computing v. This can be severe if also |x| ≫ |y|
(cf. Section 2.3.4 of [284, 2008]). To avoid this, we note that
1p 2 y
uv = r − x2 = ,
2 2
so v can be computed from v = y/(2u). When x < 0 we instead compute v from (1.4.6) and set
u = y/(2v).
Very few quantities in the physical world is known to an accuracy beyond IEEE double
precision. A value of π correct to 20 decimal digits would suffice to calculate the circumference
of a circle around the Sun at the orbit of Earth to within the width of an atom! Occasionally
one may want to perform some calculation, e.g., evaluate some mathematical constant such as
π to very high precision. Extremely high precision is also sometimes needed in experimental
mathematics when searching for new identities. For such purposes, multiple precision packages
have been developed that simulate arithmetic of arbitrarily high precision using standard floating-
point arithmetic; see Brent [177, 178, 1978].
When validated answers to mathematical problems have to be computed, another possibility
is to use interval arithmetic. Input data are given as intervals, and inclusion intervals for each
intermediate result are systematically calculated. An interval vector is denoted by [x] and has
interval components [xi ] = [xi , xi ], i = 1 : n. Likewise, an interval matrix [A] = ([aij ]) has
interval entries
[aij ] = [aij , aij ], i = 1 : m, j = 1 : n.
Operations between interval matrices and interval vectors are defined in an obvious manner.
The interval matrix-vector product [A][x] is the smallest interval vector that contains the set
{Ax | A ∈ [A], x ∈ [x]}, but it normally does not coincide with this set. By the inclusion prop-
erty,
X n 
{Ax | A ∈ [A], x ∈ [x]} ⊆ [A][x] = [aij ][xj ] . (1.4.7)
j=1
34 Chapter 1. Mathematical and Statistical Foundations

In general the image of an interval vector under a transformation is not an interval vector. As
a consequence, the inclusion operation will usually yield an overestimation. This phenomenon,
intrinsic to interval computations, is called the wrapping effect. Rump [943, 1999] gives an
algorithm for computing the product of two interval matrices using eight matrix products.
A square interval matrix [A] is called nonsingular if it does not contain a singular matrix.
An interval linear system is a system of the form [A] x = [b], where A is a nonsingular interval
matrix and b an interval vector. The solution set of such an interval linear system is the set

X = {x | Ax = b, A ∈ [A], b ∈ [b]}. (1.4.8)

Computing this solution set can be shown to be an intractable (NP-complete) problem. Even for
a 2 × 2 linear system, this set may not be easy to represent.
An efficient and easy-to-use MATLAB toolbox called INTLAB (INTerval LABoratory) has
been developed by Rump [944, 1999]. It contains many useful subroutines and allows verified
solutions of linear least squares problems to be computed.

1.4.2 Rounding Errors in Matrix Operations


By repeatedly using the formula for floating-point multiplication, one can show that the com-
puted product f l(x1 x2 · · · xn ) is exactly equal to

x1 x2 (1 + ϵ2 )x3 (1 + ϵ3 ) · · · xn (1 + ϵn ),

where |ϵi | ≤ u, i = 2, 3, . . . , n. This can be interpreted as a backward error analysis; we have


shown that the computed product is the exact product of factors x1 , x̃i = xi (1+ϵi ), i = 2, . . . , n.
It also follows from this analysis that

|f l(x1 x2 · · · xn ) − x1 x2 · · · xn | ≤ δ|x1 x2 · · · xn |,

where δ = (1 + u)n−1 − 1 < 1.06(n − 1)u, and the last inequality holds if the condition
(n − 1)u < 0.1 is satisfied. This bounds the forward error in the computed result. Simi-
lar results can easily be derived for basic vector and matrix operations; see Wilkinson [1120,
1965, pp. 114–118].
In the following we mainly use notation due to Higham [623, 2002, Sect. 3.4]. Let |δi | ≤ u
and ρi = ±1, i = 1 : n. If nu < 1, then
n
Y nu
(1 + δi )ρi = 1 + θn , |θn | < γn , γn = . (1.4.9)
i=1
1 − nu

With the realistic assumption that nu < 0.1, it holds that γn < 1.06nu. Often it can be assumed
that nu ≪ 1, and we can set γn = nu. When it is not worth the trouble to keep precise track of
constants in the γk terms, we use Higham’s notation
cnu
γ̃n ≡ , (1.4.10)
1 − cnu
where c denotes a small integer constant whose exact value is unimportant.
If the inner product xT y = x1 y1 + x2 y1 + · · · + xn yn is accumulated from left to right,
repeated use of (1.4.4) gives

f l (xT y) = x1 y1 (1 + δ1 ) + x2 y2 (1 + δ2 ) + · · · + xn yn (1 + δn ),
1.4. Floating-Point Computation 35

where |δ1 | < γn , |δi | < γn+2−i , i = 2, . . . , n. This gives the forward error bound
n
X
|f l (xT y) − xT y| < γn |x1 ||y1 | + γn+2−i |xi ||yi | < γn |x|T |y|, (1.4.11)
i=2

where |x|, |y| denote vectors with elements |xi |, |yi |. Note that the error magnitudes depend
on the order of evaluation. The last upper bound in (1.4.11) holds independently of the sum-
mation order and is also valid for floating-point computation with no guard digit rounding. The
corresponding backward error bounds

f l (xT y) = (x + ∆x)T y = xT (y + ∆y), |∆x| ≤ γn |x|, |∆y| ≤ γn |y| (1.4.12)

also hold for any order of evaluation. This result is easily generalized to yield a forward error
analysis of matrix-matrix multiplication. However, for this case there is no backward error analy-
sis, because the rows and columns of the two matrices participate in several inner products! For
the outer product xy T of two vectors x, y ∈ Rn we have f l (xi yj ) = xi yj (1 + δij ), |δij | ≤ u,
and so
|f l (xy T ) − xy T | ≤ u |xy T |. (1.4.13)
This is a satisfactory result for many purposes. However, the computed result is usually not
a rank-one matrix. In general, it is not possible to find perturbations ∆x and ∆y such that
f l(xy T ) = (x + ∆x)(y + ∆y)T .
In many matrix algorithms, expressions of the form
 k−1
X .
y= c− ai bi d
i=1

occur repeatedly. A simple extension of the roundoff analysis of an inner product shows that if
the term c is added last, then the computed ȳ satisfies
k−1
X
ȳd(1 + δk ) = c − ai bi (1 + δi ), (1.4.14)
i=1

where |δ1 | ≤ γk−1 , |δi | ≤ γk+1−i , i = 2, . . . , k − 1, and |δk | ≤ γ2 . The result is formulated so
that c is not perturbed. The forward error satisfies
 k−1
X   k−1
X 
ȳd − c − ai bi ≤ γk |ȳd| + |ai ||bi | , (1.4.15)
i=1 i=1

and this inequality holds for any summation order.


From the error analysis of inner products, error bounds for matrix-vector and matrix-matrix
products can easily be obtained. Suppose A ∈ Rm×n , x ∈ Rn , and y = Ax. Then yi = aTi x,
where aTi is the ith row of A. From (1.4.12) we have f l (aTi x) = (ai + ∆ai )T y, where |∆ai | ≤
γn |ai |, giving the backward error result

f l (Ax) = (A + ∆A)x, |∆A| ≤ γn |A|, (1.4.16)

where the inequality is to be interpreted elementwise.


Consider the matrix product C = AB, where A ∈ Rm×n and B = (b1 , . . . , bp ) ∈ Rn×p . By
(1.4.16), we have
f l (Abj ) = (A + ∆j )bj , |∆j A| ≤ γn |A|.
36 Chapter 1. Mathematical and Statistical Foundations

Hence each computed column cj in C has a small backward error. The same cannot be said for
C = AB as a whole, because the perturbation of A depends on j. For the forward error we get
the bound
|f l (AB) − AB| ≤ γn |A||B|, (1.4.17)
and it follows that ∥f l (AB) − AB∥ ≤ γn ∥ |A| ∥ ∥ |B| ∥. Hence for any absolute norm, such as
ℓ1 , ℓ∞ , and Frobenius norms, it holds that

∥f l (AB) − AB∥ ≤ γn ∥A∥ ∥B∥. (1.4.18)

For the ℓ2 -norm the best upper bound is ∥f l (AB) − AB∥2 < nγn ∥A∥2 ∥B∥2 , unless A and B
have only nonnegative elements. The rounding error results here are formulated for real arith-
metic. Similar bounds hold for complex arithmetic provided the constants in the bounds are
increased appropriately.

1.4.3 Stability of Matrix Algorithms


Consider a finite algebraic algorithm that from data a = (a1 , . . . , ar ) computes the solution
f = (f1 , . . . , fs ). In a forward error analysis one attempts to bound some norm of the error
|fˆ − f | in the computed solution fˆi . An algorithm is said to be forward stable (or acceptable-
error stable) if for some norm ∥ · ∥ the computed solution f¯ satisfies

∥f¯ − f ∥ ≤ cκu, (1.4.19)

where c is a not-too-large constant and κ is the condition number of the problem from a pertur-
bation analysis. Clearly this is a weaker form of stability.
Backward error analysis for matrix algorithms was pioneered by J. H. Wilkinson in the late
1950s. When it applies, it tends to be markedly superior to forward analysis. In a backward error
analysis one attempts to show that for some class of input data, the algorithm computes a solution
fˆ that is the exact solution corresponding to a modified set of data ãi close to the original data
ai . There may be an infinite number of such sets, but it can also happen that no such set exists.
An algorithm is said to be backward stable if for some norm ∥ · ∥,

∥ã − a∥ < cu∥a∥,

where c is a not-too-large constant. Backward error analysis usually gives better insight into the
stability (or lack thereof) of the algorithm, which often is the primary purpose of an error analysis.
Notice that no reference is made to the exact solution for the original data. A backward stable
algorithm is guaranteed to give an accurate solution only if the problem is well-conditioned. To
yield error bounds for the solution, the backward error analysis has to be complemented with a
perturbation analysis. It can only be expected that the error satisfies an inequality of the form
(1.4.19). Nevertheless, if the backward error is within the uncertainties of the given data, it can
be argued that the computed solution is as good as the data warrants. From the error bound
(1.4.15) it is straightforward to derive a bound for the backward error in solving a triangular
system of equations.

Theorem 1.4.1. If the lower triangular system Lx = b, L ∈ Rn×n is solved by forward substi-
tution, the computed solution ȳ is the exact solution of (L + ∆L)x̄ = b, where for i = 1, . . . , n,

γ2 |lij | if j = i,
|∆lij | ≤ (1.4.20)
γi−j |lij | if j < i,

where γn = nu/(1 − nu). Hence |∆L| ≤ γn |L| for any summation order.
1.4. Floating-Point Computation 37

Proof. See Higham [623, 2002, Theorem 8.3].

A similar result holds for the computed solution of an upper triangular system. We conclude
that the algorithm for solving triangular systems by forward or backward substitution is backward
stable. More precisely, we call the algorithm componentwise backward stable, because (1.4.20)
bounds the perturbations in L componentwise. Note that it is not necessary to perturb the right-
hand side b.
Often the data matrix M belongs to a class M of structured matrices, such as Toeplitz matri-
ces (see Section 4.5.5) or the augmented system matrix
 
I A
M= .
AT 0

An algorithm for solving M x = b, M ∈ M, is strongly backward stable if it computes an


exact solution of a system M̄ x = b̄, where M̄ ∈ M, and M̄ and b̄ are close to M and b; see
Bunch [185, 1987]. For many classes of structured problems, strongly backward stable methods
do not exist, but there are algorithms that can be proved to give a solution close to the exact
answer of some relevant nearby system.
Chapter 2

Basic Numerical
Methods

2.1 The Method of Normal Equations


2.1.1 Introduction
The classical method used by Gauss for solving linear least squares problems

min ∥Ax − b∥2 , A ∈ Rm×n ,


x

is to form and solve the normal equations ATAx = AT b. If A has full column rank, then AT A is
positive definite and hence nonsingular. Gauss developed an elimination method for solving the
normal equations that uses pivots chosen from the diagonal; see Stewart [1028, 1995]. Then all
reduced matrices are symmetric, and the storage and number of needed operations are reduced
by half. Later, the preferred way to organize this elimination algorithm was to use Cholesky
factorization, named after André-Louis Cholesky (1875–1918). The accuracy of the computed
least squares solution using the normal equations depends on the square of the condition number
of A. Indeed, accuracy may be lost already when forming ATA and AT b. Hence, the method
of normal equations works well only for well-conditioned problems or when modest accuracy is
required. Otherwise, algorithms based on orthogonalization should be preferred.
Much of the background theory on the matrix algorithms in this chapter is given in the ex-
cellent textbook of Stewart [1030, 1998]. Accuracy and stability properties of matrix algorithms
are admirably covered by Higham [623, 2002].

2.1.2 Forming the Normal Equations


The first step in the method of normal equations is to form

C = ATA ∈ Rn×n , d = AT b ∈ R n .

Given the symmetry of C, it suffices to compute the 21 n(n + 1) elements of its upper triangular
part (say). If m ≥ n, this number is always less than the mn elements in A ∈ Rm×n . Hence,
forming the normal equations can be viewed as data compression, in particular when m ≫ n. If
A = (a1 , a2 , . . . , an ), the elements of C and d can be expressed in inner product form as

cjk = aTj ak , dj = aTj b, 1 ≤ j ≤ k ≤ n. (2.1.1)

39
40 Chapter 2. Basic Numerical Methods

If A is instead partitioned by rows e


ai , i = 1, . . . , m, a row-oriented algorithm is obtained:
m
X m
X
C= e aTi ,
ai e d= bi e
ai . (2.1.2)
i=1 i=1

This expresses C as a sum of m matrices of rank one and d as a linear combination of the rows
of A. It has the advantage that only one pass through the data A and b is required, and it can be
used to update the normal equations when equations are added or deleted. For a dense matrix,
both schemes require mn(n + 1) floating-point operations or flops, but the outer product form
requires more memory access. When A is sparse, zero elements in A can more easily be taken
into account by the outer product scheme. If the maximum number of nonzero elements in a
row of A is p, then the outer product scheme only requires m(p + 1)2 flops. Note that the flop
count only measures the amount of arithmetic work of a matrix computation but is often not an
adequate measure of the overall complexity of the computation.
Rounding errors made in forming the normal equations can be obtained from the results in
Section 1.4.2 for floating-point arithmetic. The computed elements in C = ATA are
m
X
c̄ij = aik ajk (1 + δk ), |δk | < 1.06(m + 2 − k)u,
k=1

where u is the unit roundoff. It follows that the computed matrix satisfies C̄ = C + E, where

|eij | < 1.06mu|ai |T |aj | ≤ 1.06mu ∥ai ∥2 ∥aj ∥2 . (2.1.3)

A similar estimate holds for the rounding errors in the computed vector AT b. It is important to
note that the rounding errors in forming ATA are in general not equivalent to small perturbations
of the initial data matrix A, i.e., it is not true that C̄ = (A + E)T (A + E) for some small error
matrix E. Furthermore, forming the normal equations squares the condition number:

κ(C) = κ2 (A).

Therefore methods that form the normal equations explicitly are not backward stable.

Example 2.1.1. A simple example where information is lost when ATA is formed is the matrix
studied by Läuchli [724, 1961]:

1 1 1
   
1 + ϵ2 1 1
ϵ T
A= , A A= 1 1 + ϵ2 1 .

ϵ
1 1 1 + ϵ2
ϵ

Assume that ϵ = 10−8 and that six decimal digits are used for the elements of ATA. Then,
because 1 + ϵ2 = 1 + 10−16 is rounded to 1 even in IEEE double precision, all information
contained in the last three rows of A is irretrievably lost.

Sometimes an unsuitable formulation of the problem will cause the least squares problem
to be ill-conditioned. Then a different choice of parametrization may significantly reduce the
ill-conditioning. For example, in regression problems one should try to use orthogonal or nearly
orthogonal base functions. Consider, for example, a linear regression problem for fitting a linear
model y = α + βt to the given data (yi , ti ), i = 1, . . . , m. This is a least squares problem
2.1. The Method of Normal Equations 41

minx ∥Ax − y∥2 , where x = (α, β)T , and

1 t1 y1
   
1 t2   y2 
A = (e t) = 
 ... ..  , y=
 ..  .

.  .
1 tm ym

The corresponding normal equations can be written as


    T 
m eT t α e y
= .
eT t t T t β yT t

The solution is

m(y T t) − (eT t)(eT y)


β= , α = (eT y − β eT t)/m.
m(tT t) − (eT t)2

Note that the mean values of the data ȳ = eT y/m, t̄ = eT t/m lie on the fitted line, that is,
α + β t̄ = ȳ.
A more accurate formula for β is obtained by centering the data, i.e., by making the change
of variables yei = yi − ȳ, e
ti = ti − t̄, i = 1, . . . , m, and writing the model as ye = β e
t. In the new
variables, eT et = 0 and eT ye = 0. Hence the matrix of normal equations is diagonal, and we get

β = yeT e tT e
t/e t. (2.1.4)

When the elements in A and b are the original data, ill-conditioning cannot be avoided, as
in the above example, by choosing another parametrization. The method of normal equations
works well for well-conditioned problems or when only modest accuracy is required. However,
the accuracy of the computed solution will depend on the square of the condition number of A.
In view of the perturbation result in Theorem 1.3.14, this is not consistent with the sensitivity of
small-residual problems, and the method of normal equations can introduce errors much greater
than those of a backward stable algorithm. For less severely ill-conditioned problems, this can
be offset by iterative refinement; see Section 2.5.3. Otherwise, algorithms that use orthogonal
transformations and avoid forming ATA should be used; see Section 2.2.

2.1.3 Algorithms for Cholesky Factorization


The Cholesky factorization is the preferred way to organize the elimination algorithm for the
normal equations.3 By Theorem 1.2.1, for any symmetric positive definite matrix C ∈ Rn×n
there exists a unique Cholesky factorization

C = RT R, (2.1.5)

where R = (rij ) is upper triangular with positive diagonal elements. Written componentwise,
this is n(n + 1)/2 equations
i−1
X
cij = rii rij + rki rkj , 1 ≤ i ≤ j ≤ n, (2.1.6)
k=1

3 It is named after André-Louis Cholesky (1875–1918), who was a French military officer involved in the surveying

of Crete and North Africa before the First World War. His work was posthumously published by Benoît [105, 1924].
42 Chapter 2. Basic Numerical Methods

for the n(n+1)/2 unknown elements in R. If properly sequenced, (2.1.6) can be used to compute
R. An element rij can be computed from equation (2.1.6) provided rii and the elements rki , rkj ,
k < i, are known. It follows that one way is to compute R column by column from left to right
as in the proof of Theorem 1.2.1. Such an algorithm is called left-looking. Let Ck ∈ Rk×k be
H
the kth leading principal submatrix of C. If Ck−1 = Rk−1 Rk−1 , then
   H  
Ck−1 dk Rk−1 0 Rk−1 rk
Ck = = (2.1.7)
dTk γk rkH ρk 0 ρk

implies that
H
Rk−1 rk = dk , ρ2k = γk − rkH rk . (2.1.8)
Hence, the kth step of columnwise Cholesky factorization requires the solution of a lower tri-
angular system. The algorithm requires approximately 31 n3 flops and n square roots. Only the
elements in the upper triangular part of C are referenced, and R can overwrite C.

Algorithm 2.1.1 (Columnwise Cholesky).


function R = cholc(C);
% CHOLC computes the Cholesky factor R = (r(i,j))
% of the positive definite Hermitian matrix C in
% columnwise order.
% -----------------------------------------------
n = size(C,1); R = zeros(n,n);
R(1,1) = sqrt(C(1,1));
for j = 2:n % Compute the j:th column of R.
I = 1:j-1; % I is index set
R(I,j) = R(I,I)'\C(I,j);
R(j,j) = sqrt(C(j:j)- R(I,j)'*R(I,j));
end

A second variant computes in step k = 1, . . . , n the elements in the kth row of R. This is a row-
wise or right-looking algorithm. The arithmetic work is dominated by matrix-vector products.

Algorithm 2.1.2 (Rowwise Cholesky).


function R = cholr(C);
% CHOLR computes the Cholesky factor R = (r(i,j)) of the
% positive definite Hermitian matrix C in rowwise order.
% --------------------------------------------------
n = size(C,1); R = zeros(n,n);
for i = 1:n % Compute the i:th row of R.
if i > 1,
s(i:n) = C(i,i:n) - R(1:i-1,i)'*R(1:i-1,i:n);
end
R(i,i) = sqrt(s(i));
R(i,i+1:n) = s(i+1:n)/R(i,i);
end

It is simple to modify the Cholesky algorithms so that they instead compute a factorization
of the form C = RT DR, where D is diagonal and R unit upper triangular. This factorization
requires no square roots and may therefore be slightly faster. The two algorithms above are
2.1. The Method of Normal Equations 43

numerically equivalent, i.e., they compute the same factor R, even taking rounding errors into
Pi by2Wilkinson [1121, 1968].
account. A normwise error analysis is given
Taking i = j in (2.1.6) gives cii = k=1 rki , i = 1, . . . , n. It follows that

cii ≥ max |rkj |2 , i = 1, . . . , n. (2.1.9)


1≤k≤j

Hence, the elements in R are bounded in size by the square root of the maximum diagonal
elements in C. This shows the numerical stability of Cholesky factorization.

Theorem 2.1.2. Let C ∈ Rn×n be a symmetric positive definite matrix such that

2n3/2 uκ(C) < 0.1. (2.1.10)

Then the Cholesky factor of C can be computed without breakdown, and the computed R̄ satisfies

R̄T R̄ = C + E, ∥E∥2 < 2.5n3/2 u∥C∥2 . (2.1.11)

A componentwise result is given by Demmel [303, 1989].

Theorem 2.1.3. If Cholesky factorization applied to the symmetric positive definite matrix C ∈
Rn×n runs to completion, then the computed factor R satisfies
T
R R = C + E, |E| ≤ γn+1 (1 − γn+1 )ddT , (2.1.12)
−1/2
where di = cii and γk = ku/(1 − ku).

These results show that the given algorithms for computing the Cholesky factor R from
C are backward stable. The error in the computed Cholesky factor will affect the error in
the least squares solution x̄ computed by the method of normal equations. When C = ATA,
κ(C) = κ2 (A), and this squaring of the condition number implies that the Cholesky algorithm

may break down, and roots of negative numbers may arise when κ(A) is of the order of 1/ u.
Rounding errors in the solution of the triangular systems RTRx = AT b are usually negligible;
see Higham [615, 1989]. From Theorem 2.1.2 it follows that the backward error in the computed
solution x̄ caused by the Cholesky factorization satisfies

(ATA + E)x̄ = AT b, ∥E∥2 < 2.5n3/2 u∥A∥22 .

A perturbation analysis shows that

∥x̄ − x∥2 ≤ 2.5n3/2 uκ2 (A)∥x∥2 . (2.1.13)

The squared condition number shows that the method of normal equations is not a backward
stable method for solving a least squares problem.

2.1.4 Conditioning and Scaling Invariance


In a linear system Ax = b the unknowns xj , i = 1, . . . , n, are often physical quantities. Chang-
ing the units in which these are measured is equivalent to scaling the columns of A. If we also
multiply each equation by a scalar, this corresponds to scaling the rows of A and b. The original
system is then transformed into an equivalent system A′ x′ = b′ , where

A′ = D2 AD1 , b′ = D2 b, x = D 1 x′ . (2.1.14)
44 Chapter 2. Basic Numerical Methods

It seems natural to expect that such a scaling should have no effect on the relative accuracy of
the computed solution. If the system is solved by Gaussian elimination or, equivalently, by LU
factorization, this is in fact true, as shown by the following theorem due to Bauer [95, 1966].

Theorem 2.1.4. Denote by x̄ and x̄′ the computed solution using LU factorization in floating-
point arithmetic to the two linear systems Ax = b and (D2 AD1 )x′ = D2 b, where D1 and D2
are diagonal scaling matrices. If no rounding errors are introduced by the scaling, and the same
pivots are used, then x̄ = D1 x̄′ holds exactly.

For a least squares problem minx ∥Ax − b∥2 , the column scaling AD corresponds to a two-
sided symmetric scaling (AD)T AD = DCD of the normal equations matrix. Note that scaling
the rows in A is not allowed because this would change the LS objective function; see Sec-
tion 3.2.1.
Cholesky factorization of C = ATA is a special case of LU factorization, and therefore
by Theorem 2.1.4 it is numerically invariant under a diagonal scaling that preserves symmetry.
Condition (2.1.10) in Theorem 2.1.2 can therefore be replaced by 2n3/2 u(κ′ )2 < 0.1, where
κ′ (A) = minD κ(AD), D > 0. Furthermore, the error bound (2.1.13) for the computed solution
by Cholesky factorization can be improved to
∥x̄ − x∥2 ≤ 2.5n3/2 u κ′ (A)κ(A)∥x∥2 . (2.1.15)
Note that these improvements hold without explicitly carrying out any scaling in the algorithm.

Theorem 2.1.5. Let C ∈ Rn×n be symmetric and positive definite with at most q ≤ n nonzero
elements in any row. If all diagonal elements in C are equal, then
κ(C) ≤ q min κ(DCD), (2.1.16)
D>0

where D > 0 denotes the set of all n × n diagonal matrices with positive entries.

Proof. See Higham [623, 2002, Theorem 7.5].

If C = ATA and the columns of A are scaled to have equal 2-norms, then C, then from
Theorem 2.1.5 with q = n, it follows that

κ(A) = n min κ(AD). (2.1.17)
D>0

Hence, choosing D so that the columns of AD have equal length is a nearly optimal scaling.

Example 2.1.6. Sometimes the error in the solution computed by the method of normal equa-
tions is much smaller than the error bound in (2.1.13). This is often due to poor column scaling,
and the observed error bound is well predicted by (2.1.15). Consider the least squares fitting of a
polynomial
p(t) = c0 + c1 t + · · · + tn−1
to observations yi = p(ti ) at points ti = 0, 1, . . . , m − 1. The resulting least squares problem is
minc ∥Ac − y∥2 , where A ∈ Rm×n has elements
aij = (i − 1)j−1 , 1 ≤ i ≤ m, 1 ≤ j ≤ n.
For m = 21 and n = 6, the matrix is quite ill-conditioned: κ(A) = 6.40 · 106 . However, the
condition number of the scaled matrix AD, where all columns have unit length, is more than
three orders of magnitude smaller: κ(AD1 ) = 2.22 · 103 .
2.1. The Method of Normal Equations 45

2.1.5 Computing the Covariance Matrix


In least squares problems it is often necessary to compute the associated covariance matrix in
order to estimate the accuracy of the computed results. In particular, the variance of xi is pro-
portional to the ith diagonal element of (ATA)−1 . Let Ax = b + ϵ, A ∈ Rm×n , be a full-rank
linear model, where ϵ is a random vector with zero mean and covariance matrix σ 2 I. By the
Gauss–Markov theorem, the covariance matrix of the least squares estimate x̂ is σ 2 Cx , where

Cx = (ATA)−1 = (RT R)−1 = R−1 R−T . (2.1.18)

The inverse S = R−1 = (sij ) is upper triangular and can be computed in n3 /3 flops from the
matrix equation RS = I as follows:

for j = n, n − 1, . . . , 1
sjj = 1/rjj ;
for i = j − 1, . . . , 2, 1
 X j .
sij = − rik skj rii ;
k=i+1
end
end

The computed elements of S can overwrite the corresponding elements of R in storage. Forming
the upper triangular part of Cx = SS T requires an additional n3 /3 flops. This computation can
be sequenced so that the elements of Cx overwrite those of S. The variance of the components
of x is given by the diagonal elements of Cx :
n
X
cnn = s2nn = 1/rnn
2
, cii = s2ij , i = n − 1, . . . , 1.
j=i

Note that the variance for xn is available directly from the last diagonal element rnn . The
covariance matrix of the residual vector r̂ = b − Ax̂ is

Cr = (I − A(ATA)−1 AT ) = (I − Q1 QT1 ), Q1 = AR−1 . (2.1.19)

Here I − Q1 QT1 is the orthogonal projector onto the nullspace of AT .


An unbiased estimate of σ 2 in the Gauss–Markov model is given by

s2 = ∥r̂∥22 /(m − n), r̂ = b − Ax̂. (2.1.20)

The normalized residuals


r̃ = (s diag Cr )−1/2 r̂
are often used to detect and identify single or multiple bad data, which are assumed to correspond
to large components in r̃.
Often Cx occurs only as an intermediate quantity in a formula. For example, the variance of
a linear functional f T x̂ is
f T Cx f = f T R−1 R−T f = z T z, (2.1.21)
where z = R−T f . Thus, the variance can be computed by solving the lower triangular system
RT z = f and forming z T z. This is a more stable and efficient approach than using the expression
f T Cx f .
An algorithm for computing selected elements of Cx when A is a band matrix is given in
Section 4.1.3. This uses the identity RCx = R−T that follows from the definition (2.1.18).
46 Chapter 2. Basic Numerical Methods

2.2 Orthogonalization Methods


2.2.1 Householder and Givens Transformations
A rank-one modification of the unit matrix E = I − βuv T is called an elementary matrix. In
particular, a matrix of the form

P = I − βuuT , β = 2/(uT u) (2.2.1)

is symmetric and orthogonal: P T = P and P T P = I. It follows that P 2 = I and hence


P −1 = P . Note the similarity to an elementary orthogonal projection matrix; see (1.1.15). The
product of P with a given vector a is

P a = a − β(uT a)u

and can be found without explicitly forming P itself. The effect of this transformation is that
it reflects the vector a in the hyperplane with normal vector u; see Figure 2.2.1. Note that
P a ∈ span {a, u} and P u = −u so that P reverses u.

Figure 2.2.1. Reflection of a vector a in a hyperplane with normal u.

The use of orthogonal reflections in numerical linear algebra was initiated by Householder
[644, 1958]. Therefore, a matrix P of the form (2.2.1) is also called a Householder reflector.
The most common use of Householder reflectors is reducing a given vector a to a multiple of the
unit vector e1 = (1, 0, . . . , 0)T :

P a = a − β(uT a)u = ±σe1 . (2.2.2)

Since P is orthogonal, we have σ = ∥a∥2 ̸= 0. From (2.2.2) it follows that P a = ±σe1 for

u = a ∓ σe1 . (2.2.3)

Furthermore, setting α1 = aT e1 , we have

uT u = (a ∓ σe1 )T (a ∓ σe1 ) = σ 2 ∓ 2σα1 + σ 2 = 2σ(σ ∓ α1 ), (2.2.4)

so that 1/β = σ(σ ∓ α1 ). To avoid cancellation when a is close to a multiple of e1 , the standard
choice is to take
u = a + sign (α1 )σe1 , 1/β = σ(σ + |α1 |). (2.2.5)
This corresponds to choosing the outer bisector in the Householder reflector.
Given real vector a ̸= 0, Algorithm 2.2.1 computes a Householder reflector P = I − βuuT ,
where u is normalized so that u1 = 1. If n = 1 or a(2 : n) = 0, then β = 0 is returned.
2.2. Orthogonalization Methods 47

Algorithm 2.2.1 (Construct Householder Reflector).


function [u,beta,sigma] = houseg(a)
% Constructs a Householder reflector such
% that (I - beta*u*u')a = sigma*e_1
% -----------------------------------------
u = a;
sigma = norm(a);
u(1) = sigma + abs(a(1));
beta = u(1)/sigma;
if a(1) < 0
u(1) = -u(1);
else
sigma = -sigma;
end
u = u/u(1);
end

Householder reflections for use with complex vectors and matrices can be constructed as
follows; see Wilkinson [1119, 1965, pp. 49–50]. A complex Householder reflector has the form
2
P = I − βuuH , β= , u ∈ Cn , (2.2.6)
uH u
and is Hermitian and unitary (P H = P , P H P = I). Given x ∈ Cn such that eT1 x = ξ1 =
eiα |ξ1 |, we want to determine P so that

P x = γe1 , |γ| = σ = ∥x∥2 .

Since P is Hermitian, xH P x = γxH e1 = γξ1 must be real. It follows that γ = ±eiα σ. To


avoid cancellation in the first component of u, the sign is chosen so that

u = x − γe1 , γ = −eiα σ, β = 1/σ(σ + |ξ1 |), (2.2.7)

and P x = −eiα σe1 . Complex Householder transformations are discussed in more detail by
Lehoucq [732, 1996] and Demmel et al. [309, 2008].
Householder matrices can be used to introduce zeros in a column or a row of a matrix A ∈
Rm×n . The product P A can be computed as follows using only β and the Householder vector u:

P A = (I − βuuT )A = A − βu(uTA). (2.2.8)

This requires 4mn flops and alters A by a matrix of rank one. Similarly, multiplication from the
right is computed as
AP = A(I − βuuT ) = A − β(Au)uT . (2.2.9)
Another useful class of elementary orthogonal transformations are plane rotations, often
called Givens rotations; see Givens [480, 1958]. These have the form
 
c s
G= , c = cos θ, s = sin θ, (2.2.10)
−s c

and satisfy GT G = I and det(G) = c2 + s2 = 1. For a two-dimensional vector v, Gv represents


a clockwise rotation through an angle θ. The matrix representing a two-dimensional rotation in
48 Chapter 2. Basic Numerical Methods

the plane spanned by the unit vectors ei and ek , i < k, is a rank-two modification of the unit
matrix In :
i k
 .. .. 
1 . .
 .. .. .. 

 . . . 

i  . . . . . . c ... s ... ...
 .. .. .. 
Gik =  . . (2.2.11)
 
. .
k  . . . . . . −s
 
... c ... ...
 .. .. .. 

 . . . 

.. ..
. . 1
Premultiplying a column vector a = (α1 , . . . , αn )T by Gik will affect only elements in rows i
and j:  ′     
αi c s αi cαi + sαk
= = . (2.2.12)
αk′ −s c αk −sαi + cαk
Any element in a vector or matrix can be annihilated by a plane rotation. For example, if in
(2.2.12) we take q
c = αi /σ, s = αk /σ, σ = αi2 + αk2 ̸= 0, (2.2.13)
then αi′ = σ and αk′ = 0. Premultiplication of A ∈ Rm×n with a plane rotation Gik ∈ Rm×m
requires 6n flops and only affects rows i and j in A. Similarly, postmultiplying A with Gij ∈
Rn×n will only change columns i and j.
A robust algorithm for computing c, s, and σ in a plane rotation G such that G(α, β)T = σe1
to nearly full machine precision is given below. Note that the naive expression σ = (α2 + β 2 )1/2
may produce damaging underflows and overflows even though the data and result are well within
the range of the floating-point number system.
This algorithm requires two divisions, three multiplications, and one square root. No inverse
trigonometric functions are involved. Overflow can only occur if the true value of σ itself were
to overflow.

Algorithm 2.2.2 (Construct a Real Plane Rotation).


function [c,s,sigma] = givrot(alpha,beta)
% GIVROT constructs a 2 by 2 real plane rotation
% -----------------------------------------------
if beta == 0
c = 1.0; s = 0.0;
sigma = alpha;
elseif abs(beta) > abs(alpha)
t = alpha/beta; tt = sqrt(1+t*t);
s = 1/tt; c = t*s; sigma = tt*beta;
else
t = beta/alpha; tt = sqrt(1+t*t);
c = 1/tt; s = t*c; sigma = tt*alpha;
end
end

The standard task of mapping a given vector x ∈ Rm onto a multiple of e1 can be performed
in different ways by a sequence of plane rotations. Let Gik denote a plane rotation in the plane
2.2. Orthogonalization Methods 49

(i, k) that zeros out the kth component in a vector. Then one solution is to take

G1m · · · G13 G12 x = σe1 .

Note that because G1k only affects components 1 and k, previously introduced zeros will not be
destroyed later. Another possibility is to take

G12 G23 · · · Gn−1,n a = σe1 ,

where Gk−1,k is chosen to zero the kth component. This flexibility of plane rotations compared
to Householder reflections is particularly useful when operating on sparse matrices.
A plane rotation G (or reflector) can be represented by c and s and need never be explicitly
formed. Even more economical is to store either c or s, whichever is smaller. Stewart [1016,
1976] devised a scheme in which the two cases are distinguished by storing the reciprocal of c.
Then c = 0 has to be treated as a special case. If for the matrix (2.2.10) we define
(1 if c = 0,
ρ= sign (c)s if |s| < |c|, (2.2.14)
sign (s)/c if |c| ≤ |s|,
then the numbers c and s can be retrieved up to a common factor ±1 by

if ρ = 1, then c = 0; s = 1;
if |ρ| < 1, then s = ρ; c = (1 − s2 )1/2 ;
if |ρ| > 1, then c = 1/ρ; s = (1 − c2 )1/2 .

This scheme is used because the formula 1 − x2 gives poor accuracy when |x| is close to unity.
An alternative to plane rotations favored by some are plane reflectors of the form
 
e = c s , c = cos θ, s = sin θ,
G (2.2.15)
s −c

for which det(G(θ))e = −1. These reflectors are symmetric and orthogonal, G̃−1 = G̃ =
T
G̃ , and represent a plane rotation followed by a reflection about an axis. From trigonometric
e equals the 2 × 2 Householder reflector
identities, it follows that G
 
Ge = I − (I − G)e = I − 2uuT , u = − sin(θ/2) . (2.2.16)
cos(θ/2)

In the complex case a plane rotation has the form


 
c̄ s̄
G= , |c|2 + |s|2 = 1, (2.2.17)
−s c

where c̄ and s̄ denote complex conjugates. Then


    2 
c −s̄ c̄ s̄ |c| + |s|2 0
GH G = = = I,
s c̄ −s c 0 |c|2 + |s|2

which shows that G is unitary. We want to construct G so that


   
α σ
G = .
β 0
50 Chapter 2. Basic Numerical Methods

For efficiency, we want c to be real also in the complex case. If α ̸= 0 and β ̸= 0, this can be
achieved by taking
p p
c = |α|/( |α|2 + β|2 ), s = sign (α) β̄/( |α|2 + β|2 ), (2.2.18)
p
where σ = sign (α) |α|2 + β|2 . Here sign (z) = z/|z| is defined for any complex z ̸= 0. If
α = 0 and β ̸= 0, we take

c = 0, s = sign (β̄), σ = |β|.

Finally, if α = β = 0, we take c = 1, s = σ = 0. The efficient and reliable computation of σ,


c, and s in real and complex plane rotations is surprisingly complicated; see Bindel et al. [120,
2002].
So-called fast plane rotations were introduced by Gentleman [450, 1973] and Hammar-
ling [563, 1974]. Suppose that we want to perform the Givens transformation
   
γ σ α a2 . . . an
GA = A, A= , (2.2.19)
−σ γ β b2 . . . bn

where G is constructed to zero the element β in A. By keeping A in scaled form


 
′ d1 0
A = DA , D = ,
0 d2

and updating the two factors separately, the number of multiplications can be reduced. The
transformation (2.2.19) is represented in the factored form

GA = GDA′ = D̃P A′ , GD = D̃P,

where D̃ is a diagonal matrix chosen so that two elements in P are equal to unity. This eliminates
2n multiplications in forming the product P A′ . In actual computation, D2 rather than D is stored
in order to avoid square roots.
Consider first the case |γ| ≥ |σ|, i.e., |θ| ≤ π/4. Then
d2 σ 

1
  
d1 γ d2 σ d1 γ
GD = = γD = D̃P,
− dd21 σγ

−d1 σ d2 γ 1

and D̃2 = γ 2 D2 . Since σ/γ = β/α = (d2 /d1 )(β ′ /α′ ), we have
 2
β′
 
1 p12 d2
P = , p21 = ′ , p12 = p21 . (2.2.20)
−p21 1 α d1

Hence we only need the squares of the scale factors d1 and d2 . The identity γ 2 = (1 + σ 2 /γ 2 )−1
implies that
d˜21 = d21 /t, d˜22 = d22 /t, t = 1 + σ 2 /γ 2 = 1 + p12 p21 . (2.2.21)
This eliminates the square root in the plane transformation. Similar formulas are easily derived
for the other case |γ| < |σ|, i.e., |θ| > π/4, giving
 2
α′
 
p11 1 d1
P = , p22 = ′ , p11 = p22 , (2.2.22)
−1 p22 β d2
and
d˜21 = d22 /t, d˜22 = d21 /t, t = 1 + γ 2 /σ 2 = p11 p22 + 1. (2.2.23)
2.2. Orthogonalization Methods 51

Fast plane rotations have the advantage that they reduce the number of multiplications and square
roots. However, when they are applied, the square of the scale factors is always updated by a
factor in the interval [1/2, 1]. Thus after many transformations the elements in D may underflow.
Therefore the size of the scale factors must be careful monitored to prevent underflow or overflow.
This substantially decreases the efficiency of fast plane rotations.
Anda and Park [21, 1994] developed self-scaling fast rotations, which obviate rescalings.
Four variations of these modified fast rotations are used. The choice among the four variants is
made to diminish the larger diagonal element while increasing the smaller one.
On modern processors the gain in speed of fast plane rotations is modest, due to the nontrivial
amount of monitoring needed. Hence, their usefulness appears to be limited, and LAPACK does
not make use of them.

Example 2.2.1. Let Q ∈ R3×3 , det(Q) = 1, be an orthogonal matrix representing an arbitrary


pure rotation in three dimensions. A classical problem is to express this as a product of three
normal equations ATAx = AT b,

G23 (ϕ)G12 (θ)G23 (ψ)Q = I, (2.2.24)

where ϕ, θ, and ψ are the Euler angles,


    
1 0 0 c2 s2 0 1 0 0 q11 q12 q13
 0 c3 s3  −s2 c2 0  0 c1 s1  q21 q22 q23  = I.
0 −s3 c3 0 0 1 0 −s1 c1 q31 q32 q33

The first rotation G23 (ψ) is used to zero the element q31 . Next, G12 (θ) zeros the modified
element q21 . Finally, G23 (ϕ) is used to zero q32 . The angles can always be chosen to make the
diagonal elements positive. Since the final product is orthogonal and upper triangular, it must be
the unit matrix I3 . By orthogonality, we have

Q = G23 (−ψ)G12 (−θ)G23 (−ϕ).

A problem with this representation is that the Euler angles may not depend continuously on the
data. If Q equals the unit matrix plus small terms, then a small perturbation may change an
angle by as much as 2π. A different set of angles, based on zeroing the elements in the order
q21 , q31 , q32 , yields a continuous representation and is preferred. This corresponds to the product

G23 (ϕ)G13 (θ)G12 (ψ)Q = I3 .

For more details, see Hanson and Norris [590, 1981]. An application of Euler angles for solving
the eigenproblem of a symmetric 3×3 matrix is given by Bojanczyk and Lutoborksi [167, 1991].

Notes and references


Wilkinson [1120, 1965] proved the backward stability of algorithms based on sequences of
Householder reflectors. Parlett [884, 1998] gives stable formulas for the choice of Householder
reflector corresponding to the inner bisector. Dubrulle [338, 2000] shows that the inner reflec-
tors perform better in some eigenvalue algorithms but can lead to a loss of accuracy in other
algorithms.
Plane rotations seem to have been first used by Jacobi [658, 1845] to achieve diagonal dom-
inance in systems of normal equations. The systematic use of orthogonal transformations to
reduce matrices to simpler form was initiated by Givens [480, 1958] and Householder [644,
1958].
52 Chapter 2. Basic Numerical Methods

2.2.2 Householder QR Factorization


Any matrix A ∈ Cm×n , m ≥ n, can be reduced to upper triangular form by a sequence of
orthogonal transformations. This QR factorization is one of the most important matrix factoriza-
tions and is extensively used in least squares and eigenvalue problems.

Theorem 2.2.2 (The QR Factorization). For any matrix A ∈ Cm×n , m ≥ n, of full column
rank there exists a factorization
   
R R
A=Q = ( Q1 Q2 ) = Q1 R, (2.2.25)
0 0

where Q ∈ Cm×m is unitary, Q1 ∈ Cm×n , and R ∈ Cn×n is upper triangular with real positive
diagonal elements. The matrices R and Q1 = AR−1 are uniquely determined. Q2 is not unique,
and (2.2.25) holds if we substitute Q2 P , where P ∈ C(m−n)×(m−n) is any unitary matrix. The
corresponding orthogonal projectors

PR(A) = Q1 QH
1 , PA⊥ = Q2 QH H
2 = I − Q1 Q1 (2.2.26)

are uniquely determined.

Proof. The proof is constructive. Set A(1) = A and compute A(k+1) = Hk A(k) , k = 1, . . . , n.
Here Hk is a Householder reflection chosen to zero the elements below the main diagonal in
column k of A(k) . After step k, A(k+1) is triangular in its first k columns, i.e.,
 
R11 R12
A(k+1) = , k = 1, . . . , n, (2.2.27)
0 Ã(k+1)

where R11 ∈ Rk×k is upper triangular and (R11 , R12 ) are the first k rows of the final factor R.
(k) (k)
If Ã(k) = (ãk , . . . , ãn ), Hk is taken as

Hk = diag (Ik−1 , H̃k ), H̃k = I − uk uTk /γk , k = 1, . . . , n,

where
(k) (k)
H̃k ãk = σk e1 , σk = rkk = ∥ãk ∥2 . (2.2.28)

Note that Hk only transforms Ã(k) and does not destroy zeros introduced in earlier steps. After
n steps, we have  
(n+1) T R
A =Q A= , Q = H1 · · · Hn , (2.2.29)
0
which is the QR factorization of A, where Q is given as a product of Householder transforma-
tions. If m = n, the last transformation Hn can be skipped.

From (2.2.25) the columns of Q1 and Q2 form orthonormal bases for R(A) and its orthogonal
complement:
R(A) = R(Q1 ), N (AH ) = R(Q2 ). (2.2.30)
The vectors ũ(k) , k = 1, . . . , n, can overwrite the elements on and below the main diagonal of
A. Thus all information associated with the factors Q and R fits into the array holding A. The
vector (β1 , . . . , βn ) of length n is usually stored separately but can also be recomputed from
βk = 12 (1 + ∥b uk ∥22 )1/2 .
2.2. Orthogonalization Methods 53

For a complex matrix A ∈ Cm×n the QR factorization can be computed similarly by using
a sequence of unitary Householder reflectors. Note that a factor R with real positive diagonal
elements can always be obtained by a unitary scaling:
   
R DR
A=U = (U D−1 ) , D = diag (eiα1 , . . . , eiαn ).
0 0

The factor Q is usually kept in factored form and accessed through βk and the Householder
vectors ũ(k) , k = 1, . . . , n. In step k the application of the Householder reflector to the active
part of the matrix requires 4(m − k + 1)(n − k) flops. Hence, the total flop count becomes
2(mn2 − n3 /3) or 4n3 /3 flops if m = n.
Algorithm 2.2.3 computes the QR factorization of A ∈ Cm×n (m ≥ n) using Householder
(kk)
transformations. Note that the diagonal elements rkk will be positive if ak is negative and neg-
ative otherwise. Negative diagonal elements may be removed by multiplying the corresponding
rows of R and columns of Q by −1.

Algorithm 2.2.3 (Householder QR Factorization).

function [U,R,beta] = houseqr(A)


% HOUSEQR computes the Householder QR factorization
% of the m by n matrix A (m >= n). At return,
% U and beta contain the Householder reflections.
% -----------------------------------------------
[m,n] = size(A);
if (m < n), error('Illegal dimensions of A'); end
u = zeros(m,1); beta = zeros(n,1);
for k = 1:n
if k < m,
% Construct and save k:th Householder reflector
[u(k:m),beta(k),A(k,k)] = houseg(A(k:m,k));
A(k+1:m,k) = u(k+1:m);
% Apply k:th Householder reflector
A(k:m,k+1:n) = A(k:m,k+1:n) - ...
beta(k)*u(k:m)*(u(k:m)'*A(k:m,k+1:n));
end
end
U = eye(m,n) + tril(A,-1);
R = triu(A(1:n,:));
end

Should Q be explicitly required, it can be computed by accumulating the products in

Q = ( Q1 Q2 ) = H1 · · · Hn Im ∈ Rm×m (2.2.31)

from left to right. Since the transformation Hk+1 leaves the first k rows unchanged, it follows
that
qk = H1 · · · Hp ek , k = 1 : m, p = min{k, n}. (2.2.32)

Generating the full matrix Q takes 4(mn(m − n) + n3 /3) flops. Algorithm 2.2.4 generates the
matrix Q1 ∈ Rm×n (m ≥ n) and requires 2(mn2 − n3 /3) flops.
54 Chapter 2. Basic Numerical Methods

Algorithm 2.2.4 (Accumulating Q1 in Householder QR).

function Q = houseq1(U,beta)
% HOUSEQ1 generates the m by n orthonormal matrix
% Q from a given Householder QR factorization
% -----------------------------------------------
[m,n] = size(U);
Q = eye(m,n)
for k = n:-1:1
uk = U(k:m,k); v = uk'*Q(k:m,k:n);
Q(k:m,k:n) = Q(k:m,k:n) - (beta(k)*uk)*v;
end
end

The matrix Q2 ∈ Rm×(m−n) gives an orthogonal basis for the orthogonal complement
N (AT ) and can be generated in 2n(m − n)(2m − n) flops. Householder QR factorization
is backward stable. The following result is due to Higham [623, 2002, Theorem 19.4].

Theorem 2.2.3. Let R̄ ∈ Rm×n denote the upper trapezoidal matrix computed by the House-
holder QR factorization for A ∈ Rm×n . Then there exists an exactly orthogonal matrix Q ∈
Rm×m such that A + ∆A = QR̄, where

∥∆aj ∥2 ≤ γ̃mn ∥aj ∥2 , j = 1, . . . , n. (2.2.33)

Here the matrix Q is given by Q = Hn · · · H2 H1 , where Hk is the Householder matrix that


corresponds to the exact application of the kth step to the computed matrix produced after k − 1
steps.

Note that the matrix Q̄ computed by the Householder QR factorization is not the exact or-
thogonal matrix Q in Theorem 2.2.3. However, it is very close to this:

∥Q̄ − Q∥F ≤ nγmn . (2.2.34)

Householder QR factorization is invariant under column scaling in the following sense. If


applied to the scaled matrix A
e = AD, where D > 0 is diagonal, it yields the factors Q e =Q
and R = RD. This invariance holds also in finite-precision arithmetic, provided the scaling is
e
done without introducing rounding errors. The columnwise bound in Theorem 2.2.3 reflects this
invariance and gives the weaker bound ∥∆A∥F ≤ γ̃m ∥A∥F .
Let A ∈ Cm×n have the QR factorization A = Q1 R, where R ∈ Cn×n and Q1 ∈ Cm×n .
The following perturbation bounds for the factors when A is perturbed to A + E are given by
Sun [1048, 1991].

Theorem 2.2.4. Let A = QR be the QR factorization of A ∈ Cm×n with rank(A) = n. Let A


be perturbed to A + E, where E ∈ Cm×n satisfies

µ = κ(A)∥E∥2 /∥A∥2 < 1.

Then there is a unique QR factorization A + E = (Q + W )(R + F ), where


√ √
∥F ∥F 2κ2 (A) ∥E∥F (1 + 2)κ2 (A) ∥E∥F
≤ , ∥δQ∥F ≤ , p = 2, F. (2.2.35)
∥R∥p 1 − µ ∥A∥p 1−µ ∥A∥2
2.2. Orthogonalization Methods 55

An important special case is the computation of the QR factorization of a matrix of the form
 
R1
A= ,
R2

where R1 , R2 ∈ Rn×n are upper triangular. This “merging” of triangular matrices occurs as
a subproblem in parallel QR factorization and in QR factorization of band and other sparse
matrices. If the rows of A are permuted in order 1, n + 1, 2, n + 1, . . . , n, 2n, then standard
Householder QR can be used without introducing extra fill. This is illustrated in the following
diagram for n = 4:

× × × × × × × × × × × × × × × ×
       
⊗ × × × 0 × × × 0 × × × 0 × × ×
× × × ⊗ × × ⊗ × × ⊗ × ×
       
   
× × × ⊗ × × 0 ⊗ × 0 ⊗ ×
       
, , , .
   
× × × × ⊗ × ⊗ ⊗
   
   
× × × × ⊗ × 0 ⊗
       
   
× × × ⊗
       
× × × ⊗

In the diagram, × stands for a (potential) nonzero element, ⊗ for a nonzero element that has
been zeroed out, and + for a nonzero element that has been introduced in the computations
(if any). In practice the reordering of the rows need only be carried out implicitly. The QR
factorization requires a total of approximately 2n3 /3 flops if the Householder transformations
are not accumulated.
In Givens QR factorization, a sequence of rotations is used to eliminate the elements below
the diagonal of A. An advantage over Householder QR is that the rotations can be adapted to
the nonzero structure of the matrix. For example, in the QR factorization of band matrices, zeros
can be introduced one row at a time. This case is considered further in Section 4.1. Another
important example arises in algorithms for the unsymmetric eigenvalue problem. Here the QR
factorization of a Hessenberg matrix
h
11 h12 ··· h1,n−1 h1,n 
 h21 h22 ··· h2,n−1 h2,n 
 .. .. 
h32 ··· . .
 
Hn =   ∈ Rn×n
 
..

 . hn−1,n−1 hn−1,n 

hn,n−1 hn,n
 
hn+1,n

is needed. This is obtained using a sequence of n plane rotations,


 
R
Gn,n+1 · · · G23 G12 H = Q ,
0

where Gk,k+1 is used to zero


Pn the element hk+1,k , k = 1, . . . , n. The total work in Hessenberg
QR factorization is about k=1 6(n−k) ≈ 3n2 flops. In the special case when Hn is bidiagonal,
the flop count for QR factorization is linear in n.
As for Householder QR factorization, the factor Q is usually not explicitly formed. To be
able to perform operations with Q it suffices to store the rotations. The rotations Gij (θ) can be
stored in the zeroed part of A using Stewart’s storage scheme described in Section 4.5.3. The
error properties are similar to those for the Householder algorithm. The backward error bound
56 Chapter 2. Basic Numerical Methods

(2.2.33) holds for any ordering of the rotations in Givens QR factorization. Actual errors grow
even more slowly.
Two plane rotations Gij and Gkl are said to be disjoint if the integers i, j, k, l are disjoint.
Disjoint rotations commute and can be performed in parallel. To increase the efficiency of Givens
QR factorizations, the rotations can be ordered into groups of disjoint operations. An ordering
suggested by Gentleman [451, 1975] is illustrated as follows for a 6 × 5 matrix:
× × × × ×
 
 1 × × × ×
 2 3 × × ×
 
.
 3 4 5 × ×

4 5 6 × ×
 
5 6 7 8 ×
Here an integer k in position (i, j) denotes that the corresponding element is eliminated in step
k. Note that all elements in a group are disjoint. For a matrix A ∈ Rm×n with m > n, m + n − 2
stages are needed.

2.2.3 Least Squares by Householder QR


Since orthogonal transformations preserve the Euclidean length, the QR factorization is an ideal
tool for solving linear least squares problems.

Theorem 2.2.5. Let A ∈ Rm×n , rank(A) = n, have the QR factorization


 
R
A=Q . (2.2.36)
0

Then the unique solution x to minx ∥Ax−b∥2 and the corresponding residual vector r = b−Ax
are given by the solution of the upper triangular system Rx = d1 , where
   
d1 0
= QT b, r=Q . (2.2.37)
d2 d2

The norm of the residual r = b − Ax is ∥r∥2 = ∥d2 ∥2 .

Proof. Since Q is orthogonal, we have


    2
2 d1 Rx
∥r∥22 T
= Q (b − Ax) = − = ∥d1 − Rx∥22 + ∥d2 ∥22 .
2 d2 0 2

Clearly ∥r∥22 is minimized by taking Rx = d1 . From the orthogonality of Q it follows that


b = Ax + r = Qd = Q1 d1 + Q2 d2 . Since Q1 d1 = Q1 Rx = Ax, it follows that r = Q2 d2 and
∥r∥2 = ∥d2 ∥2 .

Golub [487, 1965] gives an algorithm using pivoted Householder QR factorization. The
factor Q is not explicitly formed but implicitly defined as Q = H1 H2 · · · Hn :
   
d1 0
= Hn · · · H2 H1 b, r = H1 H2 · · · Hn . (2.2.38)
d2 d2

An ALGOL implementation of Golub’s least squares algorithm is given in Businger and Golub
[193, 1965]. This later appeared in Wilkinson and Reinsch [1123, 1971].
2.2. Orthogonalization Methods 57

Householder QR factorization requires 2n2 (m − n/3) flops, and computing QT b and solving
Rx = d1 require a further 4mn − n2 flops. If one wants not only ∥r∥2 but also r, another
4nm − 2n2 flops are needed. This can be compared to the method of normal equations, which
requires (mn2 + n3 /3) flops for the factorization and 2(nm + n2 ) flops for each right-hand
side. For m = n this is about the same as for the Householder QR method, but for m ≫ n the
Householder method is roughly twice as expensive.
In the following algorithm the QR factorization is applied to the extended matrix ( A b ),
 
R d1
(A b) = Q , Q = H1 · · · Hn Hn+1 . (2.2.39)
0 ρe1
Then Rx = d1 and the residual and its norm are given by
 
0
r = H1 · · · Hn Hn+1 , ∥r∥2 = ρ ≥ 0. (2.2.40)
ρe1

Algorithm 2.2.5 (Least Squares Solution by Householder QR).


function [x,r,rho] = housels(A,b);
% HOUSELS computes the solution x, the residual
% r and rho = ||r||_2 to the full-rank linear
% least squares problem min||Ax - b||_2,
% -----------------------------------------------
[m,n] = size(A); % m >= n
[U,S,beta] = houseqr([A, b]);
R = S(1:n,1:n);
d1 = S(1:n,n+1);
x = R\d1;
rho = abs(S(n+1,n+1));
r = zeros(m,1);
r(n+1) = rho;
for k = n+1:-1:1
c = beta(k)*(U(k:m,k)'*r(k:m));
r(k:m) = r(k:m) - c*U(k:m,k);
end
end
The Householder algorithm for least squares problems is normwise backward stable. Theo-
rem 2.2.3 can be extended to give the following columnwise backward error bounds. Essentially
the same result holds for Givens QR factorization.

Theorem 2.2.6. Suppose that the full-rank least squares problem minx ∥Ax − b∥2 , m ≥ n, is
solved by Householder QR factorization. Then the computed solution x̂ is the exact least squares
solution to a slightly perturbed problem
min ∥(A + δA)x − (b + δb)∥2 ,
x

where ∥δaj ∥2 ≤ nγmn ∥aj ∥2 , j = 1, . . . , n, and ∥δb∥2 ≤ γmn ∥b∥2 .

Proof. See Higham [623, 2002, Theorem 20.3].

In some applications it is important that the computed residual vector r̄ be accurately orthog-
onal to R(A). The backward stability of Householder QR means that the residual computed
58 Chapter 2. Basic Numerical Methods

from (2.2.40) satisfies


(A + E)T r̄ = 0, ∥E∥2 ≤ cu∥A∥2 , (2.2.41)
for some constant c = c(m, n). Hence AT r̄ = −E T r̄, and

∥AT r̄∥2 ≤ cu∥r̄∥2 ∥A∥2 . (2.2.42)

Now assume that the residual is computed instead from re = f l(b − f l(Ax)), where x is the exact
least squares solution. Then AT r = 0, and the error analysis for (1.4.11) for inner products gives
|AT r̃| < γn+1 |AT |(|b| + |A||x|). It follows that

∥AT re∥2 ≤ n1/2 γn+1 ∥A∥2 (∥b∥2 + n1/2 ∥A∥2 ∥x∥2 ).

This bound is much weaker than the bound (2.2.42) valid for the Householder QR method, par-
ticularly when ∥r̄∥2 ≪ ∥b∥2 .
Let AT y = c be an underdetermined linear system, where A ∈ Rm×n has full row rank n.
Then the least-norm solution y ∈ Rm can be computed from the Householder QR factorization
of A as follows. We have AT = ( RT 0 ) QT and
 
T T z1
Ay = (R 0)z = c, z = Q y = . (2.2.43)
z2

Since ∥y∥2 = ∥z∥2 , the problem is reduced to min ∥z1 ∥2 subject to RT z1 = c. Clearly the
least-norm solution is obtained by setting z2 = 0, and
 
T z1
R z1 = c, y=Q . (2.2.44)
0

The resulting Householder QR algorithm (2.2.44) is backward stable.

Theorem 2.2.7. Assume that the least-norm solution of AT y = c is computed by Householder


QR factorization of A ∈ Rm×n , rank(A) = n. Then the computed solution x̂ is the exact
least-norm solution to a slightly perturbed system (A + δA)T y = (c + δc), where

∥δaj ∥F ≤ nγmn ∥aj ∥F , j = 1, . . . , n, ∥δc∥2 ≤ γmn ∥c∥2 . (2.2.45)

Proof. See Higham [623, 2002, Theorem 21.4]. (This result from 2002 was first published here;
see Notes and References at the end of Chapter 21.)

When rank(A) = n, the pseudoinverses of A and AT can be expressed in terms of the QR


factorization as
A† = R−1 QT1 , (AT )† = Q1 R−T .
The Householder QR algorithms for the least squares and minimum norm problems are spe-
cial cases of a general QR algorithm for solving the augmented system
    
I A y b
= , b ∈ Rm , c ∈ Rn , (2.2.46)
AT 0 x c

where A ∈ Rm×n , rank(A) = n. From the QR factorization of A we obtain


 
R
y+Q x = b, ( RT 0 ) QT y = c.
0
2.2. Orthogonalization Methods 59

Multiplying the first equation by QT and the second by R−T , we get


 
R
T
Q y+ x = QT b, ( In 0 ) QT y = R−T c.
0

The second equation can be used to eliminate the first n components of QT y in the first equation
to solve for x. The last m − n components d2 of QT y are obtained from the last m − n equations
in the first equation. The resulting QR algorithm for solving the augmented system (2.2.46) is
summarized below.

Algorithm 2.2.6 (Augmented System Solution by Householder QR). Compute the House-
holder QR factorization of A ∈ Rm×n , rank(A) = n, and
 
d1
z = R−T c, d= = QT b, (2.2.47)
d2
 
z
x = R−1 (d1 − z), y=Q . (2.2.48)
d2

The algorithm requires triangular solves with R and RT and multiplications of vectors with
Q and QT for a total of 8mn − 2n2 flops. Higham [618, 1991] gives a componentwise error
analysis of this algorithm, which is of importance in the analysis of iterative refinement of least
squares solution; see Section 2.5.3.

Lemma 2.2.8 (Higham 1991). Let A ∈ Rm×n , rank(A) = n, and suppose the augmented
system is solved by the algorithm in (2.2.47)–(2.2.48) using Householder or Givens QR factor-
ization. Then the computed x̄ and ȳ satisfy
    
I A + E1 ȳ b + e1
= ,
(A + E2 )T 0 x̄ c + e2

where

|Ei | ≤ µm,n G|A|, i = 1, 2,


|e1 | ≤ µm,1 (H1 |b| + H2 |ȳ|), |e2 | ≤ µm,1 |AT |H3 |ȳ|,

∥G∥2 ≤ 3mn(1 + θm,n ), ∥H1 ∥ ≤ 3m3/2 (1 + θm,1 ),


∥H2 ∥ ≤ 5m3/2 (1 + θm,1 ), ∥H2 ∥ ≤ 7m3/2 µm,1 (1 + θm,1 ),
2µm,n (m(n − 1))1/2
θm,n = ,
1 − 2µm,n (m(n − 1))1/2

where γn = nu/(1 − nu) and µm,n = γam+bn+c .

2.2.4 Gram–Schmidt QR factorization


The earliest orthogonalization methods for solving least squares problems used Gram–Schmidt
orthogonalization, where an orthonormal matrix Q ∈ Rm×m is formed explicitly as a linear
combination of the columns of A. Gram–Schmidt orthogonalization is a well-known standard
topic in elementary linear algebra textbooks. If used correctly, this has excellent stability proper-
ties. Unfortunately, incorrect descriptions are still found in textbooks and elsewhere, particularly
in statistics.
60 Chapter 2. Basic Numerical Methods

Gram–Schmidt orthogonalization is a process that, from a linearly independent sequence


{xn } of members of a finite or infinite inner-product space S, forms an orthogonal sequence
{qn } as
n−1
X ⟨qk , xn ⟩
q1 = x1 , qn = xn − qk , (2.2.49)
∥qk ∥2
k=1

where ⟨·, ·⟩ denotes the inner product. By construction, ⟨qj , qk ⟩ = 0, j ̸= k, and span (q1 , . . . , qk )
= span (x1 , . . . , xk ), k ≥ 1. Replacing each qn by qn /∥qn ∥ gives an orthonormal sequence.
Having an orthogonal basis for this nested sequence of subspaces simplifies many operations.
Given A = (a1 , a2 , . . . , an ) ∈ Rm×n with linearly independent columns, the Gram–Schmidt
process computes a matrix factorization

r11 r12 ... r1n


 
 r22 ... r2n 
A = QR = (q1 , q2 , . . . , qn )  .. ..  , (2.2.50)
 . . 
rnn

such that Q = (q1 , q2 , . . . , qn ) ∈ Rm×n is orthonormal and R ∈ Rn×n upper triangular. The
difference between Gram–Schmidt and Householder QR factorizations has been aptly formu-
lated by Trefethen and Bau [1068, 1997, p. 70]: Gram–Schmidt is triangular orthogonalization,
whereas Householder is orthogonal triangularization.
The Classical Gram–Schmidt (CGS) algorithm applied to A = (a1 , . . . , an ) proceeds in n
steps, k = 1, . . . , n. In step k the vector ak is orthogonalized against Qk−1 = (q1 , . . . , qk−1 ),
giving
ak = (I − Qk−1 QTk−1 )ak = ak − Qk−1 rk , rk = QTk−1 ak .
b (2.2.51)
Then qk is obtained by normalizing bak : rkk = ∥b
ak ∥2 , qk = b
ak /rkk . Note that if a1 , . . . , ak are
linearly independent, then rkk > 0.

Algorithm 2.2.7 (Classical Gram–Schmidt).


function [Q,R] = cgs(A);
% CGS computes the thin QR factorization of A
% by Classical Gram--Schmidt
% ---------------------------------------
[m,n] = size(A);
Q = zeros(m,n); R = zeros(n);
qk = A(:,k)
for k = 1:n
if k > 1
R(1:k-1,k) = Q(:,1:k-1)'*qk;
qk = qk - Q(:,1:k-1)*R(1:k-1,k);
end
R(k,k) = norm(qk);
Q(:,k) = qk/R(k,k);
end
end

In step k of CGS the first k − 1 elements of the kth column of R are computed. CGS is
therefore called a columnwise or left-looking algorithm. The main work in CGS is performed
in two matrix-vector products. By omitting the normalization of ak , a square-root-free CGS
2.2. Orthogonalization Methods 61

algorithm can be obtained. This gives a factorization A = QR, where R̂ is unit upper triangular
and QT Q = D, a positive diagonal matrix. The CGS algorithm requires approximately 2mn2
flops. This is 2n3 /3 flops more than required for Householder QR.
The modified Gram–Schmidt (MGS) algorithm is a slightly different way to carry out
Gram–Schmidt orthogonalization. As soon as a new column qk has been computed, all remain-
ing columns are orthogonalized against it. This determines the kth row of R. Hence MGS is a
row-oriented or right-looking algorithm. At the start of step k the matrix A = A(1) has been
transformed into
(k)
( Qk−1 A(k) ) = q1 , . . . , qk−1 , ak , . . . , a(k)

n ,
(k)
where the columns in A(k) are orthogonal to Qk−1 . Normalizing the vector ak gives
(k) (k)
rkk = ∥ak ∥2 , qk = ak /rkk . (2.2.52)
(k)
Next aj , j = k, . . . , n, is orthogonalized to qk :

(k+1) (k) (k) (k)


aj = (I − qk qkT )aj = aj − rkj qk , rkj = qkT aj . (2.2.53)

This determines the remaining elements in the kth row of R.

Algorithm 2.2.8 (Modified Gram–Schmidt).


function [Q,R] = mgs(A);
% Computes the thin QR factorization of A
% by rowwise modified Gram--Schmidt.
% ----------------------------------------
[m,n] = size(A);
Q = A; R = zeros(n);
for k = 1:n
R(k,k) = norm(qk);
Q(:,k) = Q(:,k)/R(k,k);
qk = Q(:,k);
if k < n
R(k,k+1:n) = qk'*Q(:,k+1:n);
Q(:,k+1:n) = Q(:,k+1:n) - qk*R(k,k+1:n);
end
end

Note that in MGS the orthogonalization of ak is carried out as the product


(k) T
ak = (I − qk−1 qk−1 ) · · · (I − q1 q1T )ak . (2.2.54)
(k)
This can be compared to the expression ak = (I − Qk−1 QTk−1 )ak used in CGS. If q1 , . . . , qk−1
are exactly orthonormal, these two expressions are equivalent. However, this will not be the case
in finite-precision arithmetic, and MGS is not numerically equivalent to CGS for n > 2.
As described above, MGS is a rowwise or right-looking algorithm. In applications where
the columns of A are generated one at a time, or when MGS is applied to an additional right-hand
side b of a least squares problem, a columnwise (or left-looking) version of MGS must be used.
In columnwise MGS, the same arithmetic operations as are performed as in rowwise MGS, and
only the temporal sequence of the operations is different. Then rounding errors are the same, and
the two versions of MGS produce exactly the same numerical results. The difference between
62 Chapter 2. Basic Numerical Methods

CGS and MGS is subtle and has often gone unnoticed or been misunderstood. Wilkinson [1122,
1971, p. 559] writes “I used the modified process for many years without even explicitly noticing
that I was not performing the classical algorithm.” Columnwise versions of MGS have been used
by Schwarz, Rutishauser, and Stiefel [978, 1968]; see also Gander [437, 1980] and Longley [757,
1981].

Algorithm 2.2.9 (Columnwise MGS).


function [Q,R] = mgsc(A);
% Computes the thin QR factorization of A
% by columnwise modified Gram--Schmidt
% ---------------------------------------
[m,n] = size(A);
Q = zeros(m,n); R = zeros(n);
for k = 1:n
qk = A(:,k);
for i = 1:k-1
R(i,k) = Q(:,i)'*qk;
A(:,k) = A(:,k) - Q(:,i)*R(i,k);
end
R(k,k) = norm(qk);
Q(:,k) = qk/R(k,k);
end
end

In CGS and rowwise MGS algorithms virtually all operations can be implemented as matrix-
vector operations. These versions can be made to execute more efficiently than columnwise
MGS, which uses vector operations. They also offer more scope for parallel implementations.
These aspects are important in deciding which variant should be used in a particular application.
Rice [926, 1966] was the first to establish the superior stability properties of MGS. Both CGS
and MGS can be shown to accurately reproduce A,
A + E1 = Q̄R̄, ∥E1 ∥2 < c1 u∥A∥2 , (2.2.55)
where c1 = c1 (m, n) is a small constant.
Rounding errors will occur when the orthogonal projections onto previous vectors qi are
subtracted. These errors will propagate to later stages of the algorithm and cause a (sometimes
severe) loss of orthogonality in the computed Q. However, as shown by Björck [125, 1967], for
MGS the loss of orthogonality can be bounded by a factor proportional to κ(A).

Theorem 2.2.9. Let Q̄ denote the orthogonal factor computed by the MGS algorithm. Then,
provided that c2 κ2 (A)u < 1, there is a constant c2 = c2 (m, n) such that
c2 κ2 (A)u
I − Q̄T Q̄ 2
≤ . (2.2.56)
1 − c2 κ2 (A)u

The loss of orthogonality in CGS can be much more severe. Gander [437, 1980] points out
that even Cholesky QR factorization often gives better orthogonality than CGS. For the stan-
dard version of CGS, not even a bound proportional to κ(A)2 holds unless a slightly altered
“Pythagorean variant” of CGS is used, in which the diagonal entry rkk is computed as
rkk = (s2k − p2k )1/2 = (sk − pk )1/2 (sk + pk )1/2 , (2.2.57)
2.2. Orthogonalization Methods 63

2 2
where sk = ∥ak ∥2 and pk = (r1k + · · · + rk−1,k )1/2 . For this variant, Smoktunowicz, Barlow,
and Langou [1006, 2006] were able to prove the upper bound
T
∥I − Q̄1 Q̄1 ∥2 ≤ c2 (m, n)κ2 (A)2 . (2.2.58)

Example 2.2.10. To illustrate the difference in loss of orthogonality of MGS and CGS, we use
a matrix A ∈ R50×10 with singular values σi = 10−i+1 , i = 1 : 10, generated by computing

A = U DV T , D = diag (1, 10−1 , . . . , 10−9 ).

Here U and V are random orthonormal matrices from the Haar distribution generated by an
algorithm of Stewart [1019, 1980] that uses products of Householder matrices with randomly
chosen Householder vectors. Table 2.2.1 shows κ(Ak ), Ak = (a1 , . . . , ak ), and the loss of
orthogonality in CGS and MGS as measured by ∥Ik − QTk Qk ∥2 for k = 1, . . . , 10. As expected,
the loss of orthogonality in the computed factor Q for MGS is proportional to κ(Ak ). For CGS
the loss is much worse. CGS is therefore often used with reorthogonalization; see Section 2.2.7.

Table 2.2.1. Condition number and loss of orthogonality in CGS and MGS.

k κ(Ak ) ∥Ik − QTC QC ∥2 ∥Ik − QTM QM ∥2


1 1.000e+00 1.110e-16 1.110e-16
2 1.335e+01 2.880e-16 2.880e-16
3 1.676e+02 7.295e-15 8.108e-15
4 1.126e+03 2.835e-13 4.411e-14
5 4.853e+05 1.973e-09 2.911e-11
6 5.070e+05 5.951e-08 3.087e-11
7 1.713e+06 2.002e-07 1.084e-10
8 1.158e+07 1.682e-04 6.367e-10
9 1.013e+08 3.330e-02 8.779e-09
10 1.000e+09 5.446e-01 4.563e-08

It is important to note that all Gram–Schmidt algorithms are invariant under column scalings,
Let D > 0 be a diagonal matrix. Then the Gram–Schmidt algorithms applied to the scaled
matrix A e = AD will yield the factors Q e = Q and Re = RD. This is true also in floating-point
arithmetic, provided that the entries of D are powers of two so that the scaling is done without
error. From the invariance under column scaling it follows that κ2 (A) in (2.2.56) can be replaced
by
κ
e2 = min κ2 (AD), (2.2.59)
D∈D

where D is the set of all positive diagonal matrices. Scaling A so that all column norms in A are
equal will approximately minimize κ2 (AD); see Theorem 2.1.5.

Notes and references

The history of Gram–Schmidt orthogonalization is surveyed by Leon, Björck, and Gander [733,
2013]. What is now called the “classical” Gram–Schmidt (CGS) algorithm appeared in Schmidt
[971, 1907], [972, 1908] in the context of solving linear systems with infinitely many unknowns.
Schmidt remarked that his formulas were similar to those given earlier by J. P. Gram [525, 1883]
64 Chapter 2. Basic Numerical Methods

in a paper on series expansions of real functions using least squares. Gram used the “modi-
fied” Gram–Schmidt (MGS) algorithm for orthogonalizing a sequence of functions and applied
the results to applications involving integral equations. Gram was influenced by the work of
Chebyshev, and his original orthogonalization procedure was applied to orthogonal polynomials.
The earliest linkage of the names Gram and Schmidt to describe the orthonormalization
process appears to be in a paper by Wong [1130, 1935]. A process similar to MGS had already
been used by Laplace [721, 1816] for solving a least squares problem; see Farebrother [396,
1988] and Langou [720, 2009]. However, Laplace seems not to have recognized the crucial role
of orthogonality. Bienaymé [118, 1853] developed a similar process related to an interpolation
algorithm of Cauchy [212, 1837] that forms the basis of Thiele’s theory of linear estimation.

2.2.5 MGS as a Householder Method


The results in Theorem 2.2.9 on the stability of MGS are quite satisfactory but fall short of
proving backward stability. A surprising observation due to Charles Sheffield is that the MGS
algorithm can be interpreted as Householder QR factorization applied to the matrix A ∈ Rm×n
augmented with a square matrix of zero elements on top. This is not only true in theory but also
holds numerically in the presence of rounding errors. We now outline this relationship in more
detail. We denote the augmented QR factorization by
      
e≡ On R
e P11 P12 R
e
A =P = , (2.2.60)
A 0 P21 P22 0

where P ∈ R(n+m)×(n+m) and P11 ∈ Rn×n . Recall that the Householder transformation
P a = e1 ρ uses
P = I − 2vv T /∥v∥22 , v = a − e1 ρ, ρ = ±∥a∥2

(ek is the kth column of the unit matrix). If (2.2.60) is obtained using Householder transforma-
tions, then

P T = P n · · · P2 P1 , Pk = I − 2v̂k v̂kT /∥v̂k ∥22 , k = 1 : n, (2.2.61)

(1)
where the vectors v̂k are as described below. From MGS applied to A(1) = A, r11 = ∥a1 ∥2
(1)
and a1 = q1′ = q1 r11 . Thus, for the first Householder transformation applied to the augmented
matrix,
   
e(1) ≡ On (1) 0
A , a1 = (1) ,
A(1) a1
e
   
(1) −e1 r11 −e1
v̂1 ≡ = r11 v1 , v1 =
q1′ q1

(and since there can be no cancellation, we take rkk ≥ 0). But ∥v1 ∥22 = 2, giving

P1 = I − 2v̂1 v̂1T /∥v̂1 ∥22 = I − 2v1 v1T /∥v1 ∥22 = I − v1 v1T

and
     
(1) (1) (1) 0 −e1 (1) e1 r1j
P1 e
aj = aj − v1 v1T e
aj = (1) − q1T aj = (2) ,
aj q1
e
aj
2.2. Orthogonalization Methods 65

so
r11 r12 ··· r1n
 
 0 0 ··· 0 
 . .. .. .. 
e(1)
P1 A = .
 . . . . ,
 0 0 ··· 0 
(2) (2)
0 a2 ··· an

where these values are clearly numerically the same as in the first step of MGS on A. The next
(3) (3)
Householder transformation produces the second row of R and a3 , . . . , an , just as in MGS.
We have
   
T 0 R eT = H en · · · H
Q
e = , Q e2He1, (2.2.62)
A 0

where H e k = I − 2v̂k v̂ T /∥v̂k ∥2 , k = 1 : n, are Householder reflectors. Because of the special


k 2
structure of the augmented matrix, the Householder vectors vk have the form
 
−rkk ek
v̂k = , rkk = ∥q̂k ∥2 ,
q̂k

where ek denotes the kth unit vector, and the sign is chosen so that R has a positive diagonal.
With qk = q̂k /rkk it follows that
 
e k = I − vk vkT , −ek
H vk = , ∥vk ∥22 = 2. (2.2.63)
qk

Initially the first n rows are empty. Hence, the scalar products of vk with later columns will only
involve qk , and as is easily verified, the quantities rkj and qk are numerically the same as in the
MGS method. It follows that Householder QR is numerically equivalent to MGS applied to A.
From the backward stability of Householder QR we have the following result.

Theorem 2.2.11. There exists an exactly orthonormal matrix Q̂1 ∈ Rm×n such that for the
computed matrix R̄ in MGS it holds that

A + E = Q̂1 R̄, ∥E∥


e 2 ≤ c3 u∥A∥2 , (2.2.64)

where c3 ≡ c2 (m, n) is a modest constant depending on m and n.

A consequence of this result is that the factor R̄ computed by MGS is as good as the triangular
factor obtained by using Householder or Givens QR. The result can be sharpened to show that R̄
is the exact triangular QR factor of a matrix near A in the columnwise sense; see Higham [623,
2002, Theorem 19.13].
For a matrix Q1 = (q1 , . . . , qn ) with any sequence q1 , . . . , qn of unit 2-norm vectors, the
matrix P = P1 P2 · · · Pn , with

Pk = I − vk vkT , vkT = (−eTk , qkT ),

has a very special structure. The following result holds without recourse to the MGS connection.
As shown by Paige [862, 2009] it can be used to simplify the error analysis of several other
algorithms.
66 Chapter 2. Basic Numerical Methods

Theorem 2.2.12. Let Q1 = (q1 , . . . , qn ) ∈ Rm×n , ∥qk ∥2 = 1, k = 1 : n, and define


 
−ek
Mk = I − qk qkT , Pk = I − vk vkT , vk = ∈ Rm+n . (2.2.65)
qk

Then
 
P11 (I − P11 )QT1
P = P1 P2 · · · P n = (2.2.66)
Q1 (I − P11 ) I − Q1 (I − P11 )QT1

q1T q2 q1T M2 q3 q1T M2 M3 · · · Mn−1 qn q1T M2 M3 · · · Mn


 
0 ···

 0 0 q2T q3 ··· q2T M3 M4 · · · Mn−1 qn q2T M3 M4 · · · Mn 

 .. .. .. .. .. 
=
 . . . ··· . . .

T T
 0
 0 0 ··· qn−1 qn qn−1 Mn 

 0 0 0 ··· 0 qnT 
q1 M1 q 2 M1 M2 q3 ··· M1 M2 · · · Mn−1 qn M1 M2 · · · Mn

The matrix P is orthogonal and depends only on Q1 and the strictly upper triangular matrix
P11 . P11 = 0 if and only if QT1 Q1 is diagonal, and then
 
0 QT1
P = . (2.2.67)
Q1 I − Q1 QT1

Proof. See Björck and Paige [149, 1992, Theorem 4.1].

2.2.6 Least Squares Problems by MGS


Let A = Q1 R be the MGS QR factorization of A ∈ Rm×n , rank(A) = n. We first remark that to
solve the least squares problem minx ∥Ax − b∥2 by forming z = QT1 b and then solving Rx = z
is not a stable way to proceed. Unfortunately, this method is still found in some textbooks. A
backward stable algorithm for computing x is obtained by instead treating the right-hand side b
as an extra column of A and applying MGS to the extended matrix
 
R z
( A b ) = ( Q1 qn+1 ) . (2.2.68)
0 ρ

We can then express the residual as


    
−x R z −x
r = b − Ax = ( A b) = ( Q1 qn+1 )
1 0 ρ 1

= Q1 (z − Rx) + ρqn+1 , ∥qn+1 ∥2 = 1.

If qn+1 ̸= 0 and is orthogonal to Q1 , it follows that ∥Ax − b∥2 is minimized when

Rx = ρqn+1 , s = ∥r∥2 = ρ. (2.2.69)

No assumption about the orthogonality of Q1 is needed for this to be true. However, if ρ ≪ ∥b∥2 ,
then qn+1 fails to be accurately orthogonal to R(A). A backward stable r is obtained by adding
a reorthogonalization step, where the computed r orthogonalized against qn , . . . , q2 , q1 in this
order. The proof of backward stability of this algorithm for computing x and r is by no means
2.2. Orthogonalization Methods 67

obvious. It follows by noting that it is numerically equivalent to solving


   
0 0
min x− (2.2.70)
x A b 2

by Householder QR. Applying the Householder transformations to the right-hand side in (2.2.70)
gives    
d 0
= Hn · · · H1 .
e b
An implementation is given below.

Algorithm 2.2.10 (Linear Least Squares by MGS).

function [x,r] = mgslsq(A,b);


% Computes x and r = b - Ax for the least
% squares problem min_x||b - Ax||_2
% ---------------------------------------
[m,n] = size(A);
[Q,R] = mgs(A);
d = zeros(n,1);
for k = 1:n
d(k) = Q(:,k)'*b;
b = b - d(k)*Q(:,k);
end
x = R\d; r = b;
for k = n:-1:1
r = r - (Q(:,k)'*r)*Q(:,k);
end

The equivalence between MGS and Householder QR factorization can also be used to obtain
a backward stable algorithm for computing the minimum norm solution of an underdetermined
linear system,
min ∥y∥2 subject to AT y = c, (2.2.71)
where A ∈ Rm×n , rank(A) = n; see Björck [134, 1994]. Consider now using Householder QR
factorization to solve the equivalent least-norm problem
   
w T w
min subject to ( 0 A ) = c. (2.2.72)
y 2
y

If we solve RT z = c for z = (ζ1 , . . . , ζn )T , then y is obtained from


   
w z
= H1 · · · Hn . (2.2.73)
y 0

From the special form of the matrices Hk , this leads to the following algorithm: Set y (n) = 0
and
y (k−1) = y (k) − (ωk − ζk )qk , ωk = qkT y (k) , k = n, . . . , 1. (2.2.74)
Then the least-norm solution is y = y (0) . The quantities ωk compensate for the lack of orthog-
onality of Q1 . If Q1 is exactly orthogonal, they are zero.
68 Chapter 2. Basic Numerical Methods

Algorithm 2.2.11 (Least-Norm Solution by MGS).


function y = mgsmnr(A,c);
% Computes the least-norm solution y
% of the linear system A'*y = c
% ---------------------------------------
[m,n] = size(A);
[Q,R] = mgs(A);
z = R'\c;
y = zeros(n,1);
for k = n:-1:1
omega = Q(:,k)'*y;
y = y + (z(k) - omega)*Q(:,k);
end
end

Algorithms 2.2.10 and 2.2.11 are columnwise backward stable in the same sense as the cor-
responding Householder QR factorizations; see Theorem 2.2.6 and Theorem 2.2.7, respectively.
The backward stable algorithm (2.2.47)–(2.2.48) using Householder QR factorization for
solving the augmented system
    
I A y b
= , (2.2.75)
AT 0 x c

where A ∈ Rm×n , with rank(A) = n was given in Section 2.2.3. The interpretation of MGS
as a Householder method shows the strong backward stability property of the following MGS
algorithm for solving augmented systems; see Björck and Paige [150, 1994].

Algorithm 2.2.12 (Augmented System Solution by MGS).


Compute MGS factorization A = Q1 R ∈ Rm×n , where Q1 = (q1 , . . . , qn ).
1. Solve RT z = c for z = (ζ1 , . . . , ζn ).
2. Set b(1) = b and compute d = (δ1 , . . . , δn ), as

δk = qkT b(k) , b(k+1) = b(k) − δk qk k = 1 : n.

3. Set y (n) = b(n+1) and compute y = y (0) by

ωk = qkT y (k) , y (k−1) = y (k) − (ωk − ζk )qk , k = n : (−1) : 1.

4. Solve Rx = d − z for x.

Algorithm 2.2.12 requires 8mn + 2n2 flops and generalizes the previous two algorithms.
It is easily verified that if c = 0, it reduces to Algorithm 2.2.10, and if b = 0, it reduces to
Algorithm 2.2.11. The stability of the MGS algorithm for solving augmented systems is analyzed
by Björck and Paige [150, 1994].

2.2.7 Reorthogonalization
As shown in Section 2.2.4, the loss of orthogonality in the computed Q1 = (q1 , . . . , qn ) as
measured by ∥I − QT1 Q1 ∥2 is proportional to κ(A) for MGS and to κ(A)2 for a variant of CGS.
2.2. Orthogonalization Methods 69

In many applications it is essential that the computed vectors be orthogonal to working accuracy.
In the orthogonal basis problem A is given, and we want to find Q1 and R such that

∥I − QT1 Q1 ∥ ≤ c1 (m, n)u, ∥A − Q1 R∥ ≤ c2 (m, n)u, (2.2.76)

for modest constants c1 (m, n) and c2 (m, n). One important application of reorthogonalization
is subspace projection methods for solving eigenvalue problems.
To study the loss of orthogonalization in an elementary orthogonalization step, let A =
(a1 , a2 ) ∈ Rm×n be two given linearly independent unit vectors. Let q1 = a1 and q2′ =
a2 − r12 q1 , r12 = q1T a2 , be the exact results. The corresponding quantities in floating-point
arithmetic are
r12 = f l(q1T a2 ), q ′2 = f l(a2 − f l(r12 q1 )).
The errors can be bounded by (see Section 1.4.2)

|r12 − r12 | < mu, ∥q ′2 − q2′ ∥2 < (m + 2)u|r12 | < (m + 2)u.

It follows that |q1T q̄2′ | = |q1T (q ′2 − q2′ )| < (m + 2)u, giving

|q1T q ′2 | ∥q ′2 ∥2 < γm+2 /r22 , r̄22 = ∥q ′2 ∥2 .



(2.2.77)

(The errors in the normalization are negligible.) This shows that loss of orthogonality results
when cancellation occurs in the computation of q ′2 . This is the case when r22 = sin(ϕ) is small,
where ϕ is the angle between a1 and a2 . Then the orthogonalization can be repeated:

δr12 = f l(q1T q ′2 ), q ′2 := f l(q ′2 − f l(r12 q1 )).

Often such a reorthogonalization is carried out whenever

r22 = ∥q ′2 ∥2 ≤ α∥a2 ∥2 (2.2.78)



for some parameter α, typically chosen in the range 0.1 ≤ α ≤ 1/ 2. If cancellation occurs
again, the reorthogonalization is repeated. In unpublished notes, Kahan showed that provided A
has full numerical rank, “twice is enough,” i.e., two reorthogonalizations always suffice. This
result is made more precise by Parlett [884, 1998], who showed that

∥q̂2′ − q2′ ∥2 ≤ (1 + α)u∥a2 ∥2 , |q1T q̂2′ | ≤ uα−1 ∥q̂2′ ∥2 . (2.2.79)

Hence for α = 0.5 the computed vector q̂2′ is orthogonal to machine precision. For smaller values
of α, reorthogonalization will occur less frequently, and then the bound (2.2.79) on orthogonality
is less satisfactory.
For A = (a1 , . . . , an ), n > 2, selective reorthogonalization is used in a similar way.
In step k, k = 2, . . . , n, CGS or MGS is applied to make ak orthogonal to an orthonor-
mal Q1 = (q1 , . . . , qk−1 ), giving a computed vector q̄k′ . The vector q̄k′ is accepted provided
r̄kk = ∥q̄k′ ∥2 > α∥ak ∥2 . Otherwise, q̄k′ is reorthogonalized against Q1 . Rutishauser [951, 1970]
performs reorthogonalization when at least one decimal digit of accuracy has been lost due to
cancellation. This corresponds to selective reorthogonalization with α = 0.1. Hoffmann [637,
1989] reports extensive numerical tests with iterated reorthogonalization for CGS and MGS for
a range of values of α = 1/2, 0.1, . . . , 10−10 . The tests show that α = 0.5 makes Q1 orthogonal
to full working precision after one reorthogonalization. Moreover, with α = 0.5, CGS performs
as well as MGS. √
Daniel et al. [285, 1976] recommend using α = 1/ 2. Under certain technical assumptions,
they show that provided A has full numerical rank, iterated reorthogonalization converges to
70 Chapter 2. Basic Numerical Methods

a sufficient level of orthogonality. If failure occurs in step k, one option is to not generate a
new vector qk in this step, set rkk = 0, and proceed to the next column. This will generate a
QR factorization where, after a suitable permutation of columns, Q is m × (n − p) and R is
(n − p) × n upper trapezoidal with nonzero diagonal entries. This factorization can be used to
compute the pseudoinverse solution to a least squares problem.
If full orthogonality is desired, the simplest option is to always perform one reorthogonal-
(0)
ization, i.e., the column vectors ak = ak , k ≥ 2, in A are orthogonalized twice against the
computed basis vectors Qk−1 = (q1 , . . . , qk−1 ):

(i) (i−1) (i−1) (i−1)


ak = (I − Qk−1 QTk−1 )ak = ak − Qk−1 (QTk−1 ak ), i = 1, 2.

(2) (2)
The new basis vector is then given as qk = ak /∥ak ∥2 .

Algorithm 2.2.13 (CGS2).

function [Q,R] = cgs2(A);


% CGS2 computes the compact QR factorization of A
% using CGS with one step of reorthogonalization.
% -------------------------------------------------
[m,n] = size(A);
Q = A; R = zeros(n);
R(1,1) = norm(Q(:,1));
Q(:,1) = Q(:,1)/R(1,1);
for k = 2:n
for i = 1:2
V = Q(:,1:k-1)'*Q(:,k);
Q(:,k) = Q(:,k) - Q(:,1:k-1)*V;
R(1:k-1,k) = R(1:k-1,k) + V;
end
R(k,k) = norm(Q(:,k));
Q(:,k) = Q(:,k)/R(k,k);
end
end

The corrections to the elements in R are in general small and may be omitted. However,
Gander [437, 1980] has shown that including them will give a slightly lower error in the com-
puted residual A − QR. A scheme MGS2 similar to CGS2 can be employed for the columnwise
MGS algorithm. This has the same operation count as CGS2, and both produce basis vectors
with orthogonality close to unit roundoff level. For MGS2 the inner loop is a vector operation,
whereas in CGS2 it is a matrix-vector operation. Hence MGS2 executes slower than CGS2,
which therefore usually is the preferred choice.
Giraud and Langou [479, 2002] analyze a different version of MGS2. Let the initial factor-
ization A = Q1 R be computed by rowwise MGS. MGS is then applied a second time to the
computed Q1 to give Q1 = Q e 1 R.
e Combining this and the first factorizations yields the corrected
factorization
A=Q e 1 R,
b R b = RR.
e

This algorithm can be proved to work under weaker assumptions than those for CGS2. From the
analysis of MGS by Björck [125, 1967] and Björck and Paige [149, 1992], Giraud and Langou
get the following result.
2.3. Rank-Deficient Least Squares Problems 71

Lemma 2.2.13. Assume that A ∈ Rm×n , n ≤ m, satisfies cuκ(A) ≤ 0.1, where

c = 18.53n3/2 , 2.12(m + 1)u ≤ 0.01.

Then for the factorization A = Q1 R computed by MGS, it holds that κ(Q1 ) ≤ 1.3.

From this lemma it can be deduced that Q1 satisfies


e T1 Q
∥I − Q e 1 )u < 40.52un3/2 .
e 1 ∥2 ≤ 1.71cκ(Q

Hence Q
e 1 is orthonormal to machine precision.

Notes and references


Roundoff error analyses for CGS with reorthogonalization are given by Abdelmalek [4, 1971]
and Kiełbasiński [692, 1974]. Giraud, Gratton, and Langou [478, 2004] propose an a posteriori
reorthogonalization technique for MGS based on a rank-k update of the computed vectors. The
level of orthogonality of the set of vectors gets better when k increases and eventually reaches
machine-precision level.
Ruhe [940, 1983] considers iterated reorthogonalization of ak against vectors Q1 =
(q1 , . . . , qk−1 ) that are not accurately orthogonal. He shows that this gives qk = ak − Q1 rk ,
where rk satisfies the least squares problem

QT1 Q1 rk = QT1 ak .

Iterated CGS corresponds to the Jacobi, and iterated MGS corresponds to the Gauss–Seidel iter-
ative method for solving this system; see Section 6.1.4.

2.3 Rank-Deficient Least Squares Problems


2.3.1 Semidefinite Cholesky Factorization
If the columns of A ∈ Rm×n are linearly dependent, rank(A) = r < n, and the matrix of
normal equations C = ATA is positive semidefinite. In this case the Cholesky factor R must
have n − r zero diagonal elements. By using symmetric pivoting in the factorization, these zero
elements can be made to appear last.

Theorem 2.3.1. Let C = ATA ∈ Rn×n be a symmetric positive semidefinite matrix of rank
r < n. Then there is a permutation P such that P T CP has a unique Cholesky factorization of
the form  
R11 R12
P T ATAP = RTR, R = , (2.3.1)
0 0
where R11 ∈ Rr×r is upper triangular with positive diagonal elements.

Proof. The proof is constructive. The algorithm takes C (1) = ATA and computes a sequence of
matrices  
(k) 0 0
C (k) = (cij ) = , k = 1, 2, . . . .
0 S (k)
At the start of step k we select the maximum diagonal element of C (k) ,
(k)
sp = max cii ,
k≤i≤n
72 Chapter 2. Basic Numerical Methods

and interchange rows and columns p and k to bring this into pivot position. This pivot must be
positive for k < r, because otherwise S (k) = 0, which implies that rank(C) < r. Next, the
elements in the permuted C (k) are transformed according to
q
(k) (k)
rkk = ckk , rkj = ckj /rkk , j = k + 1 : n,
(k+1) (k) T
cij = cij − rki rkj , i, j = k + 1 : n.

This is equivalent to subtracting a symmetric rank-one matrix rj rjT from C (k) , where rj = eTj R
is the jth row of R. The algorithm stops when k = r + 1. Then all diagonal elements are zero,
which implies that C (r+1) = 0.

Since all reduced matrices C (k) are symmetric positive semidefinite, their maximum ele-
ments lie on the diagonal. Hence, the pivot selection in the outer product Cholesky algorithm
described above is equivalent to complete pivoting. The algorithm produces a matrix R whose
diagonal elements in R form a nonincreasing sequence r11 ≥ r22 ≥ · · · ≥ rnn . Indeed, the
stronger inequalities
j
X
2 2
rkk ≥ rij , j = k + 1, . . . , n, k = 1 : r, (2.3.2)
i=k

are true; see Section 2.3.3.


Rounding errors can cause negative elements to appear on the diagonal in the Cholesky al-
gorithm even when C is positive semidefinite. Similarly, the computed reduced matrix will in
general be nonzero after r steps even when rank(C) = r. This raises the question of when to
terminate the Cholesky factorization of a semidefinite matrix. One possibility is to stop when
(k)
max cii ≤ 0
k≤i≤n

and set rank(C) = k − 1. But this may cause unnecessary work in eliminating negligible
elements. Taking computational cost into consideration, we recommend the stopping criterion
(k) 2
max aii ≤ cn u r11 , (2.3.3)
k≤i≤n

where cn is a modest constant; see also Higham [623, 2002, Sect. 10.3.2]. Perturbation theory
and error analysis for the Cholesky decomposition of semidefinite matrices are developed by
Higham [617, 1990].
In the rank-deficient case, the permuted normal equations become

RTRe
x = d,
e x = Px
e, de = P T (AT b).

With z = Re
x, we obtain    
T
R11 de1
RT z = T z= ,
R12 de2
where R11 ∈ Rr×r is nonsingular. The triangular system R11
T
z = de1 determines z ∈ Rr . From

e1 = z − R12 x
R11 x e2 ,
T
where xe = (x eT1 x eT2 ) , we can determine xe1 for an arbitrarily chosen x
e2 . This expresses the
fact that a consistent singular system has an infinite number of solutions. Finally, the permuta-
tions are undone to obtain x = P x e.
2.3. Rank-Deficient Least Squares Problems 73

Setting x
e2 = 0 we get a basic solution xb with only r = rank(A) nonzero components in
x, corresponding to the first r columns in AP . This is relevant when a good least squares fit of
b using as few variables as possible is desired. The pseudoinverse solution x† that minimizes
∥x∥2 = ∥e x∥2 is obtained from the full-rank least squares problem
   
S xb −1
min x2 − , S = R11 R12 . (2.3.4)
x2 −In−r 0 2

The basic solution xb can be computed in about r2 (n − r) flops. Note that S can overwrite R12 .
Then x2 can be computed from the normal equations,

(S T S + In−r )x2 = S T xb ,

using a Cholesky factorization of (S T S + In−r ). When x2 has been determined, we have x1 =


xb − Sx2 . This method requires about r(n − r)2 + 31 (n − r)3 flops and has been further studied
by Deuflhard and Sautter [319, 1980].

2.3.2 Rank-Deficient QR Factorization


We now show that there is a column permutation P such that in the QR factorization of AP the
zero diagonal elements appear last.

Theorem 2.3.2. Given A ∈ Rm×n with rank(A) = r < n, there is a permutation matrix P and
an orthogonal matrix Q ∈ Rm×n , such that
 
R11 R12 }r
AP = Q , (2.3.5)
0 0 }m − r

where R11 ∈ Rr×r is upper triangular with positive diagonal elements.

Proof. Since rank(A) = r, we can always choose a permutation matrix P such that AP =
( A1 A2 ), where A1 ∈ Rm×r has linearly independent columns. The QR factorization
 
R11
QTA1 = , Q = ( Q1 Q2 )
0

uniquely determines Q1 ∈ Rm×r and R11 ∈ Rr×r with positive diagonal elements. Then
 
T T T R11 R12
Q AP = ( Q A1 Q A2 ) =
0 R22

has rank r. Here R22 = 0, because R cannot have more than r linearly independent rows. Hence
the factorization must have the form (2.3.5).

From (2.3.5) and orthogonal invariance it follows that the least squares problem minx ∥Ax −
b∥2 is equivalent to     
R11 R12 x
e1 d1
min − , (2.3.6)
x 0 0 x
e2 d2 2

where d =QT band x


e = P x are partitioned conformally. The general solution of (2.3.6) is given
x
e1
by x = P , where
x
e2
e1 = d1 − R12 z,
R11 x (2.3.7)
74 Chapter 2. Basic Numerical Methods

and z = x
e2 can be chosen arbitrarily. For z = 0, we obtain a basic least squares solution
 
x
eb −1
x=P , x eb = R11 d1 , (2.3.8)
0

with at most r = rank(A) nonzero components. The general solution is given by


 
eb − Sz
x −1
x=P , S = R11 R12 . (2.3.9)
z

Here S can be computed in about r2 (n − r) flops by solving the matrix equation R11 S = R12
using back-substitution.
A general approach to resolve rank-deficiency is to seek the solution to the least squares
problem
min ∥Bx∥2 , S = {x | min ∥Ax − b∥2 }. (2.3.10)
x∈S x

Here B can be chosen so that ∥Bx∥2 is a measure of the smoothness of x. Substituting the
general solution (2.3.7), we find that (2.3.10) is equivalent to
   
S x
eb
min B z−B . (2.3.11)
z −In−r 0 2

This is a least squares problem of dimension r × (n − r) that can be solved by QR factorization


in about 2r(n − r)2 flops. In particular, taking B = I minimizes

∥x∥22 = ∥Sz∥22 + ∥z∥22

and gives the pseudoinverse solution. It is easily verified that


 
S
N (AP ) = R (2.3.12)
−In−r

is a (nonorthonormal) basis for N (AP ). QR factorization gives an orthonormal basis for N (AP ).
Note that the unique pseudoinverse solution orthogonal to N (AP ) equals the residual of the least
squares problem (2.3.11) with B = I,
   
† xeb S
x
e = − z. (2.3.13)
0 −In−r

Notice that it has the form of the basic solution minus a correction in the nullspace of AP . Any
particular solution can be substituted for z in (2.3.11).

2.3.3 Pivoted QR Factorization


For many applications it is preferable to use a column pivoted QR factorization (QRP)

AP = QR, A ∈ Rm×n , (2.3.14)

in which the pivot column at step k is chosen to maximize the diagonal element rkk . We first
show how to implement this strategy for MGS. Assume that after (k − 1) steps the nonpivotal
columns have been transformed according to
k−1
(k)
X
aj = aj − rij qi , j = k, . . . , n,
i=1
2.3. Rank-Deficient Least Squares Problems 75

(k)
where aj is orthogonal to R(Ak−1 ) = span {q1 , . . . , qk−1 }. Hence in the kth step we should
determine p, so that
(k) 2
∥a(k) 2
p ∥2 = max ∥aj ∥2 , (2.3.15)
k≤j≤n

and interchange columns k and p. This is equivalent to choosing at the kth step a pivot column
with largest distance to the subspace R(Ak−1 ) = span (ac1 , . . . , ack−1 ), where Ak−1 is the
submatrix formed by the columns corresponding to the first k − 1 selected pivots. We note that
for this pivot strategy to be relevant it is essential that the columns of A be well scaled.
Golub [487, 1965] gave an implementation of the same pivoting strategy for Householder
QR. Assume that after k steps of pivoted QR factorization the reduced matrix is
 
R11 R12
, (2.3.16)
0 A(k)
(k) (k)
where R11 ∈ Rk×k is square upper triangular and A(k) = (a1 , . . . , an ). Let p be the smallest
index such that
(k) (k) (k)
s(k)
p ≥ sj , sj = ∥aj ∥22 , j = k + 1, . . . , n,
(k)
where aj are the columns of the submatrix A(k) . Then before the next step, columns k + 1 and
p in A(k) are interchanged. The pivot column maximizes
(k) (k)
sj = min ∥A(k) y − aj ∥22 , j = k, . . . , n. (2.3.17)
y

(k)
The quantities sj can be updated by formulas similar to (2.3.21) used for MGS, but some care
is necessary to avoid numerical cancellation.
With the column pivoting strategy described above, the diagonal elements in R will form
a nonincreasing sequence r11 ≥ r22 ≥ · · · ≥ rrr . It is not difficult to show that, in fact, the
diagonal elements in R satisfy the stronger inequalities
j
X
2 2
rkk ≥ rij , j = k + 1, . . . , n, k = 1 : r. (2.3.18)
i=k

This implies that if rkk = 0, then rij = 0, i, j ≥ k. In particular,

|r11 | = max {|eT1 Re1 | | AP1,j = QR}, (2.3.19)


1≤j≤n

where P1,j is the permutation matrix that interchanges columns 1 and j. Then ∥A∥2F ≤ nr112
,
which yields upper and lower bounds for σ1 (A),

|r11 | ≤ σ1 (A) ≤ n |r11 |. (2.3.20)
(k)
If the column norms ∥aj ∥2 are recomputed at each stage of MGS, this will increase the opera-
tion count of the QR factorization by 50%. Since these quantities are invariant under orthogonal
transformations, this overhead can be reduced to O(mn) operations by using the recursion
(k+1) 2 (k)
∥aj ∥2 = ∥aj ∥22 − rkj
2
, j = k + 1, . . . , n, (2.3.21)

(k+1)
to update these values. To avoid numerical problems, sj should be recomputed from scratch
(k+1) (k) √
whenever there has been substantial cancellation, e.g., when ∥aj ∥2 ≤ ∥aj ∥2 / 2.
76 Chapter 2. Basic Numerical Methods

If a diagonal element rkk in QRP vanishes, it follows from (2.3.18) that rij = 0, i, j ≥ k.
Assume that at an intermediate stage of QRP the new diagonal element satisfies rk+1,k+1 ≤ δ
for some small δ. Then by (2.3.18),

∥A(k) ∥F ≤ (n − k)1/2 δ,

and setting A(k) = 0 corresponds to a perturbation Ek of A, such that A + Ek has rank-k and
∥Ek ∥F ≤ (n − k)1/2 δ. The matrix

 = Q1 ( R11 R12 ) P T , Q = ( Q1 Q2 ) , (2.3.22)

obtained by neglecting R22 , is the best rank-k approximation to A that differs from AP only in
the last n − k columns. In particular, with k = n − 1 we obtain ∥A − Â∥F = rnn .

Example 2.3.3. The following example by Kahan [680, 1966] shows that QR factorization with
standard pivoting can fail to reveal near singularity of a matrix. The matrix

1 −c −c . . . −c
 
 1 −c . . . −c 
n−1 
 .. .. 
An = diag(1, s, . . . , s ) 1 . .  , 0 ≤ c ≤ 1,

 .. 
 . −c 
1

where s2 + c2 = 1, is already in upper triangular form. Because the inequalities (2.3.18) hold,
An is invariant under QR factorization with column pivoting.4 For n = 100 and c = 0.2 the two
smallest singular values are σn = 3.6781 · 10−9 and σn−1 = 0.1482. However, the two smallest
diagonal elements of Rn are rn−1,n−1 = sn−2 = 0.1299 and rnn = sn−1 = 0.1326, and the
near singularity of An is not revealed.

The column pivoting strategy described is independent of the right-hand side b and may not
be the most appropriate for solving a given least squares problem. For example, suppose b is a
multiple of a column in A. With standard pivoting this may not be detected until the full QR
factorization has been computed. An alternative strategy is to select the pivot column in step
k + 1 as the column for which the current residual norm ∥b − Ax(k) ∥2 is maximally reduced. For
MGS this is achieved by choosing as pivot the column ap that makes the smallest acute angle
(k) (k)
with r(k) . Hence, with γj = (aj )T r(k) , the column is chosen to maximize
(k) (k)
(γj )2 /∥aj ∥22 . (2.3.23)

This quantity is important in statistical applications, such as stepwise variable regression.


The criterion rk+1,k+1 ≤ δ is commonly used for terminating the pivoted QR algorithm.
However, it can greatly overestimate the numerical rank of A. Faddeev, Kublanovskaya, and
Faddeeva [391, 1968] proved the inequality

3|rnn |
σn ≥ √ ≥ 21−n |rnn |. (2.3.24)
4n + 6n − 1
This shows σn can be much smaller than |rnn | for moderately large values of n. Example 2.3.3
shows that the bound in (2.3.24) can almost be attained.
4 Due to roundoff, pivoting actually may occur in floating-point arithmetic. This can be avoided by making a small

perturbation to the diagonal; see Chan [224, 1987].


2.3. Rank-Deficient Least Squares Problems 77

Stewart [1023, 1984] shows that better bounds for σn can be found from QR factorization
using so-called reverse column pivoting. This determines the permutation P1,j so that
|rnn | = min {|eTn Ren | | AP1,j = QR}.
1≤j≤n

Then it holds that √


(1/ n) |rnn | ≤ σn (A) ≤ |rnn |. (2.3.25)
Setting rnn = 0 makes R singular. Hence the upper bound (2.3.25) follows from (1.3.21). As
shown by Chandrasekaran and Ipsen [231, 1994]), reverse pivoting on R is equivalent to using
standard pivoting on R−T .

2.3.4 Complete Orthogonal Decompositions


The QR factorization of a rank-deficient matrix A ∈ Rm×n is
 
R11 R12
AP = ( Q1 Q2 ) ,
0 0
where R11 ∈ Rr×r , r < n, is nonsingular. Here Q1 and Q2 give orthogonal bases for R(A)
and N (AT ). This factorization is less useful for applications that need a basis for N (A). The
elements in R12 can be annihilated by postmultiplying R by a sequence of Householder reflectors
e 0 , Hj = I − γj−1 uj uTj ,

( R11 R12 ) Hk · · · H1 = R
j = r, r − 1, . . . , 1, where uj has nonzero elements only in positions j, r + 1, . . . , n. This is
equivalent to a QL factorization of the transpose of the triangular factor R,
 T   
R11 0 ReT 0
T = Q̂ , (2.3.26)
R12 0 0 0
and requires 2r2 (n − r) flops. The first two steps in the reduction are shown below for n = 6,
r = 4.
×  ×  × 
× ×  × ×  × × 
× × × × × × × × ×
     
H4   ⇒ H3   ⇒ H2  ,...,.
  
× × × × × × × × × × × ×
× × × × × × × ⊗ × × ⊗ ⊗
     
× × × × × × × ⊗ × × ⊗ ⊗
This gives a complete orthogonal decomposition of the form
 
Re 0
AP = Q V T , V = H1 · · · Hk . (2.3.27)
0 0
This decomposition, first described by Hanson and Lawson [589, 1969], gives an explicit orthog-
onal basis for the range and nullspace of A, and a representation for the pseudoinverse. It can be
updated efficiently when A is subject to a change of low rank; see Section 3.3.

Theorem 2.3.4. Assume that we have a complete orthogonal decomposition (2.3.27) of A. Then
if V = ( V1 V2 ), P V2 is an orthogonal basis for the nullspace of dimension (n − r) of A.
Furthermore, the pseudoinverse of A is
 −1 
R 0
A† = P V QT , (2.3.28)
0 0
and x† = P V1 R−1 QT1 b is the pseudoinverse solution of the problem minx ∥Ax − b∥2 .
78 Chapter 2. Basic Numerical Methods

Proof. It is immediately verified that AP V2 = 0, and a dimensional argument shows that P V2


spans the nullspace of A. The expression for the pseudoinverse follows by verifying the Penrose
conditions.

For matrices A ∈ Rm×n that are only close to being rank-deficient with r < n, Stewart
[1025, 1992] introduced the URV decomposition. This has the form
 
R11 R12
AP = U V T , R11 ∈ Rr×r , (2.3.29)
0 R22

with U = ( U1 U2 ) and V = ( V1 V2 ) orthogonal. If the singular values of A are

σ1 ≥ σ2 ≥ · · · ≥ σr ≫ σr+1 ≥ · · · ≥ σn ,

then the decomposition (2.3.29) is said to be rank-revealing if


1/2
σk (R11 ) ≥ σr /c, ∥R12 ∥2F + ∥R22 ∥2F ≤ cσr+1 , (2.3.30)

and c is bounded by a low-degree polynomial in r and n. For Π = I, it follows from (2.3.29)


that  
R12
∥AV2 ∥2 = ≤ cσr+1 .
R22 F
Hence V2 is an orthogonal basis for the approximate nullspace of A. The URV decomposition
is useful in applications, such as subspace tracking in signal processing, where it is desirable to
compute an approximate nullspace and also update the basis as rows are added or deleted from A.
The rank-revealing process of Chan starts from a pivoted QR factorization and determines a
vector w such that ∥Rw∥2 is small. Then a sequence of plane rotations is determined such that

QT wn = GTn−1,n · · · GT12 wn = ∥wn ∥2 en .

Next, an orthogonal matrix P such that P T RQ = P T G12 . . . , Gn−1,n is upper triangular is


determined. When Gj−1,j is applied, a nonzero element is introduced just below the diagonal of
R. To restore the triangular from, a left rotation can be used:

↓ ↓ ↓ ↓
     
r r r r → r r r r r r r r
 ⇒ → ⊕
+ r r r  r r r 0 r r r
  ⇒  
0 0 r r 0 0 r r 0 + r r
0 0 0 r 0 0 0 r 0 0 0 r

↓ ↓
     
r r r r r r r r r r r e
→ 0 r r r ⇒
0 r r r 0 r r e
  ⇒  .
→ 0 ⊕ r r 0 0 r r → 0 0 r e
0 0 0 r 0 0 + r → 0 0 ⊕ e
This process requires O(n2 ) multiplications. We now have

P T Rwn = (P T RQ)(QT wn ) = ∥wn ∥2 Re


e n.

As P is orthogonal it follows that if ∥Rwn ∥2 < |rnn |, then ∥Re


e n ∥2 < δ/∥wn ∥2 . This bounds
the norm for the last column of the transformed matrix R.e If |rn−1,n−1 | is small, this deflation
can be continued on the principal submatrix of order n − 1 of R.
e
2.3. Rank-Deficient Least Squares Problems 79

Stewart [1026, 1993] has suggested a refinement process for the URV decomposition (2.3.29),
which reduces the size of the block R12 and increases the accuracy in the nullspace approxima-
tion. It can be viewed as one step of the zero-shift QR algorithm (7.1.19), and can be iterated,
and will converge quickly if there is a large relative gap between the singular values σk and σk+1 .
Alternatively one can work with the corresponding decomposition of lower triangular form, the
rank-revealing ULV decomposition
 
L11 0
A=U V T. (2.3.31)
L21 L22

For this decomposition with the partitioning V = ( V1 V2 ), ∥AV2 ∥2 = ∥L22 ∥F . Hence the
size of ∥L21 ∥F does not adversely affect the nullspace approximation.
Suppose we have a rank-revealing factorization
 
L11 0
AP = Q ,
L21 L22

where L11 and L22 are lower triangular and σk (L11 ) ≥ σk /c, and ∥L22 ∥2 ≤ cσk+1 for some
constant c. (Such a factorization can be obtained from a rank-revealing QR factorization by
reversing the rows and columns of the R-factor.) Then a rank-revealing ULV decomposition can
be obtained by a similar procedure as shown above for the URV decomposition. Suppose we
have a vector w such that ∥wT L∥2 is small. Then, as before, w is first reduced to the unit vector
en :
QT wn = GTn−1,n · · · GT12 wn = ∥wn ∥2 en .
The sequence of plane rotations are then applied to L from the left, and extra rotations from the
right are used to preserve the lower triangular form.

Notes and references


The use of URV and QR factorizations for solving rank-deficient least squares problems is treated
by Foster [428, 2003]. Foster and Kommu [426, 2006] use a truncated pivoted QR factorization,
where the rank of the trailing diagonal block is estimated by a condition estimator. For problems
of low rank it is an order of magnitude faster than the LAPACK routine xGELSY, which is based
on a complete orthogonal decomposition. Symmetric rank-revealing decompositions are studied
by Hansen and Yalamov [586, 2001].

2.3.5 Rank-Revealing QR Factorizations


As seen from examples in the previous section, QR factorization with standard column pivoting
may fail to reveal the rank-deficiency of nearly singular matrices. The following definition makes
precise what would be desirable in this case.

Definition 2.3.5. Let A ∈ Rm×n , m ≥ n, be a given matrix, and let Πk be a permutation. Then
the QR factorization
 
R11 R12 }k
AΠk = QR = Q , 1 ≤ k < n, (2.3.32)
0 R22 }m − k
is said to be a rank-revealing QR (RRQR) factorization if

σk (R11 ) ≥ σk (A)/c, σ1 (R22 ) ≤ c σk+1 (A), (2.3.33)

where c = c(k, n) > 0 is bounded by a low-degree polynomial in k and n.


80 Chapter 2. Basic Numerical Methods

From the interlacing property of the singular values (Theorem 1.3.5) it follows that

σk (A) ≥ σk (R11 ), σk+1 (A) ≤ σ1 (R22 ). (2.3.34)

The permutation Π should be chosen so that the smallest singular value of the k first columns
of A1 is maximized and the largest singular value of A2 is minimized. Note that an exhaustive
search is not feasible because this has combinatorial complexity. It can be shown that an RRQR
factorization always exists.

Theorem 2.3.6. Let A ∈ Rm×n , (m ≥ n), and let k be a given integer 0 < k < n. Then there
is a permutation matrix Πk that gives an RRQR factorization (2.3.32) with
p
c = k(n − k) + 1. (2.3.35)

Proof. See Hong and Pan [638, 1992, Theorem 2.2].

As pointed out by Stewart [1024, 1992] the sense in which the RRQR algorithms are rank-
revealing is different from that of the SVD. Given A ∈ Rm×n and a value k < n they produce a
permutation Π that reveals if there is a gap between σk and σk+1 . For a different value of k the
permutation may change.
Golub, Klema, and Stewart [496, 1976] (see also Golub and Van Loan [512, 1996, Sect.
12.2]) note that the selection of Π in an RRQR factorization is related to the column subset
selection problem of determining a subset A1 of k < n columns in A ∈ Rm×n such that
∥A − (A1 A†1 )A∥2 is minimized over all possible choices. This is closely related to the selection
of a subset of rows of the matrix of right singular vectors of A corresponding to small singular
values, as explained in the following theorem.

Theorem 2.3.7. Let A = U ΣV T be the SVD of A, and set V = ( V1 V2 ) and Π = ( Π1 Π2 ),


where V1 , Π1 ∈ Rn×k . Hence, V2 is the matrix of right singular vectors corresponding to the
n − k smallest singular values. Let
 
m×n R11 R12
AΠ = QR ∈ R , R= , (2.3.36)
0 R22

R11 ∈ Rk×k , 1 ≤ k < n, be the QR factorization of AΠ. Then


1
σmin (R11 ) ≥ cσk (A), σmax (R22 ) ≤ σk+1 (A), (2.3.37)
c
where c = σmin (ΠT1 V1 ) = σmin (ΠT2 V2 ).

Proof. See Hong and Pan [638, 1992, Theorem 1.5]. The equality (2.3.37) follows by applying
the CS decomposition (see Section 1.2.4) to the orthogonal matrix
 T 
Π1 V1 ΠT1 V2
ΠT V = .
ΠT2 V1 ΠT2 V2

Here the matrix V2 of right singular vectors can be replaced by any orthonormal basis for the
column space of V2 .

This theorem says that AΠ = QR is an RRQR factorization if the permutation matrix Π is


such that σmin (ΠT2 V2 ) is maximum. At the same time, σmin (ΠT1 V1 ) will attain its maximum.
2.3. Rank-Deficient Least Squares Problems 81

It remains to obtain a sufficiently sharp lower bound for c. When k = n − 1, this is easily
√ the right singular vector corresponding to σn . Since ∥vn ∥2 = 1, it follows that
done. Let vn be
∥vn ∥∞ ≥ 1/ n. Hence, taking Π to be the permutation matrix that permutes √ the maximum
element in vn to the last position guarantees that (2.3.37) holds with c = n.
Algorithms for computing an RRQR factorization based on computing the SVD of A are
usually not practical. If the SVD is known, this is already sufficient for most purposes.

Example 2.3.8. Let Rn , n = 100, be the Kahan matrix in Example 2.3.3. The largest element
in the right singular vector vn corresponding to σn is v1,n = 0.553, whereas the element vn,n =
1.60 · 10−8 is very small. Therefore, we perform a cyclical shift of the columns in Rn that puts
the first column last, i.e., in the order 2, 3, . . . , n, 1,
−c −c . . . −c 1
 
 1 −c . . . −c 0 
n−1 
 .. .. .. 
Hn = Rn Π = diag(1, s, . . . , s ) 1 . . . .


 . . . −c 0 

1 0
The matrix Hn has Hessenberg form and can be retriangularized in less than 2n2 flops using
updating techniques. Hence the total cost of this factorization is only slightly larger than that
for the standard QR factorization. In the new R-factor R̄ the last diagonal element r̄n,n =
6.654 · 10−9 is of the same order of magnitude as the smallest singular value 3.678 · 10−9 .
Furthermore, r̄n−1,n−1 = 0.16236. Hence R̄ is rank-revealing.

To obtain a sharp lower bound for c in (2.3.36) when k > 1 is more difficult. Recall that the
volume of a matrix X ∈ Rm×k , m ≥ k, is defined as the product of its singular values:
vol (X) = | det(X)| = σ1 (X) · · · σk (X).
Hong and Pan [638, 1992] show that selecting a permutation Π = ( Π1 Π2 ) such that
vol (ΠT2 V2 ) is maximum among all possible (n − k) by (n − k) submatrices in V2 is sufficient
to give an RRQR factorization.

Lemma 2.3.9. Let the unit vector v, ∥v∥2 = 1, be such that ∥Av∥2 = ϵ. Let Π be a permutation
such that if w = ΠT v, then |zn | = ∥w∥∞ . Then, in the QR factorization of AΠ we have
|rnn | ≤ n1/2 ϵ.

Proof. Since |zn | = ∥w∥∞ and ∥v∥2 = ∥w∥2 = 1, it follows that |zn | ≥ n−1/2 . Furthermore,
QTAv = QTAΠ(ΠT v) = Rw,
where the last component of Rw is rnn zn . Therefore, ϵ = ∥Av∥2 = ∥QTAw∥2 = ∥Rw∥2 ≥
|rnn zn |, from which the result follows.

Chan [224, 1987] gives a more efficient approach for the special case k = n − 1, based on
inverse iteration. Let AΠG = QR be an initial QR factorization using standard pivoting. Then
(ATA)−1 = (RTR)−1 = R−1 R−T ,
for which the dominating eigenvalue is σ1−2 . Each step then requires 2n2 flops for the solution
of the two triangular systems
RT y (k) = w(k−1) , Rw(k) = y (k) . (2.3.38)
82 Chapter 2. Basic Numerical Methods

By a few steps of inverse iteration, an approximation to σn and the corresponding singular vector
vn are obtained. From this a permutation matrix Π is determined as in Lemma 2.3.9. An RRQR
factorization of RΠ = Q̄R̄ can then be computed using updating techniques as in Example 2.3.8.
The above one-dimensional technique can be extended to the case when the approximate
nullspace is larger than one by applying it repeatedly to smaller and smaller leading blocks of R
as described in Algorithm 2.3.1.

Algorithm 2.3.1 (Chan’s RRQR).  


R
Compute a QR factorization AΠ = Q .
0
For k = n, n − 1, . . . do:
1. Partition  
R11 R12 }k
R=
0 R22 }n − k
and determine δk = σmin (R11 ) and the corresponding right singular vector wk .
2. If δk > τ (a user tolerance), then set rank(A) = k and finish.
3. Determine a permutation matrix P such that |(P T wk )k | = ∥P T wk ∥∞ .

4. Compute the QR factorization R11 P = Q


eRe11 and update
     
P 0 Qe 0 R
e11 e T R12
Q
Π := Π , Q := Q , R := .
0 In−k 0 In−k 0 R22

 
wk
5. Assign to the kth column of W and update
0
   
W1 PT 0
W = := W,
W2 0 In−k

where W2 is upper triangular and nonsingular.

A similar algorithm was proposed independently by Foster [424, 1986]. The main difference
between the two algorithms is that Foster’s algorithm only produces a factorization for a subset
of the columns of the original matrix.
By the interlacing property of singular values (Theorem 1.2.9) it follows that the δi are non-
increasing and that the singular values σi of A satisfy δi ≤ σi , k + 1 ≤ i ≤ n. Chan [224, 1987]
proves the following upper and lower bounds.

(i) (i)
Theorem 2.3.10. Let R22 and W2 denote the lower right submatrices of dimension (n − i +
1) × (n − i + 1) of R22 and W2 , respectively. Let δi denote the smallest singular value of the
leading principal i × i submatrices of R. Then, for i = k + 1 : n,
σi (i) √ (i)
√ (i)
≤ δi ≤ σi ≤ ∥R22 ∥2 ≤ σi n − i + 1∥(W2 )−1 ∥2 .
n−i+ 1∥(W2 )−1 ∥2

(i)
Hence, ∥R22 ∥2 are easily computable upper bounds for σi . Further, the outermost bounds
(i) (i)
in the theorem show that if ∥(W2 )−1 ∥2 is not large, then δi and ∥R22 ∥2 are guaranteed to be
tight bounds, and hence the factorization will have revealed the rank. The matrix W determined
2.3. Rank-Deficient Least Squares Problems 83

by the RRQR algorithm satisfies


n
X
∥AΠW ∥22 ≤ ∥AΠW ∥2F = δi2 .
i=k+1

Therefore, R(ΠW ) in the RRQR algorithm is a good approximation to the numerical nullspace
Nk (A). A more accurate and orthogonal basis for Nk (A) can be determined by simultaneous
inverse iteration with RTR starting with W . If R has zero or nearly zero diagonal elements, a
small multiple of the machine unit is substituted. The use of RRQR factorizations for computing
truncated SVD solutions is discussed by Chan and Hansen [227, 1990].
If the matrix A has low rank rather than low rank-deficiency, it is more efficient to build
up the rank-revealing QR factorization from estimates of singular vectors corresponding to the
large singular values. Such algorithms are described by Chan and Hansen [229, 1994]. Chan-
drasekaran and Ipsen [231, 1994] show that many previously suggested pivoting strategies form
a hierarchy of greedy algorithms. They give an algorithm called Hybrid-III that is guaranteed to
find an RRQR factorization that satisfies (2.3.33). Their algorithm works by alternately applying
standard and Stewart’s reverse column pivoting to the leading and trailing diagonal blocks of the
initial QR factorization
 
R11 R12
A = QR = Q , R11 ∈ Rk×k . (2.3.39)
0 R22

It keeps interchanging the “most dependent” of the first k columns with one of the last n − k
columns, and interchanging the “most independent” of the last n − k columns with one of the
first k columns, as long as det(R11 ) strictly increases. This stops after a finite number of steps.
In the worst case the work is exponential in n, but in practice usually only a few refactorizations
are needed.
Pan and Tang [875, 1999] give an RRQR algorithm that uses a similar type of cyclic pivoting.
Given A ∈ Rm×n , m ≥ n, let Πi,j be the permutation matrix such that AΠi,j interchanges
columns i and j of A. They define the pivoted magnitude η(A) of A to be the maximum
magnitude of r11 in the QR factorizations of AΠ1,j , 1 ≤ j ≤ n, i.e.,

η(A) = max {|r11 | | AΠ1,j = QR}. (2.3.40)


1≤j≤n

Clearly, it holds that


η(A) = max ∥Aej ∥2 .
1≤j≤n

Similarly, they define the reverse pivoted magnitude τ (A) to be the minimum magnitude of
|rnn | in the QR factorizations of AΠj,n , 1 ≤ j ≤ n. If A is nonsingular, then as shown by
Stewart [1023, 1984],
τ (A) = 1/ max ∥eTj A−1 ∥2 .
1≤j≤n

In the following, two related submatrices of R in the partitioned QR factorization (2.3.39)


are important: R11 , the (k + 1) × (k + 1) leading principal submatrix, and R22 , the (n − k +
1) × (n − k + 1) trailing principal submatrix. Pan and Tang consider the QR factorizations

AΠj,k = Q(j) R(j) , j = 1, . . . , k,


(j) (j)
and show that if |rkk | = η(R22 ), then the factorization is a k-rank-revealing factorization,
i.e.,the inequalities (2.3.33) are satisfied.
The following algorithm is less expensive than Hybrid-III by avoiding reverse pivoting.
84 Chapter 2. Basic Numerical Methods

Algorithm 2.3.2 (Pan and Tang’s RRQR Algorithm 1).


Compute the QR factorization AΠ with standard pivoting. Set i = k − 1, and while i ̸= 0,
do
1. Compute R-factor RΠi,k .
2. If |rkk | ≥ η(R22 ), then set i = i − 1. Otherwise, perform an exchange as follows. Find ℓ
such that
η(R̄22 ) = ∥(rk,ℓ , rk+1,ℓ , . . . , rn,ℓ )∥2 . (2.3.41)

3. Compute R-factor RΠk,ℓ , and set Π = Π Πk,ℓ and i = k − 1.

It can be shown that whenever an exchange takes place in step 2, the determinant of R11
will strictly increase in magnitude. Therefore an exchange can happen only a finite number of
times. After the last exchange, at most k − 1 iterations can take place. Hence the algorithm must
terminate.
For the RRQR factorization a basis for the numerical nullspace is given by R(W), where
−1
 
−R11 R12
W = .
In−r
−1
If the norm of R11 R12 is large, this cannot be stably evaluated. Gu and Eisenstat [549, 1996]
call an RRQR factorization strong if, apart from (2.3.33) being satisfied, it holds that
−1
(R11 R12 )ij ≤ c2 (k, n) (2.3.42)

for 1 ≤ i ≤ k and 1 ≤ j ≤ n − k, where c2 is bounded by a low-degree polynomial in


k and n. This condition suffices to make R(W) an approximate nullspace of A with a small
residual independent of the condition number of R11 . They give a modification of the Hybrid-III
algorithm that computes both k and a strong RRQR factorization.

2.4 Methods Based on LU Factorization


2.4.1 The Peters–Wilkinson Method
Standard algorithms for solving square nonsymmetric linear systems are usually based on an LU
factorization of A, and many efficient implementations are available for both sparse and dense
systems. Let minx ∥Ax − b∥2 , A ∈ Rm×n of rank(A) = n < m, be an overdetermined least
squares problem. The method of Peters and Wilkinson [892, 1970] starts by computing an LU
factorization with complete pivoting,
   
A1 L1
Π1 AΠ2 = = LU = U, (2.4.1)
A2 L2

where L ∈ Rm×n is unit lower trapezoidal and U ∈ Rn×n is upper triangular and nonsingular.
Π1 and Π2 are permutation matrices reflecting the row and column interchanges. With this
factorization the least squares problem becomes

min ∥Ly − Π1 b∥2 , U ΠT2 x = y. (2.4.2)


y

If complete pivoting is used, then all elements in L are bounded by one in modulus. This ensures
that L is well-conditioned, and any ill-conditioning in A will be reflected in U . Hence, squaring
2.4. Methods Based on LU Factorization 85

the condition number is avoided, and without substantial loss of accuracy, the first subproblem
in (2.4.2) can be solved using the normal equations

LT Ly = LT Π1 b. (2.4.3)

The solution x is then obtained from U ΠT2 x = y by back-substitution. The Peters–Wilkinson


method is particularly suitable for solving weighted least squares problems.
Computing the factorization (2.4.1) requires mn2 − n3 /3 flops. Forming the symmetric
matrix LT L requires mn2 − n/3 flops, and computing its Cholesky factorization needs n3 /3
flops. Neglecting terms of order n2 , the total arithmetic cost to compute the least squares solution
by the Peters–Wilkinson method is 2mn2 − n3 /3 flops. This is always more than that needed
for the method of normal equations applied to the original system, which requires mn2 + n3 /3
flops. Sautter [967, 1978] gives a detailed analysis of stability and rounding errors of the LU
factorization with complete pivoting.

Example 2.4.1. The following ill-conditioned matrix and its normal matrix is considered by
Noble [832, 1976]:
 
1 1  
1 1
A = 1 1 + ϵ, ATA = 3 .
1 1 + 2ϵ2 /3
1 1−ϵ

If ϵ ≤ u, then in floating-point computation, f l(1 + 2ϵ2 /3) = 1, and the computed matrix
ATA is numerically singular. However, in the LU factorization
 
1 0  
1 1
A = 1 1  ≡ LU,
0 ϵ
1 −1

L is orthonormal, and the pseudoinverse can be stably computed as

1 −ϵ−1
   
1/3 0 1 1 1
A† = U −1 (LT L)−1 LT =
0 ϵ−1 0 1/2 0 1 −1
−1 −1
 
1 2 2 − 3ϵ 2 + 3ϵ
= .
6 0 3ϵ−1 −3ϵ−1

A similar method can be developed for solving the underdetermined problem

min ∥y∥2 subject to AT y = c, (2.4.4)

where rank(AT ) = n < m. From (2.4.1) we have AT y = Π2 (LU )T Π1 y. Hence, with z = Π1 y


problem (2.4.4) can be written as

min ∥z∥2 subject to LT z = d, (2.4.5)

where d is obtained from the lower triangular system U T d = ΠT2 c. Problem (2.4.5) is well-
conditioned and can be solved using the normal equations of the second kind:

LT Lz = d, y = ΠT1 (Lz). (2.4.6)

LU factorization with complete pivoting is slow, and when A has full column rank, it is
usually sufficient to use partial pivoting. An attractive alternative is to use a method that, in
86 Chapter 2. Basic Numerical Methods

terms of efficiency and accuracy, lies between partial and complete pivoting; see Foster [425,
1997]. Rook pivoting was introduced by Neal and Poole [824, 1992]. The name is chosen
because the pivot search resembles the movement of the rook piece in chess. A pivot element
is chosen that is the largest in both its row and column. One starts by finding the element of
maximum magnitude in the first column. If this also is of maximum magnitude in its row, it is
accepted as pivot. Otherwise, the element of maximum magnitude in this row is determined and
compared with other elements in its column, etc.
The Peters–Wilkinson method works also for problems where A is rank-deficient. Then, after
deleting zero rows in U and the corresponding columns in L, the LU factorization yields a unit
lower trapezoidal factor L ∈ Rm×r and an upper triangular factor U ∈ Rr×n of full rank. For
example, if m = 6, n = 4, and r = 3, then (in exact computation) the factors obtained after r
elimination steps have the form
 
1
 l21 1   
  u11 u12 u13 u14
 l31 l32 1 
L= , U = u22 u23 u24  ;
 l41 l42 l43 
  u33 u34
 l51 l52 l53 
l61 l62 l63

see Section 2.4.3. A problem that remains is how to determine the rank r reliably.

2.4.2 Nearly Square Least Squares Problems


Many applications lead to linear systems Ax = b that are nearly square, i.e., the number of rows
in A is almost equal to the number of columns. One example is the simulation of multibody
systems, such as robot arms or satellites; see Cardenal, Duff, and Jiménez [208, 1998]. For
such problems substantial savings in operations and storage may be achieved by an algebraic
reformulation of the Peters–Wilkinson method.
Let minx ∥Ax − b∥2 , rank(A) = n, be a nearly square least squares problem, p = m − n,
0 < p ≪ n. LU factorization gives the subproblem miny ∥Ly − b∥2 . With z = L1 y, u = z − b1 ,
the transformed subproblem becomes
   
In 0
min u− , (2.4.7)
z C b2 − Cb1 2

where C = L2 L−1
1 ∈R
p×n
. The normal equations are

(In + C T C)u = C T (b2 − Cb1 ). (2.4.8)

The useful identity


(In + C T C)−1 C T = C T (Ip + CC T )−1

can be proved by multiplying with (In + C T C) from the left and using (C T C)C = C T (CC T ).
It follows that the solution to (2.4.7) can be written u = C T v, where

(Ip + CC T )v = b2 − Cb1 . (2.4.9)


2.4. Methods Based on LU Factorization 87

This only requires the inversion of a matrix of size p × p. The resulting algorithm is summarized
below.
 
1. Compute the factorization A e = Π1 AΠ2 = L1 U .
L2

2. Compute C by solving LT1 C = LT2 .

3. Form and solve (Ip + CC T )v = b2 − Cb1 .

4. Form and solve L1 y = b1 + C T v and U x = y.

Neglecting higher order terms in p, the initial LU factorization requires mn2 − n3 /3 flops
and computing C requires pn2 flops. Forming and solving the normal equations for v takes about
np2 flops. Finally, solving the triangular systems L1 y = C T v + b1 and U x = y takes 2n2 flops.
For p ≪ n the total arithmetic cost mn2 − n3 /3 + pn2 is lower than that required by the normal
equations.
Equation (2.4.9) is the normal equation for the problem
   
Ip b2
min v− (2.4.10)
y
e2 CT −b1 2

and can be solved by a QR factorization of size (p + n) × p.


The least-norm problem

min ∥y∥2 subject to AT y = c


 
L1
can be treated similarly. If A = LU , L = , then
L2
 
y1
y1 + C T y2 = L−T
1 U
−T
c = d, C = L2 L−1
1 , y= .
y2

This shows that y2 is the solution to the damped least squares problem of full rank,
 T  
C d
min y2 − . (2.4.11)
y2 Ip 0 2

The normal equations are (CC T + Ip )y2 = Cf . The algorithm can be summarized as follows:
 
L1
1. Compute the factorization A = Π1 AΠ2 =
e U.
L2

2. Solve LT1 C = LT2 and U1T LT1 d = c for C and d.

3. Form and solve the normal equations (CC T + Ip )y2 = Cd.

4. Compute y1 = −C T y2 .

Cardenal, Duff, and Jiménez [208, 1998] derive similar formulas for quasi-square least
squares problems by using the augmented system formulation. They implement and test the new
methods using both dense LU routines from LAPACK and the sparse LU solver MA48 from the
Harwell Software Library (HSL). Tests carried out confirm that the new method is much faster
than either the normal equations or the original Peters–Wilkinson method. Furthermore, their
88 Chapter 2. Basic Numerical Methods

sparse implementation outperformed the dense solvers on problems as small as a few hundred
equations.
An alternative suggested by Tewarson [1058, 1968] is to solve the least squares subproblem
miny ∥Ly − Π1 b∥2 in (2.4.2) by an orthogonal reduction of the trapezoidal matrix L to lower
triangular form:      
L1 L̄ c1
Q = , QΠ1 b = .
L2 0 c2
Such an algorithm is given by Cline [253, 1973]. The lower trapezoidal structure in L̄ is pre-
served by taking Q as a product of Householder reflectors,
Q = P1 · · · Pn−1 Pn ,
where Pk is chosen to zero the elements ℓn+1,k , . . . , lm,k . This reduction requires 2n2 (m −
n) flops. The least squares solution is then obtained in 2n3 /3 flops from the lower triangular
system L̄y = c1 . Cline’s algorithm is very efficient for least squares problems that are slightly
overdetermined, i.e., p = m − n ≪ n. The total number of flops required for computing the
least squares solution is about 23 n3 + 3pn2 . This is fewer than needed for the method of normal
equations if p ≤ n/3. For p ≤ 2n/3 it is also more efficient than using the Householder QR
factorization of A.
Plemmons [897, 1974] solves the least squares subproblem (2.4.2) by MGS. The trapezoidal
structure in L is preserved by applying MGS in reverse order from the last to the first column
of L. This method requires a total of 43 n3 + 3pn2 flops, which is slightly more than needed for
Cline’s algorithm. Similar orthogonalization methods for the underdetermined case (m < n) are
given by Cline and Plemmons [258, 1976].

Notes and references


The use of LU factorization in least squares algorithms was proposed by Ben-Israel and Wer-
san [102, 1963], Tewarson [1058, 1968], and Noble [831, 1969]. Björck and Duff [139, 1980]
use a slightly modified version of the Peters–Wilkinson method for solving sparse least squares
problems and discuss extensions of the method to handle linear constraints.

2.4.3 Rank-Revealing LU Factorizations


If Gaussian elimination with complete pivoting is applied in exact arithmetic to A ∈ Rm×n ,
m ≥ n, and rank(A) = k < n, this yields after k steps a factorization
  
L11 0 U11 U21
Ā = Π1 AΠ2 = , (2.4.12)
L21 Im−k 0 U22

where L11 ∈ Rk×k is unit lower unit triangular, U11 ∈ Rk×k is upper triangular, and U22 = 0.
Hence the rank of the matrix is revealed. However, when A is nearly rank-deficient, then even
with complete pivoting, ∥U22 ∥ may not be small, as shown by the following example by Peters
and Wilkinson [892, 1970]. For the upper triangular Wilkinson matrix
1 −1 −1 · · · −1
 
 1 −1 · · · −1 
 .. ..  n×n
W =  . ··· . ∈R , (2.4.13)
 1 −1 
1
the smallest singular value of W is of size 2−(n−2) , although no diagonal element is small.
2.4. Methods Based on LU Factorization 89

For matrices having only one small singular value, Chan [223, 1984] shows that there always
exists a permutation such that the near-rank-deficiency is revealed by LU factorization. Further-
more, these permutations are related to the size of elements in A−1 . Let A ∈ Rm×n have the
SVD
Xn
A = U ΣV T = σi ui viT ,
i=1

and assume that σn−1 ≫ σn > 0. Then the last term σn−1 vn uTn in the pseudoinverse A† =
P n −1 T
i=1 σi vi ui will dominate. Let i, j be indices corresponding to the maximum value of
|(un )i (vn )j |. Then permuting aij to the (n, n)th position will produce a rank-revealing LU
(RRLU) factorization with unn = 2−(n−2) ≈ σn (A). For the matrix W in (2.4.13), the largest
element of the inverse

1 1 2 · · · 2n−2
 
n−3
 1 1 ··· 2 
−1
 . . ..  n×n
W = . ··· . ∈R

 1 1 
1

is in the (1, n)th position. Hence a rank-revealing LU factorization should be obtained by


permuting element (n, 1) to position (n, n). Switching the first and last columns of R gives
RΠ1,n = LU , where unn = 2−(n−2) .
Chan suggests the following algorithm. Compute a pivoted LU factorization of A, and then
use inverse iteration to find approximate singular vectors un and vn . Determine (i, j) so that
|(un )i (vn )j | is maximal, permute element aji to the (n, n)th position, and recompute the LU
factorization with a pivoting strategy that is restricted to permuting only the first n − 1 rows and
columns. The cost of this algorithm is at most two LU factorizations and a few back solves.
For the general case, where A may have several small singular values, the existence of a
rank-revealing LU (RRLU) factorization is proved by Pan [873, 2000, Theorem 1.2].

Theorem 2.4.2. Let A ∈ Rn×n have singular values σ1 ≥ · · · ≥ σn ≥ 0. Then for any
given k, 1 ≤ k < n, there exist permutations Π1 and Π2 such that in the factorization (2.4.12),
L11 ∈ Rk×k is unit lower triangular, U11 ∈ Rk×k is upper triangular, and

σmin (U11 ) ≥ σk /(k(n − k) + 1), (2.4.14)


σmax (U22 ) ≤ (k(n − k) + 1)σk+1 . (2.4.15)

The bounds established in (2.4.14) and (2.4.15) are strikingly similar to the bounds estab-
lished in Theorem 2.3.6 for rank-revealing QR factorizations. If ∥U22 ∥2 is sufficiently small,
then a rank-k approximation of A is
 
L11
A= ΠT1 ( U11 U21 ) ΠT2 .
L21

In addition to the generalized LU factorization (2.4.12) we introduce the block LU factorization


  
Ik 0 Ā11 Ā12
Ā = ,
Ā21 Ā−1
11 In−k 0 S

where S = Ā22 − Ā21 Ā−1


11 Ā12 is the Schur complement.
90 Chapter 2. Basic Numerical Methods

If Ā11 = L11 U11 is nonsingular, a rank-k approximation of Ā is


 
Ā11
Ā = Ā−1
11 ( Ā11 Ā12 ) . (2.4.16)
Ā21
If k is small, this approximation preserves a large part of A. When A is large and sparse this can
give large savings of memory and arithmetic. Goreinov, Tyrtyshnikov, and Zamarashkin [519,
1997] call this a pseudoskeleton approximation. They derive tighter bounds using an approach
that relies heavily on information from the SVD.
Pan [873, 2000] used the concept of local maximum volume and showed how this could be
used to find a pivoting strategy that works for finding both RRLU and RRQR factorizations; see
Section 2.4.3.

Definition 2.4.3. Let B be a submatrix of A ∈ Rm×n formed by any k columns (rows) of A.


Then B is said to have local maximum volume if
vol (B) = σ1 (B) · · · σn (B) ≥ vol (B ′ ) (2.4.17)
for any B ′ obtained by replacing one column (row) of B with a column (row) of A not in B.

Note that to determine if B has a local maximum volume, it is only necessary to compare
its volume with the volumes of k(n − k) neighboring submatrices that differ from B in exactly
one column (row). In floating-point computations, (2.4.17) should be replaced by vol (B) ≥
vol (B ′ )/µ, where µ > 1 is a fudge factor.
 
A1
Lemma 2.4.4. Let A = ∈ Rn×k , where A1 ∈ Rk×k , n > k. Then
A2

∥A2 A−1
p
1 ∥2 ≤ k(n − k) + 1 (2.4.18)
provided vol (A1 ) is a local maximum in A.
   
I A1
Proof. Let M = (mij ) = A2 A−1
so that
1 A1 = . Since the submatrix A1 has
M A2  
I
maximum local volume in A, it follows that I has maximum local volume in . For any
M
 ′ 
I
mij ̸= 0, interchange row i of M with row j in I. Denote the new matrix by . Then
M′
vol (I ′ ) ≤ vol (I), which implies that |mij | ≤ 1. Finally, ∥M ∥2 ≤ ∥M ∥F ≤
p
k(n − k).

From this lemma we obtain the following result.


 
A1
Lemma 2.4.5. Let A = ∈ Rn×k , where A1 ∈ Rk×k , n > k. Then
A2
p
σmin (A1 ) ≥ σk (A)/ k(n − k) + 1 (2.4.19)
provided vol (A1 ) is a local maximum in A.
 
Q1
Proof. Assume that the QR factorization of A is A = QR, where Q = . Since
Q2
vol (Q1 )vol (R) = vol (A1 ), it follows that vol (Q1 ) is a local maximum in Q.
2.4. Methods Based on LU Factorization 91

To find a rank-k, 1 ≤ k < n, RRLU factorization of A ∈ Rn×n one first selects a subset of
k columns with local maximum volume in A and then a subset of k rows with local maximum
volume in the selected k columns. Pan [873, 2000] gives two algorithms. Algorithm 1 selects an
m × m submatrix of a matrix A ∈ Rm×n , with rank(A) = m < n. It starts by computing the
LU factorization with partial pivoting

Π1 AΠ2 = LU = L ( U1 U2 ) ,

where Π1 = I and |ukk | ≥ |ukj |, k < j ≤ n. This is followed by a block pivoting phase in
which column j of U1 , j = m−1, . . . , 1, is permuted to the last position in U1 , and the permuted
U1 is retriangularized by Gaussian elimination. If the updated matrix fails to satisfy the condition

|umm | ≥ max |umj |, (2.4.20)


1≤j≤n

a column interchange between U1 and U2 is performed to restore condition (2.4.20). Whenever


such an interchange takes place the procedure of permuting every column in U1 to the last is
repeated. When a complete block pivoting phase has been performed without violating condition
(2.4.20), the product LU1 has maximum local volume in A, and the algorithm stops.
If the interchanges in the block pivoting phase are done by a cyclic shift, the permuted matrix
U1 will have Hessenberg form. For example, for m = 5, permuting columns 2 and 5 produces a
matrix with structure
x × × × ×
× × × ×
× × × .
× ×
×
For stability reasons, row pivoting should be employed in the retriangularization. A complete
block pivoting without column interchanges between U1 and U2 requires about m2 n − (1/3)m3
flops. This algorithm is faster than the corresponding QR approach even without an explicit Q.
The second algorithm of Pan selects k < min{m, n} columns from A ∈ Rm×n by com-
puting a rank-revealing Cholesky factorization. It is initialized with an uncompleted Cholesky
factorization with diagonal pivoting of ATA. Like Algorithm 1 it uses block pivoting to find a
symmetric permutation such that the leading principal submatrix of order k has local maximum
volume. Combined with Algorithm 1 this yields an algorithm for finding an RRLU factorization.

Notes and references

Hwang, Lin, and Yang [652, 1992] were the first to consider RRLU factorizations with numerical
rank-deficiency p > 1. However, their bounds may increase faster than exponentially. Improved
bounds are given in Hwang, Lin, and Pierce [651, 1997]. Miranian and Gu [796, 2003] study
strong rank-revealing LU factorizations for which the elements of Wl = L21 L−1 11 and Wr =
−1
U11 U12 are bounded by some slow growing polynomial in k, m, and n.

2.4.4 Elimination Methods for Augmented Systems


If rank(A) = n, the least squares problem minx ∥b − Ax∥2 is equivalent to the symmetric
indefinite augmented system
    
I A r b
= .
AT 0 x 0
92 Chapter 2. Basic Numerical Methods

Eliminating y using Gaussian elimination without pivoting gives the reduced upper block trian-
gular system     
I A y b
= .
0 −ATA x −AT b
Hence this choice of pivots just leads to the normal equations. To get a more stable method, it is
necessary to choose pivots outside the block I.
Introducing the scaled residual vector s = α−1 r gives the augmented system
   −1   
αI A α r b
= ⇐⇒ Mα zα = dα , (2.4.21)
AT 0 x 0

where we assume that 0 ≤ α ≤ ∥A∥2 = σ1 (A). The scaling parameter α will affect the
conditioning of Mα as well as the choice of pivots and thereby the accuracy of the computed
solution. For sufficiently small values of α, pivots will not be chosen from the (1, 1) block.
However, mixing the unknowns x and r does not make sense physically, because they have
different physical units and may be on vastly different scales numerically.
For a general symmetric indefinite matrix M , a stable symmetric indefinite factorization
ΠM ΠT = LDLT with D diagonal always exists if 2 × 2 symmetric blocks in D are allowed. A
pivoting scheme due to Bunch and Kaufman [187, 1977] guarantees control of element growth
without requiring too much searching. The symmetry constraint allows row and column permu-
tations to bring any diagonal element d1 = brr or any 2 × 2 submatrix of the form
 
brr brs
(brs = bsr )
bsr bss

to the pivot position. Taking a 2×2 submatrix as a pivot is equivalent to a double step of Gaussian
elimination and pivoting first on brs and then on bsr . Such a double step preserves symmetry,
and only elements on and below the main diagonal of the reduced matrix need to be computed.
Ultimately, a factorization A = LDLT is obtained where D is block diagonal with a mixture of
1 × 1 and 2 × 2 blocks. L is unit lower triangular with ℓk+1,k = 0 when B (k) is reduced by a
2 × 2 pivot. The Bunch–Kaufman strategy is to search until two columns r and s are found for
which the common element brs bounds in modulus the other off-diagonal elements in the r and
s columns. Then either a 2 × 2 pivot on these two columns or a 1 × 1 pivot with the largest in
modulus of the two diagonal elements is taken, according to the test

max(|brr |, |bss |) ≥ ρ|brs |, ρ = ( 17 + 1)/8 ≈ 0.6404.

The number ρ has been chosen to minimize the growth per stage of elements of B, allowing for
the fact that two stages are taken by a 2 × 2 pivot. With this choice, element growth is bounded
by gn ≤ (1 + 1/ρ)n−1 < (2.57)n−1 . This bound can be compared to the bound 2n−1 that holds
for Gaussian elimination with partial pivoting.
The above bound for element growth can be achieved with fewer comparisons using a strat-
egy due to Bunch and Kaufman. Let λ = |br1 | = max2≤i≤n |bi1 | be the off-diagonal element of
largest magnitude in the first column. If |b11 | ≥ ρλ, take b11 as a pivot. Otherwise, determine
the largest off-diagonal element in column r:

σ = max |bir |, i ̸= r.
1≤i≤n

If |b11 | ≥ ρλ2 /σ, again take b11 as a pivot; else if |brr | ≥ ρσ, take brr as a pivot. Otherwise, take
the 2×2 pivot corresponding to the off-diagonal element b1r . Note that at most two columns need
2.4. Methods Based on LU Factorization 93

to be searched in each step, and at most n2 comparisons are needed in all. When the factorization
M = LDLT has been obtained, the solution of M z = d is obtained in the three steps
Lv = d, Dw = v, LT z = w.
It has been shown by Higham [621, 1997] that for stability it is necessary to solve the 2 × 2
systems arising in Dw = v using partial pivoting or the explicit 2 × 2 inverse. The proof of this
is nontrivial and makes use of the special relations satisfied by the elements of the 2 × 2 pivots
in the Bunch–Kaufman pivoting scheme.
Bunch–Kaufman pivoting does not in general give a stable method for the least squares prob-
lem, because perturbations introduced by roundoff do not respect the structure of the augmented
system. For the scaled system (2.4.21) with a sufficiently small value of α the Bunch–Kaufman
scheme will introduce 2 × 2 pivots of the form
 
α a1r
,
a1r 0
which may improve the stability. This raises the question of the optimal choice of α for stability.
The eigenvalues λ of Mα can be expressed in terms of the singular values σi , i = 1, . . . , n
of A; see Björck [124, 1967]. If Mα z = λz, z = (s, x)T ̸= 0, then
αs + Ax = λs, AT s = λx,
or, eliminating s, αλx + ATAx = λ2 x. Hence if x ̸= 0, then x is an eigenvector and (λ2 − αλ)
an eigenvalue of ATA. On the other hand, x = 0 implies that AT s = 0, αs = λs, s ̸= 0. It
follows that the m + n eigenvalues of Mα are
 p
2 2
λ = α/2 ± α /4 + σi , i = 1, . . . , n,
α otherwise.
If rank(A) = r ≤ n, then the eigenvalue α has multiplicity (m − r), and 0 is an eigenvalue
√ multiplicity (n − r). From this it−1/2
of is easily deduced that if σn > 0, then minα κ2 (Mα ) ≈
2κ2 (A) is attained for α = α̃ = 2 σn (A). Therefore, α̃ (or σn ) can be used as a nearly
optimal scaling factor in the augmented system method. Minimizing κ2 (Mα ) will minimize the
forward bound for the error in zα ,
 −1 
ϵκ(Mα ) α r
∥z̄α − zα ∥2 ≤ ∥zα ∥2 , zα = .
1 − ϵκ(Mα ) x
However, α also influences the norm in which the error is measured.
Pivoting and stability in the augmented system method is studied by Björck [133, 1992]. A
more refined error analysis is given here that separately minimizes bounds for the errors in x̄ and
ȳ. It is shown that the errors in the computed solution satisfy the upper bounds
   
∥r̄ − r∥2 σ1 (A)
≤ cguf (α) , (2.4.22)
∥x̄ − x∥2 κ2 (A)
where c is a low-degree polynomial, g the growth factor, and
  
α 1
f (α) = 1 + ∥r∥2 + ∥x∥2 .
σn α
 1/2
If x ̸= 0, then f (α) is minimized for α = αopt = σn ∥r∥2 ∥x∥2 . The corresponding
minimum value of f (α) is
 2  2
αopt σn
fmin = 1 + ∥x∥2 = 1 + σn−1 ∥r∥2 . (2.4.23)
σn αopt
94 Chapter 2. Basic Numerical Methods

Taking α = σn we find
 
1
f (σn ) = 2 ∥r∥2 + ∥x∥2 ≤ 2fmin ,
σn
i.e., using α = σn will at most double the error bound.
We recall that an acceptable-error stable algorithm is defined as one that gives a solution
for which the size of the error is never significantly greater than the error bound obtained from
a tight perturbation analysis. It can be shown that the augmented system method is acceptable-
error stable with both α = σn and α = αopt .

2.5 Estimating Condition Numbers and Errors


2.5.1 Condition Estimators
The perturbation bounds for the least squares solution x given in Section 1.3.4 depend critically
on the condition number κ2 (A) = σ1 (A)/σn (A). If the QR factorization A = QR is known,
then we have
κ(A) = κ(R) ≤ ∥R∥F ∥R−1 ∥F .
But computing R−1 takes n3 /3 flops, which usually is too expensive. Combining the estimates
(2.3.20) and (2.3.25) we obtain the lower bound

κ(A) = σ1 (R)/σn (R) ≥ |r11 /rnn |. (2.5.1)

Empirical evidence suggests that, provided column pivoting has been used in the QR factoriza-
tion, it is very rare for the bound in (2.5.1) to differ much from κ(A). In extensive tests on
randomly generated test matrices, the bound usually underestimated the true condition number
by a factor of only 2–3 and never by more than 10. However, as shown by Example 2.3.3, the
bound (2.5.1) can still be a considerable underestimate of κ(A).
Improved estimates of κ(R) can be computed in only O(n2 ) flops by using inverse iteration.
Let the singular values of A be σi , i = 1, . . . , n, where σn < σi , i ̸= n. Then the dominating
eigenvalue σ1−2 of the matrix
C = (ATA)−1 = (RTR)−1
can be computed by applying the power method to (RTR)−1 = R−1 R−T . In each step, two
triangular linear systems

RT y (k) = z (k−1) , Rz (k) = y (k) , k = 1, 2, . . . , (2.5.2)

are solved, which requires 2n2 flops. After normalization, z (k) will converge to the right singular
vector vn corresponding to an eigenvalue, and

σ1−2 = v1T R−T R−1 v1 = ∥R−1 v1 ∥22 . (2.5.3)

Example 2.5.1. Failure to detect near-rank-deficiency of A even in the (unusual) case when this
is not revealed by a small diagonal element in R can lead to a meaningless solution of very large
norm. Inverse iteration will often prevent that. For example, the n×n upper triangular Wilkinson
matrix  
1 −1 · · · −1 −1

 1 · · · −1 −1  
W =
 .. .. ..  (2.5.4)
 . . . 

 1 −1 
1
2.5. Estimating Condition Numbers and Errors 95

has numerical rank n − 1 when n is large. If n = 50 and W is perturbed by changing the z50,1
entry to −2−48 , the new matrix Ŵ will be exactly singular. The smallest singular value of W is
bounded by
σ50 ≤ ∥W − Ŵ ∥F = 2−48 ≈ 7.105·10−15 .
The next smallest singular value is σ49 ≈ 1.5, so there is a well-defined gap between σ49 and σ50 .
But in the QR factorization R = W and gives no indication of the numerical rank-deficiency. (If
column interchanges are employed, the diagonal elements in R indicate rank 49.) Doing a single
inverse iteration on W T W using the MATLAB script

n = 50; W = eye(n) - triu(ones(n,n),1);


z = ones(n,1); x = W\(W'\z);
s = 1/sqrt(max(abs(x)));
gives an approximate smallest singular value s = 1.9323·10−15 . A second inverse iteration gives
a value of 2.3666·10−30 .

The condition estimator given by A. K. Cline et al. [256, 1979], often referred to as the
LINPACK condition estimator, proceeds as follows:

1. Choose a vector d such that ∥y∥/∥d∥ is large, where RT y = d.

2. Solve Rz = y, and estimate ∥R−1 ∥ ≈ ∥z∥/∥y∥ ≤ ∥R−1 ∥.

This is equivalent to one step of the power method with (ATA)−1 . Let R = U ΣV T be the
SVD of R. Expanding d in terms of the right singular vectors V gives
n
X n
X n
X
d= αi vi , y= (αi /σi )ui , z= (αi /σi2 )vi .
i=1 i=1 i=1

Hence provided αn , the component of d along vn , is not very small, the vector z is likely to be
dominated by its component of vn , and

σn−1 ≈ ∥z∥2 /∥y∥2

will usually be a good estimate of σn−1 . In the LINPACK algorithm the 1-norm is used for
normalization. The vector d is chosen as d = (±1, ±1, . . . , ±1)T , where the sign of dj is
determined adaptively; see A. K. Cline et al. [256, 1979].
In practice the LINPACK algorithm performs very reliably and produces good order of mag-
nitude estimates; see Higham [612, 1987]. However, examples of parametrized matrices can be
constructed for which the LINPACK estimate can underestimate the true condition number by
an arbitrarily large factor. In a modification to the LINPACK condition estimator, O’Leary [837,
1980] suggests that ∥R−1 ∥1 be estimated by

max ∥y∥∞ /∥d∥∞ , ∥z∥1 /∥y∥1 ,

This makes use of information from the first step, which can improve the estimate. Another
generalization, due to A. K. Cline, Conn, and Van Loan [255, 1982], of the LINPACK algorithm
incorporates a “look-behind” technique. This allows for the possibility of modifying previously
chosen dj ’s. It gives an algorithm for the 2-norm that requires about 10n2 flops.
Boyd [173, 1974] devised a method for computing a lower bound for an arbitrary Hölder
norm ∥B∥p , assuming only that Bx and B T x can be computed for arbitrary vectors x. In the
96 Chapter 2. Basic Numerical Methods

following, p ≥ 1 and q ≥ 1 are such that 1/p + 1/q = 1. Then ∥ · ∥q is the dual norm to ∥ · ∥p ,
and the Hölder inequality
|xT y| ≤ ∥x∥p ∥y∥q
holds. In the algorithm, dualp (x) is any vector y of unit ℓq -norm such that equality holds for
x and y in the Hölder inequality. A derivation of Boyd’s algorithm is given by Higham [623,
2002, Sect. 15.2]. When p = q = 2, Boyd’s algorithm reduces to the usual power method
applied to B TB.
For the ℓ1 -norm the algorithm was derived independently by Hager [560, 1984]. In this
case the dual norm is the ℓ∞ -norm. Since ∥B∥∞ = ∥B T ∥1 , this algorithm can be used also
to estimate the infinity norm. Hager’s algorithm is based on convex optimization and uses the
observation that ∥B∥1 is the maximal value of the convex function
n
X
f (x) = ∥Bx∥1 = |yi |, y = Bx,
i=1

over the convex set S = {x ∈ Rn | ∥x∥1 ≤ 1}. From convexity results it follows that the
maximum is attained at one of the vertices ej , j = 1, . . . , n, of S. From this observation Hager
derives an algorithm for finding a local maximum that with high probability is also the global
maximum.

Algorithm 2.5.1 (Hager’s Norm Estimator).


Given a matrix B ∈ Rn×n this algorithm computes y = Bx such that γ = ∥y∥1 /∥x∥1 ≤
∥B∥1 . Let e = (1, 1, . . . , 1)T , ej be the jth unit vector, and ξ = sign(y) where ξi = ±1
according to whether yi ≥ 0 or yi < 0.

x = n−1 e;
repeat
y = Bx; ξ = sign(y);
T
z = B ξ;
if ∥z∥∞ ≤ z T x
γ = ∥y∥1 ; break
end
set x = ej ; where |zj | = ∥z∥∞ ;
end

The algorithm tries to maximize f (x) = ∥Bx∥1 subject to ∥x∥1 = 1. The vector z computed
at each step can be shown to be a subgradient of f at x. From convexity properties,

f (±ej ) ≥ f (x) + z T (±ej − x), j = 1, . . . , n.

Hence if |zj | > z T x for some j, then f can be increased by moving from x to the vertex ej of S.
If, however, ∥z∥∞ ≤ z T x, and if yj ̸= 0 for all j, then x can be shown to be a local maximum
point for f over S.
Higham [617, 1990] reports on experience in using Hager’s algorithm. The estimates pro-
duced are generally sharper than those produced by the LINPACK estimator. Its results are
frequently exact, usually good (γ ≥ 0.1∥B∥1 ), but sometimes poor. The algorithm almost al-
ways converges after at most four iterations, and Higham recommends that between two and five
2.5. Estimating Condition Numbers and Errors 97

iterations be used. The average cost for estimating ∥R∥1 of a triangular matrix R is in practice
around 6n2 flops.
An important feature of Hager’s norm estimator is that to estimate ∥B −1 ∥1 we only need to
be able to solve linear systems By = x and B T z = ξ. This feature makes it useful for estimating
the componentwise error bounds given in Section 1.3.4. For the least squares problem the bound
(1.3.56) can be written in the form ∥δx∥∞ ≤ ω cond (A, b)∥x∥∞ , where
 
cond (A, b) ≤ ∥|A† |g1 ∥∞ + ∥|ATA)−1 |g2 ∥∞ /∥x∥∞ (2.5.5)

and
g1 = |A||x| + |b|, g2 = |A|T |r|. (2.5.6)
Hager’s algorithm gives an inexpensive and reliable estimate of cond (A, b). The key idea is to
note that all terms in (2.5.5) are of the form ∥|B|g∥∞ , where g > 0. Following Arioli, Demmel,
and Duff [33, 1989], we take G = diag (g). Then using g = Ge and the properties of the
ℓ∞ -norm, we have

∥|B|g∥∞ = ∥|B|Ge∥∞ = ∥BG∥∞ = ∥|BG|e∥∞ = ∥BG∥∞ .

Hence Hager’s algorithm can be applied to estimate ∥|B|g∥∞ provided matrix-vector products
BGx and GT B T y can be computed efficiently. To estimate cond (A, b) we need to be able to
compute matrix-vector products of the forms A† x, (A† )T y, and (ATA)−1 x. This can be done
efficiently if a QR factorization of A is known.

Notes and references


A survey of condition estimators is given by Higham [612, 1987]. Fortran 77 codes by Higham
[614, 1988] implementing Hager’s condition estimator for the 1-norm of a real or complex matrix
are included in LAPACK; see Anderson et al. [26, 1995]. Further details and comments on the
algorithm are found in Hager [560, 1984] and Higham [616, 1990].

2.5.2 A Posteriori Estimation of Errors


Let x̄ be an approximate solution of a linear least squares problem minx ∥Ax − b∥2 . If x̄ is an
exact solution of the perturbed problem

min ∥(A + E)x − b∥2 (2.5.7)


x

for some E, then ∥E∥2 is called the backward error of x̄. If ∥E∥2 is small compared to the
uncertainty in the data A, then the solution x̄ can be said to be as good as the data warrants. The
forward error ∥x − x̄∥ can be estimated using the perturbation bounds given in Section 1.3.3.
In general, x̄ solves (2.5.7) for an infinite number of perturbations E. The optimal backward
error for a given x̄ is defined as

µ(x̄) = min ∥E∥F subject to ∥(A + E)x̄ − b∥2 = min . (2.5.8)

To find a good estimate of µ(x̄) is important, e.g., for deciding when to stop an iterative solution
method. For a consistent linear system Ax = b. Rigal and Gaches [928, 1967] showed that the
optimal backward error E is given by the rank-one perturbation

E0 = r̄x̄T /∥x̄∥22 = r̄x̄† , r̄ = b − Ax̄. (2.5.9)

Furthermore, ∥E0 ∥2 = ∥E0 ∥F = ∥r̄∥2 /∥x̄∥2 .


98 Chapter 2. Basic Numerical Methods

Finding the optimal backward error for a general least squares problem is more difficult.
Stewart [1018, 1977, Theorem 3.1] gives two simple upper bounds for µ(x̄).

Theorem 2.5.2. Let x̄ be an approximate solution to the least squares problem minx ∥Ax − b∥2 .
Assume that the corresponding residual r̄ = b − Ax̄ ̸= 0. Then x̄ exactly solves minx ∥b − (A +
Ei )x∥2 , where
(r̄ − r)x̄T r̄r̄TA
E1 = , E2 = − = −r̄r̄† , (2.5.10)
∥x̄∥22 ∥r̄∥22

and r = b − Ax is the residual corresponding to the exact solution x. The norms of these
perturbations are
p
∥r̄∥22 − ∥r∥22 ∥r̄∥2 ∥AT r̄∥2
∥E1 ∥2 = ≤ , ∥E2 ∥2 = . (2.5.11)
∥x̄∥2 ∥x̄∥2 ∥r̄∥2

Proof. The result for E2 is proved by showing that x̄ satisfies the normal equations (A+E2 )T (b−
(A + E2 )x̄ = 0. Note that r̄† = r̄T /∥r̄∥2 is the pseudoinverse of r̄ and that r̄r̄† is an orthogonal
projector. From A + E2 = (I − r̄r̄† )A and Ax̄ = b − r̄, we obtain

b − (A + E2 )x̄ = b − (I − r̄r̄† )(b − r̄) = r̄r̄† b.

Hence the normal equations become AT (I − r̄r̄† )r̄r̄† b = 0. The proof for E1 may be found in
Stewart [1017, 1977, Theorem 5.3].

In Theorem 2.5.2 ∥E1 ∥2 is small when r̄ is almost equal to the residual r of the exact solution.
∥E2 ∥2 is small when r̄ is almost orthogonal to the column space of A. However, these are just
upper bounds, and µ(x̄) can be much smaller than either ∥E1 ∥2 or ∥E2 ∥2 .
An exact expression for µ(x̄) was found by Waldén, Karlsson, and Sun [1096, 1995] by char-
acterizing the set E of all possible perturbations E. Their result is summarized in the following
theorem; see also Higham [623, 2002, pp. 404–405].

Theorem 2.5.3. Let r̄ = b − Ax̄ ̸= 0 and η = ∥r̄∥2 /∥x̄∥2 . Then the optimal backward error in
the Frobenius norm is

µ(x̄) ≡ min ∥E∥F = min η, σmin ( A, B ) , (2.5.12)
E∈E

where B = η(I − r̄r̄† ).

Computing the smallest singular value of the matrix (A, B) ∈ Rm×(m+n is too expensive for
most practical purposes. Karlsson and Waldén [685, 1997] proposed an estimate of µ̃ that can be
computed more cheaply. This makes use of a regularized projection of the residual r̄ = b − Ax̄.
The Karlsson–Waldén (KW) estimate can be expressed as µ e = ∥Ky∥2 /∥x̄∥2 , where y solves the
least squares problem
   
A r̄
min ∥Ky − v∥2 , K= , v= . (2.5.13)
y ηI 0

e = ∥QT v∥2 /∥x̄∥2 can be com-


If the compact QR factorization of K = QR of K is known, µ
puted in O(mn) operations.
2.5. Estimating Condition Numbers and Errors 99

Numerical experiments by Grcar, Saunders, and Su [532, 2007] indicate that the KW esti-
mate is very near the true optimal backward error and can be used safely in practice. This was
confirmed by Gratton, Jiránek, and Titley-Peloquin [528, 2012], who proved the lower and upper
bounds
µ p √
1 ≤ ≤ 2 − (∥r∥2 /∥r̄∥2 ) ≤ 2. (2.5.14)
µ
e
The ratio tends to 1 as ∥r̄∥2 → ∥r∥2 .
The following MATLAB script can be used to compute the KW estimate (2.5.13) by sparse
QR factorization without storing Q.

[m,n] = size(A); r = b - A*x;


normx = norm(x); eta = norm(r)/normx;
p = colamd(A);
K = [A(:,p); eta*speye(n)];
v = [r; zeros(n,1)];
[c,R] = qr(K,v,0);
mutilde = norm(c)/normx;

Methods for computing the KW estimate are given by Malyshev and Sadkane [770, 2001].
Optimal backward error bounds for problems with multiple right-hand sides are given by Sun
[1050, 1996], while bounds for underdetermined systems are derived in Sun and Sun [1051,
1997]. The extension of backward error bounds to constrained least squares problems is dis-
cussed by Cox and Higham [271, 1999].
The optimal componentwise backward error of x̄ is the smallest ω ≥ 0 such that x̄ exactly
minimizes ∥(A + E)x̄ − (b + f )∥2 , and

|E| ≤ ω|A|, |f | ≤ ω|b|, (2.5.15)

where the inequalities are to be interpreted componentwise. For a consistent linear system b ∈
R(A), Oettli and Prager [835, 1964] proved the explicit expression

|Ax̄ − b|i
ω = max . (2.5.16)
1≤i≤n (|A||x̄| + |b|)i

Here 0/0 should be interpreted as 0, and ζ/0 (ζ ̸= 0) as infinity. (The latter case means that no
finite ω satisfying (2.5.16) exists.) Together with the perturbation result (1.3.49), (2.5.16) can be
used to compute an a posteriori bound on the error in a given approximate solution x̄.
In Section 1.3.4 we obtained perturbation bounds for the least squares problem subject to
componentwise perturbations. However, no expression for the optimal componentwise back-
ward error is known. Following Björck [132, 1991], we apply the Oettli–Prager bound to the
augmented system (1.1.19), where no perturbations in the diagonal blocks of M or in the zero
vector in the right-hand side are allowed. However, we allow for different perturbations of the
blocks A and AT , as this does not increase the forward error bounds (1.3.55) and (1.3.56).
Hence for an a posteriori error analysis, it makes sense to define the pseudocomponentwise
backward error of a computed solution x̄, r̄ to be the smallest nonnegative number ω such that

|δAi | ≤ ω|A|, i = 1, 2, |δb| ≤ ω|b|,

and     
I A + δA1 r̄ b + δb
= . (2.5.17)
AT + δA2 0 x̄ 0
100 Chapter 2. Basic Numerical Methods

Note that this allows the two blocks of A in the augmented system to be perturbed differently
and hence does not directly correspond to perturbing the data of the least squares problem. From
the result of Oettli and Prager, this backward error for a computed solution r̄ and x̄ becomes
ω(r̄, x̄) = max(ω1 , ω2 ), where

|b − (r̄ + Ax̄)|i |AT r̄|i


ω1 = max , ω2 = max . (2.5.18)
1≤i≤m (|A||x̄| + |b|)i 1≤i≤n (|AT | |r̄|)i

If we only have a computed x̄, it may be feasible to define r̄ = b − Ax̄ and apply the result
above. With this choice we have ω1 = 0 (exactly), and hence

|AT (b − Ax̄)|i
ω(r̄, x̄) = ω2 = max .
1≤i≤n (|AT ||b − Ax̄|)i

If the columns of A and b are scaled, the class of perturbations scales in the same way. Hence ω
is invariant under row and column scaling of A. A bound for the forward error ∥x̄ − x∥2 can be
obtained in terms of ω, which potentially is much smaller than the standard forward error bound
involving κ2 (A).
In the case of a nearly consistent least squares problem, f l(b − Ax̄) will mainly consist of
roundoff and will not be accurately orthogonal to the range of A. Hence although x̄ may have
a small relative backward error, ω2 may not be small. This illustrates a fundamental problem in
computing the backward error: for x̄ to have a small backward error it is sufficient that either
(b − Ax̄) or AT (b − Ax̄) is small, but neither of these conditions is necessary.

2.5.3 Iterative Refinement


Mixed-precision iterative refinement (IR) is a process used for improving the accuracy of a
given approximate solution x to a linear system Ax = b. Typically the initial approximation is
computed using an LU factorization of A computed in working precision of A. Next the residual
b − Ax is computed in a higher precision, the correction equation Ad = r is solved, and an
updated solution x + d is formed. If necessary, the refinement step is repeated. In classical
mixed-precision IR, arithmetic in two different precisions is: the working precision u1 , in which
the data and solution are stored, and a higher precision u2 ≤ u1 used for computing residuals.

Algorithm 2.5.2 (Iterative Refinement).


Solve Ax0 = b in precision u1 using some factorization of A.

for s = 0, 1, 2, . . . ,
compute rs = b − Axs ; in precision u2
round rs to precision u1
solve Aδxs = rs ; in precision u1
xs+1 = xs + δxs ; in precision u1
end;

The process is stopped when δxs /∥xs ∥ no longer shows a steady decrease.

The factorization of A used for computing the initial approximation can also be used for solv-
ing the correction equations. Therefore the cost of each refinement step is quite small. Note that
while the computed solution initially improves with each iteration, this is usually not reflected in
a corresponding decrease in the norm of the residual, which typically stays about the same.
2.5. Estimating Condition Numbers and Errors 101

On many early computers, inner products could be cheaply accumulated at twice the working
precision, and IR was used with u2 = u21 . This traditional version of IR was analyzed for fixed-
point arithmetic by Wilkinson [1118, 1963] and for floating-point arithmetic by Moler [799,
1967]. A more recent error analysis is found in Higham [623, 2002, Chapter 12]. As long as A is
not too ill-conditioned so that the initial solution has a relative error ∥x − x0 ∥/∥x0 ∥ = η, η < 1,
IR behaves roughly as follows. The relative error is decreased by a factor of about η with each
step of refinement until a stage is reached at which the solution is correct to working precision u.
Since most problems involve inexact input data, obtaining a highly accurate solution may
not seem to be justified. Even so, IR offers a useful estimate of the accuracy and reliability of a
computed solution. The correction δ1 also gives a good estimate of the sensitivity of the solution
to small relative perturbations of order u in the data A and b. Furthermore, there are applications
where an accurate solution of very ill-conditioned equations is warranted; see Ma et al. [765,
2017].
Mixed-precision IR was first applied to least squares problems by Businger and Golub [193,
1965]. In their algorithm the QR factorization of A is used to compute x0 and solve for the
corrections δxs . The iterations proceeds as follows.

for s = 0, 1, 2, . . .
compute rs = b − Axs ; in precision u2 .
solve min ∥Aδxs − rs ∥2 ; in precision u1 .
δxs
xs+1 = xs + δxs ; in precision u1 .
end

This works well for small-residual problems, but otherwise it may fail to give solutions correct
to working precision.
To remedy this it is necessary to simultaneously refine both the solution x and the residual
r by applying IR to the augmented system for the least squares problem. Let xs and rs be the
current approximations. In Björck [124, 1967] the new approximations are taken to be

xs+1 = xs + δxs , rs+1 = rs + δrs ,

where the corrections are computed in precision u1 from the augmented system
    
I A δrs fs
= . (2.5.19)
AT 0 δxs gs

Here the residuals


fs = b − rs − Axs , gs = −AT rs

are computed in precision u2 and rounded to precision u1 . The system (2.5.19) can be solved
stably using Algorithm 2.2.6:
 
−T ds
zs = R gs , = QT fs , (2.5.20)
es
 
zs
δrs = Q , δxs = R−1 (ds − zs ). (2.5.21)
es

Alternatively, an MGS QR factorization and Algorithm 2.2.12 can be used. An implementation


that uses Householder QR is given by Björck and Golub [143, 1967].
102 Chapter 2. Basic Numerical Methods

Algorithm 2.5.3 (IR of Augmented System Solution).


set x0 = 0; r0 = 0;
for s = 0, 1, 2, . . . ,
fs = b − rs − Axs ; gs = c − AT rs ; in precision u2
solve for δrs , δxs ; in precision u1
xs+1 = xs + δxs ;
rs+1 = rs + δrs ;
end

This algorithm requires 8mn − 2n2 flops in working precision for computing the QR fac-
torization. Computing the residual takes 4mn flops in extended precision. The initial rate of
convergence can be shown to be linear with rate
ρ = c1 uκ′ , κ′ = min κ2 (AD), (2.5.22)
D>0

where c1 is of modest size. Note that this rate is achieved without actually carrying out the
scaling of A by the optimal D. This rate is similar to that for the linear system case, even
though the conditioning of the least squares problem includes a term proportional to κ2 (A) for
large-residual problems.

Example 2.5.4 (See Björck and Golub [143, 1967]). To illustrate the method of iterative re-
finement we consider the linear least squares problem where A is the last six columns of the
inverse of the Hilbert matrix H8 ∈ R8×8 , which has elements
hij = 1/(i + j − 1), 1 ≤ i, j ≤ 8.
Two right-hand sides b1 and b2 are chosen so that the exact solution is
x = (1/3, 1/4, 1/5, 1/6, 1/7, 1/8)T .
For b = b1 the system Ax = b is consistent; for b = b2 the norm of the residual r = b − Ax is
1.04 · 107 . Hence, for b2 the term proportional to κ2 (A) in the perturbation bound dominates.
The refinement algorithm was run on a computer with a single precision unit roundoff u =
1.46 · 10−11 . The correction equation was solved using Householder QR factorization. Double
precision accumulation of inner products was used for calculating the residuals, but otherwise
all computations were performed in single precision. We give below the first component of the
successive approximations x(s) , r(s) s = 1, 2, 3, . . . , for right-hand sides b1 (left) and b2 (right).
(s) (s)
x1 = 3.33323 25269 · 10−1 x1 = 5.56239 01547 · 10+1
3.33333 35247 · 10−1 3.37777 18060 · 10−1
3.33333 33334 · 10−1 3.33311 57908 · 10−1
3.33333 33334 · 10−1 3.33333 33117 · 10−1
(s) (s)
r1 = 9.32626 24303 · 10−5 r1 = 2.80130 68864 · 106
5.05114 03416 · 10−7 2.79999 98248 · 106
3.65217 71718 · 10−11 2.79999 99995 · 106
−1.95300 70174 · 10−13 2.80000 00000 · 106
A gain of almost three digits accuracy per step in the approximations to x1 and r1 is achieved for
both right-hand sides b1 and b2 . This is consistent with the estimate (2.5.22) because
κ(A) = 5.03 · 108 , uκ(A) = 5.84 · 10−3 .
2.5. Estimating Condition Numbers and Errors 103

(4)
For the right-hand side b1 the approximation x1 is correct to full fixed precision. It is interesting
to note that for the right-hand side b2 the effect of the error term proportional to uκ2 (A) is evident
(1) (4)
in that the computed solution x1 is in error by a factor of 103 . However, x1 has eight correct
(4)
digits, and r1 is close to the true value 2.8 · 106 .

Wampler [1099, 1979] gives two Fortran subroutines L2A and L2B using MGS with itera-
tive refinement for solving weighted least squares problems. These are based on the ALGOL
programs in Björck [126, 1968] and were found to provide the best accuracy in a comparative
evaluation at the National Bureau of Standards; see Wampler [1098, 1970]. Demmel et al. [308,
2009] developed a portable and parallel implementation of the Björck–Golub IR algorithm for
least squares solutions that uses extended precision.
Most descriptions of IR stress the importance of computing the residuals in higher precision.
However, fixed-precision IR with residuals computed in working precision (u2 = u1 ) can also
be beneficial. Jankowski and Woźniakowksi [663, 1977] show that any linear equation solver
can be made backward stable by IR in fixed precision as long as the solver is not too unstable
and A not too ill-conditioned. If the product of cond (A) = ∥ |A−1 ||A| ∥2 and the measure of
ill-scaling
maxi (|A||x|)i
σ(A, x) = (2.5.23)
mini (|A||x|)i
is not too large, Skeel [1001, 1980] proves that LU factorization with partial pivoting combined
with one step of fixed-precision IR becomes stable in a strong sense. Higham [618, 1991] extends
Skeel’s analysis to show that for any solver that is not too unstable, one step of fixed-precision
IR suffices to achieve a solution with a componentwise relative backward error ω < γn+1 u1 . In
particular, this result applies to solving linear systems by QR factorization.
Higham [618, 1991] studies fixed-precision IR for linear systems and least squares prob-
lems of QR factorization methods. He shows that the componentwise backward error ω(r̄, x̄) =
max(ω1 , ω2 ) in (2.5.18) eventually becomes small, although it may take more than one iteration.
In particular, IR mitigates the effect of poor row scaling.
If fs and gs in Algorithm 2.5.3 are evaluated in precision u1 , then the resulting roundoff
errors become more important. A standard backward error analysis shows that

f¯ = b − δb − (A + δA1 )x̄ − (I + δI)r̄, ḡ = −(A + δA2 )T r̄,

where δI is diagonal. Hence, the roundoff errors are equivalent to small componentwise pertur-
bations in the nonzero blocks of the augmented matrix,

|δA1 | ≤ 1.06(n + 3)u|A|, |δb| ≤ 1.06u|b|, (2.5.24)


|δA2 | ≤ 1.06(m + 2)u|A|, |δI| ≤ uI, (2.5.25)

where the inequalities are to be interpreted componentwise. It follows that the computed residu-
als f¯ and ḡ are the exact residuals corresponding to the perturbed system


      
b + δb I + δI A + δA1 r̄
= − ,
ḡ 0 (A + δA2 )T 0 x̄

where the perturbations satisfy the componentwise bounds derived above. A perturbation |δI|
can be considered as a small perturbation in the weights of the rows of Ax − b. Roundoff errors
also occur in solving equations (2.5.20–(2.5.21). However, if the refinement converges, then the
roundoff errors in the solution of the final corrections are negligible.
104 Chapter 2. Basic Numerical Methods

Recently, graphics processing units (GPUs) have been introduced that perform extremely fast
half precision matrix-matrix multiplication accumulated in single IEEE half precision format (see
Section 1.4). This has caused a renewed interest in multiprecision algorithms for applications
such as weather forecasting and machine learning. A survey of linear algebra in mixed precision
is given by Higham and Mary [627, 2022].
Carson and Higham [209, 2018] develop a three-precision iterative refinement algorithm for
solving linear equations. This uses a complete LU factorization in half IEEE precision (u = 4.9×
10−4 ), single precision as working precision, and double precision for computing residuals. The
remaining computations are performed in working precision, and all results are stored in working
precision. A rounding error analysis shows that this obtains full single-precision accuracy as long
as κ(A) ≤ 104 . With lower working precision the likelihood increases that the system being
solved is too ill-conditioned. The authors show that in these cases an improvement is obtained
by using a two-stage iterative refinement approach where the correction equation is solved by
GMRES preconditioned by LU factorization (see Section 6.4.5). For the resulting GMRES-IR
algorithm the above condition can be weakened to κ(A) ≤ 108 .
Carson, Higham, and Pranesh [210, 2020] develop an analogous three-precision iterative
refinement algorithm called GMRES-LSIR for least squares problems. It uses the QR factoriza-
tion of A computed in half IEEE precision. The correction is solved by GMRES applied to the
augmented system using a preconditioner based on Algorithm 2.2.6 and the computed QR factor-
ization. For a wide range of problems this yields backward and forward errors for the augmented
system correct to working precision.

Notes and references

Kielbasiński [693, 1981] studies a version of IR with variable precision called binary-cascade IR
(BCIR) in which several steps of IR are performed for solving a linear system with prescribed
relative accuracy. At each step the process uses the lowest sufficient precision for evaluating the
residuals. A BCIR process for solving least squares problems is developed by Gluchowska and
Smoktunowicz [481, 1990]. Iterative refinement of solutions has many applications to sparse
linear systems and sparse least squares problems. Arioli, Demmel, and Duff [33, 1989] adapt
Skeel’s analysis of fixed-precision IR to the problem of solving sparse linear systems with sparse
backward error. The use of a fixed-precision IR for sparse least squares problems is studied by
Arioli, Duff, and de Rijk [36, 1989]. They note that IR can regain a loss of accuracy caused by
bad scaling of the augmented system.

2.5.4 The Corrected Seminormal Equations


The seminormal equations (SNE) for the least squares problem minx ∥Ax − b∥2 are

RTRx = c, c = AT b. (2.5.26)

In floating-point arithmetic the roundoff errors in computing c can be bounded by ∥δc∥2 ≤


mu∥A∥2 ∥b∥2 . The error leads to a resulting perturbation δx in the solution of (2.5.26) such that

∥δx∥2 ≤ muκ(A)2 ∥b∥2 /∥A∥2 .

It will be of similar size whether R comes from a QR factorization of A or from a Cholesky


factorization of ATA. This error term can be damped by performing a few steps of IR in fixed
precision:
2.5. Estimating Condition Numbers and Errors 105

Algorithm 2.5.4 (IR for Seminormal Equations).

set x0 = 0;
for s = 0, 1, 2, . . . ,
rs = b − Axs ,
solve RT(Rδxs ) = AT rs ,
xs+1 = xs + δxs .
end

Each step of refinement requires matrix-vector multiplication with A and AT and the solution
of two triangular systems. With R from a QR factorization the convergence of this iteration is
linear with a rate that can be shown to be approximately

ρ = cuκ′ (A), κ′ = min κ(AD);


D>0

see Björck [130, 1987]. Note that this holds without actually performing the optimum column
scaling. When R comes from a Cholesky factorization the rate achieved is much worse: only
ρ̄ = cuκ′ (A)2 . Even then, a good final accuracy can be achieved for a large class of problems
by performing several steps of IR.
In the method of corrected seminormal equations (CSNE), a corrected solution xc is
computed by doing just one step of refinement of the SNE solution x̄: Compute the residual
r̄ = f l(b − Ax̄) and solve
RTRδx = AT r̄, xc = x̄ + δx. (2.5.27)
Assuming that R comes from a backward stable Householder QR or MGS factorization, the
computed R is the exact R-factor of a perturbed matrix,

A + E, ∥E∥F ≤ c1 u∥A∥F ,

where c1 is a small constant depending on m and n. From this property the following error bound
for the corrected solution xc can be shown; see Björck [130, 1987, Theorem 3.2].

Theorem 2.5.5. Let x̄c be the computed CSNE solution using R from Householder QR or MGS
of A. If ρ ≡ c1 n1/2 uκ(A) < 1, then neglecting higher order terms in uκ(A), the following
error estimate holds:
   
1/2 ∥r∥2 1/2 ∥b∥2
∥x − x̄c ∥2 ≤ mn uκ ∥x∥2 + κ + σuκ c2 ∥x∥2 + n m , (2.5.28)
∥A∥2 ∥A∥2

where κ = κ(A), σ = c3 uκκ′ , c3 ≤ 2n1/2 (c1 + 2n + m/2), and c1 and c2 are small constants
depending on m and n.

If σ = c3 uκκ′ < 1, the error bound for the CSNE solution is no worse than the error bound
for a backward stable method. This condition is usually satisfied in practical applications and
is roughly equivalent to requiring that x̄ from the seminormal equation have at least one correct
digit.
An important application of CSNE is to sparse least squares problems. In the QR factor-
ization of a sparse matrix A, the factor Q often can be much less sparse that the factor R; see
Gilbert, Ng, and Peyton [470, 1997]. Therefore Q is not saved, which creates a difficulty if addi-
tional right-hand sides b have to be treated. With CSNE, recomputing the QR factorization can
be avoided.
106 Chapter 2. Basic Numerical Methods

Example 2.5.6. Consider a sequence of least squares problems constructed as follows. Let A be
the first five columns of the inverse Hilbert matrix H6−1 of order six. This matrix is moderately
ill-conditioned: κ2 (A) = 4.70 × 106 . Let x = (1, 1/2, 1/3, 1/4, 1/5)T , and let b = Ax be a
consistent right-hand side. Let h satisfy AT h = 0 and κ2 (A)∥h∥2 /(∥A∥2 ∥x∥2 ) = 3.72 × 103 .
Consider a sequence of right-hand sides Ax + 10k h, k = 0, 1, 2, with increasing residual norm.
For these problems it holds that σ = c3 uκκ′ ≪ 1.
Table 2.5.1 shows the average number of correct significant decimal digits of four solution
methods: Normal equations (NE), SNE, QR, and CSNE. As predicted by the error analysis, SNE
gives only about the same accuracy as NE. On the other hand, CSNE is better than QR.

Table 2.5.1. Average number of correct significant decimal digits.

Right-hand side NE SNE QR CSNE


b 3.541 3.308 6.208 7.744
b+h 3.423 4.801 6.232 8.103
b + 10h 4.357 3.797 6.567 7.861
b + 100h 4.575 4.241 5.142 5.814

For more ill-conditioned problems, several refinement steps may be needed. Let xp be the
computed solution after p refinement steps. With R from QR the error initially behaves as ∥x −
xs ∥ ∼ c1 uκκ′ (c1 uκ′ )p . If c1 ≈ 1 and κ′ = κ, an acceptable-error stable level is achieved in p
steps if κ(A) < u−p/(p+1) .

2.6 Blocked Algorithms and Subroutine Libraries


2.6.1 Blocked QR Factorization
A key strategy for obtaining high performance in new algorithms is to group together and reorder
scalar and matrix-vector operations into matrix-matrix operations. Such blocked algorithms have
the same stability properties as their scalar counterparts but reduced communication. Consider
Householder QR of the partitioned matrix A = ( A1 A2 ) ∈ Rm×n , where A1 ∈ Rm×n1 . In
the first step we compute
 
R11
A1 = Q1 , Q1 = P1 P2 · · · Pn1 ∈ Rm×m , (2.6.1)
0

where Pi = I − τi ui uTi , i = 1, . . . , n1 , are Householder reflections. Next, these transformations


are applied to the trailing matrix A2 ∈ Rm×n2 :
 
T R12
Q1 A2 = , QT1 = Pn1 · · · P2 P1 . (2.6.2)
B

To achieve better performance, this sequence of rank-one updates can be aggregated into one
update of rank p.
A stable compact representation of a product of Householder matrices is given by Bischof
and Van Loan [123, 1987]. Here we describe a more storage–efficient version due to Schreiber
and Van Loan [975, 1989]. Let

Qi = I − Ui Ti UiT , i = 1, 2,
2.6. Blocked Algorithms and Subroutine Libraries 107

where Ui ∈ Rm×ni and Ti ∈ Rni ×ni are upper triangular. Then Q = Q1 Q2 = (I − U T U T ),


where
 
T1 −T1 U1T U2 T2
U = ( U1 U2 ) , T = . (2.6.3)
0 T2

Note that U is formed by concatenation, but forming the off-diagonal block of the upper trian-
gular matrix T requires extra operations. In the special case when n1 = k, n2 = 1, U2 = u, and
T2 = τ , (2.6.3) becomes
 
T1 −τ T1 (U1T u)
U = ( U1 u), T = . (2.6.4)
0 τ

The blocked Householder QR algorithm of A = (A1 A2 ) starts by computing the QR


factorization of A1 ∈ Rm×n1 , where n1 ≪ n. Starting with T = τ1 , U = u1 and using (2.6.4)
recursively, Tn1 and Un1 are generated so that

QT1 = Pn1 · · · P2 P1 = I − UnT1 TnT1 Un1 .

This requires about n21 (m + n1 /3) flops. The trailing matrix A2 can then be updated by matrix-
matrix operations as
 
R12
QT1 A2 = A2 − Un1 (TnT1 (UnT1 A2 )) = , R12 ∈ Rn1 ×(n−n1 ) (2.6.5)
B

in about n1 n2 (2m + n1 ) flops. Next, B is partitioned as B = ( B1 B2 ). Proceeding as before,


the QR factorization B1 = Q1 R22 is computed and B2 is updated. All remaining steps are
similar. The process terminates when the columns of A are exhausted. Note that the size of the
successive blocks can be chosen dynamically and need not be fixed in advance. The optimal
choice of block sizes depends on characteristics of the computer. This block Householder QR
algorithm has the same stability as the scalar algorithm and can be shown to be backward stable.
Consider now a block Householder QR factorization for a fixed uniform partitioning A =
(A1 , A2 , . . . , AN ), where N = n/p. For k = 0, . . . , N − 1, do:

1. Compute the Householder QR factorization of a matrix of size (m − kp) × p.

2. Update the upper triangular matrix T ∈ Rkp×kp in the WY representation.

3. Apply the update to the trailing blocks matrix of size (m − kp) × (n − kp).

The QR factorization in step 1 requires a total of less than 2N mp2 = 2mnp flops. The operation
count of step 2 is of similar magnitude. Since the total number of flops for the Householder
QR factorization of A ∈ Rm×n must be greater than 2n2 (m − n/3) flops, all but a fraction of
n/p = 1/N of the operations are spent in the matrix operations of the updating.
The block Householder QR algorithm described above is right-looking, i.e., in step k the full
trailing submatrix of size (m − kp) × (n − kp) is updated. For p = 1 it reduces to the standard
Householder QR algorithm. The data referenced can instead be reduced by using a left-looking
algorithm that in step k applies all previous Householder transformations to the next block of
size (n − kp) × p of the trailing matrix.
108 Chapter 2. Basic Numerical Methods

A blocked form of MGS QR can easily be constructed as follows; see Björck [134, 1994].
Let A = ( A1 A2 ) and

A1 = Q1 R11 ∈ Rm×k , Q1 = (q1 , . . . , qk ), (2.6.6)

be the MGS factorization of A1 , where qiT qj = 1, i = j. Due to rounding errors, there will
be a loss of orthogonality so that qiT qj ̸= 0, i ̸= j. In the next step, the trailing block A2 is
transformed as
B = PkT A2 , Pk = (I − q1 q1T ) · · · (I − qk qkT ), (2.6.7)

where Pk is a product of elementary orthogonal projectors. To perform this updating efficiently,


Pk can be expressed in the form

Pk = I − Qk Tk QTk , Qk = (q1 , . . . , qk ), (2.6.8)

where Tk ∈ Rk×k is a unit upper triangular matrix. To form Tk recursively, set T1 = 1, Q1 = q1 ,


and for i = 2, . . . , k, compute
 
Ti−1 li
Qi = (Qi−1 , qi ), Li = , li = −Ti−1 (QTi−1 qi ). (2.6.9)
0 1

The update in (2.6.7) can then be written in terms of matrix operations as

B = (I − QTk TkT Qk )A2 = A2 − QTk (TkT (Qk A2 )). (2.6.10)

This update requires 2(n − k)k(m + p/4) flops. When k ≪ n this is the main work in the first
step. In the next step, B is partitioned as B = ( B1 B2 ), and the MGS QR factorization of B1
is computed, etc. The resulting block MGS algorithm has the same stability as the scalar MGS
algorithm and can be used to solve least squares problems in a backward stable way.
The following result (Björck [125, 1967, Lemma 5.1]) can be used to improve the efficiency
of column-oriented MGS orthogonalization.

Lemma 2.6.1. Given Qk = (q1 , . . . , qk ), with ∥qj ∥2 = 1, j = 1, . . . , k, define Q


e k = (e
q1 , . . . , qek )
recursively by qe1 = q1 , qek = Pk−1 qk , Pk−1 = (I − q1 q1T ) · · · (I − qk−1 qk−1
T
). Then

e k QTk ,
Pk = I − Q e k (I + LTk ),
Qk = Q (2.6.11)

where Lk ∈ Rk×k is a strictly lower triangular correction matrix with elements lij = qiT qj ,
i > j.

Proof. The lemma is trivially true for k = 1. From the definition of qek we have

Pk = Pk−1 (I − qk qkT ) = Pk−1 − Pk−1 qk qkT = Pk−1 − qek qkT .

e k−1 QT )qk or, equivalently,


Assume that (2.6.11) holds for k − 1. Then qek = (I − Q k−1

qk = qek + (qkT qk−1 )e


qk−1 + · · · + (qjT q1 )e
q1 . (2.6.12)

e k (I + LT ).
But this is the kth column of Qk = Q k
2.6. Blocked Algorithms and Subroutine Libraries 109

From Lemma 2.6.1 it follows that

PkT = I − Qk (I + Lk )−1 QTk . (2.6.13)

Hence the orthogonalization of ak against q1 , . . . , qk−1 in MGS can be written as

T
Pk−1 e Tk−1 )ak ,
ak = (I − Qk−1 Q e Tk−1 = (I + Lk−1 )−1 QTk−1 .
Q

Comparing (2.6.8) and (2.6.13) gives the identity Tk−1 = (I − Lk−1 )−1 . A similar lower tri-
angular inverse is used by Walker [1097, 1988] to obtain a blocked Householder algorithm. A
summary of the compact WY and inverse compact WY for Householder and MGS transforma-
tions and a version of blocked MGS based on (2.6.13) are given by Świrydowicz et al. [1054,
2020].
Several other block Gram–Schmidt algorithms have been suggested. Jalby and Philippe [662,
1991] study a block Gram–Schmidt algorithm in which MGS is used to orthogonalize inside the
blocks, and the trailing matrix is updated as in CGS by multiplication with (I − Qk QTk ). The
stability of this algorithm is shown to lie between that of MGS and CGS. The computed matrix
Q̂ satisfies
∥I − Q̂T Q̂∥2 ≤ ρu max κ(Wk )κ(A),
k

where Wk , k = 1, . . . , N , are the successive panel matrices. The accuracy can be improved
significantly by reorthogonalization of the trailing matrix.
A more challenging problem is the orthogonal basis problem for computing Q1 and R that
satisfy

∥I − QT1 Q1 ∥ ≤ c1 (m, n)u, ∥A − Q1 R∥ ≤ c2 (m, n)∥A∥u, (2.6.14)

where A ∈ Rm×n , and c1 (m, n) and c2 (m, n) are modest constants. Stewart [1031, 2008]
develops a left-looking Gram–Schmidt algorithm with A partitioned into blocks of columns.
Each block is successively orthogonalized and incorporated into Q. In order to maintain full
orthogonality in Q, reorthogonalization is used in all Gram–Schmidt steps. A feature of the
algorithm is that it can handle numerical rank-deficiencies in A. A similar block algorithm based
on CGS2, together with an error analysis that improves some previously given bounds, is given
by Barlow and Smoktunowicz [76, 2013].

Notes and references

Puglisi [906, 1992] gives an improved version of the WY representation of products of House-
holder reflections, which is richer in matrix-matrix operations; see also Joffrain et al. [674, 2006].
Bischof and Quintana-Ortí [121, 122, 1998], use a windowed version of column pivoting aided
by an incremental condition estimator (ICE) to develop an efficient block algorithm for comput-
ing an RRQR factorization. Columns found to be nearly linearly dependent of previously chosen
columns are permuted to the end of the matrix. Numerical tests show that this pivoting strategy
usually correctly identifies the rank of A and generates a well-conditioned matrix R.
Oliveira et al. [842, 2000] analyze pipelined implementations of QR factorization using dif-
ferent partitioning schemes, including block and block-cyclic columnwise schemes. A parallel
implementation of CGS with reorthogonalization is given by Hernandez, Román, and Tomás
[605, 2006]. Rounding error analysis of mixed-precision block Householder algorithms is given
by Yang, Fox, and Sanders [1138, 2020]. Carson et al. [211, 2022] survey block Gram–Schmidt
algorithms and their stability properties.
110 Chapter 2. Basic Numerical Methods

2.6.2 Recursive Cholesky and QR Factorization


A special class of block partitioned algorithms uses a recursive blocking. This improves data
locality and can execute efficiently also on multicore computers. The Cholesky factorization for
a symmetric positive definite matrix A partitioned in 2 × 2 blocks has the form
   T
 
A11 A12 R11 0 R11 R12
= , (2.6.15)
AT12 A22 T
R12 T
R22 0 R22

where R11 and R22 are upper triangular matrices. Equating both sides gives the matrix equations

T T T T
R11 R11 = A11 , R11 R12 = A12 , R22 R22 = A22 − R12 R12

for computing the blocks in R and leads to the following algorithm.

T
1. Compute the Cholesky factorization A11 = R11 R11 .

T
2. Solve the lower triangular system R11 R12 = A12 for R12 .

T
3. Form the Schur complement S22 = A22 − R12 R12 and compute its Cholesky factorization
T
S22 = R22 R22 .

The Cholesky algorithm in the MATLAB program following computes the two required
Cholesky factorizations of size n1 × n1 and n2 × n2 by recursive calls. The recursion is stopped
and a standard Cholesky routine used if n ≤ nmin.

Algorithm 2.6.1 (Recursive Cholesky Factorization).

function L = rchol(C,nmin)
% RCHOL computes the Cholesky factorization
% of C using a divide and conquer method
% -------------------------------------------------
[n,n] = size(C);
if n <= nmin, L = chol(C);
else
n1 = floor(n/2); n2 = n-n1;
j1 = 1:n1; j2 = n1+1:n;
% Recursive call
L11 = rchol(C(j1,j1),nmin);
% Triangular solve
L21 = (L11\C(j1,j2))';
% Recursive call
L22 = rchol(C(j2,j2) - L21*L21',nmin);
L = [L11, zeros(n1,n2); L21, L22];
end
end

The parameter nmin may be tuned according to architecture characteristics. If nmin = 1,


the algorithm is a purely recursive algorithm, and L = sqrt(C) can be substituted for
2.6. Blocked Algorithms and Subroutine Libraries 111

L = chol(C). All remaining work is done in triangular solves and matrix multiplication. At
level i, 2i calls to matrix-matrix operations are made. In going from level i to i + 1, the number
of such calls doubles, and each problem size is halved. Hence the number of flops done at each
level goes down in a geometric progression by a factor of 4. Because the total number of flops
must remain the same, a large part of the calculations are made at low levels. Since the flop rate
goes down with the problem size, the computation time does not quite go down by the factor
1/4, but for large problems this has little effect on the total efficiency.
To develop a recursive QR algorithm, A is partitioned as A = ( A1 A2 ), where A1 consists
of the first ⌊n/2⌋ columns of A. Assume that the QR factorization of A1 has been computed and
A2 updated as follows:
     
R11 A12 R12
QT1 A1 = , QT1 A2 = QT1 = .
0 A22 B

To obtain the QR factorization of A,


 
R11 R12
A = ( Q1 Q2 ) ,
0 R22

it remains to compute the QR factorization of B = ( B1 B2 ) giving Q2 and R22 .


Algorithm 2.6.2 uses recursive calls to perform the CGS QR factorization. If n < nmin,
where nmin ≥ 1 is a user-selected parameter, a standard scalar CGS routine is used. If nmin =
1, this can be replaced by setting R = (ATA)1/2 and Q = A/R.

Algorithm 2.6.2 (Recursive CGS Factorization).

function [Q,R] = rcgs(A,nmin)


% RCGS computes the CGS QR factorization
% of A using a divide and conquer method
% -------------------------------------------------
[m,n] = size(A);
if n <= nmin [Q,R] = cgs(A);
else
n1 = floor(n/2); n2 = n-n1;
% Recursive call
[Q1,R11] = rcgs(A(:,1:n1),nmin);
R12 = Q1' * A(:,n1+1:n);
% Recursive call
[Q2,R22] = rcgs(A(:,n1+1:n) - Q1*R12,nmin);
Q = [Q1, Q2];
R = [R11, R12; zeros(n2,n1),R22];
end
end

Algorithm 2.6.3 below performs a recursive QR factorization of A ∈ Cm×n (m ≥ n). The


matrix Q = I − U T U T is given in aggregated form, where U ∈ Cm×n is unit lower trapezoidal
and T ∈ Cn×n is upper triangular. The function houseg(a) generates a Householder reflector
P = I ∓ τ uuT such that P a = σ e1 , σ = −sign (a1 )∥a∥2 .
112 Chapter 2. Basic Numerical Methods

Algorithm 2.6.3 (Recursive Householder QR Factorization).

function [U,T,R] = recqr(A)


% RECQR computes recursively the QR factorization
% of the m by n matrix A (m >= n). Output is the
% n by n R and Q = (I - UTU') in aggregated form.
% -------------------------------------------------
[m,n] = size(A);
if n == 1, [U,T,R] = houseg(A);
else
n1 = floor(n/2);
n2 = n - n1; j = n1+1;
% Recursive call
[U1,T1,R1]= recqr(A(1:m,1:n1));
B = A(1:m,j:n) - (U1*T1')*(U1'*A(1:m,j:n));
% Recursive call
[U2,T2,R2] = recqr(B(j:m,1:n2));
R = [R1, B(1:n1,1:n2); zeros(n-n1,n1), R2];
U2 = [zeros(n1,n2); U2];
U = [U1, U2];
T = [T1, -T1*(U1'*U2)*T2; zeros(n2,n1), T2];
end
end

A disadvantage of this algorithm is the overhead in storage and operations caused by the T
matrices. At the end of the recursive QR factorization a T -matrix of size n × n is formed and
stored. This can be avoided by using a hybrid of the partitioned and recursive algorithms, where
the recursive QR algorithm is only used to factorize the blocks in the partitioned algorithm; see
Elmroth and Gustavson [385, 2004].

Notes and references

O’Leary and Whitman [841, 1990] analyze algorithms for Householder and MGS QR factoriza-
tions on distributed MIMD machines using rowwise partitioning schemes. Gunter and Van de
Geijn [553, 2005] present parallel algorithms for QR factorizations. A recursive algorithm for
Cholesky factorization of a matrix in packed storage format is given in Andersen, Waśniewski,
and Gustavson [23, 2001]. An incremental parallel QR factorization code is given by Baboulin
et al. [49, 2009]. Algorithms for QR factorization for multicore architectures are developed by
Buttari et al. [196, 2008], Buttari [195, 2013], and Yeralan et al. [1139, 2017]. Communication
avoiding rank-revealing QR factorizations are developed by Demmel et al. [305, 2015]. Recent
developments in hardware and software for large-scale accelerated multicores are surveyed by
Abdelfattah et al. [2, 2016]. The impact of hardware developments on subroutines for computing
the SVD is surveyed by Dongarra et al. [323, 2018].

2.6.3 BLAS and Linear Algebra Libraries


The core of most applications in scientific computing are subroutines implementing algorithms
for matrix computations such as LU, QR, and SVD factorizations. To be efficient, these have to
be continuously adapted as computer architectures change. The first collection of high quality
2.6. Blocked Algorithms and Subroutine Libraries 113

software for linear algebra appeared in 1971 in the Handbook by Wilkinson and Reinsch [1123,
1971]. It contained eleven subroutines written in ALGOL 60 for linear systems, linear least
squares, and linear programming and eighteen routines for eigenvalue problems.
EISPACK is a collection of Fortran 77 subroutines for computing eigenvalues and/or eigen-
vectors of several different classes of matrices as well as the SVD; see Smith et al. [1005, 1976]
and Garbow et al. [441, 1977]. The subroutines are primarily based on a collection of ALGOL
procedures in the Handbook mentioned above, although some were updated to increase reliability
and accuracy.
In 1979 a set of standard routines called BLAS (Basic Linear Algebra Subprograms) were
introduced to perform frequently occurring operations; see Lawson et al. [728, 1979]. These
included operations such as scalar product β := xT y (Sdot), vector sums y := αx + y (Saxpy),
scaling y = αx (Sscal), and Euclidean norm β = ∥x∥2 (Snrm2). Both single- and double-
precision real and complex operations were provided. BLAS leads to shorter and clearer code
and aids portability. Furthermore, machine-independent optimization can be obtained by using
tuned BLAS provided by manufacturers.
LINPACK is a collection of Fortran subroutines using BLAS that analyzes and solves lin-
ear equations and linear least squares problems. It solves systems whose matrices are general,
banded, symmetric positive definite and indefinite, triangular, or tridiagonal. It uses QR and SVD
for solving least squares problems. These subroutines were developed from scratch and include
several innovations; see Dongarra et al. [322, 1979].
While successful for vector-processing machines, Level 1 BLAS were found to be unsatisfac-
tory for the cache-based machines introduced in the 1980s. This brought about the development
of Level 2 BLAS that involve operations with one matrix and one or several vectors, e.g.,

y := αAx + βy,
y := αAT x + βy,
A := αxy T + A,
x := T x,
x := T −1 x,

where A is a matrix, T is an upper or lower triangular matrix, and x and y are vectors; see
Dongarra et al. [325, 326, 1988]. Level 2 BLAS involve O(mn) data, where m × n is the
dimension of the matrix involved, and the same number of arithmetic operations.
When RISC-type microprocessors were introduced, Level 2 BLAS failed to achieve adequate
performance, due to a delay in getting data to the arithmetic processors. In Level 3 BLAS,
introduced in 1990, the vectors in Level 2 BLAS are replaced by matrices. Some typical Level 3
BLAS are

C := αAB + βC,
C := αAT B + βC,
C := αAB T + βC,
B := αT B,
B := αT −1 B.

For n × n matrices, Level 3 BLAS use O(n2 ) data but perform O(n3 ) arithmetic operations.
This gives a surface-to-volume effect for the ratio of data movement to operations and avoids
excessive data movements between different parts of the memory hierarchy. Level 3 BLAS can
achieve close to optimal performance on a large variety of computer architectures and makes
it possible to write portable high-performance linear algebra software. Formal definitions for
114 Chapter 2. Basic Numerical Methods

Level 3 BLAS were published in 2001; see Blackford et al. [155, 2002]. Vendor-supplied highly
efficient machine-specific implementations of BLAS libraries are available, such as Intel Math
Kernel Library (MKL), IBM Scientific Subroutine Library (ESSL), and the open-source BLAS
libraries OpenBLAS and ATLAS.
The kernel in the Level 3 BLAS that gets closest to peak performance is the matrix-matrix
multiply routine GEMM. Typically, it will achieve over 90% of peak on matrices of order greater
than a few hundred. The bulk of the computation of other Level 3 BLAS such as symmetric
matrix-matrix multiply (SYMM), triangular matrix-matrix multiply (TRMM), and symmetric
rank-k update (SYRK), can be expressed as calls to GEMM; see Kågström, Ling, and Van Loan
[679, 1998].
The LAPACK library [27, 1999], first released in 1992, was designed to supersede and inte-
grate LINPACK and EISPACK. The subroutines in LAPACK were restructured to achieve greater
efficiency on both vector processors and shared-memory multiprocessors. LAPACK was incor-
porated into MATLAB in the year 2000. LAPACK is continually improved and updated and
available from http://www.netlib.org/lapack/. Different versions and releases are listed
there as well as information on related projects. A number of parallel BLAS libraries can be used
in LAPACK to take advantage of common techniques for shared-memory parallelization such as
pThreads or OpenMP.
The last decade has been marked by the proliferation of multicore processors and hardware
accelerators that present new challenges in algorithm design. On such machines, costs for com-
munication, i.e., moving data between different levels of memory hierarchies and processors, can
exceed arithmetic costs by orders of magnitude; see Graham, Snir, and Patterson [524, 2004].
This gap between computing power and memory bandwidth keeps increasing; see Abdelfatta
et al. [1, 2021]. A key to high efficiency is locality of reference, which requires splitting opera-
tions into carefully sequenced tasks that operate on small portions of data. Iterative refinement is
exploited by Dongarra and his coworkers for accelerating multicore computing; see Abdelfatta
et al. [2, 2016].
There are two costs associated with communication: bandwidth cost (proportional to the
amount of data moved) and latency cost (proportional to the number of messages in which these
data are sent). Ballard et al. [65, 2011] prove bounds on the minimum amount of communication
needed for a wide variety of matrix factorizations including Cholesky and QR factorizations.
These lower bounds generalize earlier bounds by Irony, Toledo, and Tiskin [657, 2004] for matrix
products. New linear algebra algorithms with reduced communication costs are discussed and
examples given that attain these lower bounds.
ScaLAPACK is an extension of LAPACK designed to run efficiently on newer MIMD dis-
tributed memory architectures; see Choi et al. [243, 1996] and Blackford et al. [154, 1997].
ScaLAPACK builds on distributed memory versions of parallel BLAS (PBLAS) and on a set
of Basic Linear Algebra Communication Subprograms (BLACS) for executing communication
tasks. This makes the top level code of ScaLAPACK look quite similar to the LAPACK code.
Matrices are arranged in a two-dimensional block-cyclic layout using a prescribed block size.
New implementations of algorithms are available via the open-source libraries PLASMA and
MAGMA; see Agullo et al. [10, 2009].
Chapter 3

Generalized and
Constrained Least
Squares

3.1 Generalized Least Squares Problems


3.1.1 Generalized Least Squares
Many applications of least squares involve a general Gauss–Markov linear model

Ax + ϵ = b, V(ϵ) = σ 2 V, (3.1.1)

with Hermitian positive definite error covariance matrix V ̸= I. Then the following result holds.

Theorem 3.1.1. Consider a Gauss–Markov linear model Ax = b + e with A ∈ Cm×n of


rank(A) = n and symmetric positive definite error covariance matrix V(e) = σ 2 V ∈ Cm×m .
Then the best unbiased linear estimate of x is the solution of the generalized least squares (GLS)
problem
min (b − Ax)H V −1 (b − Ax). (3.1.2)
x

The solution x
b satisfies the generalized normal equations

AH V −1Ax = AH V −1 b (3.1.3)

or, equivalently, the orthogonality condition AH V −1 r = 0, where r = b − Ax.


The covariance matrix of the estimate of x is

x) = σ 2 (AH V −1A)−1 ∈ Rm×m ,


V(b (3.1.4)

and an unbiased estimate of σ 2 is

s2 = rbH V −1 rb/(m − n), rb = b − Ab


x. (3.1.5)

Proof. Since V is positive definite, the Cholesky factorization V = LLH exists. Then AH V −1 A
= (L−1 A)HL−1 A, and problem (3.1.2) can be reformulated as

min ∥L−1 (Ax − b)∥2 . (3.1.6)


x

This is a standard least squares problem minx ∥Ax e = L−1 A and eb = L−1 b. The
e − eb∥, where A
proof now follows by replacing A and b in Theorem 1.1.4 with A and eb.
e

115
116 Chapter 3. Generalized and Constrained Least Squares

In the following we assume that A, b, and V are real. The GLS problem can be solved
by first computing V = LLT and then solving LA e = A and Leb = b. The normal equations
T T
à Ãx = à b̃ are formed and solved by Cholesky factorization. Alternatively, using the QR
factorization
 
R
L−1 A = Q , Q = ( Q1 Q2 ) , (3.1.7)
0

we get the solution x = R−1 QT1 L−1 b.


Computing the Cholesky factorization V = LLT requires about m3 /3 flops for a dense
matrix V . Forming A e = L−1 A and eb = L−1 b requires a further m2 (n + 1) flops. This may be
prohibitive unless V has a favorable structure. When V is a banded matrix with small bandwidth
w, the work in the Cholesky factorization is only about mw(w + 3) flops. Frequently V is
diagonal. Such weighted least squares problems are treated in Section 3.2.1.
For a consistent linear system AT y = c of full row rank, the generalized least-norm (GLN)
problem is
min y T V y subject to AT y = c. (3.1.8)
y

The corresponding generalized normal equations of the second kind are

AT V −1 Az = c, y = V −1 Az. (3.1.9)

If V = LLT is the Cholesky factorization, then y T V y = ∥LT y∥22 . Hence problem (3.1.8) is
eT ye = c, where
equivalent to seeking the minimum ℓ2 -norm solution of the system A

e = L−1 A,
A ye = LT y.

Alternatively, using the QR factorization (3.1.7) gives

y = L−T Q1 (R−T c). (3.1.10)

Problems GLS and GLN are special cases of the generalized augmented system
      
y V A y b
M ≡ = . (3.1.11)
x AT 0 x c

This system is nonsingular if and only if rank(A) = n and

N (V ) ∩ N (AT ) = {0}.

If V is positive definite, then by Sylvester’s law of inertia (see Horn and Johnson [639, 1985]) it
follows that the matrix M ∈ R(m+n)×(m+n) of system (3.1.11) has m positive and n negative
eigenvalues. For this reason, (3.1.11) is called a saddle point system. Eliminating y in (3.1.11)
gives the generalized normal equations for x,

AT V −1 Ax = AT V −1 b − c. (3.1.12)

Such systems represent the equilibrium of a physical system and occur in many applications; see
Strang [1043, 1988].
3.1. Generalized Least Squares Problems 117

Theorem 3.1.2. If A ∈ Rm×n has full column rank and V ∈ Rm×m is symmetric positive defi-
nite, the augmented system (3.1.11) is nonsingular and gives the first-order optimality conditions
for the generalized least squares problem

min 1 rT V −1 r + cT x, r = b − Ax, (3.1.13)


x∈Rn 2

and for the dual equality-constrained quadratic programming problem

min 1 y T V y − bT y subject to AT y = c . (3.1.14)


y∈Rm 2

Proof. System (3.1.11) can be obtained by differentiating (3.1.13). This gives AT V −1 (b−Ax) =
c, where V y = b − Ax. It can also be obtained by differentiating the Lagrangian

1 T
L(x, y) = y V y − bT y + xT (AT y − c)
2

for (3.1.14) and equating to zero. Here x is the vector of Lagrange multipliers. If c = 0 in
(3.1.13), then x is the GLS solution of (1.1.2). If b = 0 in (3.1.14), y is the GLS solution with
minimum weighted norm ∥y∥V = (y T V y)1/2 of the consistent underdetermined linear system
AT y = c.

It follows that any algorithm for solving the GLS problem (3.1.13) is valid also for the qua-
dratic programming problem (3.1.14) and vice versa. An explicit expression for the inverse of
augmented matrix M is obtained from the Schur–Banachiewicz formula (3.3.6),
−1
V −1 (I − P ) V −1 AS −1
  
−1 V A
M = = , (3.1.15)
AT 0 S −1 AT V −1 −S −1

where
S = AT V −1 A, P = AS −1 (V −1 A)T . (3.1.16)

In terms of the QR factorization (3.1.7), the inverse is

L−T Q2 QT2 L−1 L−T Q1 R−T


 
−1
M = . (3.1.17)
R−1 QT1 L−1 −R−1 R−T

The solution of the augmented system (3.1.11) becomes y = L−T u, x = R−1 v, where

B −1 b
    
u Q2 QT2 Q1
= . (3.1.18)
v QT1 −I R−T c

Notes and references

In constrained optimization the augmented system is called the Karush–Kuhn–Tucker (KKT)


system. Such systems arise in a wide variety of applications and are often large. Numerous
solution methods, both direct and iterative, have been suggested. An excellent survey is given
by Benzi, Golub, and Liesen [108, 2005]. Solution methods for augmented systems where the
(1,1)-block is indefinite are developed by Golub and Greif [491, 2003].
118 Chapter 3. Generalized and Constrained Least Squares

3.1.2 Oblique Projectors


A matrix P ∈ Rm×m that satisfies P 2 = P and P T ̸= P is an oblique projector. It splits any
vector b ∈ Rm into a sum b = P b + (I − P )b,

P b ∈ R(P ), (I − P )b = N (P )⊥ .

Consider first the two-dimensional case. Let u and v be unit vectors in R2 such that cos θ =
v T u > 0. Then
1
P = u(v T u)−1 v T = uv T
cos θ
is the oblique projector onto u along the orthogonal complement of v. Similarly, P T =
v(uT v)−1 uT is the oblique projector onto v along the orthogonal complement of u. If u = v,
then P is an orthogonal projector and cos θ = 1. When v is almost orthogonal to u, then
∥P ∥2 = 1/ cos θ becomes large.
It is easily verified that if V ̸= I is positive definite, the matrix in (3.1.16),

P = A(AT V −1 A)−1 (V −1 A)T , (3.1.19)

is an oblique projector onto R(A) along the orthogonal complement to N (V −1 A)⊥ .

Theorem 3.1.3. Let X and Y be two complementary subspaces in Cn ,

X ∩ Y = 0, X ∪ Y = Cn . (3.1.20)

Let U1 and V1 be orthonormal matrices such that R(U1 ) = X and R(V1 ) = Y ⊥ , where Y ⊥ is
the orthogonal complement of Y. Then the oblique projector onto X along Y is

PX ,Y = U1 (V1T U1 )−1 V1T . (3.1.21)

Similarly, let U2 and V2 be orthonormal matrices such that X ⊥ = R(U2 ) and Y = R(V2 ). Then
PY,X = V2 (U2T V2 )−1 U2T and

PX ,Y + PY,X = I, PXT ,Y = PY ⊥ ,X ⊥ . (3.1.22)

Proof. We have PX2 ,Y = U1 (V1T U1 )−1 V1T U1 (V1T U1 )−1 V1T = PX ,Y . This shows that PX ,Y is a
projector onto X . Similarly, PY,X = V2 (U2T V2 )−1 U2T is the projector onto Y. To prove the first
identity in (3.1.22), we first note that the assumption implies V1T V2 = 0 and U2T U1 = 0. Then

PX ,Y + PY,X = U1 (V1T U1 )−1 V1T + V2 (U2T V2 )−1 U2T


T −1 T
= (U1 , V1 ) ( V2 U2 ) ( U1T V1 ) ( V2 U2 )
−1 −T T
= ( U1 V1 ) ( U1 V1 ) ( V2 , U2 ) ( V2 U2 ) = I.

The second identity follows from the expression PXT ,Y = V1 (U1T V1 )−1 U1T .

If P is an orthogonal projector, then

∥P v∥2 = ∥QT1 v∥2 ≤ ∥v∥2 ∀ v ∈ Cn , (3.1.23)

where equality holds for all vectors in R(P ). It follows that ∥P ∥2 = 1. The converse is also
true; a projector P is an orthogonal projector only if (3.1.23) holds. The spectral norm of an
oblique projector can be exactly computed.
3.1. Generalized Least Squares Problems 119

Lemma 3.1.4. Let P ∈ Rn×n be an oblique projector. Then

∥P ∥2 = ∥I − P ∥2 = 1/c, c = cos θmax (V1T U1 ), (3.1.24)

where θmax is the largest principal angle between R(P ) and N (P )⊥ .

Proof. See Wedin [1110, 1985, Lemma 5.1] or Stewart [1032, 2011].

Notes and references

An excellent introduction to oblique projectors and their representations is given by Wedin [1110,
1985]. Afriat [9, 1957] gives an exposition of orthogonal and oblique projectors. Relations
between orthogonal and oblique projectors are studied in Greville [537, 1974] and Černý [215,
2009]. Numerical properties of oblique projectors are treated by Stewart [1032, 2011]. Szyld
[1055, 2006] surveys different proofs of the equality ∥P ∥ = ∥I − P ∥ for norms of oblique
projectors in Hilbert spaces.

3.1.3 Elliptic MGS and Householder Methods


For a given symmetric positive definite matrix G,

(x, y)G := y T Gx, ∥x∥G = (xT Gx)1/2 (3.1.25)

defines a scalar product and the corresponding norm. Since the unit ball {x | ∥x∥G ≤ 1} is an
ellipsoid, ∥ · ∥G is called an elliptic norm. A generalized Cauchy–Schwarz inequality holds:

|(x, y)G | ≤ ∥x∥G ∥y∥G . (3.1.26)

Two vectors x and y are said to be G-orthogonal if (x, y)G = 0, and a matrix Q ∈ Rm×n is
G-orthonormal if QT GQ = I.
If A = (a1 , . . . , an ) ∈ Rm×n has full column rank, then an elliptic modified Gram–Schmidt
(MGS) algorithm can be used to compute a G-orthonormal matrix Q1 = (q1 , . . . , qn ) and an
upper triangular matrix R such that

A = Q1 R, QT1 GQ1 = In . (3.1.27)

An elementary elliptic projector has the form

P = (I − qq T G), q T Gq = 1 (3.1.28)

and satisfies P 2 = I − 2qq T G + q(q T Gq)q T G = P . It is easily verified that for any vector a,
q T G(P a) = 0, i.e., P a is G-orthogonal to q. Note that P is not symmetric and therefore is an
oblique projector; see Section 3.1.2. Furthermore,

G1/2 P G−1/2 = I − q̃ q̃ T , q̃ = G1/2 q

is an orthogonal projector.
120 Chapter 3. Generalized and Constrained Least Squares

The following row-oriented MGS algorithm computes the factorization (3.1.27).

Algorithm 3.1.1 (Row-Oriented Elliptic MGS).


function [Q,R] = emgs(A,G);
% Computes the thin QR factorization
% by row-oriented elliptic MGS.
% -----------------------------------
[m,n] = size(A); R = zeros(n,n);
for k = 1:n
qk = Q(:,k); pk = G*qk;
R(k,k) = sqrt(p'*qk);
qk = qk/R(k,k); Q(:,k) = qk;
pk = pk/R(k,k);
for j = k+1:n
R(k,j) = pk'*Q(:,j);
Q(:,j) = Q(:,j) - R(k,j)*qk;
end
end
end
In addition to the 2mn2 flops for the standard MGS algorithms, elliptic MGS requires 2m2 n
flops for n matrix-vector products with G. If m ≫ n, these operations can dominate the overall
arithmetic cost. If a factorization G = B TB ∈ Rm×m is known, then
∥x∥G = (xH B TBx)1/2 = ∥Bx∥2 ,
and the operations with G can be replaced by operations with B and B T . A column-oriented
MGS using the minimal number n matrix-vector products requires storing n auxiliary vectors
pk = Gqk , k = 1, . . . , n.

Algorithm 3.1.2 (Column-Oriented Elliptic MGS).


function [Q,R] = emgsc(A,G);
% Computes the thin QR factorization
% by column-oriented elliptic MGS
% -----------------------------------
[m,n] = size(A); Q = A;
R = zeros(n,n);
for k = 1:n
for i = 1:k-1
R(i,k) = p(i)'*Q(:,k);
Q(:,k) = Q(:,k) - R(i,k)*Q(:,i);
end
p(k) = G*Q(:,k);
R(k,k) = sqrt(p(k)'*Q(:,k));
Q(:,k) = Q(:,k)/R(k,k);
p(k) = p(k)/R(k:,k);
end
end
The GLS problem (3.1.2) or, equivalently, minx ∥b − Ax∥G , G = V −1 , can be solved by
an elliptic MGS QR factorization. If applied to the extended matrix ( A b ), this gives the
3.1. Generalized Least Squares Problems 121

factorization  
R z
(A b ) = ( Q1 qn+1 ) . (3.1.29)
0 ρ
It follows that Ax − b = Q1 (Rx − z) − ρqn+1 , where qn+1 is G-orthogonal to Q1 . Hence
∥b − Ax∥G is minimized when Rx = z, and the solution and residual are obtained from

Rx = z, r = b − Ax = ρqn+1 . (3.1.30)

Extra right-hand sides can be treated later by updating the factorization A = Q1 R using the
column-oriented version of elliptic MGS.
Gulliksson and Wedin [551, 1992] develop an elliptic Householder QR factorization. An
elliptic Householder reflection has the form

H = (I − βuuT G), β = 2/(uT Gu); (3.1.31)

cf. the elementary projection operator (3.1.28). The product of an elliptic Householder reflection
H with a vector a is given by

Ha = (I − βuuT G)a = a − β(uT Ga)u.

It is easily verified that H 2 = I and hence H −1 = H. However, H is neither symmetric nor


G-orthogonal, but
G1/2 HG−1/2 = I − β ũũT , ũ = G1/2 u
is an orthogonal reflection. It is easily verified that

H T GH = (I − βGuuT )G(I − βuuT G) = G. (3.1.32)

Such matrices are called G-invariant. Clearly, the unit matrix I is G-invariant, and a product
of G-invariant matrices H = H1 H2 · · · Hn is again G-invariant. This property characterizes
transformations that leave the G-norm invariant:

∥Hx∥2G = (Hx)T GHx = xT Gx = ∥x∥G .

Hence, minx ∥Ax − b∥G and minx ∥H(Ax − b)∥G have the same solution.
To develop a Householder QR algorithm for solving minx ∥Ax − b∥G , we construct a se-
quence of generalized reflections Hi such that
   
R c1
Hn · · · H1 (Ax − b) = x− , (3.1.33)
0 c2

where R is upper triangular and nonsingular. Then an equivalent problem is minx ∥Rx − c1 ∥G
with solution x = R−1 c1 . As in the standard Householder method, this only requires that we
construct a generalized Householder reflection H that maps a given vector a onto a multiple of
the unit vector e1 :
Ha = a − β(uT Ga)u = ±σe1 . (3.1.34)
By the invariance of the G-norm,

σ∥e1 ∥G = ∥a∥G , ∥e1 ∥G = (eT1 Ge1 )1/2 ,

and from (3.1.34) we have u = a ∓ σe1 . Hence β = 2/(uT Gu), where

uT Gu = (a ∓ σe1 )T G(a ∓ σe1 ) = 2(∥a∥2G ∓ σaTGe1 ).

For stability, the sign should be chosen to maximize uT Gu.


122 Chapter 3. Generalized and Constrained Least Squares

Notes and references


Gram–Schmidt orthogonalization in elliptic norms is analyzed by Thomas and Zahar [1059,
1991], [1060, 1992]. The numerical stability of such algorithms is studied by Rozložník et al.
[938, 2012]. Imakura and Yamamoto [656, 2019] compare different variants of elliptic MGS
with respect to efficiency and accuracy.

3.1.4 Generalized QR Factorization


If the covariance matrix is given in factored form

V = BB T , B ∈ Rm×p , p ≤ m, (3.1.35)

the GLS problem minx (b − Ax)T V −1 (b − Ax) can be reformulated as a standard least squares
problem minx ∥B −1 Ax − B −1 b∥2 . However, when B is ill-conditioned, computing A
e = B −1 A
−1
and b = B b may lead to a loss of accuracy. Paige [858, 853, 1979] avoids this by using the
e
equivalent formulation
min ∥v∥22 subject to Ax + Bv = b. (3.1.36)
v,x

Paige’s method can handle rank-deficiency in both A and V , but for simplicity we assume in
the following that A has full column rank n and V is positive definite. Paige’s method starts by
computing the QR factorization of A and applies QT to b and B:
   
T R c1 } n T C1 } n
Q (A b) = , Q B= . (3.1.37)
0 c2 } m − n C2 } m − n

The constraint in (3.1.36) can then be written in partitioned form


     
C1 R c1
v+ x= . (3.1.38)
C2 0 c2

For any vector v ∈ Rm , x can always be determined so that the first block of these equations is
satisfied. Next, an orthogonal matrix P ∈ Rm×m is determined such that
 
0 } n
P T C2T = , (3.1.39)
ST } m − n

and S is upper triangular. By the nonsingularity of B it follows that C2 will have linearly inde-
pendent rows, and hence S will be nonsingular. (Note that after rows and columns are reversed,
(3.1.39) is just the QR factorization of C2T .) Now, the second set of constraints in (3.1.38) be-
comes  
u1 } n
Su2 = c2 , where P T v = u = . (3.1.40)
u2 } m − n
Since P is orthogonal, ∥v∥2 = ∥u∥2 . Hence the minimum in (3.1.36) is achieved by taking

u1 = 0, u2 = S −1 c2 , v = P2 u2 ,

where P = (P1 P2 ). Finally, x is obtained by solving the triangular system Rx = c1 − C1 v in


(3.1.38). It can be shown that this gives an unbiased estimate of x for the model (3.1.36) with
covariance matrix σ 2 B T B, where

B = LR−T , L = P1T C1 . (3.1.41)


3.1. Generalized Least Squares Problems 123

Paige’s algorithm requires a total of about 4m3 /3 + 2m2 n flops. If m ≫ n, the work in
the QR factorization of C2 dominates. Paige [858, 1979] gives a perturbation analysis for the
generalized least squares problem (3.1.6) by using the formulation (3.1.36). An error analysis
shows that the algorithm is stable. The algorithm can be generalized in a straightforward way to
rank-deficient A and B.
If B has been obtained from the Cholesky factorization of V , it is advantageous to carry out
the two QR factorizations in (3.1.37) and (3.1.39) together, maintaining the lower triangular form
throughout by a careful sequencing of the plane rotations. When there are several problems of
the form (3.1.36) with constant A but variable B, the QR factorization of A can be computed
once and for all. When m = n this reduces the work for solving an additional problem from
(10/3)n3 to 2n3 .
When computing the QR factorization of the product C = BA or quotient C = AB −1 =
QR the explicit formation of C should be avoided in order to obtain backward stable results. The
generalized QR factorization (GQR) of a pair of matrices A ∈ Rm×n and B ∈ Rm×p , intro-
duced by Hammarling [564, 1987], is useful for solving generalized equality constrained least
squares problems and in the preprocessing stage for computing the generalized SVD (GSVD).
When B is nonsingular, GQR implicitly computes

C = B −1 A = QR, A ∈ Rm×n , B ∈ Rm×m (3.1.42)

without forming C. The GQR is defined for any matrices A and B with the same number of
rows. In the general case the construction of the GQR proceeds in two steps. A rank-revealing
QR factorization of A is first computed,
 
U11 U12 }r
QT AΠ = , r = rank(A), (3.1.43)
0 0 }m − r

where Q is orthogonal, Π is a permutation matrix, and U11 ∈ Rr×r is upper triangular and
nonsingular. Then QT is applied to B:
 
B1 }r
QT B = .
B2 }m − r

Next, an orthogonal matrix Q̃ is constructed so that


 
R11 R12 0 }r
QT B Q̃ = R =  0 R22 0  }q , Q̃ = ( Q1 Q2 Q3 ) , (3.1.44)
0 0 0 }m − t

where rank(B2 ) = q, t = r + q, R11 ∈ Rr×k1 and R22 ∈ R(n−q)×k2 are upper trapezoidal, and
rank(R11 ) = k1 , rank(R22 ) = k2 . If rank(B) = p, there will be no zero columns. Note that
row interchanges can be performed on the block B2 if Q is modified accordingly.
If B is square (p = m) and nonsingular, then so is R, and from (3.1.43)–(3.1.44) we have
 −1 
R11 U
Q̃T (B −1 A)Π = = S, U = ( U11 U12 ) , (3.1.45)
0

which is the QR factorization of B −1 AΠ. Even in this case one should avoid computing S
because in most applications it is not needed, and it is usually more effective to use R11 and U
separately. Another advantage of keeping R11 and U is that the corresponding decompositions
(3.1.43)–(3.1.44) can be updated by the standard methods when columns or rows of A and B
are added or deleted. Even when S is defined by (3.1.45) it cannot generally be updated in a
stable way.
124 Chapter 3. Generalized and Constrained Least Squares

When B is singular or not square, the GQR can be defined as the QR factorization of B † A,
where B † denotes the pseudoinverse of B. However, as pointed out by Paige [861, 1990], this
does not produce the algebraically correct solution for many applications.
The product QR factorization (PQR) of A ∈ Rm×n and B ∈ Rm×p can be computed in a
similar manner. We use (3.1.43) as the first step and replace (3.1.44) by
 
L11 0 0 }r
QT B Q̃ =  L21 L22 0  }q , Q e = ( Q1 Q2 Q3 ) , (3.1.46)
0 0 0 }m − t
where L11 ∈ Rq×r1 , L22 ∈ R(n−q)×r2 and rank(L11 ) = r1 , rank(L22 ) = r2 . This gives the
PQR because  T   T
e T B TA = L11 U = L
Q , (3.1.47)
0 0
with LT ∈ Rr1 ×n upper trapezoidal. Again, one should avoid computing LT because it is not
needed in most applications, and more accurate methods are usually obtained if L11 and U are
kept separate. (A trivial example is the case when B = A.)
The GQR factorization is given by
A = QR, B = QT Z, (3.1.48)
where Q ∈ Rm×m and Z ∈ Rp×p are orthogonal, and R and T have one of the forms
 
R11
R= (m ≥ n), R = ( R11 R12 ) (m < n)
0
and  
T11
T = ( 0 T12 ) (m ≤ p), T = (m > p).
T21
If B is square and nonsingular, GQR implicitly gives the QR factorization of B −1 A. There is a
similar generalized RQ factorization related to the QR factorization of AB −1 . These generalized
decompositions and their applications are discussed in Anderson, Bai, and Dongarra [25, 1992].

3.1.5 Generalized SVD


The generalized SVD (GSVD) is a generalization of the SVD to a pair of matrices A and B
with the same number of columns (or rows). For B = I the GSVD reduces to the SVD of
A. The GSVD was introduced by Van Loan [1079, 1976], who used it to analyze matrix pencils
ATA−λB TB. Paige and Saunders [856, 1981] extended the GSVD and gave it a computationally
more reliable form.
The GSVD is closely related to the CS decomposition defined in Section 1.2.4. To see this,
let the QR factorization of the matrix M be
   
A Q11
M= = R, (3.1.49)
B Q21
where the matrices are partitioned conformally. For simplicity, assume that both A and B are
square matrices and that rank(M ) = n. These assumptions are satisfied in many applications.
From the CS decomposition in Section 1.2.4 it follows that
   
A U1 CV T R
= ,
B U2 SV T R
where C = diag (c1 , . . . , cn ), S = diag (s1 , . . . , sn ), and C 2 + S 2 = In . This is essentially the
GSVD of the matrix pair (A, B).
3.1. Generalized Least Squares Problems 125

Theorem 3.1.5 (Generalized SVD). Let (A, B) with A ∈ Rm×n and B ∈ Rp×n be a given
T
matrix pair with M = ( AT B T ) and rank(M ) = k. Then there exist orthogonal matrices
m×m p×p n×n
U ∈R ,V ∈R ,Q∈R , and W ∈ Rk×k such that
AQ = U ΣA ( Z 0), BQ = V ΣB ( Z 0), (3.1.50)
where the nonzero singular values of Z = W T R equal those of M , and R ∈ Rk×k is upper
triangular. Moreover,
   
IA }r OB }m − k − r
ΣA =  DA  }s , ΣB =  DB  }s (3.1.51)
OA }m − r − s IB }k − r − s
are diagonal matrices,
DA = diag (αr+1 , . . . , αr+s ), 1 > αr+1 ≥ · · · ≥ αr+s ,
DB = diag (βr+1 , . . . , βr+s ), 0 < βr+1 ≤ · · · ≤ βr+s , (3.1.52)
and αi2 + βi2 = 1, i = r + 1, . . . , r + s. Furthermore, IA and IB are square unit matrices, and
OA ∈ R(m−r−s)×(k−r−s) and OB ∈ R(m−k−r)×r are zero matrices with possibly no rows or
no columns.

Proof. See Paige and Saunders [856, 1981].

Note that in (3.1.51) the column partitionings of ΣA and ΣB are the same. We can define
k nontrivial singular value pairs (αi , βi ) of (A, B), where αi = 1, βi = 0, i = 1, . . . , r, and
αi = 0, βi = 1, i = r + s + 1, . . . , k. Perturbation theory for generalized singular values by
Sun [1047, 1983] and Li [742, 1993] shows that as in the SVD, αi and βi are well-conditioned
with respect to perturbations of A and B.
The GSVD algorithm of Bai and Demmel [59, 1993] requires about 2mn2 + 15n3 flops. It
uses a preprocessing step for reducing A and B to upper triangular form and gives a new stable
and accurate 2 × 2 triangular GSVD algorithm. Another approach by Bai and Zha [63, 1993]
starts by extracting a regular pair (A, B), with A and B upper triangular and B nonsingular.
A satisfying aspect of the formulation of GSVD in Theorem 3.1.5 is that A and B are treated
identically, and no assumptions are made on the dimension and rank of A and B. For many
applications this generality is not needed, and the following simplified form similar to that in
Van Loan [1079, 1976] can be used.

Corollary 3.1.6. Let ( A B ) with A ∈ Rm×n and B ∈ Rp×n be a given matrix pair with
T
m ≥ n ≥ p and rank(M ) = n, where M = ( AT B T ) . Then there exist orthogonal
m×m p×p
matrices U ∈ R ,V ∈ R and a nonsingular matrix Z ∈ Rn×n with singular values
equal to those of M , such that
 
DB 0
A = U ( DA 0 ) Z, B=V Z, (3.1.53)
0 0
where DA = diag (α1 , . . . , αn ), DB = diag (β1 , . . . , βp ), αi2 + βi2 = 1, i = 1, . . . , p, and
0 < α1 ≤ · · · ≤ αn ≤ 1, 1 ≥ β1 ≥ · · · ≥ βp > 0.
The generalized singular values of (A, B) are the ratios σi = αi /βi . Setting W = Z −1 =
(w1 , . . . , wk ), we get from (3.1.53)
Awi = αi ui , i = 1, . . . , n, Bwi = βi vi , i = 1, . . . , p,
126 Chapter 3. Generalized and Constrained Least Squares

and Bwi = 0, i = p + 1, . . . , n. These n − p vectors form a basis for the nullspace of B.


Furthermore, we have the orthogonality relations

(Awi )TAwj = 0, i ̸= j.

When B ∈ Rn×n is square and nonsingular the GSVD of A and B reduces to the SVD of AB −1 ,
also called the quotient SVD (QSVD). Similarly the SVD of AB is the product SVD. If B were
ill-conditioned, then forming AB −1 (or AB) would give unnecessarily large errors in the SVD,
so this approach should be avoided. Note also that when B is not square or is singular, the SVD
of AB † does not always correspond to the GSVD.
An algorithm for computing the QSVD of A ∈ Rm×n and B ∈ Rp×n was proposed by
Paige [854, 1986]. In the first phase, A and B are reduced to generalized triangular form by an
RRQR factorization of A with column pivoting P . Next, a QR factorization is performed on BP
in which column pivoting is used on the last p − r rows and n − r columns of B to reveal the
rank q of this block. In the second phase, two n × n upper triangular matrices are computed by
a Kogbetliantz-type algorithm; see Section 7.2.2. Such algorithms can be extended to compute
the GSVD for products and quotients of several matrices.
The generalized linear model where both A and V are allowed to be rank-deficient can be
analysed by the GSVD; see Paige [860, 1985]. We use the model (3.1.36) and assume that
V = BB T is given in factored form, where B ∈ Rm×p , p ≤ m. Since A and B have the same
number of rows, the GSVD is applied to AT and B T .
Let r = rank(A), s = rank(B), k = rank(( A B )), where r ≤ n, s ≤ p, k ≤ r + s. Then
there exist orthogonal matrices U ∈ Rn×n , V ∈ Rp×p and a matrix Z ∈ Rm×k of rank k such
that
   
0 0 0 }k − r Ik−r 0 0 }k − r
AU = Z  0 DA 0  0 D 0
}q , BV = Z B  }q ,
 
 
0 0
|{z} | {z }I k−s }k − s 0
| {z }0 0
|{z} }k − s
n−r r s p−s
(3.1.54)
2 2
where q = r + s − k, DA + DB = Iq , and

DA = diag(α1 , . . . , αq ) > 0, 0 < α1 ≤ · · · ≤ αq < 1, (3.1.55)


DB = diag(β1 , . . . , βq ) > 0, 1 > β1 ≥ · · · ≥ βq > 0.

Note that the row partitionings in (3.1.54) are the same. Let the orthogonal matrices U =
( U1 U2 U3 ) and V = ( V1 V2 V3 ) be partitioned conformally with the column blocks
on the right-hand sides in (3.1.54). Then AU1 = 0, BV3 = 0, i.e., U1 and V3 span the nullspace
of A and B, respectively. The decomposition (3.1.54) separates out the common column space
of A and B. Since AU2 = ZDA and BV2 = ZDB , we have AU2 DB = BV2 DA , and it follows
that
R(AU2 ) = R(BV2 ) = R(A) ∩ R(B)
and has dimension q. For the special case B = I we have s = k = m and then q = rank(A).
Now let the QR factorization of Z in (3.1.54) be
 
R
QT Z = , Q = (Q1 , Q2 ), (3.1.56)
0

where R ∈ Rk×k is upper triangular and nonsingular. In the model (3.1.36) we make the orthog-
onal transformations
x̃ = U T x, ũ = V T u. (3.1.57)
3.1. Generalized Least Squares Problems 127

Then, from (3.1.54) and (3.1.56) the model (3.1.36) becomes


      
  0 0 0 x̃1 Ik−r 0 0 ũ1   T 
R  0 DA Q1 b
0   x̃2  +  0 DB 0   ũ2  = ,
0  QT2 b
0 0 Ik−s x̃3 0 0 0 ũ3

(3.1.58)
where x̃i = UiT xi and ũi = ViT ui , i = 1, 2, 3.
It immediately follows that the model is correct only if QT2 b = 0, which is equivalent to the
condition b ∈ R(A, B). If this condition is not satisfied, then b could not have come from the
model. The remaining part of (3.1.58) can now be written
    
R11 R12 R13 ũ1 c1 }k − r
 0 R22 R23   DA x̃2 + DB ũ2  =  c2  }q , (3.1.59)
0 0 R33 x̃3 c3 }k − s

where we have partitioned R and c = QT1 b conformally with the block rows of the two-block
diagonal matrices in (3.1.58).
We first note that x̃1 has no effect on b and therefore cannot be estimated. The decomposition
x = xn + xe with
xn = U1 x̃1 , xe = U2 x̃2 + U3 x̃3
splits x into a nonestimable part xn and an estimable part xe . Furthermore, x̃3 can be determined
exactly from R33 x̃3 = c3 . Note that x̃3 has dimension k − s = rank(( A B )) − rank(B), so
that this can only occur when rank(B) < m.
The second block row in (3.1.59) gives the linear model
−1
DA x̃2 + DB ũ2 = R22 (c2 − R23 x̃3 ),
where from (3.1.57) we have V ũ2 = σ 2 I. Here the right-hand side is known, and the best linear
unbiased estimate of x̃2 is
−1 −1
x̂2 = DA R22 (c2 − R23 x̃3 ). (3.1.60)
Since the error satisfies DA (x̂2 − x̃2 ) = DB ũ2 , the error covariance is
−1
V (x̂2 − x̃2 ) = σ 2 (DA DB ),
and the components are uncorrelated. The random vector ũ3 has no effect on b. The dimension
of ũ3 is p − s = p − rank(B), and so is zero if B has independent columns. Finally, the vector
ũ1 can be obtained exactly from (3.1.59). Since ũ1 has zero mean and covariance matrix σ 2 I, it
can be used to estimate σ 2 . Note that ũ1 has dimension k − r = rank ( A B ) − rank(A).
A QR-like algorithm for computing the SVD of a product or quotient of two or more matrices
is given by Golub, Sølna, and Van Dooren [508, 2000]. Let
C = Aspp · · · As22 As11 , si = ±1,
be a sequence of products or quotients of matrices Ai of compatible dimensions. To illustrate the
idea, we consider for simplicity the case when p = 2 and si = 1. Then orthogonal matrices Qi ,
i = 0, 1, 2, can be constructed such that
B = QT2 A2 Q1 QT1 A1 Q0
is bidiagonal. The SVD of B can then be found by standard methods. Typically, the bidiagonal
matrix will be graded and will allow small singular values to be computed with high relative
precision by the QR algorithm. The generalization to the product and/or quotient of an arbitrary
number of matrices is obvious.
128 Chapter 3. Generalized and Constrained Least Squares

Notes and references


The first algorithms for computing the GSVD were developed by Stewart [1020, 1982], [1021,
1983] and Van Loan [1080, 1985]. Their algorithms use a QR factorization for the first phase.
Heath et al. [597, 1986] give a generalized Kogbetliantz algorithm for computing the SVD of
the product B TA, which computes very small singular values of the product B TA accurately. A
more general theory for GSVD is developed by De Moor and Zha [299, 1991].
Rao [912, 1973] called the augmented matrix in (3.1.11) the fundamental matrix. When
both A and V are rank-deficient the solution of (3.1.11) can, in theory, be obtained from a
generalized inverse of the fundamental matrix. Some results and proofs for the singular linear
model are given by Searle [992, 1994]. A unified treatment including a perturbation analysis
of generalized and constrained least squares problems is given by Wedin [1110, 1985]. The
systematic use of GQR as a basic conceptual and computational tool is explored by Paige [861,
1990]. Implementation aspects of GQR are considered in Anderson, Bai, and Dongarra [25,
1992], and routines for GQR are included in LAPACK. QR, URV, and SVD decompositions are
generalized to any number of matrices by De Moor and Van Dooren [298, 1992]. Rank-revealing
GQR decompositions are developed by Luk and Qiao [763, 1994].

3.2 Weighted Least Squares


3.2.1 Weighted Least Squares Problems
In many problems, the random errors in the data are uncorrelated but of variable accuracy. In ge-
odetic surveys, recent observations may be much more accurate than older data. Such problems
have diagonal covariance matrix V and are called weighted least squares (WLS) problems,

min ∥W (Ax − b)∥22 , W = V −1/2 = diag (w1 , w2 , . . . , wm ), (3.2.1)


x

where weights wi are such that the weighted residuals ri = wi (b − Ax)i have equal variance.
Note that the solution of (3.2.1) is scaling independent, i.e., it does not change if W is multiplied
by a nonzero scalar. Therefore, without restriction we can assume in the following that wi ≥ 1,
and that the rows of A are normalized so that max1≤j≤n |aij | = 1, i = 1, . . . , m.
The solution to the WLS problem (3.2.1) satisfies the normal equations

AT W 2 Ax = AT W 2 b. (3.2.2)

A more stable solution method is to use the weighted QR factorization W A = QR. The solution
to (3.2.1) is then obtained from
Rx = QT W b, (3.2.3)
and squaring of the weight matrix W is avoided.
For a consistent underdetermined system AT y = c, the unique solution of the weighted
least-norm problem
min ∥W y∥2 subject to AT y = c (3.2.4)
y

is obtained from the generalized normal equations of the second kind,

(AT W 2 A)z = c, y = W 2 Az. (3.2.5)

The weighted QR factorization W A = QR gives the more accurate solution

y = W QR−T c. (3.2.6)
3.2. Weighted Least Squares 129

In a linear model where some error components have much smaller variance than others, the
weight matrix W = diag (wi ) is ill-conditioned. Then κ(W A) can be large even when A is
well-conditioned. We call a weighted least squares problem stiff if
max1≤i≤m wi
µ= ≫ 1, (3.2.7)
min1≤i≤m wi

by analogy to the terminology used in differential equations. Stiff problems arise, e.g., in barrier
and interior methods for optimization, electrical networks, and certain classes of finite element
problems. In interior methods, W becomes very ill-conditioned when the iterate approaches the
boundary; see Wright [1132, 1995]. Stiff problems occur also when the method of weighting is
used to solve least squares problems with the linear equality constraints A1 x = b1 . Often the
interest is in a sequence of weighted problems where W varies and A is constant.
That a weighted least squares problem is stiff does not in general imply that the problem of
computing x from the data W , A, and b is ill-conditioned. For weighted least squares problems
the componentwise Bauer–Skeel condition number (see Section 1.3.4)

cond(AW ) = ∥ |(AW )† | |AW | ∥

is more relevant. This often depends only weakly on W and when µ → ∞ can tend to a limit
value; see Section 3.2.3.
To illustrate the possible failure of the method of normal equations for the weighted least
squares problem (3.2.2), we consider a problem where only the first p < n equations are
weighted by wi = w ≫ 1, i = 1, . . . , p:
    2
wA1 wb1
min x− , A1 ∈ Rp×n . (3.2.8)
x A2 b2 2

The weighted normal equations are Cx = d, where

C = w2 AT1 A1 + AT2 A2 , d = w2 AT1 b1 + AT2 b2 .

When w ≫ 1, C and d will be dominated by the first term. If w > u−1/2 , then all information
contained in A2 and b2 will be lost. But since p < n, the solution depends critically on this data.
(The matrix in the Läuchli least squares problem in Example 2.1.1 is of this type.) Hence, the
method of normal equations is generally not well behaved.
The Peters–Wilkinson method (see Section 2.4.1) can be used to solve weighted least squares
problems even when W is severely ill-conditioned. Assume that the rows of A and b are pre-
ordered by decreasing weights, ∞ > w1 ≥ · · · ≥ wm > 0. Compute an LDU factorization with
complete pivoting of A,
Π1 AΠ2 = LDU,

where L ∈ Rm×n is unit lower trapezoidal, |lij | ≤ 1, and U ∈ Rn×n is upper unit triangular. In
the transformed problem

min ∥Ly − b̃∥2 , b̃ = Π1 b, DU ΠT2 x = y, (3.2.9)


y

L and U are usually well-conditioned, and the weight matrix W is reflected only in D. The
transformed problem can often be solved safely by the method of normal equations LT Ly = LT b̃
and back-substitution DU x = Dy.
130 Chapter 3. Generalized and Constrained Least Squares

Example 3.2.1. Consider the least squares problem minx ∥W Ax − W b∥2 ,

w w w w
   
1 0 0 0
WA =  , Wb =  
0 1 0 0
0 0 1 0

of Läuchli. For w > u−1/2 the normal equations become singular. The Peters–Wilkinson
method computes the factorization W A = LDU ,

1
   
1 1 1
 w−1 1
L= , U = −1 −1  ,

−1 1
−1
−1

where D = diag (w, 1, 1) and L and U are well-conditioned. With y = DU x the problem
becomes miny ∥Ly − W b∥2 . The solution can be accurately computed from

LT Ly = LT W b, DU x = y.

The weight w only appears in the diagonal scaling of y in the last step. Alternatively, L can be
transformed into lower triangular form by Householder reflections; see A. K. Cline [253, 1973].

3.2.2 Weighted QR Factorization


Householder QR factorization can give unnecessarily poor accuracy for weighted least squares
problems because of the large growth of intermediate elements during the factorization, even
when column pivoting is used. This was first observed by Powell and Reid [903, 1969], who
considered the least squares problem for

0 2 1 3
   
w w 0  2w 
A= , b= , w ≫ 1,
w 0 w 2w
0 1 1 2

with exact solution x = (1, 1, 1)T . The first step of Householder QR factorization produces the
reduced matrix √ √ √
−w 2 −w/ 2 −w/ 2
 
√ √
 0 w/2 − 2 −w/2 − 1/ 2 
 √ √ .

−w/2 − 2 w/2 − 1/ 2 

 0
0 1 1
√ −1 √ √
If w > 2 2u , then in floating-point arithmetic, the terms − 2 and −1/ 2 in the second and
third rows of Ã(2) are lost. But this means that all information present in the first row of A is lost.
This is disastrous, because the number of rows in A containing large elements is less than the
number of components in x. Hence there is a substantial dependence of the solution x on both
the first and fourth rows of A. Still, this is better than the method of normal equations, which
fails when w > u−1/2 for this problem. Van Loan [1081, 1985] gives further examples where
Householder QR without row interchanges gives poor accuracy for stiff least squares problems.
Note that the insensitivity of fixed-precision IR to poor row scaling of A can make it possible to
relax the need to sort rows of large norm to the top.
3.2. Weighted Least Squares 131

For the first example above, failure can easily be avoided by interchanging the first two rows
of A before performing the QR factorization. More generally, rowwise stability can be achieved
by using Householder QR with complete pivoting. By this we mean that in each step, first
a pivot column of maximum norm is selected and then an element of largest absolute value is
permuted into the pivot position. (Note the importance of interchanging columns before rows.)
With complete pivoting, the backward errors in QR factorization can be bounded rowwise in
(k)
terms of element growth in each row. If Â(k) = (aij ) is the computed matrix after k steps, then
(k)

ωi = max |aij | ≤ (1 + 2)n−1 max |aij |, i = 1, . . . , m. (3.2.10)
j,k j

This upper bound can nearly be attained, although in practice, usually ωi ≈ 1. The following
rowwise stability result is due to Powell and Reid, but a more accessible derivation is given by
Cox and Higham [273, 1998].

Theorem 3.2.2. Let R̂ ∈ Rm×n denote the computed upper triangular matrix in the Householder
QR factorization of A ∈ Rm×n with complete pivoting. Let Π be the permutation matrix that
describes the column permutations. Then there exists an exactly orthogonal matrix Q ∈ Rm×m
such that
 

(A + ∆A)Π = Q , |∆aTi | ≤ γ̃m (1, . . . , n2 )ωi , i = 1, . . . , m, (3.2.11)
0
where γ̃m is defined as in (1.4.9).

Householder QR factorization with complete pivoting is expensive and not available in stan-
dard software. Björck [135, 1996, p. 169] conjectured that if the rows of A are presorted by
decreasing row ∞-norm, i.e., so that d1 ≥ d2 ≥ · · · ≥ dm , where di = max1≤j≤n |wi aij |, then
the rowwise backward error bound holds for Householder QR with standard column pivoting.
This conjecture was later proved by Cox [272, 1997]. Then standard software can be used for
stably solving strongly weighted least squares problems.
In contrast to the Householder QR method, Gram–Schmidt QR factorization is numerically
invariant under row interchanges, except for second-order effects derived from different summa-
tion orders in inner products. However, numerical results for highly stiff problems show a loss
of accuracy also for MGS.

Example 3.2.3. The stability of algorithms using QR factorization for stiff problems can be
enhanced by iterative refinement. As test problems we take
A = V D ∈ R21×6 , vij = (i − 1)j−1 ,
with D chosen so that the columns of A have unit 2-norm. The right-hand side is taken to be
b = Ax + θh, x = D−1 (105 , 104 , . . . , 1),
where AT h = 0 and h is normalized so that κ2 (A)∥h∥2 = 1.5∥A∥2 ∥x∥2 . Problems with
−1
widely different row norms are obtained by taking Aw = Dw A, bw = Dw b, hw = Dw h,
and Dw = diag (wi ), where wi = w for i = 1, 11, 21 and wi = 1 otherwise. The tests were
run on a UNIVAC 1108 with single precision equal to 2−26 = 10−7.83 and double precision
2−62 = 10−18.66 . Mixed-precision iterative refinement was carried out with three different QR
factorizations: Modified Gram–Schmidt (MGS), Householder QR (HQR), and Householder QR
with the weighted rows permuted to the top (HQRP). Table 3.2.1 shows the initial and final
average numbers of correct significant decimal digits in the solution for w = 1, 27 , 214 . The
numbers in parentheses indicate refinements carried out.
132 Chapter 3. Generalized and Constrained Least Squares

Table 3.2.1. Average number of correct significant decimal digits in the solution before and after
iterative refinement with various QR factorizations. The number of refinement steps is shown in parentheses.

Algorithm w=1 w = 27 w = 214


MGS 5.92 5.15 2.90
14.15 (3) 18.66 (4) 18.66 (9)
HQR 5.75 3.79 -0.44
15.74 (4) 12.86 (3) 18.66 (8)
HQRP 4.94 4.22 3.16
14.68 (5) 18.66 (4) 11.31 (4)

Anda and Park [22, 1996] apply their fast self-scaling plane rotations to the QR factorization
of stiff least squares problems. Their results show that regardless of row sorting, these produce
accurate results even for extremely stiff least squares problems. No significant difference in ac-
curacy is observed between different rotation orderings. This makes self-scaling plane rotations
a method of choice for solving stiff problems.

3.2.3 Limit Behavior of Weighted Least Squares


Let xW be the unique solution to the weighted least squares problem

min ∥W (Ax − b)∥2 , A ∈ Rm×n , rank(A) = n,


x

where W = diag (w1 , . . . , wm ) is positive definite. Stewart [1022, 1984] showed that there is a
finite number χA not depending on W such that the weighted least squares solution xW satisfies

∥xW ∥ ≤ χA ∥b∥. (3.2.12)

This implies a bound for the perturbation of the solution that results from a perturbation of b.
However, due to the combinatorial aspect of the bound χA , it cannot be computed in polynomial
time.
Ben-Tal and Teboulle [103, 1990] show, using a formula derived from Cramer’s rule and the
Binet–Cauchy formula, that xW is a convex combination of the basic solutions to nonsingular
square subsystems of the original overdetermined system.

Theorem 3.2.4. Let F be the set of all subsets I = {i1 , i2 , . . . , in } from {1, 2, . . . , m}. Denote
by AI and bI the submatrix of A and b whose rows correspond to I. Furthermore, let

F + = {I ∈ F | rank(AI ) = n},

and let xI be the solution of AI xI = bI . Then


X X
xW = λI xI , λI > 0, λI = 1.
I∈F + I∈F +

It follows that xW ∈ conv {xI | I ∈ F + }.

This result is generalized to diagonally dominant symmetric and positive semidefinite weight
matrices by Forsgren [420, 1996]. He also shows that it does not hold for general symmetric
definite weight matrices.
3.2. Weighted Least Squares 133

Notes and references


Vavasis [1091, 1994] surveys equilibrium problems with extremely large weights and gives stable
direct and iterative algorithms for their solution. Hough and Vavasis [643, 1997] develop a
special method based on a complete orthogonal decomposition that satisfies a strong type of
stability. Iterative methods for accurate solution of weighted least squares are studied by Bobrov-
nikova and Vavasis [156, 2001].

3.2.4 Indefinite Linear Least Squares


For A ∈ Rm×n with m ≥ n, and b ∈ Rm , the indefinite least squares (ILS) problem is
   
T A1 b1
min (b − Ax) J(b − Ax), A = , b= , (3.2.13)
x A2 b2

where A1 ∈ Rm1 ×n , A2 ∈ Rm2 ×n , m1 + m2 = m, and


 
Im1 0
J= (3.2.14)
0 −Im2

is the signature matrix. Note that J −1 = J. This problem arises in downdating, total least
squares problems, and H ∞ -smoothing; see Chandrasekaran, Gu, and Sayed [232, 1998]. A
necessary condition for x to solve (3.2.13) is that the gradient be zero:

AT J(b − Ax) = 0. (3.2.15)

Hence the residual vector r = b−Ax is J-orthogonal to R(A). Equivalently, x solves the normal
equations ATJAx = AT Jb. The indefinite least squares problem has a unique solution if and
only if AT JA is positive definite. This implies that m1 ≥ n and that A1 (and A) has full column
rank; see Bojanczyk, Higham, and Patel [166, 2003]. In the following we assume that AT JA is
positive definite.
In the method of normal equations for problem ILS the Cholesky factorization of AT JA =
R R is computed, and the solution is obtained by solving two triangular systems RT (Rx) = c,
T

where c = AT Jb.
If accuracy is important, an algorithm based on QR factorization is to be preferred. A back-
ward stable algorithm was given by Chandrasekaran, Gu, and Sayed [232, 1998]. The first step
is to compute the compact QR factorization
   
A1 Q1
A= = R = QR, Q1 ∈ Rm1 ×n , Q2 ∈ Rm2 ×n ,
A2 Q2

where QT Q = QT1 Q1 + QT2 Q2 = In . Substituting this into (3.2.15) gives the equation

(QT1 Q1 − QT2 Q2 )Rx = QT Jb. (3.2.16)

Note that the orthogonality of Q is not needed for this to hold. By computing the Cholesky
factorization
QT1 Q1 − QT2 Q2 = LLT ,
where L ∈ Rn×n is lower triangular, this linear system becomes

LLT Rx = QT Jb.

The total cost for this QR-Cholesky algorithm is about (5m − n)n2 flops. Although QT1 Q1 −
QT2 Q2 can be very ill-conditioned, this method can be shown to be backward stable.
134 Chapter 3. Generalized and Constrained Least Squares

Bojanczyk, Higham, and Patel [166, 2003] give a perturbation analysis of the ILS problem
and develop an alternative ILS algorithm based on so-called hyperbolic QR factorization. For
the perturbation analysis the normal equations can be rewritten in symmetric augmented form as
    
J A s b
= , (3.2.17)
AT 0 x 0

where s = Jr = J(b − Ax). The inverse of the augmented matrix is (cf. (3.1.15))
−1
J(I − P ) JAS −1
  
J A
= , (3.2.18)
AT 0 S −1 AT J −S −1

where S = AT JA and P = AS −1 AT J is a projector. Consider perturbations that satisfy the


componentwise bounds |δA| ≤ ωE, and |δb| ≤ ωf . Then using (3.2.18) and proceeding as in
Section 1.3.4 we obtain the first-order bound

|δx| ≤ ω |S −1 AT |(f + E|x|) + |S −1 |E T |r| .



(3.2.19)

A matrix Q ∈ Rn×n is said to be J-orthogonal if QT JQ = J. Multiplying this by QJ and


using J 2 = I gives QJQT J = I and QJQT = J. If Q1 and Q2 are J-orthogonal, then

QT2 QT1 JQ1 Q2 = QT2 JQ2 = J,

i.e., a product of J-orthogonal matrices is J-orthogonal. J-orthogonal matrices are useful in


the treatment of problems with an underlying indefinite inner product and also play a role in the
solution of certain structured eigenvalue problems.
From rT Jr = rT QT JQr it follows that the ILS problem (3.2.13) is invariant under J-
orthogonal transformations. A hyperbolic QR factorization has the form
   
A1 R
QT A = QT = (3.2.20)
A2 0

where R is upper triangular and Q is J-orthogonal. Then it holds that

(b − Ax)T J(b − Ax) = (b − Ax)T QJQT (b − Ax)


 T  
c1 − Rx c1 − Rx
= J = ∥c1 − Rx∥22 − ∥c2 ∥22 ,
c2 c2

and the solution to the ILS problem (3.2.13) is obtained by solving Rx = c1 .


The simplest J-orthogonal matrices are hyperbolic rotations of the form
 
T c −s
H=H = , c2 − s2 = 1.
−s c

They are so named because |c| = cosh θ and s = sinh θ for some θ. It is easily verified that

H T JH = J, J = diag (1, −1).

Like plane rotations, hyperbolic rotations can be used to zero a selected component in a vector,
   
x1 σ
H = ,
x2 0
3.2. Weighted Least Squares 135

which requires cx2 = sx1 . Provided |x1 | > |x2 |, the solution is
q q
s = x2 / x21 − x22 , c = x1 / x21 − x22 . (3.2.21)

The elements of a hyperbolic rotation H are unbounded and therefore must be used with care. As
shown by Chambers [217, 1971] direct computation of y = Hx is not stable. Instead, a mixed
form that combines a hyperbolic rotation with a plane rotation should be used. This is based on
the equivalence
         
x1 y1 y1 x1 c −s
H = ⇔ G = , G= .
x2 y2 x2 y2 s c

First, y1 = Hx1 is computed using the hyperbolic rotation, and then y2 = Gx2 is computed
from a plane rotation, i.e.,

y1 = (x1 − sx2 )/c, y2 = cx2 − sy1 . (3.2.22)

An error analysis of Chambers’ algorithm is give by Bojanczyk et al. [164, 1987].


Bojanczyk, Higham, and Patel [166, 2003] give an algorithm for solving the ILS problem
that combines standard Householder reflectors and hyperbolic rotations. In the first step, two
Householder reflectors are used: P1,1 zeros the elements in rows 2, . . . , m1 of the first column
of A1 , and P1,2 zeros the elements in rows m1 + 1, . . . , m in A2 . The remaining element in the
first column of A2 is annihilated by a hyperbolic rotation in the plane (1, m1 + 1). The steps in
this reduction to triangular form are shown below for the case n = m1 = m2 = 3.
     
× × × × × × × × ×

 ⊗ × × 


 × × 


 × × 

P1,1 
 ⊗ × × 


 × × 
 P2,1 
 ⊗ × 


 ⊗ × ×  ⇒ H1,5 
  × ×  ⇒


 ⊗ × 

P1,2 
 × × × 

 ⊗
 × × 
 P2,2 
 × × 

 ⊗ × ×   × ×   ⊗ × 
⊗ × × × × ⊗ ×
     
× × × × × × × × ×

 × × 


 × × 


 × × 


 × 
 P3,1 
 × 


 × 

⇒ H2,5 
 ×  ⇒


 ⊗  ⇒ H3,5 
 



 ⊗ × 
 P3,2 
 × 


 ⊗ 

 ×   ⊗   
× ⊗
The remaining steps are similar. In step k, k = 1, . . . , n, two Householder reflections are used
to zero the last m1 − k elements in the kth column of A1 and the last m2 − 1 elements in the
kth column of A2 . Next, a hyperbolic rotation in the plane (k, m1 + 1) is then used to zero the
remaining element in the kth column of A2 . If the problem is positive definite, the process will
not break down and terminates with an upper triangular matrix R after n steps. The reduction
can be combined with column interchanges so that at step k the diagonal element rkk in R is
maximized. The interchanges in the remaining steps are determined from the reduced matrices
in a similar way. Since the operation count for the required n hyperbolic rotations used in the
reduction is O(n2 ) flops, the total cost is about the same as for the standard Householder QR
factorization. The algorithm can be shown to be forward stable.
A special case arises in total least squares problems (see Section 4.4), where the factorization
ATA − w2 In = RTR is required without forming the cross-product. This can be achieved by
136 Chapter 3. Generalized and Constrained Least Squares

taking A1 = A and A2 = wIn in the above reduction. Since the hyperbolic rotations will cause
the rows in A2 to fill in, the reduction is almost as expensive as when A2 is a full upper triangular
matrix. There are two different ways to perform the reduction: process In either top down, which
will use more memory but allows the use of Householder reflectors, or bottom up, which avoids
fill-in but uses more hyperbolic rotations.
Bojanczyk, Higham, and Patel [165, 2003] give a similar algorithm for the equality con-
strained ILS problem
min(b − Ax)T J(b − Ax) subject to Bx = d, (3.2.23)
x

where A ∈ R(p+q)×n , B ∈ Rs×n , and J is the signature matrix. It is assumed that rank(B) = s
and that xT (AT JA)x > 0 for all x > 0 in N (B). These conditions imply that p ≥ n − s. The
solution to problem (3.2.23) satisfies
    
0 0 B λ d
 0 J A s  = b, (3.2.24)
B T AT 0 x 0
where s = Jr = J(b − Ax).

Algorithm 3.2.1 (Generalized Householder QR Method).


1. Compute the Householder QR factorization of B T : BQ = B ( Q1 Q2 ) = ( RT 0)
and form C2 = AQ2 .
2. Solve the lower triangular system RT y1 = d.
T
3. Compute the hyperbolic QL factorization HC2 = ( 0 LT22 ) , where L22 is lower trian-
gular and H is J-orthogonal, using Chambers’ method.
T
4. Compute f = ( f1T f2T ) = H(b − AQ1 y1 ) using the factored form of H.
5. Solve the lower triangular system L22 y2 = f2 and set x = Qy.

J-orthogonal matrices can be constructed from orthogonal matrices as follows. Consider the
partitioned linear system
    
Q11 Q12 x1 y1
Qx = = , (3.2.25)
Q21 Q22 x2 y2
where Q is orthogonal. Solving the first equation for x1 and substituting in the second equation
will exchange x1 and y1 . This can be written as
   
x1 y1
= exc (Q) ,
y2 x2
where
Q−1 −Q−1
 
exc (Q) = 11 11 Q12 (3.2.26)
Q21 Q−1
11 Q22 − Q21 Q−1
11 Q12
is the exchange operator. The (2, 2) block in (3.2.26) is the Schur complement of Q11 in
Q. An early reference to the exchange operator is in network analysis; see the survey by Tsat-
someros [1069, 2000].

Theorem 3.2.5. Let Q ∈ Rn×n be partitioned as in (3.2.25) with Q11 nonsingular. If Q is


orthogonal, then exc (Q) is J-orthogonal. If Q is J-orthogonal, then exc (Q) is orthogonal.

Proof. See Pan and Plemmons [874, 1989, Lemma 1].


3.3. Modified Least Squares Problems 137

Notes and references


Golub [489, 1969] gave an early version of hyperbolic downdating. He noted that deleting a row
wT in a least squares problem is formally equivalent to adding a row iwT , where i2 = −1. The
resulting algorithm can be expressed entirely in real arithmetic; see Lawson and Hanson [727,
1995, pp. 229–231]. The use of hyperbolic rotations in signal processing for downdating least
squares solutions is studied by Alexander, Pan, and Plemmons [16, 1988]. Rader and Stein-
hardt [908, 1988] introduced hyperbolic Householder reflectors to zero several components at a
time in a vector. Higham [624, 2003] gives a systematic treatment of J-orthogonal matrices and
their many applications. Linear algebra with an indefinite inner product and applications thereof
are treated by Gohberg, Lancaster, and Rodman [483, 2005]. An algorithm using orthogonal
transformations to reduce an indefinite symmetric matrix to block antitriangular form is given by
Mastronardi and Van Dooren [782, 2013]. By applying this to (3.2.24) they derive in [783, 2014]
a stable algorithm for solving constrained indefinite least squares problems.

3.3 Modified Least Squares Problems


3.3.1 Modified Linear Systems
Consider the 2 × 2 block matrix  
A B
M= . (3.3.1)
C D
If A is nonsingular, then M can be factored into a product of a block lower and a block upper
triangular matrix as
    
A B I 0 A B
M= = , S = D − CA−1 B. (3.3.2)
C D CA−1 I 0 S
This factorization is equivalent to block Gaussian elimination and is easily verified by performing
the product. The matrix S is called the Schur complement of A in M . From (3.3.2) and the
product rule for determinants it follows that
det(M ) = det(A) det(S). (3.3.3)
In particular, for the 2 × 2 block triangular matrices
   
L11 0 U11 U12
L= , U=
L21 L22 0 U22
with square diagonal blocks, it holds that
det(L) = det(L11 ) det(L22 ), det(U ) = det(U11 ) det(U22 ). (3.3.4)
Hence L and U are nonsingular if and only if their diagonal blocks are nonsingular. Furthermore,
L−1
   −1 −1 −1

−1 11 0 −1 U11 −U11 U12 U22
L = , U = . (3.3.5)
−L−1 −1
22 L21 L11 L−1
22 0 −1
U22

These formulas can easily be verified by forming the products L−1 L and U −1 U and using the
rule for multiplying partitioned matrices.
Let M in (3.3.1) have a nonsingular block A. Then from M −1 = (LU )−1 = U −1 L−1 and
(3.3.5) the Schur–Banachiewicz formula follows:
 −1
A + A−1 BS −1 CA−1 −A−1 BS −1

−1
M = , (3.3.6)
−S −1 CA−1 S −1
138 Chapter 3. Generalized and Constrained Least Squares

where S is the Schur complement of A in M . Similarly, if D is nonsingular, then M can be


factored into a product of a block upper and a block lower triangular matrix as

I BD−1
  
T 0
M= , T = A − BD−1 C, (3.3.7)
0 I C D

where T is the Schur complement of D in M . This is equivalent to block Gaussian elimination


in reverse order. From this we obtain the expression

T −1 −T −1 BD−1
 
M −1 = . (3.3.8)
−D−1 CT −1 D−1 + D−1 CT −1 BD−1

Woodbury [1131, 1950]5 gave a formula for the inverse of a square nonsingular matrix A
after it has been modified by a matrix of rank p. This is very useful when p ≪ n.

Theorem 3.3.1 (The Woodbury Formula). Let A ∈ Rn×n and D = Ip , p < n. If A and
S = D − CA−1 B ∈ Rp×p are nonsingular, then

(A − BC)−1 = A−1 + A−1 BS −1 CA−1 . (3.3.9)

which is the Woodbury formula.

Proof. Equate the (1, 1) blocks in the two expressions (3.3.6) and (3.3.8) for the inverse
M −1 .

If A = In , then (3.3.9) simplifies to

(In − BC)−1 = In + B(Ip − CB)−1 C, (3.3.10)

which is sometimes called the matrix inversion lemma. Higham [626, 2017] shows that this
formula follows directly from the associative law for matrix multiplications.
Let Ax = b be a given system with known solution x = A−1 b, and let x̂ satisfy the system

(A − BC)x̂ = b, B, C T ∈ Rn×p , (3.3.11)

where A has been modified by a matrix of rank p. Then the Woodbury formula gives

x̂ = x + W (Ip − CW )−1 Cx, W = A−1 B. (3.3.12)

If an LU factorization of A is already available, then computing W = A−1 B requires 2n2 p flops.


Forming (Ip − CW ) and its LU factorization takes 2(np2 + p3 /3) flops. Finally, 4np + 2p2 flops
are needed for computing Cx and the correction to x. When p ≪ n this is computationally
advantageous compared to solving the modified system from scratch. For symmetric A and
C = B T , the Woodbury formula can be further simplified.
Let p = 1, BC = σuv T , where u, v ∈ Rn and σ ̸= 0. Then the Woodbury formula (3.3.9)
becomes

(A − σuv T )−1 = A−1 + ρA−1 uv T A−1 , ρ = 1/(σ −1 − v T A−1 u), (3.3.13)

which is also known as the Sherman–Morrison formula. Note that if σ −1 = v T A−1 u, then
modified matrix A − σuv T is singular. Otherwise, the solution x̂ of the modified linear system
5 The same formula appeared in several papers before Woodbury.
3.3. Modified Least Squares Problems 139

(A − σuv T )x̂ = b can be expressed as

vT x
x̂ = x + w, w = A−1 u. (3.3.14)
σ −1− vT w
The evaluation of this expression only requires the solution of two linear systems Ax = b and
Aw = u. If the LU factorization of A is known, the arithmetic cost is only about 2n2 .
The following related result was first proved by Egerváry [361, 1960] and is included as
Example 1.34 in the seminal book by Householder [645, 1975]. The sufficient part appeared
earlier in Wedderburn [1105, 1964].

Theorem 3.3.2. Let A ∈ Rm×n be an arbitrary matrix and u ∈ Rm and v ∈ Rn be vectors.


Then the rank of B = A − σuv T is less than that of A if and only if there are vectors x ∈ Rn
and y ∈ Rm such that u = Ax, v = AT y, and σ −1 = y T Ax, in which case

rank(B) = rank(A) − 1.

Note that the Woodbury and Sherman–Morrison formulas do not always lead to numerically
stable algorithms, and therefore they should be used with some caution. Stability is a problem
whenever the unmodified problem is conditioned in a worse way than the modified problem.

Notes and references

The history of the Woodbury and similar updating formulas and their applications is surveyed by
Henderson and Searle [601, 1981]. Chu, Funderlic, and Golub [248, 1995] explore extensions
of the Wedderburn rank reduction formula that lead to various matrix factorizations. Explicit
expressions for the pseudoinverse (A + uv H )† are given by Meyer [793, 1973]. Depending
on which of the three conditions u ∈ R(A), v ∈ R(AH ), and 1 + v H A† u ̸= 0 are satisfied,
there are no less than six different cases. Generalizations of Meyer’s results to perturbations of
higher rank are not known. Some results for the pseudoinverse of sums of matrices are given by
R. E. Cline [257, 1965]. For rectangular or square singular matrices, no formulas similar to the
Woodbury formula (3.3.9) with A−1 replaced by the pseudoinverse A† seem to exist.

3.3.2 Modifying QR Factorizations


Many applications require the solution of a least squares problem after the data have been modi-
fied by adding or deleting variables and/or observations. Examples arise in regression problems,
nonlinear and constrained least squares, optimization, signal processing, and prediction in control
theory. Algorithms for updating least squares solutions date back to Gauss; see Farebrother [397,
1999]. The first systematic use of algorithms for modifying matrix factorization seems to have
been in optimization; see Gill et al. [472, 1974].
We consider mainly algorithms for updating the QR factorization
 
R
A=Q , Q = ( Q1 Q2 ) ∈ Rm×m , (3.3.15)
0

of A ∈ Rm×n , m ≥ n. Note that only R and Q1 are uniquely determined. Because no efficient
way to update a product of Householder reflections is known, we assume that Q ∈ Rm×m is
explicitly stored. Primarily, we consider algorithms for modifying a QR factorization when A
is subject to a low-rank change. The important special cases of adding or deleting a column or
a row of A are considered separately. Such updating algorithms require O(m2 ) multiplications.
140 Chapter 3. Generalized and Constrained Least Squares

One application where such algorithms are needed is stepwise regression. This is a greedy
technique for selecting a suitable subset of variables in a linear regression model

Ax ≈ b, A = (a1 , a2 , . . . , an ) ∈ Rm×n .

The regression model is built sequentially by adding or deleting one variable at a time. Initially,
set x(0) = 0 and r(0) = b. Assume that at the current step, k variables have entered the regres-
sion, and the current residual is r(k) = b − Ax(k) . In the next step, the column ap to add is
chosen so that the residual norm is maximally decreased or, equivalently, the column that makes
the smallest acute angle with the residual r(k) . Hence,

aTj r(k)
cos(aj , r(k) ) =
∥aj ∥2 ∥rk) ∥2

is maximized for j = p over all variables not yet in the model. After a new variable has entered
the regression, it may be that the contribution of some other variable included in the regression
is no longer significant. This variable is then deleted from the regression model using a similar
technique.
Efroymson [360, 1960] gave an algorithm for stepwise regression based on Gauss–Jordan
elimination on the normal equations. This is sensitive to perturbations and not numerically stable.
Miller [794, 2002] surveys subset selection in regression and emphasizes the computational and
conceptual advantages of using methods based on QR factorization rather than normal equations.
Eldén [366, 1972] describes a backward stable method that uses Householder QR factorization
that has the drawback of needing storage for a square matrix Q ∈ Rm×m . A stable implementa-
tion of stepwise regression based on MGS with reorthogonalization is given by Gragg, LeVeque,
and Trangenstein [522, 1979] and essentially uses only the storage needed for A and b.
Methods for updating least squares solutions are closely related to methods for modifying
matrix factorizations. If A ∈ Rm×n has full column rank, the least squares solution is obtained
from the extended QR factorization
 
R z
(A b) = Q , Q ∈ Rm×m , (3.3.16)
0 ρe1

where the right-hand side b has been appended to A as a last column. The solution is then
obtained by solving the triangular system Rx = z, and the residual norm ∥Ax − b∥2 equals ρ.
The upper triangular matrix R in (3.3.16) is the Cholesky factor of
   
AT ATA AT b
(A b) = .
bT bTA bT b

By applying algorithms for updating a QR factorization to the extended QR factorization (3.3.16)


of ( A b ), we obtain algorithms for updating the solution of least squares problems.
The algorithms given in the following for modifying the full QR factorization are minor
modifications of those given in LINPACK; see Dongarra et al. [322, 1979, p. 10.2]. It is fairly
straightforward to extend these to cases where a block of rows/columns is added or deleted. Such
block algorithms are more amenable to efficient implementation on modern computers.
We point out that adding rows and deleting columns are inherently stable operations in the
sense that the smallest singular value cannot decrease. On the other hand, when adding col-
umns or deleting rows the smallest singular value can decrease, and a solution may not exist. In
particular, deleting a row can be a very ill-conditioned problem.
3.3. Modified Least Squares Problems 141

Appending a Row

Without loss of generality, we assume that a row v T is appended to A ∈ Rm×n after the last row.
Then we have  
    R
A Q 0  T
= v .
vT 0 1
0

Hence the problem is equivalent to appending v T as the (n + 1)th row to R. Now, plane rotations
Gk,n+1 , k = 1 : n, are determined to annihilate v T , giving
   
R R
e
G = , G = Gn,n+1 · · · G1,n+1 .
vT 0

This requires 3n2 flops. The updated factor becomes


 
Qe = Q 0 In+1,m+1 GT
0 1

and can be computed in 6mn flops. Note that R can be updated without Q being available. From
the interlacing property of the singular values it follows that the updating does not decrease the
singular values of R. By the general rounding error analysis of plane rotations and Householder
transformations, this updating algorithm is backward stable.

General Rank-One Change

Given the QR factorization of A ∈ Rm×n and vectors u ∈ Rm and v ∈ Rn , we want to compute


QR factors of  
e = A + uv T = Q e R ,
e
A (3.3.17)
0
 
w1
where A is perturbed by a rank-one matrix. We first compute w = QT u. Then with w = ,
w2
w1 ∈ Rn , it holds for any orthogonal P that
    
T T R w1
A + uv = QP P + vT . (3.3.18)
0 w2

Determine H as a Householder reflection such that Hw2 = βe1 . Next, let Gk,n+1 , k =
n, n − 1, . . . , 1, be a sequence of plane transformations that zeros the elements in w1 from the
bottom up and creates a nonzero row below the matrix R. Taking P1 as the combination of these
transformations, we have
 
     R
R w1
P1 + v T =  z T + βv T  , β = ±∥w2 ∥2 . (3.3.19)
0 w2
0

Finally, a product P2 of plane rotations G


e k,k+1 (ϕk ) is determined that zeros the elements in row
n + 1 as described in the algorithm for adding a row. This gives the factor R. e The orthogonal
T T
factor is Q = QP1 P2 .
e
The work needed for updating R is 6n2 flops. Applying the Householder transformations to
Q takes 4m(m−n) flops, and applying the plane transformations from steps 2 and 3 takes 12mn
142 Chapter 3. Generalized and Constrained Least Squares

flops. This gives a total of 4m2 + 8nm + 4n2 flops. The algorithm can be shown to be mixed
backward stable.

Remark 3.3.1. In another version of this algorithm the matrix R was modified into a Hessenberg
matrix by using a sequence of rotations Gk,k+1 , k = n, n − 1, . . . , 1, in step 3. The version given
here is easier to implement since the modified row can be held in a vector. This becomes even
more important for large sparse problems.

Deleting a Column

We first observe that deleting the last column of A in the QR factorization


 
R̃ u
A = ( A1 an ) = Q
0 rnn

is trivial. The QR factorization of A1 is obtained simply by deleting the trailing column from the
decomposition. Suppose now that we want to compute the QR factorization

A
e = (a1 , . . . , ak−1 , ak+1 , . . . , an ),

where the kth column of A is deleted, k < n. From the above observation it follows that this
decomposition can readily be obtained from the QR factorization of the matrix

APL = (a1 , . . . , ak−1 , ak+1 , . . . , an , ak ), (3.3.20)

where PL is a permutation matrix that performs a left circular shift of the columns ak , . . . , an .
The matrix RPL is upper Hessenberg, but the matrix PLT RPL is upper triangular, except in its
last row. For example, if k = 3, n = 6, then it has the structure
 
× × × × × ×

 0 × × × × × 

T
 0 0 × × × 0 
PL RPL =  .

 0 0 0 × × 0 

 0 0 0 0 × 0 
0 0 × × × ×

The task has now been reduced to constructing plane rotations Gi,n , i = k : n − 1, that zero out
the off-diagonal elements in the last row. Only the trailing principal submatrix of order n − k + 1
in PLT RPL , which has the form
 
R22 0
,
v T rnn

participates in this transformation. After the last column is deleted , the remaining update of R22
is precisely the same as already described for adding a row. The updated Q factor is

e = QPL GT · · · GT
Q k,n n−1,n .

By an obvious extension of the above algorithm, we obtain the QR factorization of the matrix
resulting from a left circular shift applied to a set of columns (a1 , . . . , ak−1 , ak+1 , . . . , ap , ak ,
ap+1 , . . . , an ).
3.3. Modified Least Squares Problems 143

Inserting a Column
Assume that the QR factorization
 
R
A = (a1 , . . . , ak−1 , ak+1 , . . . , an ) = Q , k ̸= n,
0

is known. We want to compute the QR factorization of A


e = (a1 , . . . , an ), where the kth column
ak is inserted. We assume that ak ∈
/ R(A), because otherwise A
e is singular. Then
 
u
w = QT ak = , u ∈ Rn ,
v

where γ = ∥v∥2 ̸= 0. Let Hn be a Householder reflector such that HnT v = γe1 . Then we have
the QR factorization
     
e R , Q e = Q In 0 e= R u .
e
( A ak ) = Q , R
0 0 Hn 0 γ
Let PR be the permutation matrix that performs a right circular shift on the columns ak+1 , . . . ,
an , ak , so that
 
  R 11 u1 R 12
e RPR , RP
e
A e = ( A a k ) PR = Q e R= 0 u2 R22  ,
0
0 γ 0

where R11 ∈ R(k−1)×(k−1) and R22 ∈ R(n−k)×(n−k) are upper triangular. For example, for
k = 4, n = 6, we have  
× × × × × ×
 0 × × × × × 
 
e R= 0 0 × × × ×
 
RP  0 0 0 ×
.
 × × 

 0 0 0 × 0 × 
0 0 0 × 0 0
Now determine plane rotations Gi−1,i , i = n : −1 : k, to zero the last n − k elements in the kth
column of RP
e R . Then
 
u2 R22
Gk−1,k · · · Gn−1,n =R e22
γ 0
is upper triangular, and the updated factors are
 
R11 R e12
e Tn−1,n · · · Gk−1,k ,
R= e22 , Q = QG (3.3.21)
0 R

where R
e12 = ( u1 R12 ).
The above method easily generalizes to computing QR factors of

(a1 , . . . , ak−1 , ap , ak , . . . , ap−1 , ap+1 , . . . , an+1 ),

i.e., of the matrix resulting from a right circular shift of the columns ak , . . . , ap . Note that when
a column is deleted, the new R-factor can be computed without Q being available. However,
when a column is added, it is essential that Q be known.
The algorithms given for appending and deleting a column correspond to the MATLAB func-
tions qrinsert(Q,R,k,ak) and qrdelete(Q,R,k).
144 Chapter 3. Generalized and Constrained Least Squares

Deleting a Row
Suppose we are given the QR factorization of A ∈ Rm×n ,
 T  
a1 R
A= = Q , (3.3.22)
Ae 0

e where the first row aT of A is deleted. This is known


and want to find the QR factorization of A, 1
T T
as the downdating problem. If q = e1 Q = ( q1T q2T ) is the first row of Q, then
 T   
a1 1 R q1
Ae 0 = Q 0 q2 . (3.3.23)

Let H be a Householder reflection such that Hq 2  = γe1 , γ = ±∥q2 ∥2 . Next, let G =


q1
G1,2 · · · Gn,n+1 be plane rotations such that G = αe1 . Since q is a unit vector, we
γ
must have |α| = 1. It is no restriction to assume that α = 1. Then
 T 
   v 1
In 0 R q1
G =R e 0, (3.3.24)
0 H 0 q2
0 0

where R e ∈ Rn×n is upper triangular, and the row vector v T = aT has been generated. To find
1
the downdated factor Q̃ we need not consider the transformation H, because it does not affect
its first n columns. The matrix QGT is orthogonal, and by (3.3.23) its first row must equal eT1 .
Therefore it must have the form  
T 1 0
QG =
0 Qe

with Q
e orthogonal. It then follows that
 
vT 1
aT1
   
1 1 0
= Re 0,
Ae 0 0 Q e
0 0
 
Re
which shows that a1 = v. The desired factorization A = Q
e e is now obtained by deleting
0
the first row and last column on both sides of the equation. Note the essential role played by the
first row of Q in this algorithm.
In downdating, the singular values of R can decrease, and R
e can become singular. Paige [859,
1980] has proved that the above downdating algorithm is mixed stable, i.e., the computed R e is
close to the corresponding exact factor of a nearby matrix A e + E, where ∥E∥ < cu.

Block Downdating a QR Factorization


The algorithms for modifying the QR factorization can be extended to modifications of rank
k > 1. This applies in particular to cases when blocks of rows or columns are added or deleted.
Compared to repeated applications of the corresponding rank-one algorithm, such algorithms can
be implemented using matrix-matrix operations and hence should execute more efficiently.
We describe a downdating algorithm when k > 1 rows are to be deleted. The other algorithms
can be similarly modified. Let the QR factorization of A ∈ Rm×n be
      
A1 R Q11 Q12 R
A= =Q = , (3.3.25)
A2 0 Q21 Q22 0
3.3. Modified Least Squares Problems 145

where Q11 ∈ Rk×n and Q12 ∈ Rk×(m−n) . We want to find the QR factors of A2 , where the first
block of k rows A1 ∈ Rk×n is deleted. From QQT = Im it follows that
   
A1 Ik R QT11
=Q . (3.3.26)
A2 0 0 QT12

Compute the QR factorization of QT12 by finding a product H of Householder reflections such


that HQT12 = U ∈ Rk×k is upper triangular. Next, let G be a product of plane rotations such that
   T    T 
P Q11 In 0 Q11
=G =G ,
0 U 0 H QT12

with P ∈ Rk×k upper triangular with positive diagonal elements. By orthogonality,


 T 
T Q11
P P = ( Q11 Q12 ) = Ik .
QT12

Hence P is orthogonal, and because it is triangular, it must equal Ik . Then


 
  T
 V Ik
In 0 R Q11
G = R e 0 , (3.3.27)
0 H 0 QT12
0 0

e ∈ Rn×n is upper triangular, and V has been generated. To find the downdated factor Q̃
where R
we need not consider the transformation H, which does not affect its first n columns. The matrix
QGT is orthogonal, and from (3.3.27) its k first rows must equal ( Ik 0 ). It follows that
 
T Ik 0
QG = e ,
0 Q

where Q
e is orthogonal, and
 
    V Ik
A1 Ik Ik 0
= R
e 0 .
A2 0 0 Q
e
0 0

This shows that V = A1 . Equating the last block of rows, we obtain the desired downdated
factorization  
R
e
A2 = Q e .
0

3.3.3 Downdating the Cholesky Factorization


In the downdating problem of Section 3.3.2 we are given the QR factorization of
 T
z
A= (3.3.28)
Ae

and want to determine the QR factors of A.


e The downdating transformations are determined by
m×m
the first row of the full matrix Q ∈ R . In many applications, storing and modifying Q is
too costly. Then one would like to modify the upper triangular R ∈ Rn×n without knowledge of
Q. This is the Cholesky downdating problem: Given R and z, determine R e such that

eT R
R e = RTR − zz T .
146 Chapter 3. Generalized and Constrained Least Squares

The Cholesky factor R is mathematically the same as R in the QR factorization. Any downdating
algorithm that uses only R and not Q or the original data A relies on less information and cannot
be expected to give full accuracy.
In the LINPACK algorithm due to Saunders [963, 1972], one seeks to recover the necessary
information in Q using the original data from A. The first row of the QR factorization can be
written  
T T R
e1 A = q , q T = eT1 Q = ( q1T q2T ) , (3.3.29)
0
giving z T = q1T R. Thus q1 ∈ Rn can be found from A by forward substitution in the lower
triangular system RT q1 = AT e1 = z. Furthermore, ∥q∥22 = ∥q1 ∥22 + ∥q2 ∥22 = 1, and hence
γ = ∥q2 ∥2 = (1 − ∥q1 ∥22 )1/2 . (3.3.30)

This allows the downdated factor R e to be computed as described previously in Section 3.3.2 by
a sequence of plane rotations Gk,n+1 , k = n, n − 1, . . . , 1, constructed so that
 
q1
G1,n+1 · · · Gn,n+1 = αen+1 , α = 1,
γ
and    
R Re
G1,n+1 · · · Gn,n+1 = .
0 vT
Then, as in (3.3.24), R
e is the downdated factor, and v = z. As described, the LINPACK al-
gorithm requires about 3n2 flops. By interleaving the two phases, Pan [871, 1990] gives an
implementation that uses 40% fewer multiplications.
We can write
RTR − zz T = RT (In − q1 q1T )R = R̃T R̃. (3.3.31)
If we put I − q1 q1T = LLT , then R̃ = LT R. The matrix I − q1 q1T has n − 1 eigenvalues equal
to 1 and one equal to γ 2 = 1 − ∥q1 ∥22 ≤ 1. Hence, σn (L) = γ. If γ is small, there will be
severe cancellation in the computation of 1 − ∥q1 ∥22 . When γ ≈ u1/2 the LINPACK algorithm
can break down. In Saunders [963, 1972] the downdate was used only on a square matrix A.
Then we know that γ = 0, and there is no danger of breakdown. However, Example 3.3.3 below
shows the danger of not having Q.
Downdating the Cholesky factor R is an inherently less stable operation than downdating
both Q and R in the QR factorization. The best we can expect is that the computed downdated
factor R
e is the exact Cholesky factor of

(R + E)T (R + E) − (z + f )(z + f )T ,
where ∥E∥2 and ∥f ∥2 are modest constants times machine precision.
Pan [872, 1993] gives a first-order perturbation analysis that shows that the normwise relative
sensitivity of the Cholesky downdating problem can be measured by
ξ(R, z) = κ(R) + κ2 (R)(1 − γ 2 )/γ 2 , (3.3.32)
where γ is defined as in (3.3.30). Hence an ill-conditioned downdating problem is signaled by a
small value of γ, but the condition number of R also plays a role. Sun [1049, 1995] derives two
different condition numbers,
κ(R, z) = κ(R)/γ 3 , e 2.
c(R, z) = κ(R)/γ (3.3.33)
Numerical tests show that in most cases, c(R, z) is the smallest. Note that the suggested condition
numbers can be cheaply estimated using a standard condition estimator.
3.3. Modified Least Squares Problems 147

Example 3.3.3. Consider the least squares problem min ∥Ax − b∥2 , where

   
τ 1
A= , b= , τ = 1/ u,
1 1
and u is the unit roundoff. We may think of the first row of A as an outlier. The QR factorization
of A, correctly rounded to single precision, is
    
τ 1 −ϵ τ
A= = ,
1 ϵ 1 0

where ϵ = 1/τ . The LINPACK algorithm computes q1 = τ /τ = 1, giving γ 2 = 1 − 1 = 0, and


G1 = I. Hence it gives the downdated factor R e = 0, and the downdated least squares solution
is not defined. It is easily verified that if we downdate using Q, we get the correct result R
e=1
and the downdated solution x = 1.

An alternative downdating method that uses both R and A but not Q is given by Björck,
Eldén, and Park [151, 1994]. Let v be the solution of minv ∥Av − e1 ∥2 . Then the R-factor of
( A e1 ) is  
R q1
, q1 = Rv. (3.3.34)
0 γ
The connected seminormal equations (CSNE) downdating algorithm first computes v from the
so-called seminormal equations (SNE) RTRv = AT e1 . A corrected solution v + δv is then
determined from
r = Av, RTRδv = r, v := v + δv, (3.3.35)
giving
q1 = Rv, γ = ∥Av − e1 ∥2 .
The update of R then proceeds as in the LINPACK algorithm. A similar procedure can be used
to downdate the augmented R-factor (3.3.16) by solving the least squares problem
 
x
min ( A b ) − e1
x,ϕ ϕ 2

using the CSNE method. This leads to an accurate downdating algorithm for least squares prob-
lems. However, the modifications are not trivial, partly because the condition number of the
augmented R-factor is large when ρ is small. An error analysis of the CSNE method is given in
Section 2.5.4.
The CSNE downdating algorithm requires three more triangular solves than the LINPACK
algorithm and an additional four matrix-vector products. Thus, a hybrid algorithm is preferable
in which the CSNE algorithm is used only when the downdating problem is ill-conditioned.
It is often required to find the downdated Cholesky factor after a modification of rank k > 1.
This can be performed as a sequence of k rank-one modifications. However, block methods using
matrix-matrix and matrix-vector operations can execute more efficiently. Let R in AT A = RT R
and Z ∈ Rk×n be given. The Cholesky block downdating problem seeks R e such that

eT R
R e = ATA − Z T Z.

The LINPACK algorithm can be generalized as follows. Suppose the first k rows Z in A are to
be deleted. We have
      
Z R Q11 Q12 R
A= e =Q = , (3.3.36)
A 0 Q21 Q22 0
148 Chapter 3. Generalized and Constrained Least Squares

where Q ∈ Rm×m has been partitioned so that Q11 ∈ Rk×n . It follows that
 
R
Z = ( Ik 0 ) A = ( Ik 0 ) Q = Q11 R. (3.3.37)
0

Hence Q11 ∈ Rk×n can be determined as in LINPACK by solving the triangular matrix equation
RT QT11 = Z. Furthermore, using the orthogonality of Q, we have
 T 
Q11
Ik = ( Q11 Q12 ) ,
QT12

which shows that Q12 QT12 = Ik − Q11 QT11 . Hence we can take Q12 = ( L 0 ) ∈ Rk×(m−n) ,
where L is the lower triangular Cholesky factor of Ik − Q11 QT11 . The downdating can then
proceed as in block downdating of the QR factorization described in Section 3.3.2.
Algorithms for block downdating the Cholesky factorization using hyperbolic transforma-
tions are given by Bojanczyk and Steinhardt [168, 1991] and Liu [756, 2011]. They proceed in
n steps to compute    
Z 0
Pn · · · P1 = e ,
R R
where each transformation Pi consists of a Householder reflection followed by a hyperbolic
rotation; see Section 3.2.4. In step i, i = 1, . . . , n, a Householder reflection Hi is used to zero
all elements in the ith column of Z, except zk,i which then is zeroed by a hyperbolic rotation
Gk,k+i acting on rows k, k + i. If the problem is positive definite, the process will not break
down. The first two steps in the reduction are shown below for the case n = 4, k = 3.
     
⊗ × × × × × × ⊗ × ×

 ⊗ × × × 


 × × × 


 ⊗ × × 

 × × × ×   ⊗ × × ×   ⊗ × × 
H1   ⇒ G3,4
 
 ⇒ G3,5 H2
  

 × × × × 
 ×
 × × × 
 ×
 × × × 


 × × × 


 × × × 


 × × × 

 × ×   × ×   × × 
× × ×

Notes and references


First-order perturbation analysis for the block downdating problem for the Cholesky factor is
given by Eldén and Park [378, 1994], Sun [1049, 1995], and Chang and Paige [234, 1998].
Olszanskyj, Lebak, and Bojanczyk [843, 1994] extend the LINPACK, Gram–Schmidt, and CSNE
downdating algorithms to block downdating and provide experimental results comparing their
numerical accuracy. LAPACK-style codes for updating the QR factorization after a block of
rows or columns are added or deleted are given by Hammarling et al. [565, 2006].

3.3.4 Recursive Least Squares Methods


In signal processing, control systems, and communication, data often arrive continuously in real
time for processing. Assume that a sequence of least squares problems minx ∥Ax − b∥2 is to be
solved, where new equations are added and old ones deleted. If A ∈ Rm×n has full column rank,
the unique solution is x = CAT b, where C = (ATA)−1 is the covariance matrix apart from a
scaling factor. Suppose now that a block of equations Bx = c, B ∈ Rp×n , is appended. The
updated normal equations are

(ATA + B TB)e
x = AT b + B T c.
3.3. Modified Least Squares Problems 149

Adding B TBx to both sides of the original normal equations ATAx = AT b and subtracting gives
(ATA + B TB)(e
x − x) = B T rp , rp = c − Bx, (3.3.38)
where rp is the predicted residual for the added equations. Hence the updated solution becomes

x e T rp ,
e = x + CB e = (ATA + B TB)−1 ,
C (3.3.39)

where Ce is the updated covariance matrix. From the Woodbury formula (3.3.9) we obtain the
expression
e = C − U (Ip + BU )−1 U T , U = CB T .
C (3.3.40)
In particular, adding a single equation v T x = γ gives ρ = γ − v T x, u = Cv, and
e = C − uuT /(1 + v T u),
C x
e = x + ρCv,
e

where ρ is the predicted residual and u


e = Cv
e is the so-called Kalman gain vector. With slight
modifications these equations can be used also for deleting an equation wT x = γ. Provided that
1 − wT u > 0, we obtain
e = C + uuT /(1 − wT u),
C e = x − (γ − wT x)Cw.
x e (3.3.41)
The simplicity of this recursive least squares updating algorithm has made it popular for many
applications. Because C = (RTR)−1 , such schemes are called square root methods in the
signal processing literature. However, such methods lack stability because they are based on the
normal equation. When accuracy is important, an algorithm that updates the factor R from a
QR factorization should be preferred. Adding a single equation v T x = γ leads to the following
recursive least squares (RLS) algorithm.
1. Compute the predicted residual ρ = γ − v T x and v T u = v T Cv = ∥R−T v∥22 .
 
R
2. Compute the updated QR factorization of to obtain R.
e
vT

3. Compute the updated solution x


e = x + ρe
u, where u e−1 (R
e=R e−T v).
Pan and Plemmons [874, 1989] have developed alternative algorithms that instead update or
downdate the inverse Cholesky factor R−1 . Such schemes can be more easily parallelized, be-
cause no back-substitutions are involved. The vector ue can be computed by matrix-vector multi-
plication, and the covariance matrix (RTR)−1 = R−1 R−T is more readily recovered. These are
important considerations for applications in signal processing.
In nonstationary time series calculations it is necessary to suppress older data. Often a
sliding window moving over the data is used in which a new observation is added and the
oldest observation is deleted. Another method to suppress older data is called exponential
windowing. Let β ∈ (0, 1) be a forgetting factor. Then the rows in the current data matrix
Xm = (Am , bm ) ∈ Rm×(n+1) are weighted with
Dm = diag (β m−1 , . . . , β, 1),
and the QR factorization Dm Xm = Qm Rm is computed. Hence as m increases, the older data
influence the solution less and less. Then Rm+1 can be computed as the QR factorization of
 
βRm
.
xm+1
Exponential windowing makes it possible to look at arbitrarily long sequences of data. Stew-
art [1029, 1995] shows that both the LINPACK and Chambers downdating algorithms are rela-
tionally stable in the sense that old rounding errors are damped along with the data.
150 Chapter 3. Generalized and Constrained Least Squares

3.3.5 Modifying the Gram–Schmidt QR Factorization


For least squares problems with m ≫ n it is usually not feasible to store and update the full
square matrix Q ∈ Rm×m in the QR factorization. By instead modifying the compact Gram–
Schmidt QR factorization

A = Q1 R, Q1 = (q1 , . . . , qn ) ∈ Rm×n , (3.3.42)

storage and operation counts for a rank-one modification are reduced from O(m2 ) to O(mn).
Such algorithms for adding and deleting rows and columns in the compact QR factorization are
developed by Daniel et al. [285, 1976]. Their algorithms use Gram–Schmidt QR with reorthog-
onalization. Reichel and Gragg [917, 1990] give optimized Fortran subroutines implementing
similar methods.
Adding a column in the last position in the QR factorization is straightforward and equal to
an intermediate step in a columnwise Gram–Schmidt. Similarly, deleting the last column of A in
the factorization A = Q1 R is trivial. Inserting or deleting a column in another position requires
computing QR factors of a permuted triangular matrix. This can be done by a series of plane
rotations as described for updating the full QR factorization; see Section 3.3.2. Adding a row in
the QR factorization can also be performed similarly by a series of plane rotations.
We now describe an algorithm for a general rank-one update. Given A = Q1 R, with or-
thonormal Q1 ∈ Rm×n , we seek the compact QR factorization of the modified matrix A e =
A + vuT , where v ∈ Rm , and u ∈ Rn . We then have
 
Ae = ( Q1 v ) RT . (3.3.43)
u

The first step is to make v orthogonal to Q1 using Gram–Schmidt and, if necessary, reorthogo-
nalization. This produces vectors r and a unit vector q, ∥q∥2 = 1, such that

v = Q1 r + ρq, QT1 q = 0.

We then have     
R r
A
e = ( Q1 q) + uT . (3.3.44)
0 ρ
The remaining step uses a sequence of plane rotations as in the algorithm for modifying the
full QR factorization. With one reorthogonalization this rank-one update algorithm requires
approximately 20mn + 6n2 flops.
A similar algorithm is used for downdating the compact QR factorization when the first row
z T is deleted. Let q1T = eT1 Q1 be the first row in Q1 . Then we have
 T  T
z q1
A= e = Q̂1 R, (3.3.45)
A

and appending the column e1 = (1, 0, . . . , 0)T to Q1 gives


 T  T  
z q1 1 R
= . (3.3.46)
Ae Q̂1 0 0

We now use the Gram–Schmidt process—if necessary with reorthogonalization—to orthogonal-


ize e1 = (1, 0, . . . , 0)T to Q1 . With QT1 Q1 = In , this gives

v = e1 − Q1 (QT1 e1 ) = e1 − Q1 q1 .
3.3. Modified Least Squares Problems 151


If ∥v∥2 < 1/ 2, then v is reorthogonalized; otherwise, v is accepted. Because of the special
form of e1 , the result has the form
 T   T    
q1 1 q1 γ In q1 γ
= , = v/∥v∥2 . (3.3.47)
Q̂1 0 Q̂1 h 0 γ h

Using (3.3.46) gives     


zT q1T γ R
= . (3.3.48)
Ae Q̂1 h 0
Next, a sequence of plane rotations Gk,n+1 , k = n, n − 1, . . . , 1, is determined such that
 T   
q1 γ 0 τ
G= e e , G = Gn,n+1 · · · G1,n+1 . (3.3.49)
Q̂1 h Q1 h

Since orthogonal transformations preserve length, we must have |τ | = 1. Because the trans-
h = 0 in (3.3.49), and
formed matrix also must have orthonormal columns, it follows that e
 T    
z 0 1 T R
Ae = Q e1 0 G 0
,

where GT = GT1,n+1 · · · GTn,n+1 . This gives


    
zT 0 1 Re
= , (3.3.50)
Ae Q
e1 0 wT

where R e is upper triangular. Thus wT = z T , and the downdated QR factorization is Ae=Q e 1 R.


e
With one reorthogonalization, this downdating algorithm requires about 7mn + 2.5n2 flops. The
storage requirement is about mn + 0.5n2 for Q1 and R.
In the above downdating algorithm the orthonormality of Q1 plays a decisive role. A more ac-
curate downdating algorithm called Householder Gram–Schmidt downdating (HGSD) for MGS
has been devised by Yoo and Park [1140, 1996]. This makes use of the numerical equivalence
of MGS and Householder QR applied to A augmented with an n × n zero matrix on top; see
Section 2.2.6. Let A = Q1 R, Q1 = (q1 , . . . , qn ) be the computed MGS QR factorization. This
is equivalent to the Householder QR decomposition
   
0n×n R
=P , P = P1 · · · Pn , (3.3.51)
A 0

where  
−ek
Pk = I − uk uTk , uk = . (3.3.52)
qk
The equivalence is true also numerically. The matrix P is orthogonal by construction and fully
determined by Q1 and the strictly upper triangular matrix P11 ∈ Rn×n ,
   
P11 P12 P11 (I − P11 )Q̄T1
P = = . (3.3.53)
P21 P22 Q̄1 (I − P11 ) I − Q̄1 (I − P11 )Q̄T1

In particular, it can be shown that (see Theorem 2.2.12)

P21 = (q1 , M1 q2 , . . . , M1 · · · Mn−1 qn ) ∈ Rm×n , (3.3.54)

P22 = M1 M2 · · · Mn ∈ Rm×m , (3.3.55)


152 Chapter 3. Generalized and Constrained Least Squares

where Mi = I − qi qiT . Yoo and Park note that downdating the MGS QR decomposition when
the first row in A is deleted is equivalent to downdating the corresponding Householder decom-
position (3.3.51) when row (n + 1) is deleted. This can be done stably provided the (n + 1)th
row in P is available. The HGSD algorithm starts by using (3.3.55) to recover the first row g T
of P22 :
g = eT1 P21 = ((eT1 M1 )M2 ) · · · Mn . (3.3.56)
This gives the recursion g T = eT1 , g T = g T − (g Tqk )qkT , k = 1, . . . , n. Next, a Householder
reflection H such that g T H = (∥g∥2 , 0, . . . , 0) is determined, and the first column v of P22 H is
computed from
v = P22 He1 = (M1 · · · (Mn (He1 ))), (3.3.57)
giving the recursion v = He1 , v = v − qk (qkT v), k = n, n − 1, . . . , 1. These steps replace
the steps for orthogonalization of e1 to Q1 in the previous algorithm and yield γ = v1 and
h = (v2 , . . . , vm ). The first row f of P21 could be recovered similarly from (3.3.54), but it is
much cheaper to use
eTn+1 P = (((eTn+1 P1 )P2 ) · · · Pn ), (3.3.58)
where en+1 ∈ R(n+m) is a unit vector with one in its (n + 1)th position. This leads to the
recursion f = en+1 , f T = f T − (f T uk )uk , k = 1, . . . , n, where uk is given by (3.3.52).
The remaining steps of the algorithm are similar to the steps (3.3.49)–(3.3.50) in the previous
Gram–Schmidt downdating algorithm. An orthogonal matrix G is determined as a product of
plane rotations so that  T   
f γ 1 0
G= f1 .
0 Q̂1 0 Q
Finally, the upper triangular matrix R is modified:
   T
R z
GT = e .
0 R
A complete pseudocode of this Householder–MGS downdating algorithm is given by Yoo and
Park [1140, 1996]. The HGSD algorithm uses 4mn flops for computing g and v in (3.3.56) and
(3.3.57). The total arithmetic work is approximately 20mn + 4n2 flops.

Notes and references


Barlow, Smoktunowicz, and Erbay [77, 2005] develop a new family of more robust Gram–
Schmidt downdating algorithms and give new bounds on the loss of orthogonality after downdat-
ing. They show that the HGSD algorithm of Yoo and Park can be obtained as a special case of this
family and corresponds to taking two MGS steps, where the second goes through the columns in
reverse order. Examples are given where the proposed new algorithms have a dramatic impact
upon the accuracy of the downdated factorization. Barlow and Smoktunowicz [76, 2013] give a
block downdating algorithm using classical GS with reorthogonalization. Block Gram–Schmidt
downdating is further studied by Barlow in [69, 2014] and [70, 2019].

3.3.6 Modifying URV Decompositions


Treating rank-deficient least squares problems requires a rank-revealing decomposition such as
the SVD or URV decomposition. The high cost of updating the SVD (see Section 7.2.4) makes
rank-revealing URV decompositions attractive. These have the form
 
R F
A=U V T, (3.3.59)
0 G
3.3. Modified Least Squares Problems 153

where U and V are orthogonal and R ∈ Rk×k , G ∈ R(m−k)×(n−k) are upper triangular. Let
σ1 ≥ σ2 ≥ · · · ≥ σn be the singular values of A, and assume that for some k < n it holds that
σk ≫ σk+1 ≤ δ, where δ is a given tolerance. Then the numerical δ-rank of A equals k. Also, if
1 1/2
σk (R) ≥ σk , ∥F ∥2F + ∥G∥2F ≤ cσk+1
c
for some constant c, the decomposition (3.3.59) exhibits the rank and nullspace of A. The URV
decomposition can be updated in O(n2 ) operations when a row is added to A. Following the
algorithm given by Stewart [1025, 1992], we write
 
 T   R F
U 0 A
V = 0 G , (3.3.60)
0 1 wT
xT y T

where wT V = (xT y T ) and (∥F ∥2F + ∥G∥2F )1/2 = ν ≤ δ. In the simplest case the inequality
q
ν 2 + ∥y∥22 ≤ δ (3.3.61)

is satisfied. Then it suffices to reduce the matrix in (3.3.60) to upper triangular form by a sequence
of left plane rotations. Note that the updated matrix R cannot become effectively rank-deficient
because its singular values cannot decrease.
If (3.3.61) is not satisfied, we first reduce y T in (3.3.60) so that it becomes proportional to
T
e1 , while keeping the upper triangular form of G. This can be done by a sequence of right and
left plane rotations as illustrated below. (Note that here the f ’s represent entire columns of F .)
↓ ↓ ↓ ↓
     
f f f f f f f f f f f f
g g g g g g g g g g g g
     
0 g g g 0 g g g 0 g g g
  ⇒   ⇒   ⇒
0
 0 g g →0 0 g g
0
 + g g
0 0 + g → 0 0 ⊕ g 0 0 0 g
y y y 0 y y y 0 y y 0 0

↓ ↓
     
f f f f f f f f f f f f
g
 g g g
g g g
 g →g g g g
→
0 g g g ⇒
+ g g
 g ⇒ → ⊕
 g g g.
→
0 ⊕ g g
0 ⊕ g
 g
0
 0 0 g
0 0 0 g 0 0 0 g  0 0 g g
y y 0 0 σ 0 0 0 σ 0 0 0
In this part of the reduction, R and xT are not involved. The system now has the form
 
R f F̃
 0 g G̃  .
xT σ 0

This matrix is now reduced to triangular form using plane rotations from the left, and k is in-
creased by 1. Finally, the new R is checked for degeneracy and possibly reduced by deflation.
The complete update takes O(n2 ) flops.
Stewart [1027, 1993] has pointed out that although the decomposition (3.3.59) is very satis-
factory for recursive least squares problems, it is less suited for applications where an approx-
imate nullspace is to be recursively updated. Let U = ( U1 U2 ) and V = ( V1 V2 ) be
154 Chapter 3. Generalized and Constrained Least Squares

partitioned conformally with (3.3.59). Then we have


 
F
∥AV2 ∥2 = . (3.3.62)
<G 2

Hence the orthogonal matrix V2 can be taken as an approximation to the numerical nullspace Nk .
On the other hand, we have ∥U2T A∥2 = ∥G∥2 , and therefore the last n − k singular values of A
are less than or equal to ∥G∥2 .
Because F is involved in the bound (3.3.62), V2 is not the best available approximate nullspace.
This problem can be resolved by working instead with the corresponding rank-revealing ULV
decomposition  
L 0
A=U V T, (3.3.63)
H E
where L and E have lower triangular form, and
1 1/2
σk (L) ≥ σk , ∥H∥2F + ∥E∥2F = ν ≤ δ.
c
For this decomposition, ∥AV2 ∥2 = ∥E∥F , where V = (V1 , V2 ) is a conformal partitioning of
V . Hence the size of ∥H∥2 does not affect the nullspace approximation.
Stewart [1027, 1993] has presented an updating scheme for the decomposition (3.3.63). With
wT V = ( xT y T ), the problem reduces to updating
 
L 0
 H E .
xT y T

We first reduce y T to ∥y∥2 eT1 by right rotations while keeping the lower triangular form of E. At
the end of this reduction the matrix will have the form
l 0 0 0 0
l l 0 0 0
l l l 0 0
 
.
h h h e 0

h h h e e
 
x x x y 0

This last row is annihilated by a sequence of left rotations, and k is increased by 1. (For the case
above we would use Q = G16 G26 G36 G46 .) If there has been no effective increase in rank, a
deflation process has to be applied. If (3.3.61) is satisfied, the rank cannot increase. Then the
reduction is performed, but the first rotation G46 is skipped. This gives us a matrix of the form
l 0 0 y 0
l l 0 y 0
l l l y 0
 
.
h h h e 0

h h h e e
 
0 0 0 y 0

The y elements above the main diagonal can be eliminated using right rotations. This fills out the
last row again, but with elements the same size as y. Now the last row can be reduced by the pro-
cedure described above without destroying the rank-revealing structure; see again Stewart [1027,
1993]. The main difference compared to the scheme for updating the URV decomposition is that
there is not the same simplification when (3.3.61) is satisfied.
3.4. Equality Constrained Problems 155

A complication with the above updating algorithm is that when m ≫ n, the extra storage for
U ∈ Rm×m may be prohibitive. If only V and the triangular factor are stored, then we must use
methods like the Saunders algorithm, possibly stabilized with the CSNE method. Alternatively,
hyperbolic rotations may be used; see Section 7.2.4. Such methods will not be as satisfactory as
methods using Q or Gram–Schmidt-based methods using Q1 ; see Sections 3.3.2 and 3.3.5.

Notes and references

Downdating algorithms for rank-revealing URV decompositions are also treated by Park and
Eldén [880, 1995] and Barlow, Yoon, and Zha [79, 1996]. MATLAB templates for computing
RRQR and UTV decompositions are given by Fierro, P. C. Hansen, and P. S. K. Hansen [408,
1999] and Fierro and Hansen [407, 2005]. Algorithms for modifying and maintaining ULV
decompositions are given in Barlow [67, 2003], Barlow and Erbay [72, 2009], and Barlow, Erbay,
and Slapnic̆ar [73, 2005]. Stewart and Van Dooren [1034, 2000] give updating schemes for
quotient-type generalized URV decomposition. Methods for computing and updating product
and quotient ULV decompositions are developed by Simonsson [999, 2006].

3.4 Equality Constrained Problems


3.4.1 Method of Direct Elimination
In least squares problems the solution is often subject to linear equality constraints.
Problem LSE
min ∥Ax − b∥2 subject to Bx = d, (3.4.1)
x

where A ∈ Rm×n , B ∈ Rp×n , p ≤ n, m + p ≥ n. Such problems arise when models require


some equations to be satisfied exactly. Applications include constrained optimization, surface
fitting, signal processing, and various geodetic problems. Least squares problems with inequality
constraints can often be reduced to solving a sequence of LSE problems; see Section 3.4. In
beam-forming or spatial filtering the LSE problem has to be solved for fixed A and many different
B matrices.
Assuming that the constraints Bx = d are consistent, problem LSE has a unique solution if
and only if  
A
rank = n. (3.4.2)
B
This condition is equivalent to N (A) ∩ N (B) = {0}, i.e., the nullspaces of A and B only
intersect trivially. If this is not true, there is a vector z ̸= 0 such that Az = Bz = 0. Hence if
x solves (3.4.1), then x + αz is a different solution. The solution to problem LSE satisfies the
augmented system     
0 0 B λ d
 0 Im−p A   r  =  b  , (3.4.3)
BT AT 0 x 0
where λ are Lagrange multipliers for the constraint Bx = d.
A robust algorithm for problem LSE should check for a possible inconsistency in the con-
straints if rank(B) < p. If the constraints are inconsistent, problem LSE can be reformulated as
a sequential least squares problem,

min ∥Ax − b∥2 , S = {x | ∥Bx − d∥2 = min}. (3.4.4)


x∈S
156 Chapter 3. Generalized and Constrained Least Squares

This problem always has a unique solution of least-norm. Most of the methods described in the
following for solving problem LSE can, with small modifications, be adapted to solve (3.4.4).
A natural way to solve problem LSE is to derive an equivalent unconstrained least squares
problem of lower dimension. There are two different ways to perform this reduction: by direct
elimination or by using the nullspace method. The method of direct elimination starts by
reducing the matrix B to upper trapezoidal form. It is essential that column pivoting be used
in this step. To solve the more general problem (3.4.4) a QR factorization of B with column
pivoting can be used:
 
R11 R12 }r
QTB BΠB = , r = rank(B) ≤ p, (3.4.5)
0 0 }p − r

where QB ∈ Rp×p is orthogonal, R11 is upper triangular and nonsingular, and ΠB is a permuta-
tion matrix. With x̄ = ΠTB x the constraints become

 
( R11 R12 ) x̄ = d¯1 , d¯ = QTB d = ¯1 , (3.4.6)
d2

where d¯2 = 0 if the constraints are consistent. Applying the permutation ΠB to the columns of
A and partitioning the resulting matrix conformally with (3.4.5) gives
 
x̄1
Ax = Āx̄ = ( Ā1 Ā2 ) , (3.4.7)
x̄2
−1 ¯
where Ā = AΠB . If (3.4.6) is used to eliminate x̄1 = R11 (d1 − R12 x̄2 ) from (3.4.7), we obtain
Ax − b = A2 x̄2 − b, where
b b
b2 = Ā2 − Ā1 R−1 R12 ,
A bb = b − Ā1 R−1 d¯1 . (3.4.8)
11 11

This reduction can be interpreted as performing r steps of Gaussian elimination on the system
d¯1
    
R11 R12 x̄1
= .
Ā1 Ā2 x̄2 b
Then x̄2 is determined by solving the reduced unconstrained least squares problem

min ∥A
b2 x̄2 − bb∥2 , b2 ∈ Rm×(n−r) .
A (3.4.9)
x̄2

We now show that if (3.4.2) holds, then rank(A b2 ) = n − r, and (3.4.9) has a unique solution.
b2 ) < n − r, there is a vector v ̸= 0 such that
For if rank(A
b2 v = Ā2 v − Ā1 R−1 R12 v = 0.
A 11
 
−1 u
If we let u = −R11 R12 v, then R11 u + R12 v = 0 and Ā1 u + Ā2 v = 0. Hence w = ΠB ̸=
v
0 is a null vector to both B and A. But this contradicts the assumption (3.4.2).
The solution to (3.4.9) can be obtained from the QR factorization
   
T b R22 Tb c1
QA A2 = , QA b = ,
0 c2

where R22 ∈ R(n−r)×(n−r) is upper triangular and nonsingular. Then x̄ is obtained from the
triangular system
d¯1
   
R11 R12
x̄ = , (3.4.10)
0 R22 c1
3.4. Equality Constrained Problems 157

and x = ΠB x̄ solves problem LSE. The coding of the direct elimination algorithm can be kept
remarkably compact, as shown by the ALGOL program for Householder QR in Björck and
Golub [143, 1967]. Cox and Higham [271, 1999] obtain a similar stable elimination method by
taking the analytic limit of the weighted least squares problem
    2
ωB ωd
min x−
x A b 2

when ω tends to infinity and by rescaling the rows.


The set of vectors x = ΠB x̄, where x̄ satisfies (3.4.6), is exactly the set of vectors that
minimize ∥Bx − d∥2 . Thus, the algorithm outlined above solves the more general problem
(3.4.4). If condition (3.4.2) is not satisfied, the reduced problem (3.4.9) does not have a unique
solution. Then column permutations are needed also in the QR factorization of A b2 . In this case
we can compute either a least-norm solution or a basic solution to (3.4.4).

3.4.2 The Nullspace Method


In the nullspace method for solving problem LSE, an orthogonal basis for the nullspace of the
constraint matrix B is used. Cox and Higham [274, 1999] note that three different versions of
this method are known. The first (used in LAPACK) is based on the generalized QR factorization
of A and B; see Section 3.1.5. This is slightly less efficient than the method described by Lawson
and Hanson [727, 1995] and Golub and Van Loan [512, 1996]. Here we describe yet another
method, which has the lowest computational cost of the three versions. It starts by computing
the QR factorization of B T or, equivalently,

BQB = ( LB 0), (3.4.11)

where LB ∈ Rp×p is lower triangular and QB = ( Q1 Q2 ) ∈ Rn×n , with

Q1 ∈ Rn×p , Q2 ∈ Rn×(n−p) .

Here Q2 gives an orthogonal basis for the nullspace of B, i.e., N (B) = R(Q2 ). If rank(B) = p,
then LB is nonsingular. Then any vector x ∈ Rn such that Bx = d can be represented as
x = x1 + Q2 y2 , where
x1 = B † d, B † = Q1 L−1 B d, (3.4.12)
and y2 ∈ R(n−p) is arbitrary. It remains to solve the reduced least squares problem

min ∥(AQ2 )y2 − (b − Ax1 )∥2 . (3.4.13)


y2

Let y2 = (AQ2 )† (b − Ax1 ) be the minimum-length solution to (3.4.13). Since x1 ⊥ Q2 y2 , it


follows that ∥x∥22 = ∥x1 ∥22 + ∥Q2 y2 ∥22 = ∥x1 ∥22 + ∥y2 ∥22 . Hence, x is the least-norm solution to
problem LSE. From (3.4.12) it follows that x can be expressed as

x = B † d + Q2 (AQ2 )† (b − AB † d)
= (I − Q2 (AQ2 )† A)B † d + Q2 (AQ2 )† b. (3.4.14)

Assuming that N (A) ∩ N (B) = {0}, we have


   
B LB 0
C= QB = ,
A AQ1 AQ2
158 Chapter 3. Generalized and Constrained Least Squares

and C must have rank n. Hence rank(AQ2 ) = n − p, and the QR factorization


 
T RA
QA (AQ2 ) =
0

exists with RA upper triangular and nonsingular. It follows that the unique solution to problem
LSE is x = x1 + Q2 y2 , where
 
c1
RA y2 = c1 , c = = QTA (b − Ax1 ). (3.4.15)
c2

The method of direct elimination and all three nullspace methods are numerically stable and
should give almost identical results. The method of direct elimination, which uses Gaussian
elimination to derive the reduced unconstrained system, has the lowest operation count.
When A is large and sparse, the nullspace method has the drawback that fill-in can be ex-
pected when forming the matrix AQ2 . When rank(B) = p, and BΠ = ( B1 B2 ) with B1
square and nonsingular, the nullspace matrix

−B1−1 B2
 
Z=Π
I

satisfies BZ = 0. In the reduced gradient method, Z is used to solve problem LSE. This is
both an elimination method and a nullspace method. The reduced gradient method is potentially
more efficient because it can work with a sparse LU factorization of B1 .

3.4.3 Perturbation Theory


A perturbation theory for problem LSE using representation (3.4.14) is given by Leringe and
Wedin [734, 1970], generalizing results given in Section 1.3.4. Their bounds show that problem
LSE is well-conditioned if κ(B) and κ(AQ2 ) are small. Note that these two condition numbers
can be small even when κ(A) is large. Gulliksson and Wedin [552, 2000] use the augmented
system (3.4.3) to obtain perturbation bounds and condition numbers for rank-deficient, weighted,
and constrained problems.
A more complete perturbation theory for problem LSE is given by Eldén [368, 1980], [369,
1982]. This is based on a generalized pseudoinverse related to the least squares problem

min ∥x∥L S = {x | ∥Ax − b∥M is minimum}, (3.4.16)


x∈S

where ∥ · ∥L and ∥ · ∥M are the elliptic seminorms,

∥x∥2L = xH LH Lx, ∥y∥2M = y H M H M y, (3.4.17)

for some matrices L and M that are allowed to be rectangular.

Theorem 3.4.1. The solution of problem (3.4.16) can be written x = A†M L b, where

A†M L = (I − (LP )† L)(M A)† M b, P = I − (M A)† M A. (3.4.18)

Here A†M L b is the ML-weighted pseudoinverse of A. The solution is unique if and only if

N (M A) ∩ N (L) = {0}. (3.4.19)

Proof. See Eldén [369, 1982, Theorem 2.1] and Mitra and Rao [798, 1974].
3.4. Equality Constrained Problems 159


The solution to problem LSE can be expressed in terms of the weighted pseudoinverse BIA
as follows.

Theorem 3.4.2. If the constraints are consistent, the least-norm solution of problem LSE is given
by

x = BIA d + (AP )† b, P = I − L† L. (3.4.20)

Proof. See Eldén [369, 1982, Theorem 3.1].

3.4.4 The Method of Weighting


In the method of weighting for solving problem LSE, the constraints are multiplied by a large
weight w ≫ 1 and appended at the top of the weighted LS problem,
    2
wB wd
min x− , (3.4.21)
x A b 2

where A ∈ Rm×n , B ∈ Rp×n . If w is sufficiently large, the residual Bx − d becomes negligible,


and the solution x(w) is a good approximation to the LSE solution xLSE . This allows standard
subroutines for unconstrained least squares problems to be used. This is particularly attractive
when A and B are sparse. However, a large weight may be required to give acceptable accuracy
even when the LSE problem is well-conditioned.
Van Loan [1081, 1985] gives a method that allows a moderately large weight w to be used.
First, compute a solution x(w) to (3.4.21) from the QR factorization
   
wB wd R c1
=Q . (3.4.22)
A b 0 c2

Next, apply the following iterative refinement scheme: Set x(1) = x(w), and for k = 1, 2, . . . ,
(k)
1. s1 = d − Bx(k) .
   (k) 
wB s1
2. solve min∆x(k) ∆x(k) − .
A 0 2

3. x(k+1) = x(k) + ∆x(k) .

No further QR factorization is needed to compute the corrections. The vectors x(k) generated
can be shown to converge to xLSE with linear rate equal to ρw = wp2 /(wp2 + w2 ), where wp is
the largest generalized singular value of (A, B). With the default value w = u−1/2 the method
converges quickly unless the problem is ill-conditioned.
The method of weighting can be analyzed using the GSVD. In the following we assume that
rank(B) = p and rank ( AT B T ) = n, which ensures that the weighted problem (3.4.21) has
a unique solution. By Theorem 3.1.5 the GSVD of {A, B} can be written
 
ΣA
A=U Z, B = V (ΣB 0)Z, (3.4.23)
0

where U and V are orthogonal, Z is nonsingular, and

ΣA = diag (α1 , . . . , αn ) > 0, ΣB = diag (β1 , . . . , βp ) > 0.


160 Chapter 3. Generalized and Constrained Least Squares

The generalized singular values of {A, B} are αi = 1, i > p, and σi = αi /βi . From (3.4.23)
the normal equations (ATA + w2 B TB)x = AT b + w2 B T d are transformed into diagonal form,
    
Σ2B 0 ΣB
Σ2A + w2 y = ( ΣA 0 ) c + w2 e, (3.4.24)
0 0 0

where Zx = y, c = U T b, f = V T d. It follows that

αi ci + w2 βi ei
(
, i = 1, . . . , p,
yi = αi2 + w2 βi2 (3.4.25)
ci , i = p + 1, . . . , n.

Problem LSE is transformed into


  2
ΣA
min y−c subject to (ΣB 0)y = f (3.4.26)
y 0 2

with solution xLSE = Z −1 y, where



fi /βi , i = 1, . . . , p,
yi = (3.4.27)
ci , i = p + 1, . . . , n.

From (3.4.25) and (3.4.27) it follows that limw→∞ x(w) = xLSE .


The residual for the constraints is s = V T (d − Bx) = f − ( ΣB 0 ) y. From (3.4.25), we
obtain
αi ci + w2 βi fi α i f i − βi ci
si = fi − βi 2 2 = αi 2 , i = 1, . . . , p.
αi + w βi2 αi + w2 βi2
For large values of w, we have si ≈ σi (σi fi − ci )/w2 . Hence the residual norm ∥d − Bx(w)∥2
is proportional to w−2 . We conclude that the criterion for choosing w should have the form
w > cu−1/2 .

Notes and references

Several methods for solving problem LSE are described by Lawson and Hanson [727, 1995];
a detailed analysis of the method of weighting for LSE is given in Section 22. Wedin [1110,
1985] gives perturbation bounds for LSE based on the augmented system formulation. Cox and
Higham [274, 1999] analyze the accuracy and stability of three different nullspace methods for
problem LSE, give a perturbation theory, and derive practical error bounds. Rank-deficient LSE
problems are studied by Wei [1113, 1992]. An MGS algorithm for weighted and constrained
problems is presented by Gulliksson [550, 1995]. Barlow and Handy [74, 1988] compare the
method of weighting to that in Björck [126, 1968]. Reid [921, 2000] surveys the use of implicit
scaling for linear least squares problems. Barlow and Vemulapati [78, 1992] give a slightly
modified improvement scheme for the weighting method.

3.5 Inequality Constrained Problems


3.5.1 Classification of Problems
Often, the solution x ∈ Rn of a least squares problem is subject to a set of linear inequality
constraints cTi x ≤ di , i = 1, . . . , p, and has the following form:
3.5. Inequality Constrained Problems 161

Problem LSI
min ∥Ax − b∥2 subject to Cx ≤ d. (3.5.1)
x

Here C ∈ Rp×n is a matrix with ith row cTi , and the inequalities are to be interpreted compo-
nentwise. A solution to problem LSI exists only if the set of points satisfying Cx ≤ d is not
empty. Problem LSI is equivalent to the quadratic programming problem

min 21 xT Bx + cT x subject to Cx ≤ d (3.5.2)


x

with B = ATA, c = −AT b. This arises as a subproblem in general nonlinear programming


algorithms and has been studied extensively. Algorithms for quadratic programming that form
B = ATA explicitly should be avoided if possible.
If A has full column rank, then problem LSI is a strictly convex optimization problem, and if
the constraints are feasible, there is a unique solution. In this case, problem LSI is known to be
solvable in polynomial time. The Lagrangian for problem LSI is

L(x, y) = ∥Ax − b∥22 + y T (Cx − d), (3.5.3)

where y ∈ Rm is the vector of Lagrange multipliers. The first-order optimality conditions are
given by the Karush–Kuhn–Tucker (KKT) conditions.

Theorem 3.5.1 (Karush–Kuhn–Tucker). Let x∗ be a local minimizer of problem LSI (3.5.1)


and assume that the Jacobian of the active constraints at x∗ has full rank. Then there exists
Lagrange multipliers y ∗ that satisfy ∇x L(x, y) = 0, Cx − d ≥ 0 at x = x∗ , and the KKT
complementarity condition holds:

y T (Cx − d) = 0, y ≥ 0. (3.5.4)

From the nonnegativity of Cx − d and y it follows that either yi = 0 or the ith constraint is
binding: cTi x − d = 0. The vector y in the KKT conditions is called the dual solution.

An important special case of LSI is when the constraints are upper and lower bounds.
Problem BLS
min ∥Ax − b∥2 subject to l ≤ x ≤ u. (3.5.5)
x

This is an LSI problem with C = (In , −In )T , d = (u, −l)T . Bound-constrained least squares
(BLS) problems arise in many practical applications, e.g., reconstruction problems in geodesy
and tomography, contact problems for mechanical systems, and modeling of ocean circulation.
It can be argued that the linear model is only realistic when the variables are constrained within
meaningful intervals. For computational efficiency it is essential that such constraints be consid-
ered separately from more general constraints, such as those in (3.5.1).
If only one-sided bounds on x are specified in BLS, it is no restriction to assume that these
are nonnegativity constraints. Then we have a linear nonnegative least squares problem.
Problem NNLS
min ∥Ax − b∥2 subject to x ≥ 0. (3.5.6)
This problem arises when x represents quantities such as amounts of material, chemical concen-
trations, and pixel intensities. Applications include geodesy, tomography, and contact problems
for mechanical systems. The KKT conditions to be satisfied at an optimal NNLS solution are

y = AT (Ax − b), xT y = 0, x ≥ 0, y ≥ 0, (3.5.7)


162 Chapter 3. Generalized and Constrained Least Squares

where y is the gradient of 12 ∥Ax − b∥22 . This is also known as a (monotone) linear complementar-
ity problem (LCP). LSI appears to be a more general problem than NNLS, but this is not the case.
Hanson and Haskell [588, 1982] give a number of ways any LSI problem can be transformed into
the form of NNLS.
If A is rank-deficient, there may be an infinite manifold M of optimal solutions with a unique
global optimum. Then we can seek the unique solution of least norm that satisfies min ∥x∥2 ,
x ∈ M . This can be formulated as a least distance problem.
Problem LSD
min ∥x∥2 subject to g ≤ Gx. (3.5.8)
x

The solution to problem LSD can be obtained by an appropriate normalization of the resid-
ual in a related NNLS problem. The following result is shown by Lawson and Hanson [727,
1995, Chap. 23].

Theorem 3.5.2. Let u ∈ Rm+1 solve the NNLS problem

min ∥Eu − f ∥2 subject to u ≥ 0, (3.5.9)


u

where    
GT 0 }n
E= , f= .
gT 1 }1
Let r = (r1 , . . . , rn+1 )T = f − Eu be the residual corresponding to the NNLS solution, and set
σ = ∥r∥2 . If σ ̸= 0, then the unique solution to problem LSD is

x = (x1 , . . . , xn )T , xj = rj /rn+1 , j = 1, . . . , n. (3.5.10)

If σ = 0, the constraints g ≤ Gx are inconsistent, and problem LSD has no solution.

Two-step algorithms for solving the more general LSD problem,

min ∥x1 ∥2 subject to g ≤ Gx, (3.5.11)


x

T
where x = ( x1 , x2 ) , are given by A. K. Cline [254, 1975], Haskell and Hanson [594, 1981],
and Hanson [587, 1986]. Each step requires the solution of a problem of type NNLS, with the
first step having additional linear equality constraints.
Haskell and Hanson [594, 1981] implement two subprograms, LSEI and WNNLS, for solv-
ing linearly constrained least squares problems with equality constraints (LSEI). This is more
general than problem LSI in that some of the equations in Ax = b are to be satisfied exactly.
A user’s guide to the two subroutines LSEI and WNNLS is given in Haskell and Hanson [593,
1979]. WNNLS is based on solving a differentially weighted least squares problem. A penalty
function minimization that is implemented in a numerically stable way is used.
Gradient projection methods and interior methods for problem NNLS are treated in Sec-
tions 8.3.1 and 8.3.2, respectively. General nonlinear optimization software can also be used to
solve nonlinear least squares problems with inequality constraints; see, e.g., Schittkowski [970,
1985] and Mahdavi-Amiri [768, 1981].

3.5.2 Active-Set Methods


Methods for solving problems with linear inequality constraints are, in general, iterative in na-
ture. A point x is called a feasible point if it satisfies all constraints Cx ≤ d in (3.5.1). The
3.5. Inequality Constrained Problems 163

constraints that are satisfied with equality at a feasible point are called active. In an active-set
algorithm a working set of constraints is defined to be a linearly independent subset of the active
set at the current approximation. In each iteration the value of the objective function is decreased,
and the optimum is reached in finitely many steps (assuming the constraints are independent).
A first useful step in solving problem LSI (3.5.1) is to transform A to upper triangular form.
With standard column pivoting, a rank-revealing QR factorization
 
T R11 R12 }r
Q AP = R = (3.5.12)
0 0 }m − r

is computed with R11 upper triangular and nonsingular. The numerical rank r of A is determined
using some specified tolerance, as discussed in Section 1.3.3. After the last (m − r) rows in R
and c are deleted, the objective function in (3.5.1) becomes
 
c1
∥(R11 , R12 ) x − c1 ∥2 , c = = QT b.
c2

By further orthogonal transformations from the right in (3.5.12), a complete orthogonal decom-
position  
T T 0 }r
Q AP V =
0 0 }m − r
is obtained, where T is upper triangular and nonsingular. With x = P V y, problem LSI (3.5.1)
is then equivalent to
min ∥T y1 − c1 ∥2 subject to Ey ≤ d,
y1
 
y1
where E = CP V , y = , and y1 ∈ Rr .
y2
In general, a feasible point from which to start the algorithm is not known. An exception
is the case when all constraints are simple bounds, as in BLS and NNLS. In a first phase of
an
P active-set algorithm, a feasible point is determined by minimizing the sum of infeasibilities
cTi x − di among violated constraints.
Let xk be a feasible point that satisfies the working set of nk linearly independent constraints
with associated matrix Ck . The constraints in the working set are temporarily treated as equality
constraints. An optimum solution to the corresponding problem exists because the least squares
objective is bounded below. To solve this we take

xk+1 = xk + αk pk ,

where pk is a search direction and αk a nonnegative step length. The search direction is chosen
so that Ck pk = 0, which will cause the constraints in the working set to remain satisfied for all
values of αk . If moving toward the solution encounters an inactive constraint, this constraint is
added to the active set, and the process is repeated.
To satisfy the condition Ck pk = 0, a decomposition

Ck Qk = ( 0 Tk ) , Tk ∈ Rnk ×nk , (3.5.13)

is computed, where Tk is triangular and nonsingular, and Qk is a product of orthogonal transfor-


mations. (This is essentially the QR factorization of CkT .) If Qk is partitioned conformally,

Qk = ( Zk Yk ), (3.5.14)
|{z} |{z}
n−nk nk
164 Chapter 3. Generalized and Constrained Least Squares

the n − nk columns of Zk form a basis for the nullspace of Ck . The condition Ck pk = 0 is


satisfied if we take pk = Zk qk , qk ∈ Rn−nk . Now qk is determined so that xk + Zk qk minimizes
the objective function. Hence qk solves the unconstrained least squares problem

min ∥AZk qk − rk ∥2 , rk = b − Axk . (3.5.15)


qk

To simplify the discussion we assume AZk has full rank, so that (3.5.15) has a unique solu-
tion. To compute this solution we need the QR factorization of AZk . This is obtained from the
QR factorization of AQk , where
 
Rk Sk
}n − nk
PkT AQk = PkT (AZk AYk ) =  0 Uk  . (3.5.16)
}nk
0 0

Computing this larger decomposition has the advantage that the orthogonal matrix Pk need not
be saved and can be discarded after being applied also to the residual vector rk . The solution qk
to (3.5.15) can now be computed from
 
ck }n − nk
Rk qk = ck , PkT rk = .
dk }nk

Denote by ᾱ the maximum nonnegative step along pk for which xk+1 = xk + αk pk remains
feasible with respect to the constraints not in the working set. If ᾱ ≤ 1, then we take αk = ᾱ
and add the constraint that determines ᾱ to the working set for the next iteration. If ᾱ > 1, then
we set αk = 1. In this case, xk+1 will minimize the objective function when the constraints in
the working set are treated as equalities, and the orthogonal projection of the gradient onto the
subspace of feasible directions will be zero:

ZkT gk+1 = 0, gk+1 = −AT rk+1 .

In this case we check the optimality of xk+1 by computing Lagrange multipliers for the con-
straints in the working set. At xk+1 these are defined by the equation

CkT λ = gk+1 = −AT rk+1 . (3.5.17)

The residual vector rk+1 for the new unconstrained problem satisfies
 
0
PkT rk+1 = .
dk

Hence, multiplying (3.5.17) by QTk and using (3.5.13), we obtain


   
T T 0 T T 0
Qk Ck λ = λ = −Qk A Pk ,
TkT dk

so from (3.5.16),
TkT λ = − ( UkT 0 ) dk .
The Lagrange multiplier λi for the constraint cTi x ≥ di in the working set is said to be
optimal if λi ≥ 0. If all multipliers are optimal, an optimal point has been found. Otherwise, the
objective function can be decreased if we delete the corresponding constraint from the working
set. When more than one multiplier is not optimal, it is normal to delete the constraint whose
multiplier deviates the most from optimality.
3.5. Inequality Constrained Problems 165

At each iteration, the working set of constraints is changed, which leads to a change in Ck .
If a constraint is dropped, the corresponding row in Ck is deleted; if a constraint is added, a new
row is introduced in Ck . An important feature of an active-set algorithm is efficient solution of
the sequence of unconstrained problems (3.5.15). Techniques described in Section 7.2 can be
used to update the matrix decompositions (3.5.13) and (3.5.16). In (3.5.13), Qk is modified by
a sequence of orthogonal transformations from the right. Factorization (3.5.16) and the vector
PkT rk+1 are updated accordingly.
If xk+1 = xk , Lagrange multipliers are computed to determine if an improvement is possible
by moving away from one of the active constraints (by deleting it from the working set). In each
iteration, the value of the objective function is decreased until the KKT conditions are satisfied.
Active-set algorithms usually restrict the change in dimension of the working set by dropping
or adding only one constraint at each iteration. For large-scale problems this implies many
iterations when the set of active constraints at the starting point is far from the working set at the
optimal point. Hence, unless a good approximation to the final set of active constraints is known,
an active-set algorithm will require many iterations to converge.
In the rank-deficient case it can happen that the matrix AZk in (3.5.15) is rank-deficient,
and hence Rk is singular. Note that if some Rk is nonsingular, it can only become singular
during later iterations when a constraint is deleted from the working set. In this case only its last
diagonal element can become zero. This simplifies the treatment of the rank-deficient case. To
make the initial Rk nonsingular one can add artificial constraints to ensure that the matrix AZk
has full rank.
A possible further complication is that the working set of constraints can become linearly
dependent. This can cause possible cycling in the algorithm, so that its convergence cannot be
ensured. A simple remedy that is often used is to enlarge the feasible region of the offending
constraint by a small quantity.
If A has full column rank, the active-set algorithm for problem LSI described here is essen-
tially identical to an algorithm given by Stoer [1039, 1971]. LSSOL by Gill et al. [473, 1986] is a
set of Fortran subroutines for solving a class of convex quadratic programming that includes LSI.
It handles rank-deficiency in A, a combination of simple bounds, and general linear constraints.
It allows for a linear term in the objective function and uses a two-phase active-set method. The
minimizations in both phases are performed by the same subroutines.
For problems BLS and NNLS, active-set methods simplify. We outline an active-set algo-
rithm for problem BLS in its upper triangular form
min ∥Rx − c∥2 subject to l ≤ x ≤ u. (3.5.18)
x

Divide the index set of x according to {1, 2, . . . , n} = F ∪ B, where i ∈ F if xi is a free


variable, and i ∈ B if xi is fixed at its lower or upper bound. We assume F and B are ordered
sets with indices in increasing order. To this partitioning corresponds a permutation matrix P =
(EF , EB ), where EF and EB consist of the columns ei of the unit matrix for which i ∈ F
and i ∈ B, respectively. Choose an initial solution x(0) satisfying l < x(0) < u, and take
F0 = {1, 2, . . . , n}, RB0 = ∅, so that RF0 = R. (The reason, as will become apparent, is that
it is cheaper and more stable to fix a free variable than the opposite operation.) Let x(k) be the
iterate at the kth step (k = 0, 1, . . .) and write
(k) T (k)
xFk = EF k
x(k) , xBk = EBTk x(k) (3.5.19)
for the free and fixed parts of the solution. The unconstrained problem (3.5.18) with variables
(k)
xBk fixed becomes
(k)
min ∥RFk xFk − ck ∥2 , ck = c − RBk xBk , (3.5.20)
xFk
166 Chapter 3. Generalized and Constrained Least Squares

where RPk = (REFk , REBk ) = (RFk , RBk ). To simplify the discussion we assume RFk has
full column rank, so that (3.5.20) has a unique solution. This is always the case if rank(A) = n.
To solve (3.5.20) we need the QR factorization of RFk . We obtain this by considering the
first block of columns of the QR factorization,
   
Uk Sk dk
QTk (RFk , RBk ) = , QTk c = . (3.5.21)
0 Vk ek

(k) (k)
The solution to (3.5.20) is given by Uk xFk = dk − Sk xBk , and we take

x(k+1) = x(k) + α(z (k) − x(k) ),

(k) (k)
where z (k) = EFk xFk + EBk xBk and α is a nonnegative step length. (Note that, assuming
rank(R) = n, z (0) is just the solution to the unconstrained problem.)
Let ᾱ be the maximum value of α for which x(k+1) remains feasible. There are now two
possibilities:

• If ᾱ < 1, then z (k) is not feasible. We take α = ᾱ and move all indices q ∈ Fk for which
(k+1)
xq = lq or uq from Fk to Bk . Thus, the free variables that hit their lower or upper
bounds will be fixed for the next iteration step.

• If ᾱ ≥ 1, we take α = ᾱ. Then x(k+1) = z (k) is the unconstrained minimum when the
variables xBk are kept fixed. The Lagrange multipliers are checked to see if the objective
function can be decreased further by freeing one of the fixed variables. If not, we have
found the global minimum.

At each iteration, the sets Fk and Bk change. If a constraint is dropped, a column from RBk is
moved to RFk ; if a constraint is added, a column is moved from RFk to RBk . Solution of the
sequence of unconstrained problems (3.5.20) and computation of the corresponding Lagrange
multipliers can be efficiently achieved, provided the QR factorization (3.5.21) can be updated.
In a similar active-set algorithm for problem NNLS, the index set of x is divided as
{1, 2, . . . , n} = F ∪ B, where i ∈ F if xi is a free variable and i ∈ B if xi is fixed at zero.
Ck now consists of the rows ei , i ∈ B, of the unit matrix In . We let Ck = EBT , and if EF is
similarly defined, then Pk = (EF EB ), Tk = Ink . Since Pk is a permutation matrix, the product

APk = (AEF AEB ) = (AF AB )

corresponds to a permutation of the columns of A.


To drop the bound corresponding to xq , we take APk+1 = APk PR (k, q), where the permu-
tation matrix PR (k, q), q > k + 1, performs a right circular shift of the columns

k + 1, . . . , q − 1, q ⇒ q, k + 1, . . . , q − 1.

Similarly, to add the bound corresponding to xq to the working set we take AQk+1 =
AQk PL (q, k), where PL (q, k), q < k − 1, is a permutation matrix that performs a left circular
shift of the columns
q, q + 1, . . . , k ⇒ q + 1, . . . , k, q.

Equation (3.5.17) for the Lagrange multipliers simplifies for NNLS to λ = −EBTAT rk+1 ,
where −AT rk+1 is the gradient vector. As an initial feasible point we take x = 0 and set F = ∅.
3.5. Inequality Constrained Problems 167

The least squares subproblems need not be solved from scratch. Instead, the QR factorization
   
T R S T c
P A(EF , EB ) = , P b= ,
0 U d

is updated after a right or left circular shift, using the algorithms described in Section 3.3.2; cf.
stepwise regression.
The pseudocode below is based on the NNLS algorithm given by Lawson and Hanson [727,
1995, Chapter 23]. The algorithm cannot cycle and terminates after a finite number of steps.
However, the number of iterations needed can be large and cannot be estimated a priori.

Algorithm 3.5.1 (Active-Set Algorithm for NNLS).

Initialization:
F = ∅; B = {1, 2, . . . , n};
x = 0; w = AT b;
Main loop:
while B ̸= ∅ and maxi wi ≥ 0,
p = argmaxi∈B wi ;
Move index p from B to F, i.e., free variable xp ;
Let z solve minx ∥Ax − b∥2 subject to xB = 0;
while mini zi ≤ 0 ∀i ∈ F;
Let αi = xi /(xi − zi ) ∀i ∈ F such that zi < 0;
Find index q such that αq = mini∈F αi ; x = x + αq (z − x);
Move all indices q for which xq = 0 from F to B;
Let x solve minx ∥Ax − b∥2 subject to zB = 0;
end
x = z; w = EBT AT (b − Ax);
end

NNLS is also available in MATLAB as the function lsqnonneg. Fortran implementations


of algorithms for the NNLS and LSD problems are given also in Appendix C in Lawson and
Hanson [727, 1995].
Perturbation bounds for the linear least squares problem subject to linear inequality con-
straints are given by Lötstedt [758, 1983]. For solving a sequence of constrained problems
with a slowly changing matrix A, iterative methods are particularly attractive. Lötstedt [759,
1984] gives an active-set algorithm for solving time-dependent simulation of contact problems
in mechanical systems. His algorithm uses preconditioned CG with a preconditioner that is kept
constant for several time steps.

Notes and references


A survey of algorithms for enforcing nonnegativity constraints in scientific least squares com-
putation is given by Chen and Plemmons [239, 2007]. The paper also surveys important appli-
cation in science and engineering. Bro and de Jong [181, 1997] modify the standard NNLS
algorithm by precomputing some cross-product terms in the normal equations. Their algo-
rithm, called fast NNLS, gives significant speed-up for problems with multiple right-hand sides,
168 Chapter 3. Generalized and Constrained Least Squares

and targets applications such as multiway decomposition methods for tensor arrays; see Sec-
tion 4.3.5. In a recent variant due to Van Benthem and Keenan [1072, 2004], the performance is
improved by identifying and grouping together observations at each stage that share a common
pseudoinverse.

3.5.3 Quadratically Constrained Least Squares


Least squares problems with a quadratic inequality constraint (LSQI) of the form
min ∥Ax − b∥2 subject to ∥Bx − d∥2 ≤ γ, (3.5.22)
x

where A ∈ Rm×n , B ∈ Rp×n arise in many applications. For the solution to be unique it is
necessary that the nullspaces of A and B intersect trivially, i.e.,
 
A
N (A) ∩ N (B) = {0} ⇔ rank = n. (3.5.23)
B
Conditions for existence and uniqueness of solutions to problem LSQI and the related problem
LSQE with equality constraint ∥Bx−d∥2 = γ are given by Gander [438, 1981]. For a solution to
(3.5.22) to exist the set {x : ∥Bx − d∥2 ≤ γ} must not be empty. Furthermore, if ∥BxB − d∥2 <
γ, where xB solves
min ∥Bx − d∥2 , S = {x ∈ Rn | ∥Ax − b∥2 = min}, (3.5.24)
x∈S

the constraint is not binding.

Theorem 3.5.3. Assume that the constraint set {x : ∥Bx − d∥2 = γ} is not empty. Then the
solution to problem LSQE equals the solution x(λ) to the normal equations (AT A + λB T B)x =
AT b + B T d or, equivalently, to the least squares problem
   
min √A x− √
b
, (3.5.25)
x λB λd 2
where λ ≥ 0 solves the secular equation
f (λ) = ∥Bx(λ) − d∥2 − γ = 0. (3.5.26)

Proof. By the method of Lagrange multipliers and minimize ψ(x, λ), where
ψ(x, λ) = ∥Ax − b∥22 + λ(∥Bx − d∥22 − γ 2 ).
Only positive values of λ are of interest. A necessary condition for a minimum is that the gradient
of ψ(x, λ) with respect to x equals zero. For λ ≥ 0 this shows that x(λ) solves (3.5.25). It can
be shown that f (λ) is a monotone decreasing function of λ. Hence the secular equation has a
unique positive solution.

The standard case of LSQE is obtained by taking B = I and d = 0 in (3.5.22):


min ∥Ax − b∥2 subject to ∥x∥2 = γ. (3.5.27)
x

Then x(λ) solves the regularized or damped least squares problem


   
√A b
min x− , (3.5.28)
x λIn 0 2
where λ > 0 satisfies the secular equation
f (λ) = ∥x(λ)∥2 − γ = 0. (3.5.29)
3.5. Inequality Constrained Problems 169

Damped least squares problems were used by Levenberg [736, 1944] in the solution of non-
linear least squares problems. Problem (3.5.28) can be solved by Householder QR factorization;
see Golub [487, 1965]. The structure of the initial and the transformed matrices after k = 2 steps
of Householder QR factorization is shown below for m = n = 4.
× × × × × × × ×
× × × × ⊗ × × ×
× × × × ⊗ ⊗ × ×
   
× × × × ⊗ ⊗ × ×
   
=⇒
× ⊗ ⊕ + +
   

× ⊕ + +
   
  
× ×
   
× ×
Only the first two rows of λI have filled in. In all steps, precisely n elements in the current
column are annihilated. Hence the Householder QR factorization requires 2mn2 flops, which is
2n3 /3 flops more than for the QR factorization of A. A similar increase in arithmetic operations
occurs for MGS.
The standard form of problem LSQE can also be formulated as a least squares problem on
the unit sphere:
min ∥Ax − b∥2 . (3.5.30)
∥x∥2 =γ

This is a problem on the Stiefel (or equivalently, Grassman) manifold. Newton-type algorithms
for such problems are developed by Edelman, Arias, and Smith [358, 1999]. The application of
such methods to the regularized least squares problem has been studied by Eldén [374, 2002].

3.5.4 Solving Secular Equations


In order to solve the secular equation (3.5.29) numerically, we need to evaluate the function f (λ)
and preferably its derivative. This requires the solution of (3.5.28) for a sequence of values of
λ. As shown by Reinsch [923, 1971], instead of applying Newton’s method to (3.5.29), faster
convergence can be obtained by applying it to the convex function
1 1
g(λ) = − = 0. (3.5.31)
∥x(λ)∥2 γ
If the initial approximation satisfies 0 < λ0 < λ∗ , where λ∗ is the solution, then the iterates
λk can be shown to converge monotonically from below. (In the optimization literature this
observation is often credited to Hebden [598, 1973].) The asymptotic rate of convergence for
Newton’s method is quadratic. If derivatives are difficult to evaluate, the secant method can be
used. Provided that the two initial iterates are nonnegative, this also converges monotonically.
The derivative with respect to λ of g(λ) in (3.5.31) is
dg(λ) xT (λ) dx(λ)
=− .
dλ ∥x(λ)∥32 dλ
From x(λ) = C(λ)−1 AT b, C(λ) = ATA + λI, and the formula for the derivative of an inverse
matrix, we obtain
dx(λ)
x(λ)T = −x(λ)T (ATA + λI)−1 x(λ) ≡ −∥z(λ)∥22 .

Thus Newton’s method for solving (3.5.31) becomes
∥x(λk )∥22
 
∥x(λk )∥2
λk+1 = λk + −1 . (3.5.32)
γ ∥z(λk )∥22
170 Chapter 3. Generalized and Constrained Least Squares

Each Newton iteration requires the solution of the damped least squares problem (3.5.28) for a
new value of λ. Hence a new QR decomposition must be computed. These QR factorizations
account for the main cost of an iteration.

Algorithm 3.5.2 (Reinsch’s Algorithm).


function [x,nx] = reinsch(A,b,gamma,p)
% REINSCH performs <= p iterations to solve
% min_x||A x - b||_2 subject to ||x||_2 = gamma
% ---------------------------------------------------
[m,n] = size(A);
rlam = m*eps*norm(A,1);
for k = 1:p
% Compute the compact QR.
[Q,R] = qr([A; rlam*eye(n)], 0);
c = Q'*b;
x = R\c; nx = norm(x);
if nx <= gamma, break, end
% Perform Newton step.
z = R'\x; nz = norm(z);
lam = lam + (nx/gamma - 1)*(nx/nz)^2;
rlam = sqrt(lam);
end
end
 
R c
An initial QR factorization ( A b ) = Q reduces the problem to
0 d
   
√R c
min x− . (3.5.33)
x λIn 0 2

For any fixed value of λ this problem can be reduced by a second QR factorization to
minx ∥R(λ)x − c1 (λ)∥2 . Then x(λ) and z(λ) can be computed from the two triangular sys-
tems
R(λ)x(λ) = c1 (λ), R(λ)T z(λ) = x(λ).
Eldén [367, 1977] shows that further savings in the Newton iterations can be obtained by an
initial transformation    
T B T g1
U AV = , U b= , (3.5.34)
0 g2
where B is upper bidiagonal; see Section 4.2.1. Computing B and V requires 4mn2 flops; see
Section 4.2.1. With x = V y, the least squares problem (3.5.28) is transformed into
   
B g1
min √ y− . (3.5.35)
y λIn 0 2

To solve this for a given value of λ, two sequences of plane rotations,


Gk = Rk,n+k , k = 1, . . . , n, Jk = Rn+k,n+k+1 , k = 1, . . . , n − 1,
are determined that transform the matrix in (3.5.35) to upper bidiagonal form Bλ :
   
√B g1 Bλ z1
Gn Jn−1 · · · G2 J1 G1 = . (3.5.36)
λIn 0 0 z2
3.6. Regularized Least Squares 171

Here G1 zeros the element in position (n + 1, 1) and creates a new nonzero element in position
(n + 2, 2). This is annihilated by a second plane rotation J1 that transforms rows n + 1 and n + 2.
All remaining steps proceed similarly. The solution is then obtained as

Bλ y(λ) = z1 , x(λ) = V y(λ). (3.5.37)

The QR factorization in (3.5.36) and the computation of y(λ) take about 23n flops. Eldén [367,
1977] gives a more detailed operation count and also shows how to compute the derivatives used
in Newton’s method of the equation f (λ) = ∥yλ ∥2 − γ = 0.

3.6 Regularized Least Squares


3.6.1 Discrete Ill-Posed Linear Systems
In many different branches of physics and engineering one tries to determine, e.g., the structure
of a physical system from its behavior. Some important areas are medical imaging, geophysical
prospecting, image deblurring, and deconvolution of signals. Such problems are called inverse
problems and can often be modeled as a Fredholm integral equation of the first kind,
Z
k(s, t)f (t)dt = g(s), s, t ∈ Ω, (3.6.1)

where f and g are assumed to be real functions in the Hilbert space L2 (Ω), and the kernel
k(· , ·) ∈ L2 (Ω × Ω). Let K ∈ L2 (Ω) → L2 (Ω) be a continuous linear operator defined by
Z
Kf = k(· , t)f (t)dt. (3.6.2)

By the Riemann–Lebesgue lemma there are rapidly oscillating functions f that come arbitrarily
close to being annihilated by K. Hence, the inverse of K cannot be a continuous operator.
Hence the solution of f does not depend continuously on the data g. Therefore (3.6.1) is called
an ill-posed problem, a term introduced by Hadamard.
A compact operator K admits a singular value expansion

Kvi = σi ui , K T ui = σi v i , i = 1, 2, . . . ,

where the functions ui and vi are orthonormal with respect to the inner product
Z
⟨u, v⟩ = u(t)v(t)dt, ∥u∥ = ⟨u, u⟩1/2 .

The infinitely many singular values σi quickly decay with i and cluster at zero. Therefore
(3.6.1) has a solution f ∈ L2 (Ω) only for special right-hand sides g. A known (Groetsch [539,
1984, Theorem 1.2.6]) necessary and sufficient condition is that g satisfies the Picard condition

X
|uTi g/σi |2 < ∞. (3.6.3)
i=1

In most practical applications the kernel K of integral equation (3.6.1) is usually given ex-
actly by the mathematical model, while g consists of measured quantities known with a certain
accuracy at a finite set of points s1 , . . . , sn . To solve the integral equation (3.6.1) numerically, it
must first be reduced to a finite-dimensional matrix equation by discretization. This can be done
172 Chapter 3. Generalized and Constrained Least Squares

in several ways, e.g., by quadrature or collocation methods. Let si = −1 + ih, tj = −1 + jh,


i, j = 0, 1, . . . , n + 1, be a uniform mesh on Ω = [−1, 1] with step size h = 2/(n + 1). With the
trapezoidal rule, the integral in (3.6.1) can be approximated by
n
X
h wj K(si , tj )f (tj ) = g(si ), i = 0 : m + 1, (3.6.4)
j=0

where wj = 1, j ̸= 0, n, and w0 = wn = 1/2. Taking m = n and si = ti gives a linear system


for xj = f (tj ):
Kf = g, Kij = wi K(si , tj ), gi = g(si ).
Pn
In the Galerkin method, an approximation f = i=1 xi ϕi in an n-dimensional subspace Vn =
2
span {ϕ1 , . . . , ϕn } of L (Ω) is determined by the condition

ψjT (Kx − g), j = 1, . . . , n,

where Wn = span {ψ1 , . . . , ψn } is a second n-dimensional subspace of L2 (Ω). This leads again
to a finite-dimensional linear system Kf = g for the vector (f1 , . . . , fn ), where

Kij = ψjT Kϕi , gj = ψjT g.

The discretized system Kf = g or, more generally, the least squares problem minx ∥Kf −
g∥2 , will inherit many properties of the integral equation (3.6.1). In the singular value decompo-
sition
Xn
K = U ΣV T = σi ui viT ,
i=1

the singular values σi will decay rapidly and cluster near zero with no evident gap between any
two consecutive singular values. Such matrices have an ill-determined numerical rank, and the
corresponding problem is a discrete ill-posed problem. The solution usually depends mainly
on a few larger singular values σ1 , . . . , σp , p ≪ n. The effective condition number for the exact
right-hand side g,
κe = σ1 /σp ≪ κ(K), p ≪ n, (3.6.5)
is usually small. The concept of effective condition number was introduced by Varah [1086,
1973]; see also Chan and Foulser [226, 1988].
2
Example 3.6.1. Consider the Fredholm integral equation (3.6.1) with kernel K(s, t) = e−(s−t) .
Let Kf = g be the system of linear equations obtained by discretization with the trapezoidal rule
on a square mesh with m = n = 100. The singular values of K are displayed in logarithmic
scale in Figure 3.6.1. They decay toward zero, and there is no distinct gap anywhere in the spec-
trum. For i > 30, σi are close to roundoff level, and in double precision the numerical rank of
K certainly is smaller than 30.

The discretized linear system Kf = g will only have a meaningful solution for right-hand
sides b that satisfy a discrete version of the Picard condition; see Hansen [573, 1990]. If g is
affected by noise, then the exact solution for the noisy right-hand side g = gexact + e will
bear no resemblance to the noise-free true solution. A consequence of the Picard condition for
the continuous problem is that the coefficients ci = uTi g for gexact in the SVD solution of the
discretized system
Xn
f = K † gexact = ci σi−1 vi (3.6.6)
i=1
3.6. Regularized Least Squares 173

2
10

0
10

−2
10

−4
10

−6
10

−8
10
k
σ

−10
10

−12
10

−14
10

−16
10

−18
10
0 10 20 30 40 50 60 70 80 90 100
k

Figure 3.6.1. Singular values σi of a discretized integral operator. Used with permission of
Springer International Publishing; from Numerical Methods in Matrix Computations, Björck, Åke, 2015;
permission conveyed through Copyright Clearance Center, Inc.

must eventually decrease faster than σi . However, when gexact is contaminated with errors, any
attempt to solve the discrete ill-posed problem numerically without restriction of the solution
space will give a meaningless result.
In many applications the kernel k(s, t) depends only on the difference s − t, and the integral
equation has the form
Z 1
h(s − t)f (t)dt = g(s), 0 ≤ s, t ≤ 1. (3.6.7)
0

The problem of computing f given h and g is a deconvolution problem. An example is gravity


surveying, where one wants to determine the unknown mass distribution f underground from
measurements of the vertical gravity field g at the surface; see Hansen [577, 2002]. Another
example is the inverse heat equation; see Eldén [373, 1995]. If in the discretization the quadrature
points tj and the collocation points si are identical and equidistantly spaced, then the elements in
A form a Toeplitz matrix K = (kij ) with constant entries along each diagonal, i.e., kij = tj−i .
This allows very efficient solution algorithms to be developed.

3.6.2 Truncated SVD


The purpose of a regularization method is to diminish the effect of noise in the data and produce
a good approximation of the noise-free solution to an ill-conditioned linear system Ax = b. In
truncated SVD (TSVD) the approximate solution is taken to be

xk = A†k b = Vk Σ†k UkT b,

where Ak = Uk Σk VkT is the SVD expansion (3.6.6) of A ∈ Rm×n truncated to k ≪ n terms.


Recall that Ak is the best rank-k approximation of A. Furthermore, for some tolerance δk it
holds that
∥A − Ak ∥2 = ∥AVk⊥ ∥2 ≤ δk , Vk⊥ = (vk+1 , . . . , vn ),
where the columns of Vk⊥ span an approximate nullspace of A. The number of terms to include
in the SVD expansion depends on the noise level in the data. It should be chosen so that a
174 Chapter 3. Generalized and Constrained Least Squares

large reduction in the norm of the residual b − Axk is achieved without causing the norm of the
approximate solution xk to become too large. The TSVD method is widely used as a general-
purpose method for small to medium-sized ill-posed problems. For many ill-posed problems the
solution can be well approximated by the TSVD solution with a small number of terms. Such
problems are called effectively well-conditioned.
In statistical literature, TSVD is known as principal component regression (PCR) and often
formulated in terms of the eigenvectors of ATA instead of the SVD of A; see Massy [781, 1965].
The Gauss–Markov theorem (Theorem 1.1.4) states that the least squares solution is the best
unbiased linear estimator of x, in the sense that it has minimum variance. If A is ill-conditioned,
this minimum variance is still large. In regularization the variance can be substantially decreased
by allowing the estimator to be biased.
In TSVD the components selected from the SVD expansion correspond to the k largest sin-
gular values. The right-hand side b can have larger projections on some singular vectors corre-
sponding to smaller singular values. In such a case, one could take into account also the size of
the coefficients ci = uTi b when choosing which components of the SVD expansion to include in
the approximation.
Hansen, Sekii, and Shibahashi [585, 1992] introduce the modified solution MTSVD xB,k
that solves the least squares problem
min ∥Bx∥2 , S = { x | ∥Ak x − b∥2 = min}, (3.6.8)
x∈S

where B ∈ Rp×n is a matrix that penalizes solutions that are not smooth. The TSVD solution
is obtained by taking B = I. The MTSVD problem has the same form as (2.3.10) used in
Section 2.3.2 to resolve rank-deficiency. The solution can be written in the form
xB,k = xk − Vk⊥ z, z ∈ Rn−k , (3.6.9)
where xk = Vk Σ−1 T
k Uk b is the TSVD solution, and the columns of Vk⊥ span the nullspace of Ak .
The vector z can be computed from the QR factorization of BVk⊥ ∈ Rp×(n−k) as the solution to
the least squares problem
min ∥(BVk⊥ )z − Bbk ∥2 . (3.6.10)
z
It is often desired to compute a sequence of MTSVD solutions for decreasing values of k.
When k is decreased, more columns are added to the left of BVk⊥ . This makes it costly to update
the QR factorization of BVk⊥ . It is more efficient to work with

Vek⊥ = Vk⊥ Pk = (vn , . . . , vn−k ),


where Pk is the permutation matrix that reverses the columns in Vk⊥ . Then in the sequence of
QR factorizations of Vek⊥ , k = 1, 2, . . . , columns are added to the right,
vn , (vn , vn−1 ), (vn , vn−1 , vn−2 ), . . . .
Hansen et al. [585, 1992] illustrate the use of MTSVD for a problem in helioseismology in
astrophysics. The fundamental equation is a Fredholm equation, and the regularizing operator is
an approximation to the second-derivative operator. While the TSVD solution shows unrealistic
oscillations, the MTSVD solution behaves much better.

Notes and references


Methods for computing truncated SVD solutions by RRQR (see Section 2.3.5) are studied by
Chan and Hansen [227, 1990, 228, 1992]. Hansen [576, 1998] surveys the use of RRQR factor-
izations for solving discrete ill-posed problems. He shows that when the ratio σk+1 /σk is small,
3.6. Regularized Least Squares 175

the subspaces spanned by the selected columns for such methods are almost identical to those
from TSVD.

3.6.3 Tikhonov Regularization


Tikhonov regularization is the most widely used regularization method for ill-posed problems.
In its general form the regularized solution is taken to be the solution of a least squares problem

min ∥Ax − b∥22 + λ∥Lx∥22 . (3.6.11)


x

The parameter λ > 0 governs the balance between a small-residual norm and the regularity
of the solution as measured by ∥Lx∥2 . Attaching Tikhonov’s name to the method is moti-
vated by the groundbreaking work of Tikhonov [1063, 1963] and Tikhonov and Arsenin [1062,
1977]. Early works by other authors on Tikhonov regularization are surveyed by Hansen [579,
2010, Appendix C]. Regularization methods of the form (3.6.11) have been used by many other
authors for smoothing noisy data and in methods for nonlinear least squares; see Levenberg [736,
1944]. In statistics the method is known as ridge regression.
Often L is chosen as a discrete approximation of some derivative operator. Typical choices
are
1 −1 −1 2 −1
   

L1 =  .. .. , L2 =  .. .. .. , (3.6.12)
. . . . .
1 −1 −1 2 −1

or a combination of these. These operators have a smoothing effect on the solution and are
called smoothing-norm operators; see Hanke and Hansen [570, 1993]. Note that L1 and L2
are banded matrices with small bandwidth and full row rank. Their nullspaces are explicitly
known:

N (L1 ) = w1 = (1, 1, . . . , 1)T ,


N (L2 ) = span (w1 , w2 ), w2 = (1, 2, . . . , n)T .

Clearly, any component of the solution in N (L) will not be affected by the regularization term
λ2 ∥Lx∥2 . Since the nullspaces are spanned by very smooth vectors, it will not be necessary to
regularize this part of the solution.
Any combination of the matrices L1 , L2 and the unit matrix can also be used. This corre-
sponds to a discrete approximation of a Sobolev norm. It is no restriction to assume that L ∈
Rp×n , p ≤ n. If p > n, then a QR factorization of L can be performed so that ∥Lx∥2 = ∥R2 x∥2 ,
where R2 = QT L has at most n rows. Then R2 can be substituted for L in (3.6.11).
The solution of (3.6.11) satisfies the normal equations

(ATA + λLT L)x = AT b. (3.6.13)

If A and L have no nullspace in common and λ > 0, there is a unique solution. Writing the
normal equations as
AT r − λLLT x = 0, r = b − Ax,
shows that they are equivalent to the augmented system
    
I A r b
= . (3.6.14)
AT −λLLT x 0
176 Chapter 3. Generalized and Constrained Least Squares

Forming the normal equations can be avoided by noting that problem (3.6.11) is equivalent
to the least squares problem
   
min √A x−
b
, (3.6.15)
x λL 0 2

which can be solved by QR factorization.


The standard form of Tikhonov regularization is obtained for L = I. Then condition
(3.5.23) is trivially satisfied, and the normal equations are

(ATA + λI)x = AT b. (3.6.16)

The solution of (3.6.16) can be written as the filtered sum


n
X uT bi σi2
x(λ) = fi (λ)vi , fi = , (3.6.17)
i=1
σi σi2+λ

where σi are the singular values and ui , vi are the singular vectors of A. The quantities fi ,
0 ≤ fi < 1, are called filter factors (in statistics, shrinkage factors) and are decreasing functions
of λ. If λ ≪ σi , then fi ≈ 1, and the corresponding terms are almost the same as without
regularization. On the other hand, if λ ≫ σi , then fi ≪ 1, and the corresponding terms are
damped. For a suitably chosen λ, x(λ) is approximately the same as the TSVD solution.
Often the Tikhonov problem (3.6.11) arises with A and L banded upper triangular matrices.
For example, in fitting a polynomial smoothing spline of degree 2p − 1 to m data points, the
half-bandwidth will be p and p + 1, respectively; see Reinsch [923, 1971]. Eldén [370, 1984]
shows how to reduce such regularized problems to the form
   
√ R1 d1
min x− , (3.6.18)
x λR2 0 2

where R1 and R2 are banded. For a fixed value of λ the regularized problem can be reduced
further to upper triangular form. Recall that the order in which the rows are processed in this
QR factorization is important for efficiency; see Section 4.1. Unnecessary fill-in is avoided
by first sorting the rows so the matrices are in standard form; see Section 4.1.4. The rows
are then processed from the top down using plane rotations to give a matrix of banded upper
triangular form. For a given value of λ this requires n(w1 + w2 − 1) rotations and 4n(w12 + w22 )
flops. It can easily be generalized to problems involving several upper triangular band matrices
λ1 R1 , λ2 R2 , λ3 R3 , etc.

Example 3.6.2. Consider the banded regularized least squares problem (3.6.18) where R1 and
R2 are banded upper triangular matrices with bandwidth w1 = 3 and w2 = 2, respectively. First,
let Givens QR factorization be carried out without reordering the rows. Below right is shown the
reduced matrix after the first three columns have been brought into upper triangular form. Note
that the upper triangular part R2 has completely filled in:
   
×
× × ×
× ×

 × × × 


 × × × 


 × × × 


 × × × 


 × × 


 × × 


 × ⇒ 
 ×
× ×  ⊗ ⊗ ⊕ + +
   

 × × 


 ⊗ ⊗ + +
 × ×   ⊗ × +
× × × ×
3.6. Regularized Least Squares 177

For a similar problem of size (2n−1, n) and bandwidth w the complete QR factorization requires
n(n + 1)/2 plane rotations and about 2n(n + 1)w flops.
Consider now the application of Givens QR factorization after the rows have been preordered
to put A in standard form as shown below left. To the right is shown the matrix after the first
three columns have been brought into upper triangular form. Here the algorithm is optimal in the
sense that no new nonzero elements are created, except in the final (uniquely determined) R. For
a similar matrix of size (2n − 1) × n the factorization only requires n(w + 1)/2 Given rotations,
and a total of approximately 2n(w + 1)w flops, an improvement by a factor of n/w:
   
× × × × +
× × ×  ⊗ × × + 
   

 × × 


 ⊗ × + 


 × × × 


 ⊗ ⊗ × 


 × × 
 ⇒ 
 ⊗ × 


 × × × 


 ⊗ × × 


 × × 


 × × 

 × ×  × ×
× ×

Groetsch [539, 1984] has shown that when the exact solution is very smooth and in the
presence of noisy data, Tikhonov regularization cannot reach the optimal solution that the data
allows. In those cases the solution can be improved by using iterated Tikhonov regularization,
suggested by Riley [929, 1956]. In this method a sequence of improved approximate solutions is
computed by
x(0) = 0, x(q+1) = x(q) + δx(q) ,
where δx(q) solves the regularized least squares problem
   (q) 
√A r
min δx − , r(q) = b − Ax(q) . (3.6.19)
δx λI 0 2

This iteration may be implemented very effectively because only one QR factorization is needed;
see Golub [487, 1965]. A related scheme is suggested by Rutishauser [952, 1968]. The conver-
gence can be expressed in terms of the SVD of A as
n (q)
(q)
X ci f i (q)
 λ q
x (λ) = vi , fi =1− . (3.6.20)
i=1
σi σi2 +λ

Thus, for q = 1 we have the standard regularized solution, and x(q) → A† b as q → ∞. When
iterated sufficiently many times, Tikhonov regularization will reach an accuracy that cannot be
improved significantly by any other method; see Hanke and Hansen [570, 1993].

3.6.4 Determining the Regularization Parameter


Determining a suitable value of the regularization parameter in TSVD or Tikhonov regularization
is often a major difficulty. If the “noise level” η in the data b is known, the discrepancy principle
of Morozov [814, 1984] can be used. In TSVD the expansion is then truncated when the residual
satisfies
∥b − Axk ∥2 ≤ η. (3.6.21)
Similarly, in Tikhonov regularization,

min{∥Ax − b∥22 + λ∥x∥22 }, A ∈ Rm×n , (3.6.22)


x
178 Chapter 3. Generalized and Constrained Least Squares

the parameter λ is chosen as the smallest value for which (3.6.21) is satisfied. The attained
accuracy can be very sensitive to the value of η. It has been observed that the discrepancy
principle tends to give a slightly oversmoothed solution. This means that not all the information
present in the data is recovered.
When no prior information about the noise level in the data is available, a great number of
different methods have been proposed. All of them have the common property that they require
the solution for many values of the regularization parameter λ. The L-curve method was first
proposed by Lawson and Hanson [727, 1995, Chapter 26]. It derives its name from a plot in a
doubly logarithmic scale of the curve (∥b − Axλ ∥2 , ∥xλ ∥2 ), which typically is shaped like the
letter L. Choosing λ near the “corner” of this L-curve represents a compromise between a small
residual and a small solution. The L-curve method is further studied and refined by Hansen [574,
1992]. Hansen and O’Leary [584, 1993] propose choosing λ more precisely as the point on the
L-curve where the curvature has the largest magnitude. Advantages and shortcomings of this
method are discussed by Hansen [576, 1998]. For large-scale problems it may be too expensive
to compute sufficiently many points on the L-curve. Calvetti et al. [199, 2002] show how to
compute cheap upper and lower bounds in this case.
In generalized cross-validation (GCV) the parameter in Tikhonov regularization is esti-
mated directly from the data. The underlying statistical model is that the components of b are
subject to random errors of zero mean and covariance matrix σ 2 Im , where σ 2 may or may not
be known. The predicted values of b are written as Axλ = Pλ b, where
Pλ = AA†λ , A†λ = (ATA + λI)−1 AT (3.6.23)
is the symmetric influence matrix. When σ 2 is known, Craven and Wahba [278, 1979] suggest
that λ should be chosen to minimize an unbiased estimate of the expected true mean square error.
When m is large, this minimizer is asymptotically the same as for the GCV function
∥b − Axλ ∥22
Gλ = 2 (3.6.24)
trace Im − Pλ
(see Golub, Heath, and Wahba [493, 1979]). The GCV method can also be used in other ap-
plications, such as truncated SVD methods and subset selection. The GCV function is invariant
under orthogonal transformations of A. It can be very flat around the minimum, and localizing
the minimum numerically can be difficult; see Varah [1088, 1983].
Ordinary cross-validation is based on the following idea; see Allen [17, 1974]. Let xλ,i be
the solution of the regularized problem when the ith equation is left out. If this solution is a
good approximation, then the error in the prediction of the ith component of the right-hand side
should be small. This is true for all i = 1 : m. Generalized cross-validation is a rotation-invariant
version of ordinary cross-validation.
For standard Tikhonov regularization the GCV function (3.6.24) can expressed in terms of
the SVD A = U ΣV T as  
† Ω 0
Pλ = AAλ = U UT , (3.6.25)
0 0
where Ω = diag(ω1 , . . . , ωn ), ωi = σi2 /(σi2 + λ). An easy calculation shows that
n m
X c2i X
∥(Im − Pλ )b∥22 = λ + c2 , (3.6.26)
i=1
σi2 + λ i=n+1 i

where U T b = (c1 , c2 . . . , cm )T . Since ωi are the eigenvalues of Pλ , we further have


n n
X X λ
trace (Im − Pλ ) = m − ωi = m − n + . (3.6.27)
i=1
σ2
i=1 i

3.6. Regularized Least Squares 179

For the general case B ̸= I, formulas similar to (3.6.26) and (3.6.27) can be derived from the
GSVD, or a transformation to the standard case can be used; see Section 3.6.5.

Example 3.6.3 (Golub and Van Loan [512, 1989, Problem 12.1.5]). For A = (1, 1, . . . , 1)T ∈
Rm×1 and b ∈ Rm the GCV function becomes
m(m − 1)s2 + ν 2 m2 b̄2 λ
Gλ = , ν= ,
(m − 1 + ν)2 m+λ
where
m m
1 X 1 X
b̄ = bi , s2 = (bi − b̄)2 .
m i=1 m − 1 i=1

It can be readily verified that Gλ is minimized for ν = s2 /(mb̄2 ), and the optimal value of λ is
−1
λopt = (b̄/s)2 − 1/m .

The GCV function can also be computed from the QR factorization


 
R c1
(A b) = Q .
0 c2

The regularized problem (3.6.22) is equivalent to minx ∥Rx − c1 ∥22 + λ2 ∥x∥22 . By a second QR
factorization this can be reduced further to
   
R c1 Rλ d1
QTλ √ = , (3.6.28)
λIn 0 0 d2

where Rλ is upper tridiagonal, and RTR + λIn = RλT Rλ . The solution of the regularized
problem is then obtained by solving Rλ x(λ) = d1 and the residual norm is given by

∥b − Axλ ∥22 = ∥c2 ∥22 + ∥d2 ∥22 . (3.6.29)

To compute the trace term in the GCV function we first note that AA†λ = AMλ−1 AT , where

Mλ−1 = (ATA + λI)−1 = (RλT Rλ )−1

equals the covariance matrix. From elementary properties of the trace function,

trace (Im − AMλ−1 AT ) = m − trace (Mλ−1 ATA) (3.6.30)


= m − trace (In − λMλ−1 ) =m−n+ λtrace (Mλ−1 ). (3.6.31)

Hence, the trace computation is reduced to computing the sum of the diagonal elements of the
covariance matrix C = (RλT Rλ )−1 . An efficient algorithm for this is given by Eldén [371, 1984].
By reducing A to upper bidiagonal form U T AV = B and setting y = V T x, a regularization
problem of bidiagonal form is obtained. This can be further reduced by plane rotations to upper
triangular form    
B g1 Bλ z1
QT √ = (3.6.32)
λIn 0 0 z2
with Bλ upper bidiagonal:
ρ1 θ2
 
..

ρ2 . 
Bλ =  .
 
..
 . θn

ρn
180 Chapter 3. Generalized and Constrained Least Squares

Pn
If bTi denotes the ith row of Bλ−1 , then trace ((BλT Bλ )−1 ) = i=1 ∥bi ∥22 . From the identity
Bλ Bλ−1 = I we obtain the recursion
ρn bn = en , ρi bi = ei − θi+1 bi+1 , i = n − 1, . . . , 2, 1.
Because Bλ−1 is upper triangular, bi+1 is orthogonal to ei . Hence
∥bn ∥22 = 1/ρ2n , ∥bi ∥22 = (1 + θi+1
2
∥bi+1 ∥22 )/ρ2i , i = n − 1, n − 2, . . . , 1. (3.6.33)
This algorithm for computing the trace term requires only O(n) flops.
Hutchinson and de Hoog [649, 1985] give a similar method for computing the GCV function
for smoothing noisy data with polynomial spline functions of degree 2p − 1. It is based on the
observation that only the elements in the central 2p + 1 bands of the inverse of the influence
function Pλ (3.6.23) are needed. These elements can be computed efficiently from the Cholesky
factor of Pλ . Their algorithm fully exploits the banded structure of the problem and only re-
quires O(p2 m) operations. A Fortran implementation for p = 2 is given in Hutchinson and
de Hoog [650, 1986].

3.6.5 Transformation to Standard Form


Most methods for determining the regularization parameter require repeated solution of the reg-
ularized problem for a sequence of values of λ. For generalized Tikhonov regularization
min ∥Ax − b∥22 + λ∥Lx∥22 ,

(3.6.34)
x

this can be expensive. Great savings can be achieved by an initial transformation into a problem
of standard form. If L has full column rank, such a reduction can easily be made as follows.
Let L = QL RL be the (thin) QR factorization. If L has full column rank, then RL ∈ Rn×n is
nonsingular, and with y = RL x, problem (3.6.34) becomes
e − eb∥22 + λ∥y∥22 ,
min ∥(Ay (3.6.35)
x
e

where A e = AR−1 and RL x = y.


L
If rank(L) < n as for the smoothing matrices L1 and L2 in (3.6.12), the above transforma-
tion does not work, but one can proceed as in Eldén [367, 1977]. Let
 
RL
LT = ( W1 W2 ) = W1 RL (3.6.36)
0

be the QR factorization of LT . (When L is banded, this QR factorization only requires O(n)


−T
flops.) Then L† = W1 RL is the pseudoinverse of L, and W2 ∈ Rn×t is orthonormal and spans
N (L). With
x = L† y + W2 z (3.6.37)
the solution x is split into two orthogonal components with residual vector r = b − AL† y −
(AW2 )z. If A and L have no nullspace in common, it follows that rank(AW ) = t. Computing
AW2 and its QR factorization gives
 
U
AW2 = Q , Q = (Q1 , Q2 ), (3.6.38)
0
where U ∈ Rt×t nonsingular. Then
QT1 (AL† y − b) − U z
   
T r1
Q (Ax − b) = = ,
QT2 (AL† y − b) r2
3.6. Regularized Least Squares 181

where for any given y, z can be determined so that r1 = 0. Thus, the generalized problem is
reduced to the standard form (3.6.35) with
e = QT2 (AL† ),
A eb = QT b.
2

Finally, z is found from U z = QT1 (AL† y − b) and x is retrieved from (3.6.37).


The solution to (3.6.34) can also be expressed in terms of the GSVD (3.1.53) of the matrix
pair (A, L); see Varah [1087, 1979]. If rank(L) = p, the GSVD has the form
 
diag (αi ) 0
AW = U  0 In−p  , LW = V ( diag (βi ) 0 ) ,
0 0

where U and V are orthogonal and αi2 + βi2 = 1, i = 1, . . . , p. The columns of W =


(w1 , . . . , wn ) are ATA-orthogonal, i.e.,

(Awi )T (Awj ) = 0, i ̸= j.

The generalized normal equations (ATA + λLT L)x = AT b simplify to


     
diag (αi2 ) 0 diag (βi2 ) 0 diag (αi ) 0
+λ y= U T b, (3.6.39)
0 In−p 0 0 0 In−p

where y = W x. The solution to (3.6.34) can then be written x = x1 + x2 , where


p n
X (uT b)σi
i
X
x1 = 2 wi , x2 = (uTi b)wi , (3.6.40)
i=1
σi +λ i=p+1

and σi = αi /βi . The second term x2 ∈ N (L) is the unregularized part of the solution. This
GSVD splitting resembles expansion (3.6.17) of the solution for a problem in standard form and
has the property that Ax1 ⊥ Ax2 .
The two components in (3.6.40) are solutions to two independent least squares problems

min ∥Ax1 − b∥22 + λ∥Lx1 ∥2 , min ∥Ax2 − b∥2 , (3.6.41)


ATAx1 ⊥N (L) x2 ∈N (L)

where the second problem is independent of λ. In the QR factorization of LT (3.6.36), the


orthonormal columns of W2 in span N (L) and the solution can be obtained from

x2 = W2 z, z = (AW2 )† b. (3.6.42)

From the QR factorization of AW2 (3.6.38), we obtain

x2 = W2 U −1 (QT1 b). (3.6.43)

Usually, the dimension of N (L) is very small and the cost of computing x2 is negligible. The first
problem in (3.6.41) can be transformed into standard form using the A-weighted pseudoinverse
of L introduced by Eldén [369, 1982],

L†A = (I − P )L† , P = (A(I − L† L))† A, (3.6.44)

where I − L† L = PN (L) . Setting x1 = L†A y, we have Lx1 = LL†A y = y, and the first problem
in (3.6.41) becomes
min ∥AL†A y − b∥22 + λ2 ∥y∥22 . (3.6.45)
y
182 Chapter 3. Generalized and Constrained Least Squares

Because W2 and AW2 have full column rank, it follows that

(APN (L) )† = (AW2 W2† )† = W2 (AW2 )† ,

and hence
L†A = (I − P )L† , P = W2 (AW2 )† A. (3.6.46)
2 T
It can be verified (Hansen [580, 2013]) that P = P and P ̸= P . Hence P is an oblique
projector onto N (L) along the A-orthogonal complement of N (L) in Rn ; cf. Theorem 3.1.3. It
can also be shown that L†A satisfies four conditions similar to the Penrose conditions in Theo-
rem 1.2.10.
The amount of work in the above reduction to standard form is often negligible compared
to the amount required for solving the resulting standard form problem. The use of such a
reduction in direct and iterative regularization methods is studied by Hanke and Hansen [570,
1993], where a slightly different implementation is used. The use of transformation to standard
form in iterative regularization methods is treated in Section 6.4.

Notes and references


Hansen [576, 1998] gives an excellent survey of numerical aspects of solving rank-deficient and
discrete ill-posed problems. Direct and iterative algorithms for discrete inverse problems are
treated and illustrated by tutorial examples in Hansen [579, 2010] as well as Hanke [569, 2017].
Regularization methods for large-scale ill-posed problems are given by Hanke and Hansen [570,
1993]. Regularization methods for nonlinear ill-posed problems are treated by Engl, Hanke, and
Neubauer [386, 1996]. Hansen, Nagy, and O’Leary [583, 2006] consider applications to deblur-
ring and filtering images. The use of RRQR factorizations for solving discrete ill-posed problems
is analyzed by Hansen [576, 1998]. The development of the MATLAB regularization toolbox
is described in Hansen [575, 1994, 578, 2007]. The current version can be downloaded from
www.mathworks.com/matlabcentral/. A so-called trust-region method for regularization of
large-scale discrete ill-posed problems is described by Rojas and Sorensen [932, 2002], with a
MATLAB implementation given by Rojas, Santos, and Sorensen [931, 2008].
Chapter 4

Special Least Squares


Problems

4.1 Band Least Squares Problems


4.1.1 Properties of Band Matrices
Band matrices occur frequently in many algorithms for computing eigenvalues and singular val-
ues as well as in problems of approximations and differential equations. We define the band-
width of a square matrix A ∈ Rn×n to be

w = max |i − j|, (4.1.1)


aij ̸=0

i.e., all nonzero elements in each row of A lie in at most w contiguous positions. If w ≪ n,
then only a small proportion of the n2 elements are nonzero, and they are located in a band
centered along the principal diagonal. Band linear systems and least squares problems with
small bandwidth w ≪ n arise in applications where each variable xi is coupled to only a few
other variables xj such that |j − i| is small. Clearly, the bandwidth of a matrix depends on the
ordering of its rows and columns.
The lower bandwidth r and upper bandwidth s are the smallest integers such that

aij = 0 if j < i − r or j > i + s, i = 1, . . . , n. (4.1.2)

In other words, the number of nonzero diagonals below and above the main diagonal are r and
s, respectively. The maximum number of nonzero elements in any row is w = r + s + 1. For
example, the matrix
a11 a12
 
 a21 a22 a23 
 a31 a32 a33 a34
 
(4.1.3)

a42 a43 a44 a45
 
 
a53 a54 a55 a56
 
a64 a65 a66
has r = 2, s = 1, and w = 4. If A is symmetric, then r = s. Several frequently occurring classes
of band matrices have special names, e.g., a matrix for which r = s = 1 is called tridiagonal. If
r = 0, s = 1 (r = 1, s = 0), the matrix is called upper (lower) bidiagonal.
To avoid storage of zero elements, the diagonals of a band matrix can be stored either as
columns in an array of dimension n × w or as rows in an array of dimension w × n. For example,

183
184 Chapter 4. Special Least Squares Problems

the matrix in (4.1.3) above can be stored as


 
∗ ∗ a11 a12  

 a21 a22 a23   ∗ a12 a23 a34 a45 a56
 a31 a32 a33 a34   a11 a22 a33 a44 a55 a66 
.
 a42 a43 a44 a45  or
  
 
 a21 a32 a43 a54 a65 ∗ 
 a53 a54 a55 a56  a31 a42 a53 a64 ∗ ∗
a64 a65 a66 ∗
Except for a few elements indicated by asterisks in the initial and final rows, only nonzero ele-
ments of A are stored. Passing along a column in the first storage scheme moves along a diagonal
of the matrix, and the rows are aligned. Some elements in the lower right corner are not used.
How to perform the subscript computations efficiently in algorithms where the matrix is stored
in band mode is described in “Contribution I/4” of Wilkinson and Reinsch [1123, 1971] and in
Chapter 4 of Dongarra et al. [322, 1979].
It is convenient to use the following MATLAB notation for manipulating band matrices.

Definition 4.1.1. If a ∈ Rn is a vector, then A = diag (a, k) is a square matrix of order n + |k|
with the elements of a on its kth diagonal, where k = 0 is the main diagonal, k > 0 is above
the main diagonal, and k < 0 is below the main diagonal. If A is a square matrix of order n,
then diag (A, k) ∈ R(n−k) , |k| < n, is the column vector consisting of the elements of the kth
diagonal of A.

For example, diag (A, 0) is the main diagonal of A, and if 0 ≤ k < n, the kth superdiagonal
and subdiagonal of A are

diag (A, k) = (a1,k+1 , a2,k+2 , . . . , an−k,n )T ,


diag (A, −k) = (ak+1,1 , ak+2,2 , . . . , an,n−k )T .

Clearly, the product of two diagonal matrices D1 and D2 is another diagonal matrix whose
elements are equal to the elementwise product of the diagonals. The following elementary but
very useful result shows which diagonals in the product of two square band matrices are nonzero.

Theorem 4.1.2. Let A ∈ Rn×n and A2 ∈ Rn×n have lower bandwidths r1 and r2 and upper
bandwidths s1 and s2 , respectively. Then the products AB and BA have lower and upper
bandwidths r3 ≤ min{r1 + r2 } and s3 ≤ min{s1 + s2 }.
Pn
Proof. The elements of C = AB are cij = k=1 aik bjk . By definition, aik = 0 if k > i + r1
and bkj = 0 if j > k + r2 . It follows that aik bjk = 0 unless k ≤ i + r1 and j ≤ k + r2 . But this
implies that k + j ≤ i + r1 + k + r2 or j ≤ i + (r1 + r2 ), i.e., C has bandwidth at most r1 + r2 .
The second case follows from the observations that if a matrix has lower bandwidth r, then AT
has upper bandwidth r, and that (AB)T = B T AT .
Note that Theorem 4.1.2 holds also for negative values of the bandwidths. For example, a
strictly upper triangular matrix A can be said to have lower bandwidth r = −1. It follows that
A2 has lower bandwidth r = −2, etc., and An = 0.
When the bandwidth of A ∈ Rn×n and B ∈ Rn×n are small compared to n the usual algo-
rithms for forming the product AB are not effective on vector computers. Instead, the product
can be formed by writing A and B as a sum of their diagonals and multiplying crosswise. For
example, if A and B are tridiagonal, then by Theorem 4.1.2 the product C = AB has upper and
lower bandwidths two. The five nonzero diagonals of C can be computed by 32 = 9 pointwise
vector multiplications independent of n.
4.1. Band Least Squares Problems 185

Definition 4.1.3. A square matrix A ∈ Rn×n , n ≥ 2, is said to be reducible if there is a


partitioning of the index set {1, 2, . . . , n} into two nonempty disjoint subsets S and T such that
aij = 0 whenever i ∈ S and j ∈ T . Otherwise, A is called irreducible. Equivalently, A is
reducible if there is a permutation matrix P such that
 
T A11 A12
P AP = , (4.1.4)
0 A22

where A11 and A22 are nonempty square submatrices.

If A is reducible, then after a permutation the linear system Ax = b is reduced to P T AP y =


P T b = c, y = P T x, or

A11 y1 + A12 y2 = c1 , A22 y2 = c2 .

The original system has been reduced to two smaller sets of equations. Hence, only the diagonal
blocks A11 and A22 need to be factorized. If again A11 or A22 is reducible, then such a reduction
can be carried out again. This can be continued until a triangular block form with irreducible
diagonal blocks is obtained. This observation motivates the term reducible.
It is well known (see Duff, Erisman, and Reid [344, 1986]) that the inverse of any irreducible
matrix A is structurally full, i.e., it is always possible to find numerical values such that all entries
in A−1 will be nonzero. In particular, the inverse of an irreducible band matrix in general has no
zero elements. Therefore, it is important to avoid computing the inverse explicitly. Even storing
the elements of A−1 may be infeasible. However, if the LU factorization of A can be carried out
without pivoting, then the band structure in A is preserved in the LU factors.

Theorem 4.1.4. Let A have lower bandwidth r and upper bandwidth s, and assume that the
factorization A = LU exists. Then, assuming that LU factorization can be carried out without
row and columns permutations, L will have lower bandwidth r and U have upper bandwidth s.

Proof. The proof is by induction. Assume that the first k − 1 columns of L and rows of U have
bandwidths r and s. Then for p = 1 : k − 1,

lip = 0, i > p + r, upj = 0, j > p + s. (4.1.5)

The assumption is trivially true for k = 1. Since akj = 0 for j > k + s, (4.1.5) yields
k−1
X
ukj = akj − lkp upj = 0 − 0 = 0, j > k + s.
p=1

Similarly, it follows that lik = 0, i > k + r, which completes the induction step.
An important but hard problem is to find a reordering of the rows and columns of A that
minimizes the bandwidth of the LU or Cholesky factors. However, there are heuristic algorithms
that give almost optimal results; see Section 5.1.5.
When A is tridiagonal,

a1 c2
 
 b2 a2 c3 

A= .. .. .. 
, (4.1.6)
 . . . 
 bn−1 an−1 cn 
bn an
186 Chapter 4. Special Least Squares Problems

L can be taken to be lower unit bidiagonal with subdiagonal elements γ2 , . . . , γn , and U to be


upper bidiagonal with diagonal d1 , . . . , dn and superdiagonal c2 , . . . , cn . Equating elements in
A and LU shows that the upper diagonal in U equals that in A. The other elements in L and U
are obtained by the following recursion: d1 = a1 ,

γk = bk /dk−1 , dk = ak − γk ck , k = 2 : n. (4.1.7)

Here γk and dk can overwrite bk and ak , respectively. The solution to the system Ax = L(U x) =
g is then obtained by solving Ly = f by forward substitution: y1 = g1 ,

yi = gi − γi yi−1 , i = 2, . . . , n, (4.1.8)

and then solving U x = y by back-substitution: xn = yn /dn ,

xi = (yi − ci+1 xi+1 )/di , i = n − 1, . . . , 2, 1. (4.1.9)

The total number of flops for the factorization is about 3n and is 2.5n for the solution of the
triangular systems. Note that the divisions in the substitution can be avoided if (4.1.7) is modified
to compute d−1k . This may be more efficient because on many computers a division takes more
time than a multiplication.

4.1.2 Band Cholesky Factorization


Taking advantage of a band structure of A ∈ Rm×n in a least squares problem minx ∥Ax − b∥2
can lead to significant savings. The matrix of normal equations can be formed rowwise as
m
X
C = ATA = ai aTi
i=1

in about mw2 flops, where aTi , i = 1, . . . , m, denotes the rows of A.

Lemma 4.1.5. Assume that A ∈ Rm×n has row bandwidth w. Then ATA has upper and lower
bandwidths s = r = w − 1.

Proof. From the definition (4.1.1) of bandwidth it follows that aij aik ̸= 0 ⇒ |j − k| < w. This
implies that
Xm
|j − k| ≥ w ⇒ (ATA)jk = aij aik = 0,
i=1

and hence s ≤ w − 1.

Theorem 4.1.6. Let C = LLT be the Cholesky factorization of the symmetric positive definite
band matrix C. Then the symmetric matrix L + LT inherits the band structure of C.

Proof. The proof is similar to that of Theorem 4.1.4.

The next algorithm computes the Cholesky factor L of a symmetric (Hermitian) positive
definite matrix C using a column sweep ordering. Recall that no pivoting is needed for stability.
Only the lower triangular part of A is used.
4.1. Band Least Squares Problems 187

Algorithm 4.1.1 (Band Cholesky Algorithm).


function L = bcholf(A,r);
% BCHOLF computes the lower triangular Cholesky
% factor L of a positive definite Hermitian
% matrix A of upper and lower bandwidth r.
% --------------------------------------------
n = size(A,1);
for j = 1:n
p = min(j+r,n); q = max(1,i-r);
ik = q:j-1; jn = j+1:p;
A(j,j) = sqrt(A(j,j) - A(j,ik)*A(j,ik)');
A(jn,j) = (A(jn,j) - A(jn,ik)*A(j,ik)')/A(j,j);
end
L = tril(A);

If r ≪ n, then this algorithm requires about nr(r+3) flops and n square roots to compute the
Cholesky factor L. When s ≪ n this is much less than the n3 /3 flops required in the full case.
In the semidefinite case, diagonal pivoting is required, which can destroy the band structure.
The least squares solution is obtained by solving the triangular systems Ly = c = AT b and
T
L x = y by forward- and back-substitution:
i−1
X
yi = bi − rji yj , i = 1, . . . , n, p = max(1, i − r),
j=p
 q
X 
xi = yi − rij xj /rii , i = n, . . . , 1, q = min(i + r, n).
j=i+1

Efficient band versions of band forward and back-substitution can be derived. Each requires
about 2(2n − r)(r + 1) ≈ 4nr flops and can be organized so that y and, finally, x overwrite
c in storage. Thus, if full advantage is taken of the band structure of the matrices involved, the
solution of a least squares problem where A has bandwidth w ≪ n requires a total of about
(m + n)w2 + 4nw flops.
Let A be symmetric positive definite and tridiagonal as in (4.1.6) with ci = bi , i = 2, . . . , n,
and write the Cholesky factorization in symmetric form as A = LDLT , D = diag (d1 , . . . , dn ).
Then the elements in D and L are obtained as follows. Set d1 = a1 , and

γk = bk /dk−1 , dk = ak − γk bk , k = 2, . . . , n. (4.1.10)

Eliminating γk gives
dk = ak − b2k /dk−1 , k = 2, . . . , n. (4.1.11)
Sometimes it is more convenient to set LD = U T and determine the factorization A = U T D−1 U .

4.1.3 Computing Elements of the Covariance Matrix


Methods for computing the covariance matrix

Cx = (ATA)−1 = (RT R)−1 = R−1 R−T (4.1.12)

are given in Section 2.1.5. In the case when R is banded, or generally sparse, an algorithm by
Golub and Plemmons [505, 1980] can be used with great savings to compute all elements of Cx
188 Chapter 4. Special Least Squares Problems

in positions where R has nonzero elements. This includes the diagonal elements of Cx that give
the variances of least squares solution x. We denote by K the index set {(i, j) ∈ K | rij ̸= 0}.
Note that because R is the Cholesky factor of ATA, its structure is such that (i, j) ∈ K and
(i, k) ∈ K imply that (j, k) ∈ K if j < k and (k, j) ∈ K if j > k. From (4.1.12) it follows that

RCx = R−T , (4.1.13)

where R−T is lower triangular with diagonal elements 1/rii , i = 1, . . . , n. Equating the last
columns of the identity (4.1.13) gives
−1
Rcn = rnn en , en = (0, . . . , 0, 1)T ,

where cn is the last column of Cx . From this equation the elements in cn can be computed by
−2
back-substitution, giving cnn = rnn and
n
X
cin = −rii rij cjn , i = n − 1, . . . , 1. (4.1.14)
j=i+1
(i,j)∈K

By symmetry cni = cin , i = 1, . . . , n − 1, also the last row of Cx is determined. We only need
to save the elements with row indices greater than or equal to the first nonzero element in the kth
column of R.
Now assume that the elements cij , (i, j) ∈ K, in columns j = n, . . . , k + 1 have been
computed. Then " #
Xn
−1 −1
ckk = rkk rkk − rkj ckj . (4.1.15)
j=k+1
(k,j)∈K

Similarly, for i = k − 1, . . . , fk ,
" k n
#
X X
−1
cik = −rii rij cjk + rij ckj , (i, k) ∈ K. (4.1.16)
j=i+1 j=k+1
(i,j)∈K (i,j)∈K

If the Cholesky factor R has bandwidth p, then the elements of Cx in the p + 1 bands of the
upper triangular part of Cx can be computed by the above algorithm in about 23 np(p + 1) flops.
An important particular case is when R is bidiagonal. Computing the two diagonals of Cx then
requires only about 43 n flops.

4.1.4 QR Factorization of a Band Matrix


We now consider how to use the more stable Householder and Givens QR factorizations to solve
banded least squares problems. In the standard Householder QR, we set A(1) = A and

A(k+1) = Pk A(k) , k = 1, . . . , n,

where the sequence of Householder reflections Pk is chosen to annihilate the subdiagonal ele-
ments in the kth column of A(k) . As shown by Reid [919, 1967] this will cause each column in
the remaining unreduced part of the matrix that has a nonzero inner product with the column be-
ing reduced to take on the sparsity pattern of their union. Hence, even though the final R retains
the bandwidth of A, large intermediate fill can take place with consequent cost in operations and
4.1. Band Least Squares Problems 189

storage. Thus, Householder QR factorization of a banded matrix A ∈ Rm×n of bandwidth w


can require as much as 2mnw flops, and the intermediate storage required can exceed by a large
amount that needed for the final factors. Note also that forming Q explicitly should be avoided,
because this factor may contain an order of magnitude more nonzero elements than either A and
R. This rules out the use of MGS (modified Gram–Schmidt) for banded matrices.
For QR factorization of a band matrix, the order in which the rows are processed is critical
for efficiency. We say that a band matrix A is in standard form if its rows are ordered so that
fi (A), i = 1, 2, . . . , m, form a nondecreasing sequence

i ≤ k ⇒ fi (A) ≤ fk (A).

A band matrix A in standard form can be written in partitioned form as

A b1
   
 A2   b2 
A=
 ...  ,
 b=
 ...  ,
 (4.1.17)

Aq bq

where Ak ∈ R(mk ,n) and m = m1 + · · · + mq . Here Ak consists of all rows of A for which
the first nonzero element is in column k, k = 1, . . . , q. The row ordering within the blocks Ak
may be specified by sorting the rows so that the column indices li (Ak ) within each block form a
nondecreasing sequence. In many applications, A is naturally given in standard form.
The structure of ATA, and therefore of R, can be generated as follows.

Theorem 4.1.7. Assume that A ∈ Rm×n is a band matrix in standard form partitioned as in
(4.1.17), and let wk be the bandwidth of the block Ak . Then the band structure of the upper
triangular Cholesky factor R is given by l1 (R) = w1 ,

lk (R) = max{wk + k − 1, lk−1 (R)}, k = 2, . . . , q. (4.1.18)

Proof. The proof is by induction.

The first efficient QR algorithm for band matrices was given in Chapter 27 of Lawson and
Hanson [727, 1995]. It uses Householder transformations and assumes that A is partitioned in the
form (4.1.17). First, R = R0 is initialized to be an empty upper triangular matrix of bandwidth
w. The QR factorization proceeds in q steps, k = 1, 2, . . . , q. In step k the kth block Ak is
merged as follows into the previously computed upper triangular (Rk−1 , dk−1 ) by an orthogonal
transformation:    
T Rk−1 Rk
Qk = ,
Ak 0
where Qk is a product of Householder reflections giving an upper trapezoidal Rk . Note that
this and later steps do not involve the first k − 1 rows and columns of Rk−1 . Hence, at the
beginning of step k the first k − 1 rows of Rk−1 are rows in the final matrix R. At termination
we have obtained (Rq , dq ) such that Rq x = dq . This Householder algorithm uses a total of about
2w(w + 1)(m + 3n/2) flops,

Example 4.1.8. The least squares approximation of a discrete set of data by a linear combination
of cubic B-splines is often used for smoothing noisy data values observedPat m distinct points;
n
see Reinsch [922, 1967] and Craven and Wahba [278, 1979]. Let s(t) = j=1 xj Bj (t), where
190 Chapter 4. Special Least Squares Problems

Bj (t), j = 1, . . . , n, be normalized cubic B-splines, and let (yi , ti ), i = 1, . . . , m, be given data


points. Determine x to minimize
m
X
(s(ti ) − yi )2 = ∥Ax − y∥22 .
i=1

The only B-splines with nonzero values for t ∈ [λk−1 , λk ] are Bj , j = k, k + 1, k + 2, k + 3.


Hence A will be a band matrix with w = 4. For a problem with m = 17, n = 10, A consists of
blocks Ak , k = 1, . . . , 7. The Householder QR factorization is illustrated in Figure 4.1.1, where
A is shown after the first three blocks have been reduced by Householder reflections P1 , . . . , P9 .
Elements zeroed by Pj are denoted by j and fill elements by +.

× × × ×
 
1 × × × + 
 
1 2 × × + + 
 

 3 4 × × + 


 3 4 5 × + 


 6 7 8 × 


 6 7 8 9 


 6 7 8 9 


 × × × × 


 × × × × 


 × × × × 


 × × × × 


 × × × × 


 × × × × 


 × × × × 

 × × × ×
× × × ×

Figure 4.1.1. The matrix A after reduction of the first k = 3 blocks using Householder reflections.

We now describe an alternative algorithm using plane rotations for the QR factorization of a
band matrix.

Algorithm 4.1.2 (Sequential Givens QR Factorization).


Let A ∈ Rm×n be a band matrix in standard form with row bandwidth w. Initialize R = R0
to be an upper triangular matrix of bandwidth w with zero elements. For i = 1, . . . , m, plane
rotations are used to merge the ith row with Ri−1 , giving Ri .

for i = 1, 2, . . . , m
for j = fi (A), . . . , min{fi (A) + w − 1, n}
if aij ̸= 0 then
[c, s] = givrot(rjj , aij );
 T   T 
rj c s rj
:= ;
aTi −s c aTi
end
end
end
4.2. Bidiagonalization 191

The reduction is shown schematically in Figure 4.1.1. The ith step only involves the w × w
upper triangular part of Ri−1 formed by rows and columns fi (A) to li (A). If at some stage
rii = 0, then the whole ith row in Ri−1 must be zero, and the remaining part of the current row
aTi can just be copied into row i of Ri−1 .
If A has constant bandwidth and is in standard form, then at step i the last (n−li (A)) columns
of R have not been touched and are still zero as initialized. Furthermore, at this stage the first
(fi (A) − 1) rows are final rows of R and can be saved on secondary storage. Since primary
storage is only needed for the active triangular part, shown in Figure 4.1.2, very large problems
can be handled. Clearly, the number of plane rotations needed to process the ith row is at most
min(i − 1, w) and requires at most 4w2 flops. Hence, the complete factorization requires 4mw2
flops, and 21 w(w + 3) locations of storage. We remark that if A is not in standard form, then the
operation count can only be bounded by 4mnw flops.

Figure 4.1.2. Reduction of a band matrix.

The solution to a band least squares problem minx ∥Ax − b∥2 is obtained by applying QR
factorization to the extended matrix ( A b ). The solution is then obtained from a band upper
triangular system Rx = c1 . In order to treat additional right-hand sides not available at the time
of factorization, the plane rotations from the QR factorization need to be saved. As described
in Section 2.2.1, a plane rotation can be represented by a single floating-point number. Since at
most w rotations are needed to process each row, it follows that Q can be stored in no more space
than that allocated for A.

4.2 Bidiagonalization
4.2.1 Bidiagonal Decomposition
For a general m×n matrix A, the closest-to-diagonal form that can be achieved in a finite number
of operations is a bidiagonal form. Golub and Kahan [495, 1965] show that this form can be
obtained by a sequence of two-sided Householder reflections B = U TAV , where U ∈ Rm×m
and V ∈ Rn×n are orthogonal. This preserves the singular values of A, and the singular vectors
of B are closely related to those of A. Householder bidiagonalization (HHBD) is often the first
step toward computing the SVD (singular value decomposition); see Section 7.1.1.
192 Chapter 4. Special Least Squares Problems

Theorem 4.2.1 (HHBD). For any matrix A ∈ Rm×n , orthogonal matrices U = (u1 , . . . , um ) ∈
Rm×m and V = (v1 , . . . , vn ) ∈ Rn×n can be found such that U TAV is upper bidiagonal. If
m ≥ n, then

β1 α2
 
   β2 α3 
T B  .. ..  ∈ Rp×p ,

U AV = , B= . . (4.2.1)
0  
βp−1 αp
 
βp

where βi ≥ 0, αi ≥ 0, and p = n. If m < n, then U TAV = ( B 0 ), where p = m,


B ∈ Rp×(p+1) , and Bep+1 = αp+1 ep .

Proof. U and V are constructed as products of Householder reflectors Qk ∈ Rm×m from the
left and Pk ∈ Rn×n , k = 1, . . . , p, from the right, applied alternately to A. Here Qk is chosen to
zero out all elements in the kth column below the main diagonal, and Pk is chosen to zero out all
elements in the kth row to the right of B. First, Q1 is applied to A to zero out all elements, except
one in the first column of Q1 A. When P1 is next applied to zero out the elements in the first row
not in B, the first column in Q1 A is left unchanged. These two transformations determine β1
and α2 . All later steps are similar. With A(1) = A, set A(k+1) = (Qk A(k) )Pk . This determines
not only the bidiagonal elements βk and αk+1 in the kth row of B but also the kth columns in U
and V :

uk = U ek = Q1 . . . Qk ek , vk = V ek = P1 . . . Pk ek , k = 1, . . . , p. (4.2.2)

Here, some of the transformations may be skipped. For example, if m = n, then Qn = In and
Pn−1 = Pn = In . Similarly, a complex matrix A ∈ Cm×n can be reduced to real bidiagonal
form using a sequence of complex Householder reflections.

By applying the HHBD algorithm of Theorem 4.2.1 to AT , any matrix A ∈ Rm×n can be
transformed into lower bidiagonal form. (This is equivalent to starting the reduction of A with a
right transformation P1 instead of with Q1 .)
HHBD requires approximately 4(mn2 − 31 n3 ) flops or roughly twice as many as needed for
Householder QR factorization. If the matrices Up = (u1 , . . . , up ) and/or Vp = (v1 , . . . , vp ) are
wanted, the corresponding products of Householder reflections can be accumulated at a cost of
2(mn2 − 31 n3 ) and 43 n3 flops, respectively. For a square matrix, these counts are 38 n3 for the
reduction and 43 n3 for computing each of the matrices U and V .
HHBD is a backward stable algorithm, i.e., the computed matrix B is the exact result for a
matrix A + E, where
∥E∥F ≤ cn2 u∥A∥F , (4.2.3)

and c is a constant of order unity. Further, if the stored Householder vectors are used to generate
U and V explicitly, the computed matrices are close to the exact matrices U and V that reduce
A + E. This guarantees that the computed singular values of B and the transformed singular
vectors of B are those of a matrix close to A.
An alternative two-step reduction to bidiagonal form was suggested by Lawson and Han-
son [727, 1995] and later analyzed by T. Chan [222, 1982]. This first computes the pivoted QR
factorization  
R
AP = Q
0
4.2. Bidiagonalization 193

and then transforms the upper triangular R ∈ Rn×n into bidiagonal form. Column pivoting in
the initial QR factorization makes the two-step algorithm potentially more accurate. Householder
bidiagonalization cannot take advantage of the triangular structure of R, but with plane rotations
it can be done. In the first step the elements r1n , . . . , r13 in the first row are annihilated in this
order. To zero out element r1j a plane rotation Gj−1,j is first applied from the right. This
introduces a new nonzero element rj,j−1 , which is annihilated by a rotation G̃j−1,j from the left.
The first few rotations in the process are pictured below. (Recall that ⊕ denotes the element to
be zeroed, and fill-in elements are denoted by +.)

↓ ↓ ↓ ↓
     
× × × × ⊕ × × × × 0 × × × ⊕ 0

 × × × ×

 × × × ×

 × × × ×

 × × ×⇒

 × × ×⇒

 × × ×.
 × × → × ×  + × ×
+ × → ⊕ × 0 ×

Two plane rotations are needed to zero each element. The total cost is 6n flops, and the operation
count for the reduction of R to bidiagonal form is 2n3 flops and lower than for the Householder
algorithm. However, if the product of either the left or right transformation is needed, then
Householder reduction requires less work; see Chan [222, 1982]. A similar “zero chasing” tech-
nique can be used to reduce the bandwidth of an upper triangular band matrix while preserving
the band structure; see Golub, Luk, and Overton [498, 1981].
The bidiagonal decomposition is a powerful tool for analyzing and solving least squares prob-
lems. Let U T AV = C, where U and V are square orthogonal matrices. Then the pseudoinverses
of A and C are related as A† = V C † U T , and the pseudoinverse solution is

x = A† b = V (C † (U T b)). (4.2.4)

In particular, consider the HHBD algorithm with A ∈ Rm×n , m ≥ n = rank(A). Setting


x = V y and using the orthogonal invariance of the 2-norm, we have
    2
B c
∥Ax − b∥22 = ∥U TAV y − U T b∥22 = y−
0 d 2
= ∥By − c∥22 + ∥d∥22 . (4.2.5)

Clearly, the minimum of ∥Ax − b∥22 is obtained when By = c. The least squares solution
x = V y = P1 · · · Pn−2 y is computed by forming U T b = Qn · · · Q1 b and solving By = c by
back-substitution:

yn = cn /βn , yk = (ck − αk+1 ck−1 )/βk , k = n − 1 : −1 : 1. (4.2.6)

This takes 2(mn + n2 ) flops. Computing the residual vector


 
0
r = Q1 · · · Qn
d

needs an additional 4mn − 2n2 flops. Computing r = b − Ax directly requires mn flops.


To be efficient on modern computers, a blocked version of Householder bidiagonalization is
needed. This poses new difficulties that are not present in the blocked Householder QR factor-
ization described in Section 2.6.1. A blocked implementation of Householder bidiagonalization
194 Chapter 4. Special Least Squares Problems

is given by Dongarra, Sorensen, and Hammarling [327, 1989]; see also Dongarra et al. [323,
2018, Algorithm 1]. Householder reflectors Hk = I − τk vk vkT are used to eliminate elements
below the diagonal in column k, while Gk = I − πk uk uTk eliminate elements to the right of the
superdiagonal in row k:

A(k) = Hk A(k−1) Gk = (I − τk vk vkT )A(k−1) (I − πk uk uTk )


= A(k−1) − vk ykT − zk uTk , (4.2.7)

where yk = τk (A(k−1) )T vk and zk = πk (A(k−1) uk − (ykT uk )vk ). Blocking together p applica-


tions gives
A(p) = Hp · · · H1 AG1 · · · Gp = A − Vp YpT − Xp UpT .
To determine Hk it suffices to update column k of the trailing matrix A(k−1) , and Gk can be
determined after row i of Hk A(k−1) is updated. The remaining update of the trailing matrix
can be delayed, but each step still requires two matrix-vector products involving the full trailing
matrix. If m = n, then about one half of the total 8n3 /3 flops in the decomposition will be
matrix-vector and the other half matrix-matrix operations.

Notes and references


Ralha [909, 2003] developed a one-sided bidiagonalization algorithm that starts by computing
the factorization F = AV . Here V is a product of Householder reflections chosen so that
T = V TATAV is tridiagonal but without explicitly forming ATA. Next, a Gram–Schmidt QR
of F is used to compute F = AV = U B, where B is upper bidiagonal. Ralha’s algorithm
uses fewer arithmetic operations than the two-sided Householder algorithm but lacks numerical
stability. A modified backward stable version of Ralha’s algorithm was developed by Barlow,
Bosner, and Drmač [71, 2005].

4.2.2 Core Problems by Bidiagonalization


Consider the least squares problem

min ∥Ax − b∥2 . (4.2.8)


x

Suppose U ∈ Rm×m and V ∈ Rn×n can be found so that


 
c1 A11 0
U T b AV =

. (4.2.9)
0 0 A22

Setting x = V y, the problem splits (4.2.8) into two independent approximation subproblems
 
y1
min ∥A11 y1 − c1 ∥2 , min ∥A22 y2 ∥2 , y = , (4.2.10)
y1 y2 y2

where A11 and A22 may be rectangular. Clearly, the solution to the second subproblem is y2 = 0.
Hence, all information needed for solving the original approximation problem is contained in the
first subproblem. The pseudoinverse solution x† of the original problem (4.2.8) is related to the
pseudoinverse solution y † of the transformed problem (4.2.10) by

x† = V1 y1† , V = ( V1 V2 ) . (4.2.11)

Paige and Strakoš [868, 2006] state the following definition.


4.2. Bidiagonalization 195

Definition 4.2.2. The subproblem miny ∥A11 y − c1 ∥2 in (4.2.10) is said to be a core problem
of minx ∥Ax − b∥2 if A11 is minimally dimensioned for all orthogonal U and V that give the
form (4.2.9).

We now show that a core subproblem can be found by bidiagonalization of the compound
matrix ( b A ), where A ∈ Rm×n , m > n. (Note that here b is placed in front of A, con-

trary to our previous practice!) This yields an upper bidiagonal decomposition U T b A Ve .
Because Ve does not act on the first column b, we can write
 
T
 1 0
= β1 e1 U T AV .

U b A (4.2.12)
0 V

Hence U T b = β1 e1 , so that β1 = ∥b∥2 and u1 = U e1 = b/β1 . Further, we have the lower


bidiagonal decomposition,

α1
 
   β2 α2 
T B  .. ..  ∈ R(n+1)×n .

U AV = , B= . . (4.2.13)
0  
βn αn
 
βn+1

The bidiagonalization process is terminated when the first zero bidiagonal element in B is en-
countered. Then a core problem has been found. There are two possible cases. If αk > 0 but
βk+1 = 0, k ≤ n, then the transformed matrix splits as follows:
 
β 1 e1 B11 0
UT

b AV = , (4.2.14)
0 0 A22

where  
α1
 β2 α2 
B11 =  ∈ Rk×k (4.2.15)
 
.. ..
 . . 
βk αk

has full row rank, and A22 is the part of A that is not yet in bidiagonal form. The pseudoinverse
solution x = V y is now found by solving the square nonsingular lower bidiagonal core system

B11 y1 = β1 e1 (4.2.16)

and setting x† = V1 y1 ; see (4.2.11). Because the residual for (4.2.16) is zero, it follows that the
original system Ax = b must be consistent.
If instead the bidiagonalization terminates with βk+1 > 0 and αk+1 = 0, then the first
k columns of the transformed matrix split with a rectangular lower bidiagonal matrix of full
column rank,
 
α1
 β2 α2 
 
B11 = 
 . .
.. ..  ∈ R(k+1)×k .

(4.2.17)
 
 βk αk 
βk+1
196 Chapter 4. Special Least Squares Problems

The least squares core problem is then

min ∥B11 y1 − β1 e1 ∥2 . (4.2.18)


y1

The solution can be found efficiently by a QR factorization of B11 . Proof that systems (4.2.16)
and (4.2.18) are minimally dimensioned, and therefore core problems, is given by Paige and
Strakoš [868, 2006, Theorems 3.2 and 3.3].

Notes and references

Core problems play an important role in a unified treatment of the scaled TLS (STLS) problem;
see Paige and Strakoš [867, 2002], [868, 2006]. Hnětynková et al. [630, 2011] give a new clas-
sification of multivariate TLS problems AX ≈ B that reveals subtleties not captured previously.
The definition of the core problem is extended to the multiple right-hand side case AX ≈ B by
Plešinger [899, 2008] and Hnětynková, Plešinger, and Strakoš [632, 2013], [633, 2015].

4.2.3 Golub–Kahan–Lanczos Bidiagonalization


When A is large and sparse, Householder bidiagonalization is not efficient because any sparsity
in A is usually destroyed in the early intermediate steps. Golub and Kahan [495, 1965] give an
alternative algorithm that generates the elements in B and the columns of U and V sequentially
and only uses matrix-vector products with A and AT . We call this algorithm Golub–Kahan–
Lanczos bidiagonalization (GKLBD) because of its close relationship to symmetric Lanczos
tridiagonalization of the symmetric matrix
 
0 A
AT 0

with special starting vectors.


By Theorem 4.2.1 it follows that the existence of orthogonal U ∈ Rm×m and V ∈ Rn×n is
such that U T AV is upper or lower bidiagonal. Then we have
 
B 0
A ( v1 , v2 , v3 , . . . ) = ( u1 , u2 , u3 , . . . ) , (4.2.19)
0 0
 T 
T B 0
A ( u1 , u2 , u3 , . . . ) = ( v1 , v2 , v3 , . . . ) , (4.2.20)
0 0

where some of the zero blocks may be empty. Assume now that B is lower bidiagonal,

α1
 
 β2 α2 
 .. 
B=

β3 . .

 .. 
 . αn

βn+1

Equating columns on both sides of equations (4.2.19)–(4.2.20) gives α1 v1 = AT u1 ,

Avk = βk+1 uk+1 + αk uk , (4.2.21)


AT uk+1 = αk+1 vk+1 + βk+1 vk , k = 1, . . . , n. (4.2.22)
4.2. Bidiagonalization 197

Given an initial unit vector u1 ∈ Rm , this yields an iterative process for generating U and V
columnwise and B rowwise from

βk+1 uk+1 = Avk − αk uk , (4.2.23)


αk+1 vk+1 = AT uk+1 − βk+1 vk , k = 1, 2, . . . , (4.2.24)

where βk+1 and αk+1 are normalization constants. As long as no zero bidiagonal element occurs,
the choice of u1 uniquely determines this process. It terminates with a core subproblem when
the first zero bidiagonal element is encountered. After k steps the recurrence relations (4.2.23)–
(4.2.24) can be summarized in matrix form as

AVk = Uk+1 Bk , (4.2.25)


AT Uk+1 = Vk BkT + αk+1 vk+1 eTk+1 . (4.2.26)

The Bidiag1 algorithm of Paige and Saunders [866, 1982] is obtained for the special starting
vector
u1 = b/β1 , β1 = ∥b∥2 . (4.2.27)
In exact arithmetic, it generates the same lower bidiagonal decomposition of A as HHBD, trans-
forming ( b A ) into upper bidiagonal form. Bidiag1 is the basis of the important iterative least
squares algorithm LSQR of Paige and Saunders [866, 1982].
A similar GKLBD process can be devised for the case when U T AV = R is upper bidiagonal,

ρ1 γ2
 
 ρ2 γ3 
 .. 
R=

ρ3 . .

(4.2.28)
 .. 
 . γn

ρn
Given an initial unit vector v1 ∈ Rn , equating columns in (4.2.19)–(4.2.20) yields Av1 = ρ1 u1 ,

AT uk = ρk vk + γk+1 vk+1 ,
Avk = γk uk−1 + ρk uk , k = 1, 2, . . . , n − 1.

These equations can be used to generate not only successive columns in U and V but also the
columns of R. As long as no zero bidiagonal element is encountered, we have

γk+1 vk+1 = AT uk − ρk vk , (4.2.29)


ρk+1 uk+1 = Avk+1 − γk+1 uk , k = 1, 2, . . . . (4.2.30)

These equations uniquely determine vk+1 and uk+1 as well as γk+1 and ρk+1 . After k steps
this process has generated Uk ∈ Rm×k and Vk ∈ Rn×k and a square upper bidiagonal matrix
Rk ∈ Rk×k such that

AVk = Uk Rk , (4.2.31)
T
A Uk = Vk RkT + γk+1 vk+1 eTk . (4.2.32)

Taking the initial vector to be v1 = AT b/γ1 , γ1 = ∥AT b∥2 , gives the Bidiag2 algorithm of Paige
and Saunders [866, 1982]. In exact arithmetic, this generates the same quantities as HHBD,
transforming ( AT b AT ) into upper bidiagonal form.
198 Chapter 4. Special Least Squares Problems

In finite-precision arithmetic there can be a severe loss of orthogonality in Uk and Vk com-


puted by both forms of GKLBD. One effect is that the finite termination property of HHBD is
lost; see Section 6.2.5.

4.2.4 Bidiagonalization and Krylov Subspaces


GKLBD generates vectors by repeated matrix-vector products with a fixed matrix. Such vectors
span subspaces of a particular structure called Krylov subspaces.6 Let C ∈ Rn×n be a given
square matrix, and let v ∈ Rn be a given nonzero vector. The sequence of vectors v, Cv, C 2 v,
C 3 v, . . . is called a Krylov sequence and is easily generated recursively by v1 = v, vk = Cvk−1 ,
k = 2, 3, . . . . The subspace spanned by the first k vectors is called a Krylov subspace, denoted
by
Kk (C, v) = span {v, Cv, . . . , C k−1 v}. (4.2.33)

Such subspaces with C = AT A or C = AAT play a fundamental role in methods for solving
large-scale least squares problems.
There can be at most n linearly independent vectors in Rn . Hence, in any Krylov sequence
v, Cv, C 2 v, C 3 v, . . . there is a first vector C p+1 v, p ≤ n, that is a linear combination of the
preceding ones. Then
Kp+1 (C, v) = Kp (C, v), (4.2.34)

and the Krylov sequence terminates. The maximum dimension p of Kk (C, v) is called the grade
of v with respect to C. From (4.2.34) it follows that at termination, Kp (C, v) is an invariant
subspace of C. Conversely, if the vector v lies in an invariant subspace of C of dimension p, its
Krylov sequence terminates for k = p.
Krylov subspaces satisfy the following easily verified invariance properties:

1. Scaling: Km (βC, αv) = Km (C, v), α ̸= 0, β ̸= 0.

2. Translation: Km (C − µI, v) = Km (C, v).

3. Similarity: Km (U H CU, U H v) = U H Km (C, v) for any unitary matrix U .

Theorem 4.2.3. As long as no zero bidiagonal element is encountered, the orthonormal vectors
(u1 , . . . , uk ) and (v1 , . . . , vk ) generated by Bidiag1 are bases for the Krylov subspaces

R(Uk ) = Kk (AAT, b), R(Vk ) = Kk (ATA, AT b), k = 1, 2, . . . . (4.2.35)

Proof. We have β1 u1 = b and α1 v1 = AT u1 . If β1 α1 ̸= 0, then (β1 α1 )v1 = AT b. Hence


(4.2.35) holds for k = 1. Assume now that (4.2.35) is true for some k ≥ 1. If βk+1 αk+1 ̸= 0, it
follows from (4.2.23)–(4.2.24) that

R(Uk+1 ) = R(Uk ) ∪ AKk (ATA, AT b) = Kk+1 (AAT , b),


R(Vk+1 ) = R(Vk ) ∪ AT Kk (AAT , b) = Kk+1 (ATA, AT b).

Hence (4.2.35) holds for k + 1. The theorem now follows by induction in k.

6 Named after the Russian mathematician Aleksey N. Krylov (1863–1945).


4.2. Bidiagonalization 199

By Theorem 4.2.3 the matrices Uk and Vk can also be obtained by Gram–Schmidt orthogo-
nalization of the Krylov sequences

u1 , AAT u1 , (AAT )2 u1 , . . . , v1 , ATAv1 , (ATA)2 v1 , . . . .

Hence, the uniqueness of the bases is a consequence of the uniqueness (up to a diagonal scaling
with elements ±1) of the QR factorization of a real matrix of full column rank.
Bidiag1 terminates for some k ≤ rank(A) when the subspaces Kk (ATA, AT b) have reached
maximum rank. Then the subspaces

Kk (AAT , AAT b) = AKk (ATA, AT b)

have also reached maximal rank. At termination the pseudoinverse solution xk = Vk yk is ob-
tained for some yk ∈ Rk , where R(Vk ) = Kk (ATA, AT b).
Bidiag2 terminates with βk+1 = 0 for some k ≤ rank(A) when the subspaces Kk (ATA, AT b)
have reached maximum rank. Then the subspaces Kk (AAT , AAT b) = AKk (ATA, AT b) have
also reached maximal rank. Bidiag2 generates vectors giving orthogonal bases for

R(Vk ) = Kk (ATA, AT b), R(Uk ) = Kk (AAT , AAT b). (4.2.36)

Bidiag1 and Bidiag2 generate the same Vk , but Uk differs. The upper bidiagonal matrix Rk in
Bidiag2 is the same as the matrix obtained by QR factorization of the lower bidiagonal Bk in
Bidiag1; see Paige and Saunders [866, 1982].
Collecting previous results we can now state the following.

Theorem 4.2.4 (Krylov Subspace Approximations). Let p be the grade of AT b with respect to
ATA. Then the projected least squares problems

min ∥Axk − b∥2 subject to xk ∈ Kk (ATA, AT b), 1 ≤ k ≤ p, (4.2.37)


xk

have full rank. The solutions xk are uniquely determined, and the residuals satisfy rk = b −
Axk ⊥ Kk (AAT , AAT b). Independent of b and the size or rank of A, the Krylov subspace
approximations terminate with the pseudoinverse solution xp = A† b for some p ≤ rank(A).

From the nested property of the Krylov subspaces Kk (ATA, AT b), k = 1, 2, . . . , it follows
that the sequence of residual norms ∥rk ∥2 , k = 1, 2, . . . , are monotonically decreasing. For
k < p the Krylov subspace approximations depend nonlinearly on b in a highly complicated
way.
At termination, all bidiagonal elements in B and R of Bidiag1 and Bidiag2 are positive. Such
bidiagonal matrices are said to be unreduced and have the following property.

Lemma 4.2.5. For any unreduced bidiagonal matrix B, all singular values σi must be distinct.

Proof. If B is unreduced, then T = B TB is symmetric tridiagonal with positive off-diagonal


elements γi = ρi γi+1 . Such a tridiagonal matrix is also called unreduced and, by Parlett [884,
1998, Lemma 7.7.1], has distinct eigenvalues λi . The lemma now follows from the fact that
λi = σi2 .

Theorem 4.2.6. Let A have p distinct, possibly multiple, nonzero singular values. Then, Bidiag1
and Bidiag2 terminate with an unreduced bidiagonal matrix Bk = UkT AVk of rank k ≤ p.
200 Chapter 4. Special Least Squares Problems

Proof. Since Bk is unreduced, it follows from Lemma 4.2.5 that all its singular values must
be distinct: σ1 > σ2 > · · · > σk > 0. But these are also singular values of A, and hence
k ≤ p.

Theorem 4.2.6 states that if A has singular values of multiplicity greater than one, then
Bidiag1 and Bidiag2 terminate in less than rank(A) steps. For example, let square A = In +uv T ,
u, v ̸= 0, and rank(A) = n. The singular values of A are the square roots of the eigenvalues
of ATA = In + uv T + vuT + (uT u)vv T . AT A has one eigenvalue equal to 1 of multiplicity
(n − 2) corresponding to eigenvectors x that are orthogonal to u and v. Hence, A has at most
three distinct singular values, and, in exact arithmetic, bidiagonalization must terminate after at
most three steps.
If b is orthogonal to the left singular subspaces corresponding to some of the singular values,
Bidiag1 terminates in k ≤ p steps; see Björck [137, 2014, Lemma 2.1].

4.2.5 Partial Least Squares Algorithms


The NIPALS (Nonlinear Iterative PArtial Least Squares) method was devised by H. Wold [1127,
1966] to model relations between sets of observed variables and a number of latent (not directly
observed) variables. Initially, the main applications were prediction and cause-effect inference
in statistics and economics. S. Wold et al. [1128, 1984] developed the recursive NIPALS-PLS
algorithm for use in chemometrics and pointed out its equivalence to Krylov subspace approx-
imations. Today, partial least squares (PLS) is a widely used multivariate regression technique
with a broad spectrum of applications in research and industry, such as bioinformatics, food
science, medicine, pharmacology, social sciences, and physiology; see Vinzi et al. [1093, 2010].
NIPALS-PLS. Set A0 = A, b0 = b, and for k = 1, 2, . . . ,

v̂k = ATk−1 bk−1 , vk = v̂k /∥v̂k ∥2 , (4.2.38)


ûk = Ak−1 vk , uk = ûk /∥ûk ∥2 , (4.2.39)
pTk = uTk Ak−1 , ζk = uTk bk−1 , (4.2.40)
(Ak , bk ) = (Ak−1 , bk−1 ) − uk (pTk , ζk ). (4.2.41)

The kth PLS approximation is xk = Vk yk , where yk satisfies

(PkT Vk )yk = zk , zk = (ζ, . . . , ζk )T , (4.2.42)

Vk = (v1 , . . . , vk ), Uk = (u1 , . . . , uk ), and Pk = (p1 , . . . , pk ). Note that in (4.2.41) the


orthogonal projections of Ak−1 and bk−1 onto uk are subtracted:

(Ak , bk ) = (I − uk uTk )(Ak−1 , bk−1 ).

The process terminates when either ∥v̂k ∥2 or ∥ûk ∥2 is zero. Note that if uTk Ak−1 vk ̸= 0, the
rank of Ak is exactly one less than that of Ak−1 .
In the PLS literature, uk are called score vectors, vk loading weights, and pk loading vec-
tors. Summing the equations in (4.2.41) gives

A = Uk PkT + Ak , b = Uk zkT + bk . (4.2.43)


Pk
Then Uk PkT = i=1 ui pTi is a rank-k approximation to the data matrix A. The data residuals
are given by
Ak = A − Uk PkT = (I − Uk UkT )A. (4.2.44)
4.2. Bidiagonalization 201

Algorithm 4.2.1 (NIPALS-PLS Algorithm).

function [x,U,P,V] = nipalspls(A,b,k)


% NIPALSPLS computes the first k PLS factors for the
% least squares problem min ||Ax - b||_2.
% -------------------------------------------------
[m,n] = size(A); U = zeros(m,k);
V = zeros(n,k); P = V; z = zeros(k,1);
for i = 1:k
v = A'*b; nv = norm(v);
if nv == 0, break, end
v = v/nv;
u = A*v; u = u/norm(u);
% Deflate A and b
p = A'*u; z(i) = b'*u;
A = A - u*p'; b = b - u*z(i); % Deflate data
V(:,i) = v; U(:,i) = u; P(:,i) = p;
end
x = V*((P'*V)\z); % Regression coefficients
end

NIPALS-PLS uses three matrix-vector products and one rank-one deflation that together re-
quire 8mn flops per PLS factor. The flop counts for the additional scalar products and the final
back-substitution for solving the upper triangular systems are negligible in comparison. For
k ≪ min(m, n) this is the same number of flops per step as required using Householder bidiag-
onalization.
In exact arithmetic, NIPALS-PLS computes the sequence of Krylov subspace approximations
defined in Theorem 4.2.4. The following important result is due to Eldén [375, 2004].

Theorem 4.2.7. The vectors {v1 , . . . , vk } and {u1 , . . . , uk } generated by NIPALS-PLS form
orthonormal bases for Kk (ATA, AT b) and Kk (AAT , AAT b), respectively. In exact arithmetic,
these vectors are the same as the columns of Vk and Uk from the Bidiag2 algorithm. It also
follows that PkT Vk is an upper bidiagonal matrix, and xk is the kth Krylov subspace approx-
imation.

In floating-point arithmetic there will be a progressive loss of orthogonality in Uk and Vk


in NIPALS-PLS, and the computed off-bidiagonal elements in PkT Vk will not be negligible.
The loss of orthogonality is approximately proportional to the condition number κ(PkT Vk ); see
Björck [137, 2014]. However, relations (4.2.43) still hold to working precision. Although in
theory the matrix PkT Vk is upper bidiagonal, this will not be the case in floating-point compu-
tation. It is not clear if neglecting the off-bidiagonal elements in PkT Vk improves the accuracy
of the solution. In the implementation above we follow the original NIPALS-PLS algorithm
and treat PkT Vk as a full matrix. This increases the arithmetic cost of solving the subproblem
(4.2.42) from 2k flops to 2k 3 /3 flops. Tests show only marginal differences in accuracy for
these options.
The high accuracy of NIPALS-PLS seems to be partly due to the fact that the augmented
data matrix (A, b) is deflated before the next step of orthogonalization is carried out. In exact
arithmetic, the vectors u1 , . . . , uk are mutually orthogonal. Then ζk = uTk bk−1 = uTk b and
NIPALS-PLS can be “simplified” by omitting the deflation of b. In floating-point arithmetic this
omission can substantially increase the loss of orthogonality in Uk and Vk and cause a loss of
202 Chapter 4. Special Least Squares Problems

accuracy in the computed xk . It is unfortunate that omitting the deflation of b seems to have be-
come the norm rather than the exception. For example, Manne [771, 1987] proposes a marginally
faster PLS algorithm by omitting the deflation of b. The implementation of NIPALS tested by
Andersson [29, 2009] also omits this deflation. Unfortunately, this practice has spread even to
some commercial statistical software packages.
In the context of iterative methods for least squares problems, the Bidiag2 process is known
to lead to a less stable algorithm than Bidiag1; see Paige and Saunders [866, 1982]. For PLS
it is more direct to use the Bidiag2 process. In the implementation below, orthogonality in the
computed basis vectors uk and vk is preserved by reorthogonalizing the new basis vectors uk+1
and vk+1 against all previous basis vectors. Then there is no difference in stability between
Bidiag1 and Bidiag2, as confirmed by tests in Björck and Indahl [148, 2017]. The additional
cost of reorthogonalizing is about 4(m + n)k 2 flops for k factors. Unless k is very large, this
overhead is acceptable.

Algorithm 4.2.2 (Bidiag2-PLS).

[x,U,B,V] = bidiag2pls(A,b,k)
% BIDIAG2-PLS computes the first k PLS factors using
Golub--Kahan bidiagonalization
% -----------------------------------------------------
[m,n] = size(A);
B = zeros(k,2); % B stored by diagonals
x = zeros(n,k); U = zeros(m,k); V = zeros(n,k);
v = A'*b; v = v/norm(v);
w = A*v; rho = norm(w); w = (1/rho)*w;
V(:,1) = v; U(:,1) = w; B(1,1) = rho;
z = v/rho; x(:,1) = (w'*b)*z;
for i = 2:k % Bidiagonalization
v = A'*w - rho*v;
v = v - V*(V'*v); % Reorthogonalize
gamma = norm(v); B(i-1,2) = gamma;
v = (1/gamma)*v; V(:,i) = v;
w = A*v - gamma*w;
w = w - U*(U'*w); % Reorthogonalize
rho = norm(w); B(i,1) = rho;
if rho == 0, break, end
w = (1/rho)*w; U(:,i) = w;
z = (1/rho)*(v - gamma*z); % Update solution
x(:,i) = x(:,i-1) + (w'*b)*z;
end
end

Example 4.2.8. Like TSVD (truncated SVD) approximations, the PLS approximations are or-
thogonal projections of the pseudoinverse solution onto a nested sequence of subspaces of di-
mension k ≪ n; see Section 3.6.2. Both sequences have regularization properties. However,
because PLS only requires a partial bidiagonalization of A, it is much less expensive to compute
than TSVD. Further, for PLS the subspaces depend (nonlinearly) on the right-hand side b and are
tailored to the specific right-hand side. Therefore, the minimum error for an ill-conditioned prob-
lem is often achieved with a lower dimensional subspace for PLS rather than for TSVD. Consider
2
the discretized problem Kf = g in Example 3.6.1 for the linear operator K(s, t) = e−(s−t)
4.2. Bidiagonalization 203

with m = n = 100. Let f be known, set g = Kf , and let gb = g + e be a perturbed right-hand


side, where e is normally distributed with zero mean and variance 10−4 . Figure 4.2.1 shows the
relative error ∥fk − f ∥2 /∥f ∥2 and the relative residual norm ∥Kfk − g∥2 /∥g∥2 for the PLS and
TSVD solutions as a function of k. The results are almost identical. For PLS the smallest error
occurs for k = 10 and for TSVD occurs for k = 12. For larger values of k the error increases
rapidly, but the residual norm is almost constant.

3
10 3
10

2
2
10 10

1
10
1
10

0
10
0
10
−1
10

−1
10
−2
10

−2
10 −3
10

−3 −4
10 10

−4
10 0 2 4 6 8 10 12 14 16 18 20
0 2 4 6 8 10 12 14 16 18 k

Figure 4.2.1. Relative error ∥fk −f ∥2 /∥f ∥2 (solid line) and relative residual ∥Kfk −f ∥2 /∥f ∥2
(dashed line) after k steps. PLS (left) and TSVD (right). Used with permission of Springer International
Publishing; from Numerical Methods in Matrix Computations, Björck, Åke, 2015; permission conveyed
through Copyright Clearance Center, Inc.

Many algorithms that differ greatly in speed and accuracy have been proposed for PLS. Both
NIPALS-PLS and HHBD perform transformations on A that require A to be explicitly available.
This makes them less suitable for large-scale problems. Andersson [29, 2009] tested nine dif-
ferent PLS algorithms on a set of contrived benchmark problems. The only provably backward
stable method among them, HHBD, was much slower than NIPALS-PLS, which also was one of
the most accurate, even though it was used here without deflation of b. The version of Bidiag2-
PLS used did not employ reorthogonalization and gave poor accuracy. In further tests by Björck
and Indahl [148, 2017] the most accurate algorithms were HHBD, NIPALS-PLS with deflation,
and Bidiag2-PLS with reorthogonalization. For large-scale problems, Bidiag2-PLS is the method
of choice. On some large simulated data sets of size 30,000 × 10,000 and 100 extracted com-
ponents, Bidiag2-PLS was about seven times faster than NIPALS. Notably, the popular SIMPLS
algorithm by de Jong [295, 1993], used by the MATLAB function plsregress, gave poor ac-
curacy. This was true even when it was improved by reorthogonalization, as suggested by Faber
and Ferré [390, 2008].

Notes and references


The close relationship between PLS and bidiagonalization is explored by Manne [771, 1987]
and Helland [599, 1988]. Many extensions of the basic PLS algorithm treated here have been
devised; see S. Wold, Sjöström, and Eriksson [1129, 2001]. An overview of advances in PLS is
given by Rosipal and Krämer [937, 2005].
Simon and Zha [996, 2000] show that in many applications it may suffice to reorthogo-
nalize either uk or vk . However, tests by Björck and Indahl [148, 2017] show that such one-
sided reorthogonalization could lead to a substantial loss of precision compared to full
reorthogonalization.
204 Chapter 4. Special Least Squares Problems

4.3 Some Structured Problems


4.3.1 Two-Block Least Squares Problems
It is sometimes useful to partition a least squares problem into two blocks as
 
y 2
min ( A B ) −b , (4.3.1)
y,z z 2

where A ∈ Rm×n1 , B ∈ Rm×n2 . One example is periodic spline approximation, which leads to
a problem of augmented band form, where A is a band matrix and B is a full matrix with a small
number of columns.
If z is given, then y must solve the problem

min ∥Ay − (b − Bz)∥2 . (4.3.2)


y

Substituting the solution y = A† (b − Bz) into (4.3.2) to eliminate y, we see that z solves the
problem
min ∥(I − AA† )(Bz − b)∥2 , (4.3.3)
z

where I − AA = PN (AT ) is the orthogonal projector onto N (AT ). Thus (4.3.1) has been split
into two separate least squares problems (4.3.3) and (4.3.2).
One advantage of this is that different methods can be used to solve the two least squares
subproblems for z and y. Moreover, the subproblem for z is always better conditioned than the
original problem. Hence, z can sometimes be computed with sufficient accuracy by the method
of normal equations; see Foster [427, 1991]. Another application is when n2 ≫ n1 and the
subproblem for z can be solved by an iterative method; see Section 6.3.6.
The normal equations of (4.3.1) are
 T    T 
A A ATB y A b
= .
B TA B TB z BT b

Eliminating y gives Sz = s, where S is the Schur complement of AT A. This can be written

S = B T PN (AT ) B, s = B T PN (AT ) b, (4.3.4)

where PN (AT ) = I − A(ATA)−1 AT . When z ∈ Rn2 has been determined, we obtain y ∈ Rn1
from
ATAy = AT (b − Bz). (4.3.5)
For better stability, methods based on orthogonal factorizations should be used. After n1
steps, a partial Householder QR factorization reduces the first block in ( A B ) to the upper
triangular form,  
T R11 R12 c1
Q1 ( A B b ) = , (4.3.6)
0 A22 c2
where Q1 = P1 · · · Pn1 . By the orthogonal invariance of the 2-norm,
    
R11 R12 y c1
∥Ax − b∥2 = − .
0 A22 z c2 2

This gives for z and y the subproblems

min ∥A22 z − c2 ∥2 , R11 y = c1 − R12 z. (4.3.7)


z
4.3. Some Structured Problems 205

Similarly, after n1 steps of MGS we have obtained the partial factorization


 
R11 R12 c1
( A b ) = ( Q1 A(n1 +1) b(n1 +1) )  0 I 0 ,
0 0 1
where QT1 A(n1 +1) = 0. The original problem then decomposes into (compare (4.3.7))
min ∥A(n1 +1) z − b(n1 +1) ∥2 , R11 y = c1 − R12 z.
z

The following lemma gives an alternative formulation without explicitly referring to orthog-
onal projections.

Lemma 4.3.1. Let z ∈ Rn2 and W ∈ Rn2 ×n1 be the solutions to the least squares problems
min ∥Aw − b∥2 , min ∥AW − B∥F . (4.3.8)
w W

Then the solution to (4.3.1) is given by y = z − W z, where z solves


min ∥(B − AW )z − (b − Aw)∥2 . (4.3.9)
z

Proof. If w and W solve (4.3.8), then Aw = PA b and AW = PA B. Hence the least squares
problems (4.3.9) and (4.3.3) are equivalent. Further, y = z − W x solves (4.3.2), because Ay =
A(z − W z) = PA (b − Az).

A common practice in linear regression is to preprocess the data by centering the data, i.e.,
subtracting out the means. This can be interpreted as a simple case of a two-block least squares
problem, where  
ξ
(e B ) = b, e = (1, . . . , 1)T . (4.3.10)
z
Multiplying B ∈ Rm×n and b ∈ Rm with the projection matrix (I − eeT /m) gives
1 eT b
B̄ = B − e(eTB), b̄ = b − e. (4.3.11)
m m
This makes the columns of B̄ and b̄ orthogonal to e. The reduced least squares problem becomes
minx ∥B̄z − b̄∥2 . After solving the reduced problem for z, we obtain ξ = eT (b − Bz)/m.

Example 4.3.2. The Hald cement data (see [561, 1952, p. 647]) are used in Draper and Smith
[331, 1998, Appendix B] and several other books as an example of regression analysis. The
right-hand side consists of m = 13 observations of the heat evolved in cement during hardening.
The explanatory variables are four different ingredients of the mix and a constant term:
   
1 7 26 6 60 78.5

 1 1 29 15 52 


 74.3 


 1 11 56 8 20 


 104.3 


 1 11 31 8 47 


 87.6 


 1 7 52 6 33 


 95.9 


 1 11 55 9 22 


 109.2 

A=
 1 3 71 17 6 ,
 b=
 102.7 .
 (4.3.12)

 1 1 31 22 44 


 72.5 


 1 2 54 18 22 


 93.1 


 1 21 47 4 26 


 115.9 


 1 1 40 23 34 


 83.8 

 1 11 66 9 12   113.3 
1 1 68 8 12 109.4
206 Chapter 4. Special Least Squares Problems

For the least squares problem ∥Ax − b∥2 , κ(A) ≈ 3.245 · 103 indicates that about six digits
may be lost when using the normal equations. The first column of ones in A = (e, B) is added
to extract the mean values. The first variable ξ in x = (ξ, y) can be eliminated by setting

B̄ = B − epT , c = b − βe,

where p = (eT B)/m, β = eT b/m, and ξ = β − pT y. The reduced problem miny ∥B̄y − c∥2
is much better conditioned: κ(B̄) = 23.0. Normalizing the columns of B to have unit length
decreases the condition number by only a small amount to κ(BD) = 19.6.

4.3.2 Block-Angular Problems


As noted by Rice [925, 1983], many large sparse least squares problems possess a natural multi-
level block structure. An early example is geodetic survey problems. A technique for breaking
down such problems into geographically defined subproblems was introduced by Helmert [600,
1880]; see Golub and Plemmons [505, 1980] ). Other examples are photogrammetry (Golub,
Luk, and Pagano [499, 1979] ), Doppler radar positioning (Manneback, Murigande, and Toint
[773, 1985] ), and economic models (Duchin and Szyld [339, 1979]).
The substructuring can reflect a “local connection” in the underlying physical problem. For
example, a geodetic position network consists of geodetic stations connected through observa-
tions. To each station corresponds a set of unknown coordinates to be determined. This data may
naturally be arranged by counties, then by states, then by countries. Let B be sets of stations
that separate the other stations into p blocks A1 , . . . , Ap in such a way that station variables in
Ai are not connected by observations to station variables in Aj if i ̸= j. Order the variables so
that those in A1 , . . . , Ap appear first and those in B last. The dissection can also be continued
by dissecting each of the regions A1 , . . . , Ap into separate subregions, and so on in a recursive
fashion. The blocking of the region for one and two levels of dissection for p = 2 is pictured
in Figure 4.3.1. In such a nested dissection, it is advantageous to perform the dissection in
such a way that in each stage of the dissection, the numbers of variables in the two partitions
are roughly the same. Also, the number of variables in the separator nodes should be as small
as possible.

Figure 4.3.1. One and two levels of dissection of a region.

The block structure in the matrix of the corresponding linear system induced by one and two
levels of dissection for p = 2 is shown here:
A1 B1 D1
 
 
A1 B1 A2 B2 D2 
A= , A= . (4.3.13)

A2 B 2 A3 C3 D3
A4 C4 D4
There is a finer structure in A not shown here. For example, in one level of dissection most of
the equations involve variables in A1 or A2 only but not in B.
4.3. Some Structured Problems 207

For a least squares problems minx ∥Ax − b∥2 arising from a one-level dissection into p
regions, the matrix has a bordered block diagonal or block-angular form,

x1
 
A1 B1 b1
   
x2
A2 B2   b2 
 
   ..
A= .. ..  , x=
 ,
. b=
 ...  ,
 (4.3.14)
 . .   x 

p
Ap Bp bp
xp+1

where
Ai ∈ Rmi ×ni , Bi ∈ Rmi ×np+1 , i = 1, . . . , p,
and m = m1 + m2 + · · · + mp , n = n1 + n2 + · · · + np+1 . For some problems, the blocks
Ai and/or Bi are themselves large sparse matrices, often of the same general sparsity pattern as
A. There is a wide variation in the number and size of blocks. Some problems have large blocks
with p of moderate size (10–100), while others have many more but smaller blocks.
The normal matrix ATA, with A given as in (4.3.14), has a doubly bordered block diagonal
form  T
AT1 B1

A1 A1
 AT2 A2 AT2 B2 
.. 
 
ATA = 
 .. ,
 . . 
 ATpAp ATp Bp 
B1TA1 B2TA2 · · · BpTAp C
Pp
where C = i=1 BiT Bi . The right-hand side f = (f1 , . . . , fp+1 ) of the normal equations is
p
X
fi = ATi bi , i = 1, . . . , p, fp+1 = BiT bi .
i=1

If rank(A) = n, the upper triangular Cholesky factor R of ATA exists and has a block structure
similar to that of A:
R1 S1
 
 R2 S2 
 . .. .. 
R= 
. .
 (4.3.15)
 Rp Sp 
Rp+1
Equating the blocks in RTR = ATA gives

RiT Ri = ATiAi , RiT Si = ATi Bi , i = 1, . . . , p, (4.3.16)


p
X
T
Rp+1 Rp+1 =C− SiT Si , (4.3.17)
i=1

where the Cholesky factors Ri ∈ Rni ×ni and Si can be computed independently and in parallel.
The least squares solution is then obtained from the two-block triangular systems RT z = AT b =
f and Rx = z. This amounts to first solving the lower triangular systems
p
X
RiT zi = ATi bi , i = 1, . . . , p, T
Rp+1 zp+1 = fp+1 − SiT zi , (4.3.18)
i=1
208 Chapter 4. Special Least Squares Problems

and then the upper triangular systems

Rp+1 xp+1 = zp+1 , Ri xi = zi − Si xp+1 , i = 1, . . . , p. (4.3.19)

Again, nearly all of these systems can be solved in parallel.


A more accurate algorithm based on QR factorization for solving block-angular least squares
algorithms is given by Golub, Luk, and Pagano [499, 1979]. This proceeds in the following three
steps:

1. Reduce the diagonal block Ai to upper triangular form by a sequence of orthogonal trans-
formations applied to (Ai , Bi ) and the right-hand side bi , yielding
   
T Ri S i T ci
Qi (Ai , Bi ) = , Qi bi = , i = 1, . . . , p. (4.3.20)
0 Ti di
Any sparse structure in the blocks Ai should be exploited.
2. Form
T1 d1
   
. .
T =  ..  , d =  ..  ,
Tp dp
and compute the QR factorization
 
Rp+1 cp+1
QTp+1 (T d) = . (4.3.21)
0 dp+1

The residual norm is given by ρ = ∥dp+1 ∥2 .


3. Solve xp+1 , xp , . . . , x1 from the upper triangular systems

Rp+1 xp+1 = cp+1 , Ri xi = ci − Si xp+1 , i = 1, . . . , p.

There are several ways to organize this algorithm. In steps 1 and 3 the computations can be
performed in parallel on the p subsystems. It is then advantageous to continue the reduction in
step 1 so that the matrices Ti , i = 1, . . . , p, are brought into upper trapezoidal form.
Large problems may require too much memory, even if we take into account the block-
angular structure. Cox [275, 1990] suggests two modifications by which the storage requirement
can be reduced. By merging steps 1 and 2, it is not necessary to hold all blocks Ti simultaneously
in memory. Even more storage can be saved by discarding Ri and Si after they have been
computed in step 1 and recomputing them for step 3. Indeed, only Ri needs to be recomputed,
because after y has been computed in step 2, xi is the solution to the least squares problems

min ∥Ai xi − gi ∥2 , gi = bi − Bi y.
xi

Hence, to determine xi we only need to (re)compute the QR factorizations of (Ai , gi ), i =


1, . . . , p. In some practical problems this modification can reduce the storage requirement by
an order of magnitude, while recomputation of Ri only increases the operation count by a small
percentage.
From the structure of the R-factor in (4.3.15), the diagonal blocks of the covariance matrix
C = (RTR)−1 = R−1 R−T can be written as (see Golub, Plemmons, and Sameh [506, 1988])
−1 −T
Cp+1,p+1 = Rp+1 Rp+1 ,
Ci,i = Ri−1 (I + WiTWi )Ri−T , −1
WiT = Si Rp+1 , i = 1, . . . , p. (4.3.22)
4.3. Some Structured Problems 209

Hence, if we compute the QR factorizations


   
Wi Ui
Qi = , i = 1, . . . , p,
I 0

we have I + WiTWi = UiTUi and

Ci,i = (Ui Ri−T )T (Ui Ri−T ), i = 1, . . . , p.

This method assumes that all the matrices Ri and Si have been retained. For a discussion of how
to compute variances and covariances when the storage saving algorithm is used, see Cox [275,
1990].
In some applications the matrices Ri will be sparse, but a lot of fill occurs in the blocks Bi
in step 1. Then the triangular matrix Rp+1 will be full and expensive to compute. For such
problems a block-preconditioned iterative method may be more efficient; see Section 6.3. Then
an iterative method, such as CGLS or LSQR, is applied to the problem

min ∥(AM −1 )y − b∥2 , y = M x,


y

where a suitable preconditioner is M = diag (R1 , . . . , Rp , Rp+1 ); see Golub, Manneback, and
Toint [500, 1986].

Notes and references

Dissection and orthogonal decompositions in geodetic survey problems are treated by Golub and
Plemmons [505, 1980]. Avila and Tomlin [44, 1979] discuss parallelism in the solution of very
large least squares problems by nested dissection and the method of normal equations. Weil and
Kettler [1116, 1971] give a heuristic algorithm for permuting a general sparse matrix into block-
angular form. The dissection procedure described above is a variation of the nested dissection
orderings developed for general sparse positive definite systems; see Section 5.1.5.

4.3.3 Kronecker Product Structure


Sometimes least squares problems occur with a highly regular block structure. Here we consider
least squares problems of the form

min ∥(A ⊗ B)x − f ∥2 , (4.3.23)


x

where A ⊗ B is the Kronecker product of A ∈ Rm×n and B ∈ Rp×q . This product is the
mp × nq block matrix

a11 B a12 B ··· a1n B


 
 a21 B a22 B ··· a2n B 
A⊗B =
 ... .. ..  .
. . 
am1 B am2 B ··· amn B

Problems with Kronecker structure arise in several application areas, including signal and image
processing, photogrammetry, and multidimensional approximation; see Fausett and Fulton [399,
1994]. Grosse [540, 1980] describes a tensor factorization algorithm and how it applies to least
squares fitting of multivariate data on a rectangular grid. Such problems can be solved with great
210 Chapter 4. Special Least Squares Problems

savings in storage and operations. These savings are essential for problems where A and B are
large. It is not unusual to have several hundred thousand equations and unknowns.
The Kronecker product and its relation to linear matrix equations, such as Lyapunov’s equa-
tion, are treated in Horn and Johnson [640, 1991, Chapter 4]. See also Henderson and Searle [602,
1981] and Van Loan [1083, 2000]. We now state some elementary facts about Kronecker prod-
ucts that follow from its definition:

(A + B) ⊗ C = (A ⊗ C) + (B ⊗ C),
A ⊗ (B + C) = (A ⊗ B) + (A ⊗ C),
A ⊗ (B ⊗ C) = (A ⊗ B) ⊗ C,
(A ⊗ B)T = AT ⊗ B T .

A further important relation, which is not so obvious, is given next.

Lemma 4.3.3. If the ordinary products AC and BD are defined, then

(A ⊗ B)(C ⊗ D) = AC ⊗ BD. (4.3.24)

Proof. See Lancaster and Tismenetsky [713, 1985, Chap. 12.1].

As a corollary of this lemma we obtain the identity

(A1 ⊗ B1 )(A2 ⊗ B2 ) · · · (Ap ⊗ Bp ) = (A1 A2 · · · An ) ⊗ (B1 B2 · · · Bn ),

assuming all the products are defined. We can also conclude that if P and Q are orthogonal n×n
matrices, then P ⊗ Q is an orthogonal n2 × n2 matrix. Furthermore, if A and B are square and
nonsingular, it follows that A ⊗ B is nonsingular and

(A ⊗ B)−1 = A−1 ⊗ B −1 .

This generalizes to pseudoinverses, as shown in the following lemmas.

Lemma 4.3.4. Let A† and B † be the pseudoinverses of A and B. Then

(A ⊗ B)† = A† ⊗ B † .

Proof. The lemma follows by verifying that X = A† ⊗ B † satisfies the four Penrose conditions
in Theorem 1.2.11.

We now introduce a function, closely related to the Kronecker product, that converts a matrix
into a vector. For a matrix C = (c1 , c2 , . . . , cn ) ∈ Rm×n we define

c1
 
 c2 
vec (C) = 
 ...  .
 (4.3.25)

cn

Hence vec (C) is the vector formed by stacking the columns of C into one column vector of
length mn. We now state a result that shows how the vec-function is related to the Kronecker
product.
4.3. Some Structured Problems 211

Lemma 4.3.5. If A ∈ Rm×n , B ∈ Rp×q , and F ∈ Rq×n , then

(A ⊗ B)vec (F ) = vec (BF AT ). (4.3.26)

By Lemma 4.3.5, we can write the solution to the Kronecker least squares problem (4.3.23)
as
x = (A ⊗ B)† f = (A† ⊗ B † )f = vec (B † F (A† )T ), (4.3.27)
where f = vec (F ). This allows a great reduction in the cost of solving (4.3.23). For example,
if both A and B are m × n matrices, the cost of computing the least squares solution is reduced
from O(m2 n4 ) to O(mn2 ).
In some areas, the most common approach to computing the least squares solution to (4.3.23)
is from normal equations. If we assume that both A and B have full column rank, we can use the
expressions
A† = (ATA)−1 AT , B † = (B TB)−1 B T .
However, because of the instability associated with the explicit formation of ATA and B TB, an
approach based on orthogonal decompositions should generally be preferred. From the complete
QR factorizations of A and B,
   
R1 0 R2 0
AΠ1 = Q1 V1T , BΠ2 = Q2 V2T ,
0 0 0 0

with R1 , R2 upper triangular and nonsingular, we obtain


 −1   −1 
R1 0 R2 0
A† = Π1 V1 QT1 , B † = Π2 V2 QT2 .
0 0 0 0

These expressions can be used in (4.3.27) to compute the pseudoinverse solution of problem
(4.3.23), even in the rank-deficient case.
From Lemma 4.3.5, the following simple expression for the singular values and vectors of
the Kronecker product A⊗B, in terms of the singular values and vectors of A and B, is obtained.

Lemma 4.3.6. Let A and B have the SVDs A = U1 Σ1 V1T and B = U2 Σ2 V2T . Then

A ⊗ B = (U1 ⊗ U2 )(Σ1 ⊗ Σ2 )(V1 ⊗ V2 )T

is the SVD of A ⊗ B.

Barrlund [84, 1998] develops an efficient solution method for constrained least squares prob-
lems with Kronecker structure:

min ∥(A1 ⊗ A2 )x − f ∥2 subject to (B1 ⊗ B2 )x = g. (4.3.28)


x

With vec (X) = x, vec (F ) = f , and vec (G) = g, this becomes

min ∥A2 XAT1 − F ∥F subject to B2 XB1T = G. (4.3.29)


X

This problem can be solved by a nullspace method; cf. Section 3.4.2. By a change of variables
the unknowns are split into two sets. The first set is determined by the constraints, and the other
set belongs to the nullspace of the constraints.
212 Chapter 4. Special Least Squares Problems

4.3.4 Strongly Rectangular Systems


The least squares problem

min ∥x∥2 , S = {x ∈ Rn | ∥b − Ax∥2 = min}, (4.3.30)


x∈S

where A ∈ Rm×n , is said to be strongly overdetermined if m ≫ n and strongly underdetermined


if m ≪ n. Candès et al. [205, 2011] mentions a case in stationary video background subtraction,
where n is about 103 and m can exceed 106 . Examples from other application areas, such
as seismology, natural language processing, and analysis of the human genome, are given by
Meng [788, 2014].
Demmel et al. [306, 2012] give a family of stable, efficient, and communication-reducing
algorithms called TSQR for computing the QR factorization of a strongly overdetermined matrix
A ∈ Rm×n , m ≫ n. (Such matrices are also called “tall-and-skinny.”) In TSQR a rowwise
partitioning of A into blocks Ai , i = 0, . . . , N − 1, is used. In the first stage, factorizations

Ai = Qi Ri , i = 0, . . . , N − 1,

are computed. Subsequent stages merge pairs of the resulting upper triangular matrices in a
divide-and-conquer fashion until a single factor R has been obtained. This requires about log2 N
stages. The algorithm is exemplified below for N = 4. After the first step we have obtained

A0 Q0 R0 Q0 R0
      
A Q R Q1   R1 
A= 1= 1 1= . (4.3.31)
    
A2 Q2 R2 Q2 R2

A3 Q3 R3 Q2 R3

In the next two steps the QR factorizations of the stacked upper triangular factors are merged
into one upper triangular matrix
 
R0     
 R1 
   = Q01 R01
,
R01
= Q0,1,2,3 R. (4.3.32)
 R2  Q23 R23 R23
R3

The representation of the factor Q in TSQR is different from the standard Householder QR
factorization. It is implicitly given by a tree of smaller Householder transformations

Q0
 
 
Q1 Q01
Q= Q0,1,2,3 . (4.3.33)
 
Q2 Q23

Q2

In general, the combination process in TSQR forms a tree with the row blocks Ai as leaves and
the final R as a root. The version pictured above corresponds to a binary tree. The tree shape can
be chosen to minimize either the communication between processors or the volume of memory
traffic between the main memory and the cache memory of each processor.
The initial Householder QR factorizations of the N blocks Ai require 2N n2 (p − n/3) flops.
Merging two triangular QR factorizations of dimension n × n takes 2n3 /3 flops. The total
arithmetic cost of TSQR is higher than that for the direct Householder QR factorization of A,
but for strongly rectangular systems most of the arithmetic work is spent in computing the QR
factorizations of the submatrices, which can be done in parallel.
4.3. Some Structured Problems 213

An implementation of TSQR using the message passage interface (MPI) operation AllReduce
for multiple processors is given by Langou [719, 2007]. Experiments show that the AllReduce
QR algorithm obtains nearly linear speed-up. As shown by Mori et al. [811, 2012], although the
number of floating-point operations is larger, the bounds for the backward error and the deviation
from orthogonality are smaller for the AllReduce QR algorithm than for standard Householder
QR.
Another communication-avoiding QR algorithm suitable for tall-and-skinny matrices is the
Cholesky QR algorithm. Let A ∈ Rm×n have full column rank and ATA = RTR be its Cholesky
factorization. The Cholesky QR algorithm then computes Q1 = AR−1 by block forward substi-
tution, giving
A = Q1 R; (4.3.34)
see Section 1.2.1. The arithmetic cost of this algorithm is 2mn2 + n3 /3 flops. The Cholesky QR
algorithm is ideal from the viewpoint of high performance computing. It requires only one global
reduction between parallel processing units, and most of the arithmetic work can be performed
as matrix-matrix operations. However, the loss of orthogonality ∥I − QT1 Q1 ∥F can only be
bounded by the squared condition number of A.
Yamamoto et al. [1136, 2015] suggest a modified Cholesky QR2 algorithm, where R and
Q1 from Cholesky QR are refined as follows. First, compute E = QT1 Q1 and its Cholesky
factorization E = S T S. The refined factorization is taken to be A = P U , where

P = QS −1 , U = SR. (4.3.35)

This updating step doubles the arithmetic cost. The Cholesky QR2 algorithm has good stability
properties provided the initial Cholesky factorization does not break down. However, the QR2
algorithm may fail for matrices with a condition number roughly greater than u−1/2 .
Yamazaki, Tomov, and Dongarra [1137, 2015] extend the applicability of the Cholesky QR2
algorithm as follows. An initial Cholesky factorization of RT R = AT A + sI is computed,
where s ≥ 0 is a shift that guarantees that the factorization runs to completion. Further, some
intermediate results are accumulated in higher precision. The resulting Cholesky QR3 algorithm
uses three Cholesky QR steps and yields a computed Q with loss of orthogonality ∥I − QT1 Q1 ∥F
and residual ∥A − QR∥F /∥A∥F of order u. See also Fukaya et al. [435, 2020].

4.3.5 Multilinear and Tensor Calculus


In many science and engineering applications, the data encountered have a multidimensional
structure with more than two dimensions. Consider vector spaces X1 , X2 , . . . , Xd and Y , and
let xν ∈ Xν . A function A: X1 × X2 × · · · × Xd → Y is called d-multilinear if it is linear
in each of its arguments xi separately. For example, the expression (P x1 )T Qx2 + (Rx3 )T Sx4
defines a four-linear function, mapping or operator, provided the constant matrices P , Q, R, S
have appropriate size. If d = 2, the term bilinear function is used.
Let Xν = Rnν , ν = 1, 2, . . . , d, Y = Rm , and let eji be one of the basis vectors of Xi .
Then superscripts can be used to denote coordinates in these spaces. For example, aij1 ,j2 ,...,jd
denotes the ith coordinate of A(ej1 , ej2 , . . . , ejd ). Because of the linearity, the ith coordinate of
A(x1 , x2 , . . . , xd ), xν ∈ Xν , reads
n1 X
X n2 nk
X
... aij1 ,j2 ,...,jd xj11 xj22 . . . xjdd . (4.3.36)
j1 =1 j2 =1 jk =1

The following sum convention is often used. If an index occurs as both a subscript and a su-
perscript, the product should be summed over the range of this index. For example, the ith
214 Chapter 4. Special Least Squares Problems

coordinate of A(x1 , x2 , . . . , xd ) is written aij1 ,j2 ,...,jd xj11 xj22 · · · xjkd . (Remember the superscripts
are not exponents.)
Suppose Xi = X, i = 1, 2, . . . , d. Then the set of d-linear mappings from X k to Y is itself
a linear space, denoted by Lk (X, Y ). For k = 1, we have the space of linear functions. Linear
functions can, of course, be described in vector-matrix notation as a set of matrices L(Rn , Rm ) =
Rm×n . Matrix notation can also be used for each coordinate of a bilinear function. Norms of
multilinear operators are defined analogously to subordinate matrix norms. For example,

∥A(x1 , x2 , . . . , xk )∥∞ ≤ ∥A∥∞ ∥x1 ∥∞ ∥x2 ∥∞ . . . ∥xk ∥∞ ,

where
n1 X
n2 nk
m X X
∥A∥∞ = max ... |aij1 ,j2 ,...,jk |. (4.3.37)
i=1
j1 =1 j2 =1 jk =1

A multilinear function A is called symmetric if A(x1 , x2 , . . . , xk ) is symmetric with respect to


its arguments. In the cases mentioned above, where matrix notation can be used, the matrix
becomes symmetric if the multilinear function is symmetric.
Multidimensional data can be represented by a tensor or, using a coordinate representation,
a hypermatrix. Mathematical theory and computational methods for tensor problems are still
being developed, and notation may vary among papers. In the following, we denote tensors by
calligraphic letters; e.g., we refer to

A = (ai1 ,...,id ) ∈ Rn1 ×···×nd (4.3.38)

as a d-mode tensor, d > 2. The case d = 2 corresponds to matrices. In the following we


emphasize the case d = 3 because the main difference between matrices and hypermatrices
comes in the transition from d = 2 to 3. Subarrays are formed by keeping a subset of the indices
constant. A 3-mode tensor (4.3.38) can be thought of as being built up by matrix slices in three
ways by fixing one of the indices, e.g.,

(a:,:,j ) ∈ Rn1 ×n2 , j = 1 : n3 .

Similarly, by fixing any two indices, we get a vector or fiber

(a:,j,k ) ∈ Rn1 , j = 1 : n2 , k = 1 : n3 .

A tensor is said to be symmetric if its elements are equal under any permutations of the indices,
i.e., for a 3-mode tensor,

ai,j,k = ai,k,j = aj,k,i = aj,i,k = ak,i,j = ak,j,i ∀ i, j, k;

see Comon et al. [263, 2008]. A tensor is diagonal if ai1 ,i2 ,...,id ̸= 0 only if i1 = i2 = · · · = id .
Elementwise addition and scalar multiplication trivially extend to hypermatrices of arbitrary
order. The tensor or outer product is denoted by ◦ (not to be confused with the Hadamard
product of matrices). For example, if A = (aij ) ∈ Rm×n and B = (bkl ) ∈ Rp×q are matrices,
then
C = A ◦ B = (ai,j,k,l )
is a 4-mode tensor. The 1-mode contraction product of two 3-mode hypermatrices A =
(ai,j,k ) ∈ Rn×n2 ×n3 and B = (bi,l,m ) ∈ Rn×m2 ×m3 with conforming first dimension is the
4-mode tensor C ∈ Rn2 ×n3 ×m2 ×m3 defined as
n
X
C = ⟨A, B⟩1 , cj,k,l,m = ai,j,k bi,l,m . (4.3.39)
i=1
4.3. Some Structured Problems 215

Contractions need not be restricted to one pair of indices at a time. The inner product of two
3-mode tensors of the same size and the Frobenius norm of a tensor are defined as
X X
⟨A, B⟩ = aijk bijk , ∥A∥2F = ⟨A, A⟩1/2 = a2ijk . (4.3.40)
i,j,k i,j,k

The matrix Hölder norm for p = 1, ∞ is similarly generalized.


The columns of A can be stacked or unfolded into a column vector by the operation vec(A).
A second way would be to unfold its rows into a row vector. Similarly, a 3-mode tensor A can
be unfolded or matricized by stacking in some order the matrix slices obtained by fixing one of
its three modes. Following Eldén and Savas [380, 2009], we use the notation

A(1) = (A:,1,: , A:,2,: , . . . , A:,n2 ,: ) ∈ Rn1 ×n2 n3 ,


A(2) = (AT:,:,1 , AT:,:,2 , . . . , AT:,:,n3 ) ∈ Rn2 ×n1 n3 , (4.3.41)
n3 ×n1 n2
A(3) = (AT1,:,: , AT2,:,: , . . . , ATn1 ,:,: ) ∈R ,

where a colon indicates all elements of a mode. Different papers sometimes use different order-
ings of the columns. The specific permutation is not important as long as it is consistent.
A matrix C ∈ Rp×q can be multiplied from the left and right by other matrices X ∈ Rm×p
and Y ∈ Rn×q , and we write
p X
X q
A = XCY T , aij = xiα yjβ cαβ .
α=1 β=1

The corresponding tensor-matrix multiplication of a 3-mode tensor C ∈ Rp×q×r by three matri-


ces X ∈ Rl×p , Y ∈ Rm×q , and Z ∈ Rn×r transforms C into the 3-mode tensor A ∈ Rl×m×n
with entries
X p X q X r
aijk = xi,α yj,β zkγ cαβγ . (4.3.42)
α=1 β=1 γ=1

A notation for this operation suggested by Silva and Lim [993, 2008] is

C = (X, Y, Z) · A, (4.3.43)

where the mode of each multiplication is understood from the ordering of the matrices. It
is convenient to use a separate notation for multiplication by transposed matrices. For C =
(X T , Y T , Z T ) · A we also write
C = A · (X, Y, Z).

For a matrix A ∈ Rm×n there are three ways to define the rank r, all of which yield the same
value. The rank is equal to the dimension of the subspace of Rm spanned by its columns and the
dimension of the subspace of Rn spanned by its rows. Also, the rank is the minimum number of
terms in the expansion of A as a sum of rank-one matrices; cf. the SVD expansion. For a tensor
of mode d > 2, these three definitions yield different results.
The column rank and row rank of a matrix are generalized as follows. For a 3-mode tensor
A ∈ Rn1 ×n2 ×n3 , let r1 be the dimension of the subspace of Rn1 spanned by the n2 n3 vectors
with entries a:,i2 ,i3 , i2 = 1 : n2 , i3 = 1 : n3 . In other words, r1 (A) = rank(A(1) ), with similar
interpretations for r2 and r3 . The triple (r1 , r2 , r3 ) is called the multirank of A, and r1 , r2 , r3
can all be different.
216 Chapter 4. Special Least Squares Problems

The outer product of vectors x ∈ Rℓ , y ∈ Rm , and z ∈ Rn is the 3-mode hypermatrix

T = x ◦ y ◦ z ∈ Rl×m×n , ti1 i2 i3 = xi1 yi2 zi3 . (4.3.44)

If nonzero, we call this a rank-one tensor. The tensor rank of A is the smallest number r such
that A may be written as a sum of rank-one hypermatrices:
r
X
A= xp ◦ yp ◦ zp . (4.3.45)
p=1

When d = 2 this definition agrees with the usual definition of the rank of a matrix. Generalization
of this definition of rank to higher order tensors is straightforward. However, for d ≥ 3 there is no
algorithm for determining the rank of a given tensor, and this problem is NP-hard. Furthermore,
de Silva and Lim [993, 2008] show that the problem of finding the best rank-p approximation in
general has no solution, even for d = 3.
Tensor decompositions originated with Hitchcock [629, 1927] and much later were taken
up and used to analyze data in psychometrics (Tucker [1070, 1966]). In the last decades the use
of tensor methods has spread to other fields, such as chemometrics (Bro [180, 1997]), signal and
image processing, data mining, and pattern recognition (Eldén [376, 2019]). Tensor decomposi-
tions are used in machine learning and parameter estimation.
Low-rank approximations of a given two-dimensional array of data can be found from the
SVD of a matrix. In many applications one would like to approximate a given tensor A with a
sum of rank-one tensors to minimize
p
X
A− λi xi ◦ yi ◦ zi . (4.3.46)
F
i=1

Weights λi are introduced to let us assume that vectors xi , yi , and zi are normalized to have
length one. Hillar and Lim [628, 2013] have shown that this problem (and indeed, most other
tensor problems) are NP-hard. Therefore, we assume that the number p < r of factors is fixed. A
popular algorithm for computing such an approximate decomposition is alternating least squares
(ALS). First, the vectors yi and zi are fixed, and xi is determined to minimize (4.3.46). Next,
xi , zi are fixed, and we solve for yi . Finally, xi , yi are fixed, and we solve for zi . Define the
matrices

X = (x1 , . . . , xp ) ∈ Rn1 ×p , Y = (y1 , . . . , yp ) ∈ Rn2 ×p , Z = (z1 , . . . , zp ) ∈ Rn3 ×p .

With yi , zi fixed, the minimizing problem can be written in matrix form as


b ⊙ Y )T ∥F ,
min ∥A(1) − X(Z
X

n1 ×n2 n3
where A(1) ∈ R is the matrix obtained by unfolding A along the first mode, and

Z ⊙ Y = (z1 ⊗ y1 , . . . , zp ⊗ yp ) ∈ Rn2 n3 ×p

is the matching columnwise Kronecker product, also called the Khatri–Rao product, of Z and Y .
The solution can be written
X̂ = A(1) [(Z ⊙ Y )T ]† ,
and then the columns of X̂ are normalized to give X̂ = Xdiag (λi ). Because of the special form
of the Khatri–Rao product, the solution can also be written as

X̂ = X(1) (Z ⊙ Y )(Z T Z. ∗ Y T Y )† ,
4.3. Some Structured Problems 217

where .∗ is the Hadamard (elementwise) matrix product. This version is not always suitable
because of the squared condition number. Similar formulas for the two other modes are easily
derived. At each inner iteration a pseudoinverse must be computed. ALS can take many iterations
and is not guaranteed to converge to a global minimum. Also, the solution obtained depends on
the starting point.
The idea of expressing a tensor as a sum of rank-one tensors has been proposed under differ-
ent names by several authors. In psychometrics it was called CANDECOMP (canonical decom-
position) and PARAFAC (parallel factors); see Kolda and Bader [705, 2009]. Here, following
Leurgans, Ross, and Abel [735, 1993], we call it the CP decomposition. In matrix computations,
the SVD expansion
r
X
A = U ΣV T = σi ui viT ∈ Rm×n , r ≤ min{m, n}, (4.3.47)
i=1

expresses a matrix A of rank r as the weighted sum of rank-one matrices ui viT , where ui ∈ Rm
and vi ∈ Rn , i = 1 : r, are mutually orthogonal. This expansion has the desirable property
that for any unitarily invariant norm, the best approximation of A by a matrix of rank r < n is
obtained by truncating the expansion; see the Eckart–Young–Mirksy Theorem 1.3.8.
The high-order SVD (HOSVD) is a generalization of the SVD to 3-mode hypermatrices
A = (U, V, W ) · C,
where U, V , and W are square and orthogonal, and C has the same size as A. Further, the
different matrix slices of C are mutually orthogonal (with respect to the standard inner product
on matrix spaces) and with decreasing Frobenius norm. Because of the imposed orthogonality
conditions, the HOSVD of A is essentially unique. It is rank-revealing in the sense that if A has
multirank (r1 , r2 , r3 ), then the last n1 − r1 , n2 − r2 , and n3 − r3 slices along the different modes
of the core tensor C are zero matrices. Algorithms for computing the HOSVD are described by
Lathauwer, De Moor, and Vandewalle [296, 2000]. The matrix U is obtained from the SVD of
the l × mn matrix obtained from unfolding A. V and W are obtained similarly. Since U , V , and
W are orthogonal, C = (cijk ) is easily computed from C = (U T , V T , W T ) · A.
Suppose we want to approximate tensor A by another tensor B of lower multirank. Then we
want to solve
min ∥A − B∥F , (4.3.48)
rank(B)=(p,q,r)

where the Frobenius tensor norm is defined as in (4.3.40). This is the basis of the Tucker
model [1070, 1966]. Unlike the matrix case, this problem cannot be solved by truncating the
HOSVD of A. It is no restriction to assume that B = (U, V, W ) · C, where U ∈ Rn1 ×p ,
V ∈ Rn2 ×q , and W ∈ Rn3 ×p are orthogonal matrices. Because of the orthogonal invariance of
the Frobenius norm, U , V , and W are only determined up to a rotation. With the core tensor C
eliminated, problem (4.3.48) can be rewritten as a maximization problem with objective function
1 2
Φ(U, V, W ) = (U T , V T , W T ) · A F
2
subject to U T U = I, V T V = I, and W T W = I (compare with the corresponding matrix prob-
lem for d = 2). This can be formulated and solved as an optimization problem on a Grassmann
manifold; see Eldén and Savas [380, 2009] and Savas and Lim [968, 2010].

Notes and references


An extensive survey of tensor methods is given by Kolda and Bader [705, 2009]. The theory
of tensors and hypermatrices is surveyed by Lim [747, 2013]. An introduction to the theory
218 Chapter 4. Special Least Squares Problems

and computation of tensors is given by Wei and Ding [1114, 2016]. Tensor rank problems
are studied by de Silva and Lim [993, 2008] and Comen et al. [264, 2009]. A tutorial on
CP decomposition and its applications is given by Bro [180, 1997]. The N-way Toolbox for
MATLAB (Andersson and Bro [28, 2000]) for analysis of multiway data can be downloaded
from http://www.models.kvl.dk/source/. MATLAB tools for tensor computations have
also been developed by Bader and Kolda [52, 2006], [53, 2007]. A MATLAB Tensor toolbox
supported by Sandia National Labs and MathSci.ai is available on the web. Hankel tensors arise
from signal processing and data fitting; see Papy, De Lauthauwer, and Van Huffel [878, 2005].
Tensors with Cauchy structure are also of interest; see Chen, Li, and Qi [240, 2016].

4.4 Total Least Squares


4.4.1 Errors-in-Variables Models
In the standard linear model it is assumed that the observed vector b ∈ Rm is related to the
unknown parameter vector x by the linear equation Ax = b + e, where A ∈ Rm×n is known and
e is a vector of random errors. If the components of e are uncorrelated and have zero means and
the same variance, then by the Gauss–Markov theorem (Theorem 1.1.4) the best linear unbiased
estimate of x is given by the solution of the least squares problem

min ∥r∥2 subject to Ax = b + r. (4.4.1)


r

The assumption that all errors are confined to b is frequently unrealistic. Sampling or modeling
errors will often affect both A and b. In the errors-in-variables model it is assumed that

(A + E)x = b + f, (4.4.2)

where the rows of the error matrix ( E f ) are independently and identically distributed with
zero mean and the same variance. This model has independently been developed in statistics,
where it is known as “latent root regression.” The optimal estimates of the parameters x in this
model satisfy the total least squares7 (TLS) problem

min ∥ ( E f ) ∥F subject to (A + E)x = b + f, (4.4.3)


E, f

where ∥ · ∥F denotes the Frobenius matrix norm. The TLS problem is equivalent to finding
the “nearest” consistent linear system, where the distance is measured in the Frobenius norm of
( E f ). When a minimizing perturbation has been found, any x satisfying (4.4.2) is said to
solve the TLS problem.
A complete and rigorous treatment of both theoretical and computational aspects of the TLS
problem is developed in the monograph by Van Huffel and Vandewalle [1077, 1991]. They find
that in typical applications, gains of 10–15% in accuracy can be obtained by using TLS instead
of standard least squares methods.
The TLS solution depends on the relative scaling of the data A and b. Paige and Strakoš [867,
2002] study the scaled TLS (STLS) problem

min ∥ ( E γf ) ∥F subject to (A + E)x = b + f, (4.4.4)


E, f

where γ is a given positive scaling parameter. For small values of γ, perturbations in b will
be favored. In the limit when γ → 0 in (4.4.4), the solution equals the ordinary least squares
7 The term “total least squares problem” was coined by Golub and Van Loan [511, 1980].
4.4. Total Least Squares 219

solution. On the other hand, large values of γ favor perturbations in A. In the limit when
1/γ → 0, we obtain the data least squares (DLS) problem

min ∥E∥F subject to (A + E)x = b, (4.4.5)


E

where perturbations are restricted to the data matrix A.

4.4.2 Total Least Squares and SVD


Writing the constraint (A + E)x = b + f as
 
x
(A + E b+f) =0 (4.4.6)
−1
T
shows that the matrix ( A + E b + f ) is rank-deficient and that ( xT −1 ) is a right singular
vector corresponding to a zero singular value. The TLS problem can be analyzed in terms of the
SVD
n+1
X
C = ( A b ) = U ΣV T = σi ui viT . (4.4.7)
i=1

Assume that σn+1 > 0. Then, by the Eckart–Young–Mirsky theorem (Theorem 1.3.8) the unique
perturbation ( E f ) of minimum Frobenius norm that makes (A + E)x = b + f consistent is
the rank-one perturbation
T
( E f ) = −σn+1 un+1 vn+1 , (4.4.8)
and minE, f ∥ ( E f ) ∥F = σn+1 . Multiplying (4.4.8) from the right with vn+1 and using
(4.4.7) gives
(E f ) vn+1 = −σn+1 un+1 = − ( A b ) vn+1 , (4.4.9)
i.e., ( A + E b + f ) vn+1 = 0. If vn+1,n+1 ̸= 0, the problem is called generic, and the TLS
solution is obtained by scaling vn+1 so that its last component is −1:
 
x 1
= − vn+1 , γ = eTn+1 vn+1 . (4.4.10)
−1 γ

Otherwise, the TLS problem has no solution and is called nongeneric.


Let σ̂i , i = 1, . . . , n, be the singular values of A. By Theorem 1.3.5 these interlace the
singular values of ( A b ):

σ1 ≥ σ̂1 ≥ · · · ≥ σn ≥ σ̂n ≥ σn+1 .

The condition σ̂n > σn+1 ensures that ATA − σn+1 2


I is symmetric positive definite and that the
TLS problem has a unique solution.
If A is rank-deficient, then so is ( A b ), and σ̂n = σn+1 = 0. Assume now that σp >
σp+1 = · · · = σn+1 for some p < n. Let V2 = (vp+1 , . . . , vn+1 ) be the corresponding right
singular vectors. Then the minimum is attained for any rank-one perturbation of the form

(E f ) = −(A + E b + f ) vv T , v = V2 z.

Assume that a unit vector z can be found such that


 
y
V2 z = , γ ̸= 0. (4.4.11)
γ
220 Chapter 4. Special Least Squares Problems

Then x = −γ −1 y is a TLS solution. In this case the TLS solution is not unique. A unique TLS
solution of least-norm can be found as follows. Since V2 z has unit length, minimizing ∥x∥2 is
equivalent to choosing the unit vector z ∈ Rn−p+1 to maximize γ in (4.4.11). Set z = Qe1 ,
where Q is a Householder reflector such that
 
y V̂2
V2 Q = .
γ 0

Then the least-norm TLS solution is x = −γ −1 y. If eTn+1 V2 = 0, then the TLS problem is
nongeneric. This case can only occur when σ̂n = σn+1 . By an argument similar to that for
p = n, it then holds that b ⊥ uj , j = p : n. Nongeneric TLS problems can be treated by adding
constraints on the solution; see Van Huffel and Vandewalle [1077, 1991].
From the relationship between the SVD of A e = ( A b ) and the eigendecomposition of the
T
symmetric matrix A e A e (see Section 1.2.2) it follows that the TLS solution x can be characterized
by  T   
A A AT b 2 x
v = σn+1 v, v = , (4.4.12)
bTA bT b −1
2 eT A,
where σn+1 is the smallest eigenvalue of the matrix A e and v is a corresponding eigenvector.
From (4.4.12) it follows that

(ATA − σn+1
2
In )x = AT b, bT (b − Ax) = σn+1
2
. (4.4.13)

In the first equation of (4.4.13) a positive multiple of the unit matrix is subtracted from the matrix
of normal equations ATAx = AT b. This shows that TLS can be considered as a procedure for
deregularizing the LS problem. (Compare with Tikhonov regularization, where a multiple of the
unit matrix is added to improve the conditioning; see Section 3.5.3.) From a statistical point
of view, TLS can be interpreted as removing bias by subtracting the error covariance matrix
2
estimated by σn+1 I from the data covariance matrix ATA.
Because of the nonlinear dependence of xTLS on the data A, b, a strict analysis of the sen-
sitivity and conditioning of the TLS problem is more complicated than for the least squares
problem. Golub and Van Loan [511, 1980] show that an approximate condition number for the
TLS problem is
σ̂n
κT LS (A b) = κ(A) . (4.4.14)
σ̂n − σn+1
This shows that the condition number for the TLS problem will be much larger than κ(A) when
the relative distance 1 − σn+1 /σ̂n between σn+1 and σ̂n is small. Subtracting the normal equa-
tions from (4.4.13), we obtain
2
xTLS − xLS = σn+1 (ATA − σn+1
2
I)−1 xLS , (4.4.15)

where xTLS and xLS denote the TLS and LS solutions. Taking norms in (4.4.15), we obtain the
upper bound
2
σn+1 σn+1
∥xTLS − xLS ∥2 ≤ 2 2 ∥xLS ∥2 ≤ ∥xLS ∥2 , (4.4.16)
σ̂n − σn+1 2(σ̂n − σn+1 )
where the last inequality follows from

σ̂n2 − σn+1
2
= (σ̂n + σn+1 )(σ̂n − σn+1 ) ≥ 2σn+1 (σ̂n − σn+1 ).

From this we deduce that when the errors in A and b are small the difference between the LS and
TLS solutions is small. Otherwise, the solutions can differ considerably.
4.4. Total Least Squares 221

In many parameter estimation problems, some of the columns of A are known exactly. It is
no loss of generality to assume that the n1 error-free columns are the first in A = ( A1 A2 ) ∈
Rm×n , n = n1 + n2 . The mixed LS–TLS model is
 
x1
( A1 A2 + E2 ) = b + f, A1 ∈ Rm×n1 ,
x2

where the rows of the errors ( E2 f ) are independently and identically distributed with zero
mean and the same variance. The problem can then be expressed as
 
x1
min ∥ ( E2 f ) ∥F , ( A1 A2 + E 2 ) = b + f. (4.4.17)
E2 ,f x2
When A2 is empty, this reduces to solving an ordinary least squares problem with multiple
right-hand sides. When A1 is empty, this is the standard TLS problem. When the columns
of A1 are linearly independent, the mixed LS–TLS can be solved as a two-block problem; see
Section 4.3.1. First, compute the QR factorization
 
R11 R12 c1
QT ( A1 A2 b ) = ,
0 R22 c2

where R11 ∈ Rn1 ×n1 is upper triangular and R22 ∈ R(m−n1 )×n2 . Next, compute x2 as the
solution to the TLS problem

min ∥ ( E g ) ∥F , (R22 + E)x2 = c2 + g. (4.4.18)


E,G

Finally, x1 is obtained from the triangular system R11 x1 = c1 − R12 x2 .

4.4.3 Multidimensional TLS


In this section we consider some generalizations of the TLS problem. The errors-in-variable
model has been used in statistics for a long time. Multivariate problems were treated much later
in the statistical literature. In the multidimensional TLS problem with multiple right-hand sides
B = (b1 , . . . , bd ) ∈ Rm×d , d > 1, the rows of ( A B ) are assumed to be independently and
identically distributed with zero mean and the same variance:

min ∥ ( E F ) ∥F , (A + E)X = B + F. (4.4.19)


E,F

Hence we consider perturbations ( E F ) such that


 
X
(A + E B + F ) = 0.
−Id
Note that A is similarly perturbed for all right-hand sides. Hence, the multidimensional TLS
problem is different from separately solving d one-dimensional TLS problems with right-hand
sides b1 , . . . , bd . This gives improved predictive power of the multidimensional TLS solution.
The solution to the multidimensional TLS problem can be expressed in terms of the SVD

C = (A B ) = U ΣV T = U1 Σ1 V1T + U2 Σ2 V2T , (4.4.20)

where Σ1 = diag (σ1 , . . . , σn ), Σ2 = diag (σn+1 , . . . , σn+d ). By the Eckart–Young–Mirsky


theorem, ∥ ( E F ) ∥2F is minimized for the perturbation

(E F ) = −U2 Σ2 V2T = −CV2 V2T , (4.4.21)


222 Chapter 4. Special Least Squares Problems

Pn+d
and the minimum equals i=n+1 σi2 . If σn > σn+1 , this is the unique minimizer. If V2 is
partitioned as  
V12
V2 =
V22
and V22 ∈ Rd×d is nonsingular, then the TLS solution is
−1
X = −V12 V22 = (ATA − σn+1 I)−1 AT B ∈ Rn×d . (4.4.22)

The last formula generalizes (4.4.13). For d = 1, we recover the previous expression (4.4.10) for
the TLS solution.
We now show that if σ̂n > σn+1 , then V22 is nonsingular. From (4.4.20) it follows that

AV12 + BV22 = U2 Σ2 .

If V22 is singular, then V22 z = 0 for some unit vector z, and hence U2 Σ2 z = AV12 z. From

V2T V2 = V12
T T
V12 + V22 V22 = I
T
it follows that V22 V22 z = z and ∥V22 x∥2 = 1. But then

σn+1 ≥ ∥U2 Σ2 x∥2 = ∥AV12 x∥2 ≥ σ̂n .

This is a contradiction, and hence V22 is nonsingular.


A unique solution to the multidimensional TLS problem exists if σ̂n > σn+1 . If this condi-
tion is not satisfied, then the TLS problem can still have a solution, but it is no longer unique. As
for the case d = 1, we then seek a solution of minimum norm ∥X∥F . Wei [1112, 1992] shows
that a sufficient condition for a least-norm multidimensional TLS solution to exist is that

σ̂p > σp+1 = · · · = σn = · · · = σn+d .

When this condition is satisfied, the following extension of the classical SVD algorithm computes
the least-norm solution; see Van Huffel and Vandevalle [1077, 1991, Section 3.6.1]. For d = 1
and p = n the algorithm coincides with the classical SVD algorithm described earlier.

Algorithm 4.4.1.
Given a data matrix A ∈ Rm×n and an observation matrix B ∈ Rm×d , do the following:
1. Compute the SVD of the extended matrix C = ( A B ) ∈ Rm×(n+d) :
n+d
X
C = U ΣV T = ui σi viT . (4.4.23)
i=1

2. Suppose p ≤ min{n, rank(C))}, and partition V so that


p q
 
n V11 V12
V = ∈ R(n+d)×(n+d) , (4.4.24)
d V21 V22

where V11 ∈ Rn×p and V22 ∈ Rd×q , q = n − p + d.


3. If V22 = 0, then the problem is nongeneric. Otherwise, if V22 has full row rank, compute
the least-norm TLS solution as
† T † T
X = −V12 V22 = (V11 ) V21 . (4.4.25)
4.4. Total Least Squares 223

From the CS decomposition it follows that if V22 has full column rank, then so has V11 .
For a proof of the equivalence of the formulas in (4.4.25), see Van Huffel and Vandevalle[1077,
1991, Theorem 3.10]. The second expression for X only requires computation of the k largest
right singular vectors of ( A b ) and is advantageous when k is small.
Algorithm 4.4.1 for solving the multidimensional TLS problem only requires a small part
of the full SVD, namely the d ≪ n right singular vectors corresponding to the smallest singu-
lar values. For this purpose Van Huffel, Vandewalle, and Haegemans [1078, 1987] developed
a modified partial QRSVD (PSVD) algorithm. In the Householder bidiagonalization phase the
singular vectors U and V are initialized by the accumulated products of the Householder reflec-
tions. During the QRSVD iterations, plane rotations are applied to U and V to generate the left
and right singular values. A great amount of work is saved in the PSVD algorithm by delaying
the initializing of U and V until the end of the diagonalizing phase. The Householder reflections
are then applied only to the (small number of) desired singular vectors of the bidiagonal matrix.
A second modification in PSVD is to perform only a partial diagonalization of the bidiagonal
matrix. The iterative process is stopped as soon as convergence has occurred to all desired singu-
lar values. Assume that at the ith step of the diagonalization phase, we have the block bidiagonal
form  (i) 
B1
(i)
 B2 
B (i) =  ,
 
..
 . 
(i)
Bs
(i)
where Bj , j = 1, . . . , s, are unreduced upper bidiagonal matrices. Suppose that a basis for a
singular subspace corresponding to the singular values σi ≤ θ is desired. Then spectrum slicing
(see Section 7.2.1) can be used to partition these blocks into three classes:

C1 = {Bj | all singular values > θ},

C2 = {Bj | all singular values ≤ θ},

C3 = {Bj | at least one singular value > θ and at least one ≤ θ}.

If C3 is empty, then the algorithm stops. Otherwise, one QR iteration is applied to each block in
C3 , and the partition is reclassified. If no bound on the singular values can be given but, instead,
the dimension p of the desired subspace is known, then a bound θ can be computed with the
bisection method from Section 7.2.1. A complete description of the PSVD algorithm is given in
Van Huffel and Vandevalle [1077, 1991, Section 4.3]. A Fortran 77 implementation of PSVD is
available from Netlib.

4.4.4 Solving Large-Scale TLS Problems


For large or structured TLS problems, computing the PSVD may be prohibitive because the
sparsity or structure of the matrix is lost in the reduction to bidiagonal form. Then an iterative
algorithm may be advantageous. Similarly, this applies when a sequence of slowly varying TLS
problems has to be solved. In this case the solution of the previous problem can be used as an
initial approximation.
2
Let σn+1 be the smallest eigenvalue of A eT A,
e where A e = ( A b ). If ( xT −1 )T is a
corresponding eigenvector, then
   
T e x 2 x
A A
e = σn+1 . (4.4.26)
−1 −1
224 Chapter 4. Special Least Squares Problems

This characterization of the TLS solution suggests the following alternative formulation of the
2
TLS problem. The unique minimum σn+1 of the Rayleigh quotient ρ(x) = v T Av/∥v∥
e 2
2 , where
 
x
A
e = (A b), v= ,
−1
is attained for x = xT LS . Hence the TLS problem is equivalent to minx ρ(x), where
∥r∥22
 
ρ(x) = , r = − e x
A = b − Ax. (4.4.27)
∥x∥22 + 1 −1
In the algorithm of Björck, Heggernes, and Matstoms [147, 2000] the TLS solution x is
computed by applying inverse iteration to the symmetric eigenvalue problem (4.4.26). Let x(k)
be the current approximation. Then x(k+1) and the scalars βk , k = 0, 1, 2, . . . , are computed by
solving  (k+1)   (k) 
T e x x
A A
e = βk . (4.4.28)
−1 −1
If the compact QR factorization
 
R c
A
e = (A b) = Q , Q ∈ Rm×(n+1) , (4.4.29)
0 η
is known, then the solution of (4.4.28) is obtained by solving the two triangular systems
 T   (k)   (k)     (k+1)   (k) 
R 0 z x R c x z
= , = β k .
cT η −γk −1 0 η −1 −γk
After eliminating γk , this becomes
x(k+1) = xLS + βk R−1 (R−T x(k) ), βk = η 2 /(1 + xTLS x(k) ). (4.4.30)
The iterations are initialized by taking
x(0) = xLS = R−1 c, β0 = η 2 /(1 + ∥xLS ∥2 ). (4.4.31)
From classical convergence results for symmetric inverse iteration it follows that
∥x(k) − xT LS ∥2 = O((σn+1 /σn )2k ), |ρ(x(k) ) − σn+1
2
| = O((σn+1 /σn )4k ).
By using Rayleigh-quotient iteration (RQI), a better rate of convergence can be obtained. For
properties of the Rayleigh quotient of symmetric matrices, see Section 7.3.1.
Fasino and Fazzi [398, 2018] note that by the characterization (4.4.27) the TLS problem is
equivalent to the nonlinear least squares problem minx ∥f (x)∥2 , where
f (x) = µ(x)(b − Ax), µ(x) = (1 + xT x)−1/2 , (4.4.32)
and solved by a Gauss–Newton method (see Section 8.1.2). If xk is the current approximation,
this requires solution of a sequence of linear least squares problems
min ∥f (xk ) + hk J(xk )∥2 , xk+1 = xk + hk ,
hk

where J(xk ) = µ(xk ) A + µ(xk )2 rk xTk , with rk = b − Axk is the Jacobian of f (x) at xk .
Since A + µ(xk )2 rk xTk is a rank-one modification of A, its QR factorization can be cheaply
computed by modifying the QR factorization of A; see Section 3.3.2. In fact, this method is
closely related to inverse iteration and has a rate of convergence similar to that shown by Peters
and Wilkinson [891, 1979].
4.4. Total Least Squares 225

4.4.5 Regularized TLS


When A is nearly rank-deficient, regularization is required to stabilize the TLS solution. Similar
to TSVD, truncated TLS (TTLS) reduces an ill-conditioned problem to an exactly rank-deficient
problem by treating small singular values of (A, b) as zeros. Fierro and Bunch [405, 1994] show
that under certain conditions, TTLS is superior to TSVD in suppressing noise in A and b; see
also [406, 1996].
Let k ≤ rank(A) be the number of singular values to be retained. Then the TTLS solution
† T † T
xk = −V12 V22 = (V11 ) V21 (4.4.33)

can be computed by Algorithm 4.4.1. The second expression for xk is better to use when k is
1/2
small. The norm of xk is ∥xk ∥2 = ∥V22 ∥−2
2 −1 . It increases with k, while the norm of the
residual matrix 1/2
2 2
∥ ( A b ) − ( Ã b̃ ) ∥F = σk+1 + · · · + σn+1 (4.4.34)
decreases with k. The TTLS solution xk can be written as a filtered sum
n
X ûTi b
xk = fi v̂i ,
i=1
σ̂i

where A = Û Σ̂V̂ T is the SVD of A. Fierro et al. [404, 1997] show that when ûTi b ̸= 0 and
i ≤ k, the filter factors fi are close to one, and for i > k they are small.
Another approach to regularization is to restrict the TLS solution by a quadratic constraint.
The RTLS problem is

min ∥ ( E f ) ∥F subject to (A + E)x = b + f, ∥Lx∥2 ≤ δ, (4.4.35)


E, f

where δ > 0 is a regularization parameter, and the matrix L ∈ Rp×n defines a seminorm on the
solution space. In practice, the parameter δ is usually not exactly specified but has to be estimated
from the given data using the techniques discussed for the TLS problem in Section 3.6.4.
The optimal solution of the RTLS problem is different from xTLS only when the quadratic
constraint is active, i.e.,
∥LxTLS ∥2 > δ.
In this case the constraint in (4.4.35) holds with equality at the optimal solution, and the RTLS
solution can be characterized by the following first-order optimality conditions for (4.4.35); see
Golub, Hansen, and O’Leary [492, 1999].

Theorem 4.4.1. The solution to the regularized TLS problem (4.4.35) with the inequality con-
straint replaced by equality is a solution to the eigenvalue problem

(ATA − λI I + λL LT L)x = AT b, (4.4.36)

where
∥b − Ax∥22 1 T 
λI = ϕ(x) = , λL = b (b − Ax) − ϕ(x) . (4.4.37)
1 + ∥x∥2 δ 2

Sima, Van Huffel, and Golub [994, 2004] give methods for solving the RTLS problem using
an iterative method that in each step solves a quadratic eigenvalue problem. In practice, very few
steps are required. Their method can be applied to large problems using existing fast methods
based on projection onto Krylov subspaces for solving quadratic eigenvalue problems.
226 Chapter 4. Special Least Squares Problems

The first-order conditions for RTLS given in Theorem 4.4.1 are the same as for the con-
strained minimization problem

∥b − Ax∥22
min subject to ∥Lx∥2 ≤ δ. (4.4.38)
x 1 + ∥x∥22

This Rayleigh quotient formulation of the RTLS problem is closely related to Tikhonov regular-
ization of TLS,

min ∥ ( E f ) ∥F + ρ∥Lx∥22 subject to (A + E)x = b + f, (4.4.39)


E, r

and equivalent to the nonconvex minimization problem

∥b − Ax∥22
 
2
min + ρ∥Lx∥2 . (4.4.40)
x 1 + ∥x∥22

Beck and Ben-Tal [96, 2006] show that this problem can be reduced to a sequence of trust-region
problems and give detailed algorithms. Algorithms for large-scale Tikhonov regularization of
TLS are developed by Lampe and Voss [712, 2013].
Guo and Renaut [554, 2002] show that the eigenvalue problem in Theorem 4.4.1 can be
formulated as
   
x x
(M + λL N ) = λI , (4.4.41)
−1 −1

where λL is given as in (4.4.37), and


   
ATA AT b LT L 0
M= , N= . (4.4.42)
bTA bT b 0 −δ 2

Based on this formulation, they suggest an algorithm using shifted inverse iteration to solve the
eigenvalue problem (4.4.41). As an initial solution, the corresponding regularized LS solution
x(0) = xRLS is used. An additional complication is that the matrix B depends on the RTLS
solution. Their algorithm is further analyzed and refined in Renaut and Guo [924, 2005]. Lampe
and Voss [711, 2008] develop a related but faster method that uses a nonlinear Arnoldi process
(see Section 6.4.5) and a modified root-finding method.

Notes and references

M. Wei [1111, 1992] gives algebraic relations between the total least squares and least squares
problems with more than one solution. Several papers study the sensitivity and condition-
ing of the TLS problem and give bounds for the condition number; see Zhou et al. [1149,
2009], Baboulin and Gratton [51, 2011], Jia and Li [670, 2013], and Xie, Xiang, and Y. Wei
[1134, 2017]. A perturbation analysis of TTLS is given by Gratton, Titley-Peloquin, and
Ilunga [529, 2013].
De Moor [297, 1993] studies more general structured and weighted TLS problems and con-
siders applications in systems and control theory. These problems can be solved via a nonlinear
GSVD. A review of developments and extensions of the TLS method to weighted and struc-
tured approximation is given by Markovsky and Van Huffel [777, 2007]. A recent bibliography
on total least squares is given by Markovsky [776, 2010]. Standard TLS methods may not be
4.5. Least Squares Problems with Special Bases 227

appropriate when A has a special structure such as Toeplitz or Vandermonde. The structured
least-norm problem (STLN) preserves structure and can minimize errors also in the ℓ1 -norm and
others; see Rosen, Park, and Glick [936, 1996] and Van Huffel, Park, and Rosen [1076, 1996].

4.5 Least Squares Problems with Special Bases


4.5.1 Approximation by Orthogonal Polynomials
It is frequently desired to model a given function y = f (x) by a linear combination of basis
functions:
n
X
f¯ = ck ϕk (x). (4.5.1)
k=0

It is often convenient to choose the n + 1 basis functions as a triangle family of polynomials,


where
k
X
ϕk (x) = akj xj
j=0

is a polynomial of degree k with nonzero leading coefficient akk . The coefficients of such a
family form a nonsingular lower triangular matrix A = (ai,j ), 0 ≤ j ≤ i ≤ n. Then the
monomials xk , k = 0, . . . , n, can be expressed recursively and uniquely as linear combinations
xk = bk,0 p0 + bk,1 p1 + · · · + bk,k pk , where the associated matrix is B = (bi,j ) = A−1 .

Definition 4.5.1. Let the real-valued functions f and g be defined on a finite grid G = {xi }m
i=0
of distinct points. Then, the inner product of (f, g) is defined by

m
X
(f, g) = f (xi )g(xi )wi , (4.5.2)
i=0

where {wi }m
i=0 are given positive weights. The norm of f is ∥f ∥ = (f, f )
1/2
.

An important consideration is the choice of a proper basis for the space of approximating
functions. The functions ϕ0 , ϕ1 , . . . , ϕn are said to form an orthogonal system if (ϕi , ϕj ) = 0
for i ̸= j and ∥ϕi ∥ ̸= 0 for all i. If, in addition, ∥ϕi ∥ = 1 for all i, then the sequence is
called an orthonormal system. We remark that the notation used is such that the results can be
generalized with minor changes to cover the least squares approximation when f is approximated
by an infinite sequence of orthogonal functions ϕ0 , ϕ1 , ϕ2 , . . . .
We study the least squares approximation problem to determine coefficients c0 , c1 , . . . , cn in
(4.5.1) such that the weighted Euclidean norm

m
X
∥f − f ∥2 = wi |f (xi ) − f (xi )|2
i=0

of the error is minimized. Note that interpolation is a special case (n = m) of this problem. By
a family of orthogonal polynomials we mean here a triangle family of polynomials, which is an
orthogonal system with respect to the inner product (4.5.2) for some given weights.
228 Chapter 4. Special Least Squares Problems

Theorem 4.5.2. Let ϕ0 , ϕ1 , . . . , ϕn be linearly independent functions. Then the least squares
approximation problem has the unique solution
n
X
f∗ = cj ϕj ,
j=0

which is characterized by the orthogonality property (f ∗ − f, ϕj ) = 0, j = 0, 1, . . . , n. The co-


efficients cj , called orthogonal coefficients or Fourier coefficients, satisfy the normal equations
n
X
(ϕj , ϕk )cj = (f, ϕk ), k = 0, . . . , n. (4.5.3)
j=0

If the functions form an orthogonal system, the coefficients are given by

cj = (f, ϕj )/(ϕj , ϕj ), j = 0, . . . , n. (4.5.4)

Expansions of functions in terms of orthogonal polynomials are easy to manipulate, have


good convergence properties, and give a well-conditioned representation (with the exception of
weight distributions on certain grids). We now prove some results from the general theory of
orthogonal polynomials.

Theorem 4.5.3. Let {xi }m m


i=0 ∈ (a, b) be distinct points, and let {wi }i=0 be a set of weights. Then
there is an associated triangle family of orthogonal polynomials ϕ0 , ϕ1 , . . . , ϕm . The family is
uniquely determined apart from the fact that the leading coefficients a0 , a1 , a2 , . . . can be given
arbitrary nonzero values. The orthogonal polynomials satisfy a three-term recursion formula,
ϕ−1 (x) = 0, ϕ0 (x) = a0 ,

ϕn+1 (x) = αn (x − βn )ϕn (x) − γn ϕn−1 (x), n ≥ 0, (4.5.5)

where αn = an+1 /an and

(xϕn , ϕn ) αn ∥ϕn ∥2
βn = , γn = (n > 0). (4.5.6)
∥ϕn ∥2 αn−1 ∥ϕn−1 ∥2

If the weight distribution is symmetric about x = β, then βn = β for all n.

Proof. Suppose that the ϕj have been constructed for 0 ≤ j ≤ n, ϕj ̸= 0 (n ≥ 0). We now seek
a polynomial of degree n + 1 with leading coefficient an+1 that is orthogonal to ϕ0 , ϕ1 , . . . , ϕn .
For a triangle family of polynomials {ϕj }nj=0 , we can write
n
X
ϕn+1 = αn xϕn − cn,i ϕi . (4.5.7)
i=0

Hence ϕn+1 is orthogonal to ϕj , 0 ≤ j ≤ n, if and only if


n
X
αn (xϕn , ϕj ) − cn,i (ϕi , ϕj ) = 0, j = 0, 1, . . . , n.
i=0

Since (ϕi , ϕj ) = 0, i ̸= j, it follows that

cn,j ∥ϕj ∥2 = αn (xϕn , ϕj ), 0 ≤ j ≤ n,


4.5. Least Squares Problems with Special Bases 229

which determines the coefficients cn,j . From the definition of inner product it follows that
(xϕn , ϕj ) = (ϕn , xϕj ). But xϕj is a polynomial of degree j + 1 and is therefore orthogonal to
ϕn if j + 1 < n. So cnj = 0, j < n − 1, and thus

ϕn+1 = αn xϕn − cn,n ϕn − cn,n−1 ϕn−1 .

This has the same form as (4.5.5) if we set βn = cn,n /αn , γn = cn,n−1 . To get the expression
in (4.5.6) for γn , we take the inner product of equation (4.5.7) with ϕn+1 . From orthogonal-
ity it follows that (ϕn+1 , ϕn+1 ) = αn (ϕn+1 , xϕn ). Decreasing all indices by one, we obtain
(ϕn , xϕn−1 ) = ∥ϕn ∥2 /αn−1 , n ≥ 1. Substituting this in the expression for γn gives the desired
result.

The proof of the above theorem shows a way to construct βn , γn , and the values of the
polynomials ϕn at the grid points for n = 1, 2, 3, . . . . This is called the Stieltjes procedure.
For n = m, the constructed polynomial must be equal to am+1 (x − x0 )(x − x1 ) · · · (x − xm ),
because this polynomial is zero at all the grid points and thus orthogonal to all functions on the
grid. Since ∥ϕm+1 ∥ = 0, the construction stops at n = m. This is natural because there cannot
be more than m + 1 orthogonal (or even linearly independent) functions on a grid with m + 1
points.
Given the coefficients in an orthogonal expansion, the values of f can be efficiently computed
using the following algorithm. (The proof is left as a somewhat difficult exercise.)
Pk
Theorem 4.5.4 (Clenshaw’s Formula). Let pk = j=0 cj ϕj , where ϕj (x) are orthogonal
polynomials satisfying the recursion (4.5.5). Then p(x) = A0 y0 , where yn+2 = yn+1 = 0 and

yk = αk (x − βk )yk+1 − γk+1 yk+2 + ck , k = n, n − 1, . . . , 0. (4.5.8)

From Theorem 4.5.2 follows the important result that the coefficients in the best approximat-
ing polynomial pk of degree k are independent of k and given by cj = (f, ϕj )/(ϕj , ϕj ). Hence
approximations of increasing degree can be recursively generated as follows. Assume that ϕi ,
i = 1, . . . , k, and pk have been computed. In the next step the coefficients βk , γk are computed
from (4.5.6) and ϕk+1 by (4.5.5). The next approximation of f is then given by

pk+1 = pk + ck+1 ϕk+1 , ck+1 = (f, ϕk+1 )/∥ϕk+1 ∥2 . (4.5.9)

The coefficients {βk , γk } in the recursion formula (4.5.5) and the orthogonal functions ϕj
at the grid points are computed using the Stieltjes procedure together with the orthogonal co-
efficients {cj } for j = 1, 2, . . . , n. The total work required is about 4mn flops, assuming unit
weights and that the grid is symmetric. If there are differing weights, then about mn additional
operations are needed; similarly, mn additional operations are required if the grid is not sym-
metric. If the orthogonal coefficients are determined simultaneously for several functions on the
same grid, then only about mn additional operations per function are required. (In the above, we
assume that m ≫ 1, n ≫ 1.) Hence, the procedure is much more economical than the general
methods based on normal equations or QR factorization, which all require O(mn2 ) flops.
In practice the computed ϕk+1 will gradually lose orthogonality to the previously computed
ϕj . Since ϕTk+1 pk = 0, an alternative expression for the new coefficient is

ck+1 = (rk , ϕk+1 )/∥ϕk+1 ∥2 , rk = f − pk . (4.5.10)

This expression, which involves the residual rk , will give better accuracy in the computed coef-
ficients. Indeed, when using the classical formula, one sometimes finds that the residual norm
230 Chapter 4. Special Least Squares Problems

increases when the degree of the approximation is increased! Note that the difference between
the two variants discussed here is similar to the difference between CGS and MGS.
The Stieltjes procedure may be sensitive to propagation of roundoff errors. An alternative
procedure for computing the recurrence coefficients in (4.5.5) and the values of the orthogonal
polynomials has been given by Gragg and Harrod [523, 1984]; see also Boley and Golub [169,
1987]. In this procedure these quantities are computed from an inverse eigenvalue problem for
a certain symmetric tridiagonal matrix. Reichel [916, 1991] compares this scheme with the
Stieltjes procedure and shows that the Gragg–Harrod procedure generally yields better accuracy.
Expansions in orthogonal polynomials also have the very important advantage of avoiding
Pn systems of equations that occur even for moderate n when the
the difficulties with ill-conditioned
coefficients in a polynomial j=0 cj xj and the function values are given on an equidistant grid.
For equidistant data, the Gram polynomials {Pn,m }m n=0 , which are orthogonal with respect to
the inner product
Xm
(f, g) = f (xi )g(xi ), xi = −1 + 2i/m,
i=0

are relevant. These satisfy the recursion formula P−1,m (x) = 0, P0,m = (m + 1)−1/2 ,

Pn+1,m (x) = αn,m xPn,m (x) − γn,m Pn−1,m (x), n ≥ 0, (4.5.11)

where the coefficients are given by (n < m)


1/2
4(n + 1)2 − 1

m αn,m
αn,m = , γn,m = . (4.5.12)
n+1 (m + 1)2 − (n + 1)2 αn−1,m

When n ≪ m1/2 , these polynomials are well behaved. However, when n ≫ m1/2 , they have
very large oscillations between the grid points, and a large maximum norm in [−1, 1]. Related
to this is the fact that when fitting a polynomial to equidistant data, one should never choose n
larger than about 2m1/2 .
One of the motivations for the method of least squares is that it effectively reduces the in-
fluence of random errors in measurements. Suppose that the values of a function have been
measured at points x0 , x1 , . . . , xm . Let f (xp ) be the measured value, and let f¯(xp ) be the
“true” (unknown) function value, which is assumed to be the same as the expected value of the
measured value. Thus, no systematic errors are assumed to be present. Suppose further that
the errors in measurement at the various points are statistically independent. Then we have
f (xp ) = f¯(xp ) + ϵ, where
E(ϵ) = 0, V(ϵ) = s2 I, (4.5.13)

and E denotes expected value and V variance.


Pn The problem is to use the measured data to estimate
the coefficients in the series f (x) = j=0 cj ϕ(x).
According to the Gauss–Markov theorem (Theorem 1.1.4) the least squares estimates c∗j
have a smaller variance than the values one gets by any other linear unbiased estimation method.
This minimum property holds not only for estimates of the coefficients cj but also for every
linear function of the coefficients, such as the estimate of the value f (α) at an arbitrary point
α. By Lemma 1.1.1, the covariance matrix of the estimates c∗j equals s2 I, and c∗j and c∗k are
uncorrelated if j ̸= k and the variance of the estimate c∗j is s2 . From this it follows that

n
X  n
X n
X
V{fn∗ (α)} =V c∗j ϕj (α) = V{c∗j }|ϕj (α)|2 2
=s |ϕj (α)|2 .
j=0 j=0 j=0
4.5. Least Squares Problems with Special Bases 231

As an average, taken over the grid of measurement points, the variance of the smoothed function
values is
n n m
1 X s2 X X n+1
V{fn∗ (xi )} = |ϕj (xi )|2 = s2 .
m + 1 j=0 m + 1 j=0 i=0 m+1

Between the grid points, however, the variance can, in many cases, be significantly larger. For
j ≫ m1/2 the Gram polynomials can be much larger between the grid points. Set
n Z 1
X 1
s2I = s2 |ϕj (α)|2 dα.
j=0
2 −1

Thus, s2I is an average variance for fn∗ (α) taken over the entire interval [−1, 1]. The following
values were obtained for the ratio k between s2I and s2 (n + 1)/(m + 1) when m = 41; see
Dahlquist and Björck [283, 1974, Section 4.4.5]:

n 5 10 15 20 25 30 35
k 1.0 1.1 1.7 26 7 · 103 1.7 · 107 8 · 1011

These results are related to the recommendation that one should choose n < 2m1/2 when fitting
a polynomial to equidistant data. This recommendation seems to contradict the Gauss–Markov
theorem, but in fact it only means that one gives up the requirement that the estimate be unbi-
ased. Still, it is remarkable that this can lead to such a drastic reduction of the variance of the
estimates fn∗ .

4.5.2 Chebyshev Interpolation


The most important family of orthogonal polynomials is perhaps the Chebyshev polynomials.8
The easily verified formula

cos(n + 1)ϕ + cos(n − 1)ϕ = 2 cos ϕ cos(nϕ), n ≥ 1,

can be used recursively to express cos(nϕ) as a polynomial in cos ϕ. If we set x = cos ϕ, then
ϕ = arccos x, and we obtain the Chebyshev polynomials for −1 ≤ x ≤ 1 by the formula
Tn (x) = cos(n arccos x), n ≥ 0. From trigonometric formulas it follows that the Chebyshev
polynomials satisfy the recursion formula

T0 (x) = 1, T1 (x) = x, Tn+1 (x) = 2xTn (x) − Tn−1 (x), n ≥ 0. (4.5.14)

The leading coefficient of Tn (x) is 2n−1 for n ≥ 1 and 1 for n = 0. The symmetry property
Tn (−x) = (−1)n Tn (x) also follows from the recurrence formula.

Theorem 4.5.5. Tn (x) has n zeros in [−1, 1] given by the Chebyshev abscissae,
 2k + 1 π 
xk = cos , k = 0, 1, . . . , n − 1, (4.5.15)
n 2
and n + 1 extrema Tn (x′k ) = (−1)k attained at x′k = cos(kπ/n), k = 0, . . . , n. These results
follow directly by noting that | cos(nϕ)| has maxima for ϕ′k = kπ/n, and cos(nϕk ) = 0 for
ϕk = (2k + 1)π/(2n).
8 Pafnuty Lvovich Chebyshev (1821–1894) was a Russian mathematician and a pioneer in approximation theory.
232 Chapter 4. Special Least Squares Problems

The Chebyshev polynomials T0 , T1 , . . . , Tn−1 are orthogonal with respect to the inner prod-
uct
n−1
X
(f, g) = f (xk )g(xk ),
k=0

where {xk } are the Chebyshev abscissae (4.5.15) for Tn . If i ̸= j, then (Ti , Tj ) = 0, 0 ≤ i, j <
n, and 1
n if i = j ̸= 0,
(Ti , Tj ) = 2 (4.5.16)
n if i = j = 0.
If one intends to approximate a function in the entire interval [−1, 1] by a polynomial and
can choose the points at which the function is computed or measured, one should choose the
Chebyshev abscissae. With these points, interpolation is a fairly well-conditioned problem in the
entire interval, and one can conveniently fit a polynomial of lower degree than m if one wishes
to smooth errors in measurement. The risk of disturbing surprises between the grid points is
insignificant.
Let p(x) denote the interpolation polynomial of a function f (x) at the Chebyshev abscissae
xk (4.5.15). From Theorem 4.5.3 we get
n−1 n−1
X 1 X
p(x) = cj Tj (x), ci = f (xk )Ti (xk ),
j=0
∥Ti ∥2
k=0

where ∥Ti ∥2 is given as in (4.5.16).


Expansions in terms of Chebyshev polynomials are an important aid in the study of functions
on the interval [−1, 1]. If one is working in terms of a parameter t that varies in the interval [a, b],
then one should make the substitution
1 1
t= (a + b) + (a − b)x
2 2
(t ∈ [a, b] ⇔ x ∈ [−1, 1]) and work with the Chebyshev points
1 1
tk = (a + b) + (a − b)xk , k = 0, . . . , n − 1.
2 2
The remainder term in interpolation using the values of the function f at the points xi , i =
0, 1, . . . , n − 1, is equal to

f (n) (ξ)
(x − x0 )(x − x1 ) · · · (x − xn−1 ).
(n)!

Here ξ depends on x, but one can say that the error curve behaves for the most part like a polyno-
mial curve y = a(x − x0 )(x − x1 ) · · · (x − xn−1 ). A similar oscillating curve is typical for error
curves arising from least squares approximation. The zeros of the error are then about the same
as the zeros for the first neglected term in the orthogonal expansion. This contrasts sharply with
the error curve for Taylor approximation at x0 , whose usual behavior is described approximately
by the formula y = a(x − x0 )n−1 . From the min-max property of the Chebyshev polynomi-
als it follows that placing the interpolation points of the Chebyshev abscissae will minimize the
maximum magnitude of

q(x) = (x − x0 )(x − x1 ) · · · (x − xn−1 )

in the interval [−1, 1]. This corresponds to choosing q(x) = Tm+1 (x)/2m .
4.5. Least Squares Problems with Special Bases 233

For computing p(x), one can use Clenshaw’s recursion formula; see the previous section.
(Note that αk = 2 for k > 0, but α0 = 1.) Occasionally, one is interested in the partial
sums of the expansion. To smooth errors in measurement, it can be advantageous to break off
the summation before the last term. If the values of the function are afflicted with statistically
independent errors in measurement with standard deviation s, then (see the next section) the
series can be broken off when, for the first time,
k
X
f− cj Tj < sn1/2 .
j=0

If the measurement points are the Chebyshev abscissae, then no difficulties arise in fitting a
polynomial to the data. In this case the Chebyshev polynomials have a magnitude between the
grid points that is not much larger than their magnitude at the grid points. The average variance
for fn∗ (α) becomes the same both on the interval [−1, 1] and the grid of measurement points:
s2 (n + 1)/(m + 1).
The choice of n, when m is given, is a question of compromising between taking into account
the systematic error, i.e., the truncation error (which decreases when n increases) and taking into
account the random errors (which grow as n increases). In the Chebyshev case, |cj | decreases
quickly with j if f is a sufficiently smooth function, while the part of c∗j that comes from errors
in measurement varies randomly with magnitude about s(2/(m + 1))1/2 . The expansion should
then be broken off when the coefficients begin to “behave randomly.” The coefficients in an ex-
pansion in terms of the Chebyshev polynomials can hence be used for filtering away the “noise”
from the signal, even when s is initially unknown.

4.5.3 Discrete Fourier Analysis


According to a mathematical theorem first given by Fourier (1768–1830), every periodic function
f (t) with period 2π/ω can, under certain very general conditions, be expanded into a series of
the form

X
f (t) = (ak cos kωt + bk sin kωt), (4.5.17)
k=0

where ak , bk are real constants. Fourier analysis is one of the most useful and valuable tools
in applied mathematics. It has applications also to problems that are not a priori periodic. One
important area of application is in digital signal processing, e.g., in interpreting radar and sonar
signals. Another application is statistical time series, which arise in communications theory,
control theory, and the study of turbulence.
An expansion of the form (4.5.17) can be expressed in several equivalent ways. Another
form, more convenient for manipulations, is

X
f (t) = ck eikωt , (4.5.18)
−∞

where c0 = a0 , ck = (ak − ibk )/2, c−k = (ak + ibk )/2, k > 0. This form allows the function to
have complex values. Function f with period 2π can be approximated by partial sums of these
series. We call these finite sums trigonometric polynomials. If a function of t has period p,
then the substitution x = 2πt/p transforms the function into a function of x with period 2π.
When the functions to be modeled are known only on a discrete equidistant set of sampling
points, a discrete version of the Fourier analysis can be used. The discrete inner product of two
234 Chapter 4. Special Least Squares Problems

complex-valued functions f and g of period 2π is defined as follows (the bar over g indicates
complex conjugation):
N
X −1
(f, g) = f (xβ )ḡ(xβ ), xβ = 2πβ/N. (4.5.19)
β=0

Theorem 4.5.6. With inner product (4.5.19) the following orthogonality relations hold for the
functions ϕj (x) = eijx , j = 0, ±1, ±2, . . .:

N if (j − k)/N is an integer,
n
(ϕj , ϕk ) =
0 otherwise.

Proof. With h = 2π/N , xβ = hβ, we have


N
X −1 N
X −1
(ϕj , ϕk ) = eijxβ e−ikxβ = ei(j−k)hβ .
β=0 β=0

This is a geometric series with ratio q = ei(j−k)h . If (j − k)/N is an integer, then q = 1, and
the sum is N . Otherwise, q ̸= 1, but q N = ei(j−k)2π = 1. From the summation formula of a
geometric series, (ϕj , ϕk ) = (q N − 1)/(q − 1) = 0.
PN −1
If f has an expansion of the form f = j=0 cj ϕj , then

b
X
(f, ϕk ) = cj (ϕj , ϕk ) = ck (ϕk , ϕk ), a ≤ k ≤ b,
j=a

because (ϕj , ϕk ) = 0 for j ̸= k. Thus, changing k to j, we have


N −1
(f, ϕj ) 1 X
cj = = f (xβ )e−ijxβ . (4.5.20)
(ϕj , ϕj ) N
β=0

Note that the calculations required to compute the coefficients cj according to (4.5.20), called
Fourier analysis, are of essentially the same type as in the calculations needed to tabulate f ∗ (x)
for x = 2πβ/N , β = 0, 1, . . . , N − 1, when the expansion in (4.5.21) is known, the so-called
Fourier synthesis.

Theorem 4.5.7. Every function f (x) defined on the grid xβ = 2πβ/N , β = 0, . . . , N − 1, can
be interpolated by a trigonometric polynomial
k+θ
X
f (x) = cj eijx , (4.5.21)
j=−k

where  
1 if N even, N/2 − 1 if N even,
θ= k=
0 if N odd, (N − 1)/2 if N odd.
If the sum in (4.5.21) is terminated when j < k + θ, one obtains the trigonometric polynomial
that is the best least squares approximation, among all trigonometric polynomials with the same
number of terms, to f on the grid.
4.5. Least Squares Problems with Special Bases 235

Proof. The expression for cj was formally derived previously (see (4.5.20)). Because

e−i(N −j)xβ = eijxβ , c−j = cN −j ,


PN −1
f (x) coincides on the grid with the function f ∗ (x) = j=0 cj eijx . However, between the grid

points, f and f are not identical. Functions of several variables can be treated analogously by
taking one variable at a time.

By (4.5.20), the discrete Fourier coefficients cj for a function


N
X −1
f (x) = cj eijx ,
j=0

whose values are known at the points x = 2πβ/N , β = 0, . . . , N − 1, are


N −1
X 1
cj = f (xβ )ω jβ , fβ = f (xβ ), j = 0, . . . , N − 1, (4.5.22)
N
β=0

where ω = e−2πi/N is the nth root of unity (ω N = 1). Hence cj is a polynomial of degree N − 1
in ω j . Let FN ∈ RN ×N be the Fourier matrix with elements
(FN )jβ = ω jβ , j, β = 0, . . . , N − 1.
It follows that the discrete Fourier transform can be expressed as a matrix-vector multiplica-
tion c = FN f , where the discrete Fourier transform (DFT) matrix FN is a complex symmetric
Vandermonde matrix. Furthermore,
1 H
F FN = I, (4.5.23)
N N
i.e., N −1/2 FN is a unitary matrix, and the inverse transform is
1 H
f= F c.
N N
If implemented in a naive way, the DFT will take N 2 operations to compute all cj (here,
one operation equals one complex addition and one complex multiplication). The application of
discrete Fourier analysis to large-scale problems became feasible only with the invention of the
so-called fast Fourier transform (FFT) that reduces the computational complexity to O(N log N ).
The FFT, developed in 1965 by Cooley and Tukey [269, 1965], is based on the divide-and-
conquer strategy. Consider the special case when N = 2p and set

2β1 if β even, 1
β= 0 ≤ β1 ≤ N − 1.
2β1 + 1 if β odd, 2
Then the sum in (4.5.22) can be split into an even part and an odd part:
1 1
2 N −1
X 2 N −1
X
cj = f2β1 (ω 2 )jβ1 + f2β1 +1 (ω 2 )jβ1 ω j . (4.5.24)
β1 =0 β1 =0

Let α be the quotient and j1 the remainder when j is divided by 12 N , i.e., j = α 12 N + j1 . Then,
because ω N = 1,
j 1 β1 αβ1
(ω 2 )jβ1 = (ω 2 )α(1/2)N β1 (ω 2 ) = (ω N ) (ω 2 )j1 β1 = (ω 2 )j1 β1 .
236 Chapter 4. Special Least Squares Problems

Thus, if for j1 = 0, 1, . . . , 12 N − 1 we set


1 1
2 N −1
X 2 N −1
X
2 j 1 β1
ϕ(j1 ) = f2β1 (ω ) , ψ(j1 ) = f2β1 +1 (ω 2 )j1 β1 ,
β1 =0 β1 =0

1
where (ω 2 ) 2 N = 1, then by (4.5.24),
cj = ϕ(j1 ) + ω j ψ(j1 ), j = 0, 1, . . . , N − 1.
The two sums on the right are elements of the DFTs of length N/2 applied to the parts of f with
odd and even subscripts. The DFT of length N is obtained by combining these two DFTs. Since
(N/2)
ωN = −1, it follows that
j1
yj1 = ϕj1 + ωN ψj 1 , (4.5.25)
j1
yj1 +N/2 = ϕj1 − ωN ψj 1 , j1 = 0, . . . , N/2 − 1. (4.5.26)
These expressions are called the butterfly relations because of the data flow pattern. The com-
putation of ϕj1 and ψj1 is equivalent to two Fourier transforms with m = N/2 terms instead of
one with N terms. If N/2 is even, the same idea can be applied to these two Fourier transforms.
One then gets four Fourier transforms, each of which has N/4 terms. If N = 2p , this reduction
can be continued recursively until we get N DFTs with one term. Each step involves an even–
odd permutation. In the first step the points with last binary digit equal to 0 are ordered first, and
those with last digit equal to 1 are ordered last. In the next step the two resulting subsequences
of length N/2 are reordered according to the second binary digit, etc.
The number of complex operations (one multiplication and one addition) required to compute
{yj } from the butterfly relations when {ϕj1 } and {ψj1 } have been computed is 2p , assuming that
the powers of ω are precomputed and stored. If we denote by qp the total number of operations
needed to compute the DFT when N = 2p , we have qp ≤ 2qp−1 + 2p , p ≥ 1. Since q0 = 0, it
follows by induction that
qp ≤ p 2p = N log2 N.
Hence, when N is a power of two, the FFT solves the problem with at most N · log2 N complex
operations. For example, when N = 220 = 1,048,576 the FFT algorithm is theoretically a factor
of 84,000 faster than the “conventional” O(N 2 ) algorithm. The FFT algorithm not only uses
fewer operations to evaluate the DFT, it also is more accurate. For the conventional method, the
roundoff error is proportional to N . For the FFT algorithm, the roundoff error is proportional to
log2 N .
Most implementations of FFT avoid explicit recursion and instead use two stages.
• A reordering stage in which the data vector f is permuted in bit-reversal order.
• A second stage in which first N/2 FFT transforms of length 2 are computed on adjacent
elements, followed by N/4 transforms of length 4, etc., until the final result is obtained by
merging two FFTs of length N/2.
It is not difficult to see that the combined effect of the reordering in the first stage is a bit-
reversal permutation of the data points. For i = 0 : N − 1, let the index i have the binary
expansion i = b0 + b1 · 2 + · · · + bt−1 · 2t−1 , and set
r(i) = bt−1 + · · · + b1 · 2t−2 + b0 · 2t−1 .
That is, r(i) is the index obtained by reversing the order of the binary digits. If i < r(i), then
exchange fi and fr(i) . We denote the permutation matrix performing the bit-reversal ordering
4.5. Least Squares Problems with Special Bases 237

by PN . Note that if an index is reversed twice, we end up with the original index. This means
that PN−1 = PNT = PN , i.e., PN is symmetric. The permutation can be carried out “in place” by
a sequence of pairwise interchanges or transpositions of the data points. For example, for N =
16, the pairs (1,8), (2,4), (3,12), (5,10), (7,14), and (11,13) are interchanged. The bit-reversal
permutation can take a substantial fraction of the total time to do the FFT. Which implementation
is best depends strongly on the computer architecture.
When N = 2p the FFT algorithm can be interpreted as a sparse factorization of the DFT
matrix,
F N = A k · · · A 2 A 1 PN , (4.5.27)
where PN is the bit-reversal permutation matrix, and A1 , . . . , Ak are block diagonal matrices:

Aq = diag (BL , . . . , BL ), L = 2q , r = N/L. (4.5.28)


| {z }
r

Here the matrix Bk ∈ CL×L is the radix-2 butterfly matrix defined by


 
IL/2 ΩL/2
BL = , (4.5.29)
IL/2 −ΩL/2

L/2−1
ΩL/2 = diag (1, ωL , . . . , ωL ), ωL = e−2πi/L . (4.5.30)
This is usually referred to as the Cooley–Tukey FFT algorithm.
The discrete cosine transform (DCT) was discovered in 1974 by Ahmed, Natarajan, and
Rao [11, 1974]; see also Rao and Yio [913, 1990]. The DCT has real entries as opposed to
the complex entries of the FFT matrix. Depending on the type of boundary condition (Dirichlet
or Neumann, centered at a mesh point or midpoint) there are different variants. The DCT-2
transform is used extensively in image processing. It uses the real basis vectors
r
N
cos (j − 1)(i + 21 ) π,

vi,j = (4.5.31)
2

divided by 2 if j = 1. Strang [1044, 1999] surveys the four possible variants of cosine trans-
forms DCT-1, . . . , DCT-4 and their use for different boundary conditions.

Notes and references


The computational advantage of the Stieltjes approach for polynomial approximation was pointed
out by Forsythe [422, 1956]. Van Loan [1082, 1992] gives a unified treatment of FFT algorithms
based on the factorization of the Fourier matrix FN into a product of sparse matrix factors. Ideas
related to the FFT can be found in many previous works; see Cooley [267, 1990]. Excellent
surveys of the use of the discrete Fourier transform are given by Cooley, Lewis, and Welsh [268,
1969] and Henrici [604, 1979].

4.5.4 Vandermonde Systems


A Vandermonde matrix has the form
1 1 ··· 1
 
 x0 x1 ··· xn 
V = V (x0 , x1 , . . . , xn ) = 
 ... .. ..  , (4.5.32)
. . 
xn0 xn1 ··· xnn
238 Chapter 4. Special Least Squares Problems

where {xk }nk=0 is a sequence of n + 1 distinct real numbers. Vandermonde matrices arise in
many applications, such as interpolation and approximation of linear functionals. Consider first
the problem of constructing a polynomial

p(x) = a0 + a1 x + · · · + an xn

that interpolates data


(xi , fi ), i = 0, 1, . . . , n.
It is easily shown that coefficients a = (a0 , a1 , . . . , an )T satisfy the dual Vandermonde system
V T a = f . The primal system V w = b arises from the problem of determining the weights
w0 , w1 , . . . , wn in a quadrature formula when given the moments

w0 xi0 + w1 xi1 + · · · + wn xin = bi , i = 1, . . . , n.

Vandermonde systems are often extremely ill-conditioned because they correspond to an


interpolation problem with a monomial basis. An accurate and fast algorithm for solving pri-
mal or dual Vandermonde systems in O(n2 ) multiplications and O(n) is given by Björck and
Pereyra [152, 1970]. This algorithm corresponds to decomposition of the inverse V −1 into a
product
V −1 = U0 · · · Un−1 Ln−1 · · · L0
of upper and lower bidiagonal matrices. These algorithms are generalized to confluent Vander-
monde systems by Björck and Elfving [140, 1973]. The corresponding dual system then is a
Hermite interpolation problem.
The fast Björck–Pereyra algorithms often achieve better accuracy in the computed solution
than standard (and more expensive) methods like Gaussian elimination with partial pivoting. In-
deed, some problems connected with Vandermonde systems that, traditionally, have been thought
to be too ill-conditioned to be attacked, can be solved with good precision.
The determinant of a square submatrix of a matrix A is called a minor of A. If all minors
of A are positive, then A is called totally positive. Let the Vandermonde matrix V have points
{xk }nk=0 that are positive and monotonically ordered: 0 < x0 < x1 < · · · < xn . Then from the
well-known formula Y
det(V ) = (xi − xj ) (4.5.33)
i>j

it follows that V = V (x0 , x1 , . . . , xn ) has positive determinant and positive leading principal
minors. More generally, it is known that V is totally positive. For such Vandermonde systems,
Higham [612, 1987] shows that if the right-hand side is sign-interchanging, (−1)k bk ≥ 0, then
the error in the solution computed by the Björck–Pereyra algorithm can be bounded by a quantity
independent of κ(V ).
A Vandermonde-like matrix V = (vij ) has elements vij = ϕi (xj ), 0 ≤ i, j ≤ n, where
{ϕi }n0 is a family of orthonormal polynomials that satisfy a three-term recurrence of the form
(4.5.5). Such matrices generally have much smaller condition numbers than the classical Van-
dermonde matrices. Higham [613, 1988] and Reichel [916, 1991] give fast algorithms of Björck–
Pereyra type for such systems. Demmel and Koev [311, 2005] prove that Björck–Pereyra-type
algorithms exist not only for such systems but also for any totally positive linear system for which
the initial minors can be computed accurately.
Let V ∈ Rm×n be a rectangular Vandermonde matrix consisting of the first n < m columns
of V (x0 , x1 , . . . , xm ). It is natural to ask whether fast methods exist for solving the primal
Vandermonde least squares problem

min ∥V x − b∥2 . (4.5.34)


x
4.5. Least Squares Problems with Special Bases 239

Demeure [301, 1989], [302, 1990] has given an algorithm of complexity O(mn) for computing
the QR factorization of V , which can be used to solve problem (4.5.34). However, because
this algorithm forms V T V , it is likely to be unstable. A fast algorithm based on the Gragg–
Harrod scheme [523, 1984] for computing the QR factorization of transposed Vandermonde-like
matrices is given by Reichel [916, 1991]. This algorithm can be used to solve overdetermined
dual Vandermonde-like systems in the least squares sense in O(mn) operations.

Notes and references

A survey of properties of totally positive matrices is given by Fallat [393, 2001]. Higham [623,
2002, Chapter 22], surveys algorithms for Vandermonde systems. The remarkable numerical
stability obtained for Vandermonde systems has counterparts for other classes of structured ma-
trices. Boros, Kailath, and Olshevsky [171, 1999] derive fast parallel Björck–Pereyra-type algo-
rithms for solving Cauchy linear equation Cx = d,
1 1 
···
 x0 − y0 x0 − yn 
 .. .. .. 
C=
 . . . .

1 1
 
···
xn − y0 xn − yn

This class of systems includes Hilbert linear systems with sign-interchanging right-hand side.
Martínez and Peña [780, 1998] discuss algorithms of similar type for Cauchy–Vandermonde
matrices of the form ( C V ).

4.5.5 Toeplitz and Hankel Least Squares Problems


A Toeplitz matrix T = (tij ) is a matrix with constant entries along each diagonal parallel to the
main diagonal. A rectangular Toeplitz matrix

t0 t1 ... tn
 
.. ..
t
 −1 t0 . . 

 . .. .. 
 . . .
. t1  ∈ R(m+1)×(n+1)

T = (4.5.35)
 t−n ... t−1 t0
 

 . .. .. .. 
 .
. . . .

t−m t−m+1 ... t−m+n

is defined by its n + m + 1 elements t−m , . . . , t0 , . . . , tn in the first row and column. Toeplitz
matrices arise from discretization of convolution-type integral equations and play a fundamental
role in signal processing, time-series analysis, econometrics, and image deblurring; see Hansen,
Nagy, and O’Leary [583, 2006].
The BBH algorithm of Bojanczyk, Brent, and de Hoog [162, 1986] is a fast algorithm for
computing the QR factorization of a Toeplitz matrix. It is related to a classical algorithm of
Schur and requires O(mn + n2 ) instead of O(mn2 ) operations. The basic idea of the BBH
algorithm is to partition T in two different ways,
   
t0 uT T0 ũ
T = = , (4.5.36)
v T0 ṽ T tm−n
240 Chapter 4. Special Least Squares Problems

where T0 is a submatrix of T , u and ṽ are n − 1 dimensional vectors, v and ũ are m − 1 vectors,


and t0 and tm−n are scalars. Let R be the Cholesky factor of T T T , and partition R as
   
r11 z T Rt z̃
R= = , (4.5.37)
0 Rb 0 rnn
where r11 and rnn are scalars. The factor R in the QR factorization of T is implicitly derived
from the cross-product T T T . Setting RTR = T T T and using the partitioning (4.5.36) and
(4.5.37), we get
 2   2 
r11 r11 z T t0 + v T v t0 uT + v T T0
= (4.5.38)
r11 z zz T + RbT Rb t0 u + T0T v uuT + T0T T0
and    
RtT Rt RtT z̃ T0T T0 + ṽṽ T T0T ũ + tm−n ṽ
T T 2 = . (4.5.39)
z̃ Rt z̃ z̃ + rnn ũ T0 + tm−n ṽ T
T
ũT ũ + t2m−n
From (4.5.38) and (4.5.39) we see that
zz T + RbT Rb = uuT + T0T T0 , RtT Rt = T0T T0 + ṽṽ T .
Eliminating the term T0T T0 , we obtain
RbT Rb = RtT Rt + uuT − ṽṽ T − zz T , (4.5.40)
which is the basic equality used by the BBH algorithm. This shows that if Rt were known, Rb
would be computed by one Cholesky updating step and two Cholesky downdating steps; see
Section 3.3.4. Moreover, because updating and downdating can proceed by rows, the first k rows
of Rb can be obtained from the first k rows of Rb . But the kth row of Rb defines the (k + 1)th
row of Rt , and the first row of R can be obtained from (4.5.38):
q
r11 = t20 + v T v, z T = (t0 uT + v T T0 )/r11 .
It follows that (4.5.40) provides a method for computing R one row at a time.
The BBH algorithm requires mn2 +6n2 multiplications and is more efficient than the O(mn2 )
methods when n > 256. Nagy [818, 1993] modified the BBH algorithm to compute R−1 , QT b,
and the solution x using a linear amount of storage and 2mn + 14n2 multiplications. Another
possibility is to use the corrected seminormal equations to obtain x. This can also be imple-
mented in linear storage (see Nagy [818, 1993]). Another fast algorithm by Chun, Kailath, and
Lev-Ari [249, 1987] is essentially equivalent to the BBH algorithm.
For ill-conditioned problems the BBH algorithm does not perform well, because it uses the
explicit cross-product matrix. Park and Eldén [881, 1997] give a forward error analysis that
allows the conditioning of the problem to be monitored. For ill-conditioned problems their algo-
rithm uses the corrected seminormal equations to produce more accurate triangular factors than
those of other fast algorithms.
Discretization of a convolution-type Volterra integral equations of the first kind,
Z t
K(t − s)f (t)dt = g(s), 0 ≤ t ≤ T,
0
gives a linear system with an upper triangular Toeplitz matrix
t t ... t t 
1 2 n−1 n
 t1 t2 tn−1 

T = .. .. .. 
. . . . (4.5.41)


t1 t2
 
t1
4.5. Least Squares Problems with Special Bases 241

For the Tikhonov regularization problem

min ∥T x − b∥22 + λ2 ∥Lx∥22 , (4.5.42)


x

where both T and L are upper triangular and Toeplitz, Eldén [370, 1984] gives an algorithm that
only requires 9n2 flops for computing the solution for a given value of λ. His algorithm can be
modified to handle also the case when T and L have a few nonzero diagonals below the main
diagonal. Eldén’s algorithm uses n2 /2 storage locations. A modification that only uses O(n)
storage locations is given by Bojanczyk and Brent [161, 1986].
An alternative to direct methods for Toeplitz least squares problems is to use iterative meth-
ods, such as the preconditioned conjugate gradient method; see Section 6.2.2. This requires
in each iteration step one matrix-vector multiplication with T and T H . Such products can be
implemented in O(m log m) operations and O(m + n) storage using FFT; see Section 6.3.7.
Let T = (tij ) be a Toeplitz matrix, and let J denote the reverse permutation matrix. Then
H = T J has constant entries along each antidiagonal, i.e., H is a Hankel matrix. For example,
if n = 2, m = 3, then
t2 t1 t0
 
 t t0 t−1 
H = TJ =  1 .
t0 t−1 t−2
t−1 t−2 t−3
Since the reverse permutation matrix satisfies J −1 = J, we have HJ = T . Hence, methods
discussed in this section for solving Toeplitz least squares problems apply to Hankel least squares
problems as well.

Notes and references


There is an extensive literature on Toeplitz matrices. Bunch [186, 1985] investigates the stability
properties of classical and fast algorithms for solving Toeplitz systems. Surveys can be found
in Brent [179, 1988] and Bojanczyk, Brent, and de Hoog [163, 1993]. An O(mn) algorithm
for QR factorization of Toeplitz and block Toeplitz matrices was given by Sweet [1053, 1984]
but was later shown to be unstable. A different approach is used by Cybenko [282, 1987], who
gives an algorithm for computing R−1 and Q ∈ Rm×n based on the so-called lattice algorithm.
This algorithm requires that all submatrices T:,1:i , i = 1, . . . , n, be well-conditioned, which is
not always the case in applications. Hansen and Gesmar [581, 1993] give a modification of
Cybenko’s algorithm in which a block step is used to skip over linearly dependent columns.
For solving symmetric positive definiteToeplitz linear systems, a “superfast” algorithm that only
requires O(m log2 n) operations has been proposed; see Ammar and Gragg [20, 1988].
Chapter 5

Direct Methods for


Sparse Problems

A sparse matrix is any matrix with enough zeros that it pays to take advantage of
them.
—J. H. Wilkinson

5.1 Tools for Sparse Matrix Computations


5.1.1 Introduction
A band matrix, whose nonzero elements are concentrated near the main diagonal, is a simple
example of a sparse matrix. Methods for solving band least squares problems have been treated
in Section 4.1. A significant part of large-scale scientific computing applications involves sparse
matrices with more irregular sparsity patterns. By only storing and operating on nonzero ele-
ments in sparse matrices, large savings in computing time and memory can be achieved and can
make otherwise intractable problems practical to solve. Examples include the fields of structural
analysis, geodetic surveying, management science, and power systems analysis. This chapter
treats effective direct methods for the solution of sparse least squares problems using Cholesky
and QR factorizations. As the computation proceeds, one must try to minimize fill, which
is the term used to denote the creation of new nonzero elements within sparse factors of the
sparse matrix.
We initially assume that A ∈ Rm×n has full column rank, although problems where rank(A)
= m < n or rank(A) < min(m, n) do occur in practice. Other issues, such as dimension,
sparsity structure, and conditioning, should be considered when choosing the method to be used.
Occurrence of weighted rows, the number of right-hand sides, and constraints on the solution
must also be considered. If only part of the solution vector x is needed, savings can be achieved.
It may also be possible to take advantage of a sparse right-hand side b.
An example of a symmetric irregular pattern arising from a structural problem in aerospace
is illustrated in Figure 5.1.1. Other application areas can give patterns with quite different char-
acteristics.
Solving a sparse least squares problem min ∥Ax − b∥2 of full column rank using the method
of normal equations is usually performed in several steps as follows.

1. A symbolic phase in which a column permutation Pc is determined that approximately


minimizes the number of nonzeros elements in the Cholesky factor R of (APc )T (APc ).

243
244 Chapter 5. Direct Methods for Sparse Problems

0 0

100 100

200 200

300 300

400 400

0 100 200 300 400 0 100 200 300 400


nz = 7557 nz = 9350

Figure 5.1.1. Nonzero pattern of a matrix arising from a structural problem and its Cholesky factor.

This step only works on the nonzero structure of A and also generates a storage structure
for R.

2. A numerical phase in which C = (APc )T (APc ) is formed numerically and its Cholesky
factor R is computed and stored in the data structure generated in the symbolic phase.

3. For the given right-hand side b form c = PcT AT b, solve RT z = c, Ry = z, and set
x = Pc y.

Sparse QR factorization can give substantial improvements in accuracy compared to Cholesky


factorization. As the factor R is mathematically the same for both factorizations, the symbolic
phase from the Cholesky factorization can be used. However, as for band least squares problems,
the intermediate fill and arithmetic cost for QR depend strongly on the row ordering.
There are several collections of sparse matrices arising from a wide range of applications that
are used for development, testing, and performance evaluation for sparse matrix computations.
One of the first was the Harwell–Boeing collection by Duff, Grimes, and Lewis [346, 1989].
This was followed by the Matrix Market collection by Boisvert et al. [160, 1997], which has now
been incorporated into the larger SuiteSparse collection by Davis and Hu [291, 2011]. Whereas
the largest matrix in Matrix Market has dimension 90,449 with 2.5 million nonzeros, the largest
in the SuiteSparse collection has dimension 28 million with 760 million nonzeros!

Notes and references

The literature on sparse matrix algorithms is extensive. George and Liu [457, 1981] is an early
classical text on sparse Cholesky factorization. Graph theory and its connections to sparse matrix
computations are treated in George, Gilbert, and Liu [454, 1993]. An excellent survey of the
state of the art of direct methods for sparse linear systems is given by Davis [289, 2006]; see also
Davis, Rajamanickam, and Sid-Lakhdar [292, 2016] and Duff, Erisman, and Reid [345, 2017].
A recent addition that complements theory with detailed outlines of algorithms and emphasizes
the importance of sparse matrix factorizations for constructing preconditioners for iterative least
squares methods is Scott and Tůma [991, 2023].
5.1. Tools for Sparse Matrix Computations 245

5.1.2 Storage and Operations on Sparse Matrices


The main objective of storage schemes for sparse vectors and matrices is to economize on stor-
age while facilitating subsequent operations. An important distinction is between static storage
structures that remain fixed and a dynamic structure that can accommodate fill. A static struc-
ture like the general sparse storage scheme can be used when the number and location of the
nonzeros in the matrix factors can be predicted. Otherwise, the data structure for the factors
must dynamically allocate sufficient space for fill during the elimination.
It is useful to be able to predict fill that will occur in a matrix factorization so that storage
structures can be fixed in advance. To do this, one usually appeals to a no-cancellation assump-
tion: when two nonzero numerical quantities are added or subtracted, the result is assumed to be
nonzero. The structural rank of a matrix is defined to be the maximum rank of all numerical
matrices with the same nonzero pattern. For example, the matrix
 
c11 0 c13
 0 0 0 
c31 0 c33

has structural rank two, but if c11 c33 − c13 c31 = 0, the numerical rank is one.
In the following we denote the number of nonzero elements in a sparse matrix (or vector) by
nnz. A sparse vector x can be stored in compressed form as the triple (xC, ix, nnz). Its nnz
nonzero elements are stored in the vector xC. The indices of the elements in xC are stored in
the integer vector ix, so that

xCk = xix(k) , k = 1, . . . , nnz.

For example, in compressed form the vector x = (0, 4, 0, 0, 1, 0, 0, 0, 6, 0) is stored as

xC = (1, 4, 6), ix = (5, 2, 9), nnz = 3.

Note that it is not necessary to store the nonzero elements in any particular order in xC. This
makes it easy to add further nonzero elements in x—they can be appended in the last positions
of xC and ix.
In the general sparse storage scheme for a sparse matrix A, nonzero elements are stored in
coordinate form (AC, ix, jx, nnz), that is, as an unordered one-dimensional array AC with two
integer vectors ix and jx containing the corresponding row and column indices,

AC(k) = aij , i = ix(k), j = jx(k), k = 1, . . . , nnz,

where nnz is the number of nonzero elements in A. This scheme is very convenient for the
initial representation of a general sparse matrix because additional nonzero elements can be
easily added to the structure. A drawback is that it is difficult to access A by rows or columns,
as is usually needed in later operations. Also, the storage overhead is large because two integer
vectors of length nnz are required. This overhead can be decreased by using a clever compressed
scheme due to Sherman; see George and Liu [457, 1981, pp. 139–142].
In compressed storage by rows (CSR) a sparse matrix is stored as the collection of its sparse
rows. For each row, the nonzeros are stored in AC in compressed form. The corresponding
column subscripts are stored in integer vector ja, so that the column subscript of element AC(k)
is in ja(k). A third vector ip gives the position in AC of the first element in the ith row of A.
246 Chapter 5. Direct Methods for Sparse Problems

In CSR the matrix


a11 0 a13 0 0
 
 a21 a22 0 a24 0 
 0 0 a33 0 0 
 
A= (5.1.1)
 0 a42 0 a44 0 

0 0 0 a54 a55
 
0 0 0 0 a65
would be stored as

AC = (a11 , a13 | a22 , a21 a24 | a33 | a42 , a44 | a54 , a55 | a65 ),
ja = (1, 3, 2, 1, 4, 3, 2, 4, 4, 5, 5),
ip = (1, 3, 6, 7, 9, 11).

The components in each row need not be ordered. To access a nonzero aij there is no direct way
to calculate the corresponding index in the vector AC; some testing is needed on the subscripts
in ja. In CSR storage form, a complete row of A can be retrieved efficiently, but elements in
a particular column cannot be retrieved without a search of nearly all elements. Alternatively, a
similar compressed storage by columns (CSC) form can be used.
For a sparse symmetric matrix it suffices to store either its upper or its lower triangular part,
including the main diagonal. If the matrix is positive definite, then all its diagonal elements are
positive, and it may be convenient to store them in a separate vector.
For problems in which all the nonzero matrix elements lie along a few subdiagonals the
compressed diagonals storage mode is suitable. A matrix A is then stored in two rectangular
arrays AD and a vector la of pointers. The array AD has n rows and nd columns, where nd is
the number of diagonals. AD contains the diagonals of A that have at least one nonzero element,
and la contains the corresponding diagonal numbers. The superdiagonals are padded to length
n with k trailing zeros, where k is the diagonal number. The subdiagonals are padded to length
n with |k| leading zeros. This storage scheme is particularly useful if the matrices come from a
finite element or finite discretization on a tensor product grid. The matrix A in (5.1.1) would be
stored as
a11 a13 a21 0
 
 a22 a24 0 a42 
AD =  a33 0 0 0 , la = (0 2 − 1 − 2).
 
a44 0 a54 0
 
a55 0 a65 0
Operations on sparse vectors are simplified if one of the vectors is first uncompressed, i.e.,
stored as a dense vector. This can be done in time proportional to the number of nonzeros, and
it allows direct random access to specified elements in the vector. Vector operations, such as
adding a multiple of a sparse vector x to an uncompressed sparse vector y or computing an inner
product xT y, can then be performed in constant time per nonzero element. For example, assume
x is held in compressed form and y in a full-length array. Then the operation y := a ∗ x + y may
be expressed as

for k = 1, . . . , nnz
i = ix(k); y(i) := a ∗ x(k) + y(i);
end

Efficient implementations of sparse matrix-vector products such as Av and AT u are impor-


tant in both direct and iterative methods for sparse linear least squares problems. The storage
structure for A should be chosen so that hardware features like vector registers can be exploited.
5.1. Tools for Sparse Matrix Computations 247

Some of the more common structures are described below. Let u ∈ Rm and v ∈ Rn be uncom-
pressed vectors, and let A be stored in CSR mode. Then the matrix-vector product u = Av can
be implemented as

for i = 1 : m,
u(i) = 0;
for k = ip(i) : ip(i + 1) − 1,
u(i) = u(i) + AC(k) ∗ v(ja(k));
end
end

For the product v = AT u a similar code would access the elements of A column by column,
which is very inefficient. The product is better performed as

v(1 : n) = 0;
for i = 1 : m
for k = ip(i) : ip(i + 1) − 1,
j = ja(k); v(j) = v(j) + AC(k) ∗ u(i);
end
end

A proposal for standard computational kernels (BLAS) aimed at iterative solvers is given
by Duff et al. [349, 1997]. The interface of the sparse BLAS in the standard from the BLAS
technical forum is designed to shield one from concern over specific storage schemes; see Duff,
Heroux, and Pozo [347, 2002].

5.1.3 Graphs and Sparse Matrices


A graph G = (V, E) consists of a set of vertices or nodes V = {v1 , . . . , vn } and a set of
edges E. In an undirected graph an edge is an unordered pair of its vertices (vi , vj ). A graph is
directed if the edges (vi , vj ) are directed from vi to vj . Two vertices vi and vj in a graph G are
said to be adjacent if there is an edge (vi , vj ) ∈ E. The edge is incident to the vertices vi and
vj .
The nonzero structure of a symmetric matrix C ∈ Rn×n can be modeled by an undirected
adjacency graph G(C) = (V, E) with n vertices V = {1, . . . , n} and edges

E = {(i, j) | cij = cji ̸= 0, i ̸= j}.

The structure of a square unsymmetric matrix C can be represented by a directed graph G =


(X, E), where (vi , vj ) is a directed edge from vi to vj if and only if cij ̸= 0.
The adjacency set of a vertex vi in G is defined as

AdjG (vi ) = {vj ∈ V | vi and vj are adjacent}.

The degree of a vertex vi is the number of vertices adjacent to vi and is denoted by |AdjG (vi )|.
A graph Ḡ = (V̄ , Ē) is a subgraph of G = (V, E) if V̄ ⊂ V and Ē ⊂ E. A walk of length k
in an undirected graph is an ordered sequence of k + 1 vertices v1 , . . . , vk+1 such that

(vi , vi+1 ) ∈ E, i = 1, . . . , k.
248 Chapter 5. Direct Methods for Sparse Problems

A walk is a path if all edges are distinct. If v1 = vk+1 , the walk is closed and called a cycle.
Two nodes are said to be connected if there is a path between them. The distance between two
vertices in a graph is the number of edges in the shortest path connecting them. A graph is said
to be connected if every pair of distinct vertices is connected by a path. A graph that is not
connected consists of at least two separate connected subgraphs. If G = (X, E) is a connected
graph, then Y ⊂ V is called a separator if G becomes disconnected after removal of the nodes
Y . A directed graph is strongly connected if there is a path between every pair of distinct nodes
along directed edges.
Graphs that do not contain cycles are called acyclic. A connected acyclic graph is called a
tree; see Figure 5.1.2. In a tree, any pair of vertices is connected by exactly one path. A tree
has at least two vertices of degree 1. Such vertices are called leaf vertices. If a graph G is
connected, then a spanning tree is a subgraph of G that is a tree. In a tree an arbitrary vertex r
can be specified as the root of the tree. Then the tree becomes a directed rooted tree. An edge
(vi , vj ) ∈ G in a directed rooted tree is a directed edge from vi to vj if there is a path from vi to
r such that the first edge of this path is from vi to vj . If there is a directed edge from vi to vj ,
then vj is called the parent of vi , and vi is said to be a child of vj .
A labeling (or ordering) of a graph G with n vertices is a mapping of the integers {1, 2, . . . , n}
onto the vertices of G. The integer i assigned to a vertex is called the label (or number) of that
vertex. A labeling of the adjacency graph of a symmetric matrix C corresponds to a particular
ordering of its rows and columns. For example, the graph of the structurally symmetric matrix

× × × ×
 
× × × 
× × × ×
 

× × (5.1.2)
 

× ×
 

× ×
 
× ×
is given in Figure 5.1.2.

Figure 5.1.2. The labeled graph G(C) of the matrix in (5.1.2).

The matrix of normal equations can be written as


m
X
C = ATA = ai aTi , (5.1.3)
i=1

where aTi is the ith row of A. This expresses ATA as the sum of m matrices of rank one. By
the no-cancellation assumption, the nonzero structure of ATA is the direct sum of the nonzero
elements of the matrices ai aTi , i = 1, . . . , m. It follows that the graph G(ATA) is the direct sum
of the graphs G(ai aTi ), i = 1, . . . , m. That is, its edges are the union of all edges not counting
multiple edges. In G(ai aTi ) all pairs of vertices are connected. Such a graph is called a clique
and corresponds to a dense submatrix in ATA. Clearly, the structure of ATA is not changed by
dropping any row of A whose nonzero structure is a subset of another row.
5.1. Tools for Sparse Matrix Computations 249

By the no-cancellation assumption, an alternative characterization of G(ATA) is that


T
(A A)jk ̸= 0 if and only if columns j and k intersect, i.e., if aij ̸= 0 and aik ̸= 0 for at
least one row i ∈ {1, . . . , m}. Therefore, G(ATA) is also known as the column intersection
graph of A.

5.1.4 A Graph Model of Cholesky Factorization


The use of graphs to symbolically model the Cholesky factorization of a sparse symmetric matrix
C is due to Parter [886, 1961], although the details were later given by Rose [935, 1972]. In
sparse Cholesky factorization each row of the Cholesky factor R depends on only a subset of the
previous rows. The fill in the factorization can be analyzed by recursively forming a sequence of
elimination graphs
G0 = G(C), Gi (C), i = 1, . . . , n − 1,
where Gi (C) is formed from Gi−1 (C) by removing vertex i and its incident edges and adding
fill edges that make adjacent vertices in Gi−1 (C) pairwise adjacent. The sequence of elimination
graphs for the matrix in (5.1.2) is pictured in Figure 5.1.3. It can be verified that the number of
fill elements (edges) is 10.

Figure 5.1.3. Sequence of elimination graphs of the matrix in (5.1.2).

The filled graph GF (C) is a graph with n vertices and edges corresponding to all graphs
G0 = G, Gi , i = 1, . . . , n − 1. It bounds the structure of the Cholesky factor R:

G(RT + R) ⊂ GF (C). (5.1.4)

Under a no-cancellation assumption, (5.1.4) holds with equality.

Theorem 5.1.1. Let G(C) = (V, E) be the undirected graph of the symmetric matrix C. Then
(vi , vj ) is an edge of the filled graph GF (C) if and only if (vi , vj ) ∈ E, or there is a path in
G(C) from vertex i to vertex j passing only through vertices vk with k < min(i, j).

The filled graph GF (C) can be characterized by an elimination tree T (C) that captures the
row dependencies in the Cholesky factorization.

Definition 5.1.2. The elimination tree of a symmetric matrix C ∈ Rn×n with Cholesky factor R
is a directed rooted tree T (C) with n vertices labeled from 1 to n, where vertex p is the parent
of node i if and only if
p = min{j > i | rij ̸= 0}.
j
250 Chapter 5. Direct Methods for Sparse Problems

T (C) is uniquely represented by the parent vector P AREN T [i], i = 1, . . . , n, of the


vertices of the tree. For example, the parent vector of the elimination tree in Figure 5.1.3 is
5, 5, 6, 6, 7, 7, 8, 0.
T (C) can be obtained from the filled graph GF (C) by introducing directed edges from lower
to higher numbered vertices. A directed edge from node j to node i > j indicates that row i
depends on row j. This row dependency is exhibited by the following transitive reduction of
GF (C). If there is a directed path from j to i of length greater than one, the edge from j to
i is redundant and thus removed. The elimination tree is generated by the removal of all such
redundant edges. For the matrix A in (5.1.2) the filled graph GF (ATA) equals G(ATA) with an
added edge between vertices 7 and 9. The result of the transitive reduction and the elimination
tree is shown in Figure 5.1.4.

Figure 5.1.4. The transitive reduction and elimination tree T (ATA).

The node ordering of an elimination tree is such that children vertices are numbered before
their parent node. Such orderings are called topological orderings. All topological orderings
of the elimination tree are equivalent in the sense that they give the same triangular factor R. A
postordering is a topological ordering in which a parent node j always has node j − 1 as one of
its children. For example, the ordering of the vertices in the tree in Figure 5.1.4 can be made into
a postordering by exchanging labels 3 and 5. Postorderings can be determined by a depth-first
search; see Liu [755, 1990]. An important advantage of using a postordering is that it simplifies
data management, and the storage is reduced.
Elimination trees play a fundamental role in symmetric sparse matrix factorization and pro-
vide, in compact form, all information about the row dependencies. Schreiber [974, 1982] was
the first to note this and to define elimination trees formally. In the excellent review of the
generation and use of elimination trees by Liu [755, 1990] an efficient algorithm is given that
determines the elimination tree of C in time proportional to nnz(R) and in storage proportional
to nnz(C). When C = ATA it is possible to predict the structure of R directly without forming
C; see Gilbert, Moler, and Schreiber [469, 1992].

5.1.5 Minimizing Fill in Cholesky Factorization


The sparsity of the Cholesky factor R of AT A depends on the ordering of the columns of A
but not on the ordering of the rows. In the symbolic phase of sparse Cholesky factorization a
column permutation P is determined that tries to limit the fill in R. It would be desirable to find
a column ordering that minimizes the number of nonzeros in R. However, this is known to be an
NP-complete problem, i.e., it cannot be solved in polynomial time. Even the problem of finding
an ordering that minimizes the bandwidth of R is NP-complete; see Papadimitriou [876, 1976].
Hence we have to rely on heuristic methods. A simple algorithm is to order the columns in A
5.1. Tools for Sparse Matrix Computations 251

after increasing the number of nonzero elements in the columns. More effective methods can be
obtained by using information from successively reduced submatrices.
We first consider ordering methods that have the objective of minimizing the bandwidth of
ATA or rather the number of elements in the envelope of C = ATA. Recall that by Theo-
rem 4.1.6, zeros outside the envelope of C will not suffer fill in the Cholesky factorization. Such
ordering methods often perform well for matrices that come from one-dimensional problems or
problems that are in some sense tall and thin. The most widely used such method is the Cuthill–
McKee (CM) algorithm [281, 1969]. It starts by finding a peripheral vertex r (i.e., a vertex
with lowest degree) in G(C) and numbering this as vertex 1. It then performs a breadth-first
search using a level structure rooted at r. This partitions the vertices into disjoint levels

L1 = r, L2 (r), . . . , Lh (r),

where Li (r), i ≤ h, is the set of all vertices adjacent to Li−1 (r) that are not in Lj (r), j =
1, . . . , i − 1. The search ends when all vertices are numbered.

Algorithm 5.1.1 (Cuthill–McKee Ordering).

1. Select a peripheral vertex r, and set L1 = (r).

2. for i = 1, 2, . . . until all vertices are numbered do:


Find all unnumbered vertices in Adj (Li ) and number them
after increasing degrees.

George [453, 1971] observed that the ordering obtained by reversing the Cuthill–McKee
(RCM) ordering gives the same bandwidth and usually results in less fill. An implementation of
the RCM algorithm with run-time proportional to the number of nonzeros in the matrix is given
by Chan and George [230, 1980]. The performance of the RCM ordering strongly depends on the
choice of the starting peripheral node. George and Liu [457, 1981] recommend a strategy where
a node of maximal or nearly maximal eccentricity is chosen as a starting node. The eccentricity
of a node is defined as
l(x) = max d(x, y),
y∈X

where d(x, y) is the length of the shortest path between the two vertices x and y in G = (X, E).
A substantially faster algorithm for bandwidth minimization is the GPS algorithm by Gibbs,
Poole, and Stockmeyer [467, 1976]. Lewis [738, 1982], [737, 1982] describes some improve-
ments to the implementation of the GPS and other profile reduction algorithms. Sloan [1004,
1986] gives a new algorithm that generally results in a smaller profile. Among other proposed
methods for envelope reduction, we mention the spectral method of Barnard, Pothen, and Si-
mon [81, 1995]. This uses the eigenvector of the smallest positive eigenvalue of the Laplacian
matrix corresponding to the given matrix.
The minimum degree (MD) ordering is the most widely used heuristic method for limiting
fill. The name “minimum degree” comes from the graph-theoretic formulation of the Cholesky
algorithm first given by Rose [935, 1972]; see Section 5.1.3. MD is a greedy method that in
each step of the Cholesky factorization selects the next pivot as the sparsest row and column.
It is a symmetric analogue of an ordering algorithm proposed by Markowitz [778, 1957] for
unsymmetric matrices with applications to linear programming. This algorithm was adapted for
symmetric matrices by Tinney and Walker [1064, 1967]. The MD ordering minimizes the local
252 Chapter 5. Direct Methods for Sparse Problems

arithmetic and amount of fill that occurs but in general will not minimize the global arithmetic or
fill. However, it has proved to be very effective in reducing both of these objectives. If the graph
of C = ATA is a tree, then the MD ordering results in no fill.

Algorithm 5.1.2 (Minimum Degree Ordering).


Set G(0) = G(C) and determine the degrees of its vertices.
for i = 1, . . . , n − 1 do
1. Select a vertex of minimal degree in G(i−1) as pivot.
2. Construct the elimination graph G(i) and update the degrees of its vertices.

For the matrix C in (5.1.2) the initial elimination order indicated gives 10 fill-in elements.
A minimum degree ordering such as 4, 5, 6, 7, 1, 2, 3 will result in no fill-in. Because several
vertices in the graph have degree 1, the minimum degree ordering is not unique.
If there is more than one node of minimum degree at a particular step, the node is chosen from
the set of all vertices of minimum degree. The way this tie-breaking is done can be important.
Examples are known for which the minimum degree ordering will give poor results if the tie-
breaking is systematically done poorly. It is an open question how good the orderings are if ties
are broken randomly. A matrix for which the minimum degree algorithm is not optimal is given
by Duff, Erisman, and Reid [344, 1986]. If the minimum degree node 5 is eliminated first, fill
will occur in position (4, 6). But with the elimination ordering shown in Figure 5.1.5, there is
no fill. In the multiple minimum degree algorithm (MMD) by George and Liu [459, 1989] a
refinement of the elimination graph model is used to decrease the number of degree updates. The
vertices Y = {y1 , . . . , yp } are called indistinguishable if they have the same adjacency sets
(including the node itself), i.e.,

Adj (vi ) ∪ vi = Adj (vj ) ∪ vj , 1 ≤ i, j ≤ p.

If one of these vertices is eliminated, the degrees of the remaining vertices in the set will de-
crease by one, and they all will become of minimum degree. Hence, all vertices in Y can be
eliminated simultaneously, and the graph transformation needs to be updated only once. Indeed,
indistinguishable vertices can be merged and treated as one supernode. For example, the graph
in Figure 5.1.5 contains two sets {1, 2, 3} and {7, 8, 9} of supervertices.

Figure 5.1.5. The graph of a matrix for which minimum degree is not optimal.

The structure of each row in A ∈ Rm×n corresponds to a clique in the graph of C = ATA.
Thus, the generalized element approach can be used to represent C as a sequence of cliques.
This allows the minimum degree algorithm for ATA to be implemented directly from A without
forming the structure of C = ATA, with resulting savings in work and storage.
The most costly part of the minimum degree algorithm is recomputation of the degree of the
vertices adjacent to the new pivot column. In the approximate minimum degree (AMD) algo-
rithm an upper bound on the exact minimum degree is used instead, which leads to substantial
savings in run-time, especially for very irregularly structured matrices. It has no effect on the
quality of the ordering; see Amestoy, Davis, and Duff [18, 2004].
5.2. Sparse QR Factorization 253

5.2 Sparse QR Factorization


5.2.1 Predicting Structure in Sparse QR
The factor R in the QR factorization of A is the same as the Cholesky factor of ATA. Therefore,
it may seem that the same symbolic algorithm can be used to determine the structure of R in
both Cholesky and QR factorizations. If A contains at least one dense row, then by the no-
cancellation assumption it follows that ATA is full, and we conclude that the Cholesky factor R
is full. However, this may considerably overestimate the number of nonzeros in R. For example,
consider the matrices with nonzero structures,

× × × × ×
 
× × × × ×
 
× ×
× ×

× ×
   
A1 =  , A2 =  × × . (5.2.1)
   
× ×
× ×
   
× ×
 
×
×

For both A1 and A2 the matrix of normal equations is full. But A2 is already upper triangular,
and hence R = A2 is sparse. For A2 there will be cancellation in the Cholesky factorization of
AT2 A2 that will occur irrespective of the numerical values of the nonzero elements. This is called
structural cancellation, in contrast to numerical cancellation that occurs only for certain values
of the nonzero elements. Coleman, Edenbrandt, and Gilbert [262, 1986] prove that if A has the
so-called strong Hall property, then structural cancellation cannot occur.

Definition 5.2.1. A matrix A ∈ Rm×n , m ≥ n, is said to have the strong Hall property if for
every subset of k columns, 0 < k < n, the corresponding submatrix has nonzeros in at least
k + 1 rows. That is, when m > n, every subset of k ≤ n columns has the required property, and
when m = n, every subset of k < n columns has the property.

Theorem 5.2.2. Let A ∈ Rm×n , m ≥ n, have the strong Hall property. Then the structure of
ATA correctly predicts that of R, excluding numerical cancellations.

Note that A2 in (5.2.1) does not have the strong Hall property because the first column has
only one nonzero element. However, the matrix à obtained by deleting the first column has this
property.
The structure of R in QR factorization can also be predicted by performing the Givens or
Householder QR algorithm symbolically. In Givens QR factorization, the intermediate fill can
be modeled using the bipartite graph G(A) = {R, C, E}. Here the two sets of vertices

R = (r1 , . . . , rk ), C = (c1 , . . . , ck )

correspond to the sets of rows and columns of A. E is a set of edges {ri , cj } that connects a
node in R to one in C, and {ri , cj } ∈ E if and only if aij is nonzero; see George, Liu, and Ng
[465, 1984] and Ostrouchov [851, 1987]. The following result is due to George and Heath [455,
1980].

Theorem 5.2.3. The structure of R as predicted by a symbolic factorization of ATA includes the
structure of R as predicted by the symbolic Givens method.
254 Chapter 5. Direct Methods for Sparse Problems

Manneback [772, 1985] proves that the structure predicted by a symbolic Householder algo-
rithm is strictly included in the structure predicted from ATA. Symbolic Givens and Householder
factorizations can also overestimate the structure of R. An example where structural cancellation
occurs for the Givens rule is shown by Gentleman [452, 1976].

5.2.2 The Row Sequential QR Algorithm


One of the first algorithms for sparse QR factorization was given by George and Heath [455,
1980]. In this, a symbolic factorization of ATA is used to form a static data structure for R. Then
A is merged into this data structure row by row using Givens rotations, avoiding intermediate
fill-in.
Although the final factor R in QR factorization is independent of the ordering of the rows
in A, the row ordering can significantly affect the intermediate fill and the number of Givens
rotations needed to compute the factorization. This fact was stressed in the treatment of QR
algorithms for band matrices; see Section 4.1. An extreme example is shown below, where the
cost for factorizing A is reduced from O(mn2 ) to O(n2 ):

× × × × ×
   
×  × 
 .   . 
 ..   .. 
   
×  × 
A= , PA =  . (5.2.2)
× × × × ×
   

× ×
   
   
 ×   × 
× ×

Several heuristic algorithms for determining a row ordering have been suggested. The fol-
lowing is an extension of the row ordering recommended for band sparse matrices.

Algorithm 5.2.1 (Row-Ordering Algorithm).


Denote the column index for the first and last nonzero elements in the ith row of A by fi (A)
and ℓi (A), respectively.

1. Sort the rows by increasing fi (A), so that fi (A) ≤ fk (A) if i < k.

2. For each group of rows with fi (A) = k, k = 1, . . . , maxi fi (A), sort all the rows by
increasing ℓi (A).

If Algorithm 5.2.1 is applied to A in (5.2.2), the good row ordering P A is obtained. The
algorithm will not in general determine a unique ordering. One way to resolve ties is to consider
the cost of symbolically rotating a row aTi into all other rows with a nonzero element in column
ℓi (A). By cost we mean the total number of new nonzero elements created. The rows are then
ordered according to ascending cost. With this ordering it follows that rows 1, . . . , fi (A) − 1 in
Ri−1 will not be affected when the remaining rows are processed. Therefore these rows are the
final first fi (A) − 1 rows in R.
An alternative that has been found to work well in some contexts is to order the rows by
increasing values of ℓi (A). When row aTi is processed using this ordering, all previous rows
processed have nonzeros only in columns up to at most ℓi (A). Hence, only columns fi (A) to
5.1. Sparse QR Factorization 255

ℓi (A) of Ri−1 will be involved, and Ri−1 has zeros in columns ℓi+1 (A), . . . , n. No fill will be
generated in row aTi in these columns.
Liu [754, 1986] introduced the notion of row merge tree for structuring this operation. Let
R0 be initialized to have the structure of the final R with all elements equal to zero. Denote
by Rk−1 ∈ Rn×n the upper triangular matrix obtained after processing the first k − 1 rows.
At step k the kth row of A is first uncompressed into a full vector aTk = (ak1 , ak2 , . . . , akn ).
The nonzero elements akj ̸= 0 are annihilated from left to right by plane rotations involving
rows j < k in Rk−1 . This may create new nonzeros in both Rk−1 and in the current row aTk .
Note that if rjj = 0 in Rk−1 , this means that this row in Rk−1 has not yet been touched by any
rotation, and hence the entire jth row must be zero. When this occurs the remaining part of row k
can just be inserted as the jth row in Rk−1 . The algorithm is illustrated below using an example
from George and Ng [460, 1983].
Assume that the first k rows of A have been processed to generate Rk . Nonzero elements
of Rk−1 are denoted by ×, nonzeros introduced into Rk and aTk during the elimination aTk are
denoted by +, and elements involved in the elimination of aTk are circled. Nonzero elements
created in aTk during the elimination are ultimately annihilated. The sequence of row indices
involved in the elimination is {2, 4, 5, 7, 8}, where 2 is the column index of the first nonzero in
aTk . Note that unlike in the Householder method, intermediate fill only takes place in the row
being processed:

× 0 × 0 0 × 0 0 0 0
 
 ⊗ 0 ⊕ ⊗ 0 0 0 0 0
× 0 × 0 0 0 × 0
 

⊗ ⊕ 0 ⊗ 0 0 0
 
  
Rk−1 ⊗ ⊕ 0 0 0 0
 
= . (5.2.3)

T
ak  ⊗ ⊗ 0 0
⊗ 0 0
 

× ×
 

0 ×
 
0 ⊗ 0 ⊗ ⊕ 0 ⊕ ⊕ 0 0

From Theorem 5.2.3 it follows that if the structure of R has been predicted from that of
ATA, any intermediate matrix Ri−1 will fit into the predicted structure. The plane rotations can
be applied simultaneously to a right-hand side b to form QT b. In the original implementation the
Givens rotations are discarded after use. Hence, only enough storage to hold the final R and a
few extra vectors for the current row and right-hand side(s) is needed in main memory.
Gilbert et al. [468, 2001] give an algorithm to predict the structure of R working directly from
G(A). This algorithm runs in time proportional to nnz(A) and makes the step of determining the
structure of ATA redundant.
Variable row pivoting methods are studied by Gentleman [450, 1973], Duff [340, 1974], and
Zlatev [1151, 1982]. These schemes have never become very popular because they require a
dynamic storage structure and are complicated to implement.

5.2.3 Multifrontal methods


The multifrontal method, introduced by Duff and Reid [352, 1983], is a method for solving
a linear system Ax = b that organizes the factorization of a sparse matrix into a sequence of
partial factorizations of smaller independent dense subproblems that can be solved in parallel.
This gives good data locality and lower communication costs. The following theorem, which is
the basis for the multifrontal method, is due to Duff [343, 1986].
256 Chapter 5. Direct Methods for Sparse Problems

Theorem 5.2.4. Let T [j] denote the subtree rooted in node j. Then if k ̸∈ T [j], columns k and j
can be eliminated independently of each other.

If T [i] and T [j] are two disjoint subtrees of T (C), columns s ∈ T [i] and t ∈ T [j] can be
eliminated in any order. The elimination tree prescribes an order relation for the elimination of
columns in the QR factorization, i.e., a column associated with a child node must be eliminated
before the parent column. On the other hand, columns associated with different subtrees of T (C)
are independent and can be eliminated in parallel.
Liu [754, 1986] developed a multifrontal QR algorithm that generalizes the row-oriented
Givens QR algorithm by using submatrix rotations. It achieves a significant reduction in time
at the cost of a modest increase in working storage. A modified version of this algorithm that
uses Householder reflections is given by George and Liu [458, 1987]. Supervertices and other
essential modifications of multifrontal methods are treated by Liu [755, 1990].
Nested dissection orderings have been discussed in Section 4.3.2. Such orderings for solving
general sparse positive definite systems have been analyzed by George, Poole, and Voigt [462,
1978] and George and Liu [457, 1981]. The use of such orderings for sparse least squares
problems is treated in George, Heath, and Plemmons [464, 1981] and George and Ng [460,
1983]. A planar graph is a graph that can be drawn in the plane without two edges crossing.
Planar graphs are known to have small balanced separators. Lipton, Rose, and Tarjan
√ [753, 1979]
show that for any planar graph with n vertices there exists a separator with O( n) vertices such
that each subgraph has at most n/2 vertices.
We illustrate the multifrontal QR factorization by the small 12 × 9 matrix

× × × ×
 
× × × × 
 
× × × × 
 

 × × × ×


 × × × ×

 × × × ×
A=
 . (5.2.4)
 × × × × 


 × × × × 


 × × × × 


 × × × ×

 × × × ×
× × × ×

This matrix arises from a 3 × 3 mesh problem using a nested dissection ordering. The graph
G(ATA) is

First, a QR factorization of rows 1–3 is performed. These rows have nonzeros only in columns
{1, 5, 7, 8}. With the zero columns omitted, this operation can be carried out as a QR factoriza-
tion of a small dense matrix of size 3 × 4. The resulting first row equals the first of the final R
of the complete matrix and can be stored away. The remaining two rows form an update matrix
F1 and will be processed later. The other three block rows 4–6, 7–9, and 10–12 can be reduced
5.1. Sparse QR Factorization 257

similarly in parallel. After this first stage the matrix has the form

× × × ×
 

 × × × 


 × × 


 × × × ×


 × × ×


 × ×
.

 × × × × 


 × × × 


 × × 


 × × × ×

 × × ×
× ×

In the second stage, F1 , F2 and F3 , F4 are simultaneously merged into two upper trape-
zoidal matrices by eliminating columns 5 and 6. In merging F1 and F2 , only the set of columns
{5, 7, 8, 9} needs to be considered. Reordering the rows by the index of the first nonzero element
and performing a QR decomposition, we get

× × × × × × ×
   
× × ×  × × ×
QT  = .

× × × ×
× × ×

The first row in each of the four blocks is a final row in R and can be removed, which leaves four
upper trapezoidal update matrices, F1 –F4 . The merging of F3 and F4 is performed similarly.
The first row in each reduced matrix is a final row in R and is removed. In the final stage the
remaining two upper trapezoidal (in this example, triangular) matrices are merged, giving the
final factor R. This corresponds to eliminating columns 7, 8, and 9.
In the multifrontal method the vertices in the elimination tree are visited in turn given by the
ordering. Each node xj in the tree is associated with a frontal matrix Fj that consists of the set
of rows Aj in A, with the first nonzero in location j, together with one update matrix contributed
by each child node of xj . After variable j in the frontal matrix is eliminated, the first row in the
reduced matrix is the jth row of the upper triangular factor R. The remaining rows form a new
update matrix Uj that is stored in a stack until needed.

For j = 1, . . . , n do

1. Form the frontal matrix Fj by combining the set of rows Aj and the update matrix Us for
each child xs of the node xj in the elimination tree T (ATA).

2. By an orthogonal transformation, eliminate variable xj in Fj to get Ūj . Remove the first


row in Ūj , which is the jth row in the final matrix R. The rest of the matrix is the update
matrix Uj .

The frontal matrices in the multifrontal method are often too small for the efficient use of
vector processors and matrix-vector operations in the solution of the subproblems. Therefore, a
useful modification of the multifrontal method is to amalgamate several vertices into one large
supernode. Instead of eliminating one column in each node, the decomposition of the frontal
matrices then involves the elimination of several columns, and it may be possible to use Level 2
or even Level 3 BLAS; see Dongarra et al. [328, 1990]. Vertices can be grouped together to
258 Chapter 5. Direct Methods for Sparse Problems

form a supernode if they correspond to a block of contiguous columns in the Cholesky factor,
where the diagonal block is fully triangular, and these rows all have identical off-block diago-
nal column structures. Because of the computational advantages of having large supervertices,
it is advantageous to relax this condition and also amalgamate vertices that satisfy this con-
dition if some local zeros are treated as nonzeros. A practical restriction is that if too many
vertices are amalgamated, then the frontal matrices become sparse. Note also that nonnumerical
operations often make up a large part of the total decomposition time, which limits the possi-
ble gain. For a discussion of supervertices and other modifications of the multifrontal method,
see Liu [755, 1990].
For a K by K grid problem
√ with n = K 2 , m = s(K − 1)2 it is known that nnz(R) =
O(n log n), but Q has O(n n) nonzeros; see George and Ng [461, 1988]. Hence if Q is needed,
it should not be stored explicitly but represented by the Householder vectors of the frontal or-
thogonal transformations. Lu and Barlow [761, 1996] show that these require only O(n log n)
storage. In many implementations the orthogonal transformations are not stored. Then the cor-
rected seminormal equations (see Section 2.5.4) can be used to treat additional right-hand sides.

5.2.4 Software for Sparse Least Squares Problems


Early software, such as YSMP by Eisenstat et al. [362, 1982] and SPARSPAK by George and
Ng [463, 1984], focused on the Cholesky algorithm. A Fortran subroutine LLSS01 that performs
sparse QR factorization by fast Givens rotations was developed by Zlatev and Nielsen [1152,
1979]. The orthogonal matrix Q is not stored, and elements in R smaller than a user-specified
tolerance are dropped. Other early QR factorization codes were developed by Lewis, Pierce, and
Wah [739, 1989], Matstoms [784, 1992], and Puglisi [907, 1993]. Matstoms [785, 1994] de-
velops multifrontal concepts for QR factorization such as the use of supernode elimination tree
and node amalgamation for increasing the efficiency. Sun [1046, 1996] gives a parallel multi-
frontal algorithm for sparse QR factorization suitable for distributed-memory multiprocessors. A
multifrontal sparse rank-revealing QR factorization/least squares solution module by Pierce and
Lewis [894, 1997] is included in the commercial software package BCSLIB-EXT from Boeing
Information and Support Services. This library is also available to researchers in laboratories and
academia for testing.
The SuiteSparse collection by Davis [290, 2011] includes a “state-of-the-art” sparse Cholesky
factorization and a multithreaded multifrontal sparse QR algorithm and is available at www.
netlib.org. The Ceres nonlinear least squares solver for three-dimensional imagery in Google
Earth relies on the sparse Cholesky factorization in SuiteSparse. The multithreaded multifrontal
sparse QR factorization QR MUMPS by Buttari [195, 2013] builds on earlier implementations
of Puglisi and Matstoms; see Amestoy, Duff, and Puglisi [19, 1996].
The design of sparse matrix storage and computations for MATLAB is described by Gilbert,
Moler, and Schreiber [469, 1992]. A matrix is stored as either full or sparse. Conversion between
full and sparse storage modes is done by the inbuilt functions sparse and full. A sparse matrix
is stored in CSC format as the concatenation of the sparse vectors representing its columns. This
makes it efficient to scan the columns of a sparse matrix but very inefficient to scan its rows.
To facilitate a basic sparse operation, a sparse accumulator (SPA) that allows random access
to the currently active column or row of a matrix is used. The SPA consists of a dense vector
of real (or complex) values, a dense vector of true/false “occupied flags,” and an unordered
list of the indices whose occupied flags are true. Almost all sparse operations are performed
as operations between the SPA and a sparse vector. This allows many vector operations to be
carried out in time proportional to the number of nonzero element in the vector. Factorizations
such as LU, Cholesky, and QR of a sparse matrix yield sparse results but, otherwise, behave as
the corresponding dense MATLAB operations.
5.1. Sparse QR Factorization 259

Several column reorderings of the column are available in MATLAB for making the Cholesky
and QR factors more sparse. MATLAB stores a permutation as a vector p containing a per-
mutation of 1, 2, . . . , n such that A(:,p) is the matrix with permuted columns. The function
p = colperm(A) computes a permutation that sorts the columns so that they have increas-
ing nonzero count. An approximate minimum degree ordering for the columns is given by
p = colmmd(A).
To solve a sparse least squares problem using a minimum degree ordering of the columns of
A and the corrected seminormal equations (CSNE; see Section 2.5.4), one writes the following
in MATLAB.

Algorithm 5.2.2 (Sparse Least Squares Solution by CSNE).

function [x,r] = sparselsq(A,b);


% ------------------------------------------------------
q = colmmd(A); % Minimum degree ordering of A
A = A(:,q); % Reorder columns of A
R = chol(A'*A); % Sparse Cholesky decomposition
x = R\(A'*b); % Least squares solution
x = x + R\(A'*(b - A*x)) % Correction step
x(q) = x; % Permute solution

SuiteSparseQR is also available in MATLAB. This allows large-scale sparse least squares
problems to be solved. The function [Q,R,p] = qr(A,`vector') returns the m × n factor R
and m × m factor Q such that A(:,p) = Q*R. Because Q often is not very sparse, a better choice
is to solve one or more least squares problems min ∥AX − B∥ using [C,R,p] = qr(A,B,0)
and X(p,:) = R\C.

Example 5.2.5. We study the effect of two different column orderings in the QR factorization of
a sparse matrix A arising in a study of substance transport in rivers by Elfving and Skoglund [384,
2007]. Figures 5.2.1 and 5.2.2 show the location of nonzero elements in AP and R using two
different column orderings available in MATLAB. With colperm the columns are ordered by
increasing nonzero elements and give nnz(R) = 32,355. The colamd ordering gives nnz(R) =
15,903, a great improvement.

5.2.5 Rank-Revealing Sparse QR Factorization


For dense matrices A, rank-deficient problems are handled by column pivoting in the QR fac-
torization of A; see Section 2.3.2. In sparse QR factorization the initial column ordering is
chosen to produce a sparse R-factor in advance of any numerical computation. If the col-
umns are reordered, the updated R will in general not fit into the previously generated storage
structure.
If the row sequential QR algorithm is applied in exact arithmetic to a matrix A of rank r < n,
a row is inserted into R only when it makes the diagonal entry nonzero. Hence the resulting R-
factor must have n − r zero diagonal elements. Processing additional rows can only increase the
absolute value of the diagonal elements. Hence if the final R has a zero diagonal element, all
elements in this row are zero, and the final R will have the form depicted in Figure 5.2.3. By
permuting the zero rows of R to the bottom, and the columns of R corresponding to the zero
diagonal elements to the right, we obtain R in rank-revealing form.
260 Chapter 5. Direct Methods for Sparse Problems

0 0

100 100

200 200

300 300

400 400

500 500

600 600

700 700

800 800

900 900

1000 1000
0 100 200 300 400 0 100 200 300 400
nz = 3090 nz = 32355

Figure 5.2.1. Nonzero pattern of a sparse matrix A and the factor R in its QR factorization
using the MATLAB colperm reordering. Used with permission of Springer International Publishing; from
Numerical Methods in Matrix Computations, Björck, Åke, 2015; permission conveyed through Copyright
Clearance Center, Inc.

0 0

100 100

200 200

300 300

400 400

500 500

600 600

700 700

800 800

900 900

1000 1000
0 100 200 300 400 0 100 200 300 400
nz = 3090 nz = 15903

Figure 5.2.2. Nonzero pattern of a sparse matrix A and the factor R in its QR factorization using
the MATLAB colamd column ordering. Used with permission of Springer International Publishing; from
Numerical Methods in Matrix Computations, Björck, Åke, 2015; permission conveyed through Copyright
Clearance Center, Inc.
5.1. Sparse QR Factorization 261

Figure 5.2.3. Structure of upper triangular matrix R for a rank-deficient matrix.

In finite-precision, the computed R usually will not have any zero diagonal element, even
when rank(A) < n. Although the rank is often revealed by the presence of small diagonal
elements, this does not imply that the rest of the rows are negligible. Heath [596, 1982] suggests
the following postprocessing of R. Starting from the top, the diagonal of R is examined for
small elements. In each row whose diagonal element falls below a certain tolerance, the diagonal
element is set to zero. The rest of the row is then reprocessed, zeroing out all its other nonzero
elements. This might increase some previously small diagonal elements in rows below, which
is why one has to start from the top. We again end up with a matrix of the form shown in
Figure 5.2.3. However, it may be that R is numerically rank-deficient yet has no small diagonal
element.
Pierce and Lewis [894, 1997] develop a rank-revealing algorithm for sparse QR factorizations
based on techniques similar to those in Section 2.3.5. The factorization proceeds by columns,
and inverse iteration is used to determine ill-conditioning. Let Rj = (r1 , . . . , rj ) be the matrix
formed by the first j columns of the final R. Assume that Rj is not too ill-conditioned, but
Rj+1 = ( Rj rj+1 ) is found to be almost rank-deficient. Then column rj+1 is permuted to the
last position, and the algorithm is continued. This may happen several times during the numerical
factorization. At the end we obtain a QR factorization
 
R11 R12
( A1 A2 ) = Q ,
0 S

where R11 ∈ Rr×r is well-conditioned. In general R12 and S will be dense, but provided r ≪ n,
this is often acceptable. An important fact stated in the theorem below is that R11 will always
fit into the storage structure predicted for R. The following theorem is implicit in Foster [424,
1986].

Theorem 5.2.6. Let A = (a1 , a2 , . . . , an ), and let

AFk = [aj1 , aj2 , . . . , ajr ], 1 ≤ j1 < j2 < · · · < jr ≤ n

be a submatrix of A. Denote the Cholesky factors of ATA and ATFk AFk by R and RFk , respec-
tively. Then the nonzero structure of RFk is included in the nonzero structure predicted for R
under the no-cancellation assumption.

Proof. Let G = G(X, E) be the ordered graph of ATA. The ordered graph GFk = GFk (XFk ,
EFk ) of ATFkAFk is obtained by deleting all vertices in G not in Fk = [j1 , j2 , . . . , jr ] and all
edges leading to the deleted vertices. Then (RF )ij ̸= 0 only if there exists a path in GFk from
node i to node j (i < j) through vertices with numbers less than i. If such a path exists in GFk ,
it must exist also in G, and hence we will have predicted Rij ̸= 0.
262 Chapter 5. Direct Methods for Sparse Problems

5.3 Special Topics


5.3.1 Mixed Sparse-Dense Least Squares Problems
In some applications, least squares problems arise in which A ∈ Rm×n is sparse, except for
a small number of dense rows representing additional coupling terms. If the dense rows are
ordered last, such problems have the form
   
As bs
min ∥Ax − b∥2 = min x− , (5.3.1)
x x Ad bd 2

where As ∈ Rms ×n is sparse, Ad ∈ Rmd ×n is dense, and m = ms + md with md ≪ n. Then


ATA = ATsAs +ATdAd , and its Cholesky factor will be dense. For large-scale mixed sparse-dense
least squares problems, standard sparse Cholesky and QR methods are not practical because of
their high memory and computing demands. We remark that finding a good partitioning between
sparse and dense equations can be a nontrivial problem.
We now describe an updating method by Heath [596, 1982] in which dense rows are first
withheld from the QR factorization. The solution (not the factorization) to the sparse subproblem
is then updated to incorporate the dense rows. Let the QR factorization of the sparse subproblem
be  
Rs cs
( As bs ) = Qs , (5.3.2)
0 ds
where Qs need not be formed or saved. If rank(As ) = n, then Rs is nonsingular, and the
solution y to the sparse subproblem can be obtained from Rs y = cs . Setting x = y + z and
noting that cs − Rs y = 0, we can write the residuals to the sparse subproblem as
 
−Rs z
rs = bs − As (y + z) = Qs .
ds
It follows that z solves
   
Rs 0
min z− , rd = bd − Ad y. (5.3.3)
z Ad rd 2

By the change of variables

u = Rs z, v = rd − Cd u, Cd = Ad Rs−1 ∈ Rmd ×n ,

this problem is seen to be that of finding the least-norm solution to the linear system
 
v
( Imd Cd ) = rd . (5.3.4)
u
This problem is the same as the small least squares problem
   
Imd rd
min v− , u = CdT v, (5.3.5)
v CdT 0 2

which can be solved by QR factorization of the (md + n) × md matrix. Note that for both
problems the normal equations for v are

(Cd CdT + Imd )v = rd .

Finally, the solution x to problem (5.3.1) is x = y + z, where z is found by solving Rs z = u.


The algorithm is summarized below.
5.3. Special Topics 263

Algorithm 5.3.1 (Solving a Sparse-Dense Lease Squares Problem by Updating).


 
Rs cs
1. Compute a sparse QR factorization ( As bs ) = Qs .
0 ds

2. Solve Rs y = cs for y and form residual rd = bd − Ad y.

3. Compute Cd ∈ Rmd ×n from RsT CdT = ATd .

4. Solve the full-rank least squares problem (5.3.5) for v and form u = CdT v.

5. Form the solution x = y + z, where Rs z = u.

A fundamental difficulty arises with this updating method if As is ill-conditioned or singular.


If rank(As ) < n, then Rs is singular, and the algorithm breaks down. If σmin (As ) ≪ σmin (A),
then Algorithm 5.3.1 is not stable.
It is not unusual in practice for As to have n2 ≪ n null columns. If the null columns are
permuted to the last positions and x is partitioned conformally, the problem becomes
   
As 1 0 x1
A = ( A1 A2 ) = , x= , (5.3.6)
Ad1 Ad2 x2

where A1 has full column rank, x1 ∈ Rn1 and x2 ∈ Rn2 , and n = n1 + n2 . The previous
updating scheme is generalized to this case by Heath [596, 1982]. Let z ∈ Rn1 and W ∈ Rn1 ×n2
be the solutions of the least squares problems

min ∥A1 z − b∥2 , min ∥A1 W − A2 ∥2 , (5.3.7)


z z

respectively. In both least squares problems (5.3.7), A1 has full rank and md has dense rows
and, thus problems (5.3.7) can be solved by the previous algorithm. Then x2 is obtained as the
solution of the small dense least squares problem

min ∥(A2 − A1 W )x2 − (b − A1 z)∥2 (5.3.8)


x2

of size m × n2 , and x1 = z − W x2 ; see Lemma 4.3.1.


Another approach for treating mixed sparse-dense least squares problems is matrix stretch-
ing; see Grcar [530, 1990] and Adlers and Björck [8, 2000]. In this approach, dense rows are
replaced by several much sparser rows connected by additional linking variables. The strength
and limitations of this approach are studied by Scott and Tůma [990, 2019]. They propose a new
stretching strategy that better limits the fill in the normal matrix and its Cholesky factorization.
Avron, Ng, and Toledo [45, 2009] propose using an iterative method, such as LSQR, precon-
ditioned by Rs from the sparse QR factorization of As . This requires As to have full column
rank and be not too ill-conditioned. A Schur complement approach to preconditioning mixed
sparse-dense problems is studied by Scott and Tůma [989, 2018]. In image reconstruction and
certain other inverse problems, A may be fairly sparse in all rows and columns, but ATA may
be practically dense. Such problems are usually solved by preconditioned iterative methods; see
Scott and Tůma [988, 2017].

5.3.2 Canonical Block Triangular Form


Sometimes it can be helpful to permute A into the canonical block upper triangular form discov-
ered by Dulmage and Mendelsohn [353, 1958], [354, 1959], [355, 1963], and Johnson, Dulmage,
264 Chapter 5. Direct Methods for Sparse Problems

and Mendelsohn [675, 1963]:


 
Ah Uhs Uhv
P AQ =  As Usv  . (5.3.9)
Av

Here the block Ah is underdetermined (has more columns than rows), As is square, and Av is
overdetermined (has more rows than columns). All three blocks have a nonzero diagonal, and the
submatrices Av and ATh both have the strong Hall property. One or two of the diagonal blocks
may be absent in the decomposition. The example below shows the coarse block triangular
decomposition of a matrix A ∈ R11×10 with structural rank 8.
× × ⊗ × ×
× ⊗ × × ×
⊗ × ×
× ⊗ ×
⊗ ×
× ⊗ ×
⊗ ×

×
× ×
×

It may be possible to further decompose the blocks Ah and Av in the coarse decomposition
(5.3.9) so that
Ah1 Av1
   

Ah =  . .. , Av =  . .. ,
Ahp Avq

where each Ah1 , . . . , Ahp is underdetermined and each Av1 , . . . , Avq is overdetermined. The
submatrix As may be decomposable into block upper triangular form
A U12 ... U1,t 
s1
 As2 ... U2,t 
As =  .. ..  (5.3.10)
 . .

Ast

with square diagonal blocks As1 , . . . , Ast that have nonzero diagonal elements. The resulting
fine decomposition can be shown to be essentially unique. Any block triangular form can be
obtained from any other by applying row permutations that involve the rows of a single block
row, column permutations that involve the columns of a single block column, and symmetric
permutations that reorder the blocks.
A square matrix that can be permuted to the form (5.3.10), with t > 1, is said to be reducible;
otherwise, it is called irreducible; see Definition 4.1.3. All the diagonal blocks As1 , . . . , Ast in
the fine decomposition are irreducible. This implies that they have the strong Hall property; see
Coleman, Edenbrandt, and Gilbert [262, 1986]. A two-stage algorithm for permuting a square
and structurally nonsingular matrix A to block upper triangular is given by Tarjan [1057, 1972];
see also Gustavson [555, 1976], and Duff [341, 1977], [342, 1981]. The algorithm depends
on the concept of matching in the bipartite graph of A. This is a subset of its edges with no
5.3. Special Topics 265

common end points and corresponds to a subset of nonzeros in A such that no two belong to the
same row or column. A maximum matching is one where the maximum number of edges equals
the structural rank r(A).
Pothen and Fan [902, 1990], [901, 1984] give an algorithm for the general case. The program
MC13D by Duff and Reid [351, 1978] in the HSL Mathematical Software Library implements
the fine decomposition of As . It proceeds in three steps:
1. Find a maximum matching in the bipartite graph G(A) with row set R and column set C.
2. According to the matching, partition R into sets VR, SR, HR and C into sets VC, SC,
HC corresponding to the vertical, square, and horizontal blocks.
3. Find the diagonal blocks of the submatrices Av and Ah from the connected components
in the subgraphs G(Av ) and G(Ah ). Find the block upper triangular form of As from
the strongly connected components in the associated directed subgraph G(As ), with edges
directed from columns to rows.

In MATLAB the algorithm is available through the function [p,q,r,s,cc,rr] = dmperm(A).


The output consists of row and column permutation vectors p and q, such that A(p, q) has
Dulmage–Mendelsohn block triangular form. The vectors r and s are index vectors indicat-
ing the block boundaries for the fine decomposition, while the vectors cc and rr indicate the
boundaries of the coarse decomposition.
The reordering to block triangular form in a preprocessing phase can save work and interme-
diate storage in solving least squares problems. If A has structural rank n, the first block row in
(5.3.9) must be empty, and the original least squares problem can, after a reordering, be solved
by a form of block back-substitution. First, partition x̃ = QT x and b̃ = P b conformally with
P AQ in (5.3.9) and compute the solution x̃v of

min ∥Av x̃v − b̃v ∥2 . (5.3.11)


x̃v

Next, compute the remaining part of the solution x̃k , . . . , x̃1 from
k
X
Asi x̃i = b̃i − Uij x̃j , i = k, . . . , 2, 1. (5.3.12)
j=i+1

Finally, set x = Qx̃. We can solve subproblems (5.3.11) and (5.3.12) by computing the QR fac-
torizations of Av and As,i , i = 1, . . . , k. As As1 , . . . , Ask and Av have the strong Hall property,
the structures of the matrices Ri are correctly predicted by the structures of the corresponding
normal matrices.
If A has structural rank n but is numerically rank-deficient, it will not be possible to fac-
torize all the diagonal blocks in (5.3.10). In this case the block triangular structure given by
the Dulmage–Mendelsohn form cannot be preserved, or some blocks may become severely ill-
conditioned. If the structural rank is less than n, there is an underdetermined block Ah . In this
case we can still obtain the form (5.3.10) with a square block A11 by permuting the extra col-
umns in the first block to the end. The least squares solution is then not unique, but there is a
unique solution of minimum length.
The block triangular form of the matrices in the Harwell–Boeing test collection (Duff,
Grimes, and Lewis [346, 1989]) and the time required to compute them are given in Pothen
and Fan [902, 1990].
Chapter 6

Iterative Methods

6.1 Basic Iterative Methods


6.1.1 Iterative versus Direct Methods
Direct methods for solving linear equations Ax = b and least squares problems min ∥Ax − b∥2
are reliable when they can be used. For huge problems, direct methods can be prohibitively
expensive in terms of storage and operations. Then it becomes essential to use iterative solvers, in
which an initial approximation is successively improved until an acceptable accuracy is achieved.
An important feature of iterative methods is that A itself need not be stored; it suffices to
be able to compute matrix-vector products Av for arbitrary vectors v. Hence, iterative meth-
ods automatically speed up when A is a sparse matrix or a fast linear operator. (Most iterative
methods for least squares problems also need products AT u for arbitrary vectors u.) The main
weakness of iterative methods is their unpredictable robustness and range of applicability. Often,
a particular iterative solver may be efficient for a specific class of problems, but for other cases it
may be excessively slow or break down. The rate of convergence depends in a complex way on
the spectrum of A (or sometimes the singular values of A) and can be prohibitively slow when A
is ill-conditioned. Usually, it is essential to use a preconditioner M such that AM −1 or M −1 A
is better conditioned and systems M w = c can be solved efficiently with arbitrary vectors c.
For least squares problems, any iterative method for solving symmetric positive definite sys-
tems can be applied to the normal equations ATAx = AT b. However, explicit formation of ATA
and AT b can and should be avoided by using the factored form

AT r = 0, r = b − Ax. (6.1.1)

Working with A and AT separately has important advantages. As emphasized earlier, small
perturbations in ATA, e.g., by roundoff, may change the solution much more than perturbations
of similar size in A itself. Working with AT b instead of b as input data can also cause a loss
of accuracy. Fill that can occur in the formation of ATA is also avoided (although occasionally
ATA is more sparse than A).
Iterative methods can also be applied to the least-norm problem

min ∥y∥2 subject to AT y = c. (6.1.2)

If A has full row rank, the unique solution is y = Az, where z satisfies the normal equations of
the second kind (see (1.1.17)):
AT Az = c. (6.1.3)

267
268 Chapter 6. Iterative Methods

Again, explicit formation of the cross-product matrix ATA should be avoided.

Example 6.1.1. A problem where A is sparse but ATA is significantly more dense is shown in
Figure 6.1.1. In such a case the Cholesky factor will in general also be nearly dense. This rules
out the use of sparse direct methods based on QR decomposition of A. Consider the case when
A has a random sparsity structure such that an element aij is nonzero with probability p < 1.
Ignoring numerical cancellation, it follows that (ATA)jk ̸= 0 with probability
2
q = 1 − (1 − p2 )m ≈ 1 − e−mp .

Therefore, ATA will be almost dense when mp ≈ m1/2 , i.e., when the average number of non-
zero elements in a column is about m1/2 . This type of structure is common in reconstruction
problems. An example is the inversion problem for the velocity structure for the Central Califor-
nia Microearthquake Network. In 1980 this generated a matrix A with dimensions m = 500,000,
n = 20,000, and about 107 nonzero elements. The nonzero structure of A is very irregular and
ATA is almost dense. Today similar problems of much higher dimensions are common.

0 0
10
20 20
30
40 40
50
60 60
70
80 80
90
100
0 20 40 60 80 0 20 40 60 80
nz = 600 nz = 2592

Figure 6.1.1. Structure of a sparse matrix A (left) and ATA (right) for a simple image recon-
struction problem. Used with permission of Springer International Publishing; from Numerical Methods
in Matrix Computations, Björck, Åke, 2015; permission conveyed through Copyright Clearance Center, Inc.

Another approach is to apply an iterative method to the augmented system


    
I A y b
= . (6.1.4)
AT 0 x c

This system combines both kinds of normal equations. Taking c = 0 gives the least squares
problem. Taking b = 0 gives the least-norm problem with z = −x. The augmented system is
symmetric and indefinite, which makes its solution more challenging. Recall that the condition
of (6.1.4) can be improved by working with
    
αI A y/α b
= , (6.1.5)
AT 0 x c/α

where α ≈ σmin (A); see Section 2.4.4.


6.1. Basic Iterative Methods 269

Notes and references

A good survey of iterative methods for solving linear systems is the comprehensive text by
Saad [957, 2003]. The main research developments of the 20th century are surveyed by Saad and
van der Vorst [960, 2000]. Other notable textbooks on iterative methods include Axelsson [48,
1994], Greenbaum [534, 1997], and van der Vorst [1075, 2003]. Templates for implementation
of iterative methods for linear systems are found in Barret et al. [82, 1994]. A PDF file of an
unpublished second edition of this book can be downloaded from http://www.netlib.org/
templates/templates.pdf. Among older textbooks, we mention Varga [1090, 1962] and
Hageman and Young [559, 2004].

6.1.2 Stationary Iterative Methods


The idea of solving systems of linear equations Ax = b by an iterative method dates at least
as far back as Gauss (1823). In the days of “hand” computations, rather unsophisticated re-
laxation methods were used. For a given approximation xk , an equation i with a residual of
large magnitude is picked, and the corresponding component of xk is adjusted so that this equa-
tion is satisfied. (This is always possible if A is symmetric positive definite.) On computers,
cyclic relaxation methods are more suitable, because the search for the largest residual is too
time-consuming.
For a linear system Ax = b, A ∈ Rn×n , a stationary iterative method has the general form

M xk+1 = N xk + b, k = 0, 1, . . . , (6.1.6)

where x0 is an initial approximation. Here A = M − N is a splitting of A, with M nonsingular.


For the iteration to be practical, linear systems with M should be easy to solve. To analyze the
convergence, we rewrite (6.1.6) as

xk+1 = Gxk + d, G = M −1 N = I − M −1 A, d = M −1 b, (6.1.7)

where G is the iteration matrix.


The iterative method (6.1.6) is said to be convergent if the sequence {xk } converges for all
initial vectors x0 . If the method converges and limk→∞ xk = x, then x satisfies x = Gx + d,
and x satisfies Ax = b. Subtracting x = Gx + d from (6.1.7) gives

xk+1 − x = G(xk − x) = · · · = Gk+1 (x0 − x). (6.1.8)

It follows that the iteration method is convergent if and only if limk→∞ Gk = 0.

Theorem 6.1.2. The stationary iterative method xk+1 = Gxk + d is convergent for all initial
vectors x0 if and only if the spectral radius of G satisfies

ρ(G) = max |λi (G)| < 1, (6.1.9)


1≤i≤n

where λi are the eigenvalues of G.

For any consistent matrix norm, we have ρ(G) ≤ ∥G∥. Thus a sufficient condition for
convergence is that ∥G∥ < 1 holds for some consistent matrix norm. From (6.1.8) it follows that
for any consistent matrix norm,

∥xk − x∥ ≤ ∥Gk ∥ ∥x0 − x∥ ≤ ∥G∥k ∥x0 − x∥.


270 Chapter 6. Iterative Methods

Definition 6.1.3. Assume that the iterative method (6.1.6) is convergent. The average rate
Rk (G) and asymptotic rate R∞ (G) of convergence are defined as

1
Rk (G) = − ln ∥Gk ∥, R∞ (G) = − ln ρ(G),
k

respectively, where ∥ · ∥ is any consistent matrix norm.

To reduce the norm of the error by a fixed factor δ, at most k iterations are needed where
∥Gk ∥ ≤ δ or, equivalently, k satisfies

k ≥ − ln δ/Rk (G).

It is desirable for the iteration matrix G = I − M −1 A to have real eigenvalues. This will be
the case if the iterative method is symmetrizable.

Definition 6.1.4. The stationary iterative method (6.1.6) is said to be symmetrizable if there is
a nonsingular matrix W such that

W (I − G)W −1 = W M −1 AW −1

is symmetric positive definite.

If A and the splitting matrix M are both symmetric positive definite, then the corresponding
stationary method is symmetrizable. To show this, let R be the Cholesky factor of A and set
W = R. Then
R(M −1 A)R−1 = RM −1 RTRR−1 = RM −1 RT .

If M is symmetric positive definite, then so is M −1 and also RM −1 RT .

Notes and references

The convergence of (6.1.6) when rank(A) < n is investigated by Keller [690, 1965] and
Young [1141, 2003]. Dax [293, 1990] investigates the convergence properties of stationary iter-
ative methods, with the emphasis is on properties that hold for singular and possibly inconsistent
systems for a square matrix A. Tanabe [1056, 1971] considers stationary iterative methods of
the form (6.1.10) for computing more general solutions x = A− b, where A− is any generalized
inverse of A such that AA− A = I. He shows that the iteration can always be written in the form

xk+1 = xk + B(b − Axk )

for some matrix B, and characterizes the solution in terms of R(AB) and N (BA).
The concept of splitting has been extended to rectangular matrices by Plemmons [896, 1972].
Berman and Plemmons [112, 1974] define A = M − N to be a proper splitting if the ranges
and nullspaces of A and M are equal. They show that for a proper splitting, the iteration

xk+1 = M † (N xk + b) (6.1.10)

converges to the pseudoinverse solution x = A† b for every x0 if and only if the spectral radius
ρ(M † N ) < 1. The iterative method (6.1.10) avoids explicit use of the normal system.
6.1. Basic Iterative Methods 271

6.1.3 Richardson’s Method for the Normal Equations


A stationary iteration method for solving the normal equations ATAx = AT b has the form
M xk+1 = N xk + AT b, where ATA = M − N is a splitting. Explicit use of ATA can be avoided
by noting that N = M − ATA and rewriting the iteration as

xk+1 = xk + M −1 AT (b − Axk ), k = 0, 1, . . . . (6.1.11)

This iteration is symmetrizable if A has full column rank and M is symmetric positive definite.
For solving the minimum norm problem (6.1.2), the same splitting is applied to solve ATAz =
c, giving the iteration zk+1 = zk + M −1 (c − ATAzk ). After multiplying with A and using
yk = Azk , we obtain

yk+1 = yk + AM −1 (c − AT yk ), k = 0, 1, . . . . (6.1.12)

In the following we assume that ATA is positive definite. In (6.1.11), the particular choice
M = ω −1 I gives Richardson’s first-order method

xk+1 = xk + ωAT (b − Axk ), k = 0, 1, 2, . . . , (6.1.13)

where ω > 0 is a relaxation parameter. Richardson’s method is often used for solving least
squares problems originating from discretized ill-posed problems. In this context (6.1.13) is
also known as Landweber’s method; see Section 6.4.1. If x0 ∈ R(AT ) (e.g., x0 = 0), then
by construction xk ∈ R(AT ) for all k > 0. Hence, in exact arithmetic, Richardson’s method
converges to the pseudoinverse solution when A is rank-deficient,
For the iteration (6.1.13) the error satisfies

xk − x = G(xk−1 − x), G = I − ωATA. (6.1.14)

The eigenvalues of the iteration matrix G are λi (G) = 1 − ωσi2 , i = 1, . . . , n, where σi are the
singular values of A.

Theorem 6.1.5. Assume that the singular values σi of A satisfy 0 < a ≤ σi2 ≤ b, i = 1, . . . , n.
Then Richardson’s method converges if and only if 0 < ω < 2/b.

Proof. By the assumption, 1 − ωb ≤ λi (G) ≤ 1 − ωa for all i. Hence, if 1 − ωa < 1 and


1 − ωb > −1, then ρ(G) < 1. Since a > 0, the first condition is satisfied for all ω > 0, while
the second condition is satisfied if ω < 2/b.

To maximize the asymptotic rate of convergence, ω should be chosen so that the spectral
radius
ρ(G) = max{|1 − ωa|, |1 − ωb|}
is minimized. The optimal ω lies in the intersection of the graphs of |1 − ωa| and |1 − ωb|,
ω ∈ (0, 2/b). Setting 1 − ωa = ωb − 1, we obtain

ωopt = 2/(b + a), ρopt (G) = (b − a)/(b + a).

Since κ2 (A) = b/a, we have


κ2 (A) − 1 2
ρopt (G) = =1− 2 . (6.1.15)
κ2 (A) + 1 κ (A) + 1
This illustrates a typical property of iterative methods in which they converge more slowly for
ill-conditioned systems.
272 Chapter 6. Iterative Methods

Richardson’s nonstationary method is

xk+1 = x + ωk AT (b − Axk ), k = 0, 1, . . . , (6.1.16)

where ωk > 0. Sufficient conditions for convergence of this iteration are known, which we state
below without proof.

Theorem 6.1.6. The iterates (6.1.16) converge for all vectors b to a least squares solution
x̂ = arg minx ∥Ax − b∥2 if for some ϵ > 0 it holds that

0 < ϵ ≤ ωk ≤ (2 − ϵ)/σ12 ∀k,

where σ1 is the largest singular value of A. If x0 ∈ R(AH ), then x̂ is the unique least-norm
solution.

6.1.4 The Jacobi and Gauss–Seidel Methods


In the following we use the standard splitting

ATA = L + D + LT , D = diag (ATA), (6.1.17)

where A = (a1 , . . . , an ) ∈ Rm×n , D = diag (d1 , . . . , dn ) is diagonal, and L is strictly lower


triangular. In Jacobi’s method for the normal equations ATAx = AT b the splitting M = D is
used, giving the iteration
xk+1 = xk + D−1 AT b − Axk .

(6.1.18)
The minimum norm solution of AT y = c is y = Az, where z satisfies ATAz = c. Applying
Jacobi’s method to these equations gives

zk+1 = zk + D−1 (c − AT Azk ).

Multiplying by A and setting Azk = yk , we obtain the iteration

yk+1 = yk + AD−1 (c − AT yk ). (6.1.19)

Since M = D is symmetric positive definite, Jacobi’s method is symmetrizable. If the columns


of A are scaled to have unit norm, then D = I, and Jacobi’s method becomes Richardson’s
method with ω = 1.
The Gauss–Seidel method for the normal equations ATA = AT b uses the splitting M =
L + D, where D is the diagonal and L is the strictly lower triangular part of ATA. In matrix
form the iteration becomes

xk+1 = xk + (L + D)−1 AT b − Axk ). (6.1.20)

For the minimum norm problem, the Gauss–Seidel iteration method is

yk+1 = yk + A(L + D)−1 (c − AT yk ), k = 0, 1, . . . . (6.1.21)

The method is not symmetrizable. Each step requires the solution of a lower triangular system
by forward substitution and splits into n minor steps. Note that the ordering of the columns of A
will influence the convergence. To implement the Gauss–Seidel method the key observation is
that it differs from the Jacobi method only in the following respect. As soon as a new component
of xk+1 has been computed, it is used for computing the remaining part of xk+1 .
6.1. Basic Iterative Methods 273

Björck and Elfving [141, 1979] show that the Gauss–Seidel method applied to the normal
equations of the first and second kinds are special cases of two classes of projection methods for
square nonsingular linear systems studied by Householder and Bauer [646, 1960]. In the first
class of methods for Ax = b, let p1 , p2 , . . . ̸∈ N (A) be a sequence of n-vectors. Let x(1) be an
initial approximation, and for j = 1, 2, . . . , compute

x(j+1) = x(j) + αj pj , qj = Apj , αj = qjT rj /∥qj ∥22 , (6.1.22)

where r(j) = b − Ax(j) is the residual vector. Then r(j+1) ⊥ qj and

∥r(j+1) ∥22 = ∥r(j) ∥22 − |αj |2 ∥qj ∥22 ≤ ∥r(j) ∥22 .

This shows that (6.1.22) is a residual-reducing iteration method. This method was originally
devised by de la Garza [442, 1951].
One step of the Gauss–Seidel method for ATAx = AT b is obtained by taking pj in (6.1.22) to
be the unit vectors ej ∈ Rn , j = 1, . . . , n, in cyclic order. Then qj = Aej = aj , and one iteration
(1) (1) (1)
splits into n minor steps as follows. Let xk = xk be the current iterate and rk = b − Axk be
the corresponding residual. The Gauss–Seidel iteration becomes the following: For j = 1, . . . , n,
compute
(j+1) (j) (j)
xk = xk + δ j ej , δj = aTj rk /∥aj ∥22 , (6.1.23)
(j+1) (j)
rk = rk − δj aj .
(n+1) (n+1)
Then xk+1 = xk and rk+1 = rk . In the jth minor step, only the jth component is
changed, and only the jth column aj = Aej is accessed. The iteration simplifies if the columns
are prescaled to have unit norm.
The second class of projection methods finds the minimum norm solution of AT y = c =
(c, . . . , cn )T . Let p1 , p2 , . . . ̸∈ N (A) be a sequence of n-vectors, and set qj = Apj . Compute

y (j+1) = y (j) + δj qj , δj = pTj (c − AT y (j) )/∥qj ∥22 , j = 1, 2, . . . . (6.1.24)

By construction we have d(j+1) ⊥ qi , where d(j) = y − y (j) denotes the error. It follows that

∥d(j+1) ∥22 = ∥d(j) ∥22 − |δj |2 ∥qj ∥22 ≤ ∥d(j) ∥22 .

Hence this class of methods is error-reducing.


The Gauss–Seidel method for the minimum norm solution of AT y = c is obtained by taking
pj in (6.1.24) to be the unit vectors ej in cyclic order. Then pTj A = aTj and pTj c = cj . Let the
(1)
current iterate be yk ∈ R(A), and set yk = yk . The Gauss–Seidel method is as follows: For
j = 1, . . . , n, compute
(j+1) (j) (j)
yk = yk + δj aj , δj = (cj − aTj yk )/dj , (6.1.25)
(n+1)
and set yk+1 = yk . For a square matrix A, this method was originally devised by Kacz-
marz [678, 1937].
Cimmino [250, 1938] devised another notable error-reducing iterative method for the approx-
imate solution of a linear system Ax = b, where initially A ∈ Rn×n is assumed to be square and
nonsingular; see Benzi [107, 2005] for an English translation. Cimmino notes that the unique
solution x = A−1 b lies on the intersection of the m hyperplanes

aTi x = bi , i = 1, . . . , n, (6.1.26)
274 Chapter 6. Iterative Methods

where aTi = eTi A is the ith row of A. Given an initial approximation x(0) , he forms

(0) (bi − aTi x(0) )


xi = x(0) + 2 ai , i = 1, . . . , n. (6.1.27)
∥ai ∥22
This has a nice geometrical interpretation. Subtracting x from both sides of (6.1.27) and using
bi = aTi x gives
(0)
xi − x = Pi (x(0) − x), Pi = I − 2(ai aTi ) ∥ai ∥22 .

(6.1.28)
(0)
This shows that the points xi , i = 1, . . . , are the orthogonal reflections of x(0) with respect to
the hyperplanes (6.1.26). It follows that
(0)
∥xi − x∥2 = ∥x(0) − x∥2 , i = 1, . . . , n.
(0)
Hence the initial point x(0) and its n reflections xi all lie on a hypersphere. If A is square and
nonsingular, the center of this hypersphere is the unique solution of Ax = b. The next iterate
x(1) in Cimmino’s method is taken as the center of gravity of the mass system formed by placing
(0)
n positive masses wi at the points xi , i = 1, . . . , n:
n m
1X (0)
X
x(1) = w i xi , µ= wi . (6.1.29)
µ i=1 i=1

Because the center of gravity of the system of masses wi must fall inside this hypersphere,
Cimmino’s method is error-reducing, i.e., ∥x(1) − x∥2 < ∥x(0) − x∥2 . In matrix form Cimmino’s
method can be written
2 T T
x(k+1) = x(k) + A D D(b − Ax(k) ), (6.1.30)
m
where
√ 
D = diag (d1 , . . . , dn ), di = wi ∥ai ∥2 . (6.1.31)
n×n
Cimmino notes that if rank(A) > 2, the iterates converge even when A ∈ R is singular
and the linear system is inconsistent. Then Cimmino’s method will converge to a solution of the
weighted least squares problem minx ∥D(Ax − b)∥2 . If wi = ∥ai ∥22 , then D = I, and (6.1.30)
is Richardson’s method with ω = 2/µ.

Notes and references


Kaczmarz’s method was rediscovered by Gordon, Bender, and Herman [518, 1970] and given the
name ART (Algebraic Reconstruction Technique). ART was successfully used in the first com-
puterized tomography (CT) scanner patented by Houndsfield in 1972; see Censor and Zenios [214,
1997]. It is still extensively used for this purpose. Iterative methods such as Kaczmarz’s and
Cimmino’s that require access to only one row of A at each minor step are sometimes called
“row-action methods.” A survey of this class of methods is given by Censor [213, 1981].

6.1.5 Successive Overrelaxation Methods


The rate of convergence of the Gauss–Seidel method can be improved by introducing a relax-
ation parameter ω > 1. The successive overrelaxation method (SOR) for the normal equations
ATA = AT b is obtained simply by changing δj to ωδj in (6.1.24). Similarly, SOR for the normal
equations of the second kind AT y = c, y = Az, is obtained by changing δj to ωδj in (6.1.25).
The symmetric SOR (SSOR) method is obtained by following each step of SOR with another
6.1. Basic Iterative Methods 275

SOR step where the columns of A are taken in reverse order, j = n, . . . , 2, 1. The SSOR iter-
ation is symmetrizable. SOR and SSOR share with the Gauss–Seidel method the advantages of
simplicity and small storage requirements.
For positive definite ATA, SOR converges if and only if 0 < ω < 2. The parameter ω in
SOR should, if possible, be chosen to maximize the asymptotic rate of convergence. If ATA has
the following special property, the optimal ω is known.

Definition 6.1.7. A square matrix A with the decomposition A = D(I + L + U ), where D is


nonsingular and L (U ) are strictly lower (upper) triangular, is said to be consistently ordered
if it has the property that the eigenvalues of J(α) ≡ αL + α−1 U , α ̸= 0, are independent of α.

A block tridiagonal matrix A whose diagonal blocks are nonsingular diagonal matrices can
be shown to be consistently ordered. In particular, this is true if there exists a permutation matrix
P such that P ATAP T has the form
 
D1 U1
, (6.1.32)
L1 D2

where D1 and D2 are nonsingular diagonal matrices. Such a matrix is said to have property A.
The following result is due to Young [1141, 2003].

Theorem 6.1.8. Let A be a consistently ordered matrix, and assume that the eigenvalues µ of
the Jacobi iteration matrix GJ = L + U are real, and its spectral radius satisfies ρ(GJ ) < 1.
Then the optimal relaxation parameter ω in SOR is given by
2
ωopt = p , ρ(Gωopt ) = ωopt − 1. (6.1.33)
1 + 1 − ρ2J

If A is consistently ordered, then using ωopt in SOR gives a great improvement in the rate of
convergence. Otherwise, SOR may not be effective for any choice of ω. In contrast to SOR, the
rate of convergence of SSOR is not very sensitive to the choice of ω. Taking ω = 1, i.e., using
the symmetric Gauss–Seidel method, is often close to optimal; see Axelsson [48, 1994].

Notes and references


Bramley and Sameh [175, 1992] develop row projection methods related to Kaczmarz’s method
for large unsymmetric linear systems. For a three-dimensional grid problem with n3 unknowns,
each iteration can be split into n2 /9 subproblems. Arioli et al. [35, 1992], [38, 1995] develop
a block projection method for accelerating the block Cimmino method. A robust and efficient
solver for elliptic equations by Gordon and Gordon [517, 2008] is based on a similar technique.

6.1.6 The Chebyshev Semi-iterative Method


Consider a stationary iterative method x̃0 = x0 ,

x̃k+1 = x̃k + M −1 AT (b − Ax̃k ), k = 0, 1, . . . , (6.1.34)

for solving the normal equations ATAx = AT b, associated with the splitting ATA = M −N with
M symmetric positive definite. Then the eigenvalues {λi }ni=1 of M −1 ATA are real. Assume that
lower and upper bounds are known such that

0 < a ≤ λi (G) < b, i = 1, . . . , n, (6.1.35)


276 Chapter 6. Iterative Methods

where G = I − M −1 ATA is the iteration matrix. Then the eigenvalues {ρi }ni=1 of G are real
and satisfy
1 − b = c < ρi ≤ d = 1 − a < 1. (6.1.36)

(Note that we allow c ≤ −1, even though then ρ(G) ≥ 1, and the iteration (6.1.34) is divergent!)
To attempt to accelerate convergence of the basic iteration we take a linear combination of
the first k iterations,
X k
x0 = x̃0 , xk = cki x̃i , k = 1, 2, . . . , (6.1.37)
i=0
Pk
where, for consistency, we require that i=0 cki = 1. The resulting iteration is known as a
semi-iterative method; see Varga [1090, 1962]. The error equation for the basic iteration method
(6.1.34) is
x̃k − x = Gk (x̃0 − x), (6.1.38)

and from (6.1.37) it follows for the semi-iterative method that

xk − x = Pk (G)(x0 − x), (6.1.39)

where
k
X
Pk (t) = cki ti , Pk (1) = 1,
i=0

is a polynomial of degree k. Such a polynomial is called a residual polynomial. Hence, a


measure of the rate of convergence for the accelerated sequence (6.1.37) is given by the spectral
radius ρ(Pk (G)) ≤ maxt∈[c,d] |Pk (t)|. To minimize this quantity, we seek the polynomial that
solves
min max |Pk (t)|, (6.1.40)
Pk ∈Πk 1 t∈[c,d]

where Π1k denotes the set of all polynomials of degree k such that Pk (1) = 1.
The solution to the minimization problem (6.1.40) can be expressed in terms of the Cheby-
shev polynomials Tk (z) of the first kind; see Section 4.5.2. These are defined by the three-term
recurrence relation T0 (z) = 1, T1 (z) = z, and

Tk+1 (z) = 2zTk (z) − Tk−1 (z), k ≥ 1. (6.1.41)

By induction it follows that the leading coefficient of Tk (z) is 2k−1 . Tk (z) may also be expressed
explicitly as

cos(kϕ)), z = cos ϕ if |z| ≤ 1,
Tk (z) = (6.1.42)
cosh(kγ), z = cosh γ if |z| > 1.

Thus, |Tk (z)| ≤ 1 for |z| ≤ 1. For |z| ≥ 1 we have z = 21 (w + w−1 ), where w = eγ . By solving
a quadratic equation in w, we get
p
Tk (z) = 12 (wk + w−k ), w=z± z 2 − 1 > 1. (6.1.43)

This shows that outside the interval [−1, 1], Tk (z) grows exponentially with k. The Chebyshev
polynomial Tk (z) has the following extremal property.
6.1. Basic Iterative Methods 277

Theorem 6.1.9. Let µ be any fixed number such that µ > 1. If we let Pk (z) = Tk (z)/Tk (µ),
then Pk (µ) = 1 and
max |Pk (z)| = 1/Tk (µ). (6.1.44)
1≤z≤1

Moreover, if Q(z) is any polynomial of degree k or less such that Q(µ) = 1 and max1≤z≤1 |Q(z)|
≤ 1/Tk (µ), then Q(z) = Pk (z).

Proof. See Young [1141, 2003, Theorem 3.1].

From this result it follows that the solution to the minmax problem (6.1.40) is a scaled and
shifted Chebyshev polynomial. Let

z(t) = (2t − (d + c))/(d − c)

be the linear transformation that maps the interval t ∈ [c, d] onto z ∈ [−1, 1]. Then the solution
to the minimization problem (6.1.40) is given by

Tk (z(t)) b+a
Pk (t) = , µ = z(1) = (2 − (d + c))/(d − c) = , (6.1.45)
Tk (z(1)) b−a
where we have used the facts that a = 1 − d and b = 1 − c; see (6.1.36). It follows that a bound
for the spectral radius of Pk (G) is given by

ρ(Pk (G)) ≤ 1/Tk (µ) = 1/ cosh(kγ), cosh γ = µ. (6.1.46)

If the splitting matrix M is symmetric and positive definite, then κ2 = b/a > 1 is an approximate
upper bound for the condition number of M −1 ATA. From (6.1.35) it follows that

µ = (κ + 1)/(κ − 1), κ = b/a,

is an upper bound for the spectral condition number of A. An elementary calculation shows that
√ √
p κ+1 2 κ κ+1
w = µ + µ2 − 1 = + =√ > 1.
κ−1 κ−1 κ−1

From (6.1.43) it follows that ρ(qk (A)) ≤ 1/Tk (µ) < 2e−kγ , where
√ 
κ+1 2
γ = log √ >√ . (6.1.47)
κ−1 κ
Hence, to reduce the error norm by at least a factor of δ < 1 it suffices to perform k iterations,
where
1√
 
2
k> κ log . (6.1.48)
2 δ
Hence, the number of iterations √required for the Chebyshev accelerated method to achieve a cer-
tain accuracy is proportional to κ rather than κ as for Richardson’s method; see Section 6.1.3.
This is a great improvement but assumes that the upper and lower bounds in (6.1.35) for the
eigenvalues are sufficiently accurate.
The Chebyshev Semi-iterative (CSI) method by Golub and Varga [513, 1961] is an efficient
and stable way to implement Chebyshev acceleration. It can be applied to accelerate any station-
ary iterative method for the normal equations

xk+1 = xk + M −1 AT (b − Axk ), k = 0, 1, . . . ,
278 Chapter 6. Iterative Methods

provided it is symmetrizable. CSI also has the advantage that the number of iterations need not
be fixed in advance. CSI uses a clever rewriting of the three-term recurrence relation for the
Chebyshev polynomials to compute x(k) directly.

Algorithm 6.1.1 (The Chebyshev Semi-iterative Method).


Set α = 2/(a + b), µ = (b − a)/(b + a), and let

rk = b − Axk , sk = M −1 AT rk , k ≥ 0.

Take x1 = x0 + αs0 , ω1 = 2, and for k ≥ 1 compute



xk+1 = xk−1 + ωk+1 αsk + xk − xk−1 , (6.1.49)
 −1
µ2
where ωk+1 = 1 − 4 ωk .

Each iteration requires two matrix-vector multiplications Axk and AT rk , and the solution of
M s(k) = AT rk . The second-order Richardson method can also be described by (6.1.49) with
α and µ as above, and
2
ωk = ωb= p . (6.1.50)
(1 + 1 − µ2 )
It can be shown that in the CSI method, ωk → ω b as k → ∞.
For SOR, the eigenvalues of the iteration matrix Bωopt are all complex with modulus |ωopt |.
In this case, convergence acceleration is of no use; see Young [1141, 2003]. On the other hand,
for SSOR, Chebyshev acceleration often achieves a substantial gain in convergence rate.

6.2 Krylov Subspace Methods


6.2.1 The Conjugate Gradient Method
Iterative methods that minimize an error functional over a sequence of Krylov subspaces play a
fundamental role in solving large-scale linear systems and least squares problems. Basic prop-
erties of Krylov subspaces are given in Section 4.2.3. The prototype for such methods is the
conjugate gradient (CG) method for solving symmetric positive definite linear systems Ax = b,
A ∈ Cn×n . It was developed independently by Hestenes and Stiefel and published in their joint
seminal paper [608, 1952].
Given an initial approximation x0 , CG generates a sequence of approximate solutions xk as
follows. Set p0 = r0 = b − Ax0 , and for k = 1, 2, . . . , compute

xk+1 = xk + αk pk , rk+1 = rk − αk Apk , pk+1 = rk+1 + βk pk , (6.2.1)

where αk and βk are parameters to be chosen. A simple induction argument using (6.2.1) shows
that rk and pk lie in the Krylov subspaces

Kk (A, r0 ) = span {r0 , Ar0 , . . . , Ak−1 r0 }, k = 1, 2, . . . . (6.2.2)

In CG, αk is chosen to make rk+1 orthogonal to pk . From pH


k (rk − αk Apk ) = 0 we obtain

pH
k rk
αk = . (6.2.3)
pH
k Apk
6.2. Krylov Subspace Methods 279

The parameter βk is chosen to make pk+1 A-orthogonal or conjugate to pk ,

(pk+1 , pk )A = 0, (6.2.4)

where the A-inner product and A-norm, also called energy norm, are defined by

(u, v)A = uH Av, ∥u∥A = (uH Au)1/2 . (6.2.5)

This explains the name “conjugate gradient method.” Multiplying (6.2.1) by pH


k A, we obtain

pH
k Ark+1
βk = − . (6.2.6)
pH
k Apk

Equations (6.2.1), (6.2.3), and (6.2.6) fully define the CG method.

Theorem 6.2.1. In CG, the residual vector rk is orthogonal to all previous direction vectors and
residual vectors:
rkH pj = 0, rkH rj = 0, j = 0 : k − 1. (6.2.7)
The direction vectors are mutually A-conjugate:

pH
k Apj = 0, j = 0 : k − 1. (6.2.8)

Proof. We prove (6.2.7) and (6.2.8) jointly by induction. The choice of αk ensures that rk is
orthogonal to pk−1 , and (6.2.4) shows that (6.2.8) holds also for j = k − 1. Hence these relations
are true for k = 1. Assume now that the relations are true for some k > 1. From pH k rk+1 = 0,
changing the index and taking the scalar product with pj , 0 ≤ j < k, we get
H
rk+1 pj = rkH pj − αk pH
k Apj .

H
This is zero by the induction hypothesis, and because rk+1 pk = 0, it follows that (6.2.7) holds
for k + 1. From (6.2.1), the induction hypothesis, and (6.2.8), we find that
−1 H
pH H H
k+1 Apj = rk+1 Apj + βk pk Apj = αj rk+1 (rj − rj+1 )

= αj−1 rk+1
H
(pj − βj−1 pj−1 − pj+1 + βj pj ).

By (6.2.7), this is zero for 0 < j < k. For j = 0 we use b = p0 in forming the last line of
the equation. For j = k we use (6.2.4), which yields (6.2.8). Since the vectors r0 , . . . , rk−1
and p0 , . . . , pk−1 span the same Krylov subspace Kk (A, b), the second orthogonality relation in
(6.2.7) also holds.

We now use these orthogonality properties to derive alternative expressions for αk and βk .
From (6.2.1), we have rkH pk = rkH (rk + βk−1 pk−1 ) = rkH rk , giving

rkH rk
αk = . (6.2.9)
pH
k Apk

H H H H
Similarly, rk+1 rk+1 = rk+1 (rk − αk Apk ) = −αk rk+1 Apk . Now from (6.2.6) we get rk+1 Apk
H
= βk pk Apk , and (6.2.9) gives
rH rk+1
βk = k+1 . (6.2.10)
rkH rk
Theorem 6.2.1 and the property rk ∈ Kk (A, b) imply that in theory the residual vectors r0 , r1 ,
r2 , . . . are the vectors that would be obtained from the sequence b, Ab, A2 b, . . . by Gram–Schmidt
280 Chapter 6. Iterative Methods

orthogonalization. The vectors p0 , p1 , p2 , . . . may be constructed similarly from the conjugacy


relation (6.2.8).
The orthogonality relations (6.2.7) ensure that in exact arithmetic, CG terminates with rk = 0
after at most n steps. Indeed, suppose rk ̸= 0, k = 0 : n. Then by (6.2.7) these n + 1 nonzero
vectors in Cn are mutually orthogonal and hence linearly independent. Since this is impossible,
we have a contradiction.

Theorem 6.2.2. In CG the vector xk minimizes the energy norm

E(x) = (x − x∗ )H A(x − x∗ ) = ∥x − x∗ ∥2A (6.2.11)

of the error over all vectors x ∈ x0 + Kk (A, r0 ), where r0 = b − Ax0 and x∗ = A−1 b is the
exact solution.

We have xk = x0 + Qk−1 (A)b for some polynomial Qk−1 of degree k − 1. Substituting


b = Ax and subtracting x from both sides gives

x − xk = (I − Qk−1 (A)A)(x − x0 ) = Pk (A)(x − x0 ),

where P0 (A) = 1. As long as rk ̸= 0, the “energy” norm ∥xk − x∗ ∥A is strictly decreasing. It


can be shown (Hestenes and Stiefel [608, 1952, Theorem 6.3]) that the error norm ∥xk − x∗ ∥2
is also strictly decreasing. However, the residual norm ∥b − Axk ∥2 typically oscillates and may
increase initially.
CG also works for Hermitian semidefinite consistent systems Ax = b, b ∈ R(A). With
x0 = 0, it follows that xk ∈ R(A), k > 0. Then CG converges to the unique least-norm
solution. A more general result is shown by Kammerer and Nashed [683, 1972].

An implementation of CG is given by the following MATLAB function.

Algorithm 6.2.1 (CG).


function [x,r] = cg(A,b,x0,maxit)
% CG performs at most maxit CG iterations
% on the linear system Ax = b.
% -----------------------------------------
x = x0; r = b - A*x;
p = r; nrm = r'*r;
for k = 1:maxit
if nrm == 0, break; end;
q = A*p;
alpha = nrm/(q'*p);
x = x + alpha*p;
r = r - alpha*q;
nrmold = nrm; nrm = r'*r;
beta = nrm/nrmold;
p = r + beta*p;
end
end

The computational requirements for each iteration of CG are constant. Each step requires
one matrix-vector multiplication with A, two inner products, and two vector updates of length n.
Storage is needed for four n-vectors x, r, p, q.
6.2. Krylov Subspace Methods 281

Originally, the CG method was viewed primarily as a direct method; see Householder [645,
1964, Sect. 5.7]. It soon became evident that the finite termination property is valid only in
exact arithmetic. In floating-point computation it could take many more than n iterations before
convergence occurred. This led to a widespread disregard of the method for more than a decade
after its publication. Interest was renewed when Reid [920, 1971] showed that it could be highly
efficient if used as an iterative method for solving large, sparse, well-conditioned linear systems.

6.2.2 CGLS and Related Methods


A Krylov subspace method CGLS for solving the least squares problem

min ∥Ax − b∥2 , A ∈ Rm×n ,


x

is obtained by applying CG to the normal equations ATAx = AT b. These are Hermitian positive
definite or semidefinite. From Theorem 6.2.1 it follows that the residual vectors

sk = AT rk = AT (b − Axk )

in CGLS are mutually orthogonal. In exact arithmetic, the CGLS iterations terminate with sk = 0
after at most rank(A) steps. Let x† denote the pseudoinverse solution, and let r† = b − Ax† . If
ATA is positive definite, it follows from Theorem 6.2.2 that xk minimizes the error norm

E(x) = ∥x − x† ∥2ATA = ∥A(x − x† )∥22 = ∥r − r† ∥22 (6.2.12)

over the Krylov subspace x − x0 ∈ Kk (ATA, AT r0 ), r0 = b − Ax0 . Hence, both ∥r† − rk ∥2


and ∥x† − xk ∥2 decrease monotonically. The error functional (6.2.12) can be written

E(x) = ∥r − r† ∥22 = ∥r∥22 − ∥r† ∥22 , (6.2.13)

where the last equality follows from the Pythagorean theorem and the identity

r = (r − r† ) + r† , r − r† ⊥ r† .

Thus ∥rk ∥2 also decreases monotonically. However, the residual norm ∥sk ∥2 = ∥AT rk ∥2 will
usually oscillate, especially when A is ill-conditioned.
Consider a straightforward application of CG to the normal equation ATAx = AT b with
x0 = 0. The only information about b available is from the initialization s = AT b because no
more reference to b is made in the iterative phase. Hence the bound on the achievable accuracy
will include a term of size
|δx| ≤ muκ(A)|A† | |b|, (6.2.14)
coming from the roundoff error in computing AT b. If |r| ≪ |b|, this term is much larger than for
perturbations of A and b. For reasons of numerical stability, the following two simple algebraic
rearrangements should be performed:
1. Explicit formation of the matrix ATA should be avoided.
2. The residual r = b − Ax should be recurred instead of the residual s = AT r of the normal
equations. This is crucial for stability because of the cancellation that occurs in r before
multiplication by AT .
The resulting method, here called CGLS, appeared as Algorithm (10:2) in the original paper by
Hestenes and Stiefel [608, 1952].9
9 The same method is called CGNR by Saad [956, 1996] and GCG-LS by Axelsson [48, 1994].
282 Chapter 6. Iterative Methods

Algorithm 6.2.2 (CGLS).


function [x,r,sts] = cgls(A,b,x0,maxit)
% CGLS performs at most maxit CG iterations
% for the normal equations A'Ax = A'b.
% -----------------------------------------
x = x0; r = b - A*x;
s = A'*r; p = s; sts = s'*s;
for k = 0:maxit
if sts == 0, break; end
q = A*p;
alpha = sts/(q'*q);
x = x + alpha*p;
r = r - alpha*q;
s = A'*r;
stsold = sts; sts = s'*s;
beta = sts/stsold;
p = s + beta*p;
end
end

Each iteration of CGLS requires two matrix-vector products, one with A and the other with
AT , as well as two inner products or vector updates of length m and three of length n. Storage
is needed for two m-vectors r, q and two n-vectors x, p. (Note that s can share storage with q.)
When rank(A) < n the least squares solution is not unique. However, it is easily verified
that if x0 ∈ R(AT ), e.g., x0 = 0, then in CGLS xk ∈ R(AT ), k = 0, 1, 2, . . . . Hence, in exact
arithmetic, CGLS terminates with the pseudoinverse solution x† = A† b ∈ R(AT ). We conclude
that in theory CGLS works for least squares problems of any rank and shape, overdetermined
as well as underdetermined. A version of CGLS that solves the regularized normal equations
(AT A + µ2 I)x = AT b is given in Section 6.4.2.
For a linear model Ax = b + e with positive definite covariance matrix σ 2 V ∈ Rm×m , the
generalized normal equations in factored form are AT V −1 (b − Ax) = 0. These can be solved
by the following generalized version of CGLS.

Algorithm 6.2.3 (GCGLS).


function [x,r] = gcgls(A,V,b,x0,maxit)
% GCGLS performs at most maxit CG iterations
% for the normal equations A'V(b - Ax) = 0 with
% SPD covariance matrix V.
% -----------------------------------------
x = x0; r = (b - A*x);
y = V\r; s = A'*y; p = s;
sts = s'*s;
for k = 0:maxit
if sts == 0, break; end
q = A*p; t = V\q;
alpha = sts/(q'*t);
x = x + alpha*p;
y = y - alpha*t;
6.2. Krylov Subspace Methods 283

s = A'*y;
stsold = sts; sts = s'*s;
beta = sts/stsold;
p = s + beta*p;
end
end

For an underdetermined system Ax = b of full row rank, the problem

min ∥x∥22 subject to Ax = b (6.2.15)


y

has a unique solution that satisfies

AAT z = b, x = AT z. (6.2.16)

By assumption, AAT is symmetric positive definite, so CG can be applied to AAT z = b. With


z0 = 0 this generates approximations

xk = AT zk , zk ∈ Kk (AAT , b).

Eliminating zk , we obtain algorithm CGME expressed in terms of xk . The generated approxi-


mations minimize
∥z † − zk ∥AT A = ∥A(z † − zk )∥2 = ∥y † − yk ∥2 . (6.2.17)
Note that xk lies in the same Krylov subspace as xk in CGLS but minimizes a different error
norm, namely ∥xk − x† ∥2 .
By construction, the error norm ∥xk −x† ∥2 for CGME decreases monotonically, but the resid-
ual norm ∥rk ∥2 can oscillate. Because the stopping criterion for consistent systems is usually
based on the size of ∥rk ∥2 , it may often be preferable to use CGLS also for underdetermined sys-
tems. When applied to an inconsistent system Ax = b, CGME may break down; see Choi [244,
2006, Sect. 2.2.1].

Algorithm 6.2.4 (CGME).

function [x,r] = cgme(A,b,x0,maxit)


% CGME performs at most maxit steps of Craig's
% algorithm on a consistent linear system Ax = b.
% -----------------------------------------------
x = x0; r = b - A*x;
nrm = r'*r; p = r;
for k = 1:maxit
if nrm == 0, break; end
q = A'*p;
alpha = nrm/(q'*q);
x = x + alpha*q;
r = r - alpha*(A*q);
nrmold = nrm; nrm = r'*r;
beta = nrm/nrmold;
p = r + beta*p;
end
end
284 Chapter 6. Iterative Methods

For A ∈ Rm×n , CGME needs storage for two vectors x and q of length n and two vectors r
and p of length m. Three inner products or vector updates of length n and two of length m are
required per step.
The vector p in CGME can be eliminated. Then the algorithm becomes identical to an al-
gorithm due to Craig [277, 1955]; see Saad [956, 1996, Sect. 8.3.2]. We prefer to keep the
algorithm in the form given above, because this makes it possible to include a regularization
term; see Section 6.4.2.
In exact arithmetic, CGLS generates the sequence of Krylov subspace approximations xk ,
k = 1, 2, . . . , defined in Section 4.2.3 for solving ATAx = ATb. By Theorem 4.2.4, CGLS
terminates after at most r = rank(A) steps with the pseudoinverse solution x† = A† b. More
precisely, if A has p distinct (possibly multiple) nonzero singular values σ1 > σ2 > · · · > σp ,
then in exact arithmetic, CGLS terminates after p steps. For example, if A is the unit matrix plus
a matrix of rank p, at most p+1 steps are needed. Even fewer steps are required if b is orthogonal
Pn vectors ui corresponding to σi . If the original system is such that
to some of the left singular
the exact solution x = i=1 (ci /σi )vi , ci = uTi b has small projections ci onto singular vectors
ui , i > p, then p steps can be expected to give good approximations. However, the intermediate
Krylov subspace approximations xk depend nonlinearly on A and b in a highly complicated way.
We now derive an upper bound for the number of iterations needed to reduce the residual
norm by a certain amount. We assume exact arithmetic, but the bound holds also for finite-
precision computation. The residual of the normal equation can be written

sk = I − ATAPk−1 (ATA) AT r0 = Rk (ATA)AT r0 ,




Pk
where Pk−1 is a polynomial of degree k − 1, and Rk (λ) = i=0 cki λi is a residual polynomial,
i.e., Rk (1) = 1. Let S contain all the nonzero singular values σ of A, and assume that for some
residual polynomial R̃k we have

max |R̃k (σ 2 )| ≤ Mk .
σ∈S

Then from the minimum property (6.2.12) of xk , it follows that


n
X
∥sk ∥2(ATA)−1 2
= ∥r − rk ∥ ≤ Mk2 γi2 σi−2 = Mk2 ∥s0 ∥2(ATA)−1 .
i=1

We can now select a set S on the basis of some assumption regarding the singular value distribu-
tion of A and seek a polynomial R̃k ∈ Π̃1k such that Mk = maxσ∈S |R̃k (σ 2 )| is small. A simple
choice is to take S to be the interval [σn2 , σ12 ] and seek the polynomial R̃k ∈ Π̃1k that minimizes

max |Rk (σ 2 )|.


2 ≤σ 2 ≤σ 2
σn 1

The solution to this problem is the shifted Chebyshev polynomials introduced in the analysis
of the CSI method in Section 6.1.6. This gives the following upper bound for the norm of the
residual error after k steps:
 k
κ(A) − 1
∥r − rk ∥2 ≤ 2 ∥r − r0 ∥2 , k = 0, 1, 2, . . . . (6.2.18)
κ(A) + 1
It follows that an upper bound on the number of iterations k needed to reduce the relative error
by a factor ϵ is
∥r − rk ∥2 1 2
< ϵ ⇐⇒ k < κ(A) log . (6.2.19)
∥r − r0 ∥2 2 ϵ
6.2. Krylov Subspace Methods 285

This is the same as for the CSI method and the second-order Richardson method. However, these
methods require that accurate lower and upper bounds for the singular values of A be known.
Furthermore, the estimate (6.2.19) tends to be sharp asymptotically for CSI, while for CGLS the
error usually decreases much faster. On the other hand, the inner products in CGLS can be a
bottleneck when implemented on parallel computers.
Although useful in the analysis of many model problems, the bound (6.2.18) in terms of κ(A)
cannot be expected to describe the highly nonlinear complexity of the convergence behavior of
CGLS. The convergence depends on the distribution of all singular values of A, as well as on
the projection of the right-hand side b onto the left singular vectors of A. In practice the rate of
convergence often accelerates as the number of steps increases.
In floating-point arithmetic the finite termination property no longer holds, and it can take
much more than n steps before the desired final accuracy is reached; see Section 6.2.6.

Notes and references


The theory of the CG method was published in the seminal paper by Hestenes and Stiefel [608,
1952]. The story of its development is recounted by Hestenes [607, 1990]. One of the first
applications of CGLS to a least squares problem was for solving geodetic network problems; see
Stiefel [1035, 1952]. The early history of Krylov subspace methods is documented by Golub and
O’Leary [502, 1989]. Early discussions of CG methods for least squares problems are given by
Lawson [726, 1973] and Chen [242, 1975]. A more recent survey of Krylov subspace methods
is given by van der Vorst [1075, 2003]. Modern treatments of the Lanczos and CG methods are
found in Greenbaum [534, 1997], Meurant [791, 2006], Meurant and Strakoš [792, 2006]), and
Liesen and Strakoš [746, 2012].

6.2.3 Preconditioned Iterative Methods


The rate of convergence of iterative least squares methods can be very slow, and the use of a
preconditioner to accelerate convergence is often essential. For a linear system Ax = b, let
A = B − E be a splitting such that B is nonsingular and ∥E∥ is small. Then the linear systems
B −1 Ax = (I − B −1 E)x = c, Bc = b, (6.2.20)

AB −1 y = (I − EB −1 )Bx = b, Bx = y, (6.2.21)
are the left- and right-preconditioned systems. If B is chosen so that B −1 A or AB −1 is better
conditioned than A, faster convergence may be expected when the iterative method is applied to
one of the preconditioned systems. Note that the products B −1 A and AB −1 need not be formed
explicitly, since iterative methods only require that matrix-vector products such as B −1 (Ax)
and A(B −1 y) can be formed for arbitrary x and y. This is possible if linear systems with B
can be solved. Preconditioned iterative methods may be viewed as a compromise between a
direct and an iterative solver. To be efficient, a preconditioner should have the following, partly
contradictory properties:
• The norm of the defect matrix E should be small, and AB −1 (B −1 A) should be better
conditioned than A and have well clustered eigenvalues.
• Linear systems with matrices B and B T should be cheap to solve, and B should not have
many more nonzero elements than A.
For solving a least squares problem minx ∥Ax − b∥2 the right-preconditioned problem
min ∥AB −1 y − b∥2 , y = Bx, (6.2.22)
y
286 Chapter 6. Iterative Methods

should be used, because a left-preconditioner would change the objective function. This is equiv-
alent to a symmetric preconditioning of the normal equations:
B −TATAB −1 y = B −TAT b, x = B −1 y. (6.2.23)
For the preconditioned CGLS (PCGLS) method the approximations minimize the error func-
tional ∥rk ∥22 , rk = b − Axk , over the Krylov subspace
xk − x0 ∈ Kk (B −T ATAB −1 , B −T AT r0 ). (6.2.24)
For PCGLS, the required matrix-vector product u = AB −1 y is computed by solving Bw = y
and forming u = Aw. Similarly, v = B −T AT z is computed by solving B T v = AT z. Hence,
the extra work is one solve with B and one with B T . Below we give an implementation of
preconditioned CGLS (PCGLS).

Algorithm 6.2.5 (PCGLS).


function [x,r,s] = pcgls(A,b,B,maxit)
% Solves consistent system Ax = b
% with left preconditioner B.
% -----------------------------------
x = 0; r = b;
s = B'\(A'*r);
p = s; sts = s'*s;
for k = 0:maxit
t = B\p; q = A*t;
alpha = sts/(q'*q);
x = x + alpha*t;
r = r - alpha*q;
s = B'\(A'*r);
stsold = sts; sts = s'*s;
beta = sts/stsold;
p = s + beta*p;
end
end

A simple preconditioner for the normal equations is the diagonal matrix B = diag (d1 ,
. . . , dm ), where dj = ∥aj ∥2 . Then the columns of the preconditioned matrix AB −1 will have
unit length. By Theorem 2.1.2 this preconditioner approximately minimizes κ2 (AD−1 ) over all
diagonal D > 0. Using this preconditioner can significantly improve the convergence rate with
almost no cost in terms of time and memory. The column norms can be obtained cheaply if A is
a sparse matrix (stored columnwise or rowwise). For CGLS the iterations are usually terminated
when ∥AT rk ∥2 /(∥A∥2 ∥rk ∥2 ) ≤ η, where η is a small tolerance (see stopping criteria (6.2.59)
and (6.2.60)). This guarantees a backward stable solution. In PCGLS, ∥rk ∥2 and ∥AB −1 ∥ can
be estimated, but usually not ∥A∥2 . Instead, we use the stopping criterion
∥(AB −1 )T rk ∥2
≤ η.
∥AB −1 ∥2 ∥rk ∥2
To solve a consistent underdetermined problem minx ∥x∥2 subject to Ax = b, we can apply
CGME (Craig’s method) to the left-preconditioned problem
min ∥x∥2 subject to B −1 Ax = B −1 b. (6.2.25)
x
6.2. Krylov Subspace Methods 287

This iteration method minimizes the error functional ∥x − xk ∥2 over the Krylov subspaces
x − xk ∈ Kk AT (BB T )−1A, AT (BB T )−1 b .

(6.2.26)
Note that although the residual vectors are transformed, the algorithm can be formulated in terms
of the original residuals rk = b − Axk .

Algorithm 6.2.6 (PCGME).


function [x,r] = pcgme(A,B,b,maxit)
% Solves consistent system Ax = b
% with left preconditioner B.
% -----------------------------------
x = 0; r = b;
nrm = r'*r; z = B\r;
q = A'*(B'\z);
for k = 1:maxit
alpha = nrm/(q'*q);
x = x + alpha*q;
r = r - alpha*(A*q);
z = B\r;
nrmold = nrm; nrm = z'*z;
beta = nrm/nrmold;
q = A'*(B'\z) + beta*p;
end
end

Preconditioners for least squares problems are treated in Section 6.3. Benzi [106, 2002]
gives an excellent survey of preconditioning techniques for the iterative solution of large linear
systems, with a focus on algebraic methods for general sparse matrices. Wathen [1095, 2015] and
Pearson and Pestana [888, 2020] survey a range of preconditioners for use with partial differential
equations and optimization problems and for other purposes as well.

6.2.4 The Lanczos Process


Lanczos [715, 1950] developed a method for computing selected eigenvalues and eigenvectors
of a Hermitian matrix A ∈ Cn×n . The method is based on the Lanczos process, a recursive
algorithm for reducing a Hermitian matrix A ∈ Cn×n to tridiagonal form by a unitary similarity,
α1 β2
 
 β2 α2 β3 
H
 .. .. 
Un AUn = Tn =  β3 . .  ∈ Rn×n , (6.2.27)
 
 . .. α

βn
 
n−1
βn αn
where Un = (u1 , u2 , . . . , un ). (Note that αk and βk are not the same as in the CG method.)
Taking β1 = βn+1 = 0 and equating columns in AUn = Un Tn gives
Auk = βk uk−1 + αk uk + βk+1 uk+1 , k = 1 : n. (6.2.28)
The requirement that uk ⊥ uk−1 and uk+1 ⊥ uk yields α1 = uH
1 Au1 and

αk = uH
k vk , vk = Auk − βk uk−1 , k = 2, . . . , n. (6.2.29)
288 Chapter 6. Iterative Methods

Solving (6.2.28) for uk+1 gives βk+1 uk+1 = rk and rk = vk − αk uk . If rk = 0, the process
stops. Otherwise,
βk+1 = ∥rk ∥2 , uk+1 = rk /βk+1 . (6.2.30)
Thus, as long as all βk ̸= 0, the elements in the tridiagonal matrix Tk and the unitary matrix
Uk+1 ∈ Kk+1 (A, u1 ) are uniquely determined. Furthermore, it holds that
 
Tk
AUk = Uk Tk + βk+1 uk+1 eTk = Uk+1 T̂k , T̂k = , (6.2.31)
βk+1 eT1
which is the Lanczos decomposition. It is easy to verify that Uk is an orthonormal basis in the
Krylov subspace Kk (A, b).
The Lanczos process requires storage for Tk and three n-vectors uk−1 , uk , and uk+1 . The
eigenvalues of Tk are approximations to the eigenvalues of A. The process stops when βk+1 = 0.
Then by (6.2.31), AUk = Uk Tk , i.e., Uk is an invariant subspace of A.
Lanczos [716, 1952] noted that his process can be used to solve positive definite systems of
linear equations by a method he called the method of minimized iterations. With
β1 = ∥b∥2 , u1 = b/β1 ,
an approximate solution xk = Uk yk ∈ Kk (A, b) is determined by the Galerkin condition
UkH rk = 0, rk = b − Axk .
The Lanczos decomposition (6.2.31) gives
rk = β1 u1 − AUk yk = Uk+1 (β1 e1 − T̂k yk ). (6.2.32)
If yk is determined from Tk yk = β1 e1 , it follows that rk = −(eTk yk )βk+1 uk+1
and H
Uk+1
=0 rk
as required. Because A is positive definite, so is Tk . Hence the Cholesky factorization Tk =
Lk LTk exists, with
l11
 
 l21 l22 
l32 l33
 
Lk =  .
 .. .. 
 . . 
lk,k−1 lkk
Because Lk is the k × k principal submatrix of Lk+1 , the Cholesky factorization can be cheaply
updated. The equation Tk yk = β1 e1 is equivalent to the bidiagonal equations
Lk zk = β1 e, LTk yk = zk .
It follows that  
zk−1
zk = , ξk = −lk,k−1 ξk−1 /lkk .
ξk
If we define Pk from Lk PkT = UkT , then
xk = Uk yk = Pk LTk yk = Pk zk ,
and lk−1,k pk−1 + lkk pk = uk . Hence,
xk = xk−1 + ζk pk , pk = (uk − lk,k−1 pk−1 )/lkk
can be obtained without saving all the vectors u1 , . . . , uk or computing yk .
In exact arithmetic, this method computes the same sequence of approximations xk ∈
Kk (A, b) as CG and is therefore often called the Lanczos-CG method. The residual vectors
r0 , . . . , rk−1 in CG are mutually orthogonal and form a basis for the Krylov space Kk (A, b).
Hence by uniqueness, the columns of Uk in the Lanczos process equal the residual vectors nor-
malized to unit length.
6.2. Krylov Subspace Methods 289

Notes and references


The Lanczos process is closely connected to the theory of orthogonal polynomials, Gauss–
Christoffel quadrature, and the Stieltjes moment problem; see Golub and Meurant [501, 1994]
and Liesen and Strakoš [746, 2012]. Let Pn−1 be the space of polynomials of degree at most
n − 1, equipped with the inner product

⟨p, q⟩u1 = (p(A)u1 )H q(A)u1 . (6.2.33)

Then the Lanczos process is just the Stieltjes algorithm for computing the corresponding se-
quence of orthogonal polynomials. The vectors uk are of the form qk−1 (A)u1 , and the orthog-
onality of these vectors translates into the orthogonality of the polynomials with respect to the
inner product (6.2.33).

6.2.5 LSQR and Related Algorithms


The LSQR algorithm of Paige and Saunders [866, 1982] is a Lanczos-type algorithm for solv-
ing least squares problems. LSQR uses the Bidiag1 variant of GKL bidiagonalization; see Sec-
tion 4.2.3. Starting from
β1 u1 = b, α1 v1 = AT u1 ,
Bidiag1 generates two orthonormal sequences of unit vectors u1 , u2 , . . . in Rm and v1 , v2 , . . .
in Rn ,
βk+1 uk+1 = Avk − αk uk , (6.2.34)
T
αk+1 vk+1 = A uk+1 − βk+1 uk , k = 1, 2, . . . , (6.2.35)

where the scalars αk and βk are normalization constants. The recurrences can be summarized as

AVk = Uk+1 Bk , AT Uk+1 = Vk BkT + αk+1 vk+1 eTk+1 , (6.2.36)

where Vk = (v1 , . . . , vk ), Uk+1 = (u1 , . . . , uk+1 ), and

α1
 
 β2 α2 
 .. 
Bk =  β3 .  ∈ R(k+1)×k (6.2.37)
 
 .. 
 . αk

βk+1

is lower bidiagonal. The columns of Uk and Vk are orthonormal bases for the Krylov subspaces

Kk (AAT, b) and Kk (ATA, AT b), k = 1, 2, . . . .

Approximations xk = Vk yk ∈ Kk (ATA, AT b) are obtained as follows. By construction, b =


β1 u1 = β1 Uk+1 e1 . With xk = Vk yk and using (6.2.36) we get

rk = β1 u1 − AVk yk = Uk+1 tk+1 , tk+1 = β1 e1 − Bk yk . (6.2.38)

From the orthogonality of Uk+1 and Vk it follows that ∥rk ∥2 = ∥tk+1 ∥2 is minimized over all
xk ∈ span(Vk ) if yk solves the bidiagonal least squares subproblem

min ∥Bk yk − β1 e1 ∥2 . (6.2.39)


yk
290 Chapter 6. Iterative Methods

The special form of the right-hand side β1 e1 holds because the starting vector was taken as
b. By uniqueness, it follows that in exact arithmetic, this generates the same Krylov subspace
approximations xk ∈ Kk (ATA, AT b) as CGLS.
Subproblem (6.2.39) can be solved by the QR factorization
 
Rk fk
Qk (Bk β1 e1 ) = , (6.2.40)
0 ϕ̄k+1

where
ρ1 γ2 ϕ1
   
 ρ2 γ3   ϕ2 
 .. ..   .. 
Rk = 
 . . ,
 fk = 
 .
.
 (6.2.41)
ρk−1 γk ϕk−1
   
ρk ϕk
The matrix Qk = Gk,k+1 Gk−1,k · · · G12 is a product of plane rotations chosen to eliminate the
subdiagonal elements β2 , . . . , βk+1 of Bk . The solution yk satisfies

Rk yk = fk . (6.2.42)

In exact arithmetic, the LSQR approximations xk = Vk yk are the same as for CGLS.
The factorization (6.2.40) can be computed by a recurrence relation. Assume that we have
computed the QR factorization of Bk−1 , so that
   
  0 β1 e 1 γk ek−1 fk−1
Qk−1  αk 0  =  ρ̄k ϕ̄k  . (6.2.43)
1
βk+1 0 βk+1 0

To eliminate βk+1 in the kth column of Bk , a plane rotation Gk,k+1 is determined:


   
ρ̄k ϕ̄k ρk ϕk
Gk,k+1 = . (6.2.44)
βk+1 0 0 ϕ̄k+1

(Note that the previous rotations Gk−2,k−1 , . . . , G12 do not act on the kth column.)
If xk were formed as xk = Vk yk , it would be necessary to save (or recompute) the vectors
v1 , . . . , vk . This can be avoided by defining Zk from the triangular system Zk Rk = Vk , so that

xk = Zk Rk yk ≡ Zk fk .

The columns of Zk = (z1 , . . . , zk ) can be found successively by identifying columns in Zk Rk =


Vk with Rk as in (6.2.41). This yields

v1 = ρ1 z1 , vk = γk zk−1 + ρk zk , k > 1.

If ρk ̸= 0, we obtain the recursion z0 = 0, x0 = 0,


1
zk = (vk − γk zk−1 ), xk = xk−1 + ϕk zk .
ρk
Only one extra n-vector needs to be stored. Some work can be saved by using the vectors
wk ≡ ρk zk instead of zk . A basic MATLAB implementation of the LSQR is given below. Note
that all quantities needed to update the approximation xk−1 are computed by forward recurrence
relations, where only the last term needs to be saved.
6.2. Krylov Subspace Methods 291

For use in stopping criteria, LSQR computes the estimates


∥rk ∥2 = ϕ̄k+1 = β1 sk sk−1 · · · s1 , ∥AT rk ∥2 = ϕ̄k+1 αk+1 |ck |. (6.2.45)
These are obtained essentially for free. An estimate of ∥xk ∥2 can be obtained from a QR factor-
ization of RkT ; see Paige and Saunders [866, 1982, Section 5.2].

Algorithm 6.2.7 (LSQR).


function [x,nr,ns] = lsqr(A,b,k)
% LSQR for solving Ax = b or min||Ax - b||_2.
% -------------------------------------------
[m,n] = size(A); x = zeros(n,1);
beta = norm(b); u = (1/beta)*b;
v = A'*u; alpha = norm(v);
v = (1/alpha)*v; w = v;
rhobar = alpha; phibar = beta;
for i = 1:k
% Continue bidiagonalization
u = A*v - alpha*u;
beta = norm(u); u = (1/beta)*u;
v = A'*u - beta*v;
alpha = norm(v); v = (1/alpha)*v;
% Construct and apply i:th plane rotation
rho = norm([rhobar,beta]);
c = rhobar/rho; s = beta/rho;
gamma = s*alpha; rhobar = -c*alpha;
phi = c*phibar; phibar = s*phibar;
% Update the solution and residual norms
x = x + (phi/rho)*w;
w = v - (gamma/rho)*w;
nr = phibar;
ns = nr*alpha*abs(c);
end
end

LSQR requires 3m + 5n multiplications and storage of two m-vectors u and Av and three
n-vectors x, v, and w. This can be compared to CGLS, which requires 2m + 3n multiplications,
two m-vectors, and two n-vectors. Unlike in CGLS, the residual vector rk = Uk+1 tk+1 is not
computed in LSQR.
Benbow [104, 1999] generalized GKL lower bidiagonalization and LSQR to solve the gen-
eralized normal equations AT M −1 Ax = AT M −1 b, where M = LLT . When vectors ũi = Lui
are introduced into LSQR, only matrix-vector operations with A, AT , and M −1 are needed:
Avi and AT L−T L−1 ui+1 = AT M −1 ũi+1 .
Algorithm LSME of Paige [852, 1974] is an algorithm for solving consistent systems Ax = b
using the same bidiagonalization process as LSQR. Let
α1
 
 β2 α2 
β α  ∈ Rk×k
 
Lk =   3 3 (6.2.46)
.. .. 
 . . 
βk α k
292 Chapter 6. Iterative Methods

be the lower bidiagonal matrix consisting of the first k rows of Bk in (6.2.37). Then the bidiag-
onalization recurrence relations (6.2.36) can be written as
AVk = Uk Lk + βk+1 uk+1 eTk , AT Uk = Vk LTk . (6.2.47)
The LSME approximations xk are given by v
xk = Vk zk , zk = β1 L−1
k e1 , (6.2.48)
where zk and xk are obtained by the recurrences
ζk = −(βk /αk )ζk−1 , xk = xk−1 + ζk vk . (6.2.49)
Since the increments vk form an orthogonal set, the step lengths are bounded by |ζk |2 ≤ ∥xk ∥2 ≤
∥x∥2 . The LSME approximations xk lie in Kk (AAT , r0 ) and minimize ∥x† − xk ∥2 . By unique-
ness, LSME is mathematically equivalent to Craig’s method and CGME.

Algorithm 6.2.8 (LSME).


function x = lsme(A,b,maxit)
% LSME for solving Ax = b.
% -----------------------------------
x = 0; beta = norm(b);
u = (1/beta)*b;
v = A'*u; alpha = norm(v);
v = (1/alpha)*v; zeta = -1;
for k = 1:maxit
zeta = -(beta/alpha)*zeta;
x = x + zeta*v;
u = A*v - alpha*u;
beta = norm(u); u = (1/beta)*u;
v = A'*u - beta*v;
alpha = norm(v); v = (1/alpha)*v;
nr(k) = abs(zeta*beta);
end
end

The residual vector rk = b − Axk is obtained from


rk = b − AVk zk = −ζk βk+1 uk+1 , ζk = eTk zk .
LSME needs storage for two n-vectors x, v and one m-vector u. It requires five scalar prod-
ucts or vector updates of length n and three of length m per step. For LSME (and Craig’s method)
the error functionals ∥rk ∥2 and ∥AT rk ∥2 will not decrease monotonically. LSME has the advan-
tage over CGME that it is easily possible to transfer to the LSQR approximation at any step; see
Saunders [965, 1995]. As for CGLS and LSQR, almost no additional work or storage is required
to apply LSME to regularized problems.
For LSQR the residual norms ∥rk ∥2 decrease monotonically, but ∥AT rk ∥2 in general oscil-
lates. The oscillations can be large for ill-conditioned problems. Such behavior is undesirable,
because practical stopping criteria are based on ∥AT rk ∥2 ; see Section 6.2.6. The LSMR algo-
rithm by Fong and Saunders [417, 2011] uses the same Bidiag1 process as LSQR but computes
the Krylov subspace approximations that solve
min ∥AT (b − Axk )∥2 , xk = Vk yk . (6.2.50)
xk
6.2. Krylov Subspace Methods 293

Hence for LSMR, ∥AT rk ∥2 is monotonically decreasing by design. This allows the iterations to
be terminated more safely. The residual norm ∥rk ∥2 of LSQR will be smaller but, usually, not
by much.
We now describe how subproblems (6.2.50) can be solved efficiently. After k steps, we have

AVk = Uk+1 Bk , AT Uk+1 = Vk+1 LTk+1 ,

where Bk is given as in (6.2.37), and Lk+1 = ( Bk αk+1 ek+1 ). Also,

rk = b − AVk yk = Uk+1 (β1 e1 − Bk yk ).

From b = β1 u1 and AT u1 = α1 v1 , we obtain AT b = β1 Au1 = β̄1 v1 , where β̄k ≡ αk βk . Hence


Axk = Vk yk = Uk+1 Bk yk , and

AT Vk yk = AT Uk+1 Bk yk = Vk+1 LTk+1 Bk yk (6.2.51)


 
BkT
= Vk+1 Bk yk . (6.2.52)
αk+1 eTk+1

Because eTk+1 Bk = βk+1 eTk , the LSMR subproblem becomes


 
BkT Bk
min ∥AT rk ∥2 = min β̄1 e1 − yk . (6.2.53)
yk yk β̄k+1 eTk 2

To solve this, a sequence of plane rotations is first used to compute the QR factorization
 
Rk
Qk Bk = (6.2.54)
0

as in LSQR. Here Rk is upper bidiagonal, and RkT Rk = BkT Bk . If RkT qk = β̄k+1 ek , then
qk = (β̄k+1 /ρk )ek ≡ φk ek . If we define tk = Rk yk , subproblem (6.2.53) can be written
 
T RkT
min ∥A rk ∥2 = min β̄1 e1 − tk , (6.2.55)
yk tk φk eTk 2

Next, perform the QR factorization


   
RkT β̄1 e1 R̄k zk
Q̄k = , (6.2.56)
φk eTk 0 0 ζk+1

where R̄k is upper bidiagonal. The subproblem can now be written


   
zk R̄k
min ∥AT rk ∥2 = min − tk . (6.2.57)
yk tk ζk+1 0 2

The solution tk is obtained by solving R̄k tk = zk , and ∥AT rk ∥2 = |ζk+1 | is monotonically


decreasing.
294 Chapter 6. Iterative Methods

Let Wk = (w1 , . . . , wk ) and W̄k = (w̄1 , . . . , w̄k ) be computed by forward substitution from
RkT WkT = VkT and R̄kT W̄kT = WkT . Then, from

xk = Vk yk , Rk yk = tk , R̄k tk = zk ,

we have
xk = Wk Rk yk = Wk tk = W̄k R̄k tk = W̄k zk = xk−1 + ζk w̄k .
As for LSQR, all quantities needed to update the approximation xk−1 can be computed by for-
ward recurrence relations, where only the last term needs to be saved. Also, ∥rk ∥2 and ∥xk ∥2
can be obtained by using formulas that can be updated cheaply. For details see Fong and Saun-
ders [417, 2011], where a detailed pseudocode for LSMR is given. LSMR shares with LSQR the
property that for rank-deficient problems, it terminates with the least-norm solution.
LSMR requires storage for one m-vector and seven n-vectors. The number of floating-point
multiplications is 8n per iteration step. The corresponding figures for LSQR are two m-vectors,
three n-vectors and 3m + 5n, respectively. LSMR is easily extended to solve regularized least
squares problems (6.4.8).
The three algorithms LSME, LSQR, and LSMR all use the same bidiagonalization algorithm
and generate approximations in the same Krylov subspaces xk ∈ Kk (ATA, AT b). The algo-
rithms minimize different error quantities, ∥xk − x∥, ∥rk − r∥, and ∥AT (rk − r)∥, respectively
(where the last two are equivalent to minimizing ∥rk ∥2 and ∥AT rk ∥2 ). LSME can only be used
for consistent systems. LSMR is the only algorithm for which all three error measures are mono-
tonically decreasing. This makes LSMR the method of choice in terms of stability. However,
at any given iteration, LSQR has a slightly smaller residual norm. It has been suggested that
LSMR should be used until the iterations can be terminated. Then a switch can be made to the
corresponding LSQR solution.
Similar comments apply to CGME and CGLS. A CGLS-type algorithm, CRLS, correspond-
ing to LSMR that minimizes ∥AT rk ∥2 is derived by Björck [129, 1979] by using a modified
inner product in CGLS. The same algorithm is derived by Fong [416, 2011] by applying the
conjugate residual (CR) algorithm to the normal equations ATAx = AT b. Tests have shown that
CRLS achieves much lower final accuracy in xk than CGLS and LSMR, and therefore its use
cannot be recommended.
Fortran, C, Python, and MATLAB implementations of LSQR, LSMR, and other iterative
solvers are available from Systems Optimization Laboratory, Stanford University. Julia imple-
mentations are available at http://dpo.github.io/software/.

6.2.6 Effects of Finite Precision


Due to loss of orthogonality in the computed basis vectors in LSQR and the residual vectors
sk = AT b in CGLS, the finite termination property is not valid in floating-point arithmetic. In
many practical applications the desired accuracy can be obtained in far less than n iterations.
However, when A is ill-conditioned and b has substantial components along singular vectors
corresponding to small singular values, many more than n iterations may be needed. The first
analysis of the properties of the symmetric Lanczos process in finite-precision arithmetic was
given in the pioneering thesis of Paige [857, 1971]. This predicts that the main effect of finite
precision is a delay in convergence without loss of stability.
In CGLS the residual vector r = b − Ax is recursively computed as rk = rk−1 − Apk at
each iteration step. In finite-precision arithmetic, this residual will differ from the true resid-
ual. Greenbaum [534, 1997] gives estimates of the attainable accuracy of recursively computed
residual methods. Extending these results, the limiting accuracy of different implementations of
CGLS- and LSQR-type methods is analyzed by Björck, Elfving, and Strakoš [142, 1998]. They
6.2. Krylov Subspace Methods 295

find that in the limit the computed solution’s accuracy is at least as good as for a backward stable
method.
Greenbaum [533, 1989] observed that Krylov subspace algorithms seem to behave in finite-
precision like the exact algorithm applied to a larger problem Bb x = c, where B has many
eigenvalues distributed in tiny intervals about the eigenvalues of A. Hence κ(B) ≈ κ(A), which
could explain why the error bound (6.2.18) has been observed to apply also in finite precision.
Some theoretical properties of CGLS and LSQR, such as monotonic decrease of ∥rk ∥2 , remain
valid in floating-point arithmetic.
Krylov subspace methods, such as CGLS and LSQR, typically “converge” in three phases as
follows (see, e.g., Axelsson [48, 1994]):
1. An initial phase in which convergence depends essentially on the initial vector b and can
be rapid.
2. A middle phase, where the convergence is linear with a rate depending on the spectral
condition number κ(A).
3. A final phase, where the method converges superlinearly, i.e., the rate of convergence
accelerates as the number of steps increases. This may take place after considerably less
than n iterations.
In a particular case, any of these phases can be absent or appear repeatedly. The behavior can
be supported by heuristic arguments about the spectrum of A; see Nevanlinna [829, 1993]. For
example, superlinear convergence in the third phase can be explained by the effects of the smaller
and larger singular values of A being eliminated.
A typical behavior is shown in Figure 6.2.1, which plots ∥x† −xk ∥ and ∥AT rk ∥ for LSQR and
CGLS applied to the sparse least squares problem ILLC1850 from Matrix Market (Boisvert et al.
[160, 1997]). This problem originates from a surveying application and has dimensions 1850 ×
712 with 8636 nonzeros. The condition number is κ(A) = 1.405 × 103 , and the inconsistent
right-hand side b is scaled so that the true solution x has unit norm and ∥r∥2 = 0.79 × 10−4 .
Paige and Saunders [866, 1982] remark that LSQR is often numerically somewhat more reliable
than CGLS when A is ill-conditioned and many iterations have to be carried out. In Figure 6.2.1

10 0

10 -5
error and residual norms

10 -10

10 -15

10 -20

10 -25
0 500 1000 1500 2000 2500 3000
iterations

Figure 6.2.1. ∥x† − xk ∥ and ∥AT rk ∥ for problem ILLC1850: LSQR (blue and green solid lines)
and CGLS (black and magenta dash-dot lines).
296 Chapter 6. Iterative Methods

the plots are similar, and both CGLS and LSQR converge after about 2500 ≈ 3.5n iterations to
a final error ∥x − xk ∥ < 10−14 . The superlinear convergence phase is clearly visible. Note that
the oscillations in ∥AT rk ∥ are not caused by the finite precision.
It might be tempting to restart CGLS or LSQR after a certain number of iterations from a
very good approximation to the solution and an accurately computed residual. However, after
such a restart, any previously achieved superlinear convergence is lost and often not recovered
until after many additional iterations.
Ideally, the iterations should be stopped when the backward error in xk is sufficiently small.
Evaluating the expression given in Section 2.5.2 for the optimal backward error is generally too
expensive. In practice, stopping criteria are based on the two upper bounds

∥E1 ∥2 = ∥r − r̄∥2 /∥x̄∥2 , ∥E2 ∥2 = ∥AT r̄∥2 /∥r̄∥2 , (6.2.58)

where x̄ and r̄ are the current approximate solution and residual. This motivates terminating the
iterations as soon as one of the following two conditions are satisfied:

S1 : ∥r̄k ∥2 < ηA ∥A∥∥xk ∥2 + ηb ∥b∥2 , (6.2.59)

S2 : ∥AT r̄k ∥2 < ηA ∥A∥∥rk ∥2 , (6.2.60)

where ηA and ηb are user-specified tolerances. S1 is relevant for consistent problems; otherwise,
S2 is used. Note that it is possible for xk to be a solution of a slightly perturbed least squares
problem and yet for both ∥E1 ∥2 and ∥E2 ∥2 to be orders of magnitude larger than the norm of the
optimal perturbation bound. LSQR and LSMR also terminate if an estimate of κ2 (A) exceeds
a specified limit. This option is useful when A is ill-conditioned or rank-deficient. Because the
stopping criterion S2 is normally used for inconsistent least squares problems, the oscillations in
∥AT rk ∥ that occur for LSQR and CGLS are undesirable. This is one reason to use LSMR, for
which this quantity is monotonically decreasing.
Figure 6.2.2 compares LSQR and LSMR applied to the same problem as before. The fi-
nal accuracy is similar for both algorithms. As predicted by theory, ∥AT rk ∥ is monotonically

10 0

10 -5
error and residual norms

10 -10

10 -15

10 -20

10 -25
0 500 1000 1500 2000 2500 3000
iterations

Figure 6.2.2. ∥x† − xk ∥ and ∥AT rk ∥ for problem ILLC1850: LSQR (blue and green solid lines)
and LSMR (black and magenta dashed lines).
6.2. Krylov Subspace Methods 297

decreasing for LSMR and always smaller than for LSQR. (In many cases the difference is much
more obvious.) Hence criterion S2 will terminate the iterations sooner than for LSQR. On the
other hand, both ∥x − xk ∥2 and ∥rk ∥2 are typically larger for LSMR than for LSQR.
Fong and Saunders [417, 2011] tested LSQR and LSMR on 127 different least squares
problems of widely varying origin, structure, and size from the SuiteSparse matrix collection
(Davis and Hu [291, 2011]). They make similar observations about the accuracy achieved. With
η2 = 10−8 in stopping rule S2, the iterations terminated sooner for LSMR. They suggest the
strategy of running LSMR until termination, then transferring to the LSQR solution. In tests
by Gould and Scott [520, 2017] on 921 problems from the same collection, LSMR had fewer
failures with a given iteration limit and faster execution than LSQR.
If rank(A) < n, CGLS and LSQR in theory will converge to the pseudoinverse solution
provided x0 ∈ R(AT ) (e.g., x0 = 0). In floating-point arithmetic, rounding errors will introduce
a growing component in N (A) in the iterates xk . Initially this component will remain small,
but eventually divergence sets in, and ∥xk ∥ will grow quite rapidly. The vectors xk /∥xk ∥ then
become increasingly close to a null-vector of A; see Paige and Saunders [866, 1982, Section 6.4].
In Figure 6.2.3 plots of ∥x† − xk ∥ and ∥rk ∥ for CGME and LSME are applied to a consistent
underdetermined system with the transpose of the matrix ILLC1850. The algorithms perform
almost identically. A final accuracy of ∥x − xk ∥ = 6.2 × 10−14 is achieved after about 2500
iterations. Superlinear convergence sets in slightly earlier for CGME. As predicted, ∥x† − xk ∥
converges monotonically, while ∥rk ∥ oscillates. For consistent systems Ax = b with stopping
rule S1, the oscillation in ∥rk ∥ that occurs for CGME and LSME is not an attractive feature. For
CGLS and LSQR, ∥rk ∥ converges monotonically. These methods apply also to underdetermined
systems, and, unlike CGME and LSME, they will not break down if Ax = b turns out to be
inconsistent.

10 0

10 -2

10 -4

10 -6
error and residual norms

10 -8

10 -10

10 -12

10 -14

10 -16

10 -18
0 500 1000 1500 2000 2500
iterations

Figure 6.2.3. Underdetermined consistent problem with the transpose of ILLC1850: ∥x† − xk ∥
and ∥rk ∥; CGME (blue and red solid lines); LSME (black and green dashed lines).

As shown in Figure 6.2.4 the convergence of ∥x† − xk ∥ for CGLS is only marginally slower
than for CGME. The residual norm is always smaller and behaves monotonically. Thus, CGLS
achieves similar final accuracy with only slightly more work and storage than CGME and can
be terminated earlier. Therefore, CGLS or LSQR are usually preferred also for underdetermined
systems.
298 Chapter 6. Iterative Methods

10 0

10-2

10-4

10-6
error and residual norms

10-8

10-10

10-12

10-14

10-16

10-18
0 500 1000 1500 2000 2500
iterations

Figure 6.2.4. Problem ILLC1850 overdetermined consistent: ∥x† − xk ∥ and ∥rk ∥; CGME (blue
and red solid lines) and CGLS (black and green dash-dot lines).

Although the two sets of basis vectors Uk and Vk generated by GKL bidiagonalization are
theoretically orthogonal, this property is lost in floating-point arithmetic. The algorithm be-
havior therefore deviates from the theoretical. Reorthogonalization maintains a certain level of
orthogonality and accelerates convergence at the expense of more arithmetic and storage; see
Simon [995, 1984]. In full reorthogonalization, newly computed vectors uk+1 and vk+1 are
reorthogonalized against all previous basis vectors. If Uk and Vk are orthonormal to working
accuracy, this involves computing

uk+1 − Uk (UkT uk+1 ), vk+1 − Vk (VkT vk+1 ) (6.2.61)

and normalizing the resulting vectors. Uk and Vk must be saved, and the cost is roughly 4k(m +
n) flops. After k steps the accumulated cost is about 2k 2 (m + n) flops. Thus, full reorthogonal-
ization is not practical for larger values of k. The cost can be limited by using local reorthogo-
nalization, where uk+1 and vk+1 are reorthogonalized against the p last vectors, where p ≥ 0 is
an input parameter.
Because Uk and Vk are generated by coupled two-term recurrences, these two sets of vectors
have closely related orthogonality. Simon and Zha [996, 2000] show that keeping Vk orthonormal

to machine precision u will make Uk roughly orthonormal to a precision of at least O( u) and
vice versa if Uk is orthogonalized. Such one-sided reorthogonalization saves at least half the
cost of full reorthogonalization. For strongly overdetermined systems (m ≫ n) the savings are
highest if only Vk is orthogonalized.
Tests by Fong and Saunders [417, 2011] show that for more difficult problems, reorthog-
onalization can make a huge difference in the number of iterations needed by LSMR. When
comparing full reorthogonalization and the two versions of one-sided reorthogonalization, they
(unexpectedly) found that the results from these three versions were indistinguishable in all test
cases. The current implementation of LSMR includes the option of local one-sided reorthogo-
nalization.
Barlow [68, 2013] showed that one-sided reorthogonalization is a very effective strategy for
preserving the desired behavior of GKL bidiagonalization. In exact arithmetic, GKL generates
the factorization
A = U BV T ∈ Rm×n , m ≥ n,
6.2. Krylov Subspace Methods 299

with orthonormal U ∈ Rm×n and orthogonal V ∈ Rn×n . In finite precision, orthogonality is


lost. Assume now that
η = 12 ∥I − V T V ∥F ,
where V is the matrix to be reorthogonalized. Then GKL produces Krylov subspaces generated
by a nearby matrix A + δA, where

∥δA∥F = O(u + η)∥A∥F .

A key result is that the singular values of the bidiagonal matrix Bk produced after k steps of this
procedure are the exact result for the Rayleigh–Ritz approximation of a nearby matrix.

Notes and references


Arioli, Duff, and Ruiz [37, 1992] developed possible stopping criteria for iterative methods.
Strakoš and Tichý [1040, 2002] consider theoretical error estimates in CG methods that work in
finite precision. Meurant and Strakoš [792, 2006] give error bounds for CG to justify stop-
ping criteria. Estimates of the optimal backward error have been studied by Chang, Paige,
and Titley-Peloquin [235, 2009], Grcar, Saunders, and Su [532, 2007], and Jiranek and Titley-
Péloquin [673, 2010]. The occurrence of superlinear convergence in general Krylov subspace
methods is analyzed by Simoncini and Szyld [997, 2005]. Composite convergence bounds for
finite-precision CG computations are derived by Gergelits and Strakoš [466, 2014]. Paige [863,
2019] proves the highly accurate behavior of the finite-precision Lanczos process for symmet-
ric matrices. The theory of block Krylov subspace methods for linear systems with multiple
right-hand sides is discussed by Gutknecht [556, 2007].

6.2.7 The MINRES and MINARES Algorithms


For linear systems Ax = b, where A is symmetric and nonsingular but possibly indefinite, the CG
method cannot be used. The MINRES (minimal residual) algorithm of Paige and Saunders [855,
1975] computes minimum residual approximations xk , k = 1, 2, . . . , as the solution of the
problem
min ∥Axk − b∥2 , xk ∈ Kk (A, b). (6.2.62)
xk

The Lanczos process is defined for any symmetric matrix A. It generates an orthogonal basis
Uk = (u1 , u2 , . . . , uk ) for the Krylov subspaces Kk (A, b). Setting xk = Uk yk , we obtain from
(6.2.31) that
b − Axk = β1 u1 − AUk yk = Uk+1 (β1 e1 − T̂k yk ). (6.2.63)
By the orthogonality of Uk+1 , the least squares problem (6.2.62) is seen to be equivalent to

min ∥β1 e1 − T̂k yk ∥2 .


yk

This subproblem is best solved using the QR factorization

γ1 δ2 ϵ3
 
..

 γ2 δ3 . 


Rk
  .. 
Gk,k+1 · · · G23 G12 T̂k =

= γ3 . ϵk  ∈ R(k+1)×k ,

(6.2.64)
0  .. 

 . δk 

γk
 
0
300 Chapter 6. Iterative Methods

where the plane rotations Gj,j+1 annihilate the subdiagonal elements in T̂k . The same rotations
are applied to the right-hand side, giving
 
tk
Gk,k+1 · · · G23 G12 β1 e1 = ,
τ̄k+1

where tk = (τ1 , . . . , τk )T . The factorization (6.2.64) can be updated easily, as we now show. In
the next step, a row and a column are added to T̂k . The only new nonzero elements
 
βk+1
 αk+1 
βk+2

are in column k + 1 and rows k, k + 1, and k + 2. To get column k + 1 in Rk+1 , the two last
rotations from the previous steps are first applied to column k + 1 and rows k − 1, k, k + 1. The
element βk+2 is then annihilated by a new rotation Gk+1,k+2 .
We have xk = Uk yk , where yk satisfies the upper triangular system Rk yk = tk . To compute
xk without having to save all of Uk , we define Dk = (d1 , . . . , dk ) from RkT DkT = UkT . This
yields the recurrence relation (d0 = d−1 = 0)

γk dk = uk − δk dk−1 − ϵk dk−2 , k ≥ 1. (6.2.65)

Hence, xk can be updated using


 
tk−1
xk = Uk yk = Dk Rk yk = ( Dk−1 dk ) = xk−1 + τk dk . (6.2.66)
τk

When A is singular, MINRES computes a least squares solution but not, in general, the
minimum-length solution

min ∥x∥2 subject to x ∈ arg min ∥Ax − b∥2 ; (6.2.67)


x

see Choi [245, 2006]. If the system is inconsistent, then MINRES converges to a least squares
solution but not, in general, to the solution of minimum length. Choi, Paige, and Saunders [245,
2011] develop MINRES-QLP for that purpose. MINRES-QLP uses a QLP decomposition of
the tridiagonal matrix from the Lanczos process and converges to the pseudoinverse solution.
MINRES-QLP requires slightly more operations per iteration step but can give more accurate
solutions than MINRES on ill-conditioned problems. An implementation is given by Choi and
Saunders [246, 2014].
Applying MINRES to the augmented system could be an alternative approach for solving
least squares problems. However, as shown by Fischer, Silvester, and Wathen [409, 1998], MIN-
RES makes progress only in every second iteration, and LSQR and CGLS converge twice as fast.
Indeed, applying the Lanczos process to the augmented system leads to the GLK process (Paige
and Saunders [866, 1982, Section 2.1]) and hence to LSQR.
MINARES by Montoison, Orban, and Saunders [802, 2023] completes the family of Krylov
methods based on the symmetric Lanczos process (CG, SYMMLQ, MINRES, MINRES-QLP).
For any symmetric system Ax = b, MINARES minimizes ∥Ark ∥22 , within the kth Krylov sub-
space Kk (A, b). As this quantity converges to zero even if A is singular, MINARES is well suited
to singular symmetric systems.
6.2. Krylov Subspace Methods 301

At iteration k, MINARES solves the least squares subproblem

min ∥Ark ∥22 , rk = b − Axk .


xk ∈Kk (A,b)

The quantities ∥Ark ∥22 , ∥rk ∥22 , ∥xk − x∥22 , and ∥xk − x∥2A all decrease monotonically. On
consistent systems, the number of iterations is similar to MINRES and MINRES-QLP when the
stopping criterion is based on ∥rk ∥22 , and significantly faster when the stopping criterion is based
on ∥Ark ∥22 .
LSMR is a more general solver for the least squares problem minx ∥b − Ax∥22 of square or
rectangular systems. If A were symmetric, LSMR would minimize ∥Ark ∥22 and appear at first
glance to be equivalent to MINARES. However, rk is defined within different Krylov subspaces,
and LSMR would require two matrix-vector products Av and Au per iteration. (LSMR solving
symmetric Ax = b is equivalent to MINRES solving A2 x = Ab.) MINARES is more reliable
than MINRES or MINRES-QLP on singular systems and more efficient than LSMR.

6.2.8 The GMRES Algorithm


The generalized minimal residual (GMRES) method of Saad and Schultz [959, 1986] is one of
the most widely used Krylov subspace methods. GMRES computes a sequence of approximate
solutions to an unsymmetric nonsingular system Ax = b. It is based on an extension of the
Lanczos process of Arnoldi [39, 1951] to non-Hermitian matrices. Given a unit vector v1 ∈ Cn ,
the Arnoldi process computes an orthonormal basis {v1 , . . . , vk } for the sequence of Krylov
subspaces
Kk (A, v1 ) = span {v1 , Av1 , . . . , Ak−1 v1 }, k = 1, 2, . . . .

Assume that the process has generated Vk = (v1 , v2 , . . . , vk ) with orthonormal columns. In the
next step the vector wk = Avk is formed and orthogonalized against v1 , v2 , . . . , vk . Using MGS
we compute

wk = Avk , hik = wkT vi , wk := wk − hik vi , i = 1, . . . , k. (6.2.68)

If hk+1,k = ∥wk ∥2 > 0, we set vk+1 = wk /hk+1,k . Otherwise, the process terminates. In
matrix form this process yields the Arnoldi decomposition

AVk = Vk+1 Hk+1,k , (6.2.69)

where Hk+1,k is the upper Hessenberg matrix

h11 h12 ··· h1k


 
 h21 h22 ··· h2k 
 .. .. ..  ∈ C(k+1)×k .

Hk+1,k =
 . . .  (6.2.70)
 hk,k−1 hkk 
hk+1,k

In GMRES the Arnoldi process is applied to

v1 = r0 /β, β = ∥r0 ∥2 , (6.2.71)

where x0 is a given starting approximation, and r0 = b − Ax0 . For k = 1, 2, . . . , GMRES


302 Chapter 6. Iterative Methods

determines an approximation xk = Vk yk ∈ Kk (A, b) of minimum residual norm ∥rk ∥2 , rk =


b − Axk . From the Arnoldi decomposition (6.2.69) we have

rk = b − AVk yk = βv1 − Vk+1 Hk+1,k yk = Vk+1 (ρ1 e1 − Hk+1,k yk ),

where Vk+1 has orthonormal columns. It follows that the minimum of ∥rk ∥2 is obtained when
yk solves the small Hessenberg least squares subproblem

min ∥ρ1 e1 − Hk+1,k yk ∥2 . (6.2.72)


yk

Since all Vk and Hk+1,k must be stored, this is solved only at the final step of GMRES by
determining a product of plane rotations such that
   
Rk gk
Qk Hk+1,k = , = ρ1 Qk e1 . (6.2.73)
0 γk

Then the solution is obtained by solving Rk yk = gk and forming xk = Vk yk . The plane rotations
can be used and then discarded.
GMRES terminates at step k if hk+1,k = 0. Then from (6.2.69), AVk = Vk Hk,k and
rank(AVk ) = rank(Vk ) = k. Hence Hk,k must be nonsingular, and

rk = Vk (β1 e1 − Hk,k yk ) = 0, Hk,k yk = β1 e1 .

It follows that xk = Vk yk is the exact solution of Ax = b. In exact arithmetic, GMRES termi-


nates if and only if the exact solution has been found.
With all vectors v1 , . . . , vk being stored, the memory is proportional to the number of steps.
This limits the number of steps that can be taken. Usually GMRES is restarted after a fixed
number of steps, typically between 10 and 30. Restarting usually slows down the convergence,
and it can happen that restarted GMRES never reaches an accurate solution.
If the orthogonalization in GMRES is performed by MGS, as described above, a loss of or-
thogonality of size vk+1 will occur at step k if |hk+1,k | is small. This motivated Walker [1097,
1988] to develop an implementation HH–GMRES using the more expensive Householder orthog-
onalization. However, Greenbaum, Rozložník, and Strakoš [535, 1997] observe that the conver-
gence behavior and, ultimately, attainable precision of HH–GMRES do not differ from MGS–
GMRES. In the MGS–GMRES version, orthogonality will be lost completely only when the
residual norm ∥rk ∥2 has been reduced close to its final level. Paige, Rozložník, and Strakoš [865,
2006] were finally able to prove the backward stability of MGS–GMRES.
The rate of convergence of GMRES is closely related to the behavior of the sequence
{hk+1,k }. In practice GMRES usually shows superlinear convergence. Unlike CG, upper bounds
for the convergence of GMRES cannot be derived in terms of the condition number of A. The
situation is more complicated because the eigensystem of an unsymmetric matrix may have com-
plex eigenvalues and in general no orthogonal eigensystem. The residuals can be expressed as
rk = Pk (A)r0 , where Pk is a polynomial such that Pk (0) = 1. GMRES implicitly generates the
polynomials for which ∥rk ∥2 is minimal. If A is diagonalizable, A = XΛX −1 , Λ = diag (λj ),
then

∥rk ∥2 = min ∥Pk (A)r0 ∥2 ≤ ∥X∥2 ∥X −1 ∥2 min max |Pk (λj )|∥r0 ∥2 . (6.2.74)
Pk (0)=1 Pk (0)=1 j

However, this upper bound is not very useful because the minimization of a polynomial with
Pk (0) = 1 over a set of complex numbers is an unsolved problem. Information about the eigen-
values alone does not suffice for determining the rate of convergence.
6.2. Krylov Subspace Methods 303

GMRES usually needs a good preconditioner in order to work. A right preconditioner,

AM −1 u = b, x = M −1 u, (6.2.75)

has the advantage that the residuals of the preconditioned system are identical to the original
residuals. For a right-preconditioned system the GMRES algorithm constructs an orthogonal
basis for the subspace {r0 , AM −1 r0 , . . . , (AM −1 )m−1 r0 }. We summarize the algorithm below.

Algorithm 6.2.9 (GMRES with Right Preconditioning).

r0 = b − Ax0 ; β = ∥r0 ∥; v1 = r0 /β;


for j = 1, . . . , m
zj = M −1 vj ; w = Azj ;
for i = 1, . . . , j
hij = wT vi ; w = w − hij vi
hj+1,j = ∥w∥2 ; vj+1 = w/hj+1,j ;
end
end
Vm = (v1 , . . . , vm );
ym = argminy ∥βe1 − H̄m ym ∥;
xm = x0 + M −1 Vm ym ;

For many applications an effective fixed preconditioner M is not available. Then one would
like to be able to use a preconditioner defined as an arbitrary number of steps of another iterative
method applied to solve M zj = vj . For example, another Krylov subspace based on the normal
equations could be used. It is desirable that the preconditioner be allowed to change without
restarting GMRES so that zj = Mj−1 vj . Flexible GMRES is a simple modification of standard
GMRES by Saad [955, 1993] that allows the use of variable preconditioning.

Algorithm 6.2.10 (FGMRES with Variable Right Preconditioning).

r0 = b − Ax0 ; β = ∥r0 ∥; v1 = r0 /β;


for j = 1, . . . , m
zj = Mj−1 vj ; w = Azj ;
for i = 1, . . . , j
hij = wT vi ; w = w − hij vi
hj+1,j = ∥w∥2 ; vj+1 = w/hj+1,j ;
end
end
Zm = (z1 , . . . , zm );
ym = argminy ∥βe1 − H̄m ym ∥;
xm = x0 + Zm ym ;

The only difference from the standard version is that we now must save Zm = (z1 , . . . , zm )
and use it for the update of x0 . This doubles the storage cost, but the arithmetic cost is the same.
Note that z1 , . . . , zm in FGMRES is not a Krylov subspace.
304 Chapter 6. Iterative Methods

By Propositions 2.1 and 2.2 in Saad [955, 1993], xm obtained at step m in flexible GMRES
minimizes the residual norm ∥b − Axm ∥2 over x0 + span (Zm ). Assuming that j − 1 steps of
FGMRES have been successfully performed and that Hj is nonsingular, xj is the exact solution
if and only if hj+1,j = 0. Note that the nonsingularity of Hj is no longer implied by the
nonsingularity of A.

6.2.9 Orthogonal Tridiagonalization


The simultaneous solution of a linear system and its adjoint,

Ax ≈ b, AH y ≈ c, (6.2.76)

has interesting applications in design optimization, aeronautics, weather prediction, and signal
processing; see Giles and Süli [471, 2002] and Montoison and Orban [800, 2020]. More gener-
ally, let A ∈ Rm×n , m ≥ n, be rectangular. Applying two sequences of Householder transfor-
mations alternately on the left and right, we can compute an orthogonal tridiagonalization
     
1 0 cH 1 0 γ1 eT1
= . (6.2.77)
UH b A V β1 e1 U H AV

The first transformation in each sequence is chosen to reduce b and c, respectively, to a multiple
of e1 , so that
U H b = β1 e 1 , cH V = γ1 eT1 ,
and hence b = β1 U e1 and c = γ1 V e1 . Later transformations are chosen to reduce A to tridiago-
nal form
α1 γ2
 
 β2 α2 γ3 

Tn+1,n
  . . . . . .

H
U AV = , Tn+1,n = 

 . . .  ∈ R(n+1)×n ,

0  βn−1 αn−1 γn 

βn αn
 
βn+1
(6.2.78)
with nonnegative off-diagonal elements. (If m = n, the last row of Tn+1,n is void.) Note that
this transformation preserves the singular values of A.
Knowing the existence of factorization (6.2.78), we can derive recurrence relations to com-
pute the nonzero elements of the tridiagonal matrix Tn+1,n and the columns of U and V . We
already have b = β1 u1 and c = γ1 v1 , so that u1 = b/∥b∥2 and v1 = c/∥c∥2 . Following Golub,
Stoll, and Wathen [509, 2008], we write

A(v1 , . . . , vk ) = (u1 , . . . , uk , uk+1 )Tk+1,k ,


H H
A (u1 , . . . , uk ) = (v1 , . . . , vk , vk+1 )Tk+1,k .

Comparing the last columns on both sides and solving for vectors uk+1 and vk+1 , respectively,
gives

βk+1 uk+1 = Avk − αk uk − γk uk−1 , (6.2.79)


γk+1 vk+1 = AH uk − αk vk − βk vk−1 . (6.2.80)

Orthogonality gives αk = uH
k Avk , and the elements βk+1 > 0 and γk+1 > 0 are determined as
normalization constants.
6.2. Krylov Subspace Methods 305

Approximate solutions of Ax = b and AH y = c can be obtained in the form


xk = V k zk , yk = Uk wk .
The norm of the residuals rk = b − AVk zk and sk = c − AH Uk wk at step k are
∥rk ∥2 = ∥b − Uk+1 Tk+1,k zk ∥2 = ∥β1 e1 − Tk+1,k zk ∥2 , (6.2.81)
H H
∥sk ∥2 = ∥c − Vk+1 Tk+1,k wk ∥2 = ∥γ1 e1 − Tk,k+1 wk ∥2 . (6.2.82)
Determining zk and wk so that these residual norms are minimized is equivalent to solving
two tridiagonal least squares problems. This ensures that the residual norms for both the pri-
mal and dual systems are monotonically decreasing. The QR factorizations of Tk+1,k and
H
Tk,k+1 are computed using a sequence of plane rotations to eliminate off-diagonal elements.
Let Tk+1,k = Qk R bk , where R
bk is upper triangular with three nonzero diagonals. The QR fac-
tors can be updated so that only one plane rotation is needed in each step. The rotations are
applied also to the right-hand side. The solution zk can then be obtained by back-substitution.
Storing the whole bases Uk and Vk can be avoided. Let
 
R
bk fk
Qk (Tk+1,k β1 e1 ) = ,
0 ϕ̄k+1
so that R
bk zk = fk , and define Ck from the triangular system Ck R
bk = Vk . Then the solution can
be updated as
xk = Vk zk = Ck R bk zk = Ck fk = xk−1 + ϕk ck .
H
The adjoint system Tk,k+1 wk = γ1 e1 is treated similarly.
From the recursion formulas (6.2.79) and (6.2.80) it can be seen that the vectors uk and vk
lie in the union of two Krylov subspaces:
u2k ∈ Kk (AAH , b) ∪ Kk (AAH , Ac), u2k+1 ∈ Kk+1 (AAH , b) ∪ Kk (AAH , Ac),

v2k ∈ Kk (AHA, c) ∪ Kk (AHA, AH b), v2k+1 ∈ Kk+1 (AHA, c) ∪ Kk (AHA, AH b).


The tridiagonalization process is defined as long as no element βk of γk becomes zero. If βk = 0,
this signals that the solution of Ax = b can be recovered from the computed partial decompo-
sition. Similarly, the solution of AH y = c can be obtained if γk = 0. Indeed, the process can
be continued simply as follows. If βk = 0, then in recurrence (6.2.80) we set βj = 0, j > k.
Similarly, if γk = 0, then in (6.2.79) we set γj = 0, j > k. Finally, note that if c = 0, then all γj
are zero, Tn,n+1 is lower bidiagonal, and the process equals Bidiag1. Similarly, if b = 0, all βj
are zero, and Tn,n+1 is upper bidiagonal.

Notes and references


Parlett observed in 1987 that orthogonal tridiagonalization can be interpreted as a block Lanczos
T T
process for the Jordan–Wielandt matrix with initial vectors ( 0 b ) and ( c 0 ) . Orthogonal
tridiagonalization of a square matrix A with starting vectors b and c originated with Saunders,
Simon, and Yip. This led to the iterative solvers USYMQR and USYMLQ for solving square
Ax = b and AH = c simultaneously given in Saunders, Simon, and Yip [966, 1988]. Reichel
and Ye [918, 2008] recognized that orthogonal tridiagonalization also applies to rectangular A
and named their associated solver GLSQR. For square Ax = b they show that the special choice
c=x b, where xb is an approximation to x, can improve convergence.
There are many possible generalizations of orthogonal tridiagonalization, with applications,
e.g., to least squares problems with multiple right-hand sides and block bidiagonal decompo-
sitions. Plešinger [899, 2008] and Hnětynková, Plešinger, and Strakoš [633, 2015] consider
applications to core system theory and TLS problems.
306 Chapter 6. Iterative Methods

6.3 Preconditioners for Least Squares Problems


6.3.1 Gauss–Seidel and SSOR Preconditioners
Finding good preconditioners for least squares problems is often difficult. Problems arise from
a wide variety of sources and lack properties that make some techniques successful for linear
systems arising from partial differential equations. Surveys of preconditioners for least squares
problems are given by Bru et al. [183, 2014] and Gould and Scott [520, 2017].
Some simple preconditioners can be derived from some classic iterative methods. In Section
6.1.6 it was shown that the Chebyshev Semi-iterative (CSI) method can be used to accelerate
the convergence of SSOR. A different interpretation of this is to view SSOR as a preconditioner
for CSI. Let
ATA = L + D + LT (6.3.1)
be the standard splitting with L strictly lower triangular. The SSOR preconditioner corresponds
to taking
B = D−1/2 (D + ωLT ), 0 ≤ ω < 2. (6.3.2)
Note that ω = 0 corresponds to diagonal scaling such that A has columns of unit norm. The
SSOR preconditioner can be implemented without explicitly forming ATA or L. The vectors
t = (t1 , . . . , tn )T = B −1 p and q = AB −1 p = q0 can be computed simultaneously as follows.
Set qn = 0, and for j = n, n − 1, . . . , 1, compute
 
1/2
tj = dj pj − ωaTj qj /dj , qj−1 = qj + tj aj . (6.3.3)

The vector s = (s1 , . . . , sn )T = B −T AT r is computed as follows. Set h1 = r, and for j =


1, 2, . . . , n compute
1/2 −1/2
sj = aTj hj /dj , hj+1 = hj − ω(dj sj )aj . (6.3.4)

Hence to apply the SSOR preconditioner, only one column of A is needed at a time. The number
of arithmetic operations per step approximately doubles when ω ̸= 0, compared to diagonal
scaling (ω = 0). Evans [389, 1968] reports SSOR preconditioning for solving symmetric positive
definite systems Ax = b as very promising. Jennings and Malik [665, 1978] also consider Jacobi
and SSOR-preconditioned CG methods.
Theory and numerical experiments indicate that the choice ω = 1 is often close to optimal.
However, the gain is usually small and often upset by the increased complexity of each itera-
tion. Convergence may be affected by reordering the columns. Also, reordering may be used to
introduce parallelism.
In many sparse least squares problems arising from multidimensional models, A has a natural
column block structure,

A = ( A1 A2 ... AN ) , Aj ∈ Rm×nj , (6.3.5)

where n1 + · · · + nN = n. A special example of such block structure is the block-angular


form described in Section 4.3.2. For problems with the structure (6.3.5), block versions of the
preconditioners (6.3.2) are particularly suitable. Let the QR factorizations of the blocks be

Aj = Qj Rj , Qj ∈ Rm×nj , j = 1, . . . , N. (6.3.6)

Then (6.3.2) with ω = 0 corresponds to the block diagonal preconditioner

B = RB = diag (R1 , R2 , . . . , RN ). (6.3.7)


6.3. Preconditioners for Least Squares Problems 307

For this choice we have AB −1 = (Q1 , Q2 , . . . , QN ), i.e., the columns of each block are or-
thonormal. If ATA is split according to ATA = LB + DB + LTB , where LB is strictly lower
block triangular, the block SSOR preconditioner becomes
−1
B = RB (DB + ωLTB ); (6.3.8)

see Björck [128, 1979]. As with the corresponding point preconditioner, it can be implemented
without forming ATA.
If x and y = Bx are partitioned conformally with (6.3.7),

x = (x1 , x2 , . . . , xN )T , y = (y1 , y2 , . . . , yN )T ,

then Jacobi’s method (6.1.18) applied to the preconditioned problem (6.2.22) becomes
(k+1) (k)
yj = yj + QTj (b − AB −1 yk ), j = 1, . . . , N,

or in terms of the original variables,


(k+1) (k)
xj = xj + Rj−1 QTj (b − Axk ), j = 1, . . . , N. (6.3.9)
(k+1)
This is the block Jacobi method for the normal equations. Note that the correction zj = xj −
(k)
xj solves the problem

min ∥Aj zj − r(k) ∥2 , r(k) = b − Ax(k) , (6.3.10)


zj

and these corrections can be computed in parallel. Often Qj is not available and we have to
use Qj = Aj Rj−1 . This is equivalent to using the method of seminormal equations (2.5.26) for
solving (6.3.10). It can lead to some loss of accuracy, and a correction step is recommended
unless all the blocks Aj are well-conditioned.
(k)
A block SOR method for the normal equations can be derived similarly. Let r1 = b − Axk ,
and for j = 1, . . . , N compute
(k+1) (k) (k) (k)
xj = xj + ωzj , rj+1 = rj − ωAj zj , (6.3.11)
(k)
where zj solves minzj ∥Aj zj − rj ∥2 . Taking ω = 1 in (6.3.11) gives the block Gauss–Seidel
method.
To use the block SSOR preconditioner (6.3.8) for the conjugate gradient method, we have
to be able to compute vectors q = AB −1 p and s = B −T AT r efficiently, given p and r. The
following algorithms for this generalize the point SSOR algorithms (6.3.3) and (6.3.4):
• Set q (N ) = 0 and solve B(z1 , . . . , zN )T = p. For j = N, . . . , 2, 1, solve

Rj zj = pj − Rj−T ATj q (j) , q (j−1) = q (j) − Aj zj . (6.3.12)

• Set r(1) = r and compute s = (s1 , . . . , sN )T . For j = 1, 2, . . . , N , solve

RjT sj = ATj r(j) , r(j+1) = r(j) − Aj Rj−1 sj . (6.3.13)

The choice of partitioning A into blocks is important for the storage and computational ef-
ficiency of the methods. An important criterion is that it should be possible to compute the
factorizations Aj = Qj Rj (or at least the factors Rj ) without too much fill. Note that if Aj is
308 Chapter 6. Iterative Methods

block diagonal, the computation of zj in SOR splits into independent subproblems. This makes
it possible to achieve efficiency through parallelization.
The case N = 2, A = ( A1 A2 ) is of special interest. For the block diagonal preconditioner
(6.3.7) we have B −1 = ( Q1 Q2 ), and the matrix of normal equations for the preconditioned
system becomes
 
−1 T −1 I K
(AB ) AB = , K = QT1 Q2 . (6.3.14)
KT I
This matrix is consistently ordered. Hence, it is possible to reduce the work per iteration by
approximately half for many iterative methods. This preconditioner is also called the cyclic
Jacobi preconditioner.
For consistently ordered matrices, the SOR theory holds. Hence, as shown by Elfving [383,
1980], the optimal ω in the block SOR method (6.3.11) for N = 2 is

ωopt = 2/(1 + sin θmin ), cos θmin = σmax (QT1 Q2 ),

where θmin is the smallest principal angle between R(A1 ) and R(A2 ). Block SOR with ωopt
reduces the number of iterations by a factor of 2/ sin θmin compared to ω = 1.
For N = 2, the preconditioner (6.3.8) with ω = 1 has special properties; see Golub, Man-
neback, and Toint [500, 1986]. From

R1 ωQT1 A2
 
B= ,
0 R2

it follows that for ω = 1,

( A1 A2 ) B −1 = ( Q1 (I − P1 )Q2 ) , (6.3.15)

where P1 = Q1 QT1 is the orthogonal projector onto Range(A1 ). Hence the two blocks in
(6.3.15) are mutually orthogonal, and the preconditioned problem (6.2.22) can be split into

y1 = QT1 b, min ∥(I − P1 )Q2 y2 − b∥2 . (6.3.16)


y2

This effectively reduces the original problem to one of size n2 . Hence, this preconditioner is also
called the reduced system preconditioner. The matrix of normal equations is
   
I 0 0 0
(AB −1 )TAB −1 = = I − ,
0 QT2 (I − P1 )Q2 0 KT K

where K = QT1 Q2 . This reduction of variables corresponding to the first block of columns can
also be performed when N > 2.
Manneback [772, 1985] shows that for N = 2 the optimal choice with respect to the number
of iterations is ω = 1, i.e., the reduced system preconditioning. Further, as shown by Hageman,
Luk, and Young [558, 1980], the reduced system preconditioning is equivalent to cyclic Jacobi
preconditioning (ω = 0) for Chebyshev semi-iteration and the conjugate gradient method. The
reduced system preconditioning essentially generates the same approximations in half the num-
ber of iterations. Since the work per iteration is about doubled for ω ̸= 0, this means that cyclic
Jacobi preconditioning is optimal for CG in the class of SSOR preconditioners.
The use of SSOR preconditioners for Krylov subspace methods was first proposed by Ax-
elsson [47, 1972]. SSOR-preconditioned CG methods for the least squares and least-norm prob-
lems are developed by Björck and Elfving [141, 1979]. Experimental results for block SSOR
6.3. Preconditioners for Least Squares Problems 309

preconditioning with N > 2 are given by Björck [128, 1979]. Tests show that the number of
iterations required is nearly constant for values around ω = 1. For certain grid problems, a high
degree of parallelism can be achieved. Kamath and Sameh [682, 1989] give a scheme for a three-
dimensional n × n × n mesh and a seven-point difference star, for which N = 9 and each block
consists of n2 /9 separate subblocks of columns. Hence each subproblem can be solved with a
parallelism of n2 /9.

Notes and references


Ordering algorithms based on graph theory are given by Dennis and Steihaug [317, 1986]. Golub,
Manneback, and Toint [500, 1986] apply block SSOR-preconditioned CG to the Doppler posi-
tioning problem, for which the matrix has block-angular form. Morikuno and Hayami [812,
2013], [813, 2015] use one or several steps of SOR and SSOR as an inner iteration precondi-
tioner for GMRES applied to the normal equations (see Section 4.3.2).

6.3.2 Incomplete Cholesky and QR


Incomplete Cholesky (IC) factorizations are an important class of preconditioners for solving
large sparse symmetric positive definite linear systems Cx = d. In these methods, some fill ele-
ments that would occur in the exact Cholesky factorization C = LLT are dropped. The resulting
sparse approximate Cholesky factor is used as a preconditioner. This idea has yielded very ef-
fective solvers, especially for symmetric positive definite systems arising from finite-difference
stencils. Note that if L is the exact Cholesky factor, L−T CL−1 = I, and CG converges in one
step.
For least squares problems minx ∥Ax − b∥2 , IC factorizations are used as right-precondi-
tioners, and CGLS, LSQR, or LSMR is applied to miny ∥AL−1 y − b∥2 . Let L be a lower
triangular IC factor of C = ATA such that

C = LLT − E, ∥E∥2 /∥A∥22 < ϵ ≪ 1,

where E is the residual matrix. It follows that ∥AL−1 − I∥2 < ϵκ(A)2 , and we can expect rapid
convergence. However, since ATA is often significantly denser than A, it can be difficult to find
a sufficiently sparse and effective IC preconditioner for least squares problems.
In the pioneering paper by Meijerink and van der Vorst [786, 1977], an IC factorization for
the class of symmetric M-matrices is shown to exist. More generally, IC factorizations exist
when C is an H-matrix but may fail for a general symmetric positive definite matrix because of a
zero or negative pivot. Numerical instabilities can be expected if pivots have small magnitudes.
To avoid breakdown, Manteuffel [774, 1980] proposed factorizing a diagonally shifted matrix
C + αI for some sufficiently large α > 0. This modification can be very effective, but its success
depends critically on the choice of α. In general the only way to find a suitable α is by trial and
error.
During the last fifty years, many variations of IC preconditioners have been developed. They
differ in the strategies used to determine which elements are dropped. In a level-based IC
method, the nonzero pattern of L is based on the nonzero pattern of C and prescribed in ad-
vance. A symbolic factorization is used to assign each fill entry a level. In an IC(ℓ) incomplete
factorization, a nonzero element in L is kept in the numerical factorization only if its level is
at most ℓ. This has the advantage that the memory required for the preconditioner L is known
in advance. In an IC(0) factorization, the nonzero structure of L is the same as for the lower
triangular part of C. An IC(1) factorization also includes any nonzeros directly introduced by
the elimination of the level-zero elements. Higher-level incomplete factorizations are defined
recursively. However, memory requirements may grow quickly as ℓ is increased. An improved
310 Chapter 6. Iterative Methods

strategy by Scott and Tůma [984, 2011] is to consider individual matrix elements and restrict
contributions of small elements to fewer levels of fill than for larger elements.
Another widely used class of IC factorizations is called incomplete threshold IC(τ ) fac-
torization. In these, elements in the computed IC factor whose magnitude falls below a preset
threshold τ are discarded. A choice of τ = 0 will retain all elements, giving a complete Cholesky
factorization of C. It can be shown that a choice of τ = 1 will cause all off-diagonal elements to
be rejected and give a diagonal preconditioner. In practice, an intermediate value of τ in the in-
terval [0.01, 0.02] is recommended by Ajiz and Jennings [13, 1984]. Several of the above classes
of IC factorizations are available in the ichol function supplied by MATLAB.
A suitable symmetric permutation P CP T can improve the performance of an IC factoriza-
tion. When a drop tolerance is used, good orderings for direct methods, such as the minimum
degree algorithm, can be expected to perform well, because with these orderings fewer elements
need to be dropped; see Munksgaard [815, 1980]. Duff and Meurant [350, 1989] study the effect
of different ordering strategies on the convergence of CG when it is preconditioned by IC. They
conclude that the rate of convergence is not related to the number of fill-ins that are dropped but
rather almost directly related to the norm of the residual matrix ∥E∥. They show that several
orderings that give a small number of fill-ins do not perform well when used with a level-zero or
level-one incomplete factorization.
An alternative strategy to avoid breakdown of an IC factorization is proposed by Ajiz and
Jennings [13, 1984]. To compensate for dropped off-diagonal elements, corrections are added
to the diagonal elements. To delete the element cij , i ̸= j, a residual matrix Eij with nonzero
elements
 
cii −cij
−cji cjj

is added, where cii cjj − c2ij ≥ 0. Then Eij is positive semidefinite, and the eigenvalues of the
modified matrix C + Eij cannot be smaller than those of C. Hence if C is positive definite and
E is the sum of such modifications, it follows that C + E is positive definite, and the incomplete
factorization cannot break down. In the algorithm of Ajiz and Jennings, modifications to cii and
cjj of equal relative magnitude are made,

cii = cii + ρ|cij |, cjj = cjj + (1/ρ)|cij |,


p
where ρ = cii /cjj . After all the off-diagonal elements in column i have been computed, all
additions are made to cii , and
 i−1
X 1/2
rii = cii − 2
rki , rij = c∗ij /rii , j > i.
k=1

A difficulty with threshold Cholesky factorization is that the amount of storage needed to hold
the factorization for a given τ cannot be determined in advance. One solution is to stop and
restart with a larger value of τ if the allocated memory does not suffice. Alternatively, only the
p largest off-diagonal elements in each column of L can be kept, for some parameter p. Lin and
Moré [748, 1999] use no drop tolerance and retain the nj + p largest elements in the jth column
of L.
Tismenetsky [1065, 1991] proposes a different modification scheme. Intermediate memory
is used during construction of the preconditioner L but then discarded. A decomposition of the
form
C = (L + R)(L + R)T = LLT + LRT + RLT + E (6.3.17)
6.3. Preconditioners for Least Squares Problems 311

is used, where L is lower triangular with positive diagonal elements, and R is strictly lower
triangular. The matrix L is used as a preconditioner, and R is used to stabilize the factorization
process. The residual matrix has the positive semidefinite form E = RRT . At step j, the first
column of the jth Schur complement can be decomposed as the sum lj + rj , where ljT rj = 0.
In a right-looking implementation, the Schur complement is updated by subtracting Ej = (lj +
rj )(lj + rj )T , where rj is not retained in the incomplete factorization. Hence, at step j the
positive semidefinite modification rj rjT is implicitly added to A, which prevents breakdowns.
Tismenetsky takes rj as the vector of off-diagonal elements that are smaller than a chosen drop
tolerance. The good performance of Tismenetsky’s preconditioner is partly explained by the
form of the error matrix that depends on the square of the elements in R. The fill in L can be
controlled by the choice of drop tolerance. The most serious drawback is that the total memory
requirement needed to compute L can be prohibitively high.
Kaporin [684, 1998] modifies Tismenetsky’s method in several respects. A left-looking al-
gorithm is used, and the memory requirement is controlled by using two tolerances. Elements
larger than τ1 are kept in L, and those smaller than τ2 are dropped from R. The error matrix now
has the structure

E = RRT + F + F T ,

where F is a strictly lower triangular matrix that is not computed. Kaporin’s method is not
breakdown-free and has to be stabilized, e.g., by restarting the factorization after a diagonal shift
A := A + αI. More than one restart may be required.
Further developments of the Tismenetsky–Kaporin method are proposed by Scott and Tůma
[986, 2014]. Memory is limited by using two extra parameters lsize and rsize to control the
maximal number of fill elements in each column of L and R, respectively. The lsize largest
elements are kept in lj provided they are at least τ1 in magnitude, and the rsize largest elements
are in rj provided they are at least τ2 in magnitude. An implementation MI28 is described in
Scott and Tůma [985, 2014], where extensive test are described on a large set of problems from
the SuiteSparse collection. The code is available as part of the HSL Mathematical Software
Library; see www.hsl.rl.ac.uk.
As described in Section 2.5.3, iterative refinement (IR) can be regarded as a preconditioned
iterative method, where the preconditioner is the full factor R̄ computed from Cholesky of ATA
(or a QR factorization of A), possibly in lower precision. The iteration method in IR is the
simple power method. This can require several iterations to converge, and often some other
iterative solver such as CGLS can be used with advantage. Zhang and Wu [1147, 2019] use a
QR factorization in IEEE half precision as a preconditioner for CGLS to achieve high accuracy
least squares solutions on GPUs.
By Theorem 2.1.2, the computed full Cholesky factorization satisfies

R̄TR̄ = ATA + E, ∥E∥2 < 2.5n3/2 u∥A∥22

provided 2n3/2 uκ(A)2 < 0.1. Hence,

∥AR̄−1 − I∥ < 2.5n3/2 uκ(A)2 ,

and with R̄ as a preconditioner, CGLS or LSQR will converge rapidly.


Bellavia, Gondzio, and Morini [97, 2013] discuss a new class of limited-memory precondi-
tioners for CGLS for solving weighted large-scale least squares problems arising in optimiza-
tion. The largest eigenvalues of the symmetric positive definite normal equation matrix H are
312 Chapter 6. Iterative Methods

identified by a partial Cholesky factorization P = LLT that uses only a few columns corre-
sponding to the largest diagonal elements of H. This is used as a preconditioner to reduce the
condition number of H. The smallest eigenvalues of H are handled by the deflated CG algorithm
of Saad et al. [961, 2000]; see Section 6.4.6. This requires computing approximate eigenvectors
corresponding to some of the smallest eigenvalues of the preconditioned matrix P −1 H by a
Rayleigh–Ritz procedure.
Myre et al. [816, 2018] use CGLS preconditioned with the computed complete Cholesky
factorization to solve dense least squares problems. They call their algorithm TNT because it is
a “dynamite method”(!). For problems in rock magnetism with tens of thousands of variables it
outperformed other tested methods, including dense QR factorization.
Alternatively, an incomplete factor R can be generated by modifying a QR factorization
of A. This normally involves more computation but is less subject to the effect of rounding
errors. Jennings and Ajiz [664, 1984] describe an incomplete modified Gram–Schmidt (IMGS)
factorization in which the magnitude of each off-diagonal element rij is compared against a
chosen drop tolerance τ , scaled by the norm of the corresponding column in A = (a1 , . . . , an ).
That is, elements in R such that

|rij | < τ dj , dj = ∥aj ∥2 , j = 1, . . . , n,

are dropped. If τ = 0, all elements in R are retained, and the MGS process is complete. If τ = 1,
all off-diagonal elements in R are dropped, thus making R diagonal. In IMGS factorization the
preconditioner R is formed by the coefficients in a series of vector orthogonalizations, and A is
converted into Q.

Algorithm 6.3.1 (IMGS Factorization).

for i = 1 : n
rii = ∥ai ∥2 ; qi := ai /rii ;
for j = i + 1 : n
rij := qiT aj ;
if rij < τ dj then rij := 0;
else aj := aj − rij qi ;
end
end

If A has full column rank, the IMGS algorithm cannot break down. Column aj is only mod-
ified by subtracting a linear combination of previous columns a1 , . . . , aj−1 and cannot vanish.
Therefore, we have at each stage A = Q̂R̂, where Q̂ is orthogonal, and upper triangular R̂ has
positive diagonal elements. Normalization will give a nonzero qj , and the process can be con-
tinued. A drawback of IMGS is that for a sparse A, the intermediate storage requirement can be
much larger than for the final preconditioner R̂.
Wang [1101, 1993] (see also Wang, Gallivan, and Bramley [1102, 1997]) gives a com-
pressed algorithm (CIMGS) for computing the IMGS preconditioner. CIMGS is similar to
an incomplete Cholesky factorization. In exact arithmetic, it can be shown to produce the
same incomplete factor R for C = ATA as IMGS. Thus it inherits the robustness of IMGS.
CIMGS is also equivalent to Tismenetsky’s IC decomposition applied to the matrix ATA; see Bru
et al. [183, 2014].
6.3. Preconditioners for Least Squares Problems 313

Algorithm 6.3.2 (CIMGS).


for i = 1, 2, . . . , n

rii = cii ;
for j = i + 1, . . . , n
cij = cij /rii ;
if (i, j) ̸∈ P then rij = 0 else rij = cij ; end
end
for j = i + 1, . . . , n
for k = i + 1, . . . , n
if (i, j) ∈ P or (i, k) ∈ P then
ckj = ckj − cik ∗ cij end
end
end
end

Jennings and Ajiz [664, 1984] also consider an incomplete Givens QR factorization. The
rows of A are processed sequentially. The nonzero elements in the ith row (ai1 , ai2 , . . . , ain ) are
scanned, and each nonzero element aij is annihilated by a plane rotation involving row j in R.
A rotation to eliminate an element in A is skipped if
|aij | < τ ∥aj ∥2 ,
where aj is the jth column of A. If such an element aij were simply discarded, the final in-
complete factor R would become singular. Instead, these elements are rotated into the diagonal
element rjj by setting q
rjj = rjj 2 + a2 .
ij

This guarantees that R is nonsingular and that the residual matrix E = ATA − RTR has zero
diagonal elements.
Zlatev and Nielsen [1153, 1988] compute sparse incomplete QR factors of A by discarding
computed elements that are smaller than a drop tolerance τ ≥ 0. The initial tolerance is succes-
sively reduced if the iterative solver converges too slowly. This approach can be very efficient
for some classes of problems, especially when storage is a limiting factor.
A different dropping criterion suggested by Saad [953, 1988] is to keep the pR largest ele-
ments in a row of R and the pQ largest elements in a column of Q. The sparsity structure of R
can also be limited to a prescribed index set P , as in the incomplete Cholesky algorithm. This
version can be obtained from Algorithm 6.3.1 by modifying it so that rij = 0 when (i, j) ̸∈ P .
A multilevel incomplete Gram–Schmidt QR (MIQR) preconditioner is given by Li and Saad
[741, 2006]. This exploits the fact that when a matrix is sparse, many of its columns will be
orthogonal because of their structure. The algorithm first finds a set of structurally orthogonal
columns in A and permutes them into the first positions A1 = (a1 , . . . , ak ). Normalizing these
columns gives A1 = Q1 D1 , with Q1 orthogonal. The remaining columns A2 are then orthogo-
nalized against the first set, giving B = A2 − Q1 F1 and the partial QR factorization
 
T D1 F1
AP1 = ( Q1 B ) ,
0 I
where F1 is usually sparse and has structurally independent columns. Hence, the process can
be repeated recursively on B until the reduced matrix is small enough or no longer sufficiently
314 Chapter 6. Iterative Methods

sparse. This orthogonalization process can be turned into an incomplete QR factorization by


relaxing the orthogonality and applying dropping strategies.

6.3.3 Approximate Inverse Preconditioners


Preconditioners based on incomplete LU or Cholesky factorization have the disadvantage that
they are implicit, i.e., their application requires the solution of a linear system. Therefore they
can be difficult to implement efficiently. For a nonsingular system Ax = b, an alternative is to
compute an explicit sparse approximate inverse (SPAI) preconditioner M such that M ≈ A−1 .
Application of such a preconditioner is a matrix-vector operation and therefore amenable to
parallelization.
It is not clear if a good sparse approximate inverse of a sparse matrix A exists, given that the
inverse of a sparse irreducible matrix in general has no zero elements. For example, the inverse of
an irreducible band matrix A ∈ Rn×n is dense. However, if A is strongly diagonally dominant,
an SPAI preconditioner consisting of the main diagonal and a few other diagonals can be very
efficient.
An SPAI preconditioner can be found by considering the constrained optimization problem

min ∥I − AM ∥F , (6.3.18)
M ∈G

where M is allowed to have nonzero elements only in a subset G of indices (i, j), 1 ≤ i, j ≤ n,
and ∥ · ∥F is the Frobenius matrix norm. If mj is the jth column of M , then
n
X
∥I − AM ∥2F = ∥ej − Amj ∥22 . (6.3.19)
j=1

The optimization problem reduces to solving n independent least squares subproblems min ∥ej −
Amj ∥22 for mj subject to the sparsity constraints on mj . Rows with no nonzero elements can be
discarded. Thus, when M is constrained to be a sparse matrix the least squares subproblems are
of small dimension. A simple method for solving the subproblems is coordinate descent on the
function
1 2
2 ∥rj ∥2 , rj = ej − Amj .
Chow and Saad [247, 1998] reduce the cost of computing the SPAI by using a few steps of an
iterative method to reduce the residuals for each column.
For matrices with a general sparsity pattern it is difficult to prescribe a good nonzero pat-
tern for M . For A ∈ Rn×n a common choice is to let M have the same sparsity structure as
A. Therefore, adaptive strategies have been developed. These start with a simple initial guess
for the sparsity pattern (for example, a diagonal structure) and successively augment this until
some criterion is satisfied. The algorithm by Grote and Huckle [544, 1997] is one of the most
successful of these. A detailed discussion is given in Benzi [106, 2002].
For a least squares problem minx ∥Ax − b∥2 of full column rank, we could seek an SPAI
for the normal equations matrix C = ATA. Several algorithms for computing a SPAI for posi-
tive definite systems have been suggested. Two basic types can be distinguished, depending on
whether the preconditioner is expressed as a single matrix M ≈ C −1 or as the product of two or
more matrices. For use with CGLS a symmetric positive definite preconditioner M is required.
Symmetry can be achieved by using the symmetric part 21 (M T + M ) of M . Regev and Saun-
ders [915, 2022] give a modified PCGLS method that detects indefiniteness or near singularity
of M and restarts PCGLS with a more positive definite preconditioner.
Another way of computing an approximate inverse is by a procedure related to the bicon-
jugation algorithm of Fox [429, 1964]. Given a symmetric positive definite matrix C and a
6.3. Preconditioners for Least Squares Problems 315

set of linearly independent vectors w1 , w2 , . . . , wn ∈ Rn , the AINV algorithm constructs a set


of C-conjugate vectors z1 , z2 , . . . , zn ∈ Rn using a modified Gram–Schmidt orthogonalization
(0)
process; see Benzi, Meyer, and Tůma [109, 1996]. The algorithm sets zi = wi and then iterates
(j−1) T (j−1)
(j) (j−1) (zj ) Czi (j−1)
zi := zi − (j−1) T
z
(j−1) j
, (6.3.20)
(zj ) Czj
(i−1)
where j = 1 : n − 1 and i = j + 1 : n. With zi = zi and Z = (z1 , z2 , . . . , zn ), we now have
ziT Czj = 0, i ̸= j, and
Z T CZ = D = diag (d1 , . . . , dn ), (6.3.21)
where
dj = zjT Czj > 0, 1 ≤ j ≤ n.
In exact arithmetic, the process can be completed without encountering zero divisors if and only
if all the leading principal minors of C are symmetric positive definite. In this case the matrix
Z T = L−1 is unit lower triangular. By uniqueness, C = LDLT is the square-root-free Cholesky
factorization.
Applications to sparse least squares problems are considered by Benzi and Tůma [110, 2003],
[111, 2003]. Setting C = ATA, we have
zjT Czi = (Azj )T (Azi ).
Hence (6.3.20) does not require C explicitly. For a typical sparse A, most of the inner products
will be structurally zero (Azj and Azi will have no nonzeros in the same positions). Incom-
(j)
pleteness can be imposed by dropping elements in Z. The elements of zi are scanned after
each update, and those smaller in absolute value than a drop tolerance τ ∈ (0, 1) are discarded.
Alternatively, a prescribed nonzero structure on Z can be enforced to give a factorized SPAI of
the form
(ATA)−1 ≈ Z̄ D̄−1 Z̄ T , D̄ is diagonal.
¯ 2
Since the elements dj = ∥Az¯j ∥2 are positive, such a preconditioner is always symmetric positive
definite.
The process (6.3.20) produces not only L−1 but also L at no extra cost. It holds that
(j−1) T (j−1)
(zj ) Czi (j−1)
lij = (j−1) T
z
(j−1) j
, i > j. (6.3.22)
(zj ) Czj
The vector zi is discarded as soon as it has been used to form the corresponding parts of L =
(lij ). Recall that in the first version, the multipliers lij are discarded and zi are kept. Dropping
(j)
elements of zi as above gives an incomplete Cholesky factorization
C = ATA ≈ L̄D̄−1 L̄T . (6.3.23)
If A has full column rank, the pivots d¯j = ∥Az¯j ∥22 are guaranteed to be positive. Hence the
preconditioner (6.3.23) is positive definite.

Notes and references


Preconditioners for augmented systems are developed by Scott and Tůma [985, 2014]. A survey
of methods based on incomplete QR factorization is given by Bai, Duff, and Wathen [64, 2001].
Papadoupolos, Duff, and Wathen [877, 2005] discuss implementations and give many practical
results. Scott and Tůma [987, 2016] develop preconditioners by robust incomplete factorization
(RIF). Cholesky-based factorizations for rank-deficient problems are given by Scott [983, 2017].
316 Chapter 6. Iterative Methods

6.3.4 Submatrix Preconditioners


An important class of preconditioners for the least squares problem minx ∥Ax − b∥2 is based on
selecting a subset of rows from A ∈ Rm×n forming a submatrix A1 of full column rank. If the
selected rows are permuted to the top, the least squares problem becomes
   
A1 b1
min x− , A1 ∈ Rm1 ×n . (6.3.24)
x A2 b2

Läuchli [724, 1961] was the first to use CGLS with A1 as a preconditioner for solving (6.3.24).
He took A1 to be square and nonsingular and solved the preconditioned problem minx ∥AA−1 1 y−
b∥2 , y = A1 x or, equivalently,
   
I b1
min y− , C = A2 A−1 1 ∈R
m2 ×n
, (6.3.25)
y C b2 2

with normal equations


(In + CC T )y = b1 + C T b2 . (6.3.26)
Läuchli used Gauss–Jordan elimination with complete pivoting to form C explicitly. This can be
avoided by computing the LU factorization of A with an efficient sparse LU factorization code
and then implementing the matrix-vector products as

Cp = A2 (A−1
1 p), C T t = A−T T
1 (A2 t). (6.3.27)

Algorithm 6.3.3 (LU Preconditioned CGLS I).


Initialize: r0 = b1 , t0 = b2 , p0 = s0 = r0 + C T t0 , γ0 = ∥s0 ∥22 .
for k = 0, 1, 2, . . . , while γk > τ do
qk = Cpk ,
αk = γk /(∥pk ∥22 + ∥qk ∥22 ),
rk+1 = rk − αk pk ,
tk+1 = tk − αk qk ,
sk+1 = rk+1 + C T tk+1 ,
γk+1 = ∥sk+1 ∥22 ,
βk = γk+1 /γk ,
pk+1 = sk+1 + βk pk ;

At termination, xk is retrieved from A1 xk = b1 − tk . The convergence of Algorithm 6.3.3


has been studied by Freund [433, 1987]. The eigenvalues of In + C T C are λi = 1 + σi2 (C),
i = 1, . . . , n, where σi are the singular values of C. It follows that
q
κ(AA−11 ) ≤ 1 + σ12 (C),

where σ1 (C) = ∥C∥2 = ∥A2 A−1


1 ∥2 . From (6.2.18) we obtain the following upper bound for
the rate of convergence:
 2k
σ
∥A(x − xk )∥2 ≤ 2 p1 ∥A(x − x0 )∥2 . (6.3.28)
1 + 1 + σ12
6.3. Preconditioners for Least Squares Problems 317

Fast convergence is obtained when σ1 is small, e.g., when A1 is well-conditioned and ∥A2 ∥2 is
small. Because C has at most m2 = m − n distinct singular values, the iterations will terminate
in at most m2 steps, assuming exact arithmetic. Rapid convergence can be expected if m2 is
small.
Subset preconditioners can also be constructed from QR factorization. Let A1 ∈ Rm1 ×n ,
m1 ≥ n, be a subset of rows in A ∈ Rm×n such that A1 has full column rank. Assume that the
QR factorization  
R1 c1
QT1 ( A1 P b1 ) = (6.3.29)
0 c2
is known, where P is a sparsity-preserving column permutation. Then R1 ∈ Rn×n can be used
as a preconditioner to solve minx ∥Ax − b∥2 . The least squares problem is equivalent to
   
R1 c1
min y− .
x A2 P b2

Setting x = R1−1 y and suppressing the column permutation gives the preconditioned problem
   
In c1
min y− , C = A2 R1−1 ∈ Rm2 ×n , (6.3.30)
y C b2 2

which has the same form as (6.3.25).


The choice of rows forming A1 is decisive for the efficiency of subset preconditioning. When
A is sparse, a good basis A1 may be obtained from a sparse LU factorization of A (where it is
important that L be well-conditioned). A preliminary pass through the rows of A can be made
to select a subset with maximal diagonal elements; see Duff and Koster [348, 2001]. In some
problems a natural selection may follow from the structure of A. Plemmons [898, 1979] shows
that for some geodetic applications, A1 can be chosen during collection of the data.
In the Peters–Wilkinson method (Section 2.2.6), LU factorization with row pivoting is used
to compute a factorization
   
A1 L1
ΠA = = LU = U, (6.3.31)
A2 L2

where U ∈ Rn×n is upper triangular, and L ∈ Rm×n is unit lower trapezoidal with bounded
off-diagonal elements. (If A is sparse, the row permutation Π and also a column permutation
preserve the sparsity of L and U while bounding the elements of L.) With A1 as right precondi-
tioner, we have
C = A2 A−1 −1
1 = L2 L1 . (6.3.32)
A matrix-vector multiply can be performed as Cv = L2 (L−1 1 v).
If the pivoting strategy maintains |Lij | ≤ τ for some τ ∈ [1, 4], say, any ill-conditioning in
A will usually be reflected in U . Hence U can be used as a right preconditioner. This approach,
suggested by Björck [127, 1976], has the advantage that the lower triangular factor L need not
be stored, and the additional work per iteration depends only on the density of U . Saunders [964,
1979] used a rowwise elimination with a preliminary pass through the rows to select a triangular
subset with maximal diagonal elements. Subsequent use of the operator AU −1 involves only
back-substitutions with U and multiplications with A. When A has many more rows than col-
umns it may be preferable to factorize only A1 = L1 U1 and operate with C = A2 (L1 U1 )−1
in CGLS.
For sparse problems, a standard pivoting strategy in LU factorization is to choose a pivot aij
that minimizes the product of the number of nonzeros in its row and column. The product is
318 Chapter 6. Iterative Methods

called the Markowitz merit function and bounds the number of fill-ins that can occur during an
elimination. For the purpose of stability, aij is required to satisfy

|aij | ≥ u max |akj |,


k

where u is a threshold parameter in the range 0 < u ≤ 1. Taking u in the range 0.1 ≤ u ≤ 0.9
(not too small) normally keeps L well-conditioned while promoting some degree of sparsity. This
is threshold partial pivoting. Threshold rook pivoting and threshold complete pivoting are also
implemented in LUSOL (Gill et al. [475, 2005]) to balance stability and sparsity more carefully
for demanding cases.
In a related approach, Howell and Baboulin [647, 2016] use LU factorization with partial
pivoting and apply CGLS to
min ∥Ly − b∥, U x = y.
The problem is often sufficiently well-conditioned to give rapid convergence. In their MIQR
algorithm, Li and Saad [741, 2006] further precondition L using incomplete QR factors.
Problem (6.3.25) can be written in augmented form as

r1 + y = b1 , r2 + Cy = b2 , r1 + C T r2 = 0. (6.3.33)

Eliminating r1 from the first and third equations gives y = b1 + C T r2 , and then using the second
equation yields the symmetric positive definite system

(CC T + Im−n )r2 = b2 − Cb1 (6.3.34)

of size (m − n) × (m − n). This can be interpreted as the normal equations for the least squares
problem    
−C T b1
min r2 − . (6.3.35)
r2 Im2 b2
When m2 is sufficiently small, problem (6.3.35) can be solved by QR factorization. Applying
CGLS to (6.3.35) yields the following algorithm; see Björck and Yuan [153, 1999].

Algorithm 6.3.4 (LU Preconditioned CGLS II).


Initialize r0 = b1 , t0 = b2 , p0 = s0 = t0 − Cr1 , γ0 = ∥s0 ∥22 .
for k = 0, 1, 2, . . . , while γk > τ do

qk = −C T pk ,
αk = γk /(∥pk ∥22 + ∥qk ∥22 ),
rk+1 = rk − αk qk ,
tk+1 = tk − αk pk ,
sk+1 = tk+1 − Crk+1 ,
γk+1 = ∥sk+1 ∥22 ,
βk = γk+1 /γk ,
pk+1 = sk+1 + βk pk .

At termination xk is retrieved from A1 xk = b1 + C T tk .

This requires about the same storage and work per step as Algorithm 6.3.3. However, as
shown by Yuan [1142, 1993], the last formulation is advantageous for generalized least squares
6.3. Preconditioners for Least Squares Problems 319

problems with a covariance matrix V ≠ I. The generalized normal equations for problem
(6.3.25) are   
In
( In C T ) V −1 y − b , A1 x = y. (6.3.36)
C
On the other hand, the generalized problem for (6.3.35) only involves V :
   
−C T −C T
( −C Im2 ) V r2 = b2 − Cb1 , y = b1 − ( Im2 0 ) V r2 , (6.3.37)
Im2 Im2

Notes and references


Arioli and Duff [34, 2015] discuss several aspects of submatrix preconditioning and describe a
wealth of experiments with real least squares problems. Submatrix preconditioners for equality
constrained least squares problems are given by Barlow, Nichols, and Plemmons [75, 1988].

6.3.5 Preconditioners from Randomized Algorithms


Randomized algorithms have become indispensable in areas such as combinatorial optimization,
cryptography, and machine learning. Recently, fast randomized algorithms have been developed
that act as preconditioners for very large strongly over- or underdetermined systems that arise
in geophysics, genetics, natural language processing, and high-frequency trading. An excellent
introduction to randomization and low-rank matrix factorization is given by Halko, Martinsson,
and Tropp [562, 2011].
Drineas et al. [332, 2011] introduce two randomized algorithms in which the rows of A
and b are first preprocessed by a randomized Hadamard transform (also known as the Walsh–
Hadamard transform). A Hadamard matrix is a symmetric orthogonal 2m × 2m matrix H = Hm
recursively defined by H0 = 1 and
 
1 Hm Hm
Hm+1 = √ , m ≥ 0.
2 Hm −Hm
The Hadamard matrix can be computed by n log n additions and subtractions. The randomized
Hadamard transform is the product HD, where D is a diagonal matrix formed by setting its ele-
ments to +1 or −1 with equal probability. When applied to a vector it has the useful property of
“spreading out its energy,” in the sense of providing a bound for its infinity norm. The Hadamard
transform is used in data encryption as well as many signal processing and data compression
algorithms.
In their first algorithm, Drineas et al. form a smaller s × n subproblem (s ≪ m) from
a uniform random sampling of the preprocessed system. In their second algorithm, a random
projection G = randn(s, m) ∈ Rs×m is applied, whose elements are independent random
normal variables following the standard normal distribution. In both cases, the solution of the
smaller system is shown to be a good approximation to the solution of the full problem.
Rokhlin and Tygert [934, 2008] describe a related algorithm based on random transforma-
tions. They use a row-mixing method that consists of random Givens transformations, a random
diagonal scaling, and a fast Fourier transform (FFT). A preconditioner for a Krylov subspace
method, such as CGLS, is then obtained from QR factors of a smaller subsystem obtained by
random sampling. They report that for s = 4n the condition number of the preconditioned sys-
tem is practically always less than 3. A drawback is that the solver must work with complex
numbers.
Coakley, Rokhlin, and Tygert [259, 2011] introduce the algorithm CRT11 for orthogonal
projection also based on random normal projection G ∈ R(n+4)×m . It solves the overdetermined
320 Chapter 6. Iterative Methods

least squares problem as an intermediate step. CRT11 requires 3n + 6 matrix-vector products


with A or AT . It is very reliable on a broad range of problems because the condition of the
preconditioned system is limited to about 103 for full-rank problems.
Avron, Maymonoukov, and Toledo [46, 2010] develop a least squares algorithm called
Blendenpik. They note that a uniform random sample of rows of A gives a good subset pre-
conditioner only when the coherence or statistical leverage score µ(A) of A is small. If Q forms
an orthonormal basis for the column space of A, then

µ(A) ≡ max ∥Qi,1:n ∥22 .


i

To achieve low coherence, a random-mixing preprocessing phase is performed before the random
sampling. The rows of A are first multiplied by a diagonal matrix D with random elements +1
or −1. Next, a fast transform is applied to each column of DA and Db. This can be a Walsh–
Hadamard transform (WHT), a discrete cosine transform (DCT), or a discrete Hartley transform
(DHT); see Hartley [592, 1942] and Bracewell [174, 1984]. For example, the DCT can be
achieved by the following MATLAB script:
D = spdiags(sign(rand(m,1),0,m,n));
B = dct(D*A); B(1,:) = B(1,:)/sqrt(2);
With high probability the coherence of the resulting row-mixed matrix B is small. After the row-
mixing step, a random sample B1 of s > γn rows of B is taken, where γ > 1 is an oversampling
factor. The QR factorization B1 = Q1 R1 then gives a preconditioner R1 for LSQR. With a
suitable sample size s, R(Q)1 is a good approximation of R(A), and LSQR converges rapidly.
With one preprocessing phase, γ = 4 was found to be near-optimal for a large range of problems.
Since DHT needs less memory than WHT and works better than DCT, this is the preferred choice.
A solver for underdetermined systems is included. Blendenpik often beats LAPACK’s DGELS
on dense highly overdetermined problems.
The iterative solver LSRN by Meng, Saunders, and Mahoney [789, 2014] is based on random
normal projection. LSRN works for both highly over- and underdetermined systems and can
handle rank-deficient systems. For an overdetermined least squares problem

min ∥x∥2 , S = {x ∈ Rn | ∥b − Ax∥2 = min},


x∈S

with A ∈ Rm×n and rank(A) = r ≤ n < m, LSRN performs the following steps:
1. Choose an oversampling factor γ > 1 and set s = γn.
2. Compute Ae = GA, where G ∈ Rs×m is a random matrix whose elements are independent
random variables following the standard normal distribution.
3. Compute the compact SVD A
e=U
eΣe Ve T , where Σ
e ∈ Rr×r (U
e is not needed).

4. Set N = Ve Σ−1 and compute the least-norm solution of miny ∥AN y −b∥2 using a Krylov-
type method such as LSQR. Return x = N y.
A similarly structured algorithm works for strongly underdetermined systems. Note that A
is used by LSRN only for matrix-vector and matrix-matrix operations. Hence LSRN is effi-
cient if A is sparse or a fast linear operator. LSRN can easily be extended to handle Tikhonov
regularization.
A reasonable choice for γ in step 1 is 2.0. The random normal projection in step 2 takes
O(mn2 ) time. This is more than the fast transforms used by some of the other methods. How-
ever, the random normal projection scales well in parallel environments. An important property
6.3. Preconditioners for Least Squares Problems 321

of LSRN is that the singular values of AN are the same as for random matrix (GU )† of size
s × n and independent of the spectrum A; see Theorem 4.2 in Meng, Saunders, and Mahoney
[789, 2014]. The spectrum of such a random matrix is a well-studied problem in random matrix
theory, and it is possible to give strong probability bounds on the condition number of AN . To
reach precision 10−14 , the maximum number of iterations needed by LSQR is ≈ 66/ log(s/r).
Thus the running time for LSRN is fully predictable.
The LSRN package can be downloaded from http://www.stanford.edu/group/SOL/
software/lsrn.html. On dense overdetermined problems with n = 103 , LSRN is compared
with solvers DGELSD and DGELSY from LAPACK and Blendenpik. For full-rank problems,
Blendenpik is the overall winner and LSRN the runner-up. Blendenpik is not designed for rank-
deficient problems, while LSRN can take advantage of rank-deficiency. For underdetermined
problems, the LAPACK solvers run much slower, while LSRN works equally well. On sparse
problems, LSRN is also compared to SPQR from SuiteSparseQR. On generated sparse test prob-
lems, SPQR works well for m < 105 . For larger problems, LSRN is the fastest solver. The
advantage of LSRN becomes greater for underdetermined systems.

6.3.6 Two-Level Preconditioners


In two-level subspace preconditioners for solving minx ∥Ax − b∥2 , the solution is split as

x = V x1 + W x2 , V ∈ Rn×k , W ∈ Rn×(n−k) ,

where rank ( V W ) = n. With A1 ≡ AV and A2 ≡ AW , this is a two-block least squares


problem of the form  
x1 2
min A ( V W ) −b , (6.3.38)
x1 ,x2 x2 2

studied in Section 4.3.1. Such methods for solving Tikhonov regularized problems were pro-
posed by Hanke and Vogel [572, 1999]. Usually, k ≪ n, and a direct method is used to compute
x1 . For x2 a Krylov subspace method, such as LSQR, is used that acts in the space comple-
mentary to R(V ). The rate of convergence is determined by the singular values of W , and the
reduced condition number is

κ = max ∥Ax∥2 min ∥Ax∥2 .
x∈R(W ) x∈R(W )

An efficient implementation of such a method is given by Jacobsen, Hansen, and Saun-


ders [661, 2003]. First, the Householder QR factorization
     T 
R c1 Q1 b
AV = ( Q1 Q2 ) , =
0 c2 Q2 b

is computed, where R ∈ Rk×k is upper triangular. The matrices Q1 and Q2 are not explicitly
formed but (as usual) are represented by the k corresponding Householder vectors. The problem
then becomes     
R QT1 AW x1 c1
min − . (6.3.39)
x1 ,x2 0 QT2 AW x2 c2 2
The subproblem for x2 ,
min ∥QT2 AW x2 − c2 ∥2 , (6.3.40)
x2

is independent of x1 and can be solved by LSQR or CGLS. Then x1 is found from

Rx1 = c1 − QT1 AW x2 . (6.3.41)


322 Chapter 6. Iterative Methods

If operations with W and W T are expensive, an alternative is to set p = W x2 in (6.3.40)–


(6.3.41) and solve for p. This has the drawback that QT2 A has a nontrivial nullspace spanned
by the k columns of V . Although LSQR still works for singular least squares problems, a
divergent component will arise as an effect of rounding errors; see Paige and Saunders [866,
1982, Sect. 6.2]. An effective stopping criterion is needed to terminate the iterations before di-
vergence sets in.
The generalized Tikhonov regularization problem is

min ∥Ax − b∥22 + λ2 ∥Lx∥22 .


x

When L ∈ Rp×n , p < n, and has a nontrivial nullspace spanned by the matrix W2 , this can be
transformed into standard form as follows; see Section 3.6.5. Let Ā = AL†A , where

L†A = (I − P )L† , P = W2 (AW2 )† A, (6.3.42)

is the A-weighted pseudoinverse (3.6.46) of L. The solution can be split into two parts as

x = L†A y + z, z = (AW2 )† b, (6.3.43)

where z ∈ N (L) is the unregularized part of the solution. An iterative method is used to solve
for y, where L†A acts as a right preconditioner. The matrix L†A is not formed explicitly but kept
in the factored form (6.3.42). The implementation of this two-level method is discussed in more
detail by Hansen [576, 1998] and Hansen and Jensen [582, 2006].
When L = I the optimal choice of columns for V consists of the right singular vectors
corresponding to the k largest singular values of A. This will minimize the condition number of
the reduced subproblem (6.3.40). The choice is usually not practical; instead, singular vectors
from a related simpler problem of reduced size can be used to form V . Another possibility is to
perform products by V and V T with fast transforms. Two examples are the cosine transform DC-
2 and the wavelet transform. Extensive numerical experiments with two-level preconditioners are
given by Jacobsen [660, 2000].
Bunse-Gerstner, Guerra-Ones, and de La Vega [189, 2006] give a modification of the two-
level LSQR algorithm that makes it considerably less expensive when the solution, as is often
the case, is needed for a large number of regularization parameters.

6.3.7 Preconditioners for Toeplitz Problems


Least squares problems minx ∥T x − b∥2 , where T is a rectangular Toeplitz matrix
 
t0 t1 . . . tn−1
 t−1 t0 . . . tn−2 
m×n
T = . ..  ∈ R , m ≥ n, (6.3.44)
 
. ..
 .. .. . . 
t1−m t2−m ... tn−m

of very large dimensions arise, e.g., in signal restoration, seismic explorations, and image pro-
cessing.
A matrix-vector product T x is essentially a discrete convolution operation. As will be shown
in the following, by embedding the Toeplitz matrix into a circulant matrix, a matrix-vector
product T x can be computed via the fast Fourier transform in O(n log n) operations. Provided a
good preconditioner can be found, iterative methods such as CGLS or LSQR can be competitive
with the fast direct methods given in Section 4.5.5.
6.3. Preconditioners for Least Squares Problems 323

A circulant matrix is square and has the form


 c
0 c1 ··· cn−1 
 cn−1 c0 ··· cn−2 
n×n
Cn = circ(c0 , c1 , . . . , cn−1 ) = 
 .. .. ..
. ∈R
..  . (6.3.45)
. . .
c1 c2 ··· c0
It is defined by the elements in its first row. Each column in Cn is a cyclic up-shifted version of
the previous column. If ei is the ith unit vector and
 
0 In−1
Pn = (6.3.46)
eT1 0

is the circulant shift matrix, then Pn e1 = en , Pn e2 = e1 , . . . , Pn en = en−1 , and it follows that


Pnn = I. Hence, the eigenvalues of Pn are the n roots of unity,

ωj = e−2πij/n , j = 0 : n − 1,

and the eigenvectors are the columns of the discrete Fourier matrix,
1
F = (fjk ), fjk = √ e2πijk/n , 0 ≤ j, k ≤ n, (6.3.47)
n
√ Pn−1
where i = −1. The circulant matrix Cn can be written as a polynomial Cn = k=0 ck Pnk in
Pn . Hence it has the same eigenvectors as Pn , and its eigenvalues are given by

F (c0 , cn−1 , . . . , c−1 )T = (λ1 , λ2 , . . . , λn )T . (6.3.48)

The matrix Cn can thus be factorized as

Cn = F ΛF H , Λ = diag (λ1 , λ2 . . . , λn ). (6.3.49)

It follows that operations with a circulant matrix Cn can be performed in O(n log n) operations
using the FFT.
We now show how any Toeplitz matrix T can be expanded into a circulant matrix. For
illustration, set m = n = 3, and define
t0 t1 t2 0 t−2 t−1
 
 t t0 t1 t2 0 t−2 
  −1
t−2 t−1 t0 t1 t2 0 
 
T V 
 ∈ R6×6 .
CT = = (6.3.50)
V T  0 t−2 t−1 t0 t1 t2 

 t 0 t−2 t−1 t0 t1 
2
t1 t2 0 t−2 t−1 t0

A similar construction works for rectangular Toeplitz matrices. For the Toeplitz matrix (6.3.44),
the circulant

CT = circ(t0 , . . . , tn−1 , 0, t−m+1 , . . . , t−1 ) ∈ R(n+m)×(n+m)

can be used. To form y = T x ∈ Rm+1 for arbitrary x ∈ Rn , x is padded with zeros to length
n + m, and we calculate
   
x H x
z = CT = F ΛF , y = ( Im 0 ) z. (6.3.51)
0 0
324 Chapter 6. Iterative Methods

This can be done with two FFTs and one multiplication with a diagonal matrix. The cost is
O(n log2 n) multiplications. A similar scheme enables fast computation of T H y.
Strang [1042, 1986] obtained a circulant matrix as a preconditioner for symmetric positive
definite Toeplitz systems by copying the central diagonals of T and “bringing them around.”
He showed that the eigenvalues of T C −1 cluster around 1, except for the largest and smallest
eigenvalues. T. Chan [225, 1988] gave an improved circulant preconditioner that is optimal in
the sense of minimizing ∥C − T ∥F .

Theorem 6.3.1. Let T ∈ Rn×n be a square (not necessarily symmetric positive definite) Toeplitz
matrix. Then the circulant matrix C = (c0 , c1 , . . . , cn−1 ) with
it−(n−i) + (n − i)ti
ci = , i = 0 : n − 1, (6.3.52)
n
minimizes ∥C − T ∥F .

The best approximation C has a simple structure. It is obtained by averaging the correspond-
ing diagonal of T extended to length n by wraparound. For a Toeplitz matrix of order n = 4 we
obtain
t0 t1 t2 t3 t0 c1 c2 c3
   
t t0 t1 t2   c t c3 c2 
T =  −1 , C= 1 0 ,
t−2 t−1 t0 t1 c2 c1 t0 c3
t−3 t−2 t−1 t0 c3 c2 c1 t0
where
c1 = (t−3 + 3t1 )/4, c2 = (t−1 + t1 )/2, c3 = (3t−1 + t3 ).
The convergence rate of CGLS applied to a preconditioned Toeplitz system T C −1 y = b de-
pends on the distribution of the singular values of T C −1 . R. Chan, Nagy, and Plemmons [219,
1994] show that if the generating functions of the blocks Tj are 2π-periodic continuous func-
tions, and if one of these functions has no zeros, then the singular values of the preconditioned
matrix T C −1 are clustered around 1, and PCGLS converges very quickly. The class of 2π-
periodic continuous functions contains a class of functions that arises in many signal processing
applications.
Similar ideas can be applied to problems where the least squares matrix T has a general
Toeplitz block or block Toeplitz structure; see Nagy [817, 1991] and R. Chan, Nagy, and Plem-
mons [218, 1993]. Hence the method can be applied also to multidimensional problems. Con-
sider a least squares problem
T1
 
.
min ∥T x − b∥2 , T =  ..  ∈ Rm×n , (6.3.53)
x
Tq
where each block Tj , j = 1, . . . , q, is a square Toeplitz matrix. (Note that if T itself is a rec-
tangular Toeplitz matrix, then each block Tj is necessarily Toeplitz.) In the first step, a circulant
approximation Cj is constructed for each block Tj . Each circulant matrix Cj , j = 1, . . . , q, is
then diagonalized by the Fourier matrix F : Cj = F Λj F H . The eigenvalues Λj can be found
from the first column of Cj ; cf. (6.3.48). Hence, the spectrum of Cj , j = 1, . . . , q, can be
computed in O(m log n) operations using the FFT.
The preconditioner for T is then defined as a square circulant matrix C such that
q
X q
X
T
C C= CjT Cj =F H
(ΛH
j Λj )F.
j=1 j=1
6.4. Regularization by Iterative Methods 325

Thus, C T C is also circulant, and its spectrum can be computed in O(m log n) operations. Now
C is taken to be the Hermitian positive definite matrix
Xq 1/2
C ≡ FH ΛH
j Λ j F. (6.3.54)
j=1

Then CGLS with right preconditioner C is applied to solve minx ∥T x − b∥2 . Note that to use C
as a preconditioner we need only know its eigenvalues, because the factorization (6.3.54) can be
used to solve linear systems involving C and C T . The generalization to block Toeplitz matrices
is straightforward.

Notes and references


Construction of circulant preconditioners for constrained and weighted Toeplitz least squares
problems is studied by Jin [672, 1996]. Iterative methods for solving Toeplitz problems are
surveyed by R. Chan and Ng [220, 1996] and R. Chan and Jin [221, 2007].

6.4 Regularization by Iterative Methods


6.4.1 Landweber’s Method
Discrete ill-posed linear systems Ax = b are characterized by A having a large group of numer-
ically zero singular values with a sizeable gap to the rest of the spectrum. Furthermore, b has
small projections onto the right singular vectors associated with the small singular values. Mul-
tidimensional ill-posed problems lead to large-scale discrete ill-posed systems with structured or
sparse matrices that are well suited for iterative solution methods. Many of these methods have
intrinsic regularization properties, where the number of iterations k plays the role of regulariza-
tion parameter.
One of the earliest iterative regularization methods for ill-posed linear systems is to apply
Richardson’s method (see Section 6.1.2) to the normal equations AT (Ax − b) = 0,

xk = xk−1 + ωAT (b − Axk−1 ), k = 1, 2, . . . , (6.4.1)

where ω is chosen so that ω ≈ 1/σ1 (A)2 . In this context the method is known as Landweber’s
method [718, 1951]. From the standard theory of stationary iterative methods it follows that the
error in xk satisfies

x − xk = (I − ωATA)(x − xk−1 ) = (I − ωATA)k (x − x0 ). (6.4.2)

Px
Taking
n
0 = 0 and expanding the error in terms of the SVD (singular value decomposition)
A = i=1 σi ui viT shows that (6.4.2) can be written as
n
X uTi b
xk = φk (σi2 ) vi , φk (σ 2 ) = 1 − (1 − ωσ 2 )k . (6.4.3)
i=1
σi

It follows that the effect of terminating the iteration with xk is to damp the component of the so-
lution along vi by the factor φk (σi2 ), where φk (σ 2 ) is the filter factor for Landweber’s method;
see Section 3.5.3. After k iterations, only the components of the solution corresponding to
σi ≥ 1/k 1/2 have converged. If the noise level in b is known, the discrepancy principle can
be used as a stopping criterion; see Section 3.6.4.
When an iterative method is applied to an ill-posed problem, the error in xk will initially
decrease, but eventually the unwanted irregular part of the solution will grow and cause the
326 Chapter 6. Iterative Methods

process to diverge. Such behavior is called semiconvergence. The iterations should be stopped
before divergence starts. Terminating the Landweber method after k iterations gives roughly
the same result as using truncated SVD (see Section 3.6.2) where components corresponding to
σi ≤ µ ∼ k −1/2 are dropped. The square root means that usually many iterations are required.
For this reason, Landweber’s method cannot in general be recommended; see Hanke [566, 1991].
If A ∈ Rm×n is rank-deficient, xk in Landweber’s method can be split into orthogonal
components:
xk = yk + zk , yk ∈ R(AT ), zk ∈ N (A).
The orthogonal projection of xk − x0 onto N (A) can then be shown to be zero. Hence, in exact
arithmetic, the iterates converge to the unique least squares solution that minimizes ∥x − x0 ∥2 .
Strand [1041, 1974] analyzed the more general iteration
xk+1 = xk + p(ATA)AT (b − Axk ), (6.4.4)
where p(λ) is a polynomial of order d in λ. A special case is the iteration suggested by Riley [929,
1956]: xk+1 = xk + ∆xk , where
   
A rk
min ∆xk − , rk = b − Axk .
∆xk µI 0 2

This corresponds to taking p(λ) = (λ + µ2 )−1 . Riley’s method is sometimes called the iterated
Tikhonov method.
Iteration (6.4.4) can be performed more efficiently as follows. If
d
Y
1 − λp(λ) = (1 − γj λ)
j=1

is the factorized polynomial, then one iteration step can be performed in d minor steps of a
nonstationary Landweber method:
xj+1 = xj + γj AT (b − Axj ), j = 0, 1, . . . , d − 1. (6.4.5)
1/2
Assume that σ1 = β and that the aim is to compute an approximation to the truncated sin-
gular value solution with a cut-off for singular values σi ≤ σc = α1/2 . Then, as shown by
Rutishauser [948, 1959], in a certain sense the optimal parameters to use in (6.4.5) are γj = 1/ξj ,
where ξj are the zeros of the Chebyshev polynomial of degree d on the interval [α, β]:
1 1  π 2j + 1 
ξj = (α + β) + (α − β)xj , xj = cos , (6.4.6)
2 2 2 d
j = 0, 1, . . . , d − 1. This choice leads to a filter function R(t) of degree d with R(0) = 0, and
of least maximum deviation from 1 on [α, β]. Note that α must be chosen in advance, but the
regularization can be varied by using a decreasing sequence α = α1 > α2 > · · · .
From standard results for Chebyshev polynomials it can be shown that if α ≪ β, then k steps
of iteration (6.4.5)–(6.4.6) reduce the regular part of the solution by the factor
1/2
δk ≈ 2e−2k(α/β) . (6.4.7)
Thus, the cut-off σc for this method is related to j in (6.4.5) as j ≈ 1/σc . This is a great
improvement over the standard Landweber’s method, for which the number of steps needed is
k ≈ (1/σc )2 .
Iteration (6.4.5) with parameters (6.4.6) suffers severely from roundoff errors. This instability
can be overcome by a reordering of the parameters ξj ; see Anderson and Golub [24, 1972].
Alternatively, (6.4.5)–(6.4.6) can be written as a three-term recursion, as in the CSI method of
Section 6.1.6.
6.4. Regularization by Iterative Methods 327

6.4.2 Regularized CGLS and CGME


For ill-conditioned least squares and least-norm problems it is often beneficial to include a reg-
ularization term. In the least squares case the standard regularized problem (defining Tikhonov
regularization) is
min ∥b − Ax∥2 + µ2 ∥x∥22 , (6.4.8)
x
where µ > 0 is a suitably chosen parameter. This problem has a unique solution that satisfies the
regularized normal equations (AT A + µ2 In )x = AT b or, in factored form,
AT r − µ2 x = 0, r = b − Ax.
The regularized least squares problem (6.4.8) can be solved by applying CGLS to the least
squares problem minx ∥Axe − eb∥2 , where
   
A eb = b ∈ R(m+n) .
Ae= ∈ R(m+n)×n ,
µIn 0
However, it is preferable to use the modified version of CGLS given below, where the follow-
ing changes have been made: alpha = sts/(q'*q) is changed to alpha = sts/(q'*q +
mu2*p'*p), and s = A'*r is changed to s = A'*r - mu2*x.
RCGLS requires one more scalar product than CGLS per step but no extra storage.

Algorithm 6.4.1 (RCGLS).


function [x,r,nrm] = rcgls(A,b,mu2,x0,maxit)
% RCGLS performs at most maxit CG iterations
% for the normal equations A'(b - Ax) = mu2 x.
% --------------------------------------------
x = x0; r = b - A*x;
s = A'*r - mu2*x;
nrm = s'*s; p = s;
for k = 0:maxit
if nrm == 0, break, end
q = A*p;
alpha = nrm/(q'*q + mu2*(p'*p));
x = x + alpha*p;
r = r - alpha*q;
s = A'*r - mu2*x;
nrmold = nrm; nrm = s'*s;
beta = nrm/nrmold;
p = s + beta*p;
end
end
The regularized least-norm problem is
  2  
x x
min subject to ( A µIm ) = b. (6.4.9)
x, y y 2
y
The linear system Ax + µy = b is always consistent if µ ̸= 0. Its solution is x = AT z, y = µz,
where z solves the normal equations
 T 
A
( A µIm ) z = (AAT + µ2 Im )z = b.
µIm
328 Chapter 6. Iterative Methods

This can be solved by setting


 
m×(n+m) x
A
e = (A µIm ) ∈ R , x
e= ∈ Rn+m
y
and applying CGME to the least-norm problem
min ∥e
x∥2 subject to Ae
ex = b.
x
e

Since y is not of interest, the regularized version RCGME of CGME given below is to be pre-
ferred. This needs only a small amount of extra arithmetic work and no extra storage.

Algorithm 6.4.2 (RCGME).


function [x,r] = rcgme(A,b,mu2,maxit)
% RCGME performs at most maxit steps of Craig's
% algorithm on the regularized system Ax + mu y = b.
% ---------------------------------------------------
x = 0; r = b; p = r;
nrm = r'*r;
for k = 1:maxit
if nrm == 0, break, end
q = A'*p;
alpha = nrm/(q'*q + mu2*(p'*p));
x = x + alpha*q;
r = r - alpha*(A*q + mu2*p);
nrmold = nrm; nrm = r'*r;
beta = nrm/nrmold;
p = r + beta*p;
end
end
From the convergence analysis of CG (Section 6.2.2) it follows that the convergence of
RCGLS and RCGME depends on the distribution of the nonzero eigenvalues of (ATA + µ2 In )
and (A AT + µ2 Im ), respectively. These are λi = σi2 + µ2 , where σi are the nonzero singular
values of A.
As noted by Saunders [965, 1995], for µ > 0 and A of arbitrary dimensions, the regularized
least-norm problem (6.4.9) is the same as the regularized least squares problem (6.4.8). From
(6.4.9) we have µy = b − Ax = r. Using this to eliminate y we see that both problems are
equivalent to the augmented system
    
Im A r b
= . (6.4.10)
AT −µ2 In x 0
Saunders derives regularized versions of LSQR and LSME that require little additional work or
storage. The bidiagonalizations of
 
A
à = and à = ( A µIm )
µIn
can be obtained efficiently from the bidiagonalization of A used in LSQR and LSME. In regu-
larized LSQR the bidiagonal subproblem defining xk = Vk yk is
   
Bk β1 e 1
min yk − , (6.4.11)
yk µI 0
6.4. Regularization by Iterative Methods 329

where Bk is lower bidiagonal. Orthogonal matrices Q̃k can be constructed from 2k − 1 plane
rotations so that    
B k β1 e 1 R̃k fk
Q̃k = ,
µI 0 0 ϕk+1
where R̃k ∈ Rk×k is upper bidiagonal. The basis matrices Uk+1 and Vk are modified accord-
ingly. For LSME the regularized bidiagonal subproblem is
  2  
yk yk
min subject to ( Lk µI ) = β1 e 1 .
tk 2
tk

Orthogonal matrices Q̂k are constructed so that

( Lk µI ) Q̂k = ( L̂k 0),

where Lk ∈ Rk×k and L̃k are lower bidiagonal. Numerical tests indicate that regularized LSQR
is more reliable and efficient than regularized LSME.

6.4.3 Symmetric Quasi-Definite Systems


Linear systems of the form     
M A y b
= (6.4.12)
AT −N x c
with M ∈ Rm×m and N ∈ Rn×n symmetric positive definite are called symmetric quasi-
definite or SQD systems. These systems give the first-order optimality conditions for the two
dual convex quadratic problems

min ∥Ax − b∥2M −1 + ∥x∥2N + 2cTx, (6.4.13)


x∈Rn

min ∥AT y − c∥2N −1 + ∥y∥2M − 2bTy, (6.4.14)


y∈Rm

where quantities like ∥x∥2N = xT N x are elliptic norms. SQD systems arise in sequential qua-
dratic programming and interior-point methods for convex optimization. Another source is in
stabilized mixed finite element methods.
The SQD matrix in (6.4.12) is indefinite and has m positive and n negative eigenvalues.
It is nonsingular, and its inverse is also SQD. The following properties of SQD matrices are
established by Vanderbei [1085, 1995] and George, Ikramov, and Kucherov [456, 2000]:
1. An SQD matrix K is strongly factorizable, i.e., for every permutation matrix P there exists
a diagonal matrix D and a unit lower triangular matrix L such that

P KP T = LDLT , (6.4.15)

where D may have both positive and negative diagonals.


2. For any SQD matrix K, the unsymmetric matrix
   
M −A In 0
K̃ = KJ = , J= ,
AT N 0 −Im

has a positive definite symmetric part 21 (K̃ + K̃ T ). Gill, Saunders, and Shinnerl [477,
1996] analyze the stability of an indefinite Cholesky-type factorization using the results of
Golub and Van Loan [486, 1979] on LU factorization of positive definite matrices.
330 Chapter 6. Iterative Methods

One approach to solving the SQD system (6.4.12) is to use Krylov methods for indefinite
systems such as S YMMLQ and MINRES. However, as shown by Fischer et al. [409, 1998], these
make progress only in every second iteration and do not exploit the structure of SQD systems.
Eliminating either y ∈ Rm or x ∈ Rn in (6.4.12) gives the Schur complement equations

(ATM −1 A + N )x = ATM −1 b − c, (6.4.16)

(AN −1 AT + M )y = AN −1 c + b. (6.4.17)

Both of these systems are symmetric positive definite and hence can be solved by CG. After x or
y is computed, the remaining part of the solution can be recovered from

y = M −1 (b − Ax) or x = N −1 (AT y − c). (6.4.18)

The algorithm ECGLS below solves the Schur complement equation (6.4.16). The iterates
are mathematically the same as those generated by the standard CG method applied to (6.4.16).
Better numerical stability is obtained by not forming ATM −1 A and instead using

ATy = N x + c, y = M −1 (b − Ax). (6.4.19)

Only matrix-vector products with A, AT , N , and solves with M are required.

Algorithm 6.4.3 (Extended CGLS).


function [x,y,sts] = ecgls(A,M,N,b,c,x0,maxit)
% ECGLS performs maxit steps of extended CGLS on the
% Schur complement system A'*M\(b-A*x) = Nx + c.
% --------------------------------------------------
x = x0; y = M\(b - A*x);
s = A'*y - c - N*x;
p = s; nrm = s'*s;
for k = 1:maxit
if nrm == 0, break; end
q = A*p; t = M\q;
alpha = nrm/(q'*t + p'*(N*p));
x = x + alpha*p;
y = y - alpha*t;
s = A'*y - c - N*x;
nrmold = nrm; nrm = s'*s;
beta = nrm/nrmold;
p = s + beta*p;
end
end

From the convergence analysis of CG it follows that the convergence of ECGLS is governed
by the distribution of the nonzero eigenvalues of (AT M −1 A + N ). If A has full column rank,
ECGLS works also for N = 0. For M = I, N = µ2 I, µ ̸= 0, and c = 0, ECGLS is equal to
RCGLS. Hence, ECGLS can be viewed as an extended version of CGLS.
The Schur complement equations (6.4.17) have a similar structure to (6.4.16) and can be
obtained from (6.4.16) by making the substitutions

A ⇆ AT , M ⇆ N, x ⇆ y, b→c c → −b.
6.4. Regularization by Iterative Methods 331

Hence (6.4.17) can be solved by


[y, x] = ECGLS(A′ , N, M, c, −b, maxit).
The convergence is then governed by the distribution of the nonzero eigenvalues of (AN −1 AT +
M ). The SQD system (6.4.12) can be transformed as
 −1    −T   T   −1 
L 0 M A L 0 L y L b
= ,
0 R−T AT −N 0 R−1 Rx R−T c

where M = LLT and N = RTR are the Cholesky factorizations. With new variables ye = LT y
and x
e = Rx, the transformed system becomes
    
Im Ae ye eb
T = , (6.4.20)
Ae −In x
e c
e
where
e = L−1 AR−1 ,
A eb = L−1 b, c = R−T c.
e (6.4.21)
For c = 0, problem (6.4.20) can be solved by Algorithm RCGLS or RCGME of Section 6.4.2.
Each iteration requires triangular solves with L, LT , RT , and R. The rate of convergence de-
pends on the eigenvalues λi = 1 + σi2 , where σi are the singular values of L−1 AR−1 . Arioli [31,
2013] calls these elliptic singular values. They are the critical points of the functional
min y T Ax subject to ∥x∥N = ∥y∥M = 1. (6.4.22)
x,y

Note that since ∥e x∥22 = ∥x∥N and ∥ey ∥22 = ∥y∥M , the convergence rates for x and y are measured
in the corresponding elliptic error norm.
Arioli [31, 2013] develops elliptic versions of upper and lower GKL bidiagonalization that
generalize results by Benbow [104, 1999]. These generate left and right basis vectors ui and vi
that are orthonormal with respect to the inner products defined by M and N , respectively. Each
step of the bidiagonalizations requires solves with both M and N . Based on these bidiagonaliza-
tion processes, Arioli and Orban [844, 2017] derive versions of LSQR and related algorithms for
solving SQD systems with either c = 0 or b = 0. When both b ̸= 0 and c ̸= 0, it is necessary to
either shift the right-hand side to obtain one of these special cases or solve two special systems
and add the solutions.

Notes and references


In some applications the Cholesky factorization of M and N is not available, but matrix-vector
products with M −1 can be performed efficiently. Friedlander and Orban [434, 2012] give such an
example from interior methods for nonlinear programs, where a limited-memory approximation
is used to represent M −1 . For the special case when A has full column rank and N = 0, Calandra
et al. [198, 2020] derive an algorithm similar to ECGLS and analyze its stability. Montoison and
Orban [801, 2021] develop two new iterative methods, T RI CG and T RI MR, that handle SQD
systems with general b and c. These methods are based on the orthogonal tridiagonalization
process of Saunders, Simon, and Yip [966, 1988].

6.4.4 Hybrid Regularization and LSQR


Krylov subspace methods, such as CGLS and LSQR, applied to an (unregularized) least squares
problem minx ∥Ax − b∥2 often tend to converge quickly to an approximate solution corre-
sponding to the dominating singular values of A. Hence they are well suited to ill-posed prob-
lems. Theoretical results predict similar bounds for the optimal accuracy of LSQR to those for
332 Chapter 6. Iterative Methods

truncated SVD (TSVD); see Example 4.2.8. Because the approximations in LSQR are tailored
to the specific right-hand side b, the minimum error is often achieved with a subspace of much
lower dimension compared to TSVD. For LSQR the iterations also diverge more quickly after
the optimal accuracy has been reached. For very ill-conditioned problems, partial reorthogonal-
ization of the u and v vectors in LSQR (or LSMR) may help to preserve stability. This is costly
in terms of storage and operations but may be acceptable when the number of iterations is small;
see Section 6.2.6.
For iterative regularization methods, it is essential to terminate the iterations before diver-
gence starts. Using a hybrid method that combines the iterative method with an inner regular-
ization can be an effective solution; see Hanke [568, 2001]. For example, the iterative method
can be applied to the Tikhonov regularized problem
   
A b
min x− . (6.4.23)
x λL 0
Such hybrid methods require two regularization parameters: the number of iterations k and λ.
With an appropriate choice of L, difficulties related to semiconvergence may be overcome. The
iterations can be continued until all relevant information available from the data has been cap-
tured.
Taking L = I in (6.4.23) gives the standard Tikhonov regularization. Then it can be verified
that the iterate xk,λ obtained by LSQR is the same as that obtained by first performing k steps of
LSQR on the unregularized problem (λ = 0) and then solving the regularized subproblem
   
Bk β1 e 1
min yk,λ − (6.4.24)
yk,λ λIk 0 2

and setting xk,λ = Vk yk,λ . In other words, for L = I, iteration and regularization commute.
This observation allows λ to be varied without restarting LSQR. The subproblems (6.4.24) can
be solved in O(k) operations using plane rotations, and yk,λ can be determined for several values
of λ at each step. However, to get the corresponding xk,λ the vectors v1 , . . . , vk are needed. If
there is not enough space to store these vectors, they can be generated anew by running the
bidiagonalization again for each λ.
Several alternative regularization methods besides Tikhonov regularization have been pro-
posed for LSQR’s bidiagonal subproblem
min ∥Bk yk − β1 e1 ∥2 , Bk ∈ R(k+1)×k . (6.4.25)
yk

O’Leary and Simmons [840, 1981] use a TSVD solution to (6.4.25). At each step of the bidi-
agonalization process, the SVD of ( Bk β1 e1 ) is computed. This can be done by standard
methods in O(k 2 ) operations. Computational details of this and similar schemes are considered
by Björck [131, 1988]. When no a priori information about the solution is available, generalized
cross-validation (GCV) can be used to determine the number of terms in the TSVD solution as
suggested by Björck, Grimme, and Van Dooren [145, 1994].
When L ̸= I, iteration and regularization no longer commute. Restarting LSQR when λ
is changed is usually too demanding computationally, and an initial transformation to standard
form is to be preferred. When L ∈ Rn×n is nonsingular, this is achieved by setting y = Lx and
Ā = AL−1 . Otherwise, if L ∈ Rp×n and p < n, L has a nontrivial nullspace. Then we take
Ā = AL†A , where
L†A = (I − (A(I − L† L))† A)L†
is the A-weighted (oblique) pseudoinverse; see Section 3.6.5. The standard form problem then
becomes
min ∥AL†A y − b∥22 + λ2 ∥y∥22 , x(λ) = L†A yλ + x2 , (6.4.26)
y
6.4. Regularization by Iterative Methods 333

where x2 ∈ N (L) is the unregularized component of the solution. For several frequently used
regularization matrices, this transformation can be implemented so that the extra work is negli-
gible. Assume that L has full row rank n − p, and compute the two QR factorizations
 
R
LT = (W1 W2 ) = W1 R, AW2 = Q1 U, (6.4.27)
0

where W2 gives an orthogonal basis for N (L). If p ≪ n, the work in the QR factorization of
AW2 is negligible. Then,

L†A = (I − P )W1 R−1 , P = W2 (AW2 )† A = W2 U −1 QT1 A, (6.4.28)

and
x2 = W2 (AW2 )† b = W2 U −1 QTAW b. (6.4.29)
For several discrete smoothing norms, L can be partitioned as L = ( L1 L2 ), where L1 is
square and of full rank. Then the computationally simpler expression
 −1 
L1
L†A = (I − P ) (6.4.30)
0
can be used; see Hanke and Hansen [570, 1993].
In high-dimensional problems, e.g., when L is a sum of Kronecker products, the matrix AL†A
may become too complicated to work with. For this case, an alternative projection approach that
only uses products with L and LT has been suggested by Kilmer, Hansen, and Espanõl [695,
2007]. Inspired by work of Zha [1145, 1996], this uses a joint bidiagonalization of QA and QL
in the QR factorization    
A QA
= R.
L QL

6.4.5 Regularization with GMRES and MINRES


The regularizing properties of GMRES and MINRES (see Section 6.2.7) are not well understood.
The large eigenvalues of A are usually simulated rapidly by those of the Hessenberg matrices
Hk+1,k in GMRES (for small values of k). In discrete ill-posed problems, the spectrum of A is
characterized by a sizeable gap between a large group of numerically zero eigenvalues and the
rest of the spectrum. As shown by Calvetti, Lewis, and Reichel [202, 2002], GMRES equipped
with a suitable stopping rule can deliver better approximations than TSVD. For severely ill-
conditioned problems with singular values σk = O(e−αk ) it holds that hk+1,k = O(nσk ).
For symmetric matrices, GMRES and MINRES have regularization properties similar to CGLS;
see Jensen and Hansen [666, 2007]. However, for nonsymmetric matrices the regularization
properties of GMRES are highly problem dependent. For some problems, GMRES does not
produce regularized solutions.
In the Arnoldi–Tikhonov regularization method, the regularized problem

min ∥Ax − b∥22 + λ2 ∥Lx∥22 (6.4.31)


x

is approximated by taking xk = Vk yk ∈ Kk (A, b) and using (6.2.69) to obtain the projected


problem    
Hk+1,k β1 e 1
min yk − . (6.4.32)
y∈Rk λLVk 0 2
For the standard case L = I, this method was introduced by Calvetti et al. [203, 2000]. The
regularization term simplifies because ∥LVk yk ∥22 = ∥yk ∥22 . Then (6.4.32) is equivalent to an
334 Chapter 6. Iterative Methods

inner regularization of the projected problem and may be used in a hybrid method. This simplifies
the use of parameter choice rules for λ, such as the GCV and L-curve criteria.
The choice of a regularization term with L ̸= I in (6.4.31) is known to be potentially much
more useful. If L ∈ Rp×n , the system has dimension (k + p) × k, where often p ≫ k. To
solve subproblem (6.4.32) one may first compute the compact QR factorization LVk = Qk Rk ,
Rk ∈ Rk×k , and use the identity ∥LVk yk ∥2 = ∥Rk yk ∥2 . The reduced subproblem is then solved
by a sequence of Givens transformations.
GMRES applied to a singular system can break down before a solution has been found. The
following property is proved by Brown and Walker [182, 1997].

Theorem 6.4.1. For all b and starting approximations x0 , GMRES determines a least squares
solution x∗ of a singular system Ax = b without breakdown if and only if N (A) = R(AT ).
Furthermore, if the system is consistent and x0 ∈ R(AT ), then x∗ is a least-norm solution.

A variant of MINRES called MR-II and introduced by Hanke [567, 1995] has starting vector
Ab and generates approximations xk ∈ K(A, Ab). The multiplication with A acts as a smoothing
operator and dampens high-frequency noise in b. Range-restricted GMRES is a similar method
for the nonsymmetric case due to Calvetti, Lewis, and Reichel [200, 2000]. This is based on
the Arnoldi process and also generates approximations xk ∈ Kk (A, Ab). These methods some-
times provide better regularized solutions than GMRES and MINRES but can cause a loss of
information in the data b; see Calvetti, Lewis, and Reichel [201, 2001].
If rank(A) < n, it is no restriction to assume the structure
    
A11 A12 x1 b1
= , (6.4.33)
0 0 x2 b2

where A11 ∈ Rr×r is nonsingular. Then the condition in Theorem 6.4.1 where N (A) = R(AT )
is equivalent to A12 = 0, and the system is consistent if and only if b2 = 0. If these conditions
are satisfied, applying GMRES to (6.4.33) is equivalent to applying GMRES to the nonsingular
system A11 x1 = b1 . In practice it is usually the case that A12 and b2 are nonzero but small.
A common approach is to choose M as an approximation to A in which the small eigenvalues
are replaced by ones. Eldén and Simoncini [382, 2012] show that a similar effect is obtained by
taking M to be a singular preconditioner equal to a low-rank approximation to A and applying
GMRES to
AM † y = b, x = M † y. (6.4.34)

In the initial iterations the residual components corresponding to large eigenvalues will be re-
duced in norm. This approach is particularly suitable for ill-posed problems from partial differen-
tial equations in two or three dimensions, such as the Cauchy problem with variable coefficients.
A fast solver for a nearby problem can then be used as a singular preconditioner.

Notes and references

Surveys of methods for regularization of large-scale problems are given by Hanke and Hansen
[570, 1993] and Hansen [579, 2010]. Nemirovskii [826, 1986] gives a strict proof of the regular-
izing properties of CG methods and shows that CGLS and LSQR reach about the same accuracy
as Landweber’s method before divergence starts. Hanke [568, 2001] compares the regularizing
properties of CGLS and CGME, and Jia [669, 2020] studies the regularizing effects of CGME,
LSQR, and LSMR. Wei, Xie, and Zhang [1115, 2016] propose combining Tikhonov regulariza-
tion with a randomized algorithm for truncated GSVD.
6.4. Regularization by Iterative Methods 335

Fierro et al. [404, 1997] propose using GKL bidiagonalization for computing truncated TLS
(TTLS) solutions. The use of bidiagonalization in Tikhonov regularization of large linear prob-
lems is further analyzed by Calvetti and Reichel [204, 2003]. The choice of regularization param-
eters in iterative methods is studied by Kilmer and O’Leary [696, 2001]. Hnětynková, Plešinger,
and Strakoš [631, 2009] use bidiagonalization to estimate the noise level in the data.
Novati and Russo [834, 2014] give theoretical results on convergence properties of the
Arnoldi–Tikhonov method with L ̸= I. Gazzola, Novati, and Russo [449, 2015] survey hy-
brid Krylov projection methods for Tikhonov regularized problems. They observe experimen-
tally that the method is very efficient for discrete ill-posed problems where the singular values
cluster at zero. They also investigate use of the GCV criterion within the Arnoldi–Tikhonov
method. A MATLAB package of iterative regularization methods called IR Tools is implemented
by Gazzola, Hansen, and Nagy [448, 2018]. This package also contains a set of large-scale test
problems.

6.4.6 Augmented and Deflated CGLS


Deflation and augmentation techniques have been used in a variety of contexts to accelerate the
convergence of Krylov subspace methods. In augmentation, a subspace is added to the Krylov
subspace. For example, for linear systems with multiple right-hand sides, the Krylov subspace
can be augmented to include information from previously solved systems. In deflation, the sys-
tem to be solved is multiplied by a projection that removes certain parts from the operator. For
example, components corresponding to small singular values that may slow down convergence
can be removed. This reduces the effective condition number and can significantly improve con-
vergence. Often, deflation is combined with augmentation to compensate for the singularity of
the operator.
Let Cx = d be a symmetric positive definite linear system, and let

W = (w1 , w2 , . . . , wp ) ∈ Rn×p

be a set of p linearly independent vectors that span a subspace to be added or removed. Then
both C and W T CW are symmetric positive definite. The deflated Lanczos process is obtained
by applying the standard Lanczos process auxiliary matrix

B = CH, H = I − W (W T CW )−1 (CW )T , (6.4.35)

where B is symmetric positive semidefinite and satisfies W T B = 0. Furthermore, H 2 = H,


and H is the C-orthogonal oblique projection onto W ⊥C . The matrix

H T = I − CW (W T CW )−1 W T

is the C −1 -orthogonal projection onto W ⊥ . It is easily verified that

B = CH = H T C = H T CH. (6.4.36)

Let v1 be a unit vector such that W T v1 = 0. Then the standard Lanczos process applied to
B with starting vector v1 generates a sequence {vj } of mutually orthogonal unit vectors vj such
that
vj+1 ⊥ span {W } + Kj (C, v1 ) ≡ Kj (C, W, v1 ). (6.4.37)
The generated vectors satisfy

BVj = Vj Tj + βj+1 vj+1 eTj , (6.4.38)


336 Chapter 6. Iterative Methods

where Vj = (v1 , v2 , . . . , vj ) and


α1 β2
 
 β2 α2 β3 

Tj =  .. .. .. 
.
 . . . 
βj−1 αj−1 βj
 
βj αj
From (6.4.35) it follows that
βj+1 vj+1 = Bvj − βj vj−1 − αj vj . (6.4.39)
With induction and noting that W T B = 0, we see that W Tvj+1 = 0, j = 1, 2, . . . .
Saad et al. [961, 2000] derive a deflated CG method for solving a symmetric positive definite
system Cx = d from the deflated Lanczos process. The resulting method is defined by the
conditions
xj+1 − xj ∈ Kj (C, W, r0 ), (6.4.40)
where
r0 = d − Cx0 ⊥ W, rj ⊥ Kj (C, W, r0 ). (6.4.41)
T
An x0 such that W r0 = 0 is given by
x0 = x−1 + W (W T CW )−1 W T r−1 , r−1 = d − Cx−1 , (6.4.42)
where x−1 is arbitrary. In particular, setting x−1 = 0 gives x0 = W (W T CW )−1 W T d. The
residual vectors sj in deflated CG satisfy rj /∥rj ∥2 = vj , where vectors vj are from the Lanczos
process applied to B. It follows that the residuals rj are mutually orthogonal. The descent
directions pj are C-orthogonal to each other and also C-orthogonal to all vectors wi , i.e.,
PjT CPj = 0, W T CPj = 0.
The deflated CG method can be viewed as a preconditioned CG (PCG) method with singular
preconditioner HH T . In the taxonomy defined by Ashby, Manteuffel, and Saylor [40, 1990], it
is equivalent to the version Omin(C, HH T , C) of PCG started with r0 ⊥ W .
A deflated version of CGLS is obtained by taking C = ATA, d = AT b and applying the
deflated CG method to ATAx = AT b, with the special starting point x0 = P b, where
P = W (W T ATAW )−1 (AW )T
is a projector. Each iteration requires one application of P to a vector. This operation can be
sped up by initially computing the product AW and its thin QR factorization AW = QR, giving
P = W R−1 QT . The following properties of the deflated CGLS algorithm follow from results
shown by Saad et al. [961, 2000]; see Theorem 4.2 and Theorem 4.3.

Theorem 6.4.2. Let A ∈ Rm×n and W ∈ Rm×p have full column rank. Let x∗ be the exact
solution of the least squares problem minx ∥Ax − b∥2 . Then the deflated CGLS algorithm will
not break down at any step. The approximate solution xk is the unique minimizer of the error
norm ∥rk − r∗ ∥2 , rk = b − Axk , over the affine solution space x0 + Kp,k (ATA, W, r0 ). Further,
an upper bound for the residual error after k iterations is given by
 k
κ(AH) − 1
∥rk − r∗ ∥2 ≤ 2 ∥r − r0 ∥2 , (6.4.43)
κ(AH) + 1
where H is the oblique projector defined in (6.4.35).
6.4. Regularization by Iterative Methods 337

Notes and references


Nicolaides [830, 1987] derives a deflated CG method that uses the three-term recurrence of Lanc-
zos. The idea of augmenting a Krylov subspace method with vectors to improve the convergence
is discussed by Morgan [810, 1995]. Variations and applications of deflated and augmented
Krylov subspace methods are surveyed by Simoncini and Szyld [998, 2007, Section 9]. A gen-
eral framework for augmented and deflated Krylov subspace methods is given by Gaul et al. [443,
2013]. Baglama, Reichel, and Richmond [58, 2013] describe an LSQR algorithm augmented
with harmonic Ritz vectors associated with small singular values for accelerating ill-posed least
squares problems.
Chapter 7

SVD Algorithms and


Matrix Functions

7.1 The QRSVD Algorithm


7.1.1 The LR and QR Iterations
Suppose that A ∈ Cn×n has the LU factorization A = LU . Then U = L−1 A, and multiplying
the factors in reverse order performs the similarity transformation
e = U L = L−1 AL.
A

Hence A and A e have the same eigenvalues, and the eigenvectors are related by X e = L−1 X. In
the LR algorithm of Rutishauser [947, 1958] this process is iterated. Setting A1 = A and

Ak = Lk Uk , Ak+1 = Uk Lk , k = 1, 2, . . . , (7.1.1)

generates a sequence of similar matrices such that

Ak = L−1 −1 −1
k−1 · · · L2 L1 AL1 L2 · · · Lk−1 , k = 2, 3, . . . . (7.1.2)

Define the lower and upper triangular matrices

Tk = L1 · · · Lk−1 Lk , Sk = Uk Uk−1 · · · U1 , k = 1, 2, . . . . (7.1.3)

Then Tk−1 Ak = ATk−1 , and forming the product Tk Sk , we obtain

Tk Sk = Tk−1 (Lk Uk )Sk−1 = Tk−1 Ak Sk−1 = ATk−1 Sk−1 .

By induction it follows that


Tk Sk = Ak , k = 1, 2, . . . . (7.1.4)
This exhibits the close relationship between the LR algorithm and the power method. This is
one of the oldest methods for computing eigenvalues and eigenvectors of a matrix. It is partic-
ularly suitable for finding a few extreme eigenvalues and the corresponding eigenvectors; see
Section 7.3.1. It is also directly or indirectly the basis for many other algorithms for singular
values and vectors.
When A is real symmetric and positive definite the Cholesky factorization A = LLT can be
used in the LR algorithm. Then the algorithm becomes

Ak = Lk LTk , Ak+1 = LTk Lk , k = 1, 2, . . . . (7.1.5)

339
340 Chapter 7. SVD Algorithms and Matrix Functions

Clearly, Ak+1 is again symmetric and positive definite, and therefore the recurrence is well
defined. Repeated application of (7.1.5) gives
−1 T −1 T
Ak = Tk−1 A1 Tk−1 = Tk−1 A1 (Tk−1 ) , (7.1.6)

where Tk = L1 L2 · · · Lk . Further, we have

Ak = (L1 L2 · · · Lk )(LTk · · · LT2 LT1 ) = Tk TkT . (7.1.7)

Under certain restrictions the sequence of matrices Ak converges to a diagonal matrix whose
elements are the eigenvalues of A.
The QR algorithm is similar to the LR algorithm but uses orthogonal similarity transforma-
tions
Ak = Qk Rk , Ak+1 = Rk Qk = QH k Ak Qk , k = 1, 2, . . . . (7.1.8)
The resulting matrix Ak+1 is similar to A1 = A. Successive iterates of the QR algorithm satisfy
relations similar to those derived for the LR algorithm. By repeated application of (7.1.8) it
follows that
Pk Ak+1 = APk , Pk = Q1 Q2 · · · Qk . (7.1.9)
Furthermore, setting Uk = Rk · · · R2 R1 , we have

Pk Uk = Pk−1 (Qk Rk )Uk−1 = Pk−1 Ak Uk−1 = APk−1 Uk−1 ,

and by induction,
Pk U k = A k , k = 1, 2, . . . . (7.1.10)
For the QR algorithm we have ATk = Ak = RkT QTk and hence

ATk Ak = A2k = RkT QTk Qk Rk = RkT Rk , (7.1.11)

i.e., RkT is the lower triangular Cholesky factor of A2k . For the Cholesky LR algorithm we have
from (7.1.7) that
A2k = Lk Lk+1 (Lk Lk+1 )T . (7.1.12)
By uniqueness, the Cholesky factorizations (7.1.11) and (7.1.12) of A2k must be the same, and
therefore RkT = Lk Lk+1 . Thus

Ak+1 = Rk Qk = Rk Ak Rk−1 = LTk+1 LTk Ak (LTk+1 LTk )−1 .

Comparing this with (7.1.6) shows that two steps of the Cholesky LR algorithm are equivalent to
one step in the QR algorithm.

7.1.2 Reduction to Compact Form


For a matrix A ∈ Rn×n , one step of the symmetric QR algorithm requires O(n3 ) flops, which is
too much and makes the algorithm impractical. We start by noting that the orthogonal similarity
of a real symmetric matrix A is again real symmetric. Furthermore, the QR iteration preserves
the upper and lower bandwidths of a band matrix. In particular, the QR algorithm preserves the
real symmetric tridiagonal form
α1 β2
 
 β2 α2 β3 
H
Q AQ = T = 
 . . . . . .

. (7.1.13)
 . . . 
βn−1 αn−1 βn
 
βn αn
7.1. The QRSVD Algorithm 341

Matrix shapes invariant under symmetric QR algorithms are studied by Arbentz and Golub [30,
1995]. An initial reduction to real tridiagonal form reduces the arithmetic cost per step in the
QR algorithm to O(n) flops. The reduction can be carried out by a sequence of Householder
reflections
P = I − βuuH , β = 2/uH u.
In the kth step, A(k+1) = Pk A(k) Pk , where Pk is chosen to zero the last n − k − 1 elements in
the kth column. Dropping the subscripts k, we write
P AP = A − upH − puH + βuH puuH = A − uq H − quH , (7.1.14)
where p = βAu, q = p − γu, and γ = βuH p/2. The operation count for this reduction is about
2n3 /3 flops. A complex Hermitian matrix A ∈ Cn×n can be reduced to real tridiagonal form T
by a sequence of similarity transformations with complex Householder reflections.
The reduction to symmetric tridiagonal form is normwise backward stable. This ensures that
the larger eigenvalues will be computed with high relative accuracy. However, if the reduction is
performed starting from the top row, then the matrix should be ordered so that the larger elements
occur in the top left corner. This ensures the errors in the orthogonal reduction correspond to
small relative errors in the elements of A, and the small eigenvalues will not be destroyed.
If the reduction to tridiagonal form is carried out for a symmetric band matrix A in a similar
way, then the band structure will be destroyed in the intermediate matrices. By annihilating pairs
of elements using plane rotations in an ingenious order, the reduction can be performed without
increasing the intermediate bandwidth; see Rutishauser [949, 1963] and Schwarz [977, 1968].
For computing the SVD of a matrix A ∈ Cm×n it is advantageous to reduce it initially to
real bidiagonal form:
ρ1 γ2
 
   ρ2 γ3 
B T
 . .. . ..  ∈ Rn×n .

A = QB PB , B =   (7.1.15)
0 
ρn−1 γn
 
ρn
As described in Section 4.2.1, this can be achieved by taking P and Q as products of Householder
matrices. The resulting matrix B has the same singular values as A, and the singular vectors of
B and A are simply related. Note that both B TB and BB T are tridiagonal.
The QR factorization of A ∈ Rm×n requires 2(mn2 − n3 /3) flops or twice as many if Q
is needed explicitly. This cost usually dominates the total cost of computing the SVD. If only
the singular values are required, then the cost of bidiagonalization typically is 90% of the total
cost. If singular vectors are wanted, then the explicit transformation matrices are needed, but the
reduction still accounts for more than half the total cost.
The errors from the bidiagonal reduction may often account for most of the errors in the
computed singular values. To minimize these errors the reduction should preferably be done as a
two-step procedure. In the first step a QR factorization with column pivoting of A is performed:
 
R
AΠ = QR , R ∈ Rn×n . (7.1.16)
0

Next, R is reduced to upper bidiagonal form, which takes 8n3 /3 flops, or twice as many if the left
and right transformation matrices are wanted explicitly. Presorting the rows of A by decreasing
norms before the QR factorization can also reduce the relative errors in the singular values; see
Higham [622, 2000] and Drmač [335, 2017].
Note that a bidiagonal matrix with complex elements can always be transformed into real
form by a sequence of unitary diagonal scalings from the left and right. In the first step, D1 =
342 Chapter 7. SVD Algorithms and Matrix Functions

diag (eiα1 , 1, . . . , 1) is chosen to make the (1, 1) element in D1 B real. Next, D2 = diag (1, eiα2 ,
1, . . . , 1) is chosen to make the (1, 2) element in (D1 B)D2 real, and so on.
Let σi , i = 1 : n, be the singular values, and let ui and vi be the corresponding left and
right singular vectors of the upper bidiagonal matrix B in (7.1.15). Then the eigenvalues and
eigenvectors of the Jordan–Wielandt matrix W are given by
     
ui ui 0 B
W = ±σi , W = .
±vi ±vi BT 0
By an odd–even permutation the matrix W can be brought into the special real symmetric tridi-
agonal form
0 ρ1
 
 ρ1 0 γ2 
γ2 0 ρ2
 
 
 .. .. 
G = PWPT =  ρ2 . .  ∈ R2n×2n , (7.1.17)
 
 .. .. 

 . . γn


γn 0 ρn
 
ρn 0
with zero diagonal elements first considered by Golub and Kahan [495, 1965].

7.1.3 Zero-Shift QRSVD Algorithm


Let A = QR ∈ Rm×n be the QR factorization of A. Set R0 = R, and compute the sequence
R1 , R2 , . . . of upper triangular matrices by
T
Rk−1 = Qk Rk , k = 1, 2, . . . . (7.1.18)
That is, the upper triangular Rk is obtained from the QR factorization of the lower triangular
T
matrix Rk−1 . Combining two steps of (7.1.18) gives Rk+1 = QTk+1 RkT = QTk+1 Rk−1 Qk .
Hence Rk+1 has the same singular values as Rk−1 and
T
Rk+1 Rk+1 = Rk RkT = QTk (Rk−1
T
Rk−1 )Qk . (7.1.19)
T
This amounts to one step of the QR algorithm applied to Rk−1 Rk−1 . Similarly,
T
Rk+1 Rk+1 = RkT Rk = QTk+1 (Rk−1 Rk−1
T
)Qk+1 , (7.1.20)
T
which shows that also one step of the classical QR algorithm on Rk−1 Rk−1 has been performed.
Notably, this has been achieved without explicitly forming either A A or AAT , which could have
T

resulted in a loss of accuracy.


One step of the iteration (7.1.18) requires about O(n3 ) flops. This is too much to make it a
practical algorithm. By initially reducing A to bidiagonal form, the iteration (7.1.19) becomes
B0 = B,
T
Bk−1 = Qk Bk , k = 1, 2, . . . . (7.1.21)
This can be performed by applying a sequence of plane rotations to the lower bidiagonal matrix
T
Bk−1 of (7.1.15) to produce an upper bidiagonal matrix.In the first step the off-diagonal element
in the first column is zeroed, giving
    
c s ρ1 0 ρ̂1 γ̂2
= , (7.1.22)
−s c γ2 ρ2 0 ρ̂2
where c = ρ1 /ρ̂1 , s = γ2 /ρ̂1 . The new elements are
q
ρ̂1 = ρ21 + γ22 , γ̂2 = γ2 (ρ2 /ρ̂1 ), ρ̂2 = ρ1 (ρ2 /ρ̂1 ). (7.1.23)
7.1. The QRSVD Algorithm 343

The remaining steps are similar. Note that s and c in the plane rotations are not needed and
that two successive steps of the algorithm will transform a lower bidiagonal matrix back into
lower bidiagonal form. The work in one step of the bidiagonal zero-shift SVD algorithm is 4n
multiplications, n divisions, and n square roots. The algorithm uses no addition or subtraction.
Therefore no cancellation can take place, and each entry of the transformed matrix is computed
to high relative accuracy. By merging the two steps, we obtain the zero-shift algorithm used by
Demmel and Kahan [310, 1990].
The repeated transformation from lower to upper triangular form, or flipping of a triangu-
lar matrix, was first analyzed by Faddeev, Kublanovskaya, and Faddeeva [392, 1968]; see also
Chandrasekaran and Ipsen [233, 1995].
The following remarkably compact MATLAB function by Fernando and Parlett [403, 1994]
is simpler and more efficient. It performs one step of the unshifted QRSVD algorithm on a lower
or upper bidiagonal matrix B whose elements are stored in q[1:n] and e[2:n].

Algorithm 7.1.1 (Zero-Shift Bidiagonal QRSVD).

function [rho,gam] = bidqr(rho,gam)


n = length(q);
for i = 1:n-1
rho(i) = norm(rho(i),gam(i+1));
t = rho(i+1)/rho(i);
gam(i+1) = gam(i+1)*t;
rho(i+1) = rho(i)*t;
end
end

If some element γi = 0, where i < n, then the bidiagonal matrix splits into a direct sum of
two smaller bidiagonal matrices
 
B1 0
B= ,
0 B2
which the algorithm can treat separately. In particular, if γn = 0, then σ = ρn is a singular value.
If a diagonal element ρi = 0, i < n, then B is singular and must have a zero singular value.
Then in the next iteration the algorithm will drive this zero element to the last position, giving
γn = 0.
Demmel and Kahan [310, 1990] show that the singular values of a bidiagonal matrix are
determined to full relative accuracy by their elements, independent of their magnitudes, while
the error bounds for the associated singular vectors depend on the relative gap γi between σi
and other singular values.

Theorem 7.1.1. Let B and B̄ = B + δB, |δB| ≤ ω|B|, be upper bidiagonal matrices in Rn×n ,
with singular values σ1 ≥ · · · ≥ σn and σ̄1 ≥ · · · ≥ σ̄n , respectively. If η = (2n − 1)ω < 1,
then for i = 1, . . . , n,
η
|σ̄i − σi | ≤ |σi |, (7.1.24)
1−η

 2η(1 + η)
max sin θ(ui , ūi ), sin θ(vi , v̄i ) ≤ , (7.1.25)
γi − η
|σi − σj |
γi = min .
j̸=i σi + σj
344 Chapter 7. SVD Algorithms and Matrix Functions

More generally, Demmel et al. [307, 1999] show that high relative accuracy in the computed
SVD can be achieved for matrices that are diagonal scalings of a well-conditioned matrix. They
consider rank-revealing decompositions of the form

A = XDY T , X ∈ Rm×r , Y ∈ Rr×n , (7.1.26)

where X and Y are well-conditioned and D is diagonal. Such a decomposition can be obtained,
e.g., using Gaussian elimination with rook or complete pivoting.
In the zero-shift QRSVD algorithm the diagonal elements of B will converge to the singular
values σi arranged in order of decreasing absolute magnitude. The superdiagonal elements will
behave asymptotically like cij (σi /σj )2k for some constants cij . Hence, the rate of convergence
is slow unless there is a substantial gap between the singular values; see Theorem 7.3.4. The
remedy is to introduce suitable chosen shifts in the QR algorithm. However, to do this stably is
a nontrivial task, and hence it is the subject of the next section.

7.1.4 The Implicitly Shifted QR Algorithm


The rate of convergence of the QR algorithm for a real symmetric tridiagonal matrix T ∈ Rn×n
depends on the ratios λi+1 /λi of its eigenvalues λ1 ≥ · · · ≥ λn of T . By introducing shifts in
the algorithm, the convergence rate can be improved. The shifted matrix T − τ I has the same
invariant subspaces as T , and the eigenvalues are λi − τ , i = 1, . . . , n. With variable shifts the
shifted QR algorithm becomes T1 = T ,

Tk − τ I = Qk Rk , Rk Qk + τ I = Tk+1 , k = 1, 2, . . . . (7.1.27)

Since the shift is restored, each iteration is an orthogonal similarity transformation, and it holds
that Tk+1 = QTk Tk Qk . Further, the eigenvectors of T can be found by accumulating the product
Pk = Q1 · · · Qk , k = 1, 2, . . . . If the shift is chosen to approximate a simple eigenvalue λ of T ,
convergence of the QR algorithm to this eigenvalue will be fast.
Performing the shift τ in (7.1.27) explicitly will affect the accuracy of the smaller eigenvalues
for which |λi | ≪ |τ |. This is avoided in the implicitly shifted QR algorithm due to Francis [431,
1961], [432, 1961], where algorithmic details for performing the shifts implicitly are described.
A crucial role is played by the following theorem.

Theorem 7.1.2 (Implicit Q Theorem). Let A ∈ Rn×n and an orthogonal matrix Q =


T
(q1 , . . . , qn ) be given such that H = Q AQ is upper Hessenberg with real positive subdiag-
onal elements. Then H and Q are uniquely determined by the first column q1 = Qe1 .

Proof. Assume that the first k columns q1 , . . . , qk in Q and the first k−1 columns in H have been
computed. (Since q1 is known, this assumption is valid for k = 1.) Equating the kth columns in
QH = AQ gives

h1,k q1 + · · · + hk,k qk + hk+1,k qk+1 = Aqk , k = 1 : n − 1.

Multiplying this by qiH and using the orthogonality of Q gives hik = qiH Aqk , i = 1 : k. Since H
is unreduced, hk+1,k ̸= 0 and

 k
X 
qk+1 = h−1
k+1,k Aqk − h ik i ,
q ∥qk+1 ∥2 = 1.
i=1

This and the condition that hk+1,k is real positive determine qk+1 uniquely.
7.1. The QRSVD Algorithm 345

In the implicit shift tridiagonal QR algorithm, the QR step (7.1.27) is performed as follows.
The first plane rotation P1 = G12 is chosen so that
P1 t1 = ±∥t1 ∥2 e1 , t1 = (α1 − τ, β2 , 0, . . . , 0)T ,
where t1 is the first column in Tk − τk I. The result of applying this transformation is pictured
below (for n = 5):
↓ ↓
   
→ × × + × × +
→
× × ×
 × × × 
T T
  
P1 Tk = 
 × × × ,
 P1 Tk P1 =  + × × ×
 .

 × × ×  × × ×
× × × ×

To preserve the tridiagonal form, a transformation P2 = G23 is used to zero out the new nonzero
elements:  × × 0 
× × × + 
P2T (P1T T P1 )P2 =  0 × × × .
 
+ × × ×
 
× ×
This creates two new nonzero elements, which in turn are moved further down the diagonal with
plane rotations. This process is known as chasing. Eventually, the nonzeros outside the diagonal
will disappear outside the border. By Theorem 7.1.3 the resulting symmetric tridiagonal matrix
QT Tk Q must equal Tk+1 , because the first column in Qk is P1 P2 · · · Pn−1 e1 = P1 e1 .
The shift τ in the QR algorithm is usually taken to be the eigenvalue of the trailing principal
2 × 2 submatrix of T  
αn−1 βn
, (7.1.28)
βn αn
closest to αn , the so-called Wilkinson shift. In the case of a tie (αn−1 = αn ) the smaller
eigenvalue αn − |βn | is chosen. Wilkinson [1121, 1968] shows that, neglecting rounding errors,
this shift guarantees global convergence and that local convergence is nearly always cubic. A
stable formula for computing the shift is
. p 
τ = αn − sign (δ)βn2 |δ| + δ 2 + βn2 , δ = (αn−1 − αn )/2;

see Parlett [884, 1998, Section 8.9].


The QRSVD algorithm for computing the SVD is obtained by applying the implicitly shifted
QR algorithm to the symmetric positive definite tridiagonal matrix
α1 β2
 
 β2 α2 β3 
T
T =B B=
 . . . . . .  ∈ Rn×n ,

(7.1.29)
 . . . 
βn−1 αn−1 βn
 
βn αn
where B is the bidiagonal matrix (7.1.15). Then α1 = ρ21 ,
αi = ρ2i + γi2 , βi = ρi−1 γi , i = 2, . . . , n.
If B T B is explicitly formed, the accuracy of small singular values is destroyed. Furthermore, it
is not clear how to stably obtain the left singular vectors ui . However, the implicit QR iterations
on B TB can be performed without explicitly forming B TB by using the following result.
346 Chapter 7. SVD Algorithms and Matrix Functions

Theorem 7.1.3. Let Q = (q1 , . . . , qn ) and V = (v1 , . . . , vn ) be orthogonal matrices such that
QT M Q = T and V T M V = S are real, symmetric, and tridiagonal. If v1 = q1 and T is
unreduced, then vi = ±qi , i = 2, . . . , n.

For a shift τ , let t1 be the first column of B TB − τ I, and determine the plane rotation
T1 = R12 so that

T1T t1 = ±∥t1 ∥2 e1 , t1 = (ρ21 − τ, ρ1 γ2 , 0, . . . , 0)T . (7.1.30)

Next, apply a sequence of plane rotations to make


T
Tn−1 · · · T2T T1T B TBT1 T2 · · · Tn−1

tridiagonal. To do this implicitly, start by applying the transformation T1 to B. This gives (take
n = 5)
↓ ↓
 
× ×
+ × × 
 
BT1 = 
 × × .

 × ×
×

Next, premultiply by a plane rotation S1T = R12 to zero out the + element. This creates a new
nonzero element in the (1, 3) position. To preserve the upper bidiagonal form, choose a rotation
T2 = R23 to zero out the element +:

↓ ↓
   
→ × × + × × 0
→
0 × ×
  × × 
T
S1T BT1 T2
  
S1 BT1 = 
 × × ,
 = 
 + × × .

 × ×  × ×
× ×

Then continue to chase the element + down, with transformations alternately from the right and
left until a new upper bidiagonal matrix
T
B̂ = Sn−1 · · · S1T BT1 · · · Tn−1 = U T BP

is obtained. But then T̂ = B̂ T B̂ = P T B T U U T BP = P T T P is a tridiagonal matrix, where the


first column of P equals the first column of T1 . If T is unreduced, T̂ must be the result of a QR
iteration on T with shift equal to τ .
When the shift τ approximates an eigenvalue of B TB, the element γn in the last row and last
column of T in (7.1.29) will approach zero very quickly. The Wilkinson shift is determined from
the trailing 2 × 2 submatrix of B T B, namely
 
ρ2n−1 + γn−1
2
ρn−1 γn
. (7.1.31)
ρn−1 γn ρ2n + γn2

When |ρn | ≤ δ where δ is a prescribed tolerance, ρn is accepted as a singular value, and the
order of the matrix B is reduced by one. This automatic deflation is an important property of the
QR algorithm.
7.1. The QRSVD Algorithm 347

In practice, after each QR step the convergence criterion

|γi | ≤ 0.5u(|ρi−1 | + |ρi |), i = 2, . . . , n, (7.1.32)

is checked. If this is satisfied for some i < n, the matrix splits into a direct sum of two smaller
bidiagonal matrices B1 and B2 for which the QR iterations can be continued independently.
Furthermore, if qi = 0 for some i ≤ n, then B must have at least one singular value equal to
zero. Therefore also a second convergence criterion

|ρi | ≤ 0.5u(|γi | + |γi+1 |), i = 2, . . . , n, (7.1.33)

is checked. If this is satisfied for some i < n, then the ith row can be zeroed out by a sequence of
plane rotations Gi,i+1 , Gi,i+2 , . . . , Gi,n applied from the left to B. The new elements generated
in the ith column can be discarded without introducing an error in the singular values that is larger
than some constant times u∥B∥2 . Then the matrix B again splits into two smaller bidiagonal
matrices B1 and B2 .
The criteria (7.1.32)–(7.1.33) ensure backward stability of the QRSVD algorithm in the
normwise sense, i.e., the computed singular values σ̄k are the exact singular values of a nearby
matrix B + δB, where ∥δB∥2 ≤ c(n) · uσ1 . Here c(m, n) is a constant depending on m and
n, and u is the machine unit. Thus, if T is nearly rank-deficient, this will always be revealed
by the computed singular values. The penalty for not spotting a negligible element is not loss
of accuracy but a slowdown of convergence. However, the smaller singular values may not be
computed with high relative accuracy. When all off-diagonal elements in B have converged to
zero, we have QTS BTS = Σ = diag (σ1 , . . . , σn ). The left and right singular vectors of T are
given by accumulating the product of transformations in the QRSVD iterations.
Each QRSVD iteration requires 14n multiplications and 2n calls to givrot. If singular vectors
are desired, accumulating the rotations into U and V requires 6mn and 6n2 flops, respectively,
and the overall cost goes up to O(mn2 ). Usually less than 2n QR iterations are needed. When
singular vectors are desired, the number of QR iterations can be reduced by first computing the
singular values without accumulating singular vectors. Then the QRSVD algorithm is run a
second time with shifts equal to the computed singular values, the so-called perfect shifts. Then
convergence occurs in at most n iteration. This may reduce the cost of the overall computations
by about 40%.
A variant of the QRSVD algorithm is proposed by Chan [222, 1982]. This differs in that a
QR factorization is performed before the bidiagonalization. In Table 7.1.1 operation counts are
shown for standard QRSVD and Chan’s version. Four different cases are considered depending
on whether U1 ∈ Rm×n and V ∈ Rn×n are explicitly required or not. Only the highest order
terms in m and n are kept. It is assumed that the iterative phase takes on average two complete
QR iterations per singular value and that standard plane rotations are used. Case (a) arises in
the computation of the pseudoinverse, case (c) in least squares applications, and case (d) in the
estimation of condition numbers and rank determination.

Table 7.1.1. Comparison of multiplications for SVD algorithms.

Case Required Golub–Reinsch SVD Chan SVD


2 3
(a) Σ, U1 , V 12mn + 22n /3 6mn2 + 16n3
(b) Σ, U1 12mn2 − 2n3 6mn2 + 26n3 /3
(c) Σ, V 4mn2 + 8n3 2mn2 + 28n3 /3
(d) Σ 4mn2 − 4n3 /3 2mn2 + 2n3
348 Chapter 7. SVD Algorithms and Matrix Functions

The QL algorithm is a variant of the QR algorithm based on the iteration

Ak = Qk Lk , Lk Qk = Ak+1 , k = 1, 2, . . . , (7.1.34)

with Lk lower triangular. This is merely a reorganization of the QR algorithm. Let J ∈ Rn×n be
the symmetric permutation matrix J = (en , . . . , e2 , e1 ). Then JA reverses the rows of A, AJ
reverses the columns of A, and JAJ reverses both rows and columns. If R is upper triangular,
then JRJ is lower triangular. It follows that if A = QR is the QR factorization of A, then
JAJ = (JQJ)(JRJ) is the QL factorization of JAJ. Hence, the QR algorithm applied to A is
the same as the QL algorithm applied to JAJ. Therefore the convergence theory is essentially
the same for both algorithms. But in the QL algorithm, inverse iteration is taking place in the top
left corner of A, and direct iteration in the lower right corner.
A bidiagonal matrix is said to be graded if the elements are large at one end and small at the
other. If the bidiagonalization uses an initial QR factorization with column pivoting, then the
matrix is usually graded from large at upper left to small at lower right, as illustrated here:

1 10−1
 
10−2 10−3
.
 
10−4 10−5

10−6

This is advantageous for the QR algorithm, which tries to converge to the singular values from
smallest to largest and “chases the bulge” from top to bottom. Convergence will usually be fast
if the matrix is graded this way. However, if B is graded the opposite way, the QR algorithm
may require many more steps, and the QL algorithm should be used instead. Alternatively, the
rows and columns of B could be reversed. When the matrix breaks up into diagonal blocks that
are graded in different ways, the bulge should be chased in the appropriate direction.
The QRSVD algorithm by Demmel and Kahan [310, 1990] is substantially improved com-
pared to the Golub–Reinsch algorithm. It computes the smallest singular values to maximal
relative accuracy and the others to maximal absolute accuracy. This is achieved by using the
zero-shift QR algorithm on any submatrix whose condition number κ = σmax /σmin is so large
that the shifted QR algorithm would make unacceptably large changes in the computed σmin . Al-
though the zero-shift algorithm has only a linear rate of convergence, it converges quickly when
σmin /σmax is very small. The zero-shift algorithm uses only about a third of the operations per
step as the shifted version. This makes the Demmel–Kahan algorithm faster and, occasionally,
much faster than the original Golub–Reinsch algorithm. Other important features of the new al-
gorithm are stricter convergence criteria and the use of a more accurate algorithm for computing
singular values and vectors of an upper triangular 2 × 2 matrix; see Section 7.2.2.

Notes and references

The QR algorithm was independently discovered by Kublanovskaya [709, 1961]. The story
of the QR algorithm and its later developments is told by Golub and Uhlig [510, 2009]. An
exposition of Francis’s work on the QR algorithm is given in Watkins [1104, 2011]. A two-
stage bidiagonalization algorithm where the matrix is first reduced to band form is developed by
Großer and Lang [541, 1999].
Initially, Golub [488, 1968] applied the Francis implicit QR algorithm to the special symmet-
ric tridiagonal matrix K in (7.1.17), whose eigenvalues are ±σi . If double QR steps with shifts
±τi are taken, then the zero diagonals in K are preserved. This makes it possible to remove
the redundancy caused by the doubling of the dimensions. The resulting algorithm is outlined
7.2. Alternative SVD Algorithms 349

also in the Stanford CS report of Golub and Businger [485, 1967], which contains an ALGOL
implementation by Businger.
The algorithm given by Golub and Reinsch [507, 1971] for computing the SVD is one of the
most elegant and reliable in numerical linear algebra and has been cited over 4600 times (as of
2023). The FORTRAN program for the SVD of a complex matrix of Businger and Golub [194,
1969] is an adaptation of the same code. The LINPACK implementation of the QRSVD al-
gorithm (see Dongarra et al. [322, 1979, Chap. 11]) follows the Handbook algorithm, except it
determines the shift from (7.1.31).
The QRSVD algorithm can be considered as a special instance of a product eigenvalue
problem, where two matrices A and B are given, and one wishes to find the eigenvalues of a
product matrix C = AB or quotient matrix C = AB −1 . For stability reasons, one wants to
operate on the factors A and B separately, without forming AB or AB −1 explicitly; see Heath
et al. [597, 1986]. The relationship between the product eigenvalue problem and the QRSVD
algorithm is discussed by Kressner [707, 2005], [706, 2005]. An overview of algorithms and
software for computing eigenvalues and singular values is given by Bai et al. [61, 2000].

7.2 Alternative SVD Algorithms


7.2.1 Bisection-Type Methods
The number of eigenvalues of a symmetric tridiagonal matrix

α1 β2
 
 β2 α2 β3 
 .. .. 
T = β3 . .  ∈ Rn×n (7.2.1)
 
 .. 
 . αn−1 βn

βn αn

that are greater than or less than a specified value can be determined by the method of bisection or
spectrum slicing. Early implementations of such methods were based on computing the leading
principal minors of the shifted matrix det(Tk − λI) of T . Expanding the determinant along the
last row and defining p0 = 1 gives

p1 (λ) = (α1 − λ)p0 ,


pk (λ) = (αk − λ)pk−1 (λ) − βk2 pk−2 (λ), k = 2, . . . , n. (7.2.2)

For a given numerical value of λ, the so-called Sturm sequence p1 (λ), . . . , pn (λ) can be evalu-
ated in 3n flops using (7.2.2).

Lemma 7.2.1. If the tridiagonal matrix T is irreducible, i.e., βi ̸= 0, i = 2, . . . , n, then the zeros
of pk−1 (λ) strictly separate those of pk (λ).

Proof. By Cauchy’s interlacing theorem, the eigenvalues of any leading principal minor of a
Hermitian matrix A ∈ Rn×n interlace the eigenvalues of A. In particular, the zeros of each
pk−1 (λ) separate those of pk (λ), at least in the weak sense. Suppose now that µ is a zero
of both pk (λ) and pk−1 (λ). Since βk ̸= 0, it follows from (7.2.2) that µ is also a zero of
pk−2 (λ). Continuing in this way shows that µ is a zero of p0 . This is a contradiction because
p0 = 1.
350 Chapter 7. SVD Algorithms and Matrix Functions

Theorem 7.2.2. Let s(τ ) be the number of agreements in sign of consecutive members in
the Sturm sequence p1 (τ ), p2 (τ ), . . . , pn (τ ). If pi (τ ) = 0, the sign is taken to be opposite
that of pi−1 (τ ). (Note that two consecutive pi (τ ) cannot be zero.) Then s(τ ) is the number of
eigenvalues of T strictly greater than µ.

Proof. See Wilkinson [1120, 1965, pp. 300–301].

Bisection can be used to locate an individual eigenvalue λk independent of any of the others
and is therefore suitable for parallel computing. The Sturm sequence algorithm is very stable
when carried out in IEEE floating-point arithmetic but is susceptible to underflow and overflow
and other numerical problems. There are ways to overcome these problems as shown by Barth,
Martin, and Wilkinson [93, 1971].
More recent implementations of bisection methods are developments of the inertia algorithm
analyzed by Kahan [680, 1966]; see Fernando [402, 1998]. The inertia of a symmetric matrix A
is defined as the triple (τ, ν, δ) of positive, negative, and zero eigenvalues of T . Sylvester’s law
(Horn and Johnson [639, 1985]) says that the inertia is preserved under congruence transforma-
tions. If symmetric Gaussian elimination is carried out for A − τ I, it yields the factorization
A − τ I = LDLT , D = diag (d1 , . . . , dn ). (7.2.3)
where L is unit lower bidiagonal and D = diag (d1 , . . . , dn ). Since A − σI is congruent to
D, it follows from Sylvester’s law that the number of eigenvalues of A smaller than τ equals
the number of negative elements π(D) in the sequence d1 , . . . , dn . Applied to a symmetric and
tridiagonal matrix T − τ I = LDLT , this procedure becomes particularly efficient and reliable.
A remarkable fact is that provided over- or underflow is avoided, element growth will not affect
the accuracy. For example, the LDLT factorization
     
1 2 1 1 1 2
A − I =  2 2 −4  =  2 1  −2  1 2
−4 −6 2 1 2 1
shows that A has two eigenvalues greater than 1.
The bisection method can be used to locate singular values of a bidiagonal matrix B by
applying it to compute the eigenvalues of one of tridiagonal matrices B TB and BB T . This can
be done without forming these matrix products explicitly. However, the best option is to apply
bisection to the special symmetric tridiagonal matrix T of Golub–Kahan form (7.1.17) with zero
diagonal and eigenvalues ±σi . It gives the highest relative accuracy in the computed singular
value.
By applying the bisection procedure to the special symmetric tridiagonal matrix G ∈ R2n×2n
in
0 ρ1
 
 ρ1 0 γ2 
γ2 0 ρ2
 
 
 .. .. 
G= ρ2 . .  ∈ R2n×2n (7.2.4)
 
 . . . .


 . . γn 

γn 0 ρn
 
ρn 0
with zero diagonal, we obtain a method for computing selected singular values σi of an irre-
ducible bidiagonal matrix Bn with elements ρ1 , . . . , ρn and γ2 , . . . , γn . Recall that G is per-
mutationally equivalent to the Jordan–Wielandt matrix and has eigenvalues equal to ±σi (Bn ),
i = 1, . . . , n.
7.2. Alternative SVD Algorithms 351

Following Fernando [402, 1998], the diagonal elements in the LDLT factorization of G − τ I
are obtained by Gaussian elimination as

d1 = −τ, di = −τ − zi /di−1 , i = 2, . . . , 2n, (7.2.5)

where zi = ρ2i if i is odd, and zi = γi2 if i is even and z0 = 0.

Algorithm 7.2.1 (Bisection for Singular Values).


Let ρ1 , γ2 , ρ2 , γ3 , . . . , ρn be the off-diagonal elements of the tridiagonal matrix G in (7.2.4).
Set zi = ρ2i , if i is odd and zi = γi2 , if i is even. On exit π is the number of singular values of B
that are less than τ > 0.

π := 0; d := −τ ;
if d < 0 then π = 1;
for i = 1 : 2n − 1
d := −τ − zi /d;
if |d| < 0 then π := π + 1;
end

One step in Algorithm 7.2.1 requires 2n flops, and only the elements dk need be stored. The
number of multiplications can be halved by precomputing αk2 , but this may cause unnecessary
over- or underflow. To prevent breakdown
√ of the recursion, the algorithm should be modified so
that a small |dk | is replaced by ω, where ω is the underflow threshold.
Kahan [680, 1966] gives a detailed error analysis of the bisection algorithm. Assuming that
no over- or underflow occurs, he proves the monotonicity of the inertia counts in IEEE floating-
point arithmetic. He shows that the computed number π̄ is the exact number of singular values
greater than σ of a tridiagonal matrix T ′ , where the elements of T ′ have elements satisfying

|σ ′ − σ| ≤ uσ, |αk′ − αk | ≤ 2u|αk |,

a very satisfactory backward error bound. Combined with Theorem 7.1.1, this shows that the
bisection algorithm computes singular values of a bidiagonal matrix B with small relative errors.
The bisection algorithm is related to the famous quotient difference (qd) algorithm of
Rutishauser [946, 1954] for finding roots of polynomials or the poles of meromorphic functions;
see Henrici [603, 1958]. The differential qd (dqds) algorithm for computing singular values of
a bidiagonal matrix is due to Fernando and Parlett [403, 1994]. This algorithm evolved from
trying to find a faster square-root-free version of the Demmel–Kahan zero-shift bidiagonal QR
Algorithm 7.1.1. Recall that one step of the zero-shift Demmel–Kahan QR algorithm applied to
a bidiagonal matrix B with elements qi , ei+1 gives another bidiagonal matrix Bb with elements
T T b
qbi , ebi+1 such that BB = B B. Equating the (k, k) and (k, k + 1) elements on both sides of
b
this equation gives
qk2 + e2k = eb2k−1 + qbk2 , ek qk+1 = qbk ebk .
These are similar to the rhombus rules of the qd algorithm and connect the four elements

q̂k2
ê2k−1 e2k .
2
qk+1
To keep the high relative accuracy property, Fernando and Parlett had to use the so-called differ-
ential form of the progressive dqds algorithm This version also allows a stable way to introduce
352 Chapter 7. SVD Algorithms and Matrix Functions

explicit shifts in the algorithm. One step of dqds with shift τ ≤ σmin (B) computes a bidiagonal
B
e such that
BbT B
b = BB T − τ 2 I. (7.2.6)
The choice of τ ensures that B b exists. A nonrestoring orthogonal similarity transformation can be
performed without forming BB T − τ 2 I, using a hyperbolic QR factorization (see Section 3.2.4).
Alternatively, if  T  
B B
∈ R2n×n
b
Q =
0 τI
with Q orthogonal, then BB T = B bT Bb + τ 2 I as required. In the first step, a plane rotation is
constructed that affects only rows (1, n + 1) and makes the (n + 1, 1) element equal topτ . This is
possible because τ ≤ σmin (B) ≤ q1 , and it changes the first diagonal element to t1 = q12 − τ 2 .
Next, a rotation in rows (1, 2) is used to annihilate e2 , giving
q
qb1 = q12 − τ 2 + e22

and changing q2 to qb2 . The first column and row now have their final form:
q1 t1 qb1 eb2
     
 e2 q 2   e2 q2   0 qe2 
 .. .. 
⇒
 .. .. 
⇒
 .. .. 
.

 . .   . .   . . 
en q n en qn en qn
     
0 0 ··· 0 τ 0 ··· 0 τ 0 ··· 0
All remaining steps are similar. The kth step only acts on the last n−k +1 rows and columns and
will produce an element τ in position (n + k, n + k). One can show that this algorithm does not
introduce large relative errors in the singular values. By working instead with squared quantities,
square roots can be eliminated. More details are given in Fernando and Parlett [403, 1994] and
Parlett [883, 1995]. The dqds algorithm is available in LAPACK as the routine DLASQ and is
considered to be the fastest SVD algorithm when only singular values are required. The error
bounds for dqds are significantly smaller than those for the Demmel–Kahan QRSVD algorithm.
A further benefit is that it can be implemented in either parallel or pipelined format.
The multiple relatively robust representation (MRRR or MR3 ) algorithm by Dhillon [320,
1997] and Dhillon and Parlett [321, 2004] accurately computes the eigenvalue decomposition
of a symmetric tridiagonal matrix M ∈ Rn×n in only O(n2 ) operations. It overcomes some
difficulties with the dqds algorithm for computing the eigenvectors. Applying the MR3 algorithm
to compute the eigenvalue decompositions of B TB and BB T separately gives a fast algorithm
for computing the full SVD of a bidiagonal matrix B. Großer and Lang [542, 2003] show that
this may lead to poor results regarding the residual ∥BV − U Σ∥ and give a coupling strategy
that resolves this difficulty. The resulting algorithm is analyzed in Großer and Lang [543, 2005].
An implementation of the MR3 algorithm for the bidiagonal SVD is given by Willems, Lang,
and Vömel [1125, 2007]. Later developments of the bidiagonal MR3 algorithm are described in
Willems and Lang [1124, 2012].

7.2.2 Jacobi-Type Methods


Jacobi’s method [659, 1846] is one of the oldest methods for solving the eigenvalue problem
for a real symmetric (or Hermitian) matrix A of order n. Jacobi’s method solves the eigenvalue
problem by performing a sequence of similarity transformations

A0 = A, Ak+1 = JkT Ak Jk , k = 0, 1, 2, . . . , (7.2.7)


7.2. Alternative SVD Algorithms 353

such that Ak , k = 1, 2, . . . , converges to a diagonal matrix. Here Jk = Gpq (θ) is chosen as a


rotation in the plane (p, q), p < q. The elements c = cos θ and s = sin θ are determined so that
 ′     
app 0 c −s app apq c s
= , (7.2.8)
0 a′qq s c apq aqq −s c
i.e., the off-diagonal elements apq = aqp are reduced to zero.
There are special situations when Jacobi’s method is very efficient, e.g., when A is nearly
diagonal or when the eigenvalue problems for a sequence of matrices differ only slightly from one
another. After the QR algorithm was introduced, Jacobi’s method fell out of favor for a time. It
was revived when Demmel and Veselić [312, 1992] showed that with a proper stopping criterion,
Jacobi’s method computes the eigenvalues of symmetric positive definite matrices with uniformly
better relative accuracy than any algorithm that first reduces the matrix to tridiagonal form;
see also Dopico, Koev, and Molera [330, 2009]. Newer implementations of Jacobi’s method
for computing the SVD were then developed that could also compete in speed with the QR
algorithm.
Only the elements in rows and columns p and q of A will change. Since symmetry is pre-
served, only the upper triangular part of each A needs to be computed. The 2 by 2 symmetric
eigenvalue problem (7.2.8) is a key subproblem in Jacobi’s method. Equating the off-diagonal
elements gives
(app − aqq )cs + apq (c2 − s2 ) = 0. (7.2.9)
If apq ̸= 0, we obtain τ ≡ cot 2θ = (aqq − app )/(2apq ). From (7.2.9) and the trigonometric
formula tan 2θ = 2 tan θ/(1 − tan2 θ), it follows that t = tan θ is a root of the quadratic
equation t2 + 2τ t − 1 = 0. Choosing the root of smallest modulus
. p 
t = tan(θ) = sign (τ ) |τ | + 1 + τ 2 (7.2.10)

ensures that π/4 < θ ≤ π/4 and minimizes the difference ∥A′ − A∥F . Note that a′pp + a′qq =
trace (A). The eigenvalues are

a′pp = app − t apq , a′qq = aqq + t apq , (7.2.11)



and the eigenvectors are defined by c = 1/ 1 + t2 and s = tc. The computed transformation is
also applied to the remaining elements in rows and columns p and q of the full matrix A. With
r = s/(1 + c) = tan(θ/2) and j ̸= p, q, these are obtained from

a′jp = a′pj = capj − saqj = apj − s(aqj + rapj ),


a′jq = a′qj = sapj + caqj = aqj + s(apj − raqj ).

These formulas are chosen to reduce roundoff errors; see Rutishauser [950, 1971]. If symmetry
is exploited, then one Jacobi transformation takes about 8n flops. Note that an off-diagonal
element made zero at one step will in general become nonzero at some later stage. The Jacobi
method also destroys any band structure in A.
The convergence of the Jacobi method depends on the fact that in each step the Frobenius
norm of the off-diagonal elements
X
S(A) = a2ij = ∥A − D∥2F (7.2.12)
i̸=j

is reduced. To see this, note that because the Frobenius norm is orthogonally invariant and
a′pq ̸= 0, it holds that
S(A′ ) = S(A) − 2a2pq .
354 Chapter 7. SVD Algorithms and Matrix Functions

For simplicity of notation we set in the following A = Ak and A′ = Ak+1 . There are
various strategies for choosing the order in which the off-diagonal elements are annihilated. In
the classical Jacobi method the off-diagonal element of largest magnitude is annihilated—the
optimal choice. Then 2a2pq ≥ S(Ak )/N , N = n(n − 1)/2, and

S(Ak+1 ) ≤ (1 − 1/N )S(Ak ).

This shows that for the classical Jacobi method, Ak+1 converges at least linearly with rate
1 − 1/N to a diagonal matrix. It can be shown that ultimately the rate of convergence is qua-
dratic, i.e., for k large enough, S(Ak+1 ) < cS(Ak )2 for some constant c. The iterations are
repeated until S(Ak ) < δ∥A∥F , where δ is a tolerance that can be chosen equal to the unit
roundoff u. Then it follows from the Bauer–Fike theorem that the diagonals of Ak approximate
the eigenvalues of A with an error less than δ∥A∥F .
In the classical Jacobi method, a large amount of effort is spent on searching for the largest
off-diagonal element. Even though it is possible to reduce this time by taking advantage of the
fact that only two rows and columns are changed at each step, the classical Jacobi method is
almost never used. Instead a cyclic Jacobi method is used, where the N = 12 n(n − 1) off-
diagonal elements are annihilated in some predetermined order. Each element is rotated exactly
once in any sequence of N rotations, called a sweep. Convergence of any cyclic Jacobi method
can be guaranteed if any rotation (p, q) is omitted for which

|apq | < tol (app aqq )1/2

for some threshold τ ; see Forsythe and Henrici [423, 1960]. To ensure a good rate of conver-
gence, τ should be successively decreased after each sweep. For sequential computers, the most
popular cyclic ordering is rowwise, i.e., the rotations are performed in the order

(1, 2), (1, 3), ... (1, n),


(2, 3), ... (2, n),
(7.2.13)
... ...
(n − 1, n).

Jacobi’s method is very suitable for parallel computation because rotations (pi , qi ) and (pj , qj )
can be performed simultaneously when pi , qi are distinct from pj , qj . If n is even, n/2 trans-
formations can be performed simultaneously, and a sweep needs at least n − 1 such parallel
steps. Several parallel schemes that use this minimum number of steps have been constructed;
see Eberlein and Park [356, 1990]. A possible choice is the round-robin ordering, illustrated here
for n = 8:
(1, 2), (3, 4), (5, 6), (7, 8),
(1, 4), (2, 6), (3, 8), (5, 7),
(1, 6), (4, 8), (2, 7), (3, 5),
(p, q) = (1, 8), (6, 7), (4, 5), (2, 3),
(1, 7), (8, 5), (6, 3), (4, 2),
(1, 5), (7, 3), (8, 2), (6, 4),
(1, 3), (5, 2), (7, 4), (8, 6).

The rotations associated with each such row can be computed simultaneously.
Convergence of any cyclic Jacobi method can be guaranteed if rotations are omitted when
the off-diagonal element is smaller in magnitude than some threshold. To ensure a good rate
of convergence, the threshold should be successively decreased after each sweep. It has been
shown that the rate of convergence is ultimately quadratic, so that for k large enough, we have
7.2. Alternative SVD Algorithms 355

S(Ak+1 ) < cS(Ak )2 for some constant c. The iterations are repeated until S(Ak ) < δ∥A∥F ,
where δ is a tolerance, which can be chosen equal to the unit roundoff u. The Bauer–Fike
theorem (see Golub and Van Loan [512, 1996, Theorem 7.2.2]) shows that the diagonal elements
of Ak then approximate the eigenvalues of A with an error less than δ∥A∥F . About 4n3 flops
are required for one sweep. In practice, the cyclic Jacobi method needs no more than about 3–5
sweeps to obtain eigenvalues of more than single precision accuracy, even when n is large. The
number of sweeps grows approximately as O(log n). About 10n3 flops are needed to compute
all the eigenvalues of A. This is about 3–5 times more than required for the QR algorithm.
An orthogonal system X = limk→∞ Xk , of eigenvectors of A is obtained by accumulating
the product of all Jacobi transformations Jk :

X0 = I, Xk = Xk−1 Jk , k = 1, 2, . . . . (7.2.14)

For each rotation Jk the associated columns p and q of Xk−1 are modified, which requires 8n
flops. Hence, computing the eigenvectors doubles the operation count.
Hestenes [606, 1958]) gave a one-sided Jacobi-type method for computing the SVD. It uses
a sequence of plane rotations from the right to find an orthogonal matrix V such that AV = U Σ
has orthogonal columns. From this the SVD A = U ΣV T is readily obtained. Hestenes’s method
is mathematically equivalent to applying Jacobi’s method to ATA. In a basic step of the method,
two columns in A are rotated,
 
c s
( âp âq ) = ( ap aq ) , p < q. (7.2.15)
−s c

The rotation parameters c, s are determined so that the rotated columns are orthogonal or, equiv-
alently, so that
T 
∥ap ∥22 aTp aq
  
c s c s
−s c aTq ap ∥aq ∥22 −s c

is diagonal. This is a 2 × 2 symmetric eigenproblem and can be solved as in Jacobi’s method.


However, because parts of the matrix are squared, this approach can lead to numerical error.
Instead, first the m × 2 QR factorization
 
rpp rpq
( ap aq ) = ( q1 q2 )
0 rqq

is computed, and then the 2 × 2 SVD


     
cl sl rpp rpq cr −sr σp 0
= . (7.2.16)
−sl cl 0 rqq sr cr 0 σq

The singular values of R are

1 q q
σ1 = (rpp + rqq )2 + rpq
2 + (rpp − rqq )2 + rpq
2 , σ2 = |rpp rqq |/σp . (7.2.17)
2
2
The right singular vector (−sr , cr ) in (7.2.16) is parallel to (rpp − σp2 , rpp rpq ). The left singular
vectors are determined by (cl , sl ) = (rpp cr − rpq sr , rqq sr )/σp . These expressions suffer from
possible over- or underflow in the squared subexpressions but can be reorganized to provide
results with nearly full machine precision; see the MATLAB code below.
356 Chapter 7. SVD Algorithms and Matrix Functions

Algorithm 7.2.2 (SVD of 2 by 2 Upper Triangular Matrix).


function [cu,su,cv,sv,sig1,sig2] = svd22(r11,r12,r22)
% SVD22 computes the SVD of an upper triangular
% 2 by 2 matrix with abs(r11) >= abs(r22).
% ---------------------------------------------------
q = (abs(r11) - abs(r22))/abs(r11);
m = r12/r11; t = 2 - q;
s = sqrt(t*t + m*m); r = sqrt(q*q + m*m);
a = (s + r)/2;
sig1 = abs(r11)*a; sig2 = abs(r22)/a;
t = (1 + a)*(m/(s + t) + m/(r + q));
q = sqrt(t*2 + 4);
cv = 2/q; sv = -t/q;
cu = (cv - sv*m)/a;
su = sv*(r22/r11)/a;
A Fortran program based on the same formula that guards against overflow and underflow
and always gives high relative accuracy in both singular values and vectors is given by Demmel
and Kahan [310, 1990]. An error analysis is sketched in the appendix of Bai and Demmel [59,
1993]. There also is a special subroutine SLAS2 in the BLAS for the accurate computation of
the singular values of the 2 × 2 bidiagonal matrix.
By construction, the SVD A = U ΣV T produced by Hestenes’s method gives U orthogonal
to working accuracy. A loss of orthogonality may occur in V , and the columns of V should
therefore be reorthogonalized at the end. Convergence of Hestenes’s method is related to the fact
that each step reduces the sum of squares of the off-diagonal elements
X
S(C) = c2ij , C = ATA.
i̸=j

The strategies for choosing the order in which the off-diagonal elements are annihilated are
similar to those for Jacobi’s method. A sequence of N = n(n − 1)/2 rotations in which each
column is rotated exactly once is called a sweep. No more than about five sweeps are needed to
obtain singular values of more than single precision accuracy, even when n is large.
To apply Hestenes’s method to a real m × n matrix when m > n, an initial QR factorization
with column pivoting and row sorting of A is first performed, and the algorithm is applied to R ∈
Rn×n . This tends to speed up convergence and simplify the transformations and is recommended
also when A is square. Hence, without restriction we can assume in the following that m = n.
Initial implementations of Jacobi’s method were slower than the QR algorithm but were able
to compute singular values of a general matrix more accurately. With further improvements by
Drmač [333, 1997] and Drmač and Veselić [336, 2008], [337, 2008], Jacobi’s method becomes
competitive also in terms of speed.
In the method of Kogbetliantz (see Kogbetliantz [702, 1955]) for computing the SVD of a
square matrix A, the off-diagonal elements of A are successively reduced in size by a sequence
of two-sided plane rotations
A′ = JpqT
(ϕ)AJpq (ψ), (7.2.18)
where Jpq (ϕ) and Jpq (ψ) are determined so that a′pq = a′qp = 0. Note that only rows and
columns p and q in A are affected by the transformation. The rotations Jpq (ϕ) and Jpq (ψ) are
determined by computing the SVD of a 2 × 2 submatrix
 
app apq
Apq = , app ≥ 0, aqq ≥ 0.
aqp aqq
7.2. Alternative SVD Algorithms 357

The assumption of nonnegative diagonal elements is no restriction because the sign of these
elements can be changed by premultiplication with an orthogonal matrix diag (±1, ±1). From
the invariance of the Frobenius norm under orthogonal transformations it follows that
S(A′ ) = S(A) − (a2pq + a2qp ), S(A) = ∥A − D∥2F .
This is the basis for a proof that the matrices generated by Kogbetliantz’s method converge to a
diagonal matrix containing the singular values of A. Orthogonal sets of left and right singular
vectors can be obtained by accumulating the product of all the transformations. Convergence is
analyzed in Paige and Van Dooren [869, 1986] and Fernando [401, 1989].
Kogbetliantz’s method should not be applied directly to A but to the triangular matrix R
obtained by an initial pivoted QR factorization. It can be shown that one sweep of the row
cyclic algorithm (7.2.13) applied to an upper triangular matrix generates a lower triangular matrix
and vice versa. The annihilation of the elements in the first row for n = 4 by plane rotations
(1, 2), (1, 3), (1, 4) from the left is pictured below:
       
x a0 b0 c0 x 0 b1 c1 x 0 0 c2 x 0 0 0
0 x d0 e0  0 x d1 e1   g0 x d2 e1   g1 x d2 e2 
 ⇒ ⇒ ⇒ .
0 0 x f0  0 0 x f0  0 0 x f1   h0 0 x f2 
0 0 0 x 0 0 0 x 0 0 0 x 0 0 0 x
The switching between upper and lower triangular format can be avoided by a simple permutation
scheme; see Fernando [401, 1989]. This makes it possible to reorganize the algorithm so that at
each stage of the recursion one only needs to store and process a triangular matrix. The resulting
algorithm is highly suitable for parallel computing. The reorganization of the row cyclic scheme
is achieved by the following algorithm (see also Luk [762, 1986] and Charlier, Vanbegin, and
Van Dooren [236, 1988]):
for i = 1 : n − 1
for ik = 1 : n − i
A = Pik Jik ,ik +1 (ϕk )AJiTk ,ik +1 (ψk )PiTk
end
end
where Pik denotes a permutation matrix that interchanges rows ik and ik + 1. The permutations
will shuffle the rows and columns of Ak so that each index pair (ik , jk ) in the row cyclic scheme
becomes an adjacent pair of type (ik , ik +1) when it is its turn to be processed. The permutations
involved are performed simultaneously with the rotations at no extra cost. In this scheme, only
rotations on adjacent rows and columns occur.
Below we picture the annihilation of the elements in the first row for n = 4 for the reorga-
nized scheme. After elimination of a0 , the first and second rows and columns are interchanged.
Element b1 is now in the first superdiagonal and can be annihilated. Again, by interchanging the
third and fourth rows and columns, c2 is brought to the superdiagonal and can be eliminated. The
resulting matrix is still upper triangular:
       
x a0 b0 c0 x 0 d1 e1 x d2 g0 e1 x d2 e2 g1
0 x d0 e0   0 x b1 c1  0 x 0 f1  0 x f2 h0 
 ⇒ ⇒ ⇒ .
0 0 x f0  0 0 x f0  0 0 x c2  0 0 x 0 
0 0 0 x 0 0 0 x 0 0 0 x 0 0 0 x
Because of its simplicity, Kogbetliantz’s algorithm has been adapted to computation of the
generalized singular value decomposition. Further developments of the Kogbetliantz SVD algo-
rithm are given by Hari and Veselić [591, 1987]. Bujanović and Drmač [184, 2012] study the
convergence and practical applications of the block version of Kogbetliantz’s method.
358 Chapter 7. SVD Algorithms and Matrix Functions

7.2.3 Divide-and-Conquer Methods


The first divide-and-conquer algorithm for the symmetric tridiagonal eigenproblem is due to
Cuppen [280, 1981]. The basic idea is to split the tridiagonal matrix
α1 β2
 
 β2 α2 β3 
 .. .. 
T = β3 . .  ∈ Rn×n
 
 .. 
 . αn−1 βn 
βn αn
into two smaller symmetric tridiagonal matrices T1 and T2 by a symmetric rank-one modification
chosen to annihilate the elements βk in positions (k, k + 1) and (k + 1, k). This is achieved by
setting
     
T1 βk ek eT1 T̂1 0 ek
T = = + β k ( eTk eT1 ) , (7.2.19)
δk e1 eTk T2 0 T̂2 e1
where 1 ≤ k ≤ n. The kth diagonal element of T1 and the first diagonal element of T2 are
modified to give T̂1 and T̂2 . If the eigenvalue decompositions of T̂1 and T̂2 are known, the
eigenvalue decomposition of T could be found by finding the eigensystem of a diagonal matrix
modified by a symmetric rank-one matrix. In the divide-and-conquer algorithm, this idea is
recursively applied to T1 and T2 until the subproblem sizes are sufficiently small. This requires
at most log2 n steps and gives a fully parallel algorithm.
After modifications given by Dongarra and Sorensen [324, 1987] and Gu and Eisenstat [547,
1995], the divide-and-conquer algorithm is competitive in terms of speed and accuracy with the
QR algorithm. A divide-and-conquer algorithm for the bidiagonal SVD was given by Jessup
and Sorensen [667, 1994]. In the following we describe an improved variant by Gu and Eisen-
stat [546, 1995].
The SVD of a square upper bidiagonal matrix B ∈ Rn×n can be divided into two independent
subproblems as follows:
q1 r1
 
q2 r2
 
  B1 0
B=
 .. .. 
 =  qk eT rk eT1  , (7.2.20)
 . .  k

qn−1 rn−1
 0 B2
qn
where B1 ∈ R(k−1)×k and B2 ∈ R(n−k)×(n−k) . Substituting the SVD
B1 = U1 ( D1 0 ) V1T , B2 = U2 D2 V2T
into (7.2.20) gives
  
U1 0 0 D1 0 0  T
V1 0
B =  0 1 0   qk l1T qk λ1 T 
rk f2 ≡ U CV T , (7.2.21)
0 V2
0 0 U2 0 0 D2

where ( l1T λ1 ) = eTk V1 is the last row of V1 , and f2T = eT1 V2 is the first row of V2 . If a
permutation matrix Pk interchanges row k and the first block row, then
 
qk λ1 qk l1T rk f2T
Pk CPkT = M =  0 D1 0 . (7.2.22)
0 0 D2
7.2. Alternative SVD Algorithms 359

Let M = XΣY T be the SVD of M . Then the SVD of B is


B = QΣW T , Q = U PkT X, W = V PkT Y. (7.2.23)
The matrix in (7.2.22) has the form
z1 z2 ··· zn
 
d2
 = D + e1 z T ,
 
M = .. (7.2.24)
 . 
dn
where D = diag (d1 , d2 , . . . , dn ) contains the elements of D1 and D2 , and d1 = 0 is introduced
to simplify the notation. We further assume that 0 = d1 ≤ d2 ≤ d3 ≤ · · · ≤ dn , which can be
achieved by a reordering of rows and columns.
To compute the SVD M = D + e1 z T = XΣY T , we use the fact that the square of the
singular values Σ2 are the eigenvalues and the right singular vectors Y are the eigenvectors of
M T M = XΣ2 X T = D2 + zeT1 e1 z T = D2 + zz T .
If yi is a right singular vector, then M yi is a vector in the direction of the corresponding left
singular vector.
We note that if zi = 0, or di = di+1 for some i ∈ [2, n − 1], then di is a singular value of M ,
and the degree of the characteristic equation of M T M may be reduced by one. We can therefore
assume that |zi | ̸= 0, i = 1 : n, and that di ̸= di+1 , i = 1 : n − 1. In practice, the assumptions
above must be replaced by
dj+1 − dj ≥ τ ∥M ∥2 , |zj | ≥ τ ∥M ∥2 ,
where τ depends on the unit roundoff.
The above facts gives the following characterization of the singular values and vectors of M ;
see Jessup and Sorensen [667, 1994].

Lemma 7.2.3. Let the SVD of the matrix in (7.2.24) be M = XΣY T , with
X = (x1 , . . . , xn ), Σ = diag (σ1 , . . . , σn ), Y = (y1 , . . . , yn ).
Then the singular values have the interlacing property
0 = d1 < σ1 < d2 < σ2 < · · · < dn < σn < dn + ∥z∥2 ,
where z = (z1 , . . . , zn )T , and they are roots of the characteristic equation
n
X zk2
f (σ) = 1 + = 0.
d2k − σ 2
k=1

The characteristic equation can be solved efficiently and accurately by the algorithm of
Li [743, 1994]. The singular values of M are always well-conditioned. The singular vectors
are xi = x̃i /∥x̃i ∥2 , yi = ỹi /∥ỹi ∥2 , i = 1 : n, where
   
z1 zn d2 z2 dn zn
ỹi = ,..., 2 , x̃i = −1, 2 ,..., 2 ,
d21 − σi2 dn − σi2 d2 − σi2 dn − σi2
and
n n
X zj2 X (dj zj )2
∥ỹi ∥22 = , ∥x̃i ∥22 =1+ .
j=1
(dj − σi2 )2
2
j=2
(d2j − σi2 )2
The singular vectors can be extremely sensitive to the presence of close singular values. To
get accurately orthogonal singular vectors without resorting to extended precision is a difficult
problem; see Gu and Eisenstat [546, 1995].
360 Chapter 7. SVD Algorithms and Matrix Functions

Notes and references


Van Zee, van de Geijn, and Quintana-Ortí [1084, 2014] describe ways to restructure the plane
rotations in the QRSVD algorithm to achieve better parallel performance. Dongarra et al. [323,
2018] survey the history of SVD algorithms. In particular, they review reformulations designed
to take advantage of new computer architectures, such as cache-based memory hierarchies and
distributed computing.

7.2.4 Modifying the SVD


Suppose the SVD of a matrix A ∈ Rm×n , m ≥ n,
 
Σ
A=U V T , Σ = diag (σ1 , . . . , σn ), (7.2.25)
0

with orthogonal U ∈ Rm×m and V ∈ Rn×n is known. In many applications it is desirable to


update the SVD when A is modified by a matrix of low rank to incorporate new data. Exam-
ples are subspace tracking in signal processing and latent semantic indexing; see Moonen, Van
Dooren, and Vandewalle [803, 1992] and Zha and Simon [1146, 1999].
The goal is to take advantage of knowledge of the SVD of A to reduce the work required.
However, many of the proposed updating schemes for the SVD can be as expensive as computing
the SVD from scratch. The costliest part of the updating is the rotation and reorthogonalization
of U and V , and most updating algorithms for the SVD require O(mn2 ) flops. Although this is
the same order of complexity as for recomputing the SVD from scratch, there can still be a gain
if the order constant is less.
In the SVD, rows and columns are treated the same. Hence, appending or deleting a column
in A can be treated by appending or deleting a row in AT . This simplifies the updating problem
in that only modifications of rows need be considered. (Of course, in the least squares updating,
there is a lack of symmetry.) In particular, if the SVD of A is to be used for solving the least
squares problem minx ∥Ax − b∥2 via
x = V Σ† c, c = U T b,
then we would like to update U , V , Σ, and c in order to update x.
Given the SVD A = U ΣV T , consider the problem of computing the SVD
 
A
à = , w ∈ Rn , (7.2.26)
wT

when a row wT is appended to A. From the relationship between the SVD of A and the symmet-
ric eigenvalue problem for ATA we have
ÃT Ã = ATA + wwT = V Σ2 V T + wwT = V (Σ2 + ρ2 zz T )V T = Ṽ Σ̃2 Ṽ T ,
where z = (ζ1 , . . . , ζn ) = V T w/ρ and ρ = ∥w∥2 . Hence Σ̃2 and Ṽ are the solution to a
symmetric eigenvalue problem modified by a perturbation of rank one. Such problems can be
solved by using the observation (see Golub [490, 1973]) that the eigenvalues λ1 ≥ λ2 ≥ · · · ≥
λn of
C = D + ρ2 zz T , D = diag (d1 , d2 , . . . , dn ), ∥z∥2 = 1, (7.2.27)
where d1 ≥ d2 ≥ · · · ≥ dn , are the values of λ for which
n
2
X ζj2
g(λ) = 1 + ρ = 0. (7.2.28)
j=1
(dj − λ)
7.2. Alternative SVD Algorithms 361

Good initial approximations to the roots can be obtained from the interlacing property (see The-
orem 1.3.5)
λ1 ≥ d1 ≥ λ2 ≥ · · · ≥ dn−1 ≥ λn ≥ dn .
To solve equation (7.2.28) a method based on rational approximation safeguarded with bisection
is used. The subtle details in a stable implementation of such an algorithm are treated by Li [743,
1994].
When the modified eigenvalues d˜i = σ̃i2 have been calculated, the corresponding eigenvec-
tors are found by solving
(Di + ρ2 zz T )xi = 0, Di = D − d˜2i I.
Provided Di is nonsingular (this can be ensured by an initial deflation), we have xi = Di−1 z/
∥Di−1 z∥2 . (Note that forming Di−1 z explicitly should be avoided in practice; see Bunch and
Nielsen [188, 1978].) The updated right singular vectors are Ṽ = V X, where X = (x1 , . . . , xn ).
If A (or Ã) is still available, the updated left singular vectors Ũ can be computed from Ũ =
ÃṼ Σ̃−1 .
An alternative approach for appending a row is given by Businger [192, 1970]. We have
 T      
U 0 A L Σ
V = Π n+1,m+1 , L = ,
0 1 wT 0 wT V
where Πn+1,m+1 denotes a permutation matrix interchanging rows n + 1 and m + 1, and L is
a special lower triangular matrix. Businger’s updating algorithm consists of two major phases.
The first phase is a finite process that transforms L ∈ R(n+1)×n into upper bidiagonal form using
plane rotations from left and right:
 

G1 LG2 = , B̃ ∈ Rn×n .
0

The second phase is an implicit QR diagonalization of B̃ (see Section 7.1.1) that reduces B̃ to
diagonal form Σ̃.
In the kth step of phase 1, the kth element of wT V (k = 1 : n − 1) is eliminated using plane
rotations and a chasing scheme on rows and columns. This is pictured below for n = 5 and
k = 3:
↓ ↓
     
× × × × × ×

 × × + 


 × × × 
 →
 × × ⊕ 


 × +  ⇒ →
  × ×  ⇒ →
  + × × 


 + × 
 → 
 ⊕ × 


 × 

 ×  ×  ×
0 0 ⊕ × × 0 0 0 × × 0 0 0 × ×

↓ ↓ ↓ ↓
     
× × + → × × ⊕ × ×

 × × 
 → + × ×
 

⊕
 × × 


 ⊕ × × 
 ⇒

 × × 
 ⇒

 × × 
.

 × 


 × 


 × 

 ×  ×  ×
0 0 0 × × 0 0 0 × × 0 0 0 × ×

Phase 1 uses n(n − 1)/2 row and column rotations. Most of the work is used to apply these
rotations to U and V . This requires 2n2 (m + n) flops if standard plane rotations are used. For
362 Chapter 7. SVD Algorithms and Matrix Functions

updating least squares solutions we only need to update V , Σ, and c = U T b. The dominating
term is then reduced to 2n3 flops. Zha [1144, 1992] shows that the work can be halved by using
a two-way chasing scheme in the reduction to bidiagonal form. Phase 2 typically requires about
3n3 flops. Note that Σ and V can be updated without U being available. From the interlacing
property (Theorem 1.3.5) it follows that the smallest singular value will increase. Hence the rank
cannot decrease.
When the SVD is to be modified by deleting a row, with no loss of generality we can assume
that the first row of A is to be deleted. Then we wish to determine the SVD of à ∈ R(m−1)×n
when the SVD  T  
z Σ
A= =U VT (7.2.29)
à 0
is known. This problem can be reduced to a modified eigenvalue problem of the form

C = D − ρ2 zz T , D = diag (d1 , d2 , . . . , dn ), ∥z∥2 = 1. (7.2.30)

The interlacing property now gives d1 ≥ d˜1 ≥ d2 ≥ · · · ≥ d˜n−1 ≥ dn ≥ d˜n ≥ 0. Hence the
Bunch–Nielsen scheme is readily adapted to solving this problem.
Park and Van Huffel [882, 1995] give a backward stable algorithm based on finding the SVD
of (e1 , A), where e1 is an added dummy column. Then
   
1 0 u1 Σ
U T (e1 , A) = ,
0 V u2 0

where (uT1 , uT2 ) is the first row of U . First, determine left and right plane rotations G1 and G2 so
that  
  1 wT
u1 Σ
G1 G2 =  0 B̃  , (7.2.31)
u2 0
0 0

with B̃ upper bidiagonal. This can be achieved by a chasing scheme similar to that used when
adding a row. The desired bidiagonal form is built from bottom to top, while nonzeros are chased
into the lower-right corner. The reduction is pictured below for k = 3, n = 4:

↓ ↓
     
× × × × × ×
→
× × + 

×
 × ⊕ 

×
 × 

→
⊕ + ×  ⇒

0
 × ×  ⇒ →0
  × × +
0 × × 0 + × × →0 ⊕ × ×
0 × 0 × 0 ×

↓ ↓
   
× × × ×
× ×  × × 
   
0
 × × ⊕ ⇒ 0
 × × .

0 × × →0 × ×
0 + × → 0 ⊕ ×

A total of (n − 1)2 + 1 plane rotations are needed to make the first column of G1 U T equal to
eT1 . From orthogonality it follows that this matrix must have the form
 
T 1 0
G1 U = ,
0 Ū T
7.3. Computing Selected Singular Triplets 363

with Ū orthogonal. Since no rotation from the right involves the first column, the transformed
matrix has the form    
α 0 1 0
G2 = .
0 V 0 V̄
It now follows that
 
  T
  α wT
α 0 1 z 1 0
=0 B̃  ,
0 Ū T 0 Ã 0 V̄
0 0
 

which gives Ū T ÃV̄ = . In the second phase, the implicit QRSVD is used to reduce B̃ to
0
diagonal form Σ̃. Simultaneously Ū and V̄ are updated.

Notes and references


Bunch and Nielsen [188, 1978] develop updating methods related to updating symmetric eigen-
value decompositions. The technique used by Businger to update the SVD is related to that for
updating the QR decomposition; see Barlow, Zha, and Yoon [80, 1993]. An approximate updat-
ing algorithm for the SVD is developed by Moonen, Van Dooren, and Vandewalle [803, 1992].
In the first step, a row is appended or deleted, and the resulting matrix is reduced to triangular
form. Jacobi-type sweeps are then applied to restore approximate diagonal form. Gu and Eisen-
stat [545, 1993], [548, 1995] reduce the updating and downdating of the SVD to the problem of
computing the SVD of a matrix of simple structure that can be solved by computing the roots of
a secular equation. Brand [176, 2006] develops an efficient scheme for low-rank modification of
the thin SVD of streaming data.

7.3 Computing Selected Singular Triplets


7.3.1 Shifted Inverse Iteration
In many applications, e.g., low-rank approximations, information retrieval, and seismic tomogra-
phy, a few, singular values σi and the corresponding left and right singular vectors ui ∈ Cm, and
vi ∈ Cn of a large Hermitian matrix A are required. A related problem is to compute selected
eigenpairs λi = σi2 of the Hermitian matrices

AHAvi = λvi and AAH ui = λui . (7.3.1)

In general it is not necessary, or even advisable, to form AHA or AAH . The squaring of the
singular values is a drawback, as it will force the clustering of small singular values. Instead, one
may consider the equivalent Hermitian eigenvalue problem
    
0 A u u
= ±σ . (7.3.2)
AH 0 ±v ±v

This yields the singular values and both the left and right singular vectors. If r = rank(A),
then M has 2r nonzero eigenvalues, ±σ1 (A), . . . , ±σr (A). Here the small singular values of A
correspond to interior eigenvalues of the Hermitian matrix.
Let A ∈ Cn×n be a Hermitian matrix with eigenpairs λi , xi , i = 1, . . . , n. Given a unit initial
vector z (0) , the power method forms the vector sequence z (k) = Ak z (0) using the recursion

z (k) = Az (k−1) , k = 1, 2, . . . . (7.3.3)


364 Chapter 7. SVD Algorithms and Matrix Functions

This only requires the ability to form products Az for given vectors z. If the eigenvalues
Pn satisfy
|λ1 | > |λ2 | ≥ · · · ≥ |λn |, expanding z (0) along the eigenvectors gives z (0) = j=1 αj xj and

n  n  
(k) k (0)
X X λ j k
z =A z = λkj αj xj = λk1 α1 x1 + αj xj , (7.3.4)
j=1 j=2
λ1

k = 1, 2, . . . . If α1 ̸= 0 and |λj |/|λ1 | < 1 (j ̸= 1), it follows from (7.3.4) that z (k) converges
with linear rate |λ2 |/|λ1 | to the normalized eigenvector x1 as k → ∞. To avoid overflow or
underflow, recursion (7.3.3) should be modified to

zb(k) = Az (k−1) , z (k) = zb(k) /∥b


z (k) ∥2 , k = 1, 2, . . . . (7.3.5)

Let x ∈ Cn be a given approximate eigenvector for a Hermitian matrix A ∈ Cn×n . Then


(λ, x) is an exact eigenpair of A if and only if the residual r = Ax − λx = 0. By continuity, we
can expect that ∥r∥2 can be used as a measure of the accuracy of the eigenpair. Hence it makes
sense to determine λ as the solution to the linear least squares problem minλ ∥Ax − λx∥2 . The
solution has the property that Ax − λx ⊥ x and is given by the Rayleigh quotient

xH Ax
λ= . (7.3.6)
xH x

Theorem 7.3.1. Let x be given a unit vector and A be a Hermitian matrix. Then (µ, x), where
µ = xH Ax is the Rayleigh quotient, is an exact eigenpair of A
e = A + E, where

E = −(rxH + xrH ), r = Ax − µx, ∥E∥2 = ∥r∥2 . (7.3.7)

Proof. Since r is orthogonal to x, it follows that Ex = −r and

(A + E)x = Ax − r = µx.

Hence (µ, x) is an exact eigenpair of A + E. Furthermore, ∥E∥22 = ∥E H E∥2 is the largest


eigenvalue of the rank-two matrix

E H E = rrH + ∥r∥22 xxH .

This shows that r and x are orthogonal eigenvectors of E H E, with both eigenvalues equal to
rH r = ∥r∥22 . The other eigenvalues are zero, and hence ∥E∥2 = ∥r∥2 .

The gradient of the Rayleigh quotient is

1 Ax xH Ax 1
∇µ(x) = H − H 2 x = H (Ax − λx). (7.3.8)
2 x x (x x) x x

Hence the Rayleigh quotient µ(x) is stationary if and only if x is an eigenvector of A. Hence
µ(x) usually is a far more accurate approximate eigenvalue than x is an approximate eigenvector.
If we apply the Rayleigh quotient to the Hermitian system (7.3.2) we obtain
  
1 T T 0 A u
µ(u, v) = (u , ±v ) = ±uT Av, (7.3.9)
2 AT 0 ±v

where u and v are unit vectors. Here sign(v) can be chosen to give a real nonnegative value of
7.3. Computing Selected Singular Triplets 365

µ(u, v). Given approximate right and left singular vectors of A ∈ Rm×n , the Rayleigh quotient
approximations to the dominant singular value are

µ1/2 (v) = ∥Av∥2 /∥v∥2 , µ1/2 (u) = ∥AT u∥2 /∥u∥2 ,

respectively. Theorem 7.3.1 implies the following residual error bound.

Theorem 7.3.2. For any scalar α and unit vectors u, v, there is a singular value σ of A such that
 
1 Av − uα
|σ − α| ≤ √ . (7.3.10)
2 AT u − vα 2
For fixed u, v this error bound is minimized by taking α equal to the Rayleigh quotient given in
(7.3.9).

The power method computes approximate eigenvectors of a Hermitian matrix A for the ei-
genvalue of largest magnitude. Approximations at the other end of the spectrum can be obtained
by applying the power method to A−1 . Given an initial unit vector v (0) , the inverse power
method computes the normalized sequence v (1) , v (2) , v (3) , . . . , by the recursion

v (k) = v (k−1) ,
Ab v (k) = vb(k) /∥b
v (k) ∥2 , k = 1, 2, . . . . (7.3.11)

Here v (k) will converge to a unit eigenvector corresponding to the Rayleigh quotient

µ−1
n ≈ (v
(k−1) H −1 (k−1)
) A v = (v (k−1) )H vb(k) .

This gives an approximation of the eigenvalue λn of A of smallest magnitude.


The inverse power method (7.3.11) assumes that an appropriate factorization of A is known
so that the linear system for vb(k) can be solved. If the QR factorization A = QR is known, the
inverse power method applied to A = ATA simplifies to
bT w(k) = z (k) ,
R Rz (k−1) = w(k) . (7.3.12)

Each step requires two triangular solves.


More generally, for a shift µ ̸= λi the power method can be applied to the matrix (A−µI)−1 ,
with eigenvalues

θi = 1/(λi − µ), λi = µ + 1/θi , i = 1, . . . , n. (7.3.13)

By this spectral transformation, eigenvalues close to the shift µ are transformed into large and
well-separated eigenvalues of (A−µI)−1 ; see Figure 7.3.1. Given an initial vector v0 , the shifted
inverse power method computes the sequence of vectors

(A − µI)b
vk = vk−1 , k = 1, 2, . . . . (7.3.14)

The corresponding Rayleigh quotient approximation of σi becomes

σi2 ≈ µ + 1 (vk−1
 H
vbk ). (7.3.15)

An a posteriori error bound is ∥rk ∥2 /∥b


vk ∥2 , where
H H

rk = Abvk − µ + 1/(vk−1 vbk ) vbk = vk−1 − vbk /(vk−1 vbk ).

Shifted inverse iteration is usually attributed to Wielandt [1117, 1944] but can be traced back
to Jacobi’s work in 1844. It is a powerful method for computing an eigenvalue in a neighborhood
of the shift µ but requires computing a factorization of the shifted matrix A − µI.
366 Chapter 7. SVD Algorithms and Matrix Functions

4
θ2

1
θ3

0
λ1 λ2 λ3

−1

−2

−3
θ1

−4

−5
−3 −2 −1 0 1 2 3 4 5

Figure 7.3.1. Spectral transformation with shift µ = 1. Used with permission of Springer
International Publishing; from Numerical Methods in Matrix Computations, Björck, Åke, 2015; permission
conveyed through Copyright Clearance Center, Inc.

So far we have considered inverse iteration with a fixed shift µ. In Rayleigh-quotient it-
eration (RQI) a variable shift is used equal to the Rayleigh quotient of the current eigenvector
approximation.

Algorithm 7.3.1 (Rayleigh-Quotient Iteration).


Let v0 ∈ Cn be an initial vector of unit length. Set µ0 = v0HAv0 , and for k = 0, 1, 2, . . . ,

1. If A − µk I is singular, then solve (A − µk I)vk+1 = 0 for unit vector vk+1 and stop.
Otherwise solve (A − µk I)v = vk .

2. Compute η = ∥v∥2 and set vk+1 = v/η, µk+1 = (v H Av)/η 2 .

3. If η is sufficiently large, accept eigenpair (µk+1 , vk+1 ) and stop.

Note that Av = µk v + vk allows the Rayleigh quotient in step 3 to be updated as

µk+1 = (v H Av)/η 2 = µk + (vk+1


H
vk )/η.

If A is Hermitian, the Rayleigh quotient is stationary at eigenvectors, and the local rate of conver-
gence is cubic; see Parlett [884, 1998, Theorem 4.7.1]. This ensures that the number of correct
digits in vk triples at each step for k large enough.
The norm of the residual rk = Avk − µk vk is the best measure of the accuracy of (µk , vk )
as an eigenpair. A key fact in the global analysis of RQI is that for a Hermitian A the residual
norms decrease.

Theorem 7.3.3. For a Hermitian matrix A, the residual norms in RQI are monotonically de-
creasing: ∥rk+1 ∥2 ≤ ∥rk ∥2 . Equality holds only if µk+1 = µk and vk is an eigenvector of
(A − µk I)2 .

Proof. See Parlett [884, 1998, Theorem 4.8.1].


7.3. Computing Selected Singular Triplets 367

In the Hermitian case it is not necessary to assume that RQI converges to an eigenvector
corresponding to a simple eigenvalue. Either the iterates vk will converge cubically to an eigen-
vector of A, or the odd and even iterates will converge linearly to the bisectors of a pair of
eigenvectors of A. The latter situation is unstable under small perturbations, so RQI converges
from any starting vector; see Parlett [884, 1998, Sect. 4.9]. Note that RQI may not converge to
an eigenvalue closest to µ(v0 ). It is not in general obvious how to choose the starting vector to
make RQI converge to a particular eigenvalue.
Rayleigh-quotient iteration requires a new factorization of the shifted matrix A − µk I for
each iteration. It is therefore considerably more costly than inverse iteration. For a dense matrix
the cost for a factorization is O(n3 ) operations. For problems where A is large and sparse it
may not be feasible. Then (A − µk I)v = vk can be solved inexactly using an iterative solution
method.

7.3.2 Subspace Iteration


Let S = (s1 , . . . , sp ) ∈ Rn×p be a given initial matrix of rank p > 1. In subspace iteration a
sequence {Zk } of matrices is generated by

Z0 = S, Zk = M Zk−1 , k = 1, 2, . . . , (7.3.16)

where M ∈ Rn×n is a given symmetric matrix. Then it holds that

Zk = M k S = (M k s1 , . . . , M k sp ).

In applications, M is often a very large sparse matrix, and p ≪ n. If M has a dominant eigen-
value λ1 , then all columns of Zk will converge to a scalar multiple of the dominant eigenvec-
tor x1 . Therefore, Zk will be close to a matrix of numerical rank one, and it is not clear that
much will be gained. If S = span (S), subspace iteration is actually computing a sequence of
subspaces M k S = span (M k S). The problem is that Zk = M k S becomes an increasingly
ill-conditioned basis for M k S. To avoid this, orthogonality can be maintained between the basis
columns as follows. Orthogonal iteration starts with an orthonormal matrix Q0 and computes

Zk = M Qk−1 , Zk = Qk Rk , k = 1, 2, . . . . (7.3.17)

Here Rk plays the role of a normalizing matrix, and Q1 = Z1 R1−1 = M Q0 R1−1 . By induction,
it can be shown that
Qk Rk · · · R1 = M k Q0 . (7.3.18)
Hence the iterations (7.3.16) and (7.3.17) generate the same sequence of subspaces, R(M k Q0 ) =
R(Qk ). Since iteration (7.3.16) is less costly, it is sometimes preferable to perform the orthogo-
nalization in (7.3.17) only occasionally as needed. Bauer [94, 1957] suggests a procedure called
treppen-iteration (staircase iteration) to maintain linear independence of the basis vectors. This
is similar to orthogonal iteration but uses LU instead of QR factorizations.
Orthogonal iteration overcomes several disadvantages of the power method. Provided
|λp+1 /λp | is small, it can be used to determine the invariant subspace corresponding to the
dominant p eigenvalues. Assume that the eigenvalues of M satisfy

|λ1 | ≥ · · · ≥ |λp | > |λi |, i = p + 1, . . . , n,

and let    
U1H T11 T12
M (U1 U2 ) = (7.3.19)
U2H 0 T22
368 Chapter 7. SVD Algorithms and Matrix Functions

be a Schur decomposition of M , where diag (T11 ) = (λ1 , . . . , λp )H . Then the subspace U1 =


R(U1 ) is a dominant invariant subspace of M . It can be shown that in orthogonal iteration, the
subspaces R(Qk ) almost always converge to U1 as k → ∞.
The accuracy of an invariant subspace is measured by the distance to the exact invariant
subspace; see Definition 1.2.14.

Theorem 7.3.4. Let U1 = R(U1 ) be a dominant invariant subspace of M , as defined in (7.3.19).


Let S be a p-dimensional subspace of Cn such that S ∩ U1⊥ = {0}. Then there exists a constant
M such that
θmax (M k S, U1 ) ≤ M |λp+1 /λp |k ,
where θmax (X , Y) denotes the largest angle between the two subspaces.

Proof. See Golub and Van Loan [512, 1996, p. 333].

Subspace iteration on p vectors simultaneously performs subspace iteration on the nested


sequence of subspaces span (s1 ), span (s1 , s2 ), . . . , span (s1 , s2 , . . . , sp ). This is also true for
orthogonal iteration, because the property is not changed by the orthogonalization procedure.
Hence Theorem 7.3.4 shows that whenever |λq+1 /λq | is small for some q ≤ p, convergence to
the corresponding dominant invariant subspace of dimension q will be fast. There is a duality
between direct and inverse subspace iteration.

Lemma 7.3.5 (Watkins [1103, 1982]). Let S and S ⊥ be orthogonal complementary subspaces
of Cn . Then for all integers k the spaces M k S and (M H )−k S ⊥ are also orthogonal.

Proof. Let x ∈ S and y ∈ S ⊥ . Then (M k x)H (M H )−k y = xH y = 0, and thus M k x ⊥


(M H )−k y.

This duality property means that the two sequences of subspaces

S, M S, M 2 S, . . . and S ⊥ , (M H )−1 S ⊥ , (M H )−2 S ⊥ , . . .

are equivalent in the sense that the orthogonal complement of a subspace in one sequence equals
the corresponding subspace in the other. This result is important for understanding convergence
properties of the QR algorithm. A geometric theory for QR and LR iterations is given by Parlett
and Poole [885, 1973].

7.3.3 The Rayleigh–Ritz Procedure


Let M ∈ Cn×n be a given Hermitian matrix with eigenvalues λi and eigenvectors xi , i = 1 : n,
and let Sk be a k-dimensional subspace of Cn . An approximate eigenpair (θ, y) of M with
y ∈ Sk can be determined by imposing the Galerkin condition

M y − θy ⊥ Sk . (7.3.20)

Let Sk = R(Qk ) for some orthonormal matrix Qk , and set y = Qk z. Then condition (7.3.20)
can be written
QH
k (M − θI)Qk z = 0

or, equivalently, as the projected eigenvalue problem

(Hk y − θI)z = 0, Hk = QH
k M Qk . (7.3.21)
7.3. Computing Selected Singular Triplets 369

The matrix Hk ∈ Ck is Hermitian and is the matrix Rayleigh quotient of M . Note that the
condition of this projected eigenvalue problem is not degraded. In practice, M is often large and
sparse, and one is only interested in approximating part of its spectrum. If k ≪ n, the Hermitian
eigenvalue problem (7.3.21) is small and can be solved by a standard method, such as the QR
algorithm. The solution yields k approximate eigenvalues and eigenvectors of M as described in
the procedure below.

Algorithm 7.3.2 (The Rayleigh–Ritz Procedure).

Let M ∈ Cn×n be a given Hermitian matrix, and let Qk = (q1 , . . . , qk ) be an orthonormal


basis for a given k-dimensional subspace of Cn , k ≪ n.

1. Compute the matrix M Qk = (M q1 , . . . , M qk ) and the matrix Rayleigh quotient

Hk = QH
k (M Qk ) ∈ R
k×k
. (7.3.22)

2. Compute the Ritz values (the eigenvalues of Hk ) and select from them p ≤ k desired
approximate eigenvalues θi , i = 1, . . . , p. Then compute the corresponding eigenvectors
zi :
Hk zi = θi zi , i = 1, . . . , p. (7.3.23)

3. Compute the Ritz vectors yi = Qk zi , i = 1, . . . , p, which are approximate eigenvectors


of M .

Backward error bounds for the approximate eigenvalues θi , i = 1 : p, are obtained from the
residuals
ri = M yi − yi θi = (M Qk )zi − yi θi , i = 1 : p. (7.3.24)

The Ritz value θi is an exact eigenvalue for a matrix M + Ei , with ∥Ei ∥2 ≤ ∥ri ∥2 . The
corresponding forward error bound is |θi − λi | ≤ ∥ri ∥2 . The Rayleigh–Ritz procedure is optimal
in the sense that the residual norm ∥M Qk −Qk Hk ∥ is minimized for all unitarily invariant norms
by taking Hk equal to the matrix Rayleigh quotient (7.3.22).
No bound for the error in a Ritz vector yi can be given without more information. This is to be
expected, because if another eigenvalue is close to the Ritz value, the eigenvector is very sensitive
to perturbations. If the Ritz value θi is known to be well separated from other eigenvalues of M
except the closest one, then a bound on the error in the Ritz vector and also an improved error
bound for the Ritz value yi can be obtained. If λi is the eigenvalue of M closest to θi , then

|θi − λi | ≤ ∥ri ∥22 /gap (θi ), gap (θi ) = min |λj − θi |. (7.3.25)
j̸=i

Furthermore, if xi is an eigenvector of M associated with λi , then

sin ∠(yi , xi ) ≤ ∥ri ∥2 /gap (θi ). (7.3.26)

When some of the intervals [θi − ∥ri ∥2 , θi + ∥ri ∥2 ], i = 1, . . . , k, overlap, we cannot be sure
of having an eigenvalue of M in each of these intervals. When the Ritz values are clustered, the
following theorem provides useful bounds for individual eigenvalues of M .
370 Chapter 7. SVD Algorithms and Matrix Functions

Theorem 7.3.6. Let M ∈ Cn×n be Hermitian, let Qk ∈ Cn×p be any orthonormal matrix, and
set
H = QHk M Qk , R = M Qk − Qk B.
Then to the eigenvalues θ1 , . . . , θk of H there correspond eigenvalues λ1 , . . . , λk of M such that
|λi − θi | ≤ ∥R∥2 , i = 1 : p. Furthermore, there are eigenvalues λi of M such that
p
X
(λi − θi )2 ≤ 2∥R∥2F .
i=1

Proof. See Parlett [884, 1998, Sect. 11.5].

Unless the Ritz values are well separated, there is no guarantee that the Ritz vectors are
good approximations to an eigenvalue of M . This difficulty arises because B may have spurious
eigenvalues bearing no relation to the spectrum of M . This problem can be resolved by using a
refined Ritz vector as introduced by Jia [668, 2000]. This is the solution y to the problem

min ∥M y − θy∥2 , y = Qk z, (7.3.27)


∥y∥2 =1

where θ is a computed Ritz value. This is equivalent to

min ∥(M Qk − θQk )z∥2 subject to ∥z∥2 = 1.


z

The solution is given by a right singular vector z corresponding to the smallest singular value of
M Qk − θQk . Since M Qk must be formed anyway in the Rayleigh–Ritz procedure, the extra
cost is only that of computing the SVD of a matrix of size n × k. In the Hermitian case the
Ritz vectors can be chosen so that Z = (z1 , . . . , zk ) is unitary and the projected matrix B is
Hermitian. Then, for each Ritz value θi there is an eigenvalue λi of A such that

|θi − λi | ≤ ∥ri ∥2 , j = 1 : k. (7.3.28)

For determining interior and small eigenvalues of M it is more appropriate to use the har-
monic Ritz values introduced by Paige, Parlett, and van der Vorst [864, 1995]. Given the sub-
space span (Qk ), the harmonic projection method requires that

(M − θI)Qk z ⊥ span (M Qk ). (7.3.29)

This is a generalized symmetric eigenvalue problem, and the eigenvalues are the harmonic Ritz
values. If the basis matrix Qk is chosen so that Vk = M Qk is orthonormal, then (7.3.29) becomes
(M Qk )H (M Qk − θQk )z = 0, or because Qk = M −1 Vk ,

(θ−1 I − VkH M −1 Vk )z = 0. (7.3.30)

This is a standard eigenvalue problem for M −1 .


The Lanczos process [715, 1950] for reducing a Hermitian matrix M to tridiagonal form
(Section 6.2.4) is a natural way to realize the Rayleigh–Ritz procedure on a sequence of Krylov
subspaces; see Parlett [884, 1998, Chapter 13]. It is a matrix-free algorithm, i.e., it only requires
the ability to form matrix-vector products with M . An implementation (irbleigs) is given
by Baglama, Calvetti, and Reichel [54, 2003]. This can be used to compute a few selected
singular values and associated vectors of A by applying it to one of the equivalent Hermitian
eigenproblems in (7.3.1) or (7.3.2).
7.3. Computing Selected Singular Triplets 371

More directly, the GKL bidiagonalization (Section 4.2.3) of a rectangular matrix A ∈ Rm×n ,
m ≥ n, can be used to implement the Rayleigh–Ritz procedure. Starting with a unit vector
v1 ∈ Rn , this computes u1 = Av1 /∥Av1 ∥2 ∈ Rm , and for i = 1, 2, . . . ,
γi+1 vi+1 = AT ui − ρi vi , (7.3.31)
ρi+1 ui+1 = Avi+1 − γi+1 ui . (7.3.32)
Here γi+1 and ρi+1 are nonnegative scalars chosen so that ui+1 , vi+1 are unit vectors. With
Uk = (u1 , . . . , uk ), Vk = (v1 , . . . , vk ), the recurrence relations can be summarized as
AVk = Uk Bk , AT Uk = Vk BkT + γk+1 vk+1 eTk , (7.3.33)
where  
ρ1 γ2

 ρ2 γ3 

Bk = 
 .. ..  ∈ Rk×k

 . . 
 ρk−1 γk 
ρk
is upper bidiagonal. Note that
γk+1 = ∥rk+1 ∥2 , rk+1 = AT uk − ρk vk .
If γk+1 = 0, it follows from (7.3.33) that the singular values of Bk are singular values of A, and
the associated singular vectors can be obtained from the SVD of Bk and Uk and Vk .
The columns of Uk and Vk form orthonormal bases for the Krylov subspaces Kk (ATA, v1 )
and Kk (AAT , Av1 ), respectively. From (7.3.33) the factorization for the equivalent Hermitian
problem (7.3.2) is
       
0 A Uk 0 Uk 0 0 Bk 0
= + . (7.3.34)
AT 0 0 Vk 0 Vk BkT 0 γk+1 vk+1 eTk
To avoid spurious singular values caused by loss of orthogonality in Uk and Vk in floating-
point arithmetic, a selective reorthogonalization scheme can be used. As shown by Simon and
Zha [996, 2000] it may suffice to reorthogonalize either Vk or Uk , with considerable savings in
storage and operations.
After k steps of the bidiagonalization process the projected Rayleigh quotient matrix is given
by Bk = UkT AVk . The Rayleigh–Ritz procedure for the Krylov subspaces Kk (ATA, v1 ) and
Kk (AAT , Av1 ) computes the SVD
Bk = Pk Ωk QTk , Ω = diag (ω1 , . . . , ωk ), (7.3.35)
to obtain Ritz values ωi and left/right Ritz vectors v̂i = Vk Qk ei and ûi = Uk Pk ei . The largest
singular values of A tend to be quite well approximated by ωi for k ≪ n. Hochstenbach [635,
2004] shows that for nested subspaces the Ritz values approach the largest singular values mono-
tonically from above.
Small singular values are approached irregularly, but harmonic Ritz values converge to the
smallest singular values from above. Different extraction methods for singular values and vec-
tors are compared by Hochstenbach [635, 2004]. His numerical experiments confirm that for
extracting large singular values, the standard method works well. For interior or small singular
values, harmonic Ritz values perform better. The harmonic Ritz values θi satisfy the generalized
eigenproblem
1 Bk BkT + γk+1 2
ek eTk
     
0 Bk si 0 si
= , (7.3.36)
BkT 0 wi θi 0 BkT Bk wi
372 Chapter 7. SVD Algorithms and Matrix Functions

where Bk is nonsingular or, equivalently,

(Bk BkT + γk+1


2
ek eTk )si = θi2 si , wi = θi Bk−1 si ; (7.3.37)

see Jia and Niu [671, 2010]. This result can also be obtained from similar formulas for the
Lanczos method given by Baglama, Calvetti, and Reichel [54, 2003]. It follows that the harmonic
Ritz value θi and Ritz vectors si , wi can be obtained more simply from the singular values and
right singular vectors of the lower bidiagonal matrix

ρ1
 
 γ2 ρ2 
   .. 
BkT γ3 .
 
=  ∈ R(k+1)×k .
 
T ..
γk+1 ek . ρk−1
 
 
γk ρk
 
γk+1

With s̃i = si /∥si ∥2 and w̃i = wi /∥wi ∥2 , the Ritz vectors are

ũi = Uk s̃i , ṽi = Vk w̃i .

To improve convergence and reliability, Jia recommends that the Rayleigh quotient ρi = s̃i Bk w̃i
be used as an approximation of σi rather than θi .
The computed Ritz vectors may exhibit slow and irregular convergence even though the Ritz
value has converged. Jia and Niu [671, 2010] propose a refined strategy that combines harmonic
extraction with the refined projection principle. After using harmonic extraction to compute
ρi = s̃i Bk w̃i , ũi , and ṽi it computes the smallest singular value σmin and the corresponding
right singular vector zi = (xTi , yiT )T of the matrix
   
0 Bk I 0
 BkT 0  − ρi  0 I . (7.3.38)
γk eTk 0 0 0

Then the new left and right approximate singular vectors are taken to be

ûi = Uk x̃i , v̂i = Vk ỹi ,

where x̃i = xi /∥xi ∥2 and ỹi = yi /∥yi ∥2 .


Because of the storage and arithmetic costs, the number of steps in the bidiagonalization
process must be limited. To enhance convergence to the desired part of the SVD spectrum, the
process can be restarted with a new initial vector (ATA − µI)v1 , where µ is a shift. The goal
of restarting is to replace the initial vector v1 = Vk e1 with a vector that is as near as possible
to a linear combination of the right singular vectors associated with the desired singular values.
Given the unique upper bidiagonal decomposition (7.3.31)–(7.3.32), we want to generate a new
bidiagonal decomposition corresponding to the starting vector (ATA − µ2 I)v1 . Combining the
equations in (7.3.33) and using αk = eTk Bk gives

(ATA)Vk = Vk (BkT Bk ) + ρk γk+1 vk+1 eTk , (7.3.39)

where the matrix Tk = BkT Bk is symmetric and tridiagonal. Hence, the implicitly restarted
Lanczos algorithm by Sorensen [1011, 1992] could be applied to T̂k = BkT Bk − µ2 I.
7.3. Computing Selected Singular Triplets 373

Björck, Grimme, and Van Dooren [145, 1994] show that forming BkT Bk can be avoided by
applying Golub–Reinsch QRSVD steps to Bk directly; see Section 7.1.4. First, a Givens rotation
(1)
Gl is determined so that  2   
(1) ρ1 − µ2 ∗
Gl = .
ρ1 γ2 0
(1)
This creates in Gl BkT an unwanted nonzero element in position (1, 2). Next, the bidiagonal
(1)
form of Gl BkT is restored using k − 1 additional left and right Givens rotations to chase out the
unwanted nonzero element, giving

bkT = G(k) · · · G(2) G(1) BkT G(2)


B (k) T T
r · · · Gr = Pk Bk Qk .
l l l

With Ubk ≡ Uk Qk and Vbk ≡ Vk Pk , the bidiagonal relations AVk = Uk Bk and AT Uk = Vk B T +


k
γk+1 vk+1 eTk become

AVbk = U
bk B
bk , AT U bkT + γk+1 vk+1 eTk Qk .
bk = Vbk B (7.3.40)

However, the last relation is not a valid relation for the bidiagonalization algorithm because the
residual term takes on the invalid form

rk = γk+1 vk+1 ( 0 · · · 0 q(k+1,k) q(k+1,k+1) ) , (7.3.41)

where q(i,j) is the (i, j)th element in Qk . This can be dealt with by sacrificing one step. Equating
the first (k − 1) columns of the second relation in (7.3.40), we obtain

AT U
bk−1 = Vbk−1 B T
bk−1 + rbk eTk , rbk = γ
bk vk + γk+1 q(k+1,k) vk+1 . (7.3.42)

Similarly, taking the first k−1 columns of the first relation in (7.3.40) gives the restarted analogue
AVbk−1 = U bk−1 Bbk−1 . It can be shown that Ubk−1 , Vbk−1 , and B bk−1 are what would have been
obtained after k − 1 steps of bidiagonalization with a unit starting vector proportional to

vb1 = (ATA − µ2 I)v1 .

It follows that if repeated restarts are performed with p shifts µ1 , . . . , µp , a bidiagonalization of


size k − p corresponding to a unit starting vector proportional to
p
Y
v̂1 = (ATA − µ2i I)v1
i=1

is obtained. An efficient strategy for restarting the Arnoldi or Lanczos process, proposed by
Sorensen [1011, 1992] and Lehoucq, Sorensen, and Yang [731, 1998], is to use unwanted Ritz
values as shifts to cause the resulting subspaces to contain more information about the desired
singular values. For example, to compute the p largest (smallest) singular triples, the shifts are
chosen as the k − p smallest (largest) singular values of Bk .
The standard implicitly restarted Lanczos tridiagonalization can suffer from numerical insta-
bilities caused by propagated round-off errors; see Lehoucq, Sorensen, and Yang [735, 1998].
An alternative is to perform the implicit restarts by augmenting the Krylov subspaces by certain
Ritz vectors. This process is mathematically equivalent to standard implicit restarts but is more
stable. A description of how to restart the bidiagonalization process by this method is found in
Baglama and Reichel [56, 2005].
PROPACK is a software package that uses bidiagonalization to compute selected singular
triplets. The initial work on PROPACK is described by Larsen [722, 1998]. Later versions
374 Chapter 7. SVD Algorithms and Matrix Functions

include implicit restarts and partial reorthogonalization; see Larsen [723, 2000]. An overview
of PROPACK versions is found at http://soi.stanford.edu/~rmunk/PROPACK/. The al-
gorithm IRLANB of Kokiopoulou, Bekas, and Gallopoulos [703, 2004] computes a few of the
smallest singular values. It uses an implicitly restarted bidiagonalization process with partial
reorthogonalization and harmonic Ritz values. A refinement process is applied to converged
singular vectors. Deflation is applied directly on the bidiagonalization process. The implic-
itly restarted block-Lanczos algorithm irbleigs of Baglama, Calvetti, and Reichel [55, 2003]
computes a few eigenpairs of a Hermitian matrix. It can be used to obtain singular triplets by
applying it an equivalent Hermitian eigenproblem. The algorithm irlba of Baglama and Re-
ichel [56, 2005] is directly based on the bidiagonalization process with standard or harmonic
Ritz values. A block bidiagonalization version is given by Baglama and Reichel [57, 2006].

7.3.4 Jacobi–Davidson Methods


Let (θk , yk ) be a Ritz pair over a subspace Uk of dimension k, approximating an eigenpair of
interest of a matrix A. The method of Davidson proposed in [287, 1975] is a projection algorithm
in which the space Uk is enlarged until it contains an acceptable approximation to the desired
eigenvalue. Let
rk = Ayk − θk yk
be the residual of the current Rayleigh–Ritz approximation. Then Uk is enlarged by a vector
determined by the (diagonal) linear system
(D − θk I)v = rk , D = diag (A). (7.3.43)
The new vector uk+1 is taken to be the projection of v orthogonal to Uk . New Rayleigh–Ritz ap-
proximations are then computed using the extended subspace Uk+1 spanned by u1 , . . . , uk , uk+1 .
Davidson’s method originated in computational chemistry, where it was used to find dominant
eigenvalues of large symmetric, diagonally dominant matrices. For this class of problems it
frequently works well, but on other problems it can fail completely.
It is tempting to view D − θk I in (7.3.43) as an approximation to A − θk I. However,
attempts to improve the method by using a better approximation will in general not work. This
is not surprising, because using the exact inverse (A − θk I)−1 will map rk to the vector yk and
will not expand the subspace.
The Jacobi–Davidson method, proposed in 1996 by Sleijpen and van der Vorst [1002, 1996],
is a great improvement over Davidson’s method. In this method the vector v is required to lie in
the orthogonal complement of the last Ritz vector yk . (The idea to restrict the expansion of the
current subspace to vectors orthogonal to yk was used in a method by Jacobi; see [659, 1846].)
The basic equation for determining the update v now uses the orthogonal projection of A onto
the subspace yk⊥ . This leads to the equation
(I − yk ykH )(A − θk I)(I − yk ykH )v = −rk , v ⊥ yk , (7.3.44)
where, as before, rk = Ayk − θk yk is the residual of the current Rayleigh–Ritz approximation.
If θk is a good approximation to a simple eigenvalue, then A − θk I is almost singular but the
projected matrix in (7.3.44) is not. Since v ⊥ yk , we have (I − yk ykH )v = v, and hence
(I − yk ykH )(A − θk I)v = −rk . It follows that the update satisfies
v = (A − θk I)−1 (αyk − rk ), (7.3.45)
where α = ykH (A − θI)v can be determined using the condition v ⊥ yk . An approximate
solution ve to (7.3.45) orthogonal to yk can be constructed as follows. Let M ≈ A − θk I be an
approximation, and take
ve = αM −1 yk − M −1 rk , (7.3.46)
7.3. Computing Selected Singular Triplets 375

where the condition ve ⊥ yk gives


ykH M −1 rk
α= . (7.3.47)
ykH M −1 yk
If M = A − θk I, then (7.3.46) reduces to v = α(A − θk I)−1 yk − yk . Since v is made
orthogonal to yk , the last term can be discarded. Hence, this choice is mathematically equivalent
to the RQI. Since (A − θk I)−1 yk may make a very small angle with yk , it is not worthwhile to
accelerate it further in the manner of Davidson.
Another approach is to use a preconditioned iterative method to solve (7.3.44). Let M ≈
A − θk I be a preconditioner, and let

Md = (I − yk ykH )M (I − yk ykH )

be the corresponding projected matrix. Then in each step of the iteration an equation of the form
Md z = u, where z, u ⊥ yk , has to be solved. This can be done as in (7.3.46) by computing
ykH M −1 u
z = αM −1 yk − M −1 u, α= .
ykH M −1 yk

Here M −1 yk and ykH M −1 yk need only be computed in the first iteration step. Only one appli-
cation of the preconditioner M is needed in later steps.
The Jacobi–Davidson method is among the most effective methods for computing a few in-
terior eigenvalues of a large sparse matrix, particularly when a preconditioner is available or
generalized eigenvalue problems are considered. Other methods, such as the “shift-and-invert”
variants of Lanczos and Arnoldi, require factorization of the shifted matrix. Moreover, the re-
sulting linear systems need to be solved accurately. Therefore, these methods are not well suited
to combinations with iterative methods as solvers for the linear systems. In Jacobi–Davidson
methods, such expensive factorizations can be avoided. Efficient preconditioned iterative solvers
can be used in inner iterations.
The Jacobi–Davidson method was introduced by Sleijpen and van der Vorst [1002, 1996],
[1003, 2000]. For a survey of variations and applications of this method, see Hochstenbach
and Notay [636, 2006]. Jacobi–Davidson algorithms for the generalized eigenvalue problem
are given in Fokkema, Sleijpen, and van der Vorst [415, 1998]. Variable preconditioners for
eigenproblems are studied by Eldén and Simoncini [381, 2002].
ARPACK is an implementation of the implicitly restarted Arnoldi method. It has become the
most successful and best known public domain software package for solving large-scale eigen-
value problems. ARPACK can be used for finding a few eigenvalues and eigenvectors of large
symmetric or unsymmetric standard or generalized eigenvalue problems; see the users’ guide of
Lehoucq, Sorensen, and Yang [731, 1998]. In MATLAB the eigs function is an interface to
ARPACK. The block Lanczos code of Grimes, Lewis, and Simon [538, 1994] and its updates
are often used for structural analysis problems in industrial applications. A selection of other
software packages freely available are listed in Sorensen [1012, 2002]. An ARPACK-based iter-
ative method for solving large-scale quadratic problems with a quadratic constraint is developed
in Rojas, Santos, and Sorensen [931, 2008].

Notes and references


Numerical methods for large-scale eigenvalue problems are treated in Saad [958, 2011], which
is a much modified revision of [954, 1992]. Bai et al. [61, 2000] give surveys and templates for
the solution of different eigenvalue problems.
An account of the historical development of the Rayleigh–Ritz method and its relation to
variational calculus is given by Gander and Wanner [436, 2012]. Ritz gave a complete description
376 Chapter 7. SVD Algorithms and Matrix Functions

of his method in [930, 1908]. Lord Rayleigh incorrectly claimed in 1911 that all the ideas in
Ritz’s work were present in his earlier paper [914, 1899].
Golub, Luk, and Overton [498, 1981] develop a block Lanczos method for computing se-
lected singular values and vectors of a matrix. Other Krylov subspace algorithms for computing
singular triplets are given by Cullum, Willoughby, and Lake [279, 1983]. Codes for partial sin-
gular value decompositions of sparse matrices for application to information retrieval problems
and seismic tomography are given by Berry [113, 1992], [114, 1993], [115, 1994]. Sleijpen and
van der Vorst [1002, 1996] develop an alternative Jacobi–Davidson algorithm for the partial Her-
mitian eigenproblem. A similar algorithm called JDSVD for computing singular triplets is given
by Hochstenbach [634, 2001].
Traditional inverse iterations use several Rayleigh quotient shifts for each singular value,
or just one factorization, and apply bidiagonalization on the shifted and inverted problem. In
Ruhe [941, 1998] and a series of other papers the Rational Krylov subspace methods are devel-
oped, which attempt to combine the virtues of these two approaches. Ruhe iterates with several
shifts to build up one basis from which several singular values can be computed.
Oks̆a, Yamamoto, and Vajters̆ic [836, 2022] show the convergence to singular triplets for a
two-sided block-Jacobi method with dynamic ordering.

7.4 Matrix Functions and SVD


7.4.1 Basic Definitions
Let f (z) be a scalar function of a complex variable z ∈ C. Suppose the expansion

X
f (z) = ak z k (7.4.1)
k=0

has radius of convergence r ∈ (0, ∞). Then (7.4.1) converges uniformly for any |z| < r and
diverges for any |z| > r. In the interior of the circle of convergence, formal operations such
as termwise differentiation and integration with respect to z are valid. Consider now the related
matrix power series

X
f (A) = ak Ak . (7.4.2)
k=0

If A is diagonalizable as A = XDX , then Ak = XDk X −1 . If the spectral radius ρ(A) < r,


−1

the series (7.4.2) converges and defines a matrix function

f (A) = Xf (D)X −1 . (7.4.3)

Furthermore, Af (A) = f (A)A, i.e., f (A) commutes with A. An important example of a matrix
function is the matrix exponential eA . This can be defined by its series expansion
1 2 1
eA = I + A + A + A3 + · · ·
2! 3!
for any matrix A ∈ Cn×n . Other examples are the matrix square root and sign functions, which
are treated next.
The previous assumption that A is diagonalizable is not necessary. Any matrix A ∈ Cn×n is
similar to a block diagonal matrix with almost diagonal matrices, which reveals its algebraic
properties. This is the Jordan canonical form named after the French mathematician Marie
Ennemond Camille Jordan (1838–1922).
7.4. Matrix Functions and SVD 377

Theorem 7.4.1 (Jordan Canonical Form). Any matrix A ∈ Cn×n is similar to the block
diagonal matrix

A = XJX −1 = X diag Jm1 (λ1 ), . . . , Jmt (λt ) X −1 ,



(7.4.4)

where
λi 1
 
..

λi . 
Jmi (λi ) =   = λi I + Si ∈ Cmi ×mi , i = 1 : t, (7.4.5)
 
..
 . 1
λi
Pt
are Jordan blocks and Si are shift matrices. The numbers m1 , . . . , mt are unique and i=1 mi =
n. The form (7.4.4) is called the Jordan canonical form and is unique up to the ordering of the
Jordan blocks.

A proof of this fundamental theorem is given in Horn and Johnson [639, 1985, Sect. 3.1]. It is
quite long and is therefore omitted here. The following result follows from an explicit expression
of the powers of a single Jordan block.

Theorem 7.4.2. Let A have the Jordan canonical form (7.4.4). Assume that f (λ) and its first
mk − 1 derivatives are defined for λ = λk , k = 1 : t. Then the function f (A) is said to be defined
on the spectrum of A, and
 
f (A) = X diag f Jm1 (λ1 ) , . . . , f Jmt (λt ) X −1 ,

(7.4.6)

where
m−1
X 1 (p)
f (Jmk ) = f (λk )I + f (λk )S p
p=1
p!
f (mk −1) (λk )
 

 f (λk ) f (λk ) · · · (mk − 1)! 
 
 .. .. 
=
 f (λk ) . . .
 (7.4.7)
 . .. ′

 f (λk ) 
(
f λk )

If f is a multivalued function, and a repeated eigenvalue of A occurs in more than one Jordan
block, then the same branch of f and its derivatives is usually taken. This choice gives a primary
matrix function that is expressible as a polynomial in A. In the following it is assumed that f (A)
is a primary matrix function unless stated otherwise. Then the Jordan canonical form definition
(7.4.6) does not depend on the ordering of the Jordan blocks.
There are several equivalent ways to define a function of a matrix. One definition, due to
Sylvester (1883), uses polynomial interpolation. Denote by λ1 , . . . , λt the distinct eigenvalues
of A, and let mk be the index of λk , i.e., the order of the largest Jordan block containing λk .
Assume that the function is defined on the spectrum Λ(A) of A. ThenPf (A) = p(A), where p is
t
the unique Hermite interpolating polynomial of degree less than n = k=1 mk that satisfies the
interpolating conditions

p(i) (λk ) = f (j) (λk ), j = 0 : mk − 1, k = 1 : t. (7.4.8)


378 Chapter 7. SVD Algorithms and Matrix Functions

Note that the coefficients of the interpolating polynomial depend on A and that f (A) commutes
with A. It is well known that this interpolating polynomial exists and can be computed by
Newton’s interpolation formula

n
X
f (A) = f (λ1 )I + f (λ1 , λ2 , . . . , λj )(A − λ1 I) · · · (A − λj I), (7.4.9)
j=1

where λj , j = 1 : n∗ are the distinct eigenvalues of A, each counted with the same multiplicity
as in the minimal polynomial. Thus, n∗ is the degree of the minimal polynomial of A. Formulas
for complex Hermite interpolation are given in Dahlquist and Björck [284, 2008, Sect. 4.3.2].
The definitions by the Jordan canonical form and polynomial interpolation can be shown to be
equivalent. Theory and computation of matrix functions are admirably surveyed in the seminal
monograph of Higham [625, 2008].

7.4.2 Matrix Square Root and Sign Function


A matrix X is called a square root of A ∈ Cn×n if it satisfies

X 2 = A. (7.4.10)

If A has no eigenvalues on the closed negative real axis, then there is a unique principal square
root such that −π/2 < arg(λ(X)) < π/2. The principal square root is denoted by A1/2 . When
it exists, it is a polynomial in A. If A is Hermitian and positive definite, then the principal square
root is the unique Hermitian and positive definite square root. If A is real and has a square root,
then A1/2 is real.
The square root of a matrix may not exist. For example, it is easy to verify that
 
0 1
A=
0 0

cannot have a square root. To ensure that a square root exists, it suffices to assume that A has at
most one zero eigenvalue. If A is nonsingular and has s distinct eigenvalues, then it has precisely
2s square roots that are expressible as polynomials in the matrix A.
The principal square root of A can be computed directly using only the Schur decomposition
A = QT QH with T upper triangular; see Björck and Hammarling [146, 1983]. Then

A1/2 = QSQT , S = T 1/2 .

From T = S 2 we obtain tii = s2ii and


j
X
tij = sik skj , 1 ≤ i < j ≤ n. (7.4.11)
k=i

1/2
Starting with sii = tii , i = 1 : n, the off-diagonal elements of S can be computed one diagonal
at a time from these relations in n3 /3 flops:
 j−1
X .
sij = tij − sik skj (sii + sjj ), 1 ≤ i < j ≤ n. (7.4.12)
k=i+1

If tii = tjj we take sii = sjj , so this recursion does not break down. (Recall that we have
assumed that at most one diagonal element of T is zero.) The arithmetic cost of this algorithm is
7.4. Matrix Functions and SVD 379

dominated by the 25n3 flops required for computing Q and T in the Schur decomposition. When
A is a normal matrix (AHA = AAH ), T is diagonal. In this case, S is diagonal and the flop count
is reduced to 9n3 . A modified algorithm by Higham [611, 1987] avoids complex arithmetic for
real matrices with some complex eigenvalues by using the real Schur decomposition.
In applications where it is too costly to compute the Schur decomposition, an iterative method
can be used. Assume that A ∈ Cn×n has a principal square root, and let Xk be an approximation
to A1/2 . If Xk+1 = Xk + Hk , then

A = (Xk + Hk )2 = Xk2 + Xk Hk + Hk Xk + Hk2 .

Ignoring the term Hk2 gives the Newton iteration

Xk+1 = Xk + Hk , Xk Hk + Hk Xk = A − Xk2 . (7.4.13)

To solve for the correction Hk requires solving the Sylvester equation (7.4.13), which is expen-
sive. If the initial approximation X0 in (7.4.13) is chosen as a polynomial in A, e.g., X0 = I or
X0 = A, then all subsequent iterates Xk commute with A. Then (7.4.13) simplifies to
1
Xk + AXk−1 ,

Xk+1 = (7.4.14)
2
which is the matrix version of the well-known scalar iteration zk+1 = (zk + a/zk )/2 for the
square root of a. Unfortunately, iteration (7.4.14) is unstable and converges only if A is very well-
conditioned. Divergence is caused by rounding errors that make the computed approximation Xk
fail to commute with A; see Higham [610, 1986].
Several stable modifications of the simplified Newton iteration (7.4.14) have been suggested;
see Iannazzo [653, 2003]. Denman and Beavers [313, 1976] rewrite (7.4.14) as
1 
Xk+1 = Xk + A1/2 Xk−1 A1/2 .
2
With Yk = A−1/2 Xk A−1/2 , this gives the coupled iteration: X0 = A, Y0 = I,
1 1
Xk + Yk−1 , Yk + Xk−1 .
 
Xk+1 = Yk+1 = (7.4.15)
2 2
This iteration is stable with a quadratic rate of convergence, and limk→∞ Xk = A1/2 , limk→∞ Yk
= A−1/2 . Another stable modification of Newton’s iteration due to Meini [787, 2004] can be
written: X0 = A, H0 = 12 (I − A),
1 −1

Xk+1 = Xk + Hk , Hk+1 = − Hk Xk+1 Hk , k = 0, 1, 2, . . . . (7.4.16)
2
The convergence rate of Meini’s iteration is quadratic and can be improved by scaling. Similar
Newton methods for computing the pth root of a matrix A1/p for p > 2 can be developed; see
Iannazzo [654, 2006].
Newton-type methods need the inverse of Xk or its LU (or Cholesky) factorization in each
iteration. Another possibility is to use an inner iteration for computing the needed inverses. The
Schulz iteration [976, 1933] for computing A−1 is

X0 = A, Xk+1 = Xk + (I − AXk )Xk , k = 0, 1, 2, . . . . (7.4.17)

It can be shown that if A ∈ Cn×n is nonsingular and if

X0 = α0 AT , 0 < α0 < 2/∥A∥22 ,


380 Chapter 7. SVD Algorithms and Matrix Functions

then limk→∞ Xk = A−1 . Convergence is ultimately quadratic: Ek+1 = Ek2 , where Ek =


I − AXk . About 2 log2 κ2 (A) iterations are needed for convergence; see Söderström and Stew-
art [1010, 1974]. In general, the Schulz iteration cannot compete with direct methods for dense
matrices. However, performing a few steps of the iteration (7.4.17) can be used to improve an
approximate inverse.
The sign function is defined by

−1 if ℜz < 0,
n
sign(z) = (7.4.18)
+1 if ℜz > 0

for all z ∈ C not on the imaginary axis. We assume in the following that A ∈ Cn×n is a
matrix with no eigenvalues on the imaginary axis. Its Jordan canonical form can be written
A = Xdiag (J+ , J− )X −1 , where the eigenvalues of J+ lie in the open right-hand plane and
those of J− lie in the open left-hand plane. Then
 
Ik 0
S = sign(A) = X X −1 (7.4.19)
0 −In−k

is diagonalizable with eigenvalues equal to ±1. If S is defined, then S 2 = I and hence S −1 = S.


Furthermore, S commutes with A (so SA = AS), and if A is real, so is S. From (7.4.19) it
follows that
A = SN, S = A(A2 )−1/2 , N = (A2 )1/2 , (7.4.20)
which is the matrix sign decomposition; see Higham [619, 1994]. Note that if A is Hermitian,
then A2 = AH A, and the polar and the sign decompositions
 are the same. The sign decomposi-
tion generalizes the scalar identity sign(z) = z (z 2 )1/2 . It is easy to verify that
   
0 A 0 A1/2
sign = . (7.4.21)
I 0 A−1/2 0

If z does not lie on the imaginary axis, then

sign (z) = z/(z 2 )1/2 = z(1 − ξ)−1/2 , ξ = 1 − z2. (7.4.22)

An important property of the sign function of A is that


1 1
P− = (I − S), P+ = (I + S)
2 2
are the spectral projectors onto the invariant subspaces associated with the eigenvalues of A
in the left and right half-planes, respectively. That is, if the leading columns of an orthogonal
matrix Q span the column space of P+ , then
 
H A11 A12
Q AQ = .
0 A22

It follows that the eigenvalues of A in the right half-plane equal Λ(A11 ) and those in the left
half-plane are Λ(A22 ). This can be used to design spectral divide-and-conquer algorithms
for computing eigenvalue decompositions and other fundamental matrix decompositions via the
matrix sign function. The problem is recursively decoupled into two smaller subproblems by
using the sign function to compute an invariant subspace for a subset of the spectrum. This
type of algorithm can achieve more parallelism and have lower communication costs than other
standard eigenvalue algorithms; see Bai and Demmel [60, 1998].
7.4. Matrix Functions and SVD 381

For Hermitian (and real symmetric) matrices A ∈ Cm×n , the eigenvalue decomposition
can be written A = V diag (Λ+ , Λ− )V H , where the diagonal matrices Λ+ and Λ− contain the
positive and negative eigenvalues, respectively. Then,

A = V diag (Ik , −In−k )V H V diag (Λ+ , |Λ− |) ≡ P H, (7.4.23)

where A = P H is the polar decomposition. If the unitary polar factor P is known, then
H
P + I = ( V1 V2 ) diag (2Ik , 0) ( V1 V2 ) = 2V1 V1H . (7.4.24)

It follows that the symmetric matrix

C = 12 (P + I) = V1 V1H

is an orthogonal projector onto the subspace corresponding to the positive eigenvalues of A.


Nakatsukasa and Higham [822, 2013] develop a technique for computing the eigenvalue
decomposition of Hermitian matrices that can be used also for computing the SVD. The first
step computes the polar decomposition A = P H, where P ∈ Cm×n is unitary (P H P = I)
and H ∈ Cn×n is Hermitian positive semidefinite. In the second step the symmetric eigenvalue
decomposition H = V ΣV H is computed. The desired SVD is then

A = (P V )ΣV H = U ΣV H .

The matrix sign function can be computed by a scaled version of the Newton iteration for
X 2 = I:
1
Xk + Xk−1 , k = 0, 1, 2, . . . .

X0 = A, Xk+1 = (7.4.25)
2
The corresponding scalar iteration λk+1 = λk +λ−1

k /2 is Newton’s iteration for the square root
of 1. It converges quadratically to 1 if ℜ(λ0 ) > 0, and to −1 if ℜ(λ0 ) < 0. The matrix iteration
(7.4.25) is globally and quadratically convergent to sign (A), provided A has no eigenvalues on
the imaginary axis. From the Jordan canonical form it follows that the eigenvalues are decoupled
(0)
and obey the scalar iteration with λj = λj (A). Ill-conditioning of a matrix Xk can destroy the
convergence or cause misconvergence.
Higher order iterative methods for sign (A) can be derived from matrix analogues of Taylor
or Padé approximations of the function h(ξ) = (1 − ξ)−1/2 . The Padé approximations of a
function f (z) are rational functions
Pℓ j
Pℓ,m (z) j=0 pj z
rℓ,m (z) = ≡ Pm j
, (7.4.26)
Qℓ,m (z) j=0 qj z

with numerator of degree at most ℓ and denominator of degree at most m, such that

f (z) − rℓ,m (z) = Rz ℓ+m+1 + O(z ℓ+m+2 ), z → 0. (7.4.27)

For the function


h(ξ) = (1 − ξ)−1/2
these are explicitly known. For ℓ = m − 1 and ℓ = m they are called principal Padé approxima-
tions and have the special property that

(1 + z)p − (1 − z)p
rℓ,m = ,
(1 + z)p + (1 − z)p
382 Chapter 7. SVD Algorithms and Matrix Functions

where p = ℓ + m + 1. That is, the numerator and denominator are, respectively, the odd and even
parts of (1 + z)p . This makes it easy to write down the corresponding rational approximations.
The principal Padé approximations have the following properties; see Kenney and Laub [691,
1991, Theorem 5.3].

Theorem 7.4.3. If A has no purely imaginary eigenvalues, then a Padé approximation with
ℓ = m or ℓ = m − 1 gives the rational iteration X0 = A,
pℓ,m (1 − Xk2 )
Xk+1 = Xk , k = 0, 1, 2, . . . . (7.4.28)
qℓ,m (1 − Xk2 )
This converges to S = sign (A), and
(ℓ+m+1)k
(S − Xk )(S + Xk )−1 = (S − A)(S + A)−1

.

In particular, taking ℓ = m = 1, we have (1−z)3 = 1−3z +3z 2 −z 3 , −zp11 = −z(3+z 2 ),


q11 = 1 + 3z 2 . This gives the iteration
−1
X0 = A, Xk+1 = Xk 3I + Xk2 I + 3Xk2

, k = 0, 1, 2, . . . , (7.4.29)

which is Halley’s method for sign (A) and has cubic convergence rate.

Notes and references


Early work on spectral dichotomy has been done by Godunov [482, 1986] and Malyshev [769,
1993]. Bai, Demmel, and Gu [62, 1997] develop an inverse-free spectral divide-and-conquer
algorithm for the generalized eigenvalue problem that uses only rank-revealing QR factorization
and multiplication. The algorithm of Bai and Demmel [60, 1998] is based on the matrix sign
function and a scaled Newton iteration. Divide-and-conquer algorithms for Hermitian matrices
have been developed in the PRISM project by Zhang, Zha, and Ying [1148, 2007]. Nakatsukasa
and Freund [821, 2016] give fast methods for computing the matrix sign function based on opti-
mal rational approximations of very high order due to Zolotarev [1154, 1877].

7.4.3 Polar Decomposition


Although the factors of the polar decomposition of a matrix A are not matrix functions, the
decomposition has strong connections to the matrix square root and sign function.

Theorem 7.4.4 (Polar Decomposition). Suppose A ∈ Cm×n with m ≥ n. There exists a


matrix P ∈ Cm×n with orthogonal columns and a unique Hermitian positive semidefinite matrix
H ∈ Cn×n such that
A = P H, P H P = I. (7.4.30)
The Hermitian polar factor H is unique for any A. If r = rank(A) = n, then H is positive
definite and the polar factor P is uniquely determined.

Proof. Let A have the singular value decomposition

A = U ΣV H = (U V H )(V ΣV H ),

where U ∈ Rm×n and V ∈ Rn×n are unitary. Then (7.4.30) holds with

A = P H, P = UV H, H = V ΣV H . (7.4.31)
7.4. Matrix Functions and SVD 383

For a square nonsingular matrix, the polar decomposition was first given by Autonne [41,
1902]. The factor P in the polar decomposition is the orthogonal (unitary) matrix closest to A.

Theorem 7.4.5. Let A ∈ Cm×n (m ≥ n) have the polar decomposition A = P H. Then, for
any unitarily invariant norm,

∥A − P ∥ = min ∥A − Q∥. (7.4.32)


QH Q=In

If rank(A) = n, the minimizer is unique for the Frobenius norm, and


n
X 1/2
∥A − P ∥F = (1 − σi )2 .
i=1

Theorem 7.4.5 suggests that computing the polar factor P is the “optimal orthogonalizing”
of a given matrix. In contrast to other orthogonalization methods it treats the columns of A
symmetrically, i.e., if the columns of A are permuted, the same P with permuted columns is
obtained. In quantum chemistry this orthogonalization method was pioneered by Löwdin [760,
1970] and is called Löwdin orthogonalization; see Bhatia and Mukherjea [117, 1986]. Other
applications of the polar decomposition arise in aerospace computations, factor analysis, satellite
tracking, and the Procrustes problem; see Section 7.4.4.
The Hermitian polar factor H also has a certain optimal property. Let A ∈ Cn×n be a Hermit-
ian matrix with at least one negative eigenvalue. Consider the problem of finding a perturbation
E such that A + E is positive semidefinite.

Theorem 7.4.6. Let A ∈ Cn×n be Hermitian, A = P H be its polar decomposition, and


1 1
B= (A + H), E= (A − H).
2 2
Then ∥A − B∥2 = ∥E∥2 ≤ ∥A − X∥2 for any positive semidefinite Hermitian matrix X.

Proof. See Higham [610, 1986].

The theorem was proved for m = n by Fan and Hoffman [395, 1955]. For the generalization
to m > n, see Higham [625, 2008, Theorem 8.4].
The polar decomposition can be regarded as a generalization of the polar decomposition
z = eiθ |z| of a complex number z. Thus

eiθ = z/|z| = z(|z|2 )−1/2 = z(1 − ξ)−1/2 , ξ = 1 − |z|2 . (7.4.33)

Expanding h(ξ) = (1 − ξ)−1/2 in a Taylor series and terminating the series after the term of
degree p gives   p
eiθ = z 1 + 21 ξ + 38 ξ 2 + · · · + −1/2
p ξ . (7.4.34)
This series is convergent for |ξ| < 1.
A family of iterative methods for computing the unitary polar factor P is derived by Björck
and Bowie [138, 1971]. By a well-known analogy between matrices and complex numbers, we
get
P = A(AH A)−1/2 = A(I − E)−1/2 , E = I − AH A.
The matrix series corresponding to (7.4.34),
 
−1/2
P = A 1 + 12 E + 83 E 2 + · · · +

p Ep , (7.4.35)
384 Chapter 7. SVD Algorithms and Matrix Functions

converges to P if the spectral radius ρ(E) < 1. Terminating the expansion after the term of
order p gives an iterative method of order p + 1 for computing P . For p = 1 the following simple
iteration is obtained: P0 = A,

Pk+1 = Pk I + 12 Ek , Ek = I − PkH Pk , k = 0, 1, 2, . . . .

(7.4.36)

This only uses matrix-matrix products. If σmax (A) < 3, then Pk converges to P with quadratic
rate. In applications where A is already close to an orthogonal matrix, sufficient accuracy will
be obtained after just a few iterations.
There are more rapidly converging iterative methods that work even when A is far from
orthogonal. Newton’s method applied to the equation P H P = I yields the iteration
1
Pk + Pk−H ,

P0 = A, Pk+1 = k = 0, 1, 2, . . . . (7.4.37)
2
This converges globally to the unitary polar factor P of A with quadratic rate:

∥Pk+1 − P ∥2 ≤ 12 ∥Pk ∥2 ∥Pk − P ∥22 ;

see Higham [625, 2008, Theorem 8.12]. The iteration (7.4.37) cannot be applied to a rectan-
gular matrix A. This is easily dealt with by first computing the QR factorization A = QR,
Q ∈ Rm×n (preferably with column pivoting). Apply the Newton iteration (7.4.37) with initial
approximation P0 = R to compute the polar factor P of R. Then QP is the unitary polar factor
of A.
If A is ill-conditioned, the convergence of the Newton iteration can be very slow initially.
Convergence can be accelerated by taking advantage of the fact that the orthogonal polar factor
of the scaled matrix γA, γ ̸= 0, is the same as for A. The scaled Newton iteration is

P0 = A, Pk+1 = 12 γPk + γ −1 Pk−H , k = 0, , 1, 2, . . . ,



(7.4.38)

where γk are scale factors. Scale factors that minimize ∥Pk+1 − P ∥2 are determined by the
condition that γk σ1 (Pk ) = 1/(γk σn (Pk )), i.e.,

γk = (σ1 (Pk )σn (Pk ))−1/2 .

Because the singular values of Pk are not known, the cheaply computable approximations γk =
(αk /βk )−1/2 , where
q
βk = ∥Pk−1 ∥1 ∥Pk−1 ∥∞ ,
p
αk = ∥Pk ∥1 ∥Pk ∥∞ ,

are used instead; see Higham [610, 1986]. The resulting iteration converges in at most nine
iterations to full IEEE double precision of 10−16 even for matrices with a condition number as
large as κ2 (A) = 1016 ; see Higham [625, 2008, Section 8.9]. Kielbasiński and Zietak [694,
2003] show that using the suboptimal scale factors
s √
√ 2 ab
q
γ0 = 1/ ab, γ1 = , γk+1 = 1/ 12 (γk + 1/γk ), k = 1, 2, . . . ,
a+b

where a = ∥A−1 ∥2 and b = ∥A∥2 , works nearly as well.


Iterative methods of higher order for the polar decomposition can be derived from Padé ap-
proximations of the hypergeometric function h(ξ) = (1 − ξ)−1/2 . In particular, Halley’s method
becomes P0 = A, −1
Pk+1 = Pk 3I + PkH Pk I + 3PkH Pk

. (7.4.39)
7.4. Matrix Functions and SVD 385

The initial rate of convergence of this iteration is very slow when κ(A) is large. A dynamically
weighted Halley (QDWH) algorithm, where P0 = A/∥A∥2 ,
−1
Pk+1 = Pk ak I + bk XkH Xk I + ck XkH Xk

, (7.4.40)

is proposed by Nakatsukasa, Bai, and Gygi [820, 2010]. The singular values of Xk+1 are given
by σi (Xk+1 ) = gk (σi (Xk ), where

ak + bk x2
gk (x) = x .
1 + ck x2
Ideally, the weighting parameters ak , bk , and ck should be chosen to maximize lk+1 , where
[lk+1 , 1] contains all singular values of Xk+1 . A suboptimal choice makes the function gk satisfy
the bounds 0 < gk (x) ≪ 1 for x ∈ [lk , 1] and attain the max-min
n o
max min gk (x) .
ak ,bk ,ck [lk ,1]

The solution of this optimization problem is given in Appendix A of [820].


An inverse-free implementation of the QDWH iteration can be obtained by first rewriting it
as  
bk bk −1
Xk+1 = Xk + ak − Xk I + ck XkH Xk . (7.4.41)
ck ck
From the QR factorization √   
ck Xk Q1
= R,
I Q2
we have (I + ck XkH Xk ) = RH R. Examining the blocks of the QR factorization, we see that

I = Q1 R and ck Xk = Q1 R. Hence the matrix in the second term of (7.4.41) can be computed
as −1
Xk I + ck XkH Xk = Q1 RR−1 R−H = Q1 QH 2 .

The QDWH method has the advantage that it requires at most six iterations for convergence to the
unitary polar factor of A to full IEEE double precision 10−16 for any matrix with κ(A) ≤ 1016 . A
proof of the backward stability of the QDWH method is given by Nakatsukasa and Higham [822,
2013].
The sensitivity of the factors in the polar decomposition to perturbations in A has been studied
by Barrlund [83, 1990] and Chaitin-Chatelin and Gratton [216, 2000]. The absolute condition
number in the Frobenius norm for the orthogonal factor P is 1/σn (A). If A is real and m = n,
this can be improved to
2/(σn (A) + σn−1 (A)).

For the Hermitian factor H, an upper bound on the condition number is 2.

7.4.4 The Procrustes Problem


Given A and B in Rm×n , the orthogonal Procrustes problem is10

min ∥A − BQ∥F . (7.4.42)


QT Q=I

10 In Greek mythology, Procrustes was a rogue smith and bandit who seized travelers, tied them to an iron bed, and

either stretched them or cut off their legs to make them fit.
386 Chapter 7. SVD Algorithms and Matrix Functions

The solution to this problem can be computed from the polar decomposition of B TA, as shown
by the following generalization of Theorem 7.4.5.

Theorem 7.4.7 (Schönemann [973, 1966]). Let Mm×n be the set of all matrices in Rm×n ,
m ≥ n, with orthogonal columns. Let A and B be given matrices in Rm×n such that rank(B TA) =
n. Then
∥A − BQ∥F ≥ ∥A − BP ∥F
for any matrix Q ∈ Mm×n , where B TA = P H is the polar decomposition.

Proof. From ∥A∥2F = trace (ATA) and trace (X T Y ) = trace (Y X T ) and the orthogonality of
Q, it follows that

∥A − BQ∥2F = trace (ATA) + trace (B TB) − 2 trace (QT B TA).

It follows that problem (7.4.42) is equivalent to maximizing trace (QT B TA). From the SVD
B TA = U ΣV T , set Q = U ZV T with Z orthogonal. Then ∥Z∥2 = 1, and the diagonal elements
of Z must satisfy |zii | ≤ 1, i = 1 : n. Hence,

trace (QT B TA) = trace (V Z T U T B TA) = trace (Z T U T B TAV )


Xn n
X
T
= trace (Z Σ) = zii σi ≤ σi ,
i=1 i=1

where Σ = diag (σ1 , . . . , σn ). The upper bound is obtained for Q = U V T . If rank(A) = n,


this solution is unique.

The orthogonal Procrustes problem arises in factor analysis and multidimensional scaling
methods in statistics; see Cox and Cox [276, 1994]. In these applications the matrices A and
B represent sets of experimental data, and the question is whether these are identical up to a
rotation. Another application is in determining rigid body movements. Let a1 , a2 , . . . , am be
measured positions of m ≥ n landmarks of a rigid body in Rn , and let b1 , b2 , . . . , bm be the
measured positions after the body has been rotated. We seek an orthogonal matrix Q ∈ Rn×n
representing the rotation of the body; see Söderkvist and Wedin [1009, 1994]. This has important
applications in radiostereometric analysis (Söderkvist and Wedin [1008, 1993]) and subspace
alignment in molecular dynamics simulation of electronic structures.
In many applications it is important that Q correspond to a pure rotation, i.e., det(Q) = 1. If
det(U V T ) = 1, the optimal Q = U V T as before. Otherwise, if det(U V T ) = −1, the optimal
solution can be shown to be (see Hanson and Norris [590, 1981])

Q = U ZV T , Z = diag (1, . . . , 1, −1),


Pn
with det(Q) = +1. For this choice, i=1 zii σi = trace (Σ) − 2σn . In both cases the optimal
solution can be written

Q = U ZV T , Z = diag (1, . . . , 1, det(U V T )).

In the analysis of rigid body movements, a translation vector c ∈ Rn is also involved. We


then have the model A = BQ + ecT , e = (1, 1, . . . , 1)T ∈ Rm . To estimate c ∈ Rn we solve
the problem

min ∥A − BQ − ecT ∥F subject to QT Q = I, det(Q) = 1. (7.4.43)


Q,c
7.4. Matrix Functions and SVD 387

For any Q, including the optimal Q not yet known, the best least squares estimate of c is charac-
terized by the condition that the residual be orthogonal to e. Multiplying by eT we obtain

0 = eT (A − BQ − ecT ) = eTA − (eT B)Q − mcT = 0,

where eTA/m and eT B/m are the mean values of the rows in A and B. Hence the optimal
translation is
1
c = ((B T e)Q − AT e). (7.4.44)
m
Substituting into (7.4.43) gives the problem minQ ∥A
e − BQ∥
e F , where

e = A − 1 e(eTA),
A e = B − 1 e(eT B),
B
m m
which is a standard orthogonal Procrustes problem.
A perturbation analysis of the orthogonal Procrustes problem is given by Söderkvist [1007,
1993]. If A ∈ Rm×n , B ∈ Rm×l , m > l, in (7.4.42), then the Procrustes problem is called
unbalanced. In this case, Q ∈ Rm×l is rectangular with orthonormal columns and no longer
satisfies trace (QTATAQ) = trace (ATA). Algorithms for this more difficult problem are given
by Park [879, 1991] and Eldén and Park [379, 1999]. Several other generalizations are treated in
the monograph by Gower and Dijksterhuis [521, 2004].
Chapter 8

Nonlinear Least Squares


Problems

Anyone who deals with nonlinear problems knows that everything works sometimes
and nothing works every time.
—John E. Dennis, Jr.

8.1 Newton-Type Methods


8.1.1 Vector Space Calculus
Consider a function f : X → Y , where X and Y are normed vector spaces. The function is
continuous at the point x0 ∈ X if ∥f (x) − f (x0 )∥ → 0 as x → x0 . It satisfies a Lipschitz
condition in a domain D ⊂ X if a constant α, called a Lipschitz constant, can be chosen so that
∥f (x) − f (y)∥ ≤ α∥x − y∥
for all points x, y ∈ D. The function f is differentiable at x0 , in the sense of Fréchet, if there
exists a linear mapping A such that
∥f (x) − f (x0 ) − A(x − x0 )∥ = o(∥x − x0 ∥), x → x0 .
This linear mapping is called the Fréchet derivative of f at x0 , and we write A = f (x0 ) or
A = fx (x0 ). Similar definitions apply to infinite-dimensional spaces. In the finite-dimensional
case, the Fréchet derivative is represented by the Jacobian, a matrix whose elements are the
partial derivatives ∂f i /∂xj . If vector-matrix notation is used, it is important to note that the
derivative g ′ of a real-valued function g is a row vector, because
g(x) = g(x0 ) + g ′ (x0 )(x − x0 ) + o(∥x − x0 ∥).
The transpose of g ′ (x) is called the gradient, or grad g.
Many results from elementary calculus carry over to vector space calculus, such as the rules
for the differentiation of products. The proofs are in principle the same. If z = f (x, y) with
x ∈ Rk , y ∈ Rℓ , z ∈ Rm , then the partial derivatives fx = ∂f /∂x, fy = ∂f /∂y with respect
to the vectors x, y are defined by the differential formula
dz = fx dx + fy dy ∀dx ∈ Rk , dy ∈ Rℓ . (8.1.1)
If x and y are functions of s ∈ Rn , then a general version of the chain rule reads
f ′ (x(s), y(s)) = fx x′ (s) + fy y ′ (s), (8.1.2)

389
390 Chapter 8. Nonlinear Least Squares Problems

where f ′ denotes the first derivative. It can be derived in the same way as for real-valued vari-
ables, and the extension to longer chains is straightforward.
Let f : Rk → Rk , k > 1, be a function, and consider the equation x = f (y). By formal
−1
differentiation, dx = f ′ (y)dy, and we obtain dy = f ′ (y) dx, provided that the Jacobian
matrix f ′ (y) with elements (∂xi /∂yj ), 1 ≤ i, j ≤ k, is nonsingular. If f (x, y) = 0, then by
(8.1.2), fx dx + fy dy = 0. If fy (x0 , y0 ) is a nonsingular matrix, then y becomes, under certain
additional conditions, a differentiable function of x in a neighborhood of (x0 , y0 ), and we obtain
dy = −(fy )−1 fx dx; hence
y ′ (x) = −(fy )−1 fx |y=y(x) .
One can also show that
f (x0 + ϵv) − f (x0 )
lim = f ′ (x0 )v.
ϵ→+0 ϵ
There are, however, functions f for which such a directional derivative exists for any v, but f is
not a linear function of v for some x0 . An important example is f (x) = ∥x∥∞ , where x ∈ Rn .
(Look at the case n = 2.)
Consider the set of k-linear mappings from vector spaces Xi = X, i = 1, . . . , k, which we
also write as X k , to Y . This is itself a linear space, which we here denote by Lk (X, Y ). For
k = 1 we write it more briefly as L(X, Y ). If f ′ (x) is a differentiable function of x at the
point x0 , its derivative is denoted by f ′′ (x0 ). This is a linear function that maps X into the space
L(X, Y ) of mappings from X to Y that contains f ′′ (x0 ), i.e., f ′′ (x0 ) ∈ L(X, L(X, Y )). This
space may be identified in a natural way with the space L2 (X, Y ) of bilinear mappings X 2 → Y .
If A ∈ L(X, L(X, Y )), then the corresponding Ā ∈ L2 (X, Y ) is defined by (Au)v = Ā(u, v)
for all u, v ∈ X. In the following it is not necessary to distinguish between A and Ā, so

f ′′ (x0 )(u, v) ∈ Y, f ′′ (x0 )u ∈ L(X, Y ), f ′′ (x0 ) ∈ L2 (X, Y ).

It can be shown that f ′′ (x0 ): X 2 → Y is a symmetric bilinear mapping, i.e., f ′′ (x0 )(u, v) =
f ′′ (x0 )(v, u). The second-order partial derivatives are denoted fxx , fxy , fyx , fyy . We can show
that fxy = fyx .
p p
If X = Rn , Y = Rm , m > 1, then f ′′ (x0 ) reads fij (x0 ) = fji (x0 ) in tensor notation.
It is thus characterized by a three-dimensional array, which one rarely needs to store or write.
Fortunately, most of the numerical work can be done on a lower level, e.g., with directional
derivatives. For each fixed value of p, we obtain a symmetric n × n matrix H(x0 ) called the
Hessian matrix; note that f ′′ (x0 )(u, v) = uT H(x0 )v. The Hessian can be looked upon as the
derivative of the gradient. An element of the Hessian is, in multilinear mapping notation, the pth
coordinate of the vector f ′′ (x0 )(ei , ej ).
Higher derivatives are recursively defined. If f (k−1) (x) is differentiable at x0 , its derivative at
x0 is denoted by f (k) (x0 ) and called the kth derivative of f at x0 . One can show that f (k) (x0 ) :
X k → Y is a symmetric k-linear mapping. Taylor’s formula then reads, when a, u ∈ X,
f :X →Y,
1 1
f (a + u) = f (a) + f ′ (a)u + f ′′ (a)u2 + · · · + f (k) (a)uk + Rk+1 , (8.1.3)
2 k!
Z 1
(1 − t)k (k+1)
Rk+1 = f (a + ut)dt uk+1 .
0 k!

Here we have used u2 , uk , . . . as abbreviations for the lists of input vectors (u, u), (u, u, . . . , u),
. . . , etc. It follows that
∥u∥k+1
∥Rk+1 ∥ ≤ max f (k+1) (a + ut) ,
0≤t≤1 (k + 1)!
8.1. Newton-Type Methods 391

where norms of multilinear operators are defined analogously to subordinate matrix norms; see
(4.3.37). Such simplifications are often convenient to use. The mean value theorem of differential
calculus and Lagrange’s form for the remainder of Taylor’s formula do not hold, but in many
places they can be replaced by the above integral form of the remainder. All this holds in complex
vector spaces too.

8.1.2 The Gauss–Newton Method


The unconstrained nonlinear least squares (NLS) problem is to find a global minimizer of
m
1X 2 1
ϕ(x) = fi (x) = f (x)T f (x), m ≥ n, (8.1.4)
2 i=1 2

where f (x) ∈ Rm and x ∈ Rn . Such problems arise, e.g., when fitting given data (yi , ti ),
i = 1, . . . , m, to a nonlinear model function g(x, t). If only yi are subject to errors, and the
values ti of the independent variable t are exact, we take

fi (x) = yi − g(x, ti ), i = 1, . . . , m. (8.1.5)

The choice of the least squares measure is justified here, as for the linear case, by statistical
considerations; see Bard [66, 1974]. The case when there are errors in both yi and ti is discussed
in Section 8.2.3.
The NLS problem (8.1.4) is a special case of the general unconstrained optimization problem
in Rn . Although ϕ(x) is bounded below, it is usually convex only near a local minimum. Hence,
solution methods will in general not be globally convergent. The methods are iterative in nature.
Starting from an initial guess x0 , a sequence of approximations x1 , x2 , . . . is generated that
ideally converges to a solution. Each iteration step usually requires the solution of a related
linear or quadratic subproblem.
In the following we assume that f (x) is twice continuously differentiable. Because of the
special structure of ϕ(x) in (8.1.4), the gradient ∇ϕ(x) of ϕ(x) has the special structure

∇ϕ(x) = (∂ϕ/∂x1 , . . . , ∂ϕ/∂xn )T = J(x)T f (x), (8.1.6)

where J(x) ∈ Rm×n is the Jacobian of f (x) with elements (∂fi (x)/∂xj ∂xk ). Furthermore, the
Hessian of ϕ(x) is
m
X
∇2 ϕ(x) = J(x)T J(x) + G(x), G(x) = fi (x)Gi (x), (8.1.7)
i=1

where Gi (x) ∈ Rn×n , i = 1, . . . , n, is the Hessian of fi (x), i.e., the symmetric matrix with
elements (∂ 2 fi (x)/∂xj ∂xk ).
A necessary condition for x∗ to be a local minimum of f (x) is that ∇ϕ(x∗ ) = J(x∗ )T f (x∗ )
= 0. Such a point x∗ is called a critical point. Finding a critical point is equivalent to solving
the system of nonlinear algebraic equations

F (x) ≡ J(x)T f (x) = 0. (8.1.8)

The basic method for solving such a system is Newton’s method for nonlinear equations:

F ′ (xk )pk = −F (xk ), xk+1 = xk + pk . (8.1.9)


392 Chapter 8. Nonlinear Least Squares Problems

Newton’s method can attain a quadratic rate of convergence and is invariant under a linear trans-
formation of variables x = Sz; see Dennis and Schnabel [316, 1996]). With F (x) given by
(8.1.8), pk solves
J(xk )T J(xk ) + G(xk ) p = −J(xk )T f (xk ).

(8.1.10)
The method can also be derived by using a quadratic approximation for the function ϕ(x) =
1 2
2 ∥f (x)∥2 and taking pk as the minimizer of

1
ϕk (xk + p) = ϕ(xk ) + ∇ϕ(xk )T p + pT ∇2 ϕ(xk )p. (8.1.11)
2
This is Newton’s method for unconstrained optimization. It has several attractive properties. In
particular, if the Hessian ∇2 f (x) is positive definite at xk , then pk is a descent direction for
ϕ(x).
Newton’s method is seldom used for NLS because the mn2 second derivatives in the term
G(xk ) in (8.1.10) are rarely available at a reasonable cost. An exception is in curve-fitting
problems where the function values fi (x) = yi −g(x, ti ) and derivatives can be obtained from the
single function g(x, t). If g(x, t) is composed of simple exponential and trigonometric functions,
for example, then second derivatives can sometimes be computed cheaply.
In the Gauss–Newton method, f (x) is approximated in a neighborhood of xk by the linear
function f (x) = f (xk ) + J(xk )(x − xk ). Then the condition that x be a critical point can be
written
J(xk )T (f (xk ) + J(xk )(x − xk )) = 0.
The next approximation is taken as xk+1 = xk + pk , where pk solves the linear least squares
problem
min ∥f (xk ) + J(xk )p∥2 . (8.1.12)
p

If J(xk ) has full column rank, then pk is uniquely determined by the condition

J(xk )T (J(xk )pk + f (xk )) = 0.

If xk is not a critical point, then pk = −J(xk )† f (xk ) is a descent direction. Then, for sufficiently
small α > 0, ∥f (xk + αpk )∥2 < ∥f (xk )∥2 . To verify this, note that

∥f (xk + αpk )∥22 = ∥f (xk )∥22 − 2α∥PJk f (xk )∥22 + O(α2 ), (8.1.13)

where PJk = J(xk )J † (xk ) is the orthogonal projector onto the range of J(xk ). Moreover, since
xk is not a critical point, it follows that PJk f (xk ) ̸= 0.
As the following simple example shows, the Gauss–Newton method may not be locally con-
vergent. Consider minimizing f12 (x) + f22 (x), where

f1 (x) = x + 1, f2 (x) = λx2 + x − 1, (8.1.14)

and λ is a parameter. The minimizer is x∗ = 0. The Gauss–Newton method gives xk+1 =


λxk +0(x2k ). Hence this method diverges when |λ| > 1. To ensure global convergence, Newton’s
method can be used with a line search: xk+1 = xk + αk pk , where pk is the Newton step (8.1.10).
The Gauss–Newton method is first-order invariant under a change of parametrization of the
least squares problem
minn 12 ∥f (x(z))∥22 ,
z∈R

where z is a function from R to Rn with a nonsingular Jacobian. To verify this property,


n

note that if px = −J † (x)f (x) is the Gauss–Newton step in the original variables, then pz =
8.1. Newton-Type Methods 393

(J(x)x′ )† f (x) is the step after the change of variables, where x′ is the Jacobian of x(z). Then
px = x′ pz , which is the desired relation.
The Gauss–Newton method can also be thought of as arising from neglecting the term G(x)
in Newton’s method (8.1.10). This term is small if the quantities

|fi (xk )| ∥Gi (xk )∥, i = 1, . . . , m,

are small, i.e., if either the residuals fi (xk ), i = 1, . . . , m, are small or fi (x) is only mildly
nonlinear at xk . In this case the behavior of the Gauss–Newton method can be expected to be
similar to that of Newton’s method.
The Gauss–Newton method can be written as a fixed-point iteration

xk+1 = F (xk ), F (x) = x − J(x)† f (x). (8.1.15)

Assume there exists an x∗ such that J(x∗ )T f (x∗ ) = 0 and J(x∗ ) has full column rank. Then

F (x) = x − (J(x)T J(x))−1 J(x)T f (x),

and using the chain rule gives


∂F ∂f ∂ ∂
= I − (J T J)−1 J T − (J T J)−1 (J T )f − ((J T J)−1 )J T f,
∂xi ∂xi ∂xi ∂xi
i = 1, . . . , n. Here the second and third terms cancel, and the last term vanishes at x∗ because
J(x∗ )T f (x∗ ) = 0. We obtain
m
X
∇F (x∗ ) = −(J(x∗ )T J(x∗ ))−1 fi (x∗ )Gi (x∗ ). (8.1.16)
i=1

Sufficient conditions for local convergence and error bounds for the Gauss–Newton method can
be obtained from Ostrowski’s fixed-point theorem; see Pereyra [890, 1967], Ortega and Rhein-
boldt [845, 2000, Theorem 10.1.3], and Dennis and Schnabel [316, 1996, Theorem 10.2.1].

Theorem 8.1.1. Let f (x) be a twice continuously Fréchet differentiable function. Assume there
exists x∗ such that J(x∗ )T f (x∗ ) = 0 and the Jacobian J(x∗ ) has full rank. Then the Gauss–
Newton iteration converges locally to x∗ if the spectral radius

ρ = ρ(∇F (x∗ )) < 1.

The asymptotic convergence is linear with rate bounded by ρ. In particular, the local convergence
rate is superlinear if f (x∗ ) = 0.

Wedin [1107, 1972] gives a geometrical interpretation of this convergence result. Minimizing
ϕ(x) = 21 ∥f (x)∥22 is equivalent to finding a point on the n-dimensional surface y = f (x) in Rm
closest to the origin. The normalized vector

w = −f (x∗ )/γ, γ = ∥f (x∗ )∥2 , (8.1.17)

is orthogonal to the tangent plane of the surface y = f (x∗ ) + J(x∗ )h, h ∈ Rn , at x∗ . The
normal curvature matrix of the surface with respect to w is the symmetric matrix
m
X
K = (J(x∗ )† )T Gw (x∗ )J(x∗ )† , Gw = wi Gi (x∗ ). (8.1.18)
i=1
394 Chapter 8. Nonlinear Least Squares Problems

Denote the eigenvalues of K by κ1 ≥ κ2 ≥ · · · ≥ κn . The quantities 1/κi , κi ̸= 0, are


the principal radii of curvature of the surface y = f (x) with respect to the normal w; see
Willmore [1126, 1959]. Since

(J(x∗ )T J(x∗ ))−1 = J(x∗ )† (J(x∗ )† )T ,

the matrix ∇F (x∗ ) has the same nonzero eigenvalues as γK, where γ is given as in (8.1.17). It
follows that
ρ = ρ(∇F (x∗ )) = γ max(κ1 , −κn ) = γ∥K∥2 . (8.1.19)
This relation indicates that the local convergence of the Gauss–Newton method is invariant under
a local transformation of the nonlinear least squares problem.
If J(x∗ ) has full column rank, then ∇2 f (x∗ ) = J T (I − γK)J is positive definite. It follows
that x∗ is a local minimum of ϕ(x) if and only if I − γK is positive definite at x∗ . This is the
case when
1 − γκ1 > 0 (8.1.20)
or, equivalently, γ < 1/κ1 . Furthermore, if 1 − γκ1 ≤ 0, then ϕ(x) has a saddle point at x∗ ;
if 1 − γκn < 0, then ϕ(x) has a local maximum at x∗ . If x∗ is a saddle point, then κ1 ≥ 1.
Hence using the Gauss–Newton method, one is generally repelled from a saddle point, which is
an excellent property.

PJ (x) = J(x)J(x)† = J(x)(J(x)T J(x))−1 J(x)T

is the orthogonal projection onto the range space of J(x). Hence at a critical point x∗ it holds
that PJ (x∗ )f (x∗ ) = 0. The rate of convergence for the Gauss–Newton method can be estimated
during the iterations from

∥PJ (xk+1 )fk+1 ∥2 /∥PJ (xk )fk ∥2 ≤ ρ + O(∥xk − x∗ ∥22 ), (8.1.21)

where ρ is defined as in Theorem 8.1.1. (Recall that limxk →x∗ PJ (xk )fk = 0.) Since

PJ (xk )fk = J(xk )J(xk )† fk = −J(xk )pk ,

the cost of computing this estimate is only one matrix-vector multiplication. When the estimated
ρ is greater than 0.5 (say), one should consider switching to a method using second-derivative
information.
If the radius of curvature at a critical point satisfies 1/|κi | ≪ ∥f (x∗ )∥2 , the nonlinear least
squares problems will be ill-behaved, and many insignificant local minima may exist. Poor
performance of Gauss–Newton methods often indicates poor quality of the underlying model
or insufficient accuracy in the observed data. Then it would be better to improve the model
rather than use more costly methods of solution. Wedin [1109, 1974] shows that the estimate
(8.1.21) of the rate of convergence of the Gauss–Newton method is often a good indication of the
quality of the underlying model. Deuflhard and Apostolescu [318, 1980] call problems for which
divergence occurs “inadequate problems.” Many numerical examples leading to poor behavior
of Gauss–Newton methods are far from realistic; see Hiebert [609, 1981] and Fraley [430, 1989].
As the following simple example shows, the Gauss–Newton method may not even be locally
convergent. Consider minimizing f12 (x) + f22 (x), where

f1 (x) = x + 1, f2 (x) = λx2 + x − 1, (8.1.22)

and λ is a parameter. The minimizer is x∗ = 0. The Gauss–Newton method gives xk+1 =


λxk + 0(x2k ). Hence this method diverges when |λ| > 1.
8.1. Newton-Type Methods 395

To ensure global convergence, a useful modification of the Gauss–Newton method is to use


a line search in which the next approximation is taken as

xk+1 = xk + αk pk ,

where αk > 0 is a step length to be determined. There are two common algorithms for choosing
αk ; see Ortega and Rheinboldt [845, 2000] and Gill, Murray, and Wright [476, 1981].
1. Armijo–Goldstein line search, where αk is taken to be the largest number in the sequence
1, 12 , 14 , . . . for which the inequality
1
∥f (xk )∥22 − ∥f (xk + αk pk )∥22 ≥ αk ∥J(xk )pk ∥22
2
holds (notice that −J(xk )pk = PJk f (xk )).
2. “Exact” line search, i.e., taking αk as the solution to the one-dimensional minimization
problem
min ∥f (xk + αpk )∥22 . (8.1.23)
α>0

Note that a solution αk may not exist, or there may be a number of local solutions.
A theoretical analysis of the Gauss–Newton method with exact line search has been given by
Ruhe [939, 1979]. The asymptotic rate of convergence is shown to be

ρe = γ(κ1 − κn )/(2 − γ(κ1 + κn )). (8.1.24)

A comparison with (8.1.19) shows that ρe = ρ if κn = −κ1 , and ρe < ρ otherwise. We also have
that γκ1 < 1 implies ρe < 1, i.e., we always get convergence close to a local minimum. This is
in contrast to the Gauss–Newton method, which may fail to converge to a local minimum.
Lindström and Wedin [751, 1984] develop an alternative line-search algorithm. In this, α is
chosen to minimize ∥pk (α)∥2 , where p(α) approximates the curve f (α) = f (xk + αpk ) ∈ Rm .
One possibility is to choose p(α) to be the unique circle (in the degenerate case, a straight line)
determined by the conditions

pk (0) = f (0), ∇p(0) = ∇f (0), pk (α0 ) = fk (α0 ),

where α0 is a guess of the step length.


Ruhe also develops a way to obtain second-derivative information by using a nonlinear con-
jugate gradient acceleration of the Gauss–Newton method with exact line searches. This method
achieves quadratic convergence and often gives much faster convergence on difficult problems.
When exact line search is used, conjugate gradient acceleration amounts to a negligible amount
of extra work. However, for small-residual problems, exact line search is a waste of time, and
then a simpler damped Gauss–Newton method is superior.

8.1.3 Regularization and Trust-Region Methods


In practice it often happens that J(x) is numerically rank-deficient at an intermediate point xk .
Then a natural choice is to take pk as the pseudoinverse solution pk = −J † (xk )f (xk ) of (8.1.12);
see Ben-Israel [99, 1967], Fletcher [410, 1968], and Boggs [157, 1976]. Such a situation is
usually complicated by the difficulty in making decisions about the rank. The following example
shows that this may be critical. Let
   
1 0 f1
J= , f= ,
0 ϵ f2
396 Chapter 8. Nonlinear Least Squares Problems

where 0 < ϵ ≪ 1 and f1 and f2 are of order unity. If J is considered to be of rank two, then the
search direction is pk = s1 , whereas if the assigned rank is one, pk = s2 , where
   
f1 f1
s1 = − , s2 = − .
f2 /ϵ 0

Clearly the two directions s1 and s2 are almost orthogonal, and s1 is almost orthogonal to the
gradient vector J T f . A strategy for estimating the rank of J(xk ) can be based on QR factor-
ization or SVD of J(xk ); see Section 2.5. Usually an underestimate of the rank is preferable,
except when f (x) is close to an ill-conditioned quadratic function.
An alternative approach that avoids the rank determination is to take xk+1 = xk + pk , where
pk is the unique solution to
2
min f (xk ) + J(xk )p 2
+ µ2k ∥p∥22 , (8.1.25)
p

and µk > 0 is a regularization parameter. Then pk is defined even when Jk is rank-deficient. This
method was first used by Levenberg [736, 1944] and later modified by Marquardt [779, 1963]
and is therefore called the Levenberg–Marquardt method. The solution to problem (8.1.25)
satisfies
J(xk )T J(xk ) + µ2k I pk = −J(xk )T f (xk )


or, equivalently,    
J(xk ) f (xk )
min p+ , (8.1.26)
p µk I 0 2
and can be solved stably by QR factorization (or CGLS or LSQR).
The regularized Gauss–Newton method always takes descent steps. Hence it is locally con-
vergent on almost all nonlinear least squares problems, provided an appropriate line search is
carried out. We remark that as µk → 0+ , pk → J(xk )† fk , the pseudoinverse step. For large val-
ues of µ the direction pk becomes parallel to the steepest descent direction −J(xk )T fk . Hence
pk interpolates between the Gauss–Newton and steepest descent direction. This property makes
the Levenberg–Marquardt method preferable to damped Gauss–Newton for many problems.
A useful modification of the Levenberg–Marquardt algorithm is to change the penalty term
in (8.1.25) to ∥µk D∥22 , where D is a diagonal scaling matrix. A frequently used choice is D2 =
diag (J(x0 )T J(x0 )). This choice makes the Levenberg–Marquardt algorithm scaling invariant,
i.e., it generates the same iterations if applied to f (Dx) for any nonsingular diagonal matrix D.
From the discussion of problem LSQI in Section 3.5.3, it follows that the regularized least
squares problem (8.1.25) is equivalent to the least squares problem with a quadratic constraint,

min ∥f (xk ) + J(xk )p∥2 subject to ∥Dp∥2 ≤ ∆k , (8.1.27)


p

for some ∆k > 0. If the constraint in (8.1.27) is binding, then pk is a solution to (8.1.25) for
some µk > 0. The set of feasible vectors p, ∥Dp∥2 ≤ ∆k in (8.1.27) can be thought of as a
region of trust for the linear model f (x) ≈ f (xk ) + J(xk )p, p = x − xk . There has been much
research on so-called trust-region methods based on the formulation (8.1.27) as an alternative
to a line-search strategy. Combined with a suitable active-set strategy, such a technique can be
extended to handle problems with nonlinear inequality constraints; see Lindström [749, 1983].
Several versions of the Levenberg–Marquardt algorithm have been proved to be globally con-
vergent; see, e.g., Fletcher [411, 1971] and Osborne [847, 1976]. A general description of scaled
trust-region methods for nonlinear optimization is given by Moré [806, 1978], [807, 1983]. Moré
proves that the algorithm will converge to a critical point x∗ if f (x) is continuously differentiable,
J(x) is uniformly continuous in a neighborhood of x∗ , and J(xk ) remains bounded.
8.1. Newton-Type Methods 397

Algorithm 8.1.1 (Trust-Region Method).


Given x0 , D, ∆, and β ∈ (0, 1)
for k = 0, 1, 2, . . . ,

1. Determine pk as a solution to the subproblem

min ∥f (xk ) + J(xk )p∥2 subject to ∥Dp∥2 ≤ ∆.


p

2. Compute the ratio between the actual and predicted reduction:

∥f (xk )∥22 − ∥f (xk + pk )∥22


ρk = .
∥f (xk )∥22 − ∥f (xk ) + J(xk )pk ∥22

3. If ρk > β, set xk+1 = xk + pk ; otherwise, set xk+1 = xk .


4. Update the scaling matrix D and the trust-region radius ∆. If ρk ≥ 3/4, then ∆ is
increased by a factor of 2; if ρk < 1/4 ∆, then ∆ is reduced by a factor of 2.

To compute ρ(pk ) stably in step 2, note that because pk satisfies (8.1.26), the predicted re-
duction can be computed from

∥f (xk )∥22 − ∥f (xk ) + J(xk )pk ∥22 = −2pTk J(xk )T f (xk ) − ∥J(xk )pk ∥22
= 2µ2 ∥Dpk ∥22 + ∥J(xk )pk ∥22 .

The ratio ρk measures the agreement between the linear model and the nonlinear function. If Jk
has full rank, then ρk → 1 as ∥pk ∥2 → 0. The parameter β in step 3 can be chosen quite small,
e.g., β = 0.0001.
Assume that the Jacobian J(x) is rank-deficient with constant rank r < n in a neighborhood
of a local minimum x∗ . Then the appropriate formulation of the problem is

min ∥x∥22 subject to ∥f (x)∥22 = min . (8.1.28)


x

Boggs [157, 1976] notes that the choice pk = −J(xk )† f (xk ) gives the least-norm correction to
the linearized problem. Instead, pk should be taken as the least-norm solution to the linearized
problem
min ∥xk + p∥22 subject to ∥f (xk ) + J(xk )p∥22 = min . (8.1.29)
p

This has the solution


pk = −J(xk )† f (xk ) − PN (Jk ) xk , (8.1.30)
where PN (Jk ) = I − J(xk )† J(xk ) is the orthogonal projector onto the nullspace of J(xk ).
Eriksson et al. [388, 2005] derive necessary and sufficient optimality conditions for the above
method. Applying Tikhonov regularization to problem (8.1.28) gives minx ∥f (x)∥22 + µ2 ∥x∥22
or, equivalently,
  2
f (x)
min . (8.1.31)
x µx 2
Linearization of (8.1.31) at xk gives the subproblem
    2
J(xk ) f (xk )
min p+ . (8.1.32)
p µk I µk xk 2
398 Chapter 8. Nonlinear Least Squares Problems

For µk > 0 this is a full-rank linear least squares problem. The unique solution pk = pTk ik can
be computed by QR factorization.

Lemma 8.1.2. The search directions pTik


k computed in the Gauss–Newton method are related to
those in (8.1.30) by
lim pTik
k = pk .
µk →0+

This result implies that if µk → 0+ , the two Gauss–Newton methods have the same local
convergence properties.

8.1.4 Quasi-Newton Methods


Convergence of the Gauss–Newton method and trust-region methods can be slow for large-
residual problems and strongly nonlinear problems. These methods may also have difficulty
at points where the Jacobian is rank-deficient. Several methods have been suggested for partially
taking the second derivatives into account, either explicitly or implicitly.
In quasi-Newton methods an approximation to the second-derivative matrix is built up suc-
cessively from evaluations of the gradient. Many of those methods are known to possess superlin-
ear convergence. Let Sk−1 be a symmetric approximation to the Hessian at step k. The updated
Sk is required to approximate the curvature of f along xk − xk−1 , i.e.,Sk (xk − xk−1 ) = yk ,
where

yk = ∇f (xk ) − ∇f (xk−1 ) = J(xk )T f (xk ) − J(xk−1 )T f (xk−1 ). (8.1.33)

This is called the quasi-Newton relation. We further require Sk to differ from Sk−1 by a matrix
of small rank. The search direction pk for the next step is then computed from

Sk pk = −g(xk ), g(xk ) = J(xk )T r(xk ). (8.1.34)

As a starting approximation, S0 = J(x0 )T J(x0 ) is usually recommended.


Ramsin and Wedin [910, 1977] gave the following rule for the choice between Gauss–Newton
and quasi-Newton methods based on the observed rate (8.1.21) of convergence ρ for the Gauss–
Newton method:
1. For ρ ≤ 0.5, Gauss–Newton is better.
2. For globally simple problems, quasi-Newton is better for ρ > 0.5.
3. For globally difficult problems, Gauss–Newton is much faster for ρ ≤ 0.7. But for larger
values of ρ, quasi-Newton is safer.
This can be used to construct a hybrid method with automatic switching between the two meth-
ods.
The application of quasi-Newton methods to the NLS problem as outlined above has not
been very efficient in practice. One reason is that often J(xk )T J(xk ) is the dominant part of
∇2 f (xk ), and this information is disregarded. A more successful approach, used by Dennis,
Gay, and Welsch [315, 1981], is to estimate ∇2 f (xk ) by J(xk )T J(xk ) + Bk , where Bk is a
symmetric quasi-Newton approximation to B(xk ). The quasi-Newton relation then becomes

(JkT Jk + Bk )(xk − xk−1 ) = yk ,

but it is preferable to use the alternative formula

Bk (xk − xk−1 ) = zk , zk = (J(xk ) − J(xk−1 ))T f (xk ). (8.1.35)


8.1. Newton-Type Methods 399

The solution to (8.1.35) that minimizes ∥Bk − Bk−1 ∥F is given by the rank-two update formula

wk zkT + zk wkT wkT sk zk zkT


Bk = Bk−1 + − , (8.1.36)
zkT sk (zkT sk )2
where sk = xk − xk−1 and wk = zk − Bk−1 sk ; see Dennis and Schnabel [316, 1996, pp. 231–
232]. In some cases the update (8.1.36) gives inadequate results. This motivates the inclusion
of a “sizing” in which Bk is replaced by τk Bk , where τk = min{|sTk zk |/|sTk Bk sk |, 1}. This
heuristic ensures that Sk converges to zero for zero-residual problems, which has been shown to
improve the convergence behavior. Note that in the quasi-Newton step, JkT Jk + Bk may need to
be modified to be positive so that Cholesky factorization can be used.
The update (8.1.36) is used by the popular subroutine NL2SOL by Dennis, Gay, and Welsch
[315, 1981] to maintain the approximation Bk and adaptively decide whether to use it or a Gauss–
Newton method. In each iteration, NL2SOL computes the reduction predicted by both models
and compares it with the observed actual reduction f (xk+1 ) − f (xk ). The next step uses the
model whose predicted reduction best approximates the actual reduction. Usually this causes
NL2SOL to use Gauss–Newton steps until the information in Bk becomes useful. To achieve
global convergence, a trust-region strategy is used.
In the quasi-Newton method of Gill and Murray [474, 1978], J(xk )T J(xk ) is regarded as a
good estimate of the Hessian in the invariant subspace corresponding to the large singular values
of J(xk ). The second-derivative term B(xk ) is taken into account only in the complementary
subspace. Let the SVD of Jacobian J = J(xk ) be partitioned as
  T 
Σ1 0 V1
J = ( U1 U2 ) , (8.1.37)
0 Σ2 V2T

where Σ1 = diag (σ1 , . . . , σp ) contains the large singular values of J and Σ2 = diag (σp+1 , . . . ,
σn ). Set B = B(xk ), and let s̄ denote the first n components of the vector s = U T f (xk ).
Equation (8.1.10) for the Newton direction can then be split into two sets. The first p equations
are
(Σ21 + V1T BV1 )q1 + V1T BV2 q2 = −Σ1 s̄1 .
If the terms involving B = Bk can be neglected compared to Σ21 q1 , we get q1 = −Σ−1
1 s̄1 .
Substituting this into the last (n − p) equations, we obtain

(Σ22 + V2T BV2 )q2 = −Σ2 s̄2 + V2T BV1 Σ−1


1 s̄1 .

The approximate Newton direction is taken to be pk = V1 q1 + V2 q2 . The split of the singular


values is updated at each iteration, and p is maintained close to n as long as sufficient progress is
made. A finite-difference approximation to V2T Bk V2 is obtained as follows. Let vj be a column
of V2 and h be a small positive scalar. Then, by differentiating the gradient along the columns of
V2 ,
(∇fi (xk + hvj ) − ∇fi (xk ))/h = vjT Gi (xk ) + O(h).
The vector on the left-hand side is the ith row of (J(xk + hvj ) − J(xk ))/h. Multiplying by
fi (xk ) and adding gives
m
X
f (xk )T (J(xk + hvj ) − J(xk ))/h = vjT fi (xk )Gi (xk ) + O(h)
i=1

= vjT Bk + O(h), j = p + 1, . . . , n.

This gives an approximation for V2T Bk , and then (V2T Bk )V2 can be formed.
400 Chapter 8. Nonlinear Least Squares Problems

Various other possibilities for hybrid Gauss-Newton/quasi-Newton methods are considered


by Al-Baali and Fletcher [14, 1985]. They use an approach in which the choice of method and
parameters is made by estimating the error in an inverse Hessian approximation. A Newton-like
search direction is computed from (8.1.34), where Sk is a symmetric positive definite approxi-
mation to the Hessian. This ensures that pk is a descent direction and makes it possible to use a
simpler line-search technique rather than a trust-region approach.
Al-Baali and Fletcher [15, 1986] determine αk to satisfy the conditions
f (xk + αk pk ) ≤ ραk g(xk )T pk , ρ ∈ (0, 1/2),
T T
|g(xk + αk pk ) pk | ≤ −σg(xk ) pk , σ ∈ (ρ, 1),
and suggest an algorithm to find such a point. Fletcher and Xu [413, 1987] develop a hybrid
method called HY2 in which a quasi-Newton step is taken when (f (xk ) − f (xk+1 ))/f (xk ) < ϵ;
otherwise, a Gauss–Newton step is taken. The quasi-Newton step uses (8.1.36) to update the
approximate Hessian but includes a safeguard to maintain positive definiteness. HY2 is superlin-
early convergent under mild conditions. A review of quasi-Newton methods for nonlinear least
squares is given by Eriksson [387, 1999].
Extending the quasi-Newton method to large sparse optimization problems has proved to be
difficult. For certain types of large, “partially separable” nonlinear least squares problems, a
promising approach is suggested by Toint [1066, 1987]. A typical case is when every function
fi (x) only depends on a small subset of the set of n variables. Then the Jacobian J(x) and
the element Hessian matrices Gi (x) are sparse, and it may be possible to store approximations
to all Gi (x), i = 1, . . . , m. An implementation is available as the Fortran subroutine VE10
in the Harwell Software Library; see Toint [1067, 1987]. Another subroutine suitable for such
problems is LANCELOT by Conn, Gould, and Toint [265, 1991], [266, 1992].
In a more general setting the solution to nonlinear least squares problems may be subject to
nonlinear equality constraints,
1
min ∥f (x)∥22 subject to h(x) = 0, (8.1.38)
x 2

where x ∈ Rn , f (x) ∈ Rm , h ∈ Rp , and p < n. The Gauss–Newton method can be generalized


to constrained problems by linearization of (8.1.38) at a point xk . A search direction pk is then
computed as a solution to the linearly constrained problem
min ∥f (xk ) + J(xk )p∥2 subject to h(xk ) + C(xk )p = 0, (8.1.39)
p

where J and C are the Jacobians for f (x) and h(x), respectively. This problem can be solved
by the methods described in Section 4.5. The search direction pk obtained from (8.1.39) can be
shown to be a descent direction for the merit function
ψ(x, µ) = ∥f (x)∥22 + µ∥h(x)∥22
at the point xk , provided µ is chosen large enough.

8.1.5 Inexact Gauss–Newton Methods


In large-scale applications, the Gauss–Newton linear subproblems
min ∥J(xk )p + f (xk )∥2 (8.1.40)
p

may be too costly to solve accurately. In any case, far from the solution x∗ , it may not be worth
solving these subproblems to high accuracy. To solve (8.1.40) for pk , a truncated inner iterative
8.1. Newton-Type Methods 401

method such as CGLS or LSQR can be applied. A class of inexact Newton methods for solving
a system of nonlinear equations F (x) = 0 is studied by Dembo, Eisenstat, and Steihaug [300,
1982]. A sequence {xk } of approximations is generated as follows. Given an initial guess x0 ,
set xk+1 = xk + pk , where pk satisfies

∥rk ∥2
F ′ (xk )pk = −F (xk ) + rk , ≤ ηk < 1. (8.1.41)
∥F (xk )∥2

Here {ηk } is a nonnegative forcing sequence used to control the level of accuracy. Taking
ηk = 0 gives the Newton method. Note that the requirement ηk < 1 is natural because ηk ≥ 1
would allow pk = 0.

Theorem 8.1.3 (Dembo, Eisenstat, and Steihaug [300, 1982, Theorem 3]). Assume that there
exists an x∗ such that F (x∗ ) = 0 with F ′ (x∗ ) nonsingular. Let F be continuously differentiable
in a neighborhood of x∗ , and let ηk ≤ η̂ < t < 1. Then there exists ϵ > 0 such that if
∥x∗ − x0 ∥2 ≤ ϵ, the sequence {xk } generated by (8.1.41) converges linearly to x∗ in the sense
that
∥xk+1 − xk ∥∗ ≤ t∥xk − xk ∥∗ , (8.1.42)
where the norm is defined by ∥y∥∗ ≡ ∥F ′ (x∗ )y∥2 .

First, we note that the exact Gauss–Newton method can be considered as an incomplete
Newton method for the equation F (x) = J(x)T f (x) = 0. This is of the form (8.1.41), where
pk satisfies J(xk )T J(xk )pk = −J(xk )T f (xk ) and

fk = (J(xk )T J(xk ) + G(xk ))pk + J(xk )T f (xk ) = G(xk )pk ,

where Gi (xk ) is the Hessian of fi (xk ). By Theorem 8.1.3 a sufficient condition for convergence
is that
−1
G(xk ) J(xk )T J(xk ) = tk ≤ t < 1. (8.1.43)
2

This is more restrictive than the condition given in Theorem 8.1.1.


A class of inexact Gauss–Newton methods can be defined as follows. Given an initial guess
x0 , set xk+1 = xk + pk , where pk satisfies

fk = J(xk )T (J(xk )pk + f (xk )), ∥fk ∥2 ≤ βk ∥J(xk )T f (xk )∥2 . (8.1.44)

The condition on ∥fk ∥2 is a natural stopping condition on an iterative method for solving the
linear subproblem. Gratton, Lawless, and Nichols [526, 2007] give conditions on the sequence
of tolerances {βk } needed to ensure convergence and investigate the use of such methods for
variational data assimilation in meteorology.

Theorem 8.1.4 (Gratton, Lawless, and Nichols [526, 2007, Theorem 5]). Let f (x) be a
twice continuously Fréchet differentiable function. Assume that there exists an x∗ such that
J T (x∗ )f (x∗ ) = 0 and J(x∗ ) has full column rank. Assume tk < β̂ < 1, where tk is given as in
(8.1.43). Assume βk , k = 0, 1, . . . , are chosen so that

0 ≤ βk ≤ (β̂ − tk )/(1 + tk ). (8.1.45)

Then there exists ϵ > 0 such that if ∥x∗ − x0 ∥2 ≤ ϵ, the sequence {xk } of the inexact Gauss–
Newton method (8.1.44) converges linearly to x∗ .
402 Chapter 8. Nonlinear Least Squares Problems

Note that the requirement tk < β̂ < 1 is the sufficient condition for convergence given for the
exact Gauss–Newton method. The more highly nonlinear the problem, the larger tk and the more
accurate the linear subproblems to be solved. Accelerated schemes for inexact Newton methods
using GMRES for large systems of nonlinear equations are given by Fokkema, Sleijpen, and van
der Vorst [414, 1998].

Notes and references


Dennis [314, 1977] gives an early insightful survey of methods for solving nonlinear least squares
and equations. Excellent standard references on numerical methods for optimization and sys-
tems of nonlinear equations are Dennis and Schnabel [316, 1996] and Nocedal and Wright [833,
2006]. Ortega and Rheinboldt [845, 2000] give a general treatment of theory and algorithms
for solving systems of nonlinear equations. General treatments of methods for nonlinear opti-
mization problems are given by Gill, Murray, and Wright [476, 1981] and Fletcher [412, 2000].
Schwetlick [979, 1992] surveys models and algorithms for general nonlinear parameter estima-
tion. Continuation methods are an alternative approach for solving difficult large-residual NLS
problems. Salane [962, 1987] develops such algorithms and shows they can be competitive with
trust-region algorithms.
Methods and software for the NLS problem are described by Wedin and Lindström [752,
1988]. A useful guide to optimization software is available at the NEOS Guide website. It is
constantly updated to reflect new packages and changes to existing software. A trust-region
algorithm that has proven to be very successful in practice is included in the C++ software pack-
age MINPACK available from netlib. For a user’s guide, see Moré, Garbow, and Hillstrom [808,
1980]. See also the PROC NLP Library of the SAS Institute. The Ceres nonlinear least squares
solver by Agarwal et al. (see http://ceres-solver.org) has been used since 2010 by Google
for the construction of three-dimensional models in its street-view sensor fusion.

8.2 Separable Least Squares Problems


8.2.1 Variable Projection Method
Suppose we want to solve a nonlinear least squares problem of the form

min ∥g − Φ(z)y∥22 , Φ(z) ∈ Rm×p , (8.2.1)


y,z

where Φ(z) ∈ Rm,p , y ∈ Rp , and z ∈ Rq are unknown parameters, and g ∈ Rm are given data.
For a fixed value of the nonlinear parameter z, problem (8.2.1) is a linear least squares problem
in y. Such least squares problems are called separable and arise frequently in applications. One
example is when one wants to approximate given data by a linear combination of given nonlinear
functions ϕj (z), j = 1, . . . , p.
A simple method for solving separable problems is the alternating least squares (ALS) al-
gorithm. Let z1 be an initial approximation, and solve the linear problem miny ∥g − Φ(z1 )y∥2
to obtain y1 . Next, solve the nonlinear problem minz ∥g − Φ(z)y1 ∥2 to obtain z2 . Repeat both
steps until convergence. The rate of convergence of ALS is linear and can be very slow. It does
not always converge.
For a fixed value of z, the least-norm least squares solution of (8.2.1) can be expressed as

y(z) = Φ(z)† g, (8.2.2)

where Φ† is the pseudoinverse of Φ(z). In the variable projection method of Golub and
Pereyra [503, 1973], this is used to eliminate the linear parameters y, giving a reduced nonlinear
8.2. Separable Least Squares Problems 403

least squares problem


min ∥g − PΦ(z) g∥2 , PΦ(z) = Φ(z)Φ(z)† , (8.2.3)
z

where PΦ(z) is the orthogonal projector onto the column space of Φ(z). This is a pure nonlinear
problem of reduced dimension. An important advantage is that initial values are only needed for
the nonlinear parameters z.
In order to solve (8.2.3) using a Gauss–Newton–Marquardt method, a formula for the gradi-

ent of the function f (z) = (I − PΦ(z) )g = PΦ(z) g in (8.2.3) is needed. The following lemma

gives an expression for the Fréchet derivative of the orthogonal projection matrix PΦ(z) . It must
be assumed that the rank of Φ(z) is locally constant, because otherwise the pseudoinverse would
not be differentiable.

Lemma 8.2.1 (Golub and Pereyra [503, 1973, Lemma 4.1]). Let Φ(z) ∈ Rm×p be a matrix
of local constant rank r and Φ† be its pseudoinverse. Denote by PΦ = ΦΦ† the orthogonal
projector onto R(Φ), and set PΦ⊥ = 1 − PΦ . Then, using the product rule for differentiation, we
get
 T
d d dΦ † dΦ †
(PΦ ) = − (PΦ⊥ ) = PΦ⊥ Φ + PΦ⊥ Φ . (8.2.4)
dz dz dz dz
More generally, Golub and Pereyra show that (8.2.4) is valid for any generalized inverse that
satisfies ΦΦ− Φ = Φ and (ΦΦ− )T = ΦΦ− . Note that the derivative dΦ/dz in (8.2.4) is a three-
dimensional tensor with elements ∂ϕij /∂zk . The transposition in (8.2.4) is done in the (i, j)
directions for fixed k.
In many applications, each component function ϕj (z) depends on only a few of the parame-
ters z1 , . . . , zq . Hence the derivative will often contain many zeros. Golub and Pereyra develop a
storage scheme that avoids waste storage and computations. Let E = (eij ) be a q × p incidence
matrix such that eij = 1 if function ϕj depends on the parameter P zi , and 0 otherwise. This
incidence matrix can be retrieved from an m × p array B, p = i,j e(i, j), in which the nonzero
derivatives are stored sequentially.

Example 8.2.2. A problem of great importance is the fitting of a linear combination of exponen-
tial functions with different time constants,
p
X
g(t) = y0 + yj ezj t , (8.2.5)
j=1

to observations gi = g(ti )+ϵi , i = 0 : m. Since g(t) in (8.2.5) depends on p+1 linear parameters
yj and p nonlinear parameters zj , at least m = 2p + 1 observations for t0 , . . . , tm are needed.
Clearly this problem is separable, and ϕi,j (z; t) = ezj ti , j = 1, . . . , p. Here the number of
nonvanishing derivatives is p.

The quantities required to solve the reduced nonlinear problem can be expressed in terms of
a complete QR decomposition
 
R 0
Φ=U V T , U = ( U1 U2 ) , (8.2.6)
0 0
where R ∈ Rr×r (r = rank(Φ)) is upper triangular and nonsingular, and U and V are orthogo-
nal. The solution to the linear least squares problem (8.2.1) is then y = Φ† g, where
 −1 
† R
Φ =V U1T (8.2.7)
0
404 Chapter 8. Nonlinear Least Squares Problems

is the generalized inverse. The orthogonal projectors onto the range of Φ and its orthogonal
complement are
PΦ = ΦΦ† = U1 U1T , PΦ⊥ = I − PΦ = U2 U2T . (8.2.8)
The least squares residual is r = PΦ⊥ g = U2 (U2T g), and its norm is
 
c1
∥r∥2 = ∥U2 (U2T g)∥2 = ∥c2 ∥2 , T
U g= . (8.2.9)
c2

Denote by Dk the m × n matrix of Φ derivatives with respect to the single parameter zk :


T
 
⊥ dΦ † † T dΦ ⊥
Dk = − PΦ Φ + (Φ ) P g, k = 1, . . . , q.
dzk dzk Φ

The kth column of the first part of the Jacobian becomes

−PΦ⊥ Dk Φ† g = −U2 (U2T (Dk y)), k = 1, . . . , q. (8.2.10)

The second part becomes

−(PΦ⊥ Dk Φ† )T g − (Φ† )T DkT PΦ⊥ g = U R−1 (P T (DkT r)). (8.2.11)

Both columns can be computed using matrix-vector products and triangular solves. The second
part is somewhat more expensive to compute.
The variable projection approach reduces the dimension of the parameter space and leads to a
more well-conditioned problem. Furthermore, because no starting values have to be provided for
the linear parameters, convergence to a global minimum is more likely. Krogh [708, 1974] re-
ports that at the Jet Propulsion Laboratory (JPL) the variable projection algorithm solved several
problems that methods not using separability could not solve.
Kaufman [686, 1975] gave a simplified version of the variable projection algorithm that uses
an approximate Jacobian matrix obtained by dropping the second term in (8.2.4). This simplifica-
tion was motivated by the observation that the second part of the Jacobian is negligible when the
residual r is small. Kaufman’s simplification reduces the arithmetic cost per iteration by about
25%, with only a marginal increase in the number of iterations. Kaufman and Pereyra [688, 1978]
extend the simplified scheme to problems with separable equality constraints. Their approach is
further refined by Corradi [270, 1981].
Ruhe and Wedin [942, 1980] consider variable projection methods for more general separable
problems, where one set of variables can be easily eliminated. They show that the asymptotic rate
of convergence of the variable projection method is essentially the same as for the Gauss–Newton
method applied to the full problem. Both converge quadratically for zero-residual problems,
whereas ALS always converges linearly.
Several implementations of the variable projection method are available. An improved ver-
sion of the original program VARPRO was given by John Bolstad in 1977. This uses Kaufman’s
modification that allows for weights on the observations and also computes the covariance ma-
trix. A version called VARP2 by LeVeque handles multiple right-hand sides. Both VARPRO
and VARP2 are available in the public domain at http://www.netlib.org/opt/. Another
implementation was written by Linda Kaufman and David Gay for the Port Library. A well-
documented implementation in MATLAB written by O’Leary and Rust [839, 2013] uses the full
Jacobian as in the original Golub–Pereyra version. The motivation is that in many current appli-
cations, the increase in the number of function evaluations in Kaufman’s version outweighs the
savings gained from using an approximate Jacobian.
8.2. Separable Least Squares Problems 405

Notes and references


The variable projection method is an extension of ideas first presented by Scolnik [982, 1972]
and Guttman, Pereyra, and Scolnik [557, 1973]. Osborne [846, 1975] showed how to eliminate
the linear parameters in separable problems. Golub and LeVeque [497, 1979] extend variable
projection algorithms to the case when several data sets are to be fitted to a model with the
same nonlinear parameter vector; see also Kaufman and Sylvester [689, 1992]. A review of
developments and applications of the variable projection approach for separable nonlinear least
squares problems is given by Golub and Pereyra [504, 2003].

8.2.2 Bilinear Least Squares Problems


Given data Ai ∈ Rp×q and bi , i = 1, . . . , m, the bilinear least squares (BLSQ) problem is to
determine parameters x ∈ Rp and y ∈ Rq that minimize
m
X
(xTAi y − bi )2 , m ≥ p + q. (8.2.12)
i=1

This is a nonlinear least squares problem with objective function

f (x, y) = ∥r(x, y)∥22 , ri (x, y) = bi − xTAi y ∈ Rp , i = 1, . . . , m, (8.2.13)

that is separable in each of the variables x and y. In system theory, the identification of a Ham-
merstein model leads to a BLSQ problem; see Wang, Zhang, and Ljung [1100, 2009]. Related
multilinear problems are used in statistics, chemometrics, and tensor regression.
The data matrices Ai form a three-dimensional tensor A ∈ Rm×p×q with elements ai,j,k .
Slicing the tensor in the two other possible directions, we obtain matrices

Rj ∈ Rm×q , j = 1, . . . , p, Sk ∈ Rm×p , k = 1, . . . , q. (8.2.14)

The BLSQ problem is linear in each of the variables x and y. If (x, y) is a solution to problem
BLSQ, x solves the linear least squares problem
q
X
min ∥Bx − b∥2 , B= yk Sk ∈ Rm×p , (8.2.15)
x
k=1

where the matrices Sk are given as in (8.2.14). Similarly, y solves


p
X
min ∥Cy − b∥2 , C= xj Rj ∈ Rm×q . (8.2.16)
y
j=1

We deduce that the Jacobian of r(x, y) is

J(x, y) = ( Jx Jy ) = ( B C ) ∈ Rm×(p+q) . (8.2.17)

For α ̸= 0, the residuals ri (x, y) = bi − xTAi y of the bilinear problem (8.2.12) are invariant
under the scaling (αx, α−1 y) of the variables. This shows that the bilinear problem (8.2.12) is
singular. The singularity can be handled by imposing a quadratic constraint ∥x∥2 = 1. Alterna-
tively, a linear constraint xi = eTi x = 1 for some i, 1 ≤ i ≤ m can be used. For convenience,
we assume in the following that i = 1 (x1 = 1) and use the notation
 
1
x= ∈ Rp , x̄ ∈ Rp−1 .

406 Chapter 8. Nonlinear Least Squares Problems

The residual r̄(x̄, y) of the constrained problem is r(x, y) with x given as above. The Jacobian
of the constrained problem is

J¯ = J(x̄,
¯ y) = ( J¯x̄ Jy ) ∈ Rm×(p+q−1) ,

where J¯x̄ is Jx , with the first column deleted. The Hessian is


  m
0 Ār X
H(x, y) = J¯T J¯ + , Ār = ri (x, y)Āi ∈ R(p−1)×q , (8.2.18)
ĀTr 0
i=1

where Āi is Ai with the first row deleted.


In some applications the following simple alternating least squares (ALS) algorithm is often
used. For a chosen initial approximation y, the matrix B in (8.2.15) is formed, and the subprob-
lem for x is solved and normalized by setting ∥x∥2 = 1. Next, using the computed approximation
for x, the matrix C in (8.2.16) is formed, and the subproblem is solved for y. This is repeated
until convergence. The linear subproblems are solved by QR or SVD. The arithmetic cost per
iteration for ALS is 4pqm flops for forming B and C, and about 2m(p2 + q 2 ) flops for solving
the two linear problems. In general, the ALS algorithm converges slowly and is not guaranteed
to converge to a local minimizer; see Ruhe and Wedin [942, 1980].
Several nonlinear least squares algorithms can be used to solve BLSQ problems, such as the
damped Gauss–Newton method. There are two possible versions of the variable projection (VP)
method. In VPx, x is eliminated, and a nonlinear least squares problem for y is solved. In VPy,
the roles of x and y are interchanged. Newton’s method with the Hessian given by (8.2.18) can
also be applied. The VP methods have ultimately quadratic convergence, and the arithmetic cost
per iteration is about the same as for ALS.
Hybrid algorithms can also be considered. For example, because Newton’s method is sen-
sitive to the initial approximation, it is best used in combination with a Gauss–Newton method.
Switching to Newton’s method in the later stages can considerably improve the rate of conver-
gence. The ALS method converges slowly but may be used to initialize other methods.
Different nonlinear least squares algorithms for solving BLSQ problems are analyzed and
evaluated by Eldén and Ahmadi-Asl [377, 2018]. They note that the choice of which component
of x (or y) to set equal to 1 can make a significant difference in the speed of convergence,
especially if the problem is ill-conditioned. Based on a perturbation analysis, they recommend
that the constraint be chosen so that the condition number of the Jacobian J¯x is minimized. Their
numerical experiments indicate that the VPx method, possibly combined with Newton’s method
for problems with slower convergence rate, is the method of choice.

8.2.3 Orthogonal Distance Regression


Given a curve described by the parametric function

y = f (x, β) ∈ Rq , x ∈ Rn , β ∈ Rp , (8.2.19)

let yi and xi , i = 1, . . . , m > p, be observations of points on the curve subject to independent


random errors ϵ̄i and δ̄i with zero mean and variance σ 2 ,

yi + ϵ̄i = f (xi + δ̄i , β), i = 1, . . . , m. (8.2.20)

In orthogonal distance regression (ODR) the parameters β are determined by minimizing the sum
of squares of the orthogonal distances from the observations (xi , yi ) to the curve y = f (x, β); see
Figure 8.2.1. If δi and ϵi do not have constant covariance matrices, then weighted norms should
8.2. Separable Least Squares Problems 407

Figure 8.2.1. Orthogonal distance regression for q = n = 1.

be substituted above. If the errors in the independent variables are small, then ignoring these
errors will not seriously degrade the estimates of x. Independent of statistical considerations, the
orthogonal distance measure has natural applications in fitting data to geometrical elements.
In linear orthogonal distance regression one wants to fit m > n given points yi ∈ Rn ,
i = 1 : m, to a hyperplane
M = {z | cT z = d}, z, c ∈ Rn , ∥c∥2 = 1, (8.2.21)
where c is the unit normal vector of M , and |d| is the orthogonal distance from the origin to the
plane in such a way that the sum of squares of the orthogonal distances from the given points
to M is minimized. For n = 1 this problem was studied in 1878 by Adcock [7, 1878]. The
orthogonal projection zi of the point yi onto M is given by
zi = yi − (cT yi − d)c. (8.2.22)
It is readily verified that zi lies on M and that the residual zi − yi is parallel to c and hence or-
thogonal to M . Hence, the problem of minimizing the sum of squares of the orthogonal distances
is equivalent to minimizing
m
X
(cT yi − d)2 subject to ∥c∥2 = 1.
i=1

If we put Y T = (y1 , . . . , ym ) ∈ Rn×m and e = (1, . . . , 1)T ∈ Rm , this problem can be written
in matrix form as  
d
min ( −e Y ) subject to ∥c∥2 = 1. (8.2.23)
c,d c 2
For a fixed c, this expression is minimized when the residual vector Y c − de is orthogonal to e,
i.e., when eT (Y c − de) = eT Y c − deT e = 0. Since eT e = m, it follows that
1 T T 1 T
d= c Y e = cT ȳ, ȳ = Y e, (8.2.24)
m m
where ȳ is the mean of the given points Pi . Hence d is determined by the condition that the mean
ȳ lies on the optimal plane M . Note that this property is shared by the solution to the usual linear
regression problem.
We now subtract the mean value ȳ from each given point and form the matrix
Ȳ T = (ȳ1 , . . . , ȳm ), ȳi = yi − ȳ, i = 1 : m.
408 Chapter 8. Nonlinear Least Squares Problems

From (8.2.24) it follows that


 
d
( −e Y) = Y c − eȳ T c = (Y − eȳ T )c = Ȳ c.
c
Hence problem (8.2.23) is equivalent to
min ∥Ȳ c∥2 subject to ∥c∥2 = 1.
c

By the min-max characterization of the singular values, a solution is c = vn , where vn is a right


singular vector of Ȳ corresponding to the smallest singular value σn . We further have
m
X
d = vnT ȳ, (vnT yi − d)2 = σn2 . (8.2.25)
i=1

Orthogonalizing the shifted points ȳi against vn and adding back the mean value, we get the
fitted points
zi = ȳi − (vnT ȳi )vn + ȳ ∈ M. (8.2.26)
The linear orthogonal regression problem always has a solution. The solution is unique when
σn−1 > σn , and the minimum sum of squares is σn2 . Moreover, σn = 0 if and only if the
given points yi , i = 1 : m, all lie on the hyperplane M . In the extreme case when all points
coincide, then Ȳ = 0, and any plane going through ȳ is a solution. The above method solves
the problem of fitting an (n − 1)-dimensional linear manifold to a given set of points in Rn . It
is readily generalized to fitting an (n − p)-dimensional linear manifold by orthogonalizing the
shifted points y against the p right singular vectors of Y corresponding to the p smallest singular
values.
In the nonlinear case (8.2.19), we first assume that x ∈ Rn , y ∈ R. Then the parameters
β ∈ Rp should be chosen as the solution to
m
X
min (ϵ2i + δi2 ) subject to yi + ϵi = f (xi + δi , β), i = 1, . . . , m.
β,ϵ,δ
i=1

This is a constrained least squares problem of special form. By using the constraints to eliminate
ϵi , the ODR problem can be formulated as an NLS problem in the parameters δ = (δ1 , . . . , δn )
and β:
Xm  
min (f (xi + δi , β) − yi )2 + δi2 . (8.2.27)
β,δ
i=1

Note that even when f (x, β) is a linear function of β, this is a nonlinear least squares problem.
If we define the residual vector r = (r1T , r2T )T by
r1i (δ, β) = f (xi + δi , β) − yi , r2i (δ) = δi ,
then (8.2.27) is a standard NLS problem of the form minx,δ ∥r(x, δ)∥22 . The corresponding
Jacobian matrix has the block structure
 
D 0
J = ∈ R2mn×(mn+p) . (8.2.28)
V J
Here D > 0 is a diagonal matrix of order mn reflecting the variance in δi ,
V = diag (v1T , . . . , vm
T
) ∈ Rm×mn , and
∂f (xi + δi , β) ∂f (xi + δi , β)
viT = , Jij = , i = 1, . . . , m, j = 1, . . . , p.
∂xi ∂βj
8.2. Separable Least Squares Problems 409

Note that J is sparse and highly structured. In applications, usually mn ≫ p, and then account-
ing for the errors δi in xi will considerably increase the size of the problem. Therefore, the use
of standard NLS software to solve orthogonal distance problems is not efficient or even feasible.
By taking the special structure of (8.2.27) into account, the work in ODR can be reduced to only
slightly more than for an ordinary least squares fit of β.
In the Gauss–Newton method, corrections ∆δk and ∆βk to the current approximations δk
and βk are obtained from the linear least squares problem
   
∆δ r1
min J − , (8.2.29)
∆δ,∆β ∆β r2 2

where J , r1 , and r2 are evaluated at δk and βk . A stable way to solve this problem is to compute
the QR decomposition of J . First, a sequence of plane rotations is used to zero the (2,1) block
V of J . The rotations are also applied to the right-hand side vector. We obtain
     
U K r1 t
Q1 J = , Q1 = , (8.2.30)
0 J¯ r2 r̄2

where U is a block diagonal matrix with m upper triangular blocks of size n × n. Problem
(8.2.29) now decouples, and ∆β is determined as the solution to
¯
min ∥J∆β − r̄2 ∥2 ,
∆β

where J¯ ∈ Rm×n . This second step has the same complexity as computing the Gauss–Newton
correction in the classical NLS problem. When ∆β has been determined, ∆δ is obtained by
back-substitution:
U ∆δ = u1 , u1 = t − K∆β. (8.2.31)
More generally, when y ∈ Rq and x ∈ Rn , the ODR problem has the form
m 
X 
min ∥f (xi + δi , β) − yi ∥22 + ∥δi ∥22 . (8.2.32)
β,δ
i=1

When q > 1, the Jacobian (8.2.28) has a similar structure with

V = diag (V1T , V2T , . . . , VmT ), ViT ∈ Rq×n .

After an orthogonal reduction to upper triangular form, the reduced Jacobian has the same block
structure as in (8.2.30). However, now D ∈ Rmnt ×mnt is block diagonal with upper triangular
matrices of size nt × nt on the diagonal, K ∈ Rmnt ×n , and J¯ ∈ Rmny ×n .
In practice a trust-region stabilization of the Gauss–Newton step should be used, and then we
need to solve (8.2.29) with
D 0 r1
   
 V J   r2 
Jµ =  , r =  , (8.2.33)
µT 0 0
0 µS 0
with several different values of the parameter µ, where S and T are nonsingular diagonal ma-
trices. An orthogonal reduction to upper triangular form would result in a matrix with the same
block structure as in (8.2.30), where D ∈ Rmn×mn is block diagonal with upper triangular ma-
trices of size n × n on the diagonal. Therefore, if done in a straightforward way, this reduction
does not take full advantage of the structure in the blocks and is not efficient.
410 Chapter 8. Nonlinear Least Squares Problems

Boggs, Byrd, and Schnabel [158, 1987] show how the computations in a trust-region step can
be carried out so that the cost is of the same order as for a standard least squares fit. They claim
that computing a QR factorization of Jµ would require O((mn + p)2 ) operations for each µ.
For this reason the ∆δ variables are eliminated as outlined above by using the normal equations,
combined with the Woodbury formula (3.3.9). The reduced least squares problem for ∆β can
then be written min ∥J∆β˜ − r̃2 ∥2 , where
−M (r1 − V E −2 Dr2 )
   
MJ
J˜ = , r̃2 = ,
µS 0

and E 2 = D2 + µ2 T 2 . An explicit expression for the diagonal matrix M is given by Boggs,


Byrd, and Schnabel [158, 1987] in terms of E and V .
For the case q = 1, Boggs, Byrd, and Schnabel [158, 1987] give a more efficient algorithm
to carry out elimination of the δ variables in a trust-region step so that the cost of a step is
dominated by the QR factorization of J. ˜ This is achieved by eliminating the ∆δ variables using
the normal equations and the Woodbury formula (3.3.9). The reduced normal equations for ∆β
is then interpreted as the normal equations for a least squares problem with matrix
−M (r1 − V E −1 Dr2 )
   
˜ MJ
J= , r̃2 = ,
µS 0

where E = D2 + µ2 T 2 , and an explicit expression for the diagonal M can be given in terms
of E and V . A software package ODRPACK for orthogonal distance regression based on this
algorithm is described in Boggs et al. [159, 1989].
Schwetlick and Tiller [980, 1985], [981, 1989] develop an ODR algorithm based on the
Gauss–Newton method with a special Marquardt-type regularization for which similar simplifi-
cations can be achieved. The path
p(µ) := −(∆δ(µ)T , ∆x(µ)T )T
is shown to be equivalent to a trust-region path defined by a nonstandard scaling matrix, and the
step is controlled in trust-region style.
An algorithm for computing the trust-region step based on QR factorization of Jµ is given
by Björck [136, 2002]. Because only part of U and Q1 need to be computed and saved, this
has the same leading arithmetic cost as the normal equation algorithm by Boggs et al. By taking
advantage of the special structure of Jµ , only part of the nonzero elements in the factors Q and
R need to be stored. In the first step of the factorization, plane rotations are used to merge the
two diagonal blocks D and µT . After a permutation of the last two block rows, we obtain
   
D 0 r1 Dµ 0 r̃1
 V J r2    V J r2 

G  0 µS 0  =  0 µS 0  ,
 (8.2.34)
µT 0 0 0 0 r4

where Dµ = (D2 + µ2 T 2 )1/2 is diagonal. The rotations are also applied to the right-hand side.
This step does not affect the second block column and the last block row in Jµ . The key step is
orthogonal triangularization of the first block column in (8.2.34). This will not affect the last two
block rows and can be carried out efficiently if full advantage is taken of the structure of blocks
Dµ and V .
A nonlinear least squares problem with structure similar to ODR arises in structured total
least squares problems. Given the data matrix A and the vector b, the TLS problem is to find E
and x to solve
min ∥(E r)∥F such that r = b − (A + E)x. (8.2.35)
8.2. Separable Least Squares Problems 411

If A is sparse, we may want E to have the same sparsity structure. Rosen, Park, and Glick [936,
1996] impose an affine structure on E by defining a matrix X such that Xδ = Ex, where δ ∈ Rq
is a vector containing the nonzero elements of E, and the elements of X ∈ Rm×q consist of the
elements of the vector x with a suitable repetition. Then the problem can be written as a nonlinear
least squares problem
 
δ
min , r(δ, x) = Ax − b + Xδ. (8.2.36)
δ,x r(δ, x)

When E is general and sparse, the structure of the Jacobian of r with respect to δ will be similar
to that in the ODR problem.

8.2.4 Fitting of Circles and Ellipses


A special nonlinear least squares problem that often arises is to fit given data points to a geo-
metrical element, which may be defined implicitly. We have already discussed fitting data to an
affine linear manifold such as a line or a plane. Gander and von Matt [440, 1997] show how to
fit rectangles and squares using simple generalizations of this algorithm. The problem of fitting
circles, ellipses, spheres, and cylinders arises in applications in computer graphics, coordinate
meteorology, and the aircraft industry.
Least squares algorithms to fit a curve in the x, y-plane implicitly defined by the scalar func-
tion f (x, y, p) = 0 can be divided into two classes. In algebraic fitting, the least squares
functional
Xm
ϕ(p) = f (xi , yi , p)2 (8.2.37)
i=1

is minimized and it directly involves the function f (x, y, p); see Pratt [905, 1987]. In geometric
fitting, a least squares functional is minimized and involves the geometric distances from the
data points to the curve defined by f (x, y, p) = 0,
m
X
d2i (p), d2i (p) = (x − xi )2 + (y − yi )2 ,

min min (8.2.38)
p f (x,y,p)=0
i=1

where di (p) is the orthogonal distance from the data point (xi , yi ) to the curve. This is similar
to orthogonal distance regression described for an explicitly defined function y = f (x, β) in
Section 8.2.3. However, for implicitly defined functions the calculation of the distance function
di (p) is more complicated.
Algebraic fitting often leads to a simpler problem, in particular when f is linear in the pa-
rameters p. Methods for geometrical fitting are slower but give better results both conceptually
and visually. Implicit curve fitting problems, where a model h(y, x, t) = 0 is to be fitted to
observations (yi , ti ), i = 1, . . . , m, can be formulated as a special least squares problem with
nonlinear constraints:

min ∥z − y∥22 subject to h(zi , x, ti ) = 0.


x,z

This is a special case of problem (8.1.38). It has n + m unknowns x and z, but the sparsity of
the Jacobian matrices can be taken advantage of; see Lindström [750, 1984].
We first discuss algebraic fitting of circles and ellipses in two dimensions. A circle has three
degrees of freedom and can be represented algebraically by

f (x, y, p) = a(x2 + y 2 ) + b1 x + b2 y − c = 0,
412 Chapter 8. Nonlinear Least Squares Problems

where p = (a b1 b2 c)T . Let S be an m × 4 matrix with rows

sTi = (x2i + yi2 xi yi − 1), i = 1, . . . , m.

The algebraic fitting problem (8.2.37) of a circle can then be formulated as

min ∥Sp∥22 subject to pT e1 = a = 1. (8.2.39)


p

The constraint is added because p is only defined up to a constant multiple. This problem is
equivalent to the linear least squares problem

min ∥S2 q − s1 ∥2 , S = ( s1 S2 ) . (8.2.40)


q

For plotting, the circle can more conveniently be represented in parametric form,
     
x(θ) xc cos θ
= +r ,
y(θ) yr sin θ

where the center (xc yc )T and radius r of the circle can be expressed in terms of p as
   
xc 1 b1 1
q
=− , r= ∥b∥22 + 4ac . (8.2.41)
yc 2a b2 2a
From this expression we see that the constraint 2ar = 1 can be written as a quadratic constraint
b21 + b22 + 4ac = pT Cp = 1, where
a 0 0 0 2
   
b  0 1 0 0
p =  1 , C= .
b2 0 0 1 0
c 2 0 0 0
This will guarantee that a circle is fitted. (Note that equality can be used in the constraint be-
cause of the free scaling.) The matrix C is symmetric but not positive definite (its eigenvalues
are −2, 1, 1, 2). We discuss the handling of such quadratic constraints later when dealing with
ellipses.
An ellipse in the x, y-plane can be represented algebraically by
     
x x a11 a12
f (x, y, p) = ( x y ) A + ( b1 b2 ) − c = 0, A = . (8.2.42)
y y a12 a22

Following Varah [1089, 1996], we define

p = (a11 a12 a22 b1 b2 c)T , sTi = (x2i 2xi yi yi2 xi yi − 1).

Then the objective function is Φ(p) = ∥Sp∥22 , where S is an m×6 matrix with rows sTi . Because
the parameter vector p is only determined up to a constant factor, the problem formulation must
be completed by including some constraint on p. Three such constraints have been considered.

(a) SVD constraint:

min ∥Sp∥22 subject to ∥p∥2 = 1. (8.2.43)


p

The solution of this constrained problem is the right singular vector corresponding to the smallest
singular value of S.
8.2. Separable Least Squares Problems 413

(b) Linear constraint:

min ∥Sp∥22 , where dT p = 1, (8.2.44)


p

where d is a fixed vector with ∥d∥2 = 1. Let H be an orthogonal matrix such that Hd = e1 . Then
the constraint becomes dT p = dT H T Hp = eT1 (Hp) = 1, so we can write Sp = (SH T )(Hp),
where Hp = (1 q T )T . Now, if we form SH T = S̃ = [s̃1 S̃2 ], we arrive at the unconstrained
linear least squares problem
 
2 T 1
min ∥S̃2 q + s̃1 ∥2 subject to p = H . (8.2.45)
q q

(c) Quadratic constraint:

min ∥Sp∥22 subject to pT Cp = 1. (8.2.46)


p

For general symmetric C, problem (8.2.46) reduces to the generalized eigenvalue problem

(C − λS T S)p = 0.

If C is well-conditioned, this can be reformulated as a standard eigenvalue problem C −1 S T Sp =


λ−1 p but can potentially result in a loss of accuracy if S T S is formed explicitly. When C is
positive definite, we can write C = B TB, and the constraint becomes ∥Bp∥2 = 1. The solution
is related to the generalized eigenvalue problem

(B TB − λS T S)p = 0. (8.2.47)

Since λ∥Sp∥2 = ∥Bp∥2 = 1, we want the eigenvector p corresponding to the largest eigenvalue
λ or, equivalently, the largest eigenvalue of (S T S)−1 B TB. If S = QR is the QR factorization
of S with R nonsingular, the eigenvalue problem (8.2.47) can be written

λq = (BR−1 )T BR−1 q, q = Rp.

Hence, q is the right singular vector corresponding to the largest singular value of BR−1 . Of
special interest is the choice B = (0 I). In this case, the constraint can be written ∥p2 ∥22 = 1,
where p = (p1 p2 ). With R conformally partitioned with p, problem (8.2.46) is equivalent to the
generalized total least squares problem
   2
R11 R12 p1
min subject to ∥p2 ∥2 = 1. (8.2.48)
p 0 R22 p2 2

For any p2 we can determine p1 so that R11 p1 + R12 p2 = 0. Hence p2 solves minp2 ∥R22 p2 ∥2
subject to ∥p2 ∥2 = 1 and can be obtained from the SVD of R22 .
If σmin (S) = 0, the data exactly fit the ellipse. If σmin (S) is small, the different constraints
above give similar solutions. However, when errors are large, the different constraints can lead
to very different solutions; see Varah [1089, 1996].
A desirable property of a fitting algorithm is that when the data are translated and rotated,
the fitted ellipse should be transformed in the same way. It can be shown that to have this invari-
ance, the constraint must involve only symmetric functions of the eigenvalues of A. The SVD
constraint does not have this property. For a linear constraint, the choice dT = (1 0 1 0 0 0)
gives the desired invariance:

dT p = a11 + a22 = trace(A) = λ1 + λ2 = 1. (8.2.49)


414 Chapter 8. Nonlinear Least Squares Problems

The quadratic constraint proposed by Bookstein [170, 1979],

∥A∥2F = a211 + 2a212 + a222 = λ21 + λ22 = 1, (8.2.50)

also leads to this kind of invariance. Note that the Bookstein constraint
√ can be put in the form
∥Bp∥2 = 1, B = (0 I), by permuting the variables and scaling by 2:
√ √
p = (b1 , b2 , c, a11 , 2a12 , a22 )T , sTi = (xi , yi , −1, x2i , 2xi yi , yi2 ).

Another useful quadratic constraint is pT Cp = a11 a22 −a212 = λ1 λ2 = 1. This has the advantage
of guaranteeing that an ellipse is generated rather than another conic section. Note that the matrix
C corresponding to this constraint is not positive definite, so it leads to a generalized eigenvalue
problem.
To plot the ellipse, it is convenient to convert the algebraic form (8.2.42) to the parametric
form
       
x(θ) xc a cos θ cos α sin α
= + G(α) , G(α) = , (8.2.51)
y(θ) yc b sin θ − sin α cos α

where G(α) is a rotation with angle α. The new parameters (xc , yc , a, b, α) can be obtained from
the algebraic parameters p by the eigenvalue decomposition A = G(α)ΛG(α)T as follows. We
assume that Λ = diag (λ1 , λ2 ). If a12 = 0, take G(α) = I and Λ = A. Otherwise, compute
t = tan(α) from
p 
τ = (a22 − a11 )/(2a12 ), t = sign (τ )/ |τ | + 1 + τ 2 .

This ensures that |α| < π/4. Then cos α = 1/ 1 + t2 , sin α = t cos α, λ1 = a11 − ta12 , and
λ2 = a22 + ta12 . The center of the ellipse is given by
 
xc 1 1
= − A−1 b = − G(α)Λ−1 G(α)T b, (8.2.52)
yc 2 2
p p 
and the axis is (a, b) = c̃/λ1 , c̃/λ2 , where
 
1 xc 1
c̃ = c − bT = c + bT G(α)Λ−1 G(α)T b. (8.2.53)
2 yc 4

The algorithms described here can be generalized for fitting conic sections in three dimen-
sions. In some applications, e.g., lens-making, it is required to fit a sphere to points representing
only a small patch of the sphere surface. For a discussion of this case, see Forbes [419, 1989].
Other fitting problems, such as fitting a cylinder, circle, or cone in three dimensions, are consid-
ered in Forbes [418, 1989].
In geometric fitting of circles and ellipses, the sum of orthogonal distances from each data
point to the curve is minimized. When the curve admits a parametrization, such as in the case of
fitting a circle and ellipse, this simplifies. We first consider fitting of a circle written in parametric
form as  
x − xc − r cos θ
f (x, y, p) = = 0,
y − yc − r sin θ
where p = (xc , yc , r)T . The problem can be written as a nonlinear least squares problem

min ∥r(p, θ)∥22 , θ = (θ1 , . . . , θm )T , (8.2.54)


p,θ
8.2. Separable Least Squares Problems 415

where r is a vector of length 2m with components


 
xi − xc − r cos θi
ri = , i = 1, . . . , m.
yi − yc − r sin θi

As an initial approximation we can take xc , yc , r from an algebraically fitted circle. We


determine initial approximations for θi from tan θi = (yi − yc )/(xi − xc ). To apply the Gauss–
Newton method to (8.2.54), we need the Jacobian of r. We get
       
∂ri sin θi ∂ri cos θi ∂ri ∂ri 1 0
=r , =− , , =− .
∂θi − cos θi ∂r sin θi ∂xc ∂yc 0 1

After reordering of rows, the Jacobian associated with this problem has the form
 
rS A
J= ,
−rC B

where S = diag (sin θ1 , . . . , sin θm ) and C = diag (cos θ1 , . . . , cos θm ). The first m columns
of J correspond to the parameters
 θ and  are mutually orthogonal. Multiplying from the left with
T S −C
the orthogonal matrix Q = , we obtain
C S
 
rI SA − CB
QT J = .
0 CA + SB

Hence to compute the QR factorization of J and the Gauss–Newton search direction for problem
(8.2.54), we only need to compute the QR factorization of the m × 3 matrix CA + SB. A
trust-region stabilization can easily be added.
For the geometrical fitting of an ellipse we proceed similarly. We now have the parametric
form (8.2.51),
     
x − xc a cos θ cos α sin α
f (x, y, p) = − G(α) , G(α) = ,
y − yc b sin θ − sin α cos α

where p = (xc , yc , a, b, α)T . The problem can be written as a nonlinear least squares problem
(8.2.54) if we define r as a vector of length 2m with
   
xi − xc a cos θi
ri = − G(α) .
yi − yc b sin θi

As for the circle, we can take as initial approximation for p the values for an algebraically
fitted ellipse. To obtain initial values for θi we note that for a point (x(θ), y(θ)) on the ellipse,
we have from (8.2.51)    
a cos θ T x(θ) − xc
= G (α) .
b sin θ y(θ) − yc
As an initial approximation one can take θi = arctan(vi /ui ), where
   
ui T (xi − xc )/a
= G (α) , i = 1, . . . , m.
vi (yi − yc )/b

To evaluate the Jacobian we need the partial derivatives


   
∂ri a sin θi

∂ri ∂ri

cos θi 0
= G(α) , , = −G(α) ,
∂θi −b cos θi ∂a ∂b 0 sin θi
416 Chapter 8. Nonlinear Least Squares Problems

and    
∂ri a cos θi − sin α cos α
= −G′ (α) , ′
G (α) = .
∂α b sin θi − cos α − sin α
To simplify, multiply the Jacobian from the left by the 2m × 2m block diagonal orthogonal
matrix diag (GT (α), . . . , GT (α)), noting that
   
∂ri −b sin θi 0 1
G(α)T = , G(α)T G(α)′ = .
∂α a cos θi −1 0

With rows reordered, the structure of the transformed Jacobian is similar to the circle case,
 
aS A
J= , S = diag (sin θi ), C = diag (cos θi ),
−bC B

where A and B are now m × 5 matrices. The first block column can easily be triangularized
using the diagonal form of S and C. The main work is the final triangularization of the resulting
(2,2) block. If a = b, the sum of the first m columns of J is zero. In this case the parameter α is
not well determined, and it is essential to use some kind of regularization of α.
The fitting of a sphere or an ellipsoid can be treated analogously. The sphere can be repre-
sented in parametric form as
 
x − xc − r cos θ cos ϕ
f (x, y, z, p) =  y − yc − r cos θ sin ϕ  = 0, (8.2.55)
z − zc − r sin θ

where p = (xc , yc , zc , r)T . We get 3m nonlinear equations in 2m + 4 unknowns. The first 2m


columns of the Jacobian can easily be brought into upper triangular form.
When the data cover only a small arc of the circle or a small patch of the sphere, the fitting
problem can be ill-conditioned. An important application involving this type of data is the fitting
of a spherical lens. Likewise, the fitting of a sphere or an ellipsoid to near planar data is an
ill-conditioned problem. For a more detailed description and tests of algorithms, see Gander,
Golub, and Strebel [439, 1994]. Algorithms for three-dimensional fitting of surfaces are treated
by Sourlier [1013, 1995]. Tests and comparisons of the software packages Funke and ODRPACK
are given by Strebel, Sourlier, and Gander [1045, 1997].

8.3 Nonnegativity Constrained Problems


8.3.1 Gradient Projection Methods for NNLS
The active-set algorithms given in Section 3.5.2 for the NNLS (nonnegative least squares)
problem
min ∥Ax − b∥2
x≥0

are competitive for problems of small to medium size. For applications where A is large and/or
sparse, other algorithms are to be preferred. For example, consider calculating the optimal
amount of material to be removed in the polishing of large optics. Here the nonnegativity con-
straints come in because polishing can only delete material from the surface. A typical problem
might have 8,000 to 20,000 rows and the same number of unknowns, with only a small per-
centage of the matrix elements being nonzero. In general, the problem is rank-deficient, and
the nonnegativity constraints are active for a significant fraction of the elements of the solution
vector. Applications in data mining and machine learning (where the given data, such as images
and text, are required to be nonnegative) give rise to problems of even larger size.
8.3. Nonnegativity Constrained Problems 417

In gradient projection methods for problem BLS (bound-constrained least squares), a step
in the direction of the negative gradient is followed by projection onto the feasible set. For
example, the projected Landweber method for an NNLS problem is
x(k+1) = P(xk + ωAT (b − Ax(k) )), 0 ≤ ω ≤ 2/σ1 (A)2 ,
where P is the projection onto the set x ≥ 0. These simple methods have the disadvantage of
slow convergence, and σ1 (A) may not be known.
For problem NNLS, an equivalent unconstrained problem can be obtained by introducing the
parametrization xi = ezi , i = 1, . . . , n. In image restoration, this has a physical interpretation—
see Hanke, Nagy, and Vogel [571, 2000]. By the chain rule, the gradient g of Φ(x) = 21 ∥Ax−b∥2
with respect to z is
X = diag (x) ≥ 0, y = AT (Ax − b) ≥ 0, gz = Xy. (8.3.1)
Setting gz = 0, we recover the KKT first-order optimality conditions for NNLS. The correspond-
ing modified residual norm steepest descent iterative method is
x(k+1) = x(k) + αk Xk AT (b − Ax(k) ), Xk = diag (x(k) ), (8.3.2)
where the step is in the direction of the negative gradient. This iteration can be interpreted as a
nonlinear Landweber method in which Xk acts as a (variable) preconditioner. The step length
αk in (8.3.2) is restricted to ensure nonnegativity of the x(k+1) .
In certain problems in astronomy and medical imaging, such as positron emission tomogra-
phy (PET), data is subject to noise with a Poisson distribution. Statistical considerations justify
computing a nonnegative minimizer of the maximum likelihood functional
n
X n
X
Ψ(x) = yi − bi log yi , y ≡ Ax; (8.3.3)
i=1 i=1

see Kaufman [687, 1993]. With the same parametrization x = ez as above, the gradient of Φ
with respect to z is
gz = Xgx = XAT Y −1 (Ax − b),
where X = diag (x) and Y = diag (y). Assume now that A is nonnegative and all column
sums of A are 1, i.e., AT e = e, where e = (1, 1, . . . , 1)T . This assumption can be interpreted
as an energy conservation property of A and can always be satisfied by an initial scaling of the
columns. Then XAT Y −1 Ax = XAT e = x, and the gradient becomes gz = x − XAT Y −1 b.
Setting the gradient equal to zero leads to the fixed-point iteration
x(k+1) = Xk AT Yk−1 b, Xk = diag (x(k) ), Yk = diag (Ax(k) ). (8.3.4)
This is the basis for the expectation maximization (EM) algorithm, which is popular in astron-
omy. Note that nonnegativity of the iterates in (8.3.4) is ensured if b ≥ 0.
The choice of starting point is important in iterative NNLS algorithms. Typical applications
require the solution of an underdetermined linear system, which has no unique solution. Different
initial points will converge to different local optima. The EM algorithm is very sensitive to the
initial guess and does not allow x(0) = 0. Dax [294, 1991] shows that the use of Gauss–Seidel
iterations to obtain a good initial point is likely to give large gains in efficiency.
Steepest descent methods for nonnegative least squares have only a linear rate of conver-
gence, even when a line search is used. They tend to take very small steps whenever the level
curves are ellipsoidal. Kaufman [687, 1993] considers acceleration schemes based on the conju-
gate gradient (CG) method. In the inner iterations of CG, the scaling matrix X is kept fixed. CG
is restarted with a new scaling matrix Xk whenever a new constraint becomes active. Nagy and
and Strakoš [819, 2000] consider a variant of this algorithm and show that it is more accurate
and efficient than unconstrained Krylov subspace methods.
418 Chapter 8. Nonlinear Least Squares Problems

8.3.2 Interior Methods for NNLS


The KKT first-order optimality conditions for problem NNLS can be written as a system of
nonlinear equations

F1 (x, y) = Xy = 0, X = diag (x) ≥ 0, y = AT (Ax − b) ≥ 0. (8.3.5)

This is the basis of a primal-dual interior method. It uses the Newton directions for the nonlinear
system  
Xy
F2 (x, y) = = 0,
AT (Ax − b) − y
where the iterands are not forced to satisfy the linear constraints y = AT (Ax − b). A sequence
of points {xk > 0} is computed by

(xk+1 , yk+1 ) = (xk , yk ) + θk (uk , vk ),

where θk is a positive step size, and (uk , vk ) satisfies the linear system
    
Yk Xk uk −Xk yk + µk e
= , (8.3.6)
ATA −I vk AT rk + yk

where rk = b−Axk , Xk = diag (xk ), Yk = diag (yk ), and µk ≥ 0 is a centralization parameter;


see Lustig, Marsden, and Shanno [764, 1991]. The step size θk is chosen to satisfy
 max
θk if uTk vk ≤ 0,
θk ≤
min(θk , θ̂k ) if uTk vk > 0,
max

where θkmax is the largest value such that xk+1 ≥ 0, yk+1 ≥ 0, and

θ̂k = (xTk yk − nµk )/(uTk vk ).

Choosing θk in this way can be shown to guarantee a monotonic decrease of g(x, y) = xT y in


each iteration. Computational experience has shown that the condition θkmax < θ̂k /2 is usually
satisfied, and in practice one takes θk = 0.99995·θkmax .
From (8.3.6) it can be seen that uk solves the least squares problem
   
A rk
min uk − . (8.3.7)
uk (Xk Yk )−1/2 (Xk Yk )−1/2 µk e 2

After uk has been calculated, vk is determined from the first block equation in (8.3.6):

vk = −yk + Xk−1 (µk e − Yk uk ).

This approach can be improved by using a predictor-corrector scheme that determines a


first direction (uk , vk ) by taking µk = 0 in (8.3.6). This direction is corrected by (zk , wk ) =
(uk , vk ) + (ūk , v̄k ), where (ūk , v̄k ) satisfies
    
Yk Xk ūk −Uk Vk e + µk e
= , (8.3.8)
ATA −I v̄k 0

with Uk = diag (uk ) and Vk = diag (vk ). When uk and vk have been computed, zk can be
found as the solution of the least squares problem
   
A rk
min uk − .
zk (Xk Yk )−1/2 (Xk Yk )−1/2 (µk − Uk Vk )e 2
8.3. Nonnegativity Constrained Problems 419

Finally, wk is found from

wk = −yk + Xk−1 (µk e − Yk zk − Uk vk ).

Following [764, 1991], the parameter µk is taken as

µk = (xk + θk uk )T (yk + θk vk )/n2

with θk as above. This choice does not guarantee a decrease in g(x, y) but seems to work well in
practice. Subproblems (8.3.7) and (8.3.8) must be solved from scratch at each iteration because
no reliable updating methods are available.
Portugal, Júdice, and Vicente [900, 1994] discuss implementation issues and present com-
putational experience with a predictor-corrector algorithm for problem NNLS. They find this
method gives high accuracy even when the subproblems are solved by forming the normal equa-
tion.
An interior method for large-scale nonnegative regularization is given by Rojas and Stei-
haug [933, 2002]. Surveys of interior methods are given by Wright [1133, 1997] and Forsgren,
Gill, and Wright [421, 2002]. The theory of interior methods for convex optimization is devel-
oped in the monumental work by Nesterov and Nemirovski [828, 1994]. The state of the art of
interior methods for optimization is surveyed by Nemirovski and Todd [825, 2008].

8.3.3 Nonnegative Matrix Factorization


An important application of NNLS occurs in nonnegative matrix factorization (NNMF). Given
a nonnegative matrix A ∈ Rm×n , we want to find nonnegative factors W ∈ Rm×k and H ∈
Rn×k that solve
min ∥A − W H T ∥2F subject to W ≥ 0, H ≥ 0. (8.3.9)
W,H

The NNMF problem has received much attention; see Kim and Park [697, 2008]. Applications
include analysis of image databases, data mining, machine learning, and other retrieval and clus-
tering operations. Vavasis [1092, 2009] shows that the NNMF problem is equivalent to a problem
in polyhedral combinatorics and is NP-hard.
If either of factors H or W is kept fixed, then computing the other factor in problem NNMF
is a standard NNLS problem with multiple right-hand sides. It can be solved independently by
an NNLS algorithm. For example, if H is fixed, then
m
X
min ∥HW T − AT ∥2F = min ∥HwiT − aTi ∥22 ,
W ≥0 wi ≥0
i=1

where wi and ai are the ith rows of W and A, respectively. Given an initial guess H (1) , the
alternating NNLS method (ANLS) is

for k = 1, 2, . . . , (8.3.10)
(k) T
min ∥H W − AT ∥2F , giving W (k)
, (8.3.11)
W ≥0

min ∥W (k) H T − A∥2F , giving H (k+1) . (8.3.12)


H≥0

The two NNLS subproblems are solved alternately until a convergence criterion is satisfied. It
can be shown that every limit point attained by ANLS is a stationary point of (8.3.9). If A is
rank-deficient, a unique least-norm solution is computed in a second stage.
420 Chapter 8. Nonlinear Least Squares Problems

A problem with the ANLS method is that convergence is often slow, and the solution reached
is not guaranteed to be a global optimum. Finding a good initial approximation is important.
We know that the best rank-k approximation of A can be found from the first k singular triplets
σi ui viT of A. How to obtain these triplets is discussed in Section 7.3.3. Boutsides and Gal-
lopoulus [172, 2008] show that good initial values for ANLS can be obtained by replacing all
negative elements in the product ui viT by zeros.
Surveys of algorithms for nonnegative matrix factorizations are given by Berry et al. [116,
2007] and Kim, He, and Park [699, 2014]. Nonnegative tensor factorizations are studied by Kim,
Park, and Eldén [698, 2007].

Notes and references


For large-scale NNLS problems, modern optimization algorithms for minimizing a general ob-
jective function f (x) subject to bound-constraints often outperform older specialized algorithms.
In the gradient projection algorithm L-BFGS-B by Byrd et al. [197, 1995], a limited-memory
BFGS approximation to the Hessian of f is updated at each iteration and used to define a qua-
dratic model of f . A search direction is then computed by the gradient projection method with
a line search. For an open source Fortran implementation, see Zhu et al. [1150, 1997]. Some
improvements are given by Morales and Nocedal [805, 2011]. Algorithms that combine the
standard active-set strategy with the gradient projection method are given by Bierlaire, Toint,
and Tuyttens [119, 1991] and Moré and Toraldo [809, 1989]. In these, the number of active
constraints can change significantly at each iteration.

8.4 Robust Regression and Related Topics


There are several alternative statistical models to least squares fitting such as maximum likeli-
hood methods and other Bayesian models. These often lead to nonlinear least squares problems
but lie outside the main scope of this book. A good overview of these and other alternative mod-
els for learning from data is given in the book by Hastie, Tibshirani, and Friedman [595, 2009].
In this section we discuss a selection of interesting recent developments.

8.4.1 Minimizing the ℓ1 and ℓ∞ Residual Norm


In some applications it might be appropriate to minimize the ℓp -norm ∥r∥p of the residual vector
r = b − Ax, A ∈ Rm×n for some p ̸= 2. This problem has the same solution as

min ψp (x), ψp (x) = ∥r(x)∥pp , (8.4.1)


x

and the latter expression is easier to use. For 1 ≤ p < ∞, problem (8.4.1) is strictly convex if
rank(A) = n and therefore has a unique solution. For 0 < p < 1, ψp (x) is not convex, and ∥ · ∥p
is not actually a norm, though d(x, y) = ∥x − y∥p is a metric. For p = 1 and p → ∞, where
m
X
∥r∥1 = |ri |, ∥r∥∞ = max |ri |, (8.4.2)
1≤i≤m
i=1

the minimization is complicated by the fact that these norms are only piecewise differentiable.
Already in 1799, Laplace was using the principle of minimizing the sum of the absolute errors
with the added condition that the sum of the errors be zero. He showed that this implies that the
solution x must satisfy exactly n out of the m equations. The effect on errors of using different
ℓp -norms is visualized in Figure 8.4.1.
8.4. Robust Regression and Related Topics 421

1.5

0.5

0
-1.5 -1 -0.5 0 0.5 1 1.5

Figure 8.4.1. The penalizing effect using the ℓp -norm for p = 0.1, 1, 2, 10.

Example 8.4.1. Consider the problem of estimating the scalar γ from m observations y ∈ Rm .
This is equivalent to
min ∥eT γp − y∥pp , e = (1, 1, . . . , 1)T . (8.4.3)
γp

If y1 ≥ y2 ≥ · · · ≥ ym , the solutions for some different values p are


γ1 = y(m+1)/2 , (m odd),
γ2 = (y1 + y2 + · · · + ym )/m,
γ∞ = (y1 + ym )/2.
These estimates correspond to the median, mean, and midrange, respectively. Note that the
estimate γ1 is insensitive to extreme values of yi . A small number of isolated large errors usually
will not change the ℓ1 solution. This property carries over to general ℓ1 problems. On the other
hand, the estimate γ∞ only depends on the two extreme observations.

Minimization in the ℓ1 and ℓ∞ norms can be posed as linear programs. For the ℓ1 -norm,
define nonnegative variables r+ , r− ∈ Rm such that r = r+ − r− . Let eT = (1, 1, . . . , 1) be a
row vector of all ones. Then (8.4.1) is equivalent to
min (eT r+ + eT r− ) subject to Ax + r+ − r− = b, r+ , r− ≥ 0. (8.4.4)
The matrix ( A I −I ) has rank m and column dimension n + 2m. From standard results in
linear programming theory it follows that there exists an optimal ℓ1 solution such that at least m
ri+ or ri− are zero and at least n − rank(A) xi are zero; see Barrodale and Roberts [86, 1970].
An initial feasible basic solution is available immediately by setting x = 0 and ri+ = bi or
ri− = −bi .
Barrodale and Roberts [87, 1973] use a modified simplex method to solve linear program
(8.4.4). It takes advantage of the fact that the variables ri+ and ri− cannot simultaneously be in the
basis. The simplex iterations can be performed within a condensed simplex array of dimensions
(m+2)×(n+2). Implementions of this algorithm are given in Barrodale and Roberts [88, 1974]
and Bartels and Conn [89, 1980]. The latter can handle additional linear equality and inequality
constraints. An alternative algorithm by Abdelmalek [5, 1980], [6, 1980] is based on the dual of
linear program (8.4.4).
The ℓ∞ problem, also called Chebyshev or minimax approximation, is to minimize
maxm i=1 |ri | or, equivalently,

min ζ subject to − ζe ≤ Ax − b ≤ ζe, ζ ≥ 0. (8.4.5)


422 Chapter 8. Nonlinear Least Squares Problems

Stiefel [1036, 1959] gave a so-called exchange algorithm for Chebyshev approximation. This is
based on the following property of the optimal solution: the maximum error is attained at n + 1
points if rank(A) = n. In a later paper he showed his algorithm to be equivalent to the simplex
method applied to a suitable linear program. Problem (8.4.5) can be formulated as
    
A e x b
min ζ subject to ≥ , ζ ≥ 0. (8.4.6)
−A e ζ −b
This linear program has 2m linear constraints in n + 1 variables, and only ζ has a finite simple
bound. Osborne and Watson [850, 1967] recommended the dual program of (8.4.6),
 T   
A −AT 0
max ( bT −bT ) w subject to w = , w ≥ 0, (8.4.7)
eT eT 1
which has only n+1 rows. To use a modern mathematical programming system, such as CPLEX
or Gurobi (especially when A is sparse), problem (8.4.6) can be rewritten as
      
b A I x b
min ζ subject to  0  ≤  I e   r  ≤  ∞  , ζ ≥ 0, (8.4.8)
−∞ I −e ζ 0
which is larger but has only one A and is very sparse.
If the assumptions in a regression model are violated and data are contaminated with outliers,
these can have a large effect on the solution. In robust regression, possible outliers among the
data points are identified and given less weight. Huber’s M-estimator (Huber [648, 1981]) is
a compromise between ℓ2 and ℓ1 estimations. It uses the least squares estimator for “normal”
data but the ℓ1 -norm estimator for data points that disagree more with the normal picture. More
precisely, Huber’s M-estimate minimizes the objective function
m
X
ψH (x) = ρ(rj (x)/σ), (8.4.9)
i=1

where σ is a scaling factor that depends on the data, and


1 2
t if |t| ≤ γ,
ρ(t) = 2 (8.4.10)
γ|t| − 12 γ 2 if |t| > γ
for some threshold parameter γ > 0. In the following we assume σ = 1 in (8.4.9). For large
values of γ, Huber’s M-estimator will be close to the least squares estimator, and for small values
of γ it is close to the ℓ1 estimator. The statistical model behind this estimate is that the errors
come from a distribution of the form (1 − ϵ)N + ϵP , where N denotes the standard normal
distribution, and P is an unknown symmetric perturbing distribution with the same center as N .
For a general treatment of robust statistical procedures, see Huber [648, 1981].
The Newton step s for minimizing the ψH -norm (8.4.9) satisfies AT DAs = AT y, where
yi = ρ′ (ri ), D = diag (ρ′′ (ri )), i = 1, . . . , m.
This is similar to (8.4.15) for ℓp approximation. It is advantageous to start the iterations with
a large value of the threshold parameter γ and then decrease it to the desired value. This helps
prevent the occurrence of rank-deficient Hessians H.
A finite algorithm for ℓ1 estimation is given by Madsen and Nielsen [767, 1993]. At each
iteration the nondifferentiable function ∥r∥1 is replaced by the Huber function (8.4.9) for some
parameter γ. The parameter is successively reduced until it is small enough for the ℓ1 solution
to be detected. The method is significantly faster than the simplex-type method of Barrodale and
Roberts [88, 1974].
8.4. Robust Regression and Related Topics 423

Notes and references


Späth [1014, 1992] gives Fortran programs implementing several of the early algorithms cited
above for computing ℓ1 and ℓ∞ estimators. For ℓ1 approximation the Barrodale–Roberts al-
gorithm uses the least storage and is among the fastest. For ℓ∞ approximation the Barrodale–
Phillips algorithm is the fastest and has the best reliability. Bartels and Golub [91, 1968], [92,
1968] give a more stable implementation of Stiefel’s exchange algorithm for ℓ∞ minimization.
The algorithm of Barrodale and Phillips [85, 1975] is based on the dual linear program (8.4.7).
Bartels, Conn, and Sinclair [90, 1978] give algorithms based on projected gradient techniques.
They use a descent method to explicitly find the correct subset of zero residuals when p = 1 and
maximum residuals when p = ∞. Coleman and Li [260, 1992], [261, 1992] propose a globally
convergent Newton algorithm for the ℓ1 and the ℓ∞ problems.

8.4.2 Iteratively Reweighted Least Squares


The method of Iteratively Reweighted Least Squares (IRLS) is based on the observation that
the minimization problem minx ∥r(x)∥pp , r(x) = b − Ax, can be rewritten as
m
X m
X
∥r(x)∥pp = p
|ri (x)| = |ri (x)|p−2 ri (x)2 . (8.4.11)
i=1 i=1

This is a weighted linear least squares problem


min ∥W (|r|)r∥22 , W (|r|) = diag (|r|)(p−2)/2 , (8.4.12)
x

where diag (|r|) denotes the diagonal matrix with ith component |ri |. Here the diagonal weight
matrix W depends on r and hence on the unknown x.
In IRLS a sequence of weighted least squares problems is solved, where the weights for the
next iteration are obtained from the current solution. The iterations are initialized by computing
the unweighted least squares solution x(1) and setting W1 = W1 (|r(1) |), where r(1) = b−Ax(1) .
In step k, one solves
(k)
min ∥Wk (r(k) − Aδx)∥2 , Wk = diag ((|ri |)(p−2)/2 ),
δx

and sets x(k+1) = x(k) +δx(k) . It can be shown that any fixed point of the IRLS iteration satisfies
the necessary conditions for a minimum of ψ(x) = ∥r(x)∥pp .
The first study of IRLS appeared in Lawson [725, 1961], where it was applied with p = 1
for ℓ1 minimization. It was extended to 1 ≤ p ≤ ∞ by Rice and Usow [927, 1968]. Cline
[252, 1972] proved that the local rate of convergence of IRLS is linear. Osborne [848, 1985]
gives a comprehensive analysis of IRLS and proves convergence of the basic IRLS method for
1 < p < 3. For p = 1, his main conclusion is that IRLS converges with linear rate provided the
ℓ1 approximation problem has a unique nondegenerate solution. More recently, IRLS has been
used with p < 1 for computing sparse solutions.
The IRLS method is attractive because methods for solving weighted least squares are gen-
erally available. In the simplest implementations the Cholesky factorization of AT W 2 A or the
QR factorization of W A is recomputed in each step. IRLS is closely related to Newton’s method
for minimizing the nonlinear function ψ(x) = ∥b − Ax∥pp . The first and second derivatives of
ψ(x) are
m m
∂ψ(x) X ∂ 2 ψ(x) X
= −p aij |ri |p−2 ri , = p(p − 1) aij |ri |p−2 aik . (8.4.13)
∂xj i=1
∂x j ∂xk i=1

Hence the gradient and Hessian of ψ(x) can be written


∇ψ = g(x) = −pAT W 2 r, ∇2 ψ = H(x) = p(p − 1)AT W 2 A, (8.4.14)
424 Chapter 8. Nonlinear Least Squares Problems

where W = W (|r|) is given as in (8.4.12). Newton’s method for solving the nonlinear equation
g(x) = 0 becomes

x(k+1) = x(k) + δx(k) , AT W 2 Aδx = qAT W 2 r, q = 1/(p − 1). (8.4.15)

These equations are the normal equations for the weighted linear least squares problem

min ∥W Aδx − qW r∥2 .


δx

It follows that the Newton step s for minimizing ψp (x) differs from the IRLS step only by the
factor q = 1/(p − 1). It follows that the IRLS step is a descent direction provided the Hessian is
positive definite.
Since IRLS does not take a full Newton step, it is at best only linearly convergent. Taking the
full Newton step gives asymptotic quadratic convergence but makes the initial convergence less
robust when p < 2. In the implementation below, the full Newton step is taken only when p > 2.
(k)
For p < 2, (p − 2)/2 < 0 and a zero residual |ri | = 0 will give an infinite weight wi . Then
ri = 0 also in the next step, and the zero residual will persist in all subsequent steps. Therefore,
when p < 2 it is customary to modify the weights by adding a small number:

wi = |ri |(p−2)/2 + ϵ, i = 1, . . . , m; (8.4.16)

see Merle and Späth [790, 1974]. Below, we take ϵ = 10−6 initially and halve its value at each
iteration. We remark that x can be eliminated from the loop by using r = (I − PA )b, where PA
is the orthogonal projector onto the column space of A.

Algorithm 8.4.1 (IRLS for Overdetermined Systems).

function [x,r,Nrm] = irls(A,b,p,kmax)


% IRLS for minimizing ||Ax - b||_p, where
% 0 < p < infty, and A is m by n, m > n.
% -----------------------------------------
[m,n] = size(A);
x = A\b; r = b - A*x;
weps = 1.0e-6*ones(size(x));
Nrm = norm(r,p);
for k = 1:kmax
w = abs(r).^((p-2)/2);
weps = weps/2;
if p < 2, w = w + weps;
% Sort weights and reorder rows
[w,perm] = sort(w);
A = A(perm,:); b = b(perm,:);
W = diag(w);
dx = (W*A)\(W*r);
if p < 2, x = x + dx;
else x = x + dx/(p-1); end
end
r = b - A*x;
Nrm = [Nrm,norm(r,p)];
end
8.4. Robust Regression and Related Topics 425

The global convergence of IRLS can be improved by using a line search. O’Leary [838,
1990] compares several strategies for implementing an efficient and reliable line search.
A useful modification of IRLS is to apply continuation. Starting with p = 2, p is successively
increased or decreased for a number of iterations until the desired value is reached. This improves
the range of values of p that give convergence and also significantly improves the rate of conver-
gence. A similar idea is to use a large initial value of ϵ in (8.4.16) and reduce it successively in
later iterations.
When p is close to unity, the convergence of the IRLS method can be extremely slow. This is
related to the fact that for p = 1 the solution has zero residuals. Li [744, 745, 1993] develops a
globalized Newton algorithm using the complementary slackness condition for the ℓ1 problem.
Far from the solution it behaves like IRLS with line search, but close to the solution it is similar
to Newton’s method for an extended nonlinear system of equations. The problem of unbounded
second derivatives is handled by a simple technique connected to the line search.

Notes and references


Ekblom [363, 1973] uses IRLS with a damped least squares method for ℓp minimization. IRLS
has been used extensively in signal processing for filter design; see Burrus, Barreto, and Se-
lesnick [191, 1994]. Newton-like methods for computing Huber’s M-estimator are given by
Clark and Osborne [251, 1986], Ekblom [364, 1988], Ekblom and Madsen [365, 1989], and
Madsen and Nielsen [766, 1990]. A comparison of some of these implementations is given by
O’Leary [838, 1990]. Daubechies et al. [286, 2010] outline a reweighting scheme for IRLS ℓ1
approximation that avoids infinite weights. The rate of convergence of their algorithm is similar
to that for interior-point algorithms for direct ℓ1 minimization.

8.4.3 LASSO and Least-Angle Regression


The minimization problem
min ∥Ax − b∥22 + µ∥x∥1 ,

(8.4.17)
x

where µ > 0 is a regularization parameter, is similar to Tikhonov regularization, except for


using the ℓ1 -norm instead of the ℓ2 -norm. The objective function in (8.4.17) is convex but not
differentiable. Hence this problem always has a solution, although it need not be unique. The use
of the ℓ1 -norm in the regularization term has a big influence on the character of the solution. In
contrast to standard Tikhonov regularization, the solution is not a linear function of b, and there
is no analytic formula for the solution. In the limit when µ → 0, it is the least squares solution
of the minimum ℓ1 -norm. The related ℓ1 trust-region problem,

min ∥Ax − b∥2 subject to ∥x∥1 ≤ µ, (8.4.18)


x

was proposed for variable selection in least squares problems by Tibshirani [1061, 1996]. He
gave it the colorful name LASSO, which stands for “Least Absolute Shrinkage and Selection
Operator.” For a fixed value of the regularization parameter µ > 0, the objective function in
LASSO is strictly convex over a convex feasible region. Therefore problem (8.4.18) has a unique
minimizer. Let µLS = ∥xLS ∥1 , where xLS is the unconstrained least squares solution. The trajec-
tory of the LASSO solution for µ ∈ [0, µLS ] is a piecewise linear function of µ. An algorithm for
computing the LASSO trajectory based on standard methods for convex programming is given
by Osborne, Presnell, and Turlach [849, 2000].
Efron et al. [359, 2004] gave a more intuitive algorithm called Least-Angle Regression
(LARS). By construction, the trajectory of the solution in LARS is piecewise linear with n break
points. In many cases, this trajectory coincides with that of the ℓ1 constrained least squares
426 Chapter 8. Nonlinear Least Squares Problems

problem. Indeed, with the following small modification the LARS algorithm can be used for
solving the LASSO problem: When a nonzero variable becomes zero and is about to change
sign, the variable is removed from the active set, and the least squares direction of change is
recomputed.

Theorem 8.4.2. Let x(µ) be the solution of the ℓ1 constrained least squares problem (8.4.18).
Then there exists a finite set of break points 0 = µ0 ≤ µ1 ≤ · · · ≤ µp = µLS such that x(µ) is a
piecewise linear function
x(µ) = x(µk ) + (µ − µk )(x(µk+1 ) − x(µk )), µk ≤ µ ≤ µk+1 . (8.4.19)

The ℓ1 -constrained least squares solution can be computed with about the same arithmetic
cost as a single least squares problem. As in stepwise regression, the QR factorization of the
columns in the active set is modified in each step. When a new variable is added, the factorization
is updated by adding the new column. In the case when a variable becomes zero, the factorization
is downdated by deleting a column. Although unlikely, the possibility of a multiple change in the
active set cannot be excluded. A crude method for coping with this is to make a small random
change in the right-hand side. It follows from continuity that no index set σ can be repeated in
the algorithm. There are only a finite number of steps in the algorithm, usually not much more
than min{m, n}.
Statistical aspects of variable selection in least squares problems using LASSO and related
techniques are discussed by Hastie, Tibshirani, and Friedman [595, 2009]. In image processing a
related technique has been used, where the ℓ1 -norm in LASSO is replaced by the Total-Variation
(TV) norm, giving in one dimension the related problem
min ∥Ax − b∥22 + µ∥Lx∥1 (8.4.20)
x

where (Lx)i = |xi+1 − xi |. For a two-dimensional N × N array xi,j , 1 ≤ i, j ≤ n, the TV


norm is defined as
N
X −1 X
N N N
X X −1
∥x∥T V = |xi+1,j − xi,j | + |xj+1,j − xi,j |. (8.4.21)
i=1 j=1 i=1 j=1

8.4.4 Basis Pursuit and Compressed Sensing


Let Ax = b, A ∈ Rm×n , m ≪ n, be a consistent underdetermined linear system. Such a
system does not allow the determination of x unless some additional information on x is utilized.
In many applications, such as medical image reconstruction, machine learning, and artificial
intelligence, the desired solution is known to be sparse, i.e., x has at most s ≪ n nonzero
components. For example, in a biology experiment, one may measure changes of expressions
in 30,000 genes and expect at most a few hundred genes with a different expression level. Note
that, after a transformation of variables, this includes cases where the solution can be expressed
by a few Fourier components or a few wavelets. For such problems, we would like to minimize
the number of nonzero elements in x, which is formally equivalent to minimizing the ℓ0 “norm”
∥x∥0 = dim{i | xi ̸= 0}. (8.4.22)
This is a nonconvex optimization problem that is known to be computationally intractable, be-
cause it usually requires a combinatorial search. In the Basis Pursuit (BP) problem, sparse
solutions are constructed by solving the ℓ1 minimization problem
min ∥x∥1 subject to Ax = b, (8.4.23)
x
8.4. Robust Regression and Related Topics 427

The BP objective can be interpreted as the closest convex approximation to (8.4.22). The region
∥x∥1 ≤ µ is a diamond-shaped polyhedron with many sharp corners, edges, and faces at which
one or several parameters are zero. This structure favors solutions with few nonzero coefficients.
Chen, Donoho, and Saunders [241, 2001] use BP to decompose a signal into a sparse combina-
tion of elements of a highly overcomplete dictionary, e.g., consisting of wavelets. To allow for
noise in the signal b, they also propose the BP Denoising (BPDN) problem
1
min λ∥x∥1 + rT r subject to Ax + r = b, (8.4.24)
x 2
where λ > 0 again encourages sparsity in x. BPDN is solved by the primal-dual interior method
PDCO [887, 2018] using LSMR to compute search directions (because A is a fast linear operator
rather than an explicit matrix).
Compressed sensing is a term coined by Donoho [329, 2006] for the problem of recovering
a signal from a small number of compressive measurements. Candès, Romberg, and Tao [206,
2006] prove that sparse solutions can with high probability be reconstructed exactly from re-
markably few measurements by compressed sensing, provided these satisfy a certain coherence
property. It has been established that compressed sensing is robust in the sense that it can deal
with measurements noise and cases where the signal is only approximately sparse. One of several
important applications is Magnetic Resonance Imaging (MRI), for which the use of compressed
sensing has improved performance by a factor of 10.
The convex optimization problem for a consistent underdetermined linear system Ax = b is
min ∥x∥pp subject to Ax = b, 1 ≤ p ≤ 2, (8.4.25)
x

and can be solved by IRLS. For p = 2, the solution of (8.4.25) is the pseudoinverse solution
x = A† b. For 1 ≤ p ≤ 2, the ℓp is rewritten as a weighted ℓ2 -norm
n
X
∥x∥pp = (|xi |/wi )2 , wi = |xi |(1−p/2) . (8.4.26)
i=1

With W = diag (wi ), problem (8.4.25) is equivalent to the weighted least-norm problem
min ∥W −1 x∥22 subject to Ax = b. (8.4.27)
From the weighted normal equations of the second kind, we obtain
x = W 2 AT (AW 2 AT )−1 b. (8.4.28)
A more stable alternative is to use the (compact) QR factorization
W AT = QR, x = W Q(R−T b). (8.4.29)
If possible, the rows in W A should be presorted by decreasing row norms of W AT ; see Sec-
tion 3.2.2.
Since the weights wi are well defined for any xi , in principle no regularization is needed.
However, to prevent the matrix being inverted in (8.4.28) from becoming too ill-conditioned, the
weights should be regularized as
wi = |xi |(1−p/2) + ϵ,
where ϵ = 10−6 initially, and each iteration decreases ϵ by a factor of 2.
The MATLAB implementation below follows Burrus [190, 2012]. The iterations are started
with the least squares solution, corresponding to p1 = 2. A continuation strategy is used for p,
where the current value pk is decreased by a fixed amount dp at each iteration until the target p
is reached.
428 Chapter 8. Nonlinear Least Squares Problems

Algorithm 8.4.2 (Compressed Sensing by IRLS).

function [x,nrm] = irls(A,b,p,dp,kmax)


% IRLS for minimizing ||Ax - b||_p, where
% 0 < p < 2, and A is m by n, m < n.
% ---------------------------------------
x = A'\b;
epsw = 1.0e-6*ones(size(b));
Nrm = norm(x,p); pk = 2;
for k = 1:kmax
pk = max([p,pk - dp]);
w = (x.^2 + epsw).^(1 - pk/2);
epsw = epsw/2;
% Sort weights and reorder columns
W = diag(w); A = A(:,j);
[w,j] = sort(w);
x(j) = (A*W)'\b;
Nrm = [Nrm,norm(x,p)];
end
end

It is possible to improve the ability of ℓ1 minimization to recover sparse solutions by choosing


p = 0 as target value in IRLS; see Chartrand [237, 2007], Chartrand and Yin [238, 2008], and
Yagle [1135, 2008]. Each iteration solves a convex optimization problem, but overall the iterative
algorithm attempts to find a local minimum of the nonconvex problem

min ∥x∥0 subject to Ax = b. (8.4.30)


x

Candès, Wakin, and Boyd [207, 2008] propose an iterative algorithm similar to IRLS in
which each iteration solves the weighted ℓ1 problem

min ∥W −1 x∥1 subject to Ax = b, (8.4.31)


x

and the weights are updated. This problem can be rewritten as a linear program like the un-
(1)
weighted ℓ1 problem. As in IRLS, the weights are initially wi = 1 and then updated as

(k+1) (k)
wi = |xi | + ϵ.

The weighted ℓ1 minimization problem (8.4.31) is viewed as a relaxation of the weighted ℓ0


minimization problem
min ∥W −1 x∥0 subject to Ax = b. (8.4.32)
x

Solving the weighted ℓ1 problem (8.4.31) is more complex than solving the weighted least
squares problems by IRLS. Candès, Wakin, and Boyd [207, 2008] use the primal-dual log-barrier
interior software package ℓ1 -MAGIC. There is a marked improvement in the recovery of sparse
signal compared to unweighted ℓ1 minimization. The number of iterations needed is typically
less than ten, but each iteration is computationally more costly than for IRLS. The primal-dual
interior method PDCO [887, 2018] can also be applied as for Basis Pursuit or Basis Pursuit
Denoising.
8.4. Robust Regression and Related Topics 429

Notes and references


An early use of reweighted ℓ1 minimization was for matrix rank minimization and portfolio op-
timization; see Fazel [400, 2002]. A matrix analogue is to minimize the nuclear norm ∥A∥∗ =
Pmin{m,n}
i=1 σi (A). Interior methods for large-scale ℓ1 -regularized least squares are surveyed by
Kim et al. [700, 2007]. The use of nonconvex minimization problems for signal reconstruction
was suggested by Rao and Kreutz-Delgado [911, 1999]. Nesterov and Nemirovski [827, 2013]
review first-order methods for ℓ1 - and nuclear norm minimization. They show that such meth-
ods can have only sublinear convergence but that the rate of convergence is nearly dimension-
independent.
Algorithms for computing sparse solutions of underdetermined linear systems with applica-
tions to matrix completion, graph clustering, and phase retrieval are given by Lai and Wang in
the book [710].
Bibliography

[1] Ahmad Abdelfattah, Hartwig Anzt, Erik G. Boman, Erin Carson, Terry Cojean, Jack Dongarra,
Alyson Fox, Mark Gates, Nicolas J. Higham, Xiaoye S. Li, Jennifer Loe, Piotr Luszczek, Srikara
Pranesh, Siva Rajamanickam, Tobias Ribizel, Barry F. Smith, Kasia Świrydowicz, Stephen Thomas,
Stanimire Tomov, Yaohung M. Tsai, and Ulrike Meier Yang. A survey of numerical linear alge-
bra methods utilizing mixed-precision arithmetic. Internat. J. High Performance Comput. Appl.,
35:344–369, 2021. (Cited on p. 114.)
[2] Ahmad Abdelfattah, Hartwig Anzt, Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak,
Piotr Luszczek, Stanimire Tomov, Ichitaro Yamazaki, and A. YarKhan. Linear algebra software for
large-scale accelerated multicore computing. Acta Numer., 25:1–160, 2016. (Cited on pp. 112,
114.)
[3] N. N. Abdelmalek. On the solution of the linear least squares problems and pseudoinverses. Com-
puting, 13:215–228, 1971. (Cited on p. 31.)
[4] N. N. Abdelmalek. Roundoff error analysis for Gram–Schmidt method and solution of linear least
squares problems. BIT Numer. Math., 11:345–368, 1971. (Cited on p. 71.)
[5] N. N. Abdelmalek. Algorithm 551: A Fortran subroutine for the l1 solution of overdetermined
linear systems of equations. ACM Trans. Math. Softw., 6:228–230, 1980. (Cited on p. 421.)
[6] N. N. Abdelmalek. l1 solution of overdetermined linear systems of equations. ACM Trans. Math.
Softw., 6:220–227, 1980. (Cited on p. 421.)
[7] R. J. Adcock. A problem in least squares. The Analyst, 5:53–54, 1878. (Cited on p. 407.)
[8] Mikael Adlers and Åke Björck. Matrix stretching for sparse least squares problem. Numer. Linear
Algebra Appl., 7:51–65, 2000. (Cited on p. 263.)
[9] S. N. Afriat. Orthogonal and oblique projectors and the characteristics of pairs of vector spaces.
Proc. Cambridge Philos. Soc., 53:800–816, 1957. (Cited on p. 119.)
[10] E. Agullo, James W. Demmel, Jack Dongarra, B. Hadri, Jakub Kurzak Julien Langou, H. Ltaief,
P. Luszczek, and S. Tomov. Numerical linear algebra on emerging architectures: The PLASMA
and MAGMA projects. J. Phys. Conf. Ser., 180:012037, 2009. (Cited on p. 114.)
[11] N. Ahmed, T. Natarajan, and K. R. Rao. Discrete cosine transform. IEEE Trans. Comput., C23:90–
93, 1974. (Cited on p. 237.)
[12] Alexander Craig Aitken. On least squares and linear combinations of observations. Proc. Roy. Soc.
Edinburgh, 55:42–48, 1934/1936. (Cited on p. 5.)
[13] M. A. Ajiz and Alan Jennings. A robust incomplete Cholesky-conjugate gradient algorithm. Int. J.
Numer. Meth. Eng., 20:949–966, 1984. (Cited on pp. 310, 310.)
[14] M. Al-Baali and Roger Fletcher. Variational methods for non-linear least squares. J. Oper. Res.
Soc., 36:405–421, 1985. (Cited on p. 400.)
[15] M. Al-Baali and R. Fletcher. An efficient line search for nonlinear least-squares. J. Optim. Theory
Appl., 48:359–377, 1986. (Cited on p. 400.)

431
432 Bibliography

[16] S. T. Alexander, Ching-Tsuan Pan, and Robert J. Plemmons. Analysis of a recursive least squares
hyperbolic rotation algorithm for signal processing. Linear Algebra Appl., 98:3–40, 1988. (Cited
on p. 137.)
[17] D. M. Allen. The relationship between variable selection and data augmentation and a method for
prediction. Technometrics, 16:125–127, 1974. (Cited on p. 178.)
[18] Patrick R. Amestoy, Timothy A. Davis, and Iain S. Duff. Algorithm 837: AMD, an an approximate
minimum degree ordering algorithm. ACM Trans. Math. Softw., 30:381–388, 2004. (Cited on
p. 252.)
[19] Patrick R. Amestoy, I. S. Duff, and C. Puglisi. Multifrontal QR factorization in a multiprocessor
environment. Numer. Linear Algebra Appl., 3:275–300, 1996. (Cited on p. 258.)
[20] Greg S. Ammar and William B. Gragg. Superfast solution of real positive definite Toeplitz systems.
SIAM J. Matrix Anal. Appl., 9:61–76, 1988. (Cited on p. 241.)
[21] A. A. Anda and Haesun Park. Fast plane rotations with dynamic scaling. SIAM J. Matrix Anal.
Appl., 15:162–174, 1994. (Cited on p. 51.)
[22] A. Anda and Haesun Park. Self-scaling fast rotations for stiff least squares problems. Linear
Algebra Appl., 234:137–162, 1996. (Cited on p. 132.)
[23] Bjarne S. Andersen, Jerzy Waśniewski, and Fred G. Gustavson. A recursive formulation of
Cholesky factorization for a packed matrix. ACM Trans. Math. Softw., 27:214–244, 2001. (Cited
on p. 112.)
[24] R. S. Andersen and Gene H. Golub. Richardson’s Non-stationary Matrix Iterative Procedure. Tech.
Report STAN-CS-72-304, Computer Science Department, Stanford University, CA, 1972. (Cited
on p. 326.)
[25] Edward Anderssen, Zhaojun Bai, and Jack J. Dongarra. Generalized QR factorization and its ap-
plications. Linear Algebra Appl., 162/164:243–271, 1992. (Cited on pp. 124, 128.)
[26] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammar-
ling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users’ Guide. SIAM, Philadelphia,
second edition, 1995. (Cited on p. 97.)
[27] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. W. Demmel, J. Dongarra, J. Du Croz, A. Green-
baum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’ Guide. SIAM, Philadel-
phia, third edition, 1999. (Cited on p. 114.)
[28] Claus A. Andersson and Rasmus Bro. The N -way toolbox for MATLAB. Chemom. Intell. Lab.
Syst., 52:1–4, 2000. (Cited on p. 218.)
[29] Martin Andersson. A comparison of nine PLS1 algorithms. J. Chemometrics, 23:518–529, 2009.
(Cited on pp. 202, 203.)
[30] Peter Arbentz and Gene H. Golub. Matrix shapes invariant under symmetric QR algorithm. Numer.
Linear Algebra Appl., 2:87–93, 1995. (Cited on p. 341.)
[31] M. Arioli. Generalized Golub–Kahan bidiagonalization and stopping criteria. SIAM J. Matrix Anal.
Appl., 34:571–592, 2013. (Cited on pp. 331, 331.)
[32] Mario Arioli, Marc Baboulin, and Serge Gratton. A partial condition number for linear least squares
problems. SIAM J. Matrix Anal. Appl., 29:413–433, 2007. (Cited on p. 29.)
[33] M. Arioli, J. W. Demmel, and I. S. Duff. Solving sparse linear systems with sparse backward error.
SIAM J. Matrix Anal. Appl., 10:165–190, 1989. (Cited on pp. 97, 104.)
[34] Mario Arioli and Iain S. Duff. Preconditioning linear least-squares problems by identifying a basis
matrix. SIAM J. Sci. Comput., 37:S544–S561, 2015. (Cited on p. 319.)
[35] Mario Arioli, Iain Duff, Joseph Noailles, and Daniel Ruiz. A block projection method for sparse
matrices. SIAM J. Sci. Comput., 13:47–70, 1992. (Cited on p. 275.)
[36] Mario Arioli, Iain S. Duff, and Peter P. M. de Rijk. On the augmented system approach to sparse
least-squares problems. Numer. Math., 55:667–684, 1989. (Cited on pp. 31, 104.)
Bibliography 433

[37] Mario Arioli, Iain Duff, and Daniel Ruiz. Stopping criteria for iterative solvers. SIAM J. Matrix
Anal. Appl., 13:138–144, 1992. (Cited on p. 299.)
[38] Mario Arioli, Iain S. Duff, Daniel Ruiz, and Miloud Sadkane. Block Lanczos techniques for ac-
celerating the block Cimmino method. SIAM J. Sci. Comput., 16:1478–1511, 1995. (Cited on
p. 275.)
[39] W. E. Arnoldi. The principle of minimized iteration in the solution of the matrix eigenvalue prob-
lem. Quart. Appl. Math., 9:17–29, 1951. (Cited on p. 301.)
[40] Stephen F. Ashby, Thomas A. Manteuffel, and Paul E. Saylor. A taxonomy for conjugate gradient
methods. SIAM J. Numer. Anal., 27:1542–1568, 1990. (Cited on p. 336.)
[41] Léon Autonne. Sur les groupes linéaires réels et orthogonaux. Bull. Soc. Math. France, 30:121–134,
1902. (Cited on p. 383.)
[42] Léon Autonne. Sur les matrices hypohermitiennes et les unitaires. C. R. Acad. Sci. Paris, 156:858–
860, 1913. (Cited on p. 13.)
[43] Léon Autonne. Sur les matrices hypohermitiennes et sur les matrices unitaires. Ann. Univ. Lyon
(N.S.), 38:1–77, 1915. (Cited on p. 13.)
[44] J. K. Avila and J. A. Tomlin. Solution of very large least squares problems by nested dissection on
a parallel processor. In J. F. Gentleman, editor, Proceedings of the Computer Science and Statistics
12th Annual Symposium on the Interface. University of Waterloo, Canada, 1979. (Cited on p. 209.)
[45] Haim Avron, Esmond Ng, and Sivan Toledo. Using perturbed QR factorizations to solve linear
least-squares problems. SIAM J. Matrix Anal. Appl., 31:674–693, 2009. (Cited on p. 263.)
[46] H. Avron, P. Maymounkov, and S. Toledo. Blendenpik: Supercharging LAPACK’s least squares
solver. SIAM J. Sci. Comput., 32:1217–1236, 2010. (Cited on p. 320.)
[47] Owe Axelsson. A generalized SSOR method. BIT Numer. Math., 12:443–467, 1972. (Cited on
p. 308.)
[48] Owe Axelsson. Iterative Solution Methods. Cambridge University Press, Cambridge, 1994. (Cited
on pp. 269, 275, 281, 295.)
[49] Marc Baboulin, Luc Giraud, Serge Gratton, and Julien Langou. Parallel tools for solving incremen-
tal dense least squares. Applications to space geodesy. J. Algorithms Comput. Tech., 3:117–133,
2009. (Cited on pp. 3, 112.)
[50] Marc Baboulin and Serge Gratton. Computing the conditioning of the components of a linear
least-squares solution. Numer. Linear Algebra Appl., 16:517–533, 2009. (Cited on p. 31.)
[51] Marc Baboulin and Serge Gratton. A contribution to the conditioning of the total least squares
problem. SIAM J. Matrix Anal. Appl., 32:685–699, 2011. (Cited on p. 226.)
[52] Brett W. Bader and Tamara G. Kolda. Algorithm 862: MATLAB tensor classes for fast algorithm
prototyping. ACM Trans. Math. Softw., 32:455–500, 2006. (Cited on p. 218.)
[53] Brett W. Bader and Tamara G. Kolda. Efficient MATLAB computations with sparse and factored
tensors. SIAM J. Sci. Comput., 30:205–231, 2007. (Cited on p. 218.)
[54] J. Baglama, D. Calvetti, and L. Reichel. IRBL: An implicitly restarted block-Lanczos method
for large-scale Hermitian eigenproblems. SIAM J. Sci. Comput., 24:1650–1677, 2003. (Cited on
pp. 370, 372.)
[55] James Baglama, Daniela Calvetti, and Lothar Reichel. Algorithm 827: irbleigs: A MATLAB
program for computing a few eigenpairs of a large sparse Hermitian matrix. ACM Trans. Math.
Softw., 29:337–348, 2003. (Cited on p. 374.)
[56] James Baglama and Lothar Reichel. Augmented implicitly restarted Lanczos bidiagonalization
methods. SIAM J. Sci. Comput., 27:19–42, 2005. (Cited on pp. 373, 374.)
[57] James Baglama and Lothar Reichel. Restarted block Lanczos bidiagonalization methods. Numer.
Algor., 43:251–272, 2006. (Cited on p. 374.)
434 Bibliography

[58] James Baglama, Lothar Reichel, and Daniel J. Richmond. An augmented LSQR method. Numer.
Algor., 64:263–293, 2013. (Cited on p. 337.)
[59] Zhaojun Bai and James W. Demmel. Computing the generalized singular value decomposition.
SIAM J. Sci. Comput., 14:1464–1486, 1993. (Cited on pp. 125, 356.)
[60] Zhaojun Bai and James Demmel. Using the matrix sign function to compute invariant subspaces.
SIAM J. Matrix Anal. Appl., 19:205–225, 1998. (Cited on pp. 380, 382.)
[61] Zhaojun Bai, James Demmel, Jack Dongarra, Axel Ruhe, and Henk van der Vorst, editors. Tem-
plates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. SIAM, Philadelphia,
2000. (Cited on pp. 349, 375.)
[62] Zhaojun Bai, James W. Demmel, and Ming Gu. An inverse free parallel spectral divide and conquer
algorithm for nonsymmetric eigenproblems. Numer. Math., 76:279–308, 1997. (Cited on p. 382.)
[63] Zhaojun Bai and Hongyuan Zha. A new preprocessing algorithm for the computation of the gen-
eralized singular value decomposition. SIAM J. Sci. Comput., 14:1007–1012, 1993. (Cited on
p. 125.)
[64] Zhong-Zhi Bai, Iain S. Duff, and Andrew J. Wathen. A class of incomplete orthogonal factorization
methods I: Methods and theories. BIT Numer. Math., 41:53–70, 2001. (Cited on p. 315.)
[65] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Minimizing communication in
numerical linear algebra. SIAM J. Matrix Anal. Appl., 32:866–901, 2011. (Cited on p. 114.)
[66] Y. Bard. Nonlinear Parameter Estimation. Academic Press, New York, 1974. (Cited on p. 391.)
[67] Jesse L. Barlow. Modification and maintenance of ULV decompositions. In Zlatko Drmač, Vjeran
Hari, Luka Sopta, Zvonimir Tutek, and Krěsimir Veselić, editors, Applied Mathematics and Scien-
tific Computing, pages 31–62. Springer, Boston, MA, 2003. (Cited on p. 155.)
[68] Jesse L. Barlow. Reorthogonalization for the Golub–Kahan–Lanczos bidiagonal reduction. Numer.
Math., 124:237–278, 2013. (Cited on p. 298.)
[69] Jesse L. Barlow. Block Gram–Schmidt downdating. ETNA, 43:163–187, 2014. (Cited on p. 152.)
[70] Jesse L. Barlow. Block modified Gram–Schmidt algorithm and their analysis. SIAM J. Matrix Anal.
Appl., 40:1257–1290, 2019. (Cited on p. 152.)
[71] Jesse L. Barlow, Nela Bosner, and Zlatko Drmač. A new stable bidiagonal reduction algorithm.
Linear Algebra Appl., 397:35–84, 2005. (Cited on p. 194.)
[72] Jesse L. Barlow and Hasan Erbay. A modifiable low-rank approximation of a matrix. Numer. Linear
Algebra Appl., 16:833–860, 2009. (Cited on p. 155.)
[73] Jesse L. Barlow, Hasan Erbay, and Ivan Slapnic̆ar. An alternative algorithm for the refinement of
ULV decompositions. SIAM J. Matrix Anal. Appl., 27:198–211, 2005. (Cited on p. 155.)
[74] Jesse L. Barlow and Susan L. Handy. The direct solution of weighted and equality constrained
least-squares problems. SIAM J. Sci. Statist. Comput., 9:704–716, 1988. (Cited on p. 160.)
[75] J. L. Barlow, N. K. Nichols, and R. J. Plemmons. Iterative methods for equality-constrained least
squares problems. SIAM J. Sci. Statist. Comput., 9:892–906, 1988. (Cited on p. 319.)
[76] Jesse L. Barlow and Alicja Smoktunowicz. Reorthogonalized block classical Gram–Schmidt. Nu-
mer. Math., 123:395–423, 2013. (Cited on pp. 109, 152.)
[77] Jesse L. Barlow, Alicja Smoktunowicz, and Hasan Erbay. Improved Gram–Schmidt downdating
methods. BIT Numer. Math., 45:259–285, 2005. (Cited on p. 152.)
[78] Jesse L. Barlow and Udaya B. Vemulapati. A note on deferred correction for equality constrained
least squares problems. SIAM J. Numer. Anal., 29:249–256, 1992. (Cited on p. 160.)
[79] Jesse L. Barlow, P. A. Yoon, and Hongyuan Zha. An algorithm and a stability theory for downdating
the ULV decomposition. BIT Numer. Math., 36:14–40, 1996. (Cited on p. 155.)
[80] Jesse L. Barlow, Hongyuan Zha, and P. A. Yoon. Stable Chasing Algorithms for Modifying Com-
plete and Partial Singular Value Decompositions. Tech. Report CSE-93-19, Department of Com-
puter Science, The Pennsylvania State University, State College, PA, 1993. (Cited on p. 363.)
Bibliography 435

[81] S. T. Barnard, Alan Pothen, and Horst D. Simon. A spectral algorithm for envelope reduction of
sparse matrices. Numer. Algor., 2:317–334, 1995. (Cited on p. 251.)
[82] Richard Barret, Michael Berry, Tony F. Chan, James Demmel, June Donato, Jack Dongarra, Victor
Eijkhout, Roldan Pozo, Charles Romine, and Henk van der Vorst. Templates for the Solution of
Linear Systems: Building Blocks for Iterative Methods. SIAM, Philadelphia, 1994. (Cited on
p. 269.)
[83] Anders Barrlund. Perturbation bounds on the polar decomposition. BIT Numer. Math., 30:101–113,
1990. (Cited on p. 385.)
[84] Anders Barrlund. Efficient solution of constrained least squares problems with Kronecker product
structure. SIAM J. Matrix Anal. Appl., 19:154–160, 1998. (Cited on p. 211.)
[85] Ian Barrodale and C. Phillips. Algorithm 495: Solution of an overdetermined system of linear
equations in the Chebyshev norm. ACM Trans. Math. Softw., 1:264–270, 1975. (Cited on p. 423.)
[86] Ian Barrodale and F. D. K. Roberts. Applications of mathematical programming to lp approxima-
tion. In J. B. Rosen, O. L Mangasarian, and K. Ritter, editors, Nonlinear Programming, pages
447–464. Academic Press, New York, 1970. (Cited on p. 421.)
[87] I. Barrodale and F. D. K. Roberts. An improved algorithm for discrete l1 linear approximation.
SIAM J. Numer. Anal., 10:839–848, 1973. (Cited on p. 421.)
[88] Ian Barrodale and F. D. K. Roberts. Algorithm 478: Solution of an overdetermined system of
equations in the ℓ1 norm. Comm. ACM, 17:319–320, 1974. (Cited on pp. 421, 422.)
[89] Richard H. Bartels and Andrew R. Conn. Algorithm 563: A program for linearly constrained
discrete ℓ1 problems. ACM Trans. Math. Softw., 6:609–614, 1980. (Cited on p. 421.)
[90] Richard H. Bartels, Andrew R. Conn, and J. W. Sinclair. Minimization techniques for piecewise
differentiable functions: The l1 solution to an overdetermined linear system. SIAM J. Numer. Anal.,
15:224–241, 1978. (Cited on p. 423.)
[91] Richard H. Bartels and Gene H. Golub. Chebyshev solution to an overdetermined system. Algo-
rithm 328. Comm. ACM, 11:428–430, 1968. (Cited on p. 423.)
[92] Richard H. Bartels and Gene H. Golub. Stable numerical methods for obtaining the Chebyshev
solution to an overdetermined system. Comm. ACM, 11:401–406, 1968. (Cited on p. 423.)
[93] W. Barth, R. S. Martin, and James H. Wilkinson. Calculation of the eigenvalues of a symmetric
tridiagonal matrix by the method of bisection. In F. L. Bauer et al., editors, Handbook for Automatic
Computation. Vol. II, Linear Algebra, pages 249–256. Springer, New York, 1971. Prepublished in
Numer. Math., 9:386–393, 1967. (Cited on p. 350.)
[94] Friedrich L. Bauer. Das Verfahren der Treppeniteration und verwandte Verfahren zur Lösung alge-
braischer Eigenwertprobleme. Z. Angew. Math. Phys., 8:214–235, 1957. (Cited on p. 367.)
[95] Friedrich L. Bauer. Genauigkeitsfragen bei der Lösung linearer Gleichungssysteme. Z. Angew.
Math. Mech., 46:409–421, 1966. (Cited on pp. 31, 44.)
[96] Amir Beck and Aharon Ben-Tal. On the solution of the Tikhonov regularization of the total least
squares problem. SIAM J. Optim., 17:98–118, 2006. (Cited on p. 226.)
[97] Stefania Bellavia, Jacek Gondzio, and Benedetta Morino. A matrix-free preconditioner for sparse
symmetric positive definite systems and least-squares problems. SIAM J. Sci. Comput., 35:A192–
A211, 2013. (Cited on p. 311.)
[98] Eugenio Beltrami. Sulle funzioni bilineari. Giorn. Mat. ad Uso degli Studenti Delle Universita,
11:98–106, 1873. (Cited on p. 13.)
[99] Adi Ben-Israel. On iterative methods for solving non-linear least squares problems over convex
sets. Israel J. Math., 5:211–224, 1967. (Cited on p. 395.)
[100] Adi Ben-Israel. The Moore of the Moore–Penrose inverse. Electronic J. Linear Algebra, 9:150–
157, 2002. (Cited on p. 16.)
436 Bibliography

[101] Adi Ben-Israel and Thomas N. E. Greville. Generalized Inverses: Theory and Applications.
Springer-Verlag, New York, second edition, 2003. (Cited on p. 16.)
[102] Adi Ben-Israel and S. J. Wersan. An elimination method for computing the generalized inverse of
an arbitrary matrix. J. Assoc. Comput. Mach., 10:532–537, 1963. (Cited on p. 88.)
[103] Aharon Ben-Tal and Marc Teboulle. A geometric property of the least squares solution of linear
equations. Linear Algebra Appl., 139:165–170, 1990. (Cited on p. 132.)
[104] Steven J. Benbow. Solving generalized least-squares problems with LSQR. SIAM J. Matrix Anal.
Appl., 21:166–177, 1999. (Cited on pp. 291, 331.)
[105] Commandant Benoit. Note sur une méthode de résolution des équations normales provenant de
l’application de la méthode des moindres carrés a un système d’équations linéaires en nombre
inférieur a celui des inconnues. Application de la méthode a la résolution d’un système defini
d’équations linéaires. (Procédé du Commandant Cholesky.) Bull. Géodésique, 2:67–77, 1924.
(Cited on p. 41.)
[106] Michele Benzi. Preconditioning techniques for large linear systems: A survey. J. Comput. Phys.,
182:418–477, 2002. (Cited on pp. 287, 314.)
[107] Michele Benzi. Gianfranco Cimmino’s contribution to Numerical Mathematics. In Ciclo di Con-
ferenze in Ricordo di Gianfranco Cimmino, pages 87–109, Bologna, 2005. Tecnoprint. (Cited on
p. 273.)
[108] Michele Benzi, Gene H. Golub, and Jörg Liesen. Numerical solution of saddle point problems.
Acta Numer., 14:1–138, 2005. (Cited on p. 117.)
[109] Michele Benzi, Carl D. Meyer, and Miroslav Tůma. A sparse approximate inverse preconditioner
for the conjugate gradient method. SIAM J. Sci. Comput., 17:1135–1149, 1995. (Cited on p. 315.)
[110] Michele Benzi and Miroslav Tůma. A robust incomplete factorization preconditioner for positive
definite matrices. Numer. Linear Algebra Appl., 10:385–400, 2003. (Cited on p. 315.)
[111] Michele Benzi and Miroslav Tůma. A robust preconditioner with low memory requirements for
large sparse least squares problems. SIAM J. Sci. Comput., 25:499–512, 2003. (Cited on p. 315.)
[112] Abraham Berman and Robert J. Plemmons. Cones and iterative methods for best least squares
solutions of linear systems. SIAM J. Numer. Anal., 11:145–154, 1974. (Cited on p. 270.)
[113] Michael W. Berry. A Fortran-77 Software Library for the Sparse Singular Value Decomposition.
Tech. Report CS-92-159, University of Tennessee, Knoxville, TN, 1992. (Cited on p. 376.)
[114] Michael W. Berry. SVDPACKC: Version 1.0 User’s Guide. Tech. Report CS-93-194, University of
Tennessee, Knoxville, TN, 1993. (Cited on p. 376.)
[115] Michael W. Berry. A survey of public-domain Lanczos-based software. In J. D. Brown, M. T. Chu,
D. C. Ellison, and Robert J. Plemmons, editors, Proceedings of the Cornelius Lanczos International
Centenary Conference, Raleigh, NC, Dec. 1993, pages 332–334. SIAM, Philadelphia, 1994. (Cited
on p. 376.)
[116] Michael W. Berry, M. Browne, A. Langville, V. C. Pauca, and Robert J. Plemmons. Algorithms
and applications for approximate nonnegative matrix factorizations. Comput. Data Statist. Anal.,
21:155–173, 2007. (Cited on p. 420.)
[117] Rajendra Bhatia and Kalyan K. Mukherjea. On weighted Löwdin orthogonalization. Int. J. Quan-
tum Chemistry, 29:1775–1778, 1986. (Cited on p. 383.)
[118] I. J. Bienaymé. Remarques sur les différences qui distinguent l’interpolation de M. Cauchy de la
méthode des moindre carrés et qui assurent la supériorité de cette méthode. C. R. Acad. Sci. Paris,
37:5–13, 1853. (Cited on p. 64.)
[119] M. Bierlair, Philippe Toint, and D. Tuyttens. On iterative algorithms for linear least-squares prob-
lems with bound constraints. Linear Algebra Appl., 143:111–143, 1991. (Cited on p. 420.)
[120] David Bindel, James W. Demmel, William M. Kahan, and Osni Marques. On computing Givens
rotations reliably and efficiently. ACM Trans. Math. Softw., 28:206–238, 2002. (Cited on p. 50.)
Bibliography 437

[121] Christian H. Bischof and Gregorio Quintana-Ortí. Algorithm 782: Codes for rank-revealing QR
factorizations of dense matrices. ACM Trans. Math. Softw., 24:254–257, 1998. (Cited on p. 109.)
[122] Christian H. Bischof and Gregorio Quintana-Ortí. Computing rank-revealing QR factorizations of
dense matrices. ACM Trans. Math. Softw., 24:226–253, 1998. (Cited on p. 109.)
[123] Christian Bischof and Charles Van Loan. The WY representation for products of Householder
matrices. SIAM J. Sci. Statist. Comput., 8:s2–s13, 1987. (Cited on p. 106.)
[124] Åke Björck. Iterative refinement of linear least squares solutions I. BIT Numer. Math., 7:257–278,
1967. (Cited on pp. 93, 101.)
[125] Åke Björck. Solving linear least squares problems by Gram–Schmidt orthogonalization. BIT Nu-
mer. Math., 7:1–21, 1967. (Cited on pp. 28, 62, 70, 108.)
[126] Åke Björck. Iterative refinement of linear least squares solutions II. BIT Numer. Math., 8:8–30,
1968. (Cited on pp. 103, 160.)
[127] Åke Björck. Methods for sparse least squares problems. In J.R. Bunch and D. J. Rose, editors,
Sparse Matrix Computations, pages 177–199. Academic Press, New York, 1976. (Cited on p. 317.)
[128] Åke Björck. SSOR preconditioning methods for sparse least squares problems. In J. F. Gentle-
man, editor, Proceedings of the Computer Science and Statistics 12th Annual Symposium on the
Interface, pages 21–25. University of Waterloo, Canada, 1979. (Cited on pp. 307, 309.)
[129] Åke Björck. Use of conjugate gradients for solving linear least squares problems. In I. S. Duff,
editor, Conjugate Gradient Methods and Similar Techniques, pages 49–71. Computer Science and
Systems Division, Harwell, AERE- R 9636, 1979. (Cited on p. 294.)
[130] Åke Björck. Stability analysis of the method of semi-normal equations for least squares problems.
Linear Algebra Appl., 88/89:31–48, 1987. (Cited on pp. 105, 105.)
[131] Åke Björck. A bidiagonalization algorithm for solving ill-posed systems of linear equations. BIT
Numer. Math., 28:659–670, 1988. (Cited on p. 332.)
[132] Åke Björck. Component-wise perturbation analysis and error bounds for linear least squares solu-
tions. BIT Numer. Math., 31:238–244, 1991. (Cited on pp. 31, 99.)
[133] Åke Björck. Pivoting and stability in the augmented system method. In D. F. Griffiths and G. A.
Watson, editors, Numerical Analysis 1991: Proceedings of the 14th Dundee Biennal Conference,
June 1991, Pitman Research Notes Math. Ser. 260, pages 1–16. Longman Scientific and Technical,
Harlow, UK, 1992. (Cited on p. 93.)
[134] Åke Björck. Numerics of Gram–Schmidt orthogonalization. Linear Algebra Appl., 197/198:297–
316, 1994. (Cited on pp. 67, 108.)
[135] Åke Björck. Numerical Methods for Least Squares Problems. SIAM, Philadelphia, 1996. (Cited
on p. 131.)
[136] Åke Björck. QR factorization of the Jacobian in some structured nonlinear least squares problem. In
Sabine Van Huffel and Philippe Lemmerling, editors, Total Least Squares and Errors-in-Variables
Modeling. Analysis, Algorithms and Applications, pages 225–234. Kluwer Academic Publishers,
Dordrecht, 2002. (Cited on p. 410.)
[137] Åke Björck. Stability of two direct methods for bidiagonalization and partial least squares. SIAM
J. Matrix Anal. Appl., 35:279–291, 2014. (Cited on pp. 200, 201.)
[138] Åke Björck and C. Bowie. An iterative algorithm for computing the best estimate of an orthogonal
matrix. SIAM J. Numer. Anal., 8:358–364, 1971. (Cited on p. 383.)
[139] Åke Björck and Iain S. Duff. A direct method for the solution of sparse linear least squares prob-
lems. Linear Algebra Appl., 34:43–67, 1980. (Cited on p. 88.)
[140] Åke Björck and Tommy Elfving. Algorithms for confluent Vandermonde systems. Numer. Math.,
21:130–137, 1973. (Cited on p. 238.)
[141] Åke Björck and Tommy Elfving. Accelerated projection methods for computing pseudoinverse
solutions of linear systems. BIT Numer. Math., 19:145–163, 1979. (Cited on pp. 273, 308.)
438 Bibliography

[142] Åke Björck, Tommy Elfving, and Zdeněk Strakoš. Stability of conjugate gradient and Lanczos
methods for linear least squares problems. SIAM J. Matrix Anal. Appl., 19:720–736, 1998. (Cited
on p. 294.)
[143] Åke Björck and Gene H. Golub. Iterative refinement of linear least squares solution by Householder
transformation. BIT Numer. Math., 7:322–337, 1967. (Cited on pp. 101, 102, 157.)
[144] Åke Björck and Gene H. Golub. Numerical methods for computing angles between subspaces.
Math. Comp., 27:579–594, 1973. (Cited on pp. 17, 19.)
[145] Åke Björck, Eric Grimme, and Paul Van Dooren. An implicit shift bidiagonalization algorithm for
ill-posed systems. BIT Numer. Math., 34:510–534, 1994. (Cited on pp. 332, 373.)
[146] Åke Björck and Sven J. Hammarling. A Schur method for the square root of a matrix. Linear
Algebra Appl., 52/53:127–140, 1983. (Cited on p. 378.)
[147] Åke Björck, Pinar Heggernes, and Pontus Matstoms. Methods for large scale total least squares
problems. SIAM J. Matrix Anal. Appl., 22:413–429, 2000. (Cited on p. 224.)
[148] Åke Björck and Ulf G. Indahl. Fast and stable partial least squares modelling: A benchmark study
with theoretical comments. J. Chemom., 31:e2898, 2017. (Cited on pp. 202, 203, 203.)
[149] Å. Björck and C. C. Paige. Loss and recapture of orthogonality in the modified Gram–Schmidt
algorithm. SIAM J. Matrix Anal. Appl., 13:176–190, 1992. (Cited on pp. 66, 70.)
[150] Åke Björck and Christopher C. Paige. Solution of augmented linear systems using orthogonal
factorizations. BIT Numer. Math., 34:1–26, 1994. (Cited on pp. 68, 68.)
[151] Å. Björck, H. Park, and L. Eldén. Accurate downdating of least squares solutions. SIAM J. Matrix
Anal. Appl., 15:549–568, 1994. (Cited on p. 147.)
[152] Åke Björck and Victor Pereyra. Solution of Vandermonde system of equations. Math. Comp.,
24:893–903, 1970. (Cited on p. 238.)
[153] Åke Björck and Jin Yun Yuan. Preconditioners for least squares problems by LU factorization.
ETNA, 8:26–35, 1999. (Cited on p. 318.)
[154] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. W. Demmel, I. Dhillon, J. Dongarra, S.
Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users’
Guide. SIAM, Philadelphia, 1997. ISBN 0-89871-397-8. (Cited on p. 114.)
[155] L. Susan Blackford, James W. Demmel, Jack Dongarra, Iain Duff, Sven J. Hammarling, Greg
Henry, Michael Heroux, Linda Kaufman, Andrew Lumsdaine, Antoine Petitet, Roldan Pozo, Karin
Remington, and R. Clint Whaley. An updated set of Basic Linear Algebra Subprograms (BLAS).
ACM Trans. Math. Softw., 28:135–151, 2002. (Cited on p. 114.)
[156] Elena Y. Bobrovnikova and Stephen A. Vavasis. Accurate solution of weighted least squares by
iterative methods. SIAM J. Matrix Anal. Appl., 22:1153–1174, 2001. (Cited on p. 133.)
[157] Paul T. Boggs. The convergence of the Ben-Israel iteration for nonlinear least squares problems.
Math. Comp., 30:512–522, 1976. (Cited on pp. 395, 397.)
[158] Paul T. Boggs, Richard H. Byrd, and Robert B. Schnabel. A stable and efficient algorithm for
nonlinear orthogonal distance regression. SIAM J. Sci. Statist. Comput., 8:1052–1078, 1987. (Cited
on pp. 410, 410, 410.)
[159] P. T. Boggs, J. R. Donaldson, R. H. Byrd, and R. B. Schnabel. ODRPACK software for weighted
orthogonal distance regression. ACM Trans. Math. Softw., 15:348–364, 1989. (Cited on p. 410.)
[160] R. F. Boisvert, Roldan Pozo, K. Remington, R. Barret, and Jack J. Dongarra. Matrix Market: A
web resource for test matrix collections. In R. F. Boisvert, editor, Quality of Numerical Software,
Assessment and Enhancement, pages 125–137. Chapman & Hall, London, UK, 1997. (Cited on
pp. 244, 295.)
[161] Adam W. Bojanczyk and Richard P. Brent. Parallel solution of certain Toeplitz least-squares prob-
lems. Linear Algebra Appl., 77:43–60, 1986. (Cited on p. 241.)
Bibliography 439

[162] Adam W. Bojanczyk, Richard P. Brent, and Frank R. de Hoog. QR factorization of Toeplitz matri-
ces. Numer. Math., 49:81–94, 1986. (Cited on p. 239.)
[163] Adam W. Bojanczyk, Richard P. Brent, and Frank R. de Hoog. A Weakly Stable Algorithm for Gen-
eral Toeplitz Matrices. Tech. Report TR-CS-93-15, Cornell University, Ithaca, NY, 1993. (Cited
on p. 241.)
[164] A. W. Bojanczyk, R. P. Brent, P. van Dooren, and F. de Hoog. A note on downdating the Cholesky
factorization. SIAM J. Sci. Statist. Comput., 8:210–221, 1987. (Cited on p. 135.)
[165] Adam W. Bojanczyk, Nicholas J. Higham, and Harikrishna Patel. The equality constrained indefi-
nite least squares problem: Theory and algorithms. BIT Numer. Math., 43:505–517, 2003. (Cited
on p. 136.)
[166] Adam Bojanczyk, Nicholas J. Higham, and Harikrishna Patel. Solving the indefinite least squares
problem by hyperbolic QR factorization. SIAM J. Matrix Anal. Appl., 24:914–931, 2003. (Cited
on pp. 133, 134, 135.)
[167] Adam W. Bojanczyk and Adam Lutoborski. Computation of the Euler angles of a symmetric 3 × 3
matrix. SIAM J. Matrix Anal. Appl., 12:41–48, 1991. (Cited on p. 51.)
[168] Adam W. Bojanczyk and Allan O. Steinhardt. Stability analysis of a Householder-based algorithm
for downdating the Cholesky factorization. SIAM J. Sci. Statist. Comput., 12:1255–1265, 1991.
(Cited on p. 148.)
[169] D. Boley and Gene H. Golub. A survey of matrix inverse eigenvalue problems. Inverse Problems,
3:595–622, 1987. (Cited on p. 230.)
[170] F. L. Bookstein. Fitting conic sections to scattered data. Comput. Graphics Image Process., 9:56–
71, 1979. (Cited on p. 414.)
[171] Tibor Boros, Thomas Kailath, and Vadim Olshevsky. A fast parallel Björck–Pereyra-type algorithm
for parallel solution of Cauchy linear systems. Linear Algebra Appl., 302/303:265–293, 1999.
(Cited on p. 239.)
[172] C. Boutsidis and E. Gallopoulus. SVD-based initialization: A head start for nonnegative matrix
factorization. Pattern Recognition, 41:1350–1362, 2008. (Cited on p. 420.)
[173] David W. Boyd. The power method for ℓp norms. Linear Algebra Appl., 9:95–101, 1974. (Cited
on p. 95.)
[174] R. N. Bracewell. The fast Hartley transform. Proc. IEEE., 72:1010–1018, 1984. (Cited on p. 320.)
[175] R. Bramley and A. Sameh. Row projection methods for large nonsymmetric linear systems. SIAM
J. Sci. Statist. Comput., 13:168–193, 1992. (Cited on p. 275.)
[176] Matthew Brand. Fast low-rank modification of the thin singular value decomposition. Linear
Algebra Appl., 415:20–30, 2006. (Cited on p. 363.)
[177] Richard P. Brent. Algorithm 524: A Fortran multiple-precision arithmetic package. ACM Trans.
Math. Softw., 4:71–81, 1978. (Cited on p. 33.)
[178] Richard P. Brent. A Fortran multiple-precision arithmetic package. ACM Trans. Math. Softw.,
4:57–70, 1978. (Cited on p. 33.)
[179] Richard P. Brent. Old and new algorithms for Toeplitz systems. In Franklin T. Luk, editor, Advanced
Algorithms and Architectures for Signal Processing III, SPIE Proceeding Series, Washington, pages
2–9, 1988. (Cited on p. 241.)
[180] Rasmus Bro. PARAFAC. Tutorial and applications. Chemom. Intell. Lab. Syst., 38:149–171, 1997.
Special Issue: 2nd International Conference in Chemometrics (INCINC’96). (Cited on pp. 216,
218.)
[181] Rasmus Bro and Sijmen de Jong. A fast non-negativity-constrained least squares algorithm. J.
Chemometrics, 11:393–401, 1997. (Cited on p. 167.)
[182] Peter N. Brown and Homer F. Walker. GMRES on (nearly) singular systems. SIAM J. Matrix Anal.
Appl., 18:37–51, 1997. (Cited on p. 334.)
440 Bibliography

[183] Rafael Bru, José Marín, José Mas, and Miroslav Tůma. Preconditioned iterative methods for solving
linear least squares problems. SIAM J. Sci. Comput., 36:A2002–A2022, 2014. (Cited on pp. 306,
312.)
[184] Zvonimir Bujanović and Zlatko Drmač. A contribution to the theory and practice of the block
Kogbetliantz method for computing the SVD. BIT Numer. Math., 52:827–849, 2012. (Cited on
p. 357.)
[185] J. R. Bunch. The weak and strong stability of algorithms in numerical linear algebra. Linear
Algebra Appl., 88/89:49–66, 1987. (Cited on p. 37.)
[186] James R. Bunch. Stability of methods for solving Toeplitz systems of equations. SIAM J. Sci.
Statist. Comput., 6:349–364, 1985. (Cited on p. 241.)
[187] James R. Bunch and Linda Kaufman. Some stable methods for calculating inertia and solving
symmetric linear systems. Math. Comp., 31:163–179, 1977. (Cited on p. 92.)
[188] James R. Bunch and C. P. Nielsen. Updating the singular value decomposition. Numer. Math.,
31:111–129, 1978. (Cited on pp. 361, 363.)
[189] Angelika Bunse-Gerstner, Valia Guerra-Ones, and Humberto Madrid de La Vega. An improved
preconditioned LSQR for discrete ill-posed problems. Math. Comput. Simul., 73:65–75, 2006.
(Cited on p. 322.)
[190] C. S. Burrus. Iterative Reweighted Least Squares. OpenStax-CNX Module m45285, 2012. (Cited
on p. 427.)
[191] C. S. Burrus, J. A. Barreto, and I. W. Selesnick. Iterative reweighted least-squares design of FIR
filters. IEEE Trans. Signal Process., 42:2926–2936, 1994. (Cited on p. 425.)
[192] Peter Businger. Updating a singular value decomposition. BIT Numer. Math., 10:376–385, 1970.
(Cited on p. 361.)
[193] P. Businger and G. H. Golub. Linear least squares solutions by Householder transformations. Nu-
mer. Math., 7:269–276, 1965. Also published as Contribution I/8 in Handbook for Automatic Com-
putation, Vol. 2, F. L. Bauer, ed., Springer, Berlin, 1971. (Cited on pp. 56, 101.)
[194] Peter Businger and Gene H. Golub. Algorithm 358: Singular value decomposition of a complex
matrix. Comm. ACM, 12:564–565, 1969. (Cited on p. 349.)
[195] Alfredo Buttari. Fine-grained multithreading for the multifrontal QR factorization of sparse matri-
ces. SIAM J. Sci. Comput., 35:C323–C345, 2013. (Cited on pp. 112, 258.)
[196] Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. Parallel tiled QR factorization
for multicore architectures. Concurrency and Computation: Practice and Experience, 20:1573–
1590, 2008. (Cited on p. 112.)
[197] Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for
bound constrained optimization. SIAM J. Sci. Comput., 16:1190–1208, 1995. (Cited on p. 420.)
[198] Henri Calandra, Serge Gratton, Elisa Riccietti, and Xavier Vasseur. On iterative solution of the
extended normal equations. SIAM J. Matrix Anal. Appl., 41:1571–1589, 2020. (Cited on p. 331.)
[199] Daniela Calvetti, Per Christian Hansen, and Lothar Reichel. L-curve curvature bounds via Lanczos
bidiagonalization. ETNA, 14:20–35, 2002. (Cited on p. 178.)
[200] Daniela Calvetti, Bryan Lewis, and Lothar Reichel. GMRES-type methods for inconsistent systems.
Linear Algebra Appl., 316:157–169, 2000. (Cited on p. 334.)
[201] Daniela Calvetti, Bryan Lewis, and Lothar Reichel. On the choice of subspace for iterative methods
for linear discrete ill-posed problems. Int. J. Appl. Math. Comput. Sci., 11:1060–1092, 2001. (Cited
on p. 334.)
[202] Daniela Calvetti, Bryan Lewis, and Lothar Reichel. On the regularization properties of the GMRES
method. Numer. Math., 91:605–625, 2002. (Cited on p. 333.)
[203] Daniela Calvetti, Serena Morigi, Lothar Reichel, and Fiorella Sgallari. Tikhonov regularization
and the L-curve for large discrete ill-posed problems. J. Comput. Appl. Math., 123:423–446, 2000.
(Cited on p. 333.)
Bibliography 441

[204] Daniela Calvetti and Lothar Reichel. Tikhonov regularization of large linear problems. BIT Numer.
Math., 43:283–484, 2003. (Cited on p. 335.)
[205] Emmanuel J. Candès, Xiadong Li, Yi Ma, and John Wright. Robust principal component analysis?
J. ACM, 58:11, 2011. (Cited on p. 212.)
[206] Emmanuel J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal recon-
struction from highly incomplete frequency information. IEEE Trans. Inform. Theory, 52:335–371,
2006. (Cited on p. 427.)
[207] Emmanuel J. Candès, M. B. Wakin, and S. P. Boyd. Enhancing sparsity by reweighted l1 minimiza-
tion. J. Fourier Anal. Appl., 14:877–905, 2008. (Cited on pp. 428, 428.)
[208] Jesús Cardenal, Iain S. Duff, and José M. Jiménez. Solution of sparse quasi-square rectangular
systems by Gaussian elimination. IMA J. Numer. Anal., 18:165–177, 1998. (Cited on pp. 86, 87.)
[209] Erin Carson and Nicholas J. Higham. Accelerating the solution of linear systems by iterative re-
finement in three precisions. SIAM J. Sci. Comput., 40:A817–A847, 2018. (Cited on p. 104.)
[210] Erin Carson, Nicholas J. Higham, and Srikara Pranesh. Three-precision GMRES-based iterative
refinement for least squares problems. SIAM J. Sci. Comput., 42:A4063–A4083, 2020. (Cited on
p. 104.)
[211] Erin Carson, Kathryn Lund, Miroslav Rozložník, and Stephen Thomas. Block Gram–Schmidt
algorithms and their stability properties. Linear Algebra Appl., 638:150–195, 2022. (Cited on
p. 109.)
[212] A. Cauchy. Mémoire sur l’interpolation. J. Math. Pures Appl., 2:193–205, 1837. (Cited on p. 64.)
[213] Yair Censor. Row-action methods for huge and sparse systems and their applications. SIAM Rev.,
23:444–466, 1981. (Cited on p. 274.)
[214] Yair Censor and S. A. Zenios. Parallel Optimization. Theory, Algorithms, and Applications. Oxford
University Press, Oxford, 1997. (Cited on p. 274.)
[215] Aleš Černý. Characterization of the oblique projector U (V U )† V with applications to constrained
least squares. Linear Algebra Appl., 431:1564–1570, 2009. (Cited on p. 119.)
[216] Frančoise Chaitin-Chatelin and Serge Gratton. On the condition numbers associated with the polar
factorization of a matrix. Numer. Linear Algebra Appl., 7:337–354, 2000. (Cited on p. 385.)
[217] J. M. Chambers. Regression updating. J. Amer. Statist. Assoc., 66:744–748, 1971. (Cited on
p. 135.)
[218] Raymond H. Chan, James G. Nagy, and Robert J. Plemmons. FFT-based preconditioners for
Toeplitz-block least squares problems. SIAM J. Numer. Anal., 30:1740–1768, 1993. (Cited on
p. 324.)
[219] Raymond H. Chan, James G. Nagy, and Robert J. Plemmons. Circulant preconditioned Toeplitz
least squares iterations. SIAM J. Matrix Anal. Appl., 15:80–97, 1994. (Cited on p. 324.)
[220] Raymond H. Chan and Michael K. Ng. Conjugate gradient methods for Toeplitz systems. SIAM
Rev., 38:427–482, 1996. (Cited on p. 325.)
[221] Raymond H.-F. Chan and Xiao-Qing Jin. An Introduction to Iterative Toeplitz Solvers. SIAM,
Philadelphia, 2007. (Cited on p. 325.)
[222] Tony F. Chan. An improved algorithm for computing the singular value decomposition. ACM
Trans. Math. Softw., 8:72–83, 1982. (Cited on pp. 192, 193, 347.)
[223] Tony F. Chan. On the existence and computation of LU factorizations with small pivots. Math.
Comput., 42:535–547, 1984. (Cited on p. 89.)
[224] Tony F. Chan. Rank revealing QR factorizations. Linear Algebra Appl., 88/89:67–82, 1987. (Cited
on pp. 76, 81, 82.)
[225] Tony F. Chan. An optimal circulant preconditioner for Toeplitz systems. SIAM J. Sci. Statist.
Comput., 9:766–771, 1988. (Cited on p. 324.)
442 Bibliography

[226] Tony F. Chan and D. E. Foulser. Effectively well-conditioned linear systems. SIAM J. Sci. Statist.
Comput., 9:963–969, 1988. (Cited on p. 172.)
[227] Tony F. Chan and Per Christian Hansen. Computing truncated singular value decomposition least
squares solutions by rank revealing QR-factorizations. SIAM J. Sci. Statist. Comput., 11:519–530,
1990. (Cited on pp. 83, 174.)
[228] Tony F. Chan and Per Christian Hansen. Some applications of the rank revealing QR factorization.
SIAM J. Sci. Statist. Comput., 13:727–741, 1992. (Cited on p. 174.)
[229] Tony F. Chan and Per Christian Hansen. Low-rank revealing QR factorizations. Numer. Linear
Algebra Appl., 1:33–44, 1994. (Cited on p. 83.)
[230] W. M. Chan and Alan George. A linear time implementation of the reverse Cuthill–McKee algo-
rithm. BIT, 20:1:8–14, 1980. (Cited on p. 251.)
[231] Shivkumar Chandrasekaran and Ilse C. F. Ipsen. On rank-revealing factorizations. SIAM J. Matrix
Anal. Appl., 15:592–622, 1994. (Cited on pp. 77, 83.)
[232] S. Chandrasekaran, M. Gu, and A. H. Sayed. A stable and efficient algorithm for the indefinite
linear least-squares algorithm. SIAM J. Matrix Anal. Appl., 20:354–362, 1998. (Cited on pp. 133,
133.)
[233] S. Chandrasekaran and I. C. F. Ipsen. Analysis of a QR algorithm for computing singular values.
SIAM J. Matrix Anal. Appl., 16:520–535, 1995. (Cited on p. 343.)
[234] Xiao-Wen Chang and Christopher C.Paige. Perturbation analysis for the Cholesky downdating
problem. SIAM J. Matrix Anal. Appl., 19:429–443, 1998. (Cited on p. 148.)
[235] X.-W. Chang, C. C. Paige, and D. Titley-Peloquin. Stopping criteria for the iterative solution of
linear least squares problems. SIAM J. Matrix Anal. Appl., 31:831–852, 2009. (Cited on p. 299.)
[236] J. Charlier, M. Vanbegin, and P. Van Dooren. On efficient implementations of Kogbetliantz’s algo-
rithm for computing the singular value decomposition. Numer. Math., 52:279–300, 1988. (Cited
on p. 357.)
[237] Rick Chartrand. Exact reconstruction of sparse signals via nonconvex minimization. IEEE Proc.
Lett., 14:707–710, 2007. (Cited on p. 428.)
[238] R. Chartrand and Wotao Yin. Iteratively reweighted algorithms for compressive sensing. In Pro-
ceedings IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), pages 3869–3872, 2008.
(Cited on p. 428.)
[239] Danghui Chen and Robert J. Plemmons. Nonnegativity constraints in numerical analysis. In Pro-
ceedings of the Symposium on the Birth of Numerical Analysis, pages 109–140, Katholieke Univer-
siteit, Leuven, 2007. (Cited on p. 167.)
[240] Haibin Chen, Guoyin Li, and Liqun Qi. Further results on Cauchy tensors and Hankel tensors.
Appl. Math. Comput., 275:50–62, 2016. (Cited on p. 218.)
[241] Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by basis
pursuit. SIAM Rev., 43:129–159, 2001. (Cited on p. 427.)
[242] Y. T. Chen. Iterative Methods for Linear Least Squares Problems. Tech. Report CS-75-04, Depart-
ment of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, 1975. (Cited on
p. 285.)
[243] J. Choi, James W. Demmel, I. Dhillon, J. Dongarra, S. Oustrouchov, A. Petitet, K. Stanley,
D. Walker, and R. C. Whaley. ScaLAPACK: A portable linear algebra library for distributed mem-
ory computers—Design issues and performance. Comput. Phys. Comm., 97:1–15, 1996. (Cited on
p. 114.)
[244] Sou-Cheng T. Choi. Iterative Methods for Singular Linear Equations and Least Squares Problems.
Ph.D. thesis, ICME, Stanford University, Stanford, CA, 2006. (Cited on p. 283.)
[245] Sou-Cheng T. Choi, Christopher C. Paige, and Michael A. Saunders. MINRES-QLP: A Krylov
subspace method for indefinite or singular symmetric matrices. SIAM J. Sci. Comput., 33:1810–
1836, 2011. (Cited on pp. 300, 300.)
Bibliography 443

[246] Sou-Cheng T. Choi and Michael A. Saunders. Algorithm 937: MINRES-QLP for symmetric and
Hermitian linear equations and least-squares problems. ACM Trans. Math. Softw., 40:16:1–12,
2014. (Cited on p. 300.)
[247] Edmond Chow and Yousef Saad. Approximate inverse preconditioners via sparse-sparse iterations.
SIAM J. Sci. Comput., 19:995–1023, 1998. (Cited on p. 314.)
[248] Moody T. Chu, Robert E. Funderlic, and Gene H. Golub. A rank-one reduction formula and its
application to matrix factorization. SIAM Review, 37:512–530, 1995. (Cited on p. 139.)
[249] J. Chun, T. Kailath, and H. Lev-Ari. Fast parallel algorithms for QR and triangular factorizations.
SIAM J. Sci. Statist. Comput., 8:899–913, 1987. (Cited on p. 240.)
[250] Gianfranco Cimmino. Calcolo aprossimato per le soluzioni dei sistemi di equazioni lineari. Ricerca
Sci. II, 9:326–333, 1938. (Cited on p. 273.)
[251] D. I. Clark and M. R. Osborne. Finite algorithms for Hubers’s M -estimator. SIAM J. Sci. Statist.
Comput., 7:72–85, 1986. (Cited on p. 425.)
[252] Alan K. Cline. Rate of convergence of Lawson’s algorithm. Math. Comp., 26:167–176, 1972.
(Cited on p. 423.)
[253] A. K. Cline. An elimination method for the solution of linear least squares problems. SIAM J.
Numer. Anal., 10:283–289, 1973. (Cited on pp. 88, 130.)
[254] Alan K. Cline. The Transformation of a Quadratic Programming Problem into Solvable Form.
Tech. Report ICASE 75-14, NASA, Langley Research Center, Hampton, VA, 1975. (Cited on
p. 162.)
[255] Alan K. Cline, A. R. Conn, and Charles F. Van Loan. Generalizing the LINPACK condition esti-
mator. In J. P. Hennart, editor, Numerical Analysis, volume 909 of Lecture Notes in Mathematics,
pages 73–83. Springer-Verlag, Berlin, 1982. (Cited on p. 95.)
[256] A. K. Cline, C. B. Moler, G. W. Stewart, and J. H. Wilkinson. An estimate for the condition number
of a matrix. SIAM J. Numer. Anal., 16:368–375, 1979. (Cited on pp. 95, 95.)
[257] Randell E. Cline. Representations for the generalized inverse of sums of matrices. SIAM J. Numer.
Anal., 2:99–114, 1965. (Cited on p. 139.)
[258] R. E. Cline and R. J. Plemmons. l2 -solutions to underdetermined linear systems. SIAM Rev.,
18:92–106, 1976. (Cited on p. 88.)
[259] E. S. Coakley, V. Rokhlin, and M. Tygert. A fast randomized algorithm for orthogonal projection.
SIAM J. Sci. Comput., 33:849–868, 2011. (Cited on p. 319.)
[260] T. F. Coleman and Y. Li. A global and quadratically convergent affine scaling method for linear l1
problems. Math. Program., 56:189–222, 1992. (Cited on p. 423.)
[261] Thomas F. Coleman and Yuying Li. A global and quadratically convergent method for linear l∞
problems. SIAM J. Numer. Anal., 29:1166–1186, 1992. (Cited on p. 423.)
[262] Tom F. Coleman, Alan Edenbrandt, and John R. Gilbert. Predicting fill for sparse orthogonal
factorization. J. Assoc. Comput. Mach., 33:517–532, 1986. (Cited on pp. 253, 264.)
[263] Pierre Comon, Gene H. Golub, Lek-Heng Lim, and Bernard Mourrain. Symmetric tensors and
symmetric rank. SIAM J. Matrix Anal. Appl., 30:1254–1279, 2008. (Cited on p. 214.)
[264] P. Comon, J. M. F. ten Berge, Lieven De Lathauwer, and J. Castaing. Generic and typical ranks of
multiway arrays. Linear Algebra Appl., 430:2997–3007, 2009. (Cited on p. 218.)
[265] Andrew R. Conn, Nicholas I. M. Gould, and Philippe L. Toint. A globally convergent augmented
Lagrangian algorithm for optimization with general constraints and simple bounds. SIAM J. Numer.
Anal., 28:545–572, 1991. (Cited on p. 400.)
[266] Andy R. Conn, Nick I. M. Gould, and Philippe L. Toint. LANCELOT: A Fortran Package for Large-
Scale Nonlinear Optimization. (Release A). Springer-Verlag, Berlin, 1992. (Cited on p. 400.)
[267] J. W. Cooley. How the FFT gained acceptance. In S. G. Nash, editor, A History of Scientific
Computing, pages 133–140. Addison-Wesley, Reading, MA, 1990. (Cited on p. 237.)
444 Bibliography

[268] J. W. Cooley, P. A. W. Lewis, and P. D. Welsh. The fast Fourier transform and its application. IEEE
Trans. Education, E-12:27–34, 1969. (Cited on p. 237.)
[269] James W. Cooley and John W. Tukey. An algorithm for machine calculation of complex Fourier
series. Math. Comp., 19:297–301, 1965. (Cited on p. 235.)
[270] Corrado Corradi. A note on the solution of separable nonlinear least-squares problems with separa-
ble nonlinear equality constraints. SIAM J. Numer. Anal., 18:1134–1138, 1981. (Cited on p. 404.)
[271] A. J. Cox and Nicholas J. Higham. Backward error bounds for constrained least squares problems.
BIT Numer. Math., 39:210–227, 1999. (Cited on pp. 99, 157.)
[272] Anthony J. Cox. Stability of Algorithms for Solving Weighted and Constrained Least Squares
Problems. Ph.D. thesis, University of Manchester, Department of Mathematics, Manchester, UK,
1997. (Cited on p. 131.)
[273] Anthony J. Cox and Nicholas J. Higham. Stability of Householder QR factorization for weighted
least squares problems. In D. F. Griffiths, D. J. Higham, and G. A. Watson, editors, Numerical
Analysis 1997: Proceedings of the 17th Dundee Biennial Conference, Pitman Research Notes Math.
Ser. 380, pages 57–73. Longman Scientific and Technical, Harlow, UK, 1998. (Cited on p. 131.)
[274] Anthony J. Cox and Nicholas J. Higham. Accuracy and stability of the null space method for
solving the equality constrained least squares problem. BIT Numer. Math., 39:34–50, 1999. (Cited
on pp. 157, 160.)
[275] Maurice G. Cox. The least-squares solution of linear equations with block-angular observation
matrix. In Maurice G. Cox and Sven J. Hammarling, editors, Reliable Numerical Computation,
pages 227–240. Oxford University Press, UK, 1990. (Cited on pp. 208, 209.)
[276] Trevor F. Cox and Michael A. A. Cox. Multidimensional Scaling. Chapman and Hall, London,
1994. (Cited on p. 386.)
[277] E. J. Craig. The N -step iteration procedure. J. Math. Phys., 34:65–73, 1955. (Cited on p. 284.)
[278] P. Craven and Grace Wahba. Smoothing noisy data with spline functions. Numer. Math., 31:377–
403, 1979. (Cited on pp. 178, 189.)
[279] Jane Cullum, Ralph A. Willoughby, and Mark Lake. A Lanczos algorithm for computing singular
values and vectors of large matrices. SIAM J. Sci. Statist. Comput., 4:197–215, 1983. (Cited on
p. 376.)
[280] J. J. M. Cuppen. A divide and conquer method for the symmetric tridiagonal eigenproblem. Numer.
Math., 36:177–195, 1981. (Cited on p. 358.)
[281] E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetric matrices. In ACM ’69:
Proc. 24th Nat. Conf., pages 157–172, New York, 1969. ACM. (Cited on p. 251.)
[282] George Cybenko. Fast Toeplitz orthogonalization using inner products. SIAM J. Sci. Statist. Com-
put., 8:734–740, 1987. (Cited on p. 241.)
[283] Germund Dahlquist and Åke Björck. Numerical Methods. Prentice-Hall Inc., Englewood Cliffs,
NJ, 1974. Reprinted in 2003 by Dover Publications, Mineola, NJ. (Cited on p. 231.)
[284] Germund Dahlquist and Åke Björck. Numerical Methods in Scientific Computing, Volume I. SIAM,
Philadelphia, 2008. (Cited on pp. 33, 378.)
[285] James W. Daniel, William B. Gragg, Linda Kaufman, and G. W. Stewart. Reorthogonalization and
stable algorithms for updating the Gram-Schmidt QR factorization. Math. Comp., 30:772–795,
1976. (Cited on pp. 69, 150.)
[286] Ingrid Daubechies, Ronald DeVore, Massimo Fornasier, and C. Sinan Güntürk. Iteratively re-
weighted least squares minimization for sparse recovery. Comm. Pure Appl. Math., 63:1–38, 2010.
(Cited on p. 425.)
[287] E. R. Davidson. The iterative calculation of a few of the lowest eigenvalues and corresponding
eigenvectors of large real symmetric matrices. J. Comput. Phys., 17:87–94, 1975. (Cited on
p. 374.)
Bibliography 445

[288] Chandler Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. III. SIAM J.
Numer. Anal., 7:1–46, 1970. (Cited on p. 19.)
[289] Timothy A. Davis. Direct Methods for Sparse Linear Systems, volume 2 of Fundamental of Algo-
rithms. SIAM, Philadelphia, 2006. (Cited on p. 244.)
[290] Timothy A. Davis. Algorithm 915, SuiteSparseQR: Multifrontal multithreaded rank-revealing
sparse QR factorization. ACM Trans. Math. Softw., 38:8:1–8:22, 2011. (Cited on p. 258.)
[291] Timothy A. Davis and Yifan Hu. The University of Florida sparse matrix collection. ACM Trans.
Math. Softw., 38:1:1–1:25, 2011. (Cited on pp. 244, 297.)
[292] Timothy A. Davis, Sivasankaran Rajamanickam, and Wissam M. Sid-Lakhdar. A survey of direct
methods for sparse linear systems. Acta Numer., 25:383–566, 2016. (Cited on p. 244.)
[293] Achiya Dax. The convergence of linear stationary iterative processes for solving singular unstruc-
tured systems of linear equations. SIAM Rev., 32:611–635, 1990. (Cited on p. 270.)
[294] Achiya Dax. On computational aspects of bounded linear least squares problems. ACM Trans.
Math. Softw., 17:64–73, 1991. (Cited on p. 417.)
[295] Sijmen de Jong. SIMPLS: An alternative approach to partial least squares regression. Chemom.
Intell. Lab. Syst., 18:251–263, 1993. (Cited on p. 203.)
[296] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. A multilinear singular value decompo-
sition. SIAM J. Matrix Anal. Appl., 21:1253–1278, 2000. (Cited on p. 217.)
[297] Bart De Moor. Structured total least squares and l2 approximation problems. Linear Algebra Appl.,
188/189:163–206, 1993. (Cited on p. 226.)
[298] Bart De Moor and Paul Van Dooren. Generalizations of the singular value and QR decompositions.
SIAM J. Matrix Anal. Appl., 13:993–1014, 1992. (Cited on p. 128.)
[299] Bart De Moor and H. Zha. A tree of generalizations of the ordinary singular value decomposition.
Linear Algebra Appl., 147:469–500, 1991. (Cited on p. 128.)
[300] Ron S. Dembo, Stanley C. Eisenstat, and Trond Steihaug. Inexact Newton methods. SIAM J.
Numer. Anal., 19:400–408, 1982. (Cited on pp. 401, 401.)
[301] C. J. Demeure. Fast QR factorization of Vandermonde matrices. Linear Algebra Appl.,
122/123/124:165–194, 1989. (Cited on p. 239.)
[302] C. J. Demeure. QR factorization of confluent Vandermonde matrices. IEEE Trans. Acoust. Speech
Signal Process., 38:1799–1802, 1990. (Cited on p. 239.)
[303] James W. Demmel. On Floating Point Errors in Cholesky. Tech. Report CS-89-87, Department
of Computer Science, University of Tennessee, Knoxville, TN, 1989. LAPACK Working Note 14.
(Cited on p. 43.)
[304] James W. Demmel. The componentwise distance to the nearest singular matrix. SIAM J. Matrix
Anal. Appl., 13:10–19, 1992. (Cited on p. 31.)
[305] James W. Demmel, Laura Grigori, Ming Gu, and Hua Xiang. Communication avoiding rank reveal-
ing QR factorization with column pivoting. SIAM J. Matrix Anal. Appl., 36:55–89, 2015. (Cited
on p. 112.)
[306] James W. Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou. Communication-optimal
parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput., 34:A206–A239, 2012.
(Cited on p. 212.)
[307] James W. Demmel, Ming Gu, Stanley Eisenstat, Ivan Slapnic̆ar, Krěsimir Veselić, and Zlatko
Drmač. Computing the singular value decomposition with high relative accuracy. Linear Alge-
bra Appl., 299:21–80, 1999. (Cited on p. 344.)
[308] James W. Demmel, Yozo Hida, E. Jason Riedy, and Xiaoye S. Li. Extra-precise iterative refinement
for overdetermined least squares problems. ACM Trans. Math. Softw., 35:29:1–32, 2009. (Cited
on p. 103.)
446 Bibliography

[309] James W. Demmel, Mark Hoemmen, Yozo Hida, and E. Jason Riedy. Non-negative Diagonals and
High Performance on Low-Profile Matrices from Householder QR. LAPACK Working Note 203,
Tech. Report UCB/EECS-2008-76, UCB/EECS, Berkeley, CA, 2008. (Cited on p. 47.)
[310] James Demmel and W. Kahan. Accurate singular values of bidiagonal matrices. SIAM J. Sci. Statist.
Comput., 11:873–912, 1990. (Cited on pp. 343, 343, 348, 356.)
[311] James Demmel and Plamen Koev. The accurate and efficient solution of a totally positive gener-
alized Vandermonde linear system. SIAM J. Matrix Anal. Appl., 27:142–152, 2005. (Cited on
p. 238.)
[312] James Demmel and Krešimir Veselić. Jacobi’s method is more accurate than QR. SIAM J. Matrix
Anal. Appl., 13:1204–1245, 1992. (Cited on p. 353.)
[313] Eugene D. Denman and Alex N. Beavers. The matrix sign function and computations in systems.
Appl. Math. Comput., 2:63–94, 1976. (Cited on p. 379.)
[314] John E. Dennis. Nonlinear least squares and equations. In D. A. H. Jacobs, editor, The State of the
Art in Numerical Analysis, pages 269–312. Academic Press, New York, 1977. (Cited on p. 402.)
[315] John E. Dennis, David M. Gay, and R. E. Welsh. An adaptive nonlinear least-squares algorithm.
ACM. Trans. Math. Softw., 7:348–368, 1981. (Cited on pp. 398, 399.)
[316] John E. Dennis, Jr. and Robert B. Schnabel. Numerical Methods for Unconstrained Optimization
and Nonlinear Equations, volume 16 of Classics in Applied Math., SIAM, Philadelphia, 1996.
Originally published in 1983 by Prentice-Hall, Englewood Cliffs, NJ. (Cited on pp. 392, 393,
399, 402.)
[317] J. E. Dennis, Jr. and Trond Steihaug. On the successive projections approach to least-squares
problems. SIAM J. Numer. Anal., 23:717–733, 1986. (Cited on p. 309.)
[318] Peter Deuflhard and V. Apostolescu. A study of the Gauss–Newton algorithm for the solution of
nonlinear least squares problems. In J. Frehse, D. Pallaschke, and U. Trottenberg, editors, Special
Topics of Applied Mathematics, pages 129–150. North-Holland, Amsterdam, 1980. (Cited on
p. 394.)
[319] Peter Deuflhard and W. Sautter. On rank-deficient pseudoinverses. Linear Algebra Appl., 29:91–
111, 1980. (Cited on p. 73.)
[320] Inderjit S. Dhillon. A New O(n2 ) Algorithm for the Symmetric Tridiagonal Eigenvalue/Eigenvector
Problem. Ph.D. thesis, University of California, Berkeley, CA, 1997. (Cited on p. 352.)
[321] Inderjit S. Dhillon and Beresford N. Parlett. Orthogonal eigenvectors and relative gaps. SIAM J.
Matrix Anal. Appl., 25:858–899, 2004. (Cited on p. 352.)
[322] J. J. Dongarra, J. R. Bunch, C. B. Moler, and G. W. Stewart. LINPACK Users’ Guide. SIAM,
Philadelphia, 1979. (Cited on pp. 113, 140, 184, 349.)
[323] Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, and
Ichitaro Yamazaki. The singular values decomposition: Anatomy of optimizing an algorithm for
extreme scale. SIAM Rev., 60:808–865, 2018. (Cited on pp. 112, 194, 360.)
[324] J. J. Dongarra and D. C. Sorensen. A fully parallel algorithmic for the symmetric eigenvalue prob-
lem. SIAM J. Sci. Statist. Comput., 8:s139–s154, 1987. (Cited on p. 358.)
[325] Jack Dongarra, Jeremy Du Croz, Sven J. Hammarling, and Richard J. Hanson. Algorithm 656. An
extended set of FORTRAN Basic Linear Algebra Subprograms: Model implementation and test
programs. ACM Trans. Math. Softw., 14:18–32, 1988. (Cited on p. 113.)
[326] Jack Dongarra, Jeremy Du Croz, Sven J. Hammarling, and Richard J. Hanson. An extended set of
FORTRAN Basic Linear Algebra Subprograms. ACM Trans. Math. Softw., 14:1–17, 1988. (Cited
on p. 113.)
[327] Jack Dongarra, Danny Sorensen, and Sven J. Hammarling. Block reduction of matrices to con-
densed form for eigenvalue computation. J. Comput. Appl. Math., 27:215–227, 1989. (Cited on
p. 194.)
Bibliography 447

[328] Jack J. Dongarra, Jeremy Du Croz, Iain S. Duff, and S. Hammarling. A set of level 3 basic linear
algebra subprograms. ACM Trans. Math. Softw., 16:1–17, 1990. (Cited on p. 257.)
[329] D. L. Donoho. For most large underdetermined systems of linear equations the minimal 1-norm
solution is also the sparsest solution. Comm. Pure. Appl. Math., 59:797–829, 2006. (Cited on
p. 427.)
[330] Froilán M. Dopico, Plamen Koev, and Juan M. Molera. Implicit standard Jacobi gives high relative
accuracy. Numer. Math., 113:519–553, 2009. (Cited on p. 353.)
[331] N. R. Draper and H. Smith. Applied Regression Analysis. Wiley, New York, third edition, 1998.
(Cited on p. 205.)
[332] Petros Drineas, Michael W. Mahoney, S. Muthukrishnan, and Tamás Sarlós. Faster least squares
approximation. Numer. Math., 117:219–249, 2011. (Cited on p. 319.)
[333] Zlatko Drmač. Implementation of Jacobi rotations for accurate singular value computation in float-
ing point arithmetic. SIAM J. Sci. Comput., 18:1200–1222, 1997. (Cited on p. 356.)
[334] Zlatko Drmač. On principal angles between subspaces of Euclidean space. SIAM J. Matrix Anal.
Appl., 22:173–194, 2000. (Cited on p. 17.)
[335] Zlatko Drmač. A QR-preconditioned QR SVD method for computing SVD with high accuracy.
ACM Trans. Math. Softw., 44:11:1–11:30, 2017. (Cited on p. 341.)
[336] Zlatko Drmač and Krešimir Veselić. New fast and accurate Jacobi SVD algorithm. I. SIAM J.
Matrix Anal. Appl., 29:1322–1342, 2008. (Cited on p. 356.)
[337] Zlatko Drmač and Krešimir Veselić. New fast and accurate Jacobi SVD algorithm. II. SIAM J.
Matrix Anal. Appl., 29:1343–1362, 2008. (Cited on p. 356.)
[338] A. A. Dubrulle. Householder transformations revisited. SIAM J. Matrix Anal. Appl., 22:33–40,
2000. (Cited on p. 51.)
[339] F. Duchin and Daniel B. Szyld. Application of sparse matrix techniques to inter-regional input-
output analysis. Econ. Planning, 15:147–167, 1979. (Cited on p. 206.)
[340] Iain S. Duff. Pivot selection and row orderings in Givens reduction on sparse matrices. Computing,
13:239–248, 1974. (Cited on p. 255.)
[341] Iain S. Duff. On permutations to block triangular form. J. Inst. Math. Appl., 19:339–342, 1977.
(Cited on p. 264.)
[342] Iain S. Duff. On algorithms for obtaining a maximum transversal. ACM Trans. Math. Softw.,
7:315–330, 1981. (Cited on p. 264.)
[343] Iain S. Duff. Parallel implementation of multifrontal schemes. Parallel Comput., 3:193–204, 1986.
(Cited on p. 255.)
[344] Iain S. Duff, A. M. Erisman, and John K. Reid. Direct Methods for Sparse Matrices. Oxford
University Press, London, 1986. (Cited on pp. 185, 252.)
[345] Iain S. Duff, A. M. Erisman, and John K. Reid. Direct Methods for Sparse Matrices. Oxford
University Press, London, second edition, 2017. (Cited on p. 244.)
[346] Iain S. Duff, Roger G. Grimes, and John G. Lewis. Sparse matrix test problems. ACM Trans. Math.
Softw., 15:1–14, 1989. (Cited on pp. 244, 265.)
[347] Iain S. Duff, Michael A. Heroux, and Roldan Pozo. An overview of the sparse basic linear alge-
bra subprograms: The new standard from the BLAS technical forum. ACM Trans. Math. Softw.,
28:239–257, 2002. (Cited on p. 247.)
[348] Iain S. Duff and J. Koster. On algorithms for permuting large entries to the diagonal of a sparse
matrix. SIAM J. Matrix Anal. Appl., 22:973–996, 2001. (Cited on p. 317.)
[349] Iain S. Duff, M. Marrone, G. Radicati, and C. Vittoli. Level 3 Basic Linear Algebra Subprograms
for sparse matrices: A user level interface. ACM Trans. Math. Softw., 23:379–401, 1997. (Cited
on p. 247.)
448 Bibliography

[350] Iain S. Duff and Gérard A. Meurant. The effect of ordering on preconditioned conjugate gradients.
BIT Numer. Math., 29:635–657, 1989. (Cited on p. 310.)
[351] Iain S. Duff and John K. Reid. An implementation of Tarjan’s algorithm for the block triangular-
ization of a matrix. ACM Trans. Math. Softw., 4:137–147, 1978. (Cited on p. 265.)
[352] Iain S. Duff and John K. Reid. The multifrontal solution of indefinite sparse symmetric linear
systems. ACM Trans. Math. Softw., 9:302–325, 1983. (Cited on p. 255.)
[353] A. L. Dulmage and N. S. Mendelsohn. Coverings of bipartite graphs. Canad. J. Math., 10:517–534,
1958. (Cited on p. 263.)
[354] A. L. Dulmage and N. S. Mendelsohn. A structure theory of bipartite graphs of finite exterior
dimension. Trans. Roy. Soc. Canada Sect. III, 53:1–13, 1959. (Cited on p. 263.)
[355] A. L. Dulmage and N. S. Mendelsohn. Two algorithms for bipartite graphs. J. Soc. Indust. Appl.
Math., 11:183–194, 1963. (Cited on p. 263.)
[356] P. J. Eberlein and Haesun Park. Efficient implementation of Jacobi algorithms and Jacobi sets on
distributed memory machines. J. Parallel Distrib. Comput., 8:358–366, 1990. (Cited on p. 354.)
[357] Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psy-
chometrika, 1:211–218, 1936. (Cited on p. 24.)
[358] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality con-
straints. SIAM J. Matrix Anal. Appl., 20:303–353, 1998. (Cited on p. 169.)
[359] Bradley Efron, Trevor Hastie, Iain Johnston, and Robert Tibshirani. Least angle regression. Ann.
Statist., 32:407–499, 2004. (Cited on p. 425.)
[360] M. A. Efroymson. Multiple regression analysis. In Anthony Ralston and Herbert S. Wilf, editors,
Mathematical Methods for Digital Computers. Volume I, pages 191–203. Wiley, New York, 1960.
(Cited on p. 140.)
[361] E. Egerváry. On rank-diminishing operators and their applications to the solution of linear equa-
tions. Z. Angew. Math. Phys., 11:376–386, 1960. (Cited on p. 139.)
[362] Stanley C. Eisenstat, M. C. Gursky, Martin H. Schultz, and Andrew H. Sherman. The Yale sparse
matrix package, 1. The symmetric code. Internat J. Numer. Methods Engrg., 18:1145–1151, 1982.
(Cited on p. 258.)
[363] Håkan Ekblom. Calculation of linear best lp -approximation. BIT Numer. Math., 13:292–300, 1973.
(Cited on p. 425.)
[364] Håkan Ekblom. A new algorithm for the Huber estimator in linear models. BIT Numer. Math.,
28:123–132, 1988. (Cited on p. 425.)
[365] Håkan Ekblom and Kaj Madsen. Algorithms for nonlinear Huber estimation. BIT Numer. Math.,
29:60–76, 1989. (Cited on p. 425.)
[366] Lars Eldén. Stepwise Regression Analysis with Orthogonal Transformations. Tech. Report LiTH-
MAT-R-1972-2, Department of Mathematics, Linköping University, Sweden, 1972. (Cited on
p. 140.)
[367] Lars Eldén. Algorithms for the regularization of ill-conditioned least squares problems. BIT Numer.
Math., 17:134–145, 1977. (Cited on pp. 170, 171, 180.)
[368] Lars Eldén. Perturbation theory for the least squares problem with linear equality constraints. SIAM
J. Numer. Anal., 17:338–350, 1980. (Cited on p. 158.)
[369] Lars Eldén. A weighted pseudoinverse, generalized singular values, and constrained least squares
problems. BIT Numer. Math., 22:487–502, 1982. (Cited on pp. 158, 158, 159, 181.)
[370] Lars Eldén. An efficient algorithm for the regularization of ill-conditioned least squares problems
with triangular Toeplitz matrix. SIAM J. Sci. Statist. Comput., 5:229–236, 1984. (Cited on pp. 176,
241.)
[371] Lars Eldén. A note on the computation of the generalized cross-validation function for ill-
conditioned least squares problems. BIT Numer. Math., 24:467–472, 1984. (Cited on p. 179.)
Bibliography 449

[372] Lars Eldén. Algorithms for the computation of functionals defined on the solution of a discrete
ill-posed problem. BIT Numer. Math., 30:466–483, 1990. (Cited on p. 29.)
[373] Lars Eldén. Numerical solution of the sideways heat equation. In H. W. Engl and W. Rundell,
editors, Inverse Problems in Diffusion Processes, pages 130–150. SIAM, Philadelphia, PA, 1995.
(Cited on p. 173.)
[374] Lars Eldén. Solving quadratically constrained least squares problems using a differential-geometric
approach. BIT Numer. Math., 42:323–335, 2002. (Cited on p. 169.)
[375] Lars Eldén. Partial least-squares vs. Lanczos bidiagonalization—I: Analysis of a projection method
for multiple regression. Comput. Statist. Data Anal., 46:11–31, 2004. (Cited on p. 201.)
[376] Lars Eldén. Matrix Methods in Data Mining and Pattern Recognition. Fundamentals of Algorithms.
SIAM, Philadelphia, second edition, 2019. (Cited on p. 216.)
[377] Lars Eldén and Salman Ahmadi-Asl. Solving bilinear tensor least squares problems and application
to Hammerstein identification. Numer. Linear Algebra Appl., 26:e2226, 2019. (Cited on p. 406.)
[378] L. Eldén and H. Park. Block downdating of least squares solutions. SIAM J. Matrix Anal. Appl.,
15:1018–1034, 1994. (Cited on p. 148.)
[379] Lars Eldén and Haesun Park. A Procrustes problem on the Stiefel manifold. Numer. Math., 82:599–
619, 1999. (Cited on p. 387.)
[380] Lars Eldén and Berkant Savas. A Newton–Grassman method for computing the best multilinear
rank-(r1 , r2 , r3 ) approximation of a tensor. SIAM J. Matrix Anal. Appl., 31:248–271, 2009. (Cited
on pp. 215, 217.)
[381] Lars Eldén and Valeria Simoncini. Inexact Rayleigh quotient-type methods for eigenvalue compu-
tations. BIT Numer. Math.., 42:159–182, 2002. (Cited on p. 375.)
[382] Lars Eldén and Valeria Simoncini. Solving ill-posed linear systems with GMRES and a singular
preconditioner. SIAM J. Matrix Anal. Appl., 33:1369–1394, 2012. (Cited on p. 334.)
[383] T. Elfving. Block-iterative methods for consistent and inconsistent linear equations. Numer. Math.,
35:1–12, 1980. (Cited on p. 308.)
[384] Tommy Elfving and Ingegerd Skoglund. A direct method for a regularized least-squares problem.
Numer. Linear Algebra Appl., 16:649–675, 2009. (Cited on p. 259.)
[385] Erik Elmroth and Fred G. Gustavson. Applying recursion to serial and parallel QR factorization
leads to better performance. IBM J. Res. Develop., 44:605–624, 2004. (Cited on p. 112.)
[386] Heinz W. Engl, Martin Hanke, and Andreas Neubauer. Regularization of Inverse Problems. Dor-
drecht, The Netherlands. Kluwer Academic Press, 1996. (Cited on p. 182.)
[387] Jerry Eriksson. Quasi-Newton methods for nonlinear least squares focusing on curvature. BIT
Numer. Math., 39:228–254, 1999. (Cited on p. 400.)
[388] Jerry Eriksson, Per-Åke Wedin, Mårten E. Gulliksson, and Inge Söderkvist. Regularization methods
for uniformly rank-deficient nonlinear least squares methods. J. Optim. Theory Appl., 127:1–26,
2005. (Cited on p. 397.)
[389] D. J. Evans. The use of pre-conditioning in iterative methods for solving linear systems with
symmetric positive definite matrices. J. Inst. Math. Appl., 4:295–314, 1968. (Cited on p. 306.)
[390] Nicolaas M. Faber and Joan Ferré. On the numerical stability of two widely used PLS algorithms.
J. Chemom., 22:101–105, 2008. (Cited on p. 203.)
[391] D. K. Faddeev, V. N. Kublanovskaya, and V. N. Faddeeva. Solution of linear algebraic systems with
rectangular matrices. Proc. Steklov Inst. Math., 96:93–111, 1968. (Cited on p. 76.)
[392] D. K. Faddeev, V. N. Kublanovskaya, and V. N. Faddeeva. Sur les systèmes linéaires algébriques
de matrices rectangularies et malconditionées. In Programmation en Mathématiques Numériques,
pages 161–170. Editions Centre Nat. Recherche Sci., Paris VII, 1968. (Cited on p. 343.)
[393] Shaun M. Fallat. Bidiagonal factorizations of totally nonnegative matrices. Amer. Math. Monthly,
108:697–212, 2001. (Cited on p. 239.)
450 Bibliography

[394] Ky Fan. On a theorem by Weyl concerning the eigenvalues of linear transformations. Proc. Nat.
Acad. Sci. USA, 35:652–655, 1949. (Cited on p. 22.)
[395] Ky Fan and Alan J. Hoffman. Some metric inequalities in the space of matrices. Proc. Amer. Math.
Soc., 6:111–116, 1955. (Cited on p. 383.)
[396] R. W. Farebrother. Linear Least Squares Computations. Marcel Dekker, New York, 1988. (Cited
on p. 64.)
[397] Richard W. Farebrother. Fitting Linear Relationships. A History of the Calculus of Observations
1750–1900. Springer Series in Statistics. Springer, Berlin, 1999. (Cited on pp. 2, 139.)
[398] Dario Fasino and Antonio Fazzi. A Gauss–Newton iteration for total least squares problems. BIT
Numer. Math., 58:281–299, 2018. (Cited on p. 224.)
[399] Donald W. Fausett and Charles T. Fulton. Large least squares problems involving Kronecker prod-
ucts. SIAM J. Matrix Anal. Appl., 15:219–227, 1994. (Cited on p. 209.)
[400] M. Fazel. Matrix Rank Minimization with Applications. Ph.D. thesis, Stanford University, Stanford,
CA, 2002. (Cited on p. 429.)
[401] K. Vince Fernando. Linear convergence of the row cyclic Jacobi and Kogbetliantz methods. Numer.
Math., 56:73–91, 1989. (Cited on pp. 357, 357.)
[402] K. V. Fernando. Accurately counting singular values of bidiagonal matrices and eigenvalues of
skew-symmetric tridiagonal matrices. SIAM J. Matrix Anal. Appl., 20:2:373–399, 1998. (Cited on
pp. 350, 351.)
[403] K. Vince Fernando and Beresford N. Parlett. Accurate singular values and differential qd algo-
rithms. Numer. Math., 67:191–229, 1994. (Cited on pp. 343, 351, 352.)
[404] R. D. Fierro, G. H. Golub, P. C. Hansen, and D. P. O’Leary. Regularization by truncated total least
squares. SIAM J. Sci. Comput., 18:1223–1241, 1997. (Cited on pp. 225, 335.)
[405] Ricardo D. Fierro and James R. Bunch. Collinearity and total least squares. SIAM J. Matrix Anal.
Appl., 15:1167–1181, 1994. (Cited on p. 225.)
[406] Ricardo D. Fierro and James R. Bunch. Perturbation theory for orthogonal projection methods with
applications to least squares and total least squares. Linear Algebra Appl., 234:71–96, 1996. (Cited
on p. 225.)
[407] Ricardo D. Fierro and Per Christian Hansen. UTV Expansion Pack: Special-purpose rank-revealing
algorithms. Numer. Algorithms, 40:47–66, 2005. (Cited on p. 155.)
[408] Ricardo D. Fierro, Per Christian Hansen, and Peter Søren Kirk Hansen. UTV Tools: MATLAB
templates for rank-revealing UTV decompositions. Numer. Algorithms, 20:165–194, 1999. (Cited
on p. 155.)
[409] B. Fischer, A. Ramage, D. J. Silvester, and A. J. Wathen. Minimum residual methods for augmented
systems. BIT Numer. Math., 38:527–543, 1998. (Cited on pp. 300, 330.)
[410] Roger Fletcher. Generalized inverse methods for the best least squares solution of systems of non-
linear equations. Comput. J., 10:392–399, 1968. (Cited on p. 395.)
[411] Roger Fletcher. A Modified Marquardt Subroutine for Nonlinear Least Squares. Tech. Report
R6799, Atomic Energy Research Establishment, Harwell, UK, 1971. (Cited on p. 396.)
[412] Roger Fletcher. Practical Methods of Optimization. Wiley, New York, second edition, 2000. (Cited
on p. 402.)
[413] Roger Fletcher and C. Xu. Hybrid methods for nonlinear least squares. IMA J. Numer. Anal.,
7:371–389, 1987. (Cited on p. 400.)
[414] Diederik R. Fokkema, Gerard L. G. Sleijpen, and Henk A. van der Vorst. Accelerated inexact
Newton schemes for large systems of nonlinear equations. SIAM J. Sci. Comput., 19:657–674,
1998. (Cited on p. 402.)
Bibliography 451

[415] Diederik R. Fokkema, Gerard L. G. Sleijpen, and Henk A. van der Vorst. Jacobi–Davidson style
QR and QZ algorithms for the reduction of matrix pencils. SIAM J. Sci. Comput., 20:94–125, 1998.
(Cited on p. 375.)
[416] David Chin-Lung Fong. Minimum-Residual Methods for Sparse Least-Squares Using Golub–
Kahan Bidiagonalization. Ph.D. thesis, Stanford University, Stanford, CA, 2011. (Cited on p. 294.)
[417] David Chin-Lung Fong and Michael Saunders. LSMR: An iterative algorithm for sparse least-
squares problems. SIAM J. Sci. Comput., 33:2950–2971, 2011. (Cited on pp. 292, 294, 297,
298.)
[418] A. B. Forbes. Least-Squares Best-Fit Geometric Elements. Tech. Report NPL DITC 140/89, Na-
tional Physical Laboratory, Teddington, UK, 1989. (Cited on p. 414.)
[419] A. B. Forbes. Robust Circle and Sphere Fitting by Least Squares. Tech. Report NPL DITC 153/89,
National Physical Laboratory, Teddington, UK, 1989. (Cited on p. 414.)
[420] Anders Forsgren. On linear least-squares problems with diagonally dominant weight matrices.
SIAM J. Matrix Anal. Appl., 17:763–788, 1996. (Cited on p. 132.)
[421] Anders Forsgren, Philip E. Gill, and Margaret H. Wright. Interior methods for nonlinear optimiza-
tion. SIAM Rev., 44:525–597, 2002. (Cited on p. 419.)
[422] George E. Forsythe. Generation and use of orthogonal polynomials for data-fitting with a digital
computer. J. Soc. Indust. Appl. Math., 5:74–88, 1957. (Cited on p. 237.)
[423] George E. Forsythe and Peter Henrici. The cyclic Jacobi method for computing the principal values
of a complex matrix. Trans. Amer. Math. Soc., 94:1–23, 1960. (Cited on p. 354.)
[424] Leslie Foster. Rank and null space calculations using matrix decomposition without column inter-
changes. Linear Algebra Appl., 74:47–71, 1986. (Cited on pp. 82, 261.)
[425] Leslie Foster. The growth factor and efficiency of Gaussian elimination with rook pivoting. J.
Comp. Appl. Math., 86:177–194, 1997. (Cited on p. 86.)
[426] Leslie Foster and Rajesh Kommu. Algorithm 853. An efficient algorithm for solving rank-deficient
least squares problems. ACM Trans. Math. Softw., 32:157–165, 2006. (Cited on p. 79.)
[427] Leslie V. Foster. Modifications of the normal equations method that are numerically stable. In
Gene H. Golub and P. Van Dooren, editors, Numerical Linear Algebra, Digital Signal Processing
and Parallel Algorithms, NATO ASI Series, pages 501–512. Springer-Verlag, Berlin, 1991. (Cited
on p. 204.)
[428] Leslie V. Foster. Solving rank-deficient and ill-posed problems using UTV and QR factorizations.
SIAM J. Matrix Anal. Appl., 25:582–600, 2003. (Cited on p. 79.)
[429] Leslie Fox. An Introduction to Numerical Linear Algebra. Clarendon Press, Oxford, 1964. xi+328
pp. (Cited on p. 314.)
[430] C. Fraley. Computational behavior of Gauss–Newton methods. SIAM J. Sci. Statist. Comput.,
10:515–532, 1989. (Cited on p. 394.)
[431] John G. F. Francis. The QR transformation: A unitary analogue to the LR transformation. I. Com-
puter J., 4:265–271, 1961. (Cited on p. 344.)
[432] John G. F. Francis. The QR transformation. II. Computer J., 4:332–345, 1961. (Cited on p. 344.)
[433] Roland W. Freund. A note on two block SOR methods for sparse least squares problems. Linear
Algebra Appl., 88/89:211–221, 1987. (Cited on p. 316.)
[434] M. P. Friedlander and Dominique Orban. A primal-dual regularized interior-point method for con-
vex quadratic programs. Math. Prog. Comput., 4:71–107, 2012. (Cited on p. 331.)
[435] Takeshi Fukaya, Ramaseshan Kannan, Yuji Nakatsukasa, Yasaku Yamamoto, and Yuka Yanagi-
sawa. Shifted Cholesky QR for computing the QR factorization of ill-conditioned matrices. SIAM
J. Sci. Comput., 42:A477–A503, 2020. (Cited on p. 213.)
[436] Martin J. Gander and Gerhard Wanner. From Euler, Ritz, and Galerkin to modern computing. SIAM
Rev., 54:627–666, 2012. (Cited on p. 375.)
452 Bibliography

[437] Walter Gander. Algorithms for the QR-Decomposition. Research Report 80–02, Seminar für Ange-
wandte Mathematik, ETHZ, Zürich, Switzerland, 1980. (Cited on pp. 62, 62, 70.)
[438] Walter Gander. Least squares with a quadratic constraint. Numer. Math., 36:291–307, 1981. (Cited
on p. 168.)
[439] Walter Gander, Gene H. Golub, and Rolf Strebel. Least squares fitting of circles and ellipses. BIT
Numer. Math., 34:558–578, 1994. (Cited on p. 416.)
[440] Walter Gander and Urs von Matt. Some least squares problems. In W. Gander and J. Hřbiček, edi-
tors, Solving Problems in Scientific Computing Using Maple and MATLAB, pages 83–102. Springer-
Verlag, Berlin, third edition, 1997. (Cited on p. 411.)
[441] B. S. Garbow, J. M. Boyle, Jack J. Dongarra, and G. W. Stewart. Matrix Eigensystems Routines:
EISPACK Guide Extension, volume 51 of Lecture Notes in Computer Science. Springer, New York,
1977. (Cited on p. 113.)
[442] A. de la Garza. An Iterative Method for Solving Systems of Linear Equations. Report K-731, Oak
Ridge Gaseous Diffusion Plant, Oak Ridge, TN, 1951. (Cited on p. 273.)
[443] André Gaul, Martin H. Gutknecht, Jörg Liesen, and Reinhard Nabben. A framework for deflated
and augmented Krylov subspace methods. SIAM J. Matrix Anal. Appl., 34:495–518, 2013. (Cited
on p. 337.)
[444] C. F. Gauss. Theory of the Motion of the Heavenly Bodies Moving about the Sun in Conic Sections,
C. H. Davis, translator, Dover, New York, 1963. First published in 1809. (Cited on p. 2.)
[445] Carl Friedrich Gauss. Theoria combinationis observationum erroribus minimis obnoxiae, pars prior.
In Werke, IV, pages 1–26. Königlichen Gesellschaft der Wissenschaften zu Göttingen, 1880. First
published in 1821. (Cited on p. 3.)
[446] Carl Friedrich Gauss. Theoria combinationis observationum erroribus minimis obnoxiae, pars pos-
terior. In Werke, IV, pages 27–53. Königlichen Gesellschaft der Wissenschaften zu Göttingen,
1880. First published in 1823. (Cited on p. 3.)
[447] Carl Friedrich Gauss. Theory of the Combination of Observations Least Subject to Errors. Part 1,
Part 2, Supplement, G. W. Stewart, Translator, volume 11 of Classics in Applied Math., SIAM,
Philadelphia, 1995. (Cited on p. 3.)
[448] Silvia Gazzola, Per Christian Hansen, and James G. Nagy. IR Tools: A MATLAB package of
iterative regularization methods and large-scale test problems. Numer. Algorithms, 81:773–811,
2018. (Cited on p. 335.)
[449] Silvia Gazzola, Paolo Novati, and Maria Rosario Russo. On Krylov projection methods and
Tikhonov regularization. ETNA, 44:83–123, 2015. (Cited on p. 335.)
[450] W. Morven Gentleman. Least squares computations by Givens transformations without square
roots. J. Inst. Math. Appl., 12:329–336, 1973. (Cited on pp. 50, 255.)
[451] W. Morven Gentleman. Error analysis of QR decompositions by Givens transformations. Linear
Algebra Appl., 10:189–197, 1975. (Cited on p. 56.)
[452] W. Morven Gentleman. Row elimination for solving sparse linear systems and least squares prob-
lems. In G. A. Watson, editor, Proceedings of the Dundee Conference on Numerical Analysis
1975, volume 506 of Lecture Notes in Mathematics, pages 122–133. Springer-Verlag, Berlin, 1976.
(Cited on p. 254.)
[453] Alan George. Computer Implementation of the Finite-Element Method. Ph.D. thesis, Stanford
University, CA, 1971. (Cited on p. 251.)
[454] Alan George, John R. Gilbert, and Joseph W. H. Liu, editors. Graph Theory and Sparse Matrix
Computation, volume 56 of The IMA Volumes in Mathematics and Applications. Springer, 1993.
(Cited on p. 244.)
[455] Alan George and Michael T. Heath. Solution of sparse linear least squares problems using Givens
rotations. Linear Algebra Appl., 34:69–83, 1980. (Cited on pp. 253, 254.)
Bibliography 453

[456] Alan George, Kh. D. Ikramov, and A. B. Kucherov. Some properties of symmetric quasi-definite
matrices. SIAM J. Matrix Anal. Appl., 21:1318–1323, 2000. (Cited on p. 329.)
[457] Alan George and Joseph W. H. Liu. Computer Solution of Large Sparse Positive Definite Systems.
Prentice-Hall, Englewood Cliffs, NJ, 1981. (Cited on pp. 244, 245, 251, 256.)
[458] Alan George and Joseph W. H. Liu. Householder reflections versus Givens rotations in sparse
orthogonal decomposition. Linear Algebra Appl., 88/89:223–238, 1987. (Cited on p. 256.)
[459] Alan George and Joseph W. H. Liu. The evolution of the minimum degree ordering algorithm.
SIAM Rev., 31:1–19, 1989. (Cited on p. 252.)
[460] Alan George and Esmond Ng. On row and column orderings for sparse least squares problems.
SIAM J. Numer. Anal., 20:326–344, 1983. (Cited on pp. 255, 256.)
[461] Alan George and Esmond G. Ng. On the complexity of sparse QR and LU factorization of finite-
element matrices. SIAM J. Sci. Statist. Comput., 9:849–861, 1988. (Cited on p. 258.)
[462] Alan George, William G. Poole, Jr., and Robert G. Voigt. Incomplete nested dissection for solving
n by n grid problems. SIAM J. Numer. Anal., 15:662–673, 1978. (Cited on p. 256.)
[463] J. A. George and E. G. Ng. SPARSPAK: Waterloo Sparse Matrix Package User’s Guide for
SPARSPAK-B. Res. Report CS-84-37, Department of Computer Science, University of Waterloo,
Canada, 1984. (Cited on p. 258.)
[464] J. A. George, M. T. Heath, and R. J. Plemmons. Solution of large-scale sparse least squares prob-
lems using auxiliary storage. SIAM J. Sci. Statist. Comput., 2:416–429, 1981. (Cited on p. 256.)
[465] J. Alan George, Joseph W. H. Liu, and Esmond G. Ng. Row-ordering schemes for sparse Givens
transformations I. Bipartite graph model. Linear Algebra Appl., 61:55–81, 1984. (Cited on p. 253.)
[466] Tomáš Gergelits and Zdeněk Strakoš. Composite convergence bounds based on Chebyshev polyno-
mials and finite precision conjugate gradient computations. Numer. Algorithms, 65:759–782, 2014.
(Cited on p. 299.)
[467] Norman E. Gibbs, William G. Poole, Jr., and Paul K. Stockmeyer. An algorithm for reducing the
bandwidth and profile of a sparse matrix. SIAM J. Numer. Anal., 13:236–250, 1976. (Cited on
p. 251.)
[468] John R. Gilbert, Xiaoye S. Li, Esmond G. Ng, and Barry W. Peyton. Computing row and column
counts for sparse QR and LU factorization. BIT Numer. Math., 41:693–710, 2001. (Cited on
p. 255.)
[469] John R. Gilbert, Cleve Moler, and Robert Schreiber. Sparse matrices in MATLAB: Design and
implementation. SIAM J. Matrix Anal. Appl., 13:333–356, 1992. (Cited on pp. 250, 258.)
[470] John R. Gilbert, Esmond Ng, and B. W. Peyton. Separators and structure prediction in sparse
orthogonal factorization. Linear Algebra Appl., 262:83–97, 1997. (Cited on p. 105.)
[471] M. B. Giles and E. Süli. Adjoint methods for PDEs: A posteriori error analysis and postprocessing
by duality. Acta Numer., 11:145–236, 2002. (Cited on p. 304.)
[472] Philip E. Gill, Gene H. Golub, Walter Murray, and Michael A. Saunders. Methods for modifying
matrix factorizations. Math. Comp., 28:505–535, 1974. (Cited on p. 139.)
[473] Philip E. Gill, Sven J. Hammarling, Walter Murray, Michael A. Saunders, and Margaret H. Wright.
User’s Guide for LSSOL (Version 1.0): A Fortran Package for Constrained Linear Least-Squares
and Convex Quadratic Programming. Report SOL, Department of Operations Research, Stanford
University, CA, 1986. (Cited on p. 165.)
[474] Philip E. Gill and Walter Murray. Algorithms for the solution of the nonlinear least-squares prob-
lem. SIAM J. Numer. Anal., 15:977–992, 1978. (Cited on p. 399.)
[475] Philip E. Gill, Walter Murray, and Michael A. Saunders. SNOPT: An SQP algorithm for large-scale
constrained optimization. SIAM Rev., 47:99–131, 2005. (Cited on p. 318.)
[476] Philip E. Gill, Walter Murray, and Margaret H. Wright. Practical Optimization. Academic Press,
London, UK, 1981. (Cited on pp. 395, 402.)
454 Bibliography

[477] Philip E. Gill, Michael A. Saunders, and Joseph R. Shinnerl. On the stability of Cholesky factor-
ization for symmetric quasi-definite systems. SIAM J. Matrix Anal. Appl., 17:35–46, 1996. (Cited
on p. 329.)
[478] Luc Giraud, Serge Gratton, and Julien Langou. A rank-k update procedure for reorthogonalizing
the orthogonal factor from modified Gram–Schmidt. SIAM J. Matrix Anal. Appl., 25:1163–1177,
2004. (Cited on p. 71.)
[479] Luc Giraud and Julien Langou. When modified Gram–Schmidt generates a well-conditioned set of
vectors. IMA J. Numer. Anal., 22:4:521–528, 2002. (Cited on p. 70.)
[480] Wallace Givens. Computation of plane unitary rotations transforming a general matrix to triangular
form. J. Soc. Indust. Appl. Math., 6:26–50, 1958. (Cited on pp. 47, 51.)
[481] J. Gluchowska and Alicja Smoktunowicz. Solving the linear least squares problem with very high
relative accuracy. Computing, 45:345–354, 1990. (Cited on p. 104.)
[482] Sergei K. Godunov. Problem of the dichotomy of the spectrum of a matrix. Siberian Math. J.,
27:649–660, 1986. (Cited on p. 382.)
[483] Israel Gohberg, Peter Lancaster, and Leiba Rodman. Indefinite Linear Algebra and Applications.
Birkhäuser, Boston, 2005. (Cited on p. 137.)
[484] Herman H. Goldstine. A History of Numerical Analysis from the 16th through the 19th Century.
Stud. Hist. Math. Phys. Sci., Vol. 2. Springer-Verlag, New York, 1977. (Cited on p. 2.)
[485] G. H. Golub and Peter Businger. Least Squares, Singular Values and Matrix Approximations;
an ALGOL Procedure for Computing the Singular Value Decomposition. Tech. Report CS-73,
Computer Science Department, Stanford University, CA, 1967. (Cited on p. 349.)
[486] G. H. Golub and C. F. Van Loan. Unsymmetric positive definite linear systems. Linear Algebra
Appl., 28:85–97, 1979. (Cited on p. 329.)
[487] Gene H. Golub. Numerical methods for solving least squares problems. Numer. Math., 7:206–216,
1965. (Cited on pp. 56, 75, 169, 177.)
[488] Gene H. Golub. Least squares, singular values and matrix approximations. Apl. Mat., 13:44–51,
1968. (Cited on p. 348.)
[489] Gene H. Golub. Matrix decompositions and statistical computing. In Roy C. Milton and John A.
Nelder, editors, Statistical Computation, pages 365–397. Academic Press, New York, 1969. (Cited
on p. 137.)
[490] Gene H. Golub. Some modified matrix eigenvalue problems. SIAM Rev., 15:318–334, 1973. (Cited
on p. 360.)
[491] Gene H. Golub and Chen Greif. On solving block-structured indefinite linear systems. SIAM J. Sci.
Comput., 24:2076–2092, 2003. (Cited on p. 117.)
[492] Gene H. Golub, Per Christian Hansen, and Dianne P. O’Leary. Tikhonov regularization and total
least squares. SIAM J. Matrix Anal. Appl., 21:185–194, 1999. (Cited on p. 225.)
[493] Gene H. Golub, Micheal T. Heath, and Grace Wahba. Generalized cross-validation as a method for
choosing a good ridge parameter. Technometrics, 21:215–223, 1979. (Cited on p. 178.)
[494] Gene H. Golub, Alan Hoffman, and G. W. Stewart. A generalization of the Eckart–Young matrix
approximation theorem. Linear Algebra Appl., 88/89:317–327, 1987. (Cited on p. 24.)
[495] Gene H. Golub and William Kahan. Calculating the singular values and pseudo-inverse of a matrix.
SIAM J. Numer. Anal., 2:205–224, 1965. (Cited on pp. 191, 196, 342.)
[496] Gene H. Golub, Virginia Klema, and G. W. Stewart. Rank Degeneracy and Least Squares Problems.
Tech. Report STAN-CS-76-559, Computer Science Department, Stanford University, Stanford, CA,
1976. (Cited on p. 80.)
[497] Gene H. Golub and Randall J. LeVeque. Extensions and uses of the variable projection algorithm
for solving nonlinear least squares problems. In Proceedings of the 1979 Army Numerical Analysis
and Computers Conf., ARO Report 79-3, pages 1–12. White Sands Missile Range, White Sands,
NM, 1979. (Cited on p. 405.)
Bibliography 455

[498] Gene H. Golub, Franklin T. Luk, and Michael L. Overton. A block Lanczos method for computing
the singular values and corresponding singular vectors of a matrix. ACM Trans. Math. Softw.,
7:149–169, 1981. (Cited on pp. 193, 376.)
[499] Gene H. Golub, Franklin T. Luk, and M. Pagano. A sparse least squares problem in photogramme-
try. In J. F. Gentleman, editor, Proceedings of the Computer Science and Statistics: 12th Annual
Symposium on the Interface, pages 26–30. University of Waterloo, Ontario, Canada, 1979. (Cited
on pp. 206, 208.)
[500] Gene H. Golub, P. Manneback, and Ph. L. Toint. A comparison between some direct and iterative
methods for large scale geodetic least squares problems. SIAM J. Sci. Statist. Comput., 7:799–816,
1986. (Cited on pp. 209, 308, 309.)
[501] Gene H. Golub and Gérard Meurant. Matrices, moments and quadrature. In D. F. Griffiths and G. A.
Watson, editors, Numerical Analysis 1993: Proceedings of the 13th Dundee Biennial Conference,
volume 228 of Pitman Research Notes Math., pages 105–156. Longman Scientific and Technical,
Harlow, UK, 1994. (Cited on p. 289.)
[502] Gene H. Golub and Dianne P. O’Leary. Some history of the conjugate gradient and Lanczos algo-
rithms: 1948–1976. SIAM Review, 31:50–102, 1989. (Cited on p. 285.)
[503] Gene H. Golub and Victor Pereyra. The differentiation of pseudo-inverses and nonlinear least
squares problems whose variables separate. SIAM J. Numer. Anal., 10:413–432, 1973. (Cited on
pp. 26, 402, 403.)
[504] Gene H. Golub and Victor Pereyra. Separable nonlinear least squares: The variable projection
method and its application. Inverse Problems, 19:R1–R26, 2003. (Cited on p. 405.)
[505] Gene H. Golub and R. J. Plemmons. Large-scale geodetic least-squares adjustment by dissection
and orthogonal decomposition. Linear Algebra Appl., 34:3–28, 1980. (Cited on pp. 187, 206,
209.)
[506] Gene H. Golub, Robert J. Plemmons, and Ahmed H. Sameh. Parallel block schemes for large-
scale least-squares computations. In High-Speed Computing, Scientific Applications and Algorithm
Design, pages 171–179. University of Illinois Press, 1988. (Cited on p. 208.)
[507] Gene H. Golub and C. Reinsch. Singular value decomposition and least squares solutions. In F. L.
Bauer et al., editors, Handbook for Automatic Computation. Vol. II, Linear Algebra, pages 134–151.
Springer, New York, 1971. Prepublished in Numer. Math., 14:403–420, 1970. (Cited on pp. 11,
349.)
[508] Gene H. Golub, Knut Sølna, and Paul Van Dooren. Computing the SVD of a general matrix
product/quotient. SIAM J. Matrix Anal. Appl., 22:1–19, 2000. (Cited on p. 127.)
[509] Gene H. Golub, Martin Stoll, and Andy Wathen. Approximation of the scattering amplitude and
linear systems. ETNA, 31:178–203, 2008. (Cited on p. 304.)
[510] Gene H. Golub and Frank Uhlig. The QR algorithm: 50 years later its genesis by John Francis
and Vera Kublanovskaya and subsequent developments. IMA J. Numer. Anal., 29:467–485, 2009.
(Cited on p. 348.)
[511] Gene H. Golub and Charles F. Van Loan. An analysis of the total least squares problem. SIAM J.
Numer. Anal., 17:883–893, 1980. (Cited on pp. 218, 220.)
[512] Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press,
Baltimore, MD, third edition, 1996. (Cited on pp. 17, 80, 157, 179, 355, 368.)
[513] Gene H. Golub and Richard S. Varga. Chebyshev semi-iteration and second-order Richardson
iterative methods. Parts I and II. Numer. Math., 3:147–168, 1961. (Cited on p. 277.)
[514] Gene H. Golub and James H. Wilkinson. Note on the iterative refinement of least squares solutions.
Numer. Math., 9:139–148, 1966. (Cited on p. 27.)
[515] Gene H. Golub and Hongyuan Zha. Perturbation analysis of the canonical correlation of matrix
pairs. Linear Algebra Appl., 210:3–28, 1994. (Cited on p. 17.)
456 Bibliography

[516] Gene H. Golub and Hongyuan Zha. The canonical correlations of matrix pairs and their numerical
computation. In A. Bojanczyk and G. Cybenko, editors, Linear Algebra for Signal Processing,
volume 69 of The IMA Volumes in Mathematics and Its Applications, pages 27–49. Springer-Verlag,
1995. (Cited on p. 17.)
[517] Dan Gordon and Rachel Gordon. CGMN revisited: Robust and efficient solution of stiff linear
systems derived from elliptic partial differential equations. ACM Trans. Math. Softw., 35:18:1–
18:27, 2008. (Cited on p. 275.)
[518] Rachel Gordon, R. Bender, and Gabor T. Herman. Algebraic reconstruction techniques (ART) for
three-dimensional electron microscopy and X-ray photography. J. Theor. Biology, 29:471–481,
1970. (Cited on p. 274.)
[519] S. A. Goreinov, E. E. Tyrtyshnikov, and N. L. Zamarashkin. A theory of pseudo-skeleton approxi-
mations. Linear Algebra Appl., 261:1–21, 1997. (Cited on p. 90.)
[520] Nicholas I. M. Gould and Jennifer A. Scott. The state-of-the-art of preconditioners for sparse least-
squares problems. ACM Trans. Math. Softw., 43:36-1–36-35, 2017. (Cited on pp. 297, 306.)
[521] John C. Gower and Garmt B. Dijksterhuis. Procrustes Problems. Oxford University Press, Oxford,
UK, 2004. (Cited on p. 387.)
[522] William B. Gragg, Randall J. LeVeque, and John A. Trangenstein. Numerically stable methods for
updating regressions. J. Amer. Statist. Org., 74:161–168, 1979. (Cited on p. 140.)
[523] William H. Gragg and W. J. Harrod. The numerically stable reconstruction of Jacobi matrices from
spectral data. Numer. Math., 44:317–335, 1984. (Cited on pp. 230, 239.)
[524] S. L. Graham, M. Snir, and C. A. Patterson, editors. Getting up to Speed. The Future of Supercom-
puting. The National Academies Press, Washington, DC, 2004. (Cited on p. 114.)
[525] Jørgen. P. Gram. Ueber die Entwickelung reeller Functionen in Reihen mittelst der Methode der
kleinsten Quadrate. J. Reine Angew. Math., 94:41–73, 1883. (Cited on p. 63.)
[526] S. Gratton, A. S. Lawless, and N. K. Nichols. Approximate Gauss–Newton methods for nonlinear
least squares problems. SIAM J. Optim., 18:106–132, 2007. (Cited on pp. 401, 401.)
[527] Serge Gratton. On the condition number of the least squares problem in a weighted Frobenius norm.
BIT Numer. Math., 36:523–530, 1996. (Cited on p. 31.)
[528] Serge Gratton, Pavel Jiránek, and David Titley-Peloquin. On the accuracy of the Karlson–Waldén
estimate of the backward error for linear least squares problems. SIAM J. Matrix Anal. Appl.,
33:822–836, 2012. (Cited on p. 99.)
[529] Serge Gratton, David Titley-Peloquin, and Jean Tshimanga Ilunga. Sensitivity and conditioning of
the truncated total least squares solution. SIAM J. Matrix Anal. Appl., 34:1257–1276, 2013. (Cited
on p. 226.)
[530] Joseph F. Grcar. Matrix Stretching for Linear Equations. Tech. Report SAND 90-8723, Sandia
National Laboratories, Albuquerque, NM, 1990. (Cited on p. 263.)
[531] Joseph F. Grcar. Spectral condition numbers of orthogonal projections and full rank linear least
squares residuals. SIAM J. Matrix Anal. Appl., 31:2934–2949, 2010. (Cited on p. 31.)
[532] Joseph F. Grcar, Michael A. Saunders, and Z. Su. Estimates of Optimal Backward Perturbations for
Linear Least Squares Problems. Tech. Report SOL 2007-1, Department of Management Science
and Engineering, Stanford University, Stanford, CA, 2007. (Cited on pp. 99, 299.)
[533] Anne Greenbaum. Behavior of slightly perturbed Lanczos and conjugate-gradient recurrences.
Linear Algebra Appl., 113:7–63, 1989. (Cited on p. 295.)
[534] Anne Greenbaum. Iterative Methods for Solving Linear Systems, volume 17 of Frontiers in Applied
Mathematics. SIAM, Philadelphia, 1997. (Cited on pp. 269, 285, 294.)
[535] Anne Greenbaum, Miroslav Rozložník, and Zdeněk Strakoš. Numerical behavior of the modified
Gram–Schmidt GMRES process and related algorithms. BIT Numer. Math., 37:706–719, 1997.
(Cited on p. 302.)
Bibliography 457

[536] T. N. E. Greville. Note on the generalized inverse of a matrix product. SIAM Rev., 8:518–521, 1966.
(Cited on p. 15.)
[537] T. N. E. Greville. Solutions of the matrix equation XAX = X, and relations between oblique and
orthogonal projectors. SIAM J. Appl. Math., 26:828–832, 1974. (Cited on p. 119.)
[538] Roger G. Grimes, John G. Lewis, and Horst D. Simon. A shifted block Lanczos algorithm for
solving sparse symmetric generalized eigenproblems. SIAM J. Matrix Anal. Appl., 15:228–272,
1994. (Cited on p. 375.)
[539] C. W. Groetsch. The Theory of Tikhonov Regularization for Fredholm Integral Equations of the
First Kind. Pitman, Boston, MA, 1984. (Cited on pp. 171, 177.)
[540] Eric Grosse. Tensor spline approximations. Linear Algebra Appl., 34:29–41, 1980. (Cited on
p. 209.)
[541] Benedikt Großer and Bruno Lang. Efficient parallel reduction to bidiagonal form. Parallel Comput.,
25:969–986, 1999. (Cited on p. 348.)
[542] Benedikt Großer and Bruno Lang. An O(n2 ) algorithm for the bidiagonal SVD. Linear Algebra
Appl., 358:45–70, 2003. (Cited on p. 352.)
[543] Benedikt Grosser and Bruno Lang. On symmetric eigenproblems induced by the bidiagonal SVD.
SIAM J. Matrix Anal. Appl., 26:599–620, 2005. (Cited on p. 352.)
[544] Marcus J. Grote and Thomas Huckle. Parallel preconditioning with sparse approximate inverses.
SIAM J. Sci. Comput., 18:838–853, 1997. (Cited on p. 314.)
[545] Ming Gu and Stanley C. Eisenstat. A Stable and Fast Algorithm for Updating the Singular Value
Decomposition. Tech. Report RR-939, Department of Computer Science, Yale University, New
Haven, CT, 1993. (Cited on p. 363.)
[546] Ming Gu and Stanley C. Eisenstat. A divide-and-conquer algorithm for the bidiagonal SVD. SIAM
J. Matrix Anal. Appl., 16:79–92, 1995. (Cited on pp. 358, 359.)
[547] Ming Gu and Stanley C. Eisenstat. A divide-and-conquer algorithm for the symmetric tridiagonal
eigenproblem. SIAM J. Matrix Anal. Appl., 16:172–191, 1995. (Cited on p. 358.)
[548] Ming Gu and Stanley C. Eisenstat. Downdating the singular value decomposition. SIAM J. Matrix
Anal. Appl., 16:793–810, 1995. (Cited on p. 363.)
[549] Ming Gu and Stanley C. Eisenstat. Efficient algorithms for computing a strong rank-revealing QR
factorization. SIAM J. Sci. Comput., 17:848–869, 1996. (Cited on p. 84.)
[550] Mårten E. Gulliksson. On the modified Gram–Schmidt algorithm for weighted and constrained
linear least squares problems. BIT Numer. Math., 35:453–468, 1995. (Cited on p. 160.)
[551] Mårten Gulliksson and Per-Åke Wedin. Modifying the QR-decomposition to constrained and
weighted linear least squares. SIAM J. Matrix Anal. Appl., 13:1298–1313, 1992. (Cited on p. 121.)
[552] Mårten E. Gulliksson and Per-Åke Wedin. Perturbation theory for generalized and constrained
linear least squares. Numer. Linear Algebra Appl., 7:181–196, 2000. (Cited on p. 158.)
[553] B. C. Gunter and Robert A. Van de Geijn. Parallel out-of-core computation and updating of the QR
factorization. ACM Trans. Math. Softw., 31:60–78, 2005. (Cited on p. 112.)
[554] Hongbin Guo and Rosemary A. Renaut. A regularized total least squares algorithm. In Sabine Van
Huffel and P. Lemmerling, editors, Total Least Squares and Errors-in-Variables Modeling, pages
57–66. Kluwer Academic Publishers, Dordrecht, 2002. (Cited on p. 226.)
[555] Fred G. Gustavson. Finding the block lower triangular form of a matrix. In J. R. Bunch and D. J.
Rose, editors, Sparse Matrix Computations, pages 275–289. Academic Press, New York, 1976.
(Cited on p. 264.)
[556] Martin H. Gutknecht. Block Krylov space methods for linear systems with multiple right hand-
sides: An introduction. In A. H. Siddiqi, I. S. Duff, and O. Christensen, editors, Modern Math-
ematical Models, Methods, and Algorithms for Real World Systems, pages 420–447. Anarnaya
Publishers, New Dehli, India, 2007. (Cited on p. 299.)
458 Bibliography

[557] Irwin Guttman, Victor Pereyra, and Hugo D. Scolnik. Least squares estimation for a class of
nonlinear models. Technometrics, 15:309–318, 1973. (Cited on p. 405.)
[558] L. A. Hageman, Franklin T. Luk, and David M. Young. On the equivalence of certain iterative
acceleration methods. SIAM J. Numer. Anal., 17:852–873, 1980. (Cited on p. 308.)
[559] Louis A. Hageman and David M. Young. Applied Iterative Methods. Dover, Mineola, NY, 2004.
Unabridged republication of the work first published by Academic Press, New York, 1981. (Cited
on p. 269.)
[560] William W. Hager. Condition estimates. SIAM J. Sci. and Statist. Comput., 5:311–316, 1984.
(Cited on pp. 96, 97.)
[561] Arne Hald. Statistical Theory with Engineering Applications. Wiley, New York, 1952. Translated
by G. Seidelin. (Cited on p. 205.)
[562] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic
algorithms for constructing approximate matrix decompositions. SIAM Rev., 53:217–288, 2011.
(Cited on p. 319.)
[563] Sven J. Hammarling. A note on modifications to the Givens plane rotations. J. Inst. Math. Appl.,
13:215–218, 1974. (Cited on p. 50.)
[564] Sven J. Hammarling. The numerical solution of the general Gauss–Markoff linear model. In
T. S. Durrani et al., editors, Mathematics in Signal Processing. Clarendon Press, Oxford University
Press, New York, pages 441–456, 1987. (Cited on p. 123.)
[565] Sven J. Hammarling, Nicholas J. Higham, and Craig Lucas. LAPACK-style codes for pivoted
Cholesky and QR updating. In Bo Kågström, Erik Elmroth, Jack J. Dongarra, and J. Waśniewski,
editors, Applied Parallel Computing: State of the Art in Scientific Computing. Proceedings from the
Eighth International Workshop, PARA 2006, pages 137–146, 2006. (Cited on p. 148.)
[566] Martin Hanke. Accelerated Landweber iteration for the solution of ill-posed equations. Numer.
Math., 60:341–373, 1991. (Cited on p. 326.)
[567] Martin Hanke. Conjugate Gradient Type Methods for Ill-Posed Problems, volume 327 of Pitman
Research Notes in Math. Longman Scientific and Technical, Harlow, UK, 1995. (Cited on p. 334.)
[568] Martin Hanke. On Lanczos based methods for the regularization of discrete ill-posed problems.
BIT Numer. Math., 41:1008–1018, 2001. (Cited on pp. 332, 334.)
[569] Martin Hanke. A Taste of Inverse Problems: Basic Theory and Examples. SIAM, Philadelphia,
2017. (Cited on p. 182.)
[570] Martin Hanke and Per Christian Hansen. Regularization methods for large-scale problems. Surveys
Math. Indust., 3:253–315, 1993. (Cited on pp. 175, 177, 182, 182, 333, 334.)
[571] Martin Hanke, James G. Nagy, and Curtis Vogel. Quasi-Newton approach to nonnegative image
restorations. Linear Algebra Appl., 316:223–236, 2000. (Cited on p. 417.)
[572] Martin Hanke and Curtis R. Vogel. Two-level preconditioners for regularized inverse problems I.
Theory. Numer. Math., 83:385–402, 1999. (Cited on p. 321.)
[573] Per Christian Hansen. The discrete Picard condition for discrete ill-posed problems. BIT Numer.
Math., 30:658–672, 1990. (Cited on p. 172.)
[574] Per Christian Hansen. Analysis of discrete ill-posed problems by means of the L-curve. SIAM Rev.,
34:561–580, 1992. (Cited on p. 178.)
[575] Per Christian Hansen. Regularization tools: A MATLAB package for analysis and solution of
discrete ill-posed problems. Numerical Algorithms, 46:1–35, 1994. (Cited on p. 182.)
[576] Per Christian Hansen. Rank-Deficient and Discrete Ill-Posed Problems. Numerical Aspects of Lin-
ear Inversion. SIAM, Philadelphia, 1998. (Cited on pp. 174, 178, 182, 182, 322.)
[577] Per Christian Hansen. Deconvolution and regularization with Toeplitz matrices. Numer. Algorithms,
29:323–378, 2002. (Cited on p. 173.)
Bibliography 459

[578] Per Christian Hansen. Regularization tools version 4.0 for MATLAB 7.3. Numer. Algorithms,
46:189–194, 2007. (Cited on p. 182.)
[579] Per Christian Hansen. Discrete Inverse Problems. Insight and Algorithms, volume 7 of Fundamen-
tals of Algorithms. SIAM, Philadelphia, 2010. (Cited on pp. 175, 182, 334.)
[580] Per Christian Hansen. Oblique projections and standard form transformations for discrete inverse
problems. Numer. Linear Algebra Appl., 20:250–258, 2013. (Cited on p. 182.)
[581] Per Christian Hansen and H. Gesmar. Fast orthogonal decomposition of rank deficient Toeplitz
matrices. Numer. Algorithms, 4:151–166, 1993. (Cited on p. 241.)
[582] Per Christian Hansen and Toke Koldborg Jensen. Smoothing-norm preconditioning for regularizing
minimum-residual methods. SIAM J. Matrix Anal. Appl., 29:1–14, 2006. (Cited on p. 322.)
[583] Per Christian Hansen, James G. Nagy, and Dianne P. O’Leary. Deblurring Images. Matrices, Spec-
tra, and Filtering. SIAM, Philadelphia, 2006. (Cited on pp. 182, 239.)
[584] Per Christian Hansen and Dianne Prost O’Leary. The use of the L-curve in the regularization of
discrete ill-posed problems. SIAM J. Sci. Comput., 14:1487–1503, 1993. (Cited on p. 178.)
[585] Per Christian Hansen, T. Sekii, and H. Shibahashi. The modified truncated SVD method for reg-
ularization in general form. SIAM J. Statist. Comput., 13:1142–1150, 1992. (Cited on pp. 174,
174.)
[586] Per Christian Hansen and Plamen Y. Yalamov. Computing symmetric rank-revealing decompo-
sitions via triangular factorizations. SIAM J. Matrix Anal. Appl., 23:443–458, 2001. (Cited on
p. 79.)
[587] Richard J. Hanson. Linear least squares with bounds and linear constraints. SIAM J. Sci. Statist.
Comput., 7:826–834, 1986. (Cited on p. 162.)
[588] Richard J. Hanson and K. H. Haskell. Algorithm 587: Two algorithms for the linearly constrained
least squares problem. ACM Trans. Math. Softw., 8:323–333, 1982. (Cited on p. 162.)
[589] Richard J. Hanson and Charles L. Lawson. Extensions and applications of the Householder al-
gorithm for solving linear least squares problems. Math. Comp., 23:787–812, 1969. (Cited on
p. 77.)
[590] Richard J. Hanson and Michael J. Norris. Analysis of measurements based on the singular value
decomposition. SIAM J. Sci. Statist. Comput., 2:363–373, 1981. (Cited on pp. 51, 386.)
[591] Vjeran Hari and Krešimir Veselić. On Jacobi’s methods for singular value decompositions. SIAM
J. Sci. Statist. Comput., 8:741–754, 1987. (Cited on p. 357.)
[592] R. V. L. Hartley. A more symmetrical Fourier analysis applied to transmission problems. Proc.
IRE, 30:144–150, 1942. (Cited on p. 320.)
[593] K. H. Haskell and Richard J. Hanson. Selected Algorithm for the Linearly Constrained Least
Squares Problem—A User’s Guide. Tech. Report SAND78–1290, Sandia National Laboratories,
Albuquerque, NM, 1979. (Cited on p. 162.)
[594] K. H. Haskell and Richard J. Hanson. An algorithm for linear least squares problems with equality
and nonnegativity constraints. Math. Prog., 21:98–118, 1981. (Cited on pp. 162, 162.)
[595] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning.
Data Mining, Inference, and Prediction. Springer-Verlag, Berlin, second edition, 2009. (Cited on
pp. 420, 426.)
[596] Michael T. Heath. Some extensions of an algorithm for sparse linear least squares problems. SIAM
J. Sci. Statist. Comput., 3:223–237, 1982. (Cited on pp. 261, 262, 263.)
[597] M. T. Heath, A. J. Laub, C. C. Paige, and R. C. Ward. Computing the singular value decomposition
of a product of two matrices. SIAM J. Sci. Statist. Comput., 7:1147–1149, 1986. (Cited on pp. 128,
349.)
[598] M. D. Hebden. An Algorithm for Minimization Using Exact Second Derivatives. Tech. Report T. P.
515, Atomic Energy Research Establishment, Harwell, UK, 1973. (Cited on p. 169.)
460 Bibliography

[599] I. S. Helland. On the structure of partial least squares regression. Comm. Statist. Theory Methods
Ser. B, 17:581–607, 1988. (Cited on p. 203.)
[600] F. R. Helmert. Die Mathematischen und Physikalischen Theorieen der höheren Geodäsie. Ein-
leitung und 1 Teil: Die mathematischen Theorieen. Druck und Verlag von B. G. Teubner, Leipzig,
1880. (Cited on p. 206.)
[601] H. V. Henderson and S. R. Searle. On deriving the inverse of a sum of matrices. SIAM Review,
23:53–60, 1981. (Cited on p. 139.)
[602] H. V. Henderson and S. R. Searle. The vec-permutation matrix, the vec operator and Kronecker
products: A review. Linear Multilinear Algebra, 9:271–288, 1980/1981. (Cited on p. 210.)
[603] Peter Henrici. The quotient-difference algorithm. Nat. Bur. Standards Appl. Math. Ser., 49:23–46,
1958. (Cited on p. 351.)
[604] Peter Henrici. Fast Fourier methods in computational complex analysis. SIAM Rev., 21:481–527,
1979. (Cited on p. 237.)
[605] V. Hernandez, J. E. Román, and A. Tomás. A parallel variant of the Gram–Schmidt process with
reorthogonalization. In G. R. Joubert, W. E. Nagel, F. J. Peters, O. G. Plata, and E. L. Zapata,
editors, Parallel Computing: Current & Future Issues in High-End Computing, volume 33 of John
von Neumann Institute for Computing Series, pages 221–228. Central Institute for Applied Mathe-
matics, Jülich, Germany, 2006. (Cited on p. 109.)
[606] Magnus R. Hestenes. Inversion of matrices by biorthogonalization and related results. J. Soc.
Indust. Appl. Math., 6:51–90, 1958. (Cited on p. 355.)
[607] Magnus R. Hestenes. Conjugacy and gradients. In Stephen G. Nash, editor, A History of Scientific
Computing, volume 60 of IMA Series in Mathematics and Its Applications, pages 167–179. ACM
Press, New York, 1990. (Cited on p. 285.)
[608] Magnus R. Hestenes and Eduard Stiefel. Methods of conjugate gradients for solving linear systems.
J. Res. Nat. Bur. Standards, Sect. B, 49:409–436, 1952. (Cited on pp. 278, 280, 281, 285.)
[609] K. L. Hiebert. An evaluation of mathematical software that solves nonlinear least squares problems.
ACM Trans. Math. Softw., 7:1–16, 1981. (Cited on p. 394.)
[610] Nicholas J. Higham. Computing the polar decomposition—with applications. SIAM J. Sci. Statist.
Comput., 7:1160–1174, 1986. (Cited on pp. 379, 383, 384.)
[611] Nicholas J. Higham. Computing real square roots of a real matrix. Linear Algebra Appl.,
88/89:405–430, 1987. (Cited on p. 379.)
[612] Nicholas J. Higham. Error analysis of the Björck–Pereyra algorithm for solving Vandermonde
systems. Numer. Math., 50:613–632, 1987. (Cited on pp. 95, 97, 238.)
[613] Nicholas J. Higham. Fast solution of Vandermonde-like systems involving orthogonal polynomials.
IMA J. Numer. Anal., 8:473–486, 1988. (Cited on p. 238.)
[614] Nicholas J. Higham. Fortran codes for estimating the one-norm of a real or complex matrix, with
applications to condition estimation. ACM Trans. Math. Softw., 14:381–396, 1988. (Cited on
p. 97.)
[615] Nicholas J. Higham. The accuracy of solutions to triangular systems. SIAM J. Numer. Anal.,
26:1252–1265, 1989. (Cited on p. 43.)
[616] Nicholas J. Higham. Analysis of the Cholesky decomposition of a semi-definite matrix. In Mau-
rice G. Cox and Sven J. Hammarling, editors, Reliable Numerical Computation, pages 161–185.
Clarendon Press, Oxford, UK, 1990. (Cited on p. 97.)
[617] Nicholas J. Higham. How accurate is Gaussian elimination? In D. F. Griffiths and G. A. Watson,
editors, Numerical Analysis 1989: Proceedings of the 13th Dundee Biennial Conference, volume
228 of Pitman Research Notes Math. pages 137–154. Longman Scientific and Technical, Harlow,
UK, 1990. (Cited on pp. 72, 96.)
Bibliography 461

[618] Nicholas J. Higham. Iterative refinement enhances the stability of QR factorization methods for
solving linear equations. BIT Numer. Math., 31:447–468, 1991. (Cited on pp. 59, 103, 103.)
[619] Nicholas J. Higham. The matrix sign decomposition and its relation to the polar decomposition.
Linear Algebra Appl., 212/213:3–20, 1994. (Cited on p. 380.)
[620] Nicholas J. Higham. A survey of componentwise perturbation theory in numerical linear algebra.
In Walter Gautschi, editor, Mathematics of Computation 1943–1953: A Half-Century of Compu-
tational Mathematics. Mathematics of Computation, 50th Anniversary Symposium, August 9–13,
1993, Vancouver, BC, volume 48 of Proceedings of Symposia in Applied Mathematics, pages 49–
78. AMS, Providence, RI, 1994. (Cited on pp. 29, 31.)
[621] Nicholas J. Higham. Stable iterations for the matrix square root. Numer. Algorithms, 15:227–242,
1997. (Cited on p. 93.)
[622] Nicholas J. Higham. QR factorization with complete pivoting and accurate computation of the
SVD. Linear Algebra Appl., 309:153–174, 2000. (Cited on p. 341.)
[623] Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia, second
edition, 2002. (Cited on pp. 34, 37, 39, 44, 54, 57, 58, 65, 72, 96, 98, 101, 239.)
[624] Nicholas J. Higham. J-orthogonal matrices: Properties and generation. SIAM Rev., 45:504–519,
2003. (Cited on pp. 19, 137.)
[625] Nicholas J. Higham. Functions of Matrices. Theory and Computation. SIAM, Philadelphia, 2008.
(Cited on pp. 378, 383, 384, 384.)
[626] Nicholas J. Higham. The world’s most fundamental matrix equation decomposition. SIAM News,
Dec.:1–3, 2017. (Cited on p. 138.)
[627] Nicholas J. Higham and Theo Mary. Mixed precision algorithms in numerical linear algebra. Acta
Numer., 31:347–414, 2022. (Cited on p. 104.)
[628] Christopher J. Hillar and Lek-Heng Lim. Most tensor problems are NP-hard. J. ACM, 60:Article
45, 2013. (Cited on p. 216.)
[629] F. L. Hitchcock. The expression of a tensor or a polyadic as a sum of products. J. Math. Phys.,
7:164–189, 1927. (Cited on p. 216.)
[630] Iveta Hnětynková, Martin Plešinger, Diana Maria Sima, Zdeněk Strakoš, and Sabine Van Huffel.
The total least squares problem in AX ≈ B: A new classification with the relationship to the
classical works. SIAM J. Matrix Anal. Appl., 32:748–770, 2011. (Cited on p. 196.)
[631] Iveta Hnětynková, Martin Plešinger, and Zdeněk Strakoš. The regularization effect of the Golub–
Kahan iterative bidiagonalization and revealing the noise level in the data. BIT Numer. Math.,
49:669–696, 2009. (Cited on p. 335.)
[632] Iveta Hnětynková, Martin Plešinger, and Zdeněk Strakoš. The core problem within a linear approxi-
mation problem AX ≈ B with multiple right-hand sides. SIAM J. Matrix Anal. Appl., 34:917–931,
2013. (Cited on p. 196.)
[633] Iveta Hnětynková, Martin Plešinger, and Zdeněk Strakoš. Band generalization of the Golub–Kahan
bidiagonalization, generalized Jacobi matrices, and the core problem. SIAM J. Matrix Anal. Appl.,
36:417–434, 2015. (Cited on pp. 196, 305.)
[634] Michiel E. Hochstenbach. A Jacobi–Davidson type SVD method. SIAM J. Sci. Comput., 23:606–
628, 2001. (Cited on p. 376.)
[635] Michiel E. Hochstenbach. Harmonic and refined extraction methods for the singular value problem,
with applications in least squares problems. BIT Numer. Math., 44:721–754, 2004. (Cited on
pp. 371, 371.)
[636] Michiel E. Hochstenbach and Y. Notay. The Jacobi–Davidson method. GAMM Mitt., 29:368–382,
2006. (Cited on p. 375.)
[637] Walter Hoffmann. Iterative algorithms for Gram–Schmidt orthogonalization. Computing, 41:335–
348, 1989. (Cited on p. 69.)
462 Bibliography

[638] Y. P. Hong and C.-T. Pan. Rank revealing QR decompositions and the singular value decomposition.
Math. Comp., 58:213–232, 1992. (Cited on pp. 80, 80, 81.)
[639] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, Cambridge,
UK, 1985. (Cited on pp. 116, 350, 377.)
[640] Roger A. Horn and Charles R. Johnson. Topics in Matrix Analysis. Cambridge University Press,
Cambridge, UK, 1991. (Cited on pp. 13, 210.)
[641] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, Cambridge,
UK, second edition, 2012. (Cited on pp. 12, 20.)
[642] H. Hotelling. Relation between two sets of variables. Biometrica, 28:322–377, 1936. (Cited on
p. 19.)
[643] Patricia D. Hough and Stephen A. Vavasis. Complete orthogonal decomposition for weighted least
squares. SIAM J. Matrix Anal. Appl., 18:369–392, 1997. (Cited on p. 133.)
[644] Alston S. Householder. Unitary triangularization of a nonsymmetric matrix. J. Assoc. Comput.
Mach., 5:339–342, 1958. (Cited on pp. 46, 51.)
[645] Alston S. Householder. The Theory of Matrices in Numerical Analysis. Dover, Mineola, NY, 1975.
Corrected republication of work first published in 1964 by Blaisdell Publ. Co., New York. (Cited
on pp. 139, 281.)
[646] Alston S. Householder and Friedrich L. Bauer. On certain iterative methods for solving linear
systems. Numer. Math., 2:55–59, 1960. (Cited on p. 273.)
[647] Gary W. Howell and Marc Baboulin. LU preconditioning for overdetermined sparse least squares
problems. In R. Wyrzykowski, E. Deelman, J. Dongarra, K. Kurczewiski, J. Kitowski, and K. Wiatr,
editors, 11th Internat. Conf. Parallel Proc. and Appl. Math., 2015, Krakow, Poland, volume 9573
of Lecture Notes in Computer Science, pages 128–137, Springer, Heidelberg, 2016. (Cited on
p. 318.)
[648] P. J. Huber. Robust Statistics. Wiley, New York, 1981. (Cited on pp. 422, 422.)
[649] M. F. Hutchinson and F. R. de Hoog. Smoothing noisy data with spline functions. Numer. Math.,
47:99–106, 1985. (Cited on p. 180.)
[650] M. F. Hutchinson and F. R. de Hoog. A fast procedure for calculating minimum cross-validation
cubic smoothing splines. ACM Trans. Math. Softw., 12:150–153, 1986. (Cited on p. 180.)
[651] Tsung-Min Hwang, Wen-Wei Lin, and Dan Pierce. Improved bounds for rank revealing LU factor-
izations. Linear Algebra Appl., 261:173–186, 1997. (Cited on p. 91.)
[652] Tsung-Min Hwang, Wen-Wei Lin, and Eugene K. Yang. Rank revealing LU factorizations. Linear
Algebra Appl., 175:115–141, 1992. (Cited on p. 91.)
[653] Bruno Iannazzo. A note on computing the matrix square root. Calcolo, 40:273–283, 2003. (Cited
on p. 379.)
[654] Bruno Iannazzo. On the Newton method for the matrix pth root. SIAM J. Matrix Anal. Appl.,
28:503–523, 2006. (Cited on p. 379.)
[655] IEEE Standard for Floating-Point Arithmetic. In IEEE Standard 754-2019 (Revision of IEEE Stan-
dard 754-2008), pages 1–84, 2019. (Cited on p. 32.)
[656] Akira Imakura and Yusaka Yamamoto. Efficient implementations of the modified Gram-Schmidt
orthogonalization with a non-standard inner product. Japan J. Indust. Appl. Math., 36:619–641,
2019. (Cited on p. 122.)
[657] D. Irony, Sivan Toledo, and A. Tiskin. Using perturbed QR factorizations to solve linear least-
squares problems. J. Parallel Distrib. Comput., 64:1017–1026, 2004. (Cited on p. 114.)
[658] Carl Gustav Jacob Jacobi. Über eine neue Auflösungsart der bei der Methode der kleinsten Quadrate
vorkommenden lineären Gleichungen. Astron. Nachr., 22:297–306, 1845. (Cited on p. 51.)
Bibliography 463

[659] Carl Gustav Jacob Jacobi. Über ein leichtes Verfahren die in der Theorie der Säcularstörungen vork-
ommenden Gleichungen numerisch aufzulösen. J. Reine Angew. Math., 30:51–94, 1846. (Cited on
pp. 352, 374.)
[660] M. Jacobsen. Two-grid Iterative Methods for Ill-Posed Problems. Master’s thesis, Technical Uni-
versity of Denmark, Kongens Lyngby, Denmark, 2000. (Cited on p. 322.)
[661] M. Jacobsen, Per Christian Hansen, and Michael A. Saunders. Subspace preconditioned LSQR for
discrete ill-posed problems. BIT Numer. Math., 43:975–989, 2003. (Cited on p. 321.)
[662] W. Jalby and B. Philippe. Stability analysis and improvement of the block Gram–Schmidt algo-
rithm. SIAM J. Sci. Statist. Comput., 12:1058–1073, 1991. (Cited on p. 109.)
[663] Marcus Jankowski and Henryk Woźniakowski. Iterative refinement implies numerical stability. BIT
Numer. Math., 17:303–311, 1977. (Cited on p. 103.)
[664] A. Jennings and M. A. Ajiz. Incomplete methods for solving AT Ax = b. SIAM J. Sci. Statist.
Comput., 5:978–987, 1984. (Cited on pp. 312, 313.)
[665] Alan Jennings and G. M. Malik. The solution of sparse linear equations by the conjugate gradient
algorithm. Int. J. Numer. Methods Engrg., 12:141–158, 1978. (Cited on p. 306.)
[666] Tore Koldborg Jensen and Per Christian Hansen. Iterative regularization with minimum-residual
methods. BIT Numer. Math., 47:103–120, 2007. (Cited on p. 333.)
[667] E. R. Jessup and D. C. Sorensen. A parallel algorithm for computing the singular value decompo-
sition of a matrix. SIAM J. Matrix Anal. Appl., 15:530–548, 1994. (Cited on pp. 358, 359.)
[668] Zhongxiao Jia. A refined subspace iteration algorithm for large sparse eigenproblems. Appl. Numer.
Math., 32:35–52, 2000. (Cited on p. 370.)
[669] Zhongxiao Jia. Regularization properties of Krylov iterative solvers CGME and LSMR for linear
discrete ill-posed problems with an application to truncated randomized SVDs. Numer. Algorithms,
85:1281–1310, 2020. (Cited on p. 334.)
[670] Zhongxiao Jia and Binguy Li. On the condition number of the total least squares problem. Numer.
Math., 125:61–87, 2013. (Cited on p. 226.)
[671] Zhongxiao Jia and Datian Niu. A refined harmonic Lanczos bidiagonalization method and an
implicitly restarted algorithm for computing the smallest singular triplets of large matrices. SIAM
J. Sci. Comput., 32:714–744, 2010. (Cited on pp. 372, 372.)
[672] X.-Q. Jin. A preconditioner for constrained and weighted least squares problems with Toeplitz
structure. BIT Numer. Math., 36:101–109, 1996. (Cited on p. 325.)
[673] Pavel Jiránek and David Titley-Peloquin. Estimating the backward error in LSQR. SIAM J. Matrix
Anal. Appl., 31:2055–2074, 2010. (Cited on p. 299.)
[674] Tierry Joffrain, Tze Meng Low, Enrique S. Quintana-Ortí, Robert van de Geijn, and Field G. Van
Zee. Accumulating Householder transformations, revisited. ACM Trans. Math. Softw., 32:169–179,
2006. (Cited on p. 109.)
[675] D. M. Johnson, A. L. Dulmage, and N. S. Mendelsohn. Connectivity and reducibility of graphs.
Canad. J. Math., 14:529–539, 1963. (Cited on p. 264.)
[676] Camille Jordan. Mémoires sur les formes bilinéaires. J. Math. Pures Appl., 19:35–54, 1874. (Cited
on pp. 12, 13.)
[677] Camille Jordan. Essai sur la géométrie à n dimensions. Bull. Soc. Math. France, 3:103–174, 1875.
(Cited on p. 16.)
[678] S. Kaczmarz. Angenäherte Auflösung von Systemen linearer Gleichungen. Bull. Internat. Acad.
Polon. Sci. Lett., 35:355–357, 1937. (Cited on p. 273.)
[679] Bo Kågström, Per Ling, and Charles F. Van Loan. GEMM-based level 3 BLAS high performance
model implementation and performance evaluation benchmarks. ACM Trans. Math. Softw., 24:268–
302, 1998. (Cited on p. 114.)
464 Bibliography

[680] W. M. Kahan. Accurate Eigenvalues of a Symmetric Tridiagonal Matrix. Tech. Report CS-41,
Computer Science Department, Stanford University, CA, 1966. Revised June 1968. (Cited on
pp. 76, 350, 351.)
[681] W. M. Kahan. Numerical linear algebra. Canad. Math. Bull., 9:757–801, 1966. (Cited on p. 25.)
[682] C. Kamath and Ahmed Sameh. A projection method for solving nonsymmetric linear systems on
multiprocessors. Parallel Computing, 9:291–312, 1989. (Cited on p. 309.)
[683] W. J. Kammerer and M. Z. Nashed. On the convergence of the conjugate gradient method for
singular linear operator equations. SIAM J. Numer. Anal., 9:165–181, 1972. (Cited on p. 280.)
[684] Igor E. Kaporin. High quality preconditioning of a general symmetric positive definite matrix based
on its ut u + ut r + rt u-decomposition. Numer. Linear Algebra Appl., 5:483–509, 1998. (Cited on
p. 311.)
[685] Rune Karlsson and Bertil Waldén. Estimation of optimal backward perturbation bounds for the
linear least squares problem. BIT Numer. Math., 37:862–869, 1997. (Cited on p. 98.)
[686] Linda Kaufman. Variable projection methods for solving separable nonlinear least squares prob-
lems. BIT Numer. Math., 15:49–57, 1975. (Cited on p. 404.)
[687] Linda Kaufman. Maximum likelihood, least squares, and penalized least squares for PET. IEEE
Trans. Med. Imaging, 12:200–214, 1993. (Cited on pp. 417, 417.)
[688] Linda Kaufman and Victor Pereyra. A method for separable nonlinear least squares problems with
separable nonlinear equality constraints. SIAM J. Numer. Anal., 15:12–20, 1978. (Cited on p. 404.)
[689] Linda Kaufman and Garrett Sylvester. Separable nonlinear least squares with multiple right-hand
sides. SIAM J. Matrix Anal. Appl., 13:68–89, 1992. (Cited on p. 405.)
[690] Herbert B. Keller. On the solution of singular and semidefinite linear systems by iteration. SIAM J.
Numer. Anal., 2:281–290, 1965. (Cited on p. 270.)
[691] Charles Kenney and Alan J. Laub. Rational iterative methods for the matrix sign function. SIAM J.
Matrix Anal. Appl., 12:273–291, 1991. (Cited on p. 382.)
[692] Andrzej Kielbasiński. Analiza numeryczna algorytmu ortogonalizacji Grama–Schmidta. Matem-
atyka Stosowana, 2:15–35, 1974. (Cited on p. 71.)
[693] Andrzej Kielbasiński. Iterative refinement for linear systems in variable-precision arithmetic. BIT
Numer. Math., 21:97–103, 1981. (Cited on p. 104.)
[694] Andrzej Kielbasiński and Krystyna Zietak. Numerical behavior of Higham’s scaled method for
polar decomposition. Numer. Algorithms, 32:105–140, 2003. (Cited on p. 384.)
[695] Misha E. Kilmer, Per Christian Hansen, and Malena I. Espanõl. A projection based approach to
general-form Tikhonov regularization. SIAM J. Sci. Comput., 29:315–330, 2007. (Cited on p. 333.)
[696] Misha E. Kilmer and Dianne P. O’Leary. Choosing regularization parameters in iterative methods
for ill-posed problems. SIAM J. Matrix Anal. Appl., 22:1204–1221, 2007. (Cited on p. 335.)
[697] Hyunsoo Kim and Haesun Park. Nonnegative matrix factorization based on alternating nonnegativ-
ity constrained least squares and active set method. SIAM J. Matrix Anal. Appl., 30:713–730, 2008.
(Cited on p. 419.)
[698] Hyunsoo Kim, Haesun Park, and Lars Eldén. Nonnegative tensor factorization based on alternat-
ing large-scale nonnegativity-constrained least squares. In Proceedings of IEEE 7th International
Conference on Bioinformatics and Bioengineering, volume 2, pages 1147–1151, 2007. (Cited on
p. 420.)
[699] Jingu Kim, Yunlong He, and Haesun Park. Algorithms for nonnegative matrix and tensor factoriza-
tion: A unified view based on block coordinate descent framework. J. Glob. Optim., 58:285–319,
2014. (Cited on p. 420.)
[700] Seung-Jean Kim, Kvangmoo Koh, Michael Lustig, Stephen Boyd, and Dimitri Gorinevsky. An
interior-point method for large-scale ℓ1 -regularized least squares. IEEE J. Selected Topics Signal
Process., 1:606–617, 2007. (Cited on p. 429.)
Bibliography 465

[701] Andrew V. Knyazev and Merico E. Argentati. Principal angles between subspaces in an A-based
scalar product: Algorithms and perturbation estimates. SIAM J. Sci. Comput., 23:2008–2040, 2002.
(Cited on p. 19.)
[702] E. G. Kogbetliantz. Solution of linear equations by diagonalization of coefficients matrix. Quart.
Appl. Math., 13:123–132, 1955. (Cited on p. 356.)
[703] E. Kokiopoulou, C. Bekas, and Efstratios Gallopoulos. Computing smallest singular value triplets
with implicitly restarted Lanczos bidiagonalization. Appl. Numer. Math., 49:39–61, 2004. (Cited
on p. 374.)
[704] G. B. Kolata. Geodesy: Dealing with an enormous computer task. Science, 200:421–422, 1978.
(Cited on p. 3.)
[705] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Rev., 51:455–
500, 2009. (Cited on pp. 217, 217.)
[706] Daniel Kressner. Numerical Methods for General and Structured Eigenvalue Problems. Volume 46
of Lecture Notes in Computational Science and Engineering. Springer, Berlin, 2005. (Cited on
p. 349.)
[707] Daniel Kressner. The periodic QR algorithms is a disguised QR algorithm. Linear Algebra Appl.,
417:423–433, 2005. (Cited on p. 349.)
[708] F. T. Krogh. Efficient implementation of a variable projection algorithm for nonlinear least squares.
Comm. ACM, 17:167–169, 1974. (Cited on p. 404.)
[709] Vera N. Kublanovskaya. On some algorithms for the solution of the complete eigenvalue problem.
Z. Vychisl. Mat. i Mat. Fiz., 1:555–570, 1961. In Russian. English translation in USSR Comput.
Math. Phys., 1:637–657, 1962. (Cited on p. 348.)
[710] Ming-Jun Lai and Yang Wang. Sparse Solutions of Underdetermined Linear Systems and Their
Applications. SIAM, Philadelphia, 2021. (Cited on p. 429.)
[711] Jörg Lampe and Heinrich Voss. A fast algorithm for solving regularized total least squares prob-
lems. ETNA, 31:12–24, 2008. (Cited on p. 226.)
[712] Jörg Lampe and Heinrich Voss. Large-scale Tikhonov regularization of total least squares. J.
Comput. Appl. Math., 238:95–108, 2013. (Cited on p. 226.)
[713] Peter Lancaster and M. Tismenetsky. The Theory of Matrices. With Applications. Academic Press,
New York, 1985. (Cited on p. 210.)
[714] C. Lanczos. Linear Differential Operators. D. Van Nostrand, London, UK, 1961. (Cited on p. 13.)
[715] Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear dif-
ferential and integral operators. J. Res. Nat. Bur. Standards, Sect. B, 45:255–281, 1950. (Cited on
pp. 287, 370.)
[716] Cornelius Lanczos. Solution of systems of linear equations by minimized iterations. J. Res. Nat.
Bur. Standards, Sect. B, 49:33–53, 1952. (Cited on p. 288.)
[717] Cornelius Lanczos. Linear systems in self-adjoint form. Amer. Math. Monthly, 65:665–671, 1958.
(Cited on p. 13.)
[718] L. Landweber. An iterative formula for Fredholm integral equations of the first kind. Amer. J.
Math., 73:615–624, 1951. (Cited on p. 325.)
[719] Julien Langou. AllReduce algorithms: Application to Householder QR factorization. In Precon-
ditioning, July 9–12, 2007, CERFACS, Toulouse, 2007. http://www.precond07/enseeih.fr/
Talks/langou/langou.pdf. (Cited on p. 213.)
[720] Julien Langou. Translation and Modern Interpretation of Laplace’s Théorie Analytique des Prob-
abilités, pages 505–512, 516–520. Tech. Report 280, UC Denver CCM, Albuquerque, NM and
Livermore, CA, 2009. (Cited on p. 64.)
[721] P. S. Laplace. Théorie analytique des probabilités. Premier supplément, Courcier, Paris, third
edition, 1816. (Cited on p. 64.)
466 Bibliography

[722] Rasmus Munk Larsen. Lanczos Bidiagonalization with Partial Reorthogonalization. Tech. Report
DAIMI PB-357, Department of Computer Science, Aarhus University, Denmark, 1998. (Cited on
p. 373.)
[723] Rasmus Munk Larsen. PROPACK: A Software Package for the Singular Value Problem Based
on Lanczos Bidiagonalization with Partial Reorthogonalization. http://soi.stanford.edu/
~rmunk/PROPACK/, SCCM, Stanford University, Stanford, CA, 2000. (Cited on p. 374.)
[724] Peter Läuchli. Jordan-Elimination und Ausgleichung nach kleinsten Quadraten. Numer. Math.,
3:226–240, 1961. (Cited on pp. 40, 316.)
[725] Charles L. Lawson. Contributions to the Theory of Linear Least Maximum Approximation. Ph.D.
thesis, University of California, Los Angeles, 1961. (Cited on p. 423.)
[726] Charles L. Lawson. Sparse Matrix Methods Based on Orthogonality and Conjugacy. Tech. Mem.
33-627, Jet Propulsion Laboratory, Cal. Inst. of Tech., Pasadena, CA, 1973. (Cited on p. 285.)
[727] Charles L. Lawson and Richard J. Hanson. Solving Least Squares Problems, volume 15 of Classics
in Applied Math., SIAM, Philadelphia, 1995. Unabridged, revised republication of the work first
published by Prentice-Hall, Inc., Englewood Cliffs, NJ, 1974. (Cited on pp. 137, 157, 160, 162,
167, 167, 178, 189, 192.)
[728] Charles L. Lawson, Richard J. Hanson, D. R. Kincaid, and Fred T. Krogh. Basic Linear Algebra
Subprograms for Fortran usage. ACM Trans. Math. Softw., 5:308–323, 1979. (Cited on p. 113.)
[729] Adrien-Marie Legendre. Nouvelles méthodes pour la détermination des orbites des comètes.
Courcier, Paris, 1805. (Cited on p. 2.)
[730] R. B. Lehoucq. Implicitly restarted Arnoldi methods and subspace iteration. SIAM J. Matrix Anal.
Appl., 23:551–562, 2001.
[731] R. B. Lehoucq, D. C. Sorensen, and C. Yang. ARPACK Users’ Guide: Solution of Large-Scale
Eigenvalue Problems with Implicitly Restarted Arnoldi Methods. SIAM, Philadelphia, 1998. (Cited
on pp. 373, 375.)
[732] Richard B. Lehoucq. The computations of elementary unitary matrices. ACM Trans. Math. Softw.,
22:393–400, 1996. (Cited on p. 47.)
[733] Steven J. Leon, Åke Björck, and Walter Gander. Gram–Schmidt orthogonalization: 100 years and
more. Numer. Linear Algebra Appl., 20:492–532, 2013. (Cited on p. 63.)
[734] Ö. Leringe and Per-Å. Wedin. A Comparison between Different Methods to Compute a Vector x
Which Minimizes ∥Ax−b∥2 When Gx = h. Tech. Report, Department of Computer Science, Lund
University, Lund, Sweden, 1970. (Cited on p. 158.)
[735] S. E. Leurgans, R. T. Ross, and R. B. Abel. A decomposition for three-way arrays. SIAM J. Matrix
Anal. Appl., 14:1064–1083, 1993. (Cited on pp. 217, 373.)
[736] K. Levenberg. A method for the solution of certain non-linear problems in least squares. Quart.
Appl. Math., 2:164–168, 1944. (Cited on pp. 169, 175, 396.)
[737] J. G. Lewis. Algorithm 582: The Gibbs–Poole–Stockmeyer and Gibbs–King algorithms for re-
ordering sparse matrices. ACM Trans. Math. Softw., 8:190–194, 1982. (Cited on p. 251.)
[738] J. G. Lewis. Implementation of The Gibbs–Poole–Stockmeyer and Gibbs–King algorithms. ACM
Trans. Math. Softw., 8:180–189, 1982. (Cited on p. 251.)
[739] J. G. Lewis, D. J. Pierce, and D. K. Wah. Multifrontal Householder QR Factorization. Tech. Report
ECA-TR-127-Revised, Boeing Computer Services, Seattle, WA, 1989. (Cited on p. 258.)
[740] Chi-Kwong Li and Gilbert Strang. An elementary proof of Mirsky’s low rank approximation theo-
rem. Electronic J. Linear Algebra, 36:347–414, 2020. (Cited on p. 24.)
[741] Na Li and Yousef Saad. MIQR: A multilevel incomplete QR preconditioner for large sparse least-
squares problems. SIAM J. Matrix Anal. Appl., 28:524–550, 2006. (Cited on pp. 313, 318.)
[742] Ren-Cang Li. Bounds on perturbations of generalized singular values and of associated subspaces.
SIAM J. Matrix Anal. Appl., 14:195–234, 1993. (Cited on p. 125.)
Bibliography 467

[743] Ren-Cang Li. Solving Secular Equations Stably and Efficiently. Tech. Report UCB/CSD-94-851,
Computer Science Department, University of California, Berkeley, CA, 1994. (Cited on pp. 359,
361.)
[744] Yuying Li. A globally convergent method for lp problems. SIAM J. Optim., 3:609–629, 1993.
(Cited on p. 425.)
[745] Yuying Li. Solving lp Problems and Applications. Tech. Report CTC93TR122, 03/93, Advanced
Computing Research Institute, Cornell University, Ithaca, NY, 1993. (Cited on p. 425.)
[746] Jörg Liesen and Zdeněk Strakoš. Krylov Subspace Methods; Principles and Analysis. Oxford
University Press, Oxford, UK, 2012. (Cited on pp. 285, 289.)
[747] Lek-Heng Lim. Tensor and hypermatrices. In Leslie Hogben, editor, Handbook of Linear Algebra,
pages 15.1–15.30. Chapman & Hall/CRC Press, Boca Raton, FL, second edition, 2013. (Cited on
p. 217.)
[748] Chih-Jen Lin and Jorge J. Moré. Newton’s method for large bound-constrained optimization prob-
lems. SIAM J. Optim. Theory Appl., 9:1100–1127, 1999. (Cited on p. 310.)
[749] Per Lindström. A General Purpose Algorithm for Nonlinear Least Squares Problems with Nonlin-
ear Constraints. Tech. Report UMINF–102.83, Institute of Information Processing, University of
Umeå, Sweden, 1983. (Cited on p. 396.)
[750] Per Lindström. Two User Guides, One (ENLSIP) for Constrained — One (ELSUNC) for Uncon-
strained Nonlinear Least Squares Problems. Tech. Report UMINF–109.82 and 110.84, Institute of
Information Processing, University of Umeå, Sweden, 1984. (Cited on p. 411.)
[751] Per Lindström and Per-Å. Wedin. A new linesearch algorithm for unconstrained nonlinear least
squares problems. Math. Program., 29:268–296, 1984. (Cited on p. 395.)
[752] Per Lindström and Per-Å. Wedin. Methods and Software for Nonlinear Least Squares Problems.
Tech. Report UMINF–133.87, Institute of Information Processing, University of Umeå, Sweden,
1988. (Cited on p. 402.)
[753] Richard J. Lipton, Donald J. Rose, and Robert E. Tarjan. Generalized nested dissection. SIAM J.
Numer. Anal., 16:346–358, 1979. (Cited on p. 256.)
[754] Joseph W. H. Liu. On general row merging schemes for sparse Givens transformations. SIAM J.
Sci. Statist. Comput., 7:1190–1211, 1986. (Cited on pp. 255, 256.)
[755] Joseph W. H. Liu. The role of elimination trees in sparse factorization. SIAM J. Matrix Anal. Appl.,
11:134–172, 1990. (Cited on pp. 250, 250, 256, 258.)
[756] Qiaohua Liu. Modified Gram–Schmidt-based methods for block downdating the Cholesky factor-
ization. J. Comput. Appl. Math., 235:1897–1905, 2011. (Cited on p. 148.)
[757] James W. Longley. Modified Gram–Schmidt process vs. classical Gram–Schmidt. Comm. Statist.
Simulation Comput., 10:517–527, 1981. (Cited on p. 62.)
[758] Per Lötstedt. Perturbation bounds for the linear least squares problem subject to linear inequality
constraints. BIT Numer. Math., 23:500–519, 1983. (Cited on p. 167.)
[759] Per Lötstedt. Solving the minimal least squares problem subject to bounds on the variables. BIT
Numer. Math., 24:206–224, 1984. (Cited on p. 167.)
[760] P.-O. Löwdin. On the non-orthogonality problem. Adv. Quantum Chemistry, 5:185–199, 1970.
(Cited on p. 383.)
[761] Szu-Min Lu and Jesse L. Barlow. Multifrontal computation with the orthogonal factors of sparse
matrices. SIAM J. Matrix Anal. Appl., 17:658–679, 1996. (Cited on p. 258.)
[762] Franklin T. Luk. A rotation method for computing the QR-decomposition. SIAM J. Sci. Statist.
Comput., 7:452–459, 1986. (Cited on p. 357.)
[763] Franklin T. Luk and S. Qiao. A new matrix decomposition for signal processing. Automatica,
30:39–43, 1994. (Cited on p. 128.)
468 Bibliography

[764] I. Lustig, R. Marsten, and D. Shanno. Computational experience with a primal-dual interior point
method for linear programming. Linear Algebra Appl., 152:191–222, 1991. (Cited on pp. 418,
419.)
[765] D. Ma, L. Yang, R. M. T. Fleming, I. Thiele, B. O. Palsson, and M. A. Saunders. Reliable and
efficient solution of genome-scale models of metabolism and macromolecular expression. Sci.
Rep., 36:40863, 2017. (Cited on p. 101.)
[766] Kaj Madsen and Hans Bruun Nielsen. Finite algorithms for robust linear regression. BIT Numer.
Math., 30:682–699, 1990. (Cited on p. 425.)
[767] Kaj Madsen and Hans Bruun Nielsen. A finite smoothing algorithm for linear ℓ1 estimation. SIAM
J. Optim., 3:223–235, 1993. (Cited on p. 422.)
[768] N. Mahdavi-Amiri. Generally Constrained Nonlinear Least Squares and Generating Test Prob-
lems: Algorithmic Approach. Ph.D. thesis, The John Hopkins University, Baltimore, MD, 1981.
(Cited on p. 162.)
[769] Alexander N. Malyshev. Parallel algorithms for solving spectral problems of linear algebra. Linear
Algebra Appl., 188:489–520, 1993. (Cited on p. 382.)
[770] Alexander N. Malyshev and Miloud Sadkane. Computation of optimal backward perturbation
bounds for large sparse linear least squares problems. BIT Numer. Math., 41:739–747, 2001. (Cited
on p. 99.)
[771] Rolf Manne. Analysis of two partial-least-squares algorithms for multivariate calibration. Chemom.
Intell. Lab. Syst., 2:187–197, 1987. (Cited on pp. 202, 203.)
[772] P. Manneback. On Some Numerical Methods for Solving Large Sparse Linear Least Squares Prob-
lems. Ph.D. thesis, Facultés Universitaires Notre-Dame de la Paix, Namur, Belgium, 1985. (Cited
on pp. 254, 308.)
[773] P. Manneback, C. Murigande, and Philippe L. Toint. A modification of an algorithm by Golub and
Plemmons for large linear least squares in the context of Doppler positioning. IMA J. Numer. Anal.,
5:221–234, 1985. (Cited on p. 206.)
[774] Thomas A. Manteuffel. An incomplete factorization technique for positive definite linear systems.
Math. Comp., 34:473–497, 1980. (Cited on p. 309.)
[775] A. A. Markov. Wahrscheinlichkeitsrechnung. Liebmann, Leipzig, second edition, 1912. (Cited on
p. 3.)
[776] Ivan Markovsky. Bibliography on total least-squares and related methods. Statist. Interface, 3:1–6,
2010. (Cited on p. 226.)
[777] Ivan Markovsky and Sabine Van Huffel. Overview of total least-squares methods. Signal Process.,
87:2283–2302, 2007. (Cited on p. 226.)
[778] Harry M. Markowitz. The elimination form of the inverse and its application to linear programming.
Management Sci., 3:255–269, 1957. (Cited on p. 251.)
[779] Donald W. Marquardt. An algorithm for least-squares estimation of nonlinear parameters. J. Soc.
Indust. Appl. Math., 11:431–441, 1963. (Cited on p. 396.)
[780] J. J. Martínez and J. M. Peña. Fast parallel algorithm of Björck–Pereyra type for solving Cauchy–
Vandermonde linear systems. Appl. Numer. Math., 26:343–352, 1998. (Cited on p. 239.)
[781] W. F. Massy. Principal components regression in exploratory statistical research. J. Amer. Statist.
Assoc., 60:234–246, 1965. (Cited on p. 174.)
[782] Nicola Mastronardi and Paul Van Dooren. The antitriangular factorization of symmetric matrices.
SIAM J. Matrix Anal. Appl., 34:173–196, 2013. (Cited on p. 137.)
[783] Nicola Mastronardi and Paul Van Dooren. An algorithm for solving the indefinite least squares
problem with equality constraints. BIT Numer. Math., 54:201–218, 2014. (Cited on p. 137.)
[784] Pontus Matstoms. QR27—Specification Sheet. Tech. Report, Department of Mathematics,
Linköping University, Sweden, 1992. (Cited on p. 258.)
Bibliography 469

[785] Pontus Matstoms. Sparse QR factorization in MATLAB. ACM Trans. Math. Softw., 20:136–159,
1994. (Cited on p. 258.)
[786] J. A. Meijerink and Henk A. van der Vorst. An iterative solution method for linear systems of
which the coefficient matrix is a symmetric M-matrix. Math. Comp., 31:148–162, 1977. (Cited on
p. 309.)
[787] Beatrice Meini. The matrix square root from a new functional perspective: Theoretical results and
computational issues. SIAM J. Matrix Anal. Appl., 26:362–376, 2004. (Cited on p. 379.)
[788] Xiangrui Meng. Randomized Algorithms for Large-Scale Strongly Over-Determined Linear Re-
gression Problems. Ph.D. thesis, Stanford University, Stanford, CA, 2014. (Cited on p. 212.)
[789] Xiangrui Meng, Michael A. Saunders, and Michael W. Mahoney. LSRN: A parallel iterative solver
for strongly over- or underdetermined systems. SIAM J. Sci. Comput., 36:C95–C118, 2014. (Cited
on pp. 320, 321.)
[790] G. Merle and Helmut Späth. Computational experience with discrete Lp approximation. Comput-
ing, 12:315–321, 1974. (Cited on p. 424.)
[791] Gérard Meurant. The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Preci-
sion Computations, volume 19 of Software, Environments, and Tools. SIAM, Philadelphia, 2006.
(Cited on p. 285.)
[792] Gérard Meurant and Zdeněk Strakoš. The Lanczos and conjugate gradient algorithms in finite
precision arithmetic. Acta Numer., 15:471–542, 2006. (Cited on pp. 285, 299.)
[793] Carl D. Meyer, Jr. Generalized inversion of modified matrices. SIAM J. Appl. Math., 24:315–323,
1973. (Cited on p. 139.)
[794] Alan J. Miller. Subset Selection in Regression, volume 25 of Monograph on Statistics and Applied
Probability. Chapman & Hall/CRC Press, Boca Raton, FL, second edition, 2002. (Cited on p. 140.)
[795] Kenneth S. Miller. Complex linear least squares. SIAM Rev., 15:706–726, 1973. (Cited on p. 5.)
[796] Luiza Miranian and Ming Gu. Strong rank-revealing LU factorizations. Linear Algebra Appl.,
367:1–16, 2003. (Cited on p. 91.)
[797] L. Mirsky. Symmetric gauge functions and unitarily invariant norms. Quart. J. Math. Oxford,
11:50–59, 1960. (Cited on p. 24.)
[798] S. K. Mitra and C. R. Rao. Projections under seminorms and generalized Moore–Penrose inverses.
Linear Algebra Appl., 9:155–167, 1974. (Cited on p. 158.)
[799] Cleve B. Moler. Iterative refinement in floating point. J. Assoc. Comput. Mach., 14:316–321, 1967.
(Cited on p. 101.)
[800] Alexis Montoison and Dominique Orban. BILQ: An iterative method for nonsymmetric linear
systems with a quasi-minimum error property. SIAM J. Matrix Anal. Appl., 41:1145–1166, 2020.
(Cited on p. 304.)
[801] Alexis Montoison and Dominique Orban. TRICG and TRIMR: Two iterative methods for symmet-
ric quasi-definite systems. SIAM J. Sci. Comput., 43:A2502–A2525, 2021. (Cited on p. 331.)
[802] Alexis Montoison, Dominique Orban, and Michael Saunders. MINARES: An Iterative Solver for
Symmetric Linear Systems. Tech. Report GERARD G-2023-40, École Polytechnique Montreal,
2023. (Cited on p. 300.)
[803] Marc Moonen, Paul Van Dooren, and Joos Vandewalle. A singular value decomposition updating
algorithm for subspace tracking. SIAM J. Matrix Anal. Appl., 13:1015–1038, 1992. (Cited on
pp. 360, 363.)
[804] E. H. Moore. On the reciprocal of the general algebraic matrix. Bull. Amer. Math. Soc., 26:394–395,
1920. (Cited on p. 16.)
[805] José Morales and Jorge Nocedal. Remark on Algorithm 778: L-BFGS-B: Fortran subroutines for
large-scale bound-constrained optimization. ACM Tran. Math. Softw., 38:Article 7, 2011. (Cited
on p. 420.)
470 Bibliography

[806] Jorge J. Moré. The Levenberg–Marquardt algorithm: Implementation and theory. In G. A. Watson,
editor, Numerical Analysis Proceedings Biennial Conference Dundee 1977, volume 630 of Lecture
Notes in Mathematics, pages 105–116. Springer-Verlag, Berlin, 1978. (Cited on p. 396.)
[807] Jorge J. Moré. Recent developments in algorithms and software for trust region-methods. In
A. Bachem, M. Grötchel, and B. Korte, editors, Mathematical Programming. The State of the Art,
Proceedings Bonn 1982, pages 258–287. Springer-Verlag, Berlin, 1983. (Cited on p. 396.)
[808] Jorge J. Moré, B. S. Garbow, and K. E. Hillstrom. Users’ Guide for MINPACK-1. Tech. Report
ANL-80-74, Applied Math. Div., Argonne National Laboratory, Argonne, IL, 1980. (Cited on
p. 402.)
[809] Jorge J. Moré and G. Toraldo. Algorithms for bound constrained quadratic programming problems.
Numer. Math., 55:377–400, 1989. (Cited on p. 420.)
[810] Ronald B. Morgan. A restarted GMRES method augmented with eigenvectors. SIAM J. Matrix
Anal. Appl., 16:1154–1171, 1995. (Cited on p. 337.)
[811] Daisuke Mori, Yusaku Yamamoto, Shao-Liang Zhang, and Takeshi Fukaya. Backward error analy-
sis of the AllReduce algorithm for Householder QR decomposition. Japan J. Indust. Appl. Math.,
29:111–130, 2012. (Cited on p. 213.)
[812] Keiichi Morikuni and Ken Hayami. Inner-iteration Krylov subspace methods for least squares
problems. SIAM J. Matrix Anal. Appl., 34:1–22, 2013. (Cited on p. 309.)
[813] Keiichi Morikuni and Ken Hayami. Convergence of inner-iteration GMRES methods for rank-
deficient least squares problems. SIAM J. Matrix Anal. Appl., 36:225–250, 2015. (Cited on p. 309.)
[814] V. A. Morozov. Methods for Solving Incorrectly Posed Problems. Springer, New York, 1984.
(Cited on p. 177.)
[815] N. Munksgaard. Solving sparse symmetric sets of linear equations by preconditioned conjugate
gradients. ACM Trans. Math. Softw., 6:206–219, 1980. (Cited on p. 310.)
[816] Joseph M. Myre, Erich Frahm, David J. Lilja, and Martin O. Saar. TNT: A solver for large dense
least-squares problems that takes conjugate gradient from bad in theory, to good in practice. In
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Work-
shops, pages 987–995. IEEE, 2018. (Cited on p. 312.)
[817] J. G. Nagy. Toeplitz Least Squares Computations. Ph.D. thesis, North Carolina State University,
Raleigh, NC, 1991. (Cited on p. 324.)
[818] James G. Nagy. Fast inverse QR factorization for Toeplitz matrices. SIAM J. Sci. Comput., 14:1174–
1193, 1993. (Cited on pp. 240, 240.)
[819] James G. Nagy and Zdeněk Strakoš. Enforcing nonnegativity in image reconstruction algorithms. In
Mathematical Modeling, Estimation, and Imaging, pages 182–190. Proc. SPIE 4121, Bellingham,
WA, 2000. (Cited on p. 417.)
[820] Yuji Nakatsukasa, Zhaojun Bai, and Françoise Gygi. Optimizing Halley’s iteration for computing
the matrix polar decomposition. SIAM J. Matrix Anal. Appl., 31:2700–2720, 2010. (Cited on
pp. 385, 385.)
[821] Yuji Nakatsukasa and Roland W. Freund. Computing fundamental matrix decompositions accu-
rately via the matrix sign function in two iterations: The power of Zolotarev’s functions. SIAM
Rev., 58:461–493, 2016. (Cited on p. 382.)
[822] Yuji Nakatsukasa and Nicholas J. Higham. Stable and efficient spectral divide and conquer algo-
rithms for the symmetric eigenvalue decomposition and the SVD. SIAM J. Sci. Comput., 35:A1325–
A1349, 2013. (Cited on pp. 381, 385.)
[823] M. Zuhair Nashed. Generalized Inverses and Applications. Academic Press, New York, 1976.
(Cited on pp. 15, 16.)
[824] Larry Neal and George Poole. A geometric analysis of Gaussian elimination. ii. Linear Algebra
Appl., 173:239–264, 1992. (Cited on p. 86.)
Bibliography 471

[825] Arkadi Nemirovski and Michael J. Todd. Interior point methods for optimization. Acta Numer.,
17:191–234, 2008. (Cited on p. 419.)
[826] A. S. Nemirovskii. The regularization properties of the adjoint gradient method in ill-posed prob-
lems. USSR Comput. Math. Math. Phys., 26:7–16, 1986. (Cited on p. 334.)
[827] Yurii Nesterov and Arkadi Nemirovski. On first-order algorithms for ℓ1 /nuclear norm minimization.
Acta Numer., 22:509–575, 2013. (Cited on p. 429.)
[828] Yurii Nesterov and Arkadi Nemirovskii. Interior Point Polynomial Algorithms in Convex Program-
ming, volume 13 of Studies in Applied Mathematics. SIAM, Philadelphia, 1994. (Cited on p. 419.)
[829] Olavi Nevanlinna. Convergence of Iterations for Linear Equations. Lectures in Mathematics ETH
Zürich. Birkhäuser, Basel, 1993. (Cited on p. 295.)
[830] R. A. Nicolaides. Deflation of conjugate gradients with applications to boundary values problems.
SIAM J. Numer. Anal., 24:355–365, 1987. (Cited on p. 337.)
[831] Ben Noble, editor. Applied Linear Algebra. Prentice-Hall, Englewood Cliffs, NJ, 1969. (Cited on
p. 88.)
[832] Ben Noble. Method for computing the Moore-Penrose generalized inverse and related matters.
In M. Zuhair Nashed, editor, Generalized Inverses and Applications, Proceedings of an Advanced
Seminar, The University of Wisconsin–Madison, October 8–10, 1973, Publication of the Mathe-
matics Research Center, No. 32, pages 245–302, Academic Press, New York, 1976. (Cited on
p. 85.)
[833] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer Series in Operations
Research and Financial Engineering. Springer, New York, second edition, 2006. (Cited on p. 402.)
[834] Paolo Novati and Maria Rosario Russo. A GCV based Arnoldi–Tikhonov regularization method.
BIT Numer. Math., 54:501–521, 2014. (Cited on p. 335.)
[835] W. Oettli and W. Prager. Compatibility of approximate solution of linear equations with given error
bounds for coefficients and right-hand sides. Numer. Math., 6:404–409, 1964. (Cited on p. 99.)
[836] Gabriel Oks̆a, Yusaku Yamamoto, and Marián Vajters̆ic. Convergence to singular triplets in the two-
sided block-Jacobi’s svd algorithm with dynamic ordering. SIAM J. Matrix Anal. Appl., 43:1238–
1262, 2022. (Cited on p. 376.)
[837] Dianne P. O’Leary. The block conjugate gradient algorithm and related methods. Linear Algebra
Appl., 29:293–322, 1980. (Cited on p. 95.)
[838] Dianne P. O’Leary. Robust regression computation using iteratively reweighted least squares. SIAM
J. Matrix Anal. Appl., 11:466–480, 1990. (Cited on pp. 425, 425.)
[839] Dianne P. O’Leary and Bert W. Rust. Variable projection for nonlinear least squares problems.
Comput. Optim. Appl., 54:579–593, 2013. (Cited on p. 404.)
[840] Dianne P. O’Leary and John A. Simmons. A bidiagonalization-regularization procedure for large
scale discretizations of ill-posed problems. SIAM J. Sci. Statist. Comput., 2:474–489, 1981. (Cited
on p. 332.)
[841] Dianne P. O’Leary and P. Whitman. Parallel QR factorization by Householder and modified Gram-
Schmidt algorithms. Parallel Comput., 16:99–112, 1990. (Cited on p. 112.)
[842] S. Oliveira, L. Borges, M. Holzrichter, and T. Soma. Analysis of different partitioning schemes
for parallel Gram–Schmidt algorithms. Internat. J. Parallel Emergent Distrib. Syst., 14:293–320,
2000. (Cited on p. 109.)
[843] Serge J. Olszanskyj, James M. Lebak, and Adam W. Bojanczyk. Rank-k modification methods for
recursive least squares problems. Numer. Algorithms, 7:325–354, 1994. (Cited on p. 148.)
[844] Dominique Orban and Mario Arioli. Iterative Solution of Symmetric Quasi-Definite Linear Systems.
SIAM, Philadelphia, 2017. (Cited on p. 331.)
472 Bibliography

[845] James M. Ortega and Werner C. Rheinboldt. Iterative Solution of Nonlinear Equations in Several
Variables, volume 30 of Classics in Applied Math., SIAM, Philadelphia, 2000. Unabridged repub-
lication of the work first published by Academic Press, New York and London, 1970. (Cited on
pp. 393, 395, 402.)
[846] M. R. Osborne. Some special nonlinear least squares problems. SIAM J. Numer. Anal., 12:571–592,
1975. (Cited on p. 405.)
[847] Michael R. Osborne. Nonlinear least squares—the Levenberg algorithm revisited. J. Austr. Math.
Soc. Series B, 19:342–357, 1976. (Cited on p. 396.)
[848] Michael R. Osborne. Finite Algorithms in Optimization and Data Analysis. John Wiley & Sons,
New York, 1985. (Cited on p. 423.)
[849] Michael R. Osborne, Brett Presnell, and B. A. Turlach. A new approach to variable selection in
least squares problems. IMA J. Numer. Anal., 20:389–404, 2000. (Cited on p. 425.)
[850] Michael R. Osborne and G. Alistair Watson. On the best linear Chebyshev approximation. Com-
puter J., 10:172–177, 1967. (Cited on p. 422.)
[851] George Ostrouchov. Symbolic Givens reduction and row-ordering in large sparse least squares
problems. SIAM J. Sci. Statist. Comput., 8:248–264, 1987. (Cited on p. 253.)
[852] C. C. Paige. Bidiagonalization of matrices and solution of linear equations. SIAM J. Numer. Anal.,
11:197–209, 1974. (Cited on p. 291.)
[853] C. C. Paige. Fast numerically stable computations for generalized least squares problems. SIAM J.
Numer. Anal., 16:165–171, 1979. (Cited on p. 122.)
[854] C. C. Paige. Computing the generalized singular value decomposition. SIAM J. Sci. and Statist.
Comput., 7:1126–1146, 1986. (Cited on p. 126.)
[855] C. C. Paige and M. A. Saunders. Solution of sparse indefinite systems of linear equations. SIAM J.
Numer. Anal., 12:617–629, 1975. (Cited on p. 299.)
[856] C. C. Paige and M. A. Saunders. Towards a generalized singular value decomposition. SIAM J.
Numer. Anal., 18:398–405, 1981. (Cited on pp. 18, 19, 124, 125.)
[857] Christopher C. Paige. The Computation of Eigenvalues and Eigenvectors of Very Large Sparse
Matrices. Ph.D. thesis, University of London, UK, 1971. (Cited on p. 294.)
[858] Christopher C. Paige. Computer solution and perturbation analysis of generalized linear least
squares problems. Math. Comp., 33:171–184, 1979. (Cited on pp. 122, 123.)
[859] Christopher C. Paige. Error analysis of some techniques for updating orthogonal decompositions.
Math. Comp., 34:465–471, 1980. (Cited on p. 144.)
[860] Christopher C. Paige. The general linear model and the generalized singular value decomposition.
Linear Algebra Appl., 70:269–284, 1985. (Cited on p. 126.)
[861] Christopher C. Paige. Some aspects of generalized QR factorizations. In M. G. Cox and Sven J.
Hammarling, editors, Reliable Numerical Computation, pages 71–91. Clarendon Press, Oxford,
UK, 1990. (Cited on pp. 124, 128.)
[862] Christopher C. Paige. A useful form of a unitary matrix obtained from any sequence of unit 2-norm
n-vectors. SIAM J. Matrix Anal. Appl., 31:565–583, 2009. (Cited on p. 65.)
[863] Christopher C. Paige. Accuracy of the Lanczos process for the eigenproblem and solution of equa-
tions. SIAM J. Matrix Anal. Appl., 40:1371–1398, 2019. (Cited on p. 299.)
[864] Christopher C. Paige, Beresford N. Parlett, and Henk A. van der Vorst. Approximate solutions and
eigenvalue bounds from Krylov subspaces. Numer. Linear Algebra Appl., 2:115–133, 1995. (Cited
on p. 370.)
[865] Christopher C. Paige, Miroslav Rozložník, and Zdeněk Strakoš. Modified Gram–Schmidt (MGS),
least squares, and backward stability of MGS-GMRES. SIAM J. Matrix Anal. Appl., 28:264–284,
2006. (Cited on p. 302.)
Bibliography 473

[866] Christopher C. Paige and Michael A. Saunders. LSQR: An algorithm for sparse linear equations
and sparse least squares. ACM Trans. Math. Softw., 8:43–71, 1982. (Cited on pp. 197, 197, 197,
199, 202, 289, 291, 295, 297, 300, 322.)
[867] Christopher C. Paige and Zdeněk Strakoš. Unifying least squares, total least squares, and data
least squares. In Sabine Van Huffel and P. Lemmerling, editors, Total Least Squares and Errors-
in-Variables Modeling, pages 25–34. Kluwer Academic Publishers, Dordrecht, 2002. (Cited on
pp. 196, 218.)
[868] Christopher C. Paige and Zdeněk Strakoš. Core problems in linear algebraic systems. SIAM J.
Matrix Anal. Appl., 27:861–875, 2006. (Cited on pp. 194, 196, 196.)
[869] Christopher C. Paige and P. Van Dooren. On the quadratic convergence of Kogbetliantz’s algorithm
for computing the singular value decomposition. Linear Algebra Appl., 77:301–313, 1986. (Cited
on p. 357.)
[870] Christopher C. Paige and Musheng Wei. History and generality of the CS decomposition. Linear
Algebra Appl., 208/209:303–326, 1994. (Cited on pp. 19, 19.)
[871] Ching-Tsuan Pan. A modification to the LINPACK downdating algorithm. BIT Numer. Math.,
30:707–722, 1990. (Cited on p. 146.)
[872] Ching-Tsuan Pan. A perturbation analysis on the problem of downdating a Cholesky factorization.
Linear Algebra Appl., 183:103–116, 1993. (Cited on p. 146.)
[873] Ching-Tsuan Pan. On the existence and computation of rank revealing LU factorizations. Linear
Algebra Appl., 316:199–222, 2000. (Cited on pp. 89, 90, 91.)
[874] Ching-Tsuan Pan and Robert J. Plemmons. Least squares modifications with inverse factorizations:
Parallel implementations. J. Comput. Appl. Math., 27:109–127, 1989. (Cited on pp. 136, 149.)
[875] Ching-Tsuan Pan and Ping Tak Peter Tang. Bounds on singular values revealed by QR factorization.
BIT Numer. Math., 39:740–756, 1999. (Cited on p. 83.)
[876] C. H. Papadimitriou. The NP-completeness of the bandwidth minimization. Computing, 16:263–
270, 1976. (Cited on p. 250.)
[877] A. T. Papadoupolos, Iain S. Duff, and Andrew J. Wathen. A class of incomplete orthogonal factor-
ization methods II: Implementation and results. BIT Numer. Math., 45:159–179, 2005. (Cited on
p. 315.)
[878] J. M. Papy, Lieven De Lauthauwer, and Sabine Van Huffel. Exponential data fitting using multilin-
ear algebra: The single-channel and multi-channel case. Numer. Linear Algebra Appl., 12:809–826,
2005. (Cited on p. 218.)
[879] Haesun Park. A parallel algorithm for the unbalanced orthogonal Procrustes problem. Parallel
Comput., 17:913–923, 1991. (Cited on p. 387.)
[880] Haesun Park and Lars Eldén. Downdating the rank-revealing URV decomposition. SIAM J. Matrix
Anal. Appl., 16:138–155, 1995. (Cited on p. 155.)
[881] Haesun Park and Lars Eldén. Stability analysis and fast algorithms for triangularization of Toeplitz
matrices. Numer. Math., 76:383–402, 1997. (Cited on p. 240.)
[882] Haesun Park and Sabine Van Huffel. Two-way bidiagonalization scheme for downdating the sin-
gular value decomposition. Linear Algebra Appl., 222:23–40, 1995. (Cited on p. 362.)
[883] Beresford N. Parlett. The new QD algorithms. Acta Numer., 4:459–491, 1995. (Cited on p. 352.)
[884] Beresford N. Parlett. The Symmetric Eigenvalue Problem, volume 20 of Classics in Applied Math.,
SIAM, Philadelphia, 1998. Unabridged republication of the work first published by Prentice-Hall,
Englewood Cliffs, NJ, 1980. (Cited on pp. 51, 69, 199, 345, 366, 366, 367, 370, 370.)
[885] Beresford N. Parlett and W. G. Poole, Jr. A geometric theory for the QR, LU and power iteration.
SIAM J. Numer., 10:389–412, 1973. (Cited on p. 368.)
[886] S. V. Parter. The use of linear graphs in Gauss elimination. SIAM Rev., 3:119–130, 1961. (Cited
on p. 249.)
474 Bibliography

[887] PDCO: MATLAB Convex Optimization Software. http://stanford.edu/group/SOL/


software/pdco, 2018. (Cited on pp. 427, 428.)
[888] John W. Pearson and Jennifer Pestana. Preconditioners for Krylov subspace methods: An overview.
GAMM Mitt., 43:1–35, 2020. (Cited on p. 287.)
[889] Roger Penrose. A generalized inverse for matrices. Proc. Cambridge Philos. Soc., 51:406–413,
1955. (Cited on p. 14.)
[890] Victor Pereyra. Iterative methods for solving nonlinear least squares problems. SIAM J. Numer.
Anal., 4:27–36, 1967. (Cited on p. 393.)
[891] G. Peters and J. H. Wilkinson. Inverse iteration, ill-conditioned equations and Newton’s method.
SIAM Rev., 21:339–360, 1979. (Cited on p. 224.)
[892] G. Peters and James H. Wilkinson. The least squares problem and pseudo-inverses. Comput. J.,
13:309–316, 1970. (Cited on pp. 84, 88.)
[893] Émile Picard. Quelques remarques sur les équations intégrales de premiére espéce et sur certains
problémes de physique mathématique. C. R. Acad. Sci. Paris, 148:1563–1568, 1909. (Cited on
p. 13.)
[894] Daniel J. Pierce and John G. Lewis. Sparse multifrontal rank revealing QR factorization. SIAM J.
Matrix Anal. Appl., 18:159–180, 1997. (Cited on pp. 258, 261.)
[895] R. L. Plackett. The discovery of the method of least squares. Biometrika, 59:239–251, 1972. (Cited
on p. 2.)
[896] Robert J. Plemmons. Monotonicity and iterative approximations involving rectangular matrices.
Math. Comp., 26:853–858, 1972. (Cited on p. 270.)
[897] Robert J. Plemmons. Linear least squares by elimination and MGS. J. Assoc. Comput. Mach.,
21:581–585, 1974. (Cited on p. 88.)
[898] Robert J. Plemmons. Adjustment by least squares in geodesy using block iterative methods for
sparse matrices. In Proceedings of the 1979 Army Numerical Analysis and Computer Conference,
pages 151–186, El Paso, TX, 1979. (Cited on p. 317.)
[899] Martin Plešinger. The Total Least Squares Problem and Reduction of Data in AX ≈ B. Ph.D.
thesis, Technical University of Liberec, Czech Republic, 2008. (Cited on pp. 196, 305.)
[900] L. F. Portugal, J. J. Júdice, and L. N. Vicente. A comparison of block pivoting and interior-point
algorithms for linear least squares problems with nonnegative variables. Math. Comp., 63:625–643,
1994. (Cited on p. 419.)
[901] A. Pothen. Sparse Null Bases and Marriage Theorems. Ph.D. thesis, Cornell University, Ithaca,
NY, 1984. (Cited on p. 265.)
[902] Alan Pothen and C. J. Fan. Computing the block triangular form of a sparse matrix. ACM Trans.
Math. Softw., 16:303–324, 1990. (Cited on pp. 265, 265.)
[903] M. J. D. Powell and J. K. Reid. On applying Householder’s method to linear least squares problems.
In A. J. H. Morell, editor, Proceedings of the IFIP Congress 68, pages 122–126. North-Holland,
Amsterdam, 1969. (Cited on p. 130.)
[904] Srikara Pranesh. Low precision floating-point formats: The wild west of computer arithmetic.
SIAM News, 52:12, 2019. (Cited on p. 32.)
[905] Vaughan Pratt. Direct least-squares fitting of algebraic surfaces. ACM SIGGRAPH Comput. Graph-
ics, 21:145–152, 1987. (Cited on p. 411.)
[906] Chiara Puglisi. Modification of the Householder method based on the compact WY representation.
SIAM J. Sci. Statist. Comput., 13:723–726, 1992. (Cited on p. 109.)
[907] Chiara Puglisi. QR Factorization of Large Sparse Overdetermined and Square Matrices with the
Multifrontal Method in a Multiprocessing Environment. Ph.D. thesis, Institut National Polytech-
nique de Toulouse, Toulouse, France, 1993. (Cited on p. 258.)
Bibliography 475

[908] Charles M. Rader and Allen O. Steinhardt. Hyperbolic Householder transforms. SIAM J. Matrix
Anal. Appl., 9:269–290, 1988. (Cited on p. 137.)
[909] Rui Ralha. One-sided reduction to bidiagonal form. Linear Algebra Appl., 358:219–238, 2003.
(Cited on p. 194.)
[910] Håkan Ramsin and Per-Å. Wedin. A comparison of some algorithms for the nonlinear least squares
problem. BIT Numer. Math., 17:72–90, 1977. (Cited on p. 398.)
[911] Bhaskar D. Rao and Kenneth Kreutz-Delgado. An affine scaling methodology for best basis selec-
tion. IEEE Trans. Signal Process, 47:187–200, 1999. (Cited on p. 429.)
[912] R. C. Rao. Linear Statistical Inference and Its Applications, John Wiley, New York, second edition,
1973. (Cited on p. 128.)
[913] R. K. Rao and P. Yip. Discrete Cosine Transforms. Academic Press, New York, 1990. (Cited on
p. 237.)
[914] Lord Rayleigh. On the calculation of the frequency of vibration of a system in its gravest mode
with an example from hydrodynamics. Philos Mag., 47:556–572, 1899. (Cited on p. 376.)
[915] Shaked Regev and Michael A. Saunders. Ssai: A Symmetric Approximate Inverse Preconditioner
for the Conjugate Gradient Methods PCG and PCGLS. Tech. Report, Working Paper, SOL and
ICME, Stanford University, Stanford, CA, 2022. (Cited on p. 314.)
[916] Lothar Reichel. Fast QR decomposition of Vandermonde-like matrices and polynomial least
squares approximation. SIAM J. Matrix Anal. Appl., 12:552–564, 1991. (Cited on pp. 230, 238,
239.)
[917] Lothar Reichel and William B. Gragg. FORTRAN subroutines for updating the QR decomposition.
ACM Trans. Math. Softw., 16:369–377, 1990. (Cited on p. 150.)
[918] Lothar Reichel and Qiang Ye. A generalized LSQR algorithm. Numer. Linear Algebra Appl.,
15:643–660, 2008. (Cited on p. 305.)
[919] John K. Reid. A note on the least squares solution of a band system of linear equations by House-
holder reductions. Comput J., 10:188–189, 1967. (Cited on p. 188.)
[920] John K. Reid. A note on the stability of Gaussian elimination. J. Inst. Math. Appl., 8:374–375,
1971. (Cited on p. 281.)
[921] John K. Reid. Implicit scaling of linear least squares problems. BIT Numer. Math., 40:146–157,
2000. (Cited on p. 160.)
[922] Christian H. Reinsch. Smoothing by spline functions. Numer. Math., 10:177–183, 1967. (Cited on
p. 189.)
[923] Christian H. Reinsch. Smoothing by spline functions II. Numer. Math., 16:451–454, 1971. (Cited
on pp. 169, 176.)
[924] Rosemary A. Renaut and Hongbin Guo. Efficient algorithms for solution of regularized total least
squares problems. SIAM J. Matrix Anal. Appl., 26:457–476, 2005. (Cited on p. 226.)
[925] J. R. Rice. PARVEC Workshop on Very Large Least Squares Problems and Supercomputers. Tech.
Report CSD-TR 464, Purdue University, West Lafayette, IN, 1983. (Cited on p. 206.)
[926] John R. Rice. A theory of condition. SIAM J. Numer. Anal., 3:287–310, 1966. (Cited on p. 62.)
[927] John R. Rice and Karl H. Usow. The Lawson algorithm and extensions. Math. Comp., 24:118–127,
1968. (Cited on p. 423.)
[928] J. L. Rigal and J. Gaches. On the compatibility of a given solution with the data of a linear system.
J. Assoc. Comput. Mach., 14:543–548, 1967. (Cited on p. 97.)
[929] J. D. Riley. Solving systems of linear equations with a positive definite symmetric but possibly
ill-conditioned matrix. Math. Tables Aids. Comput., 9:96–101, 1956. (Cited on pp. 177, 326.)
[930] Walter Ritz. Über eine neue Methode zur Lösung gewisser Variationsprobleme der mathematischen
Physik. J. Reine Angew. Math., 136:1–61, 1908. (Cited on p. 376.)
476 Bibliography

[931] Marielba Rojas, Sandra A. Santos, and Danny C. Sorensen. Algorithm 873: LSTRS: MATLAB
software for large-scale trust-region subproblems and regularization. ACM Trans. Math. Softw.,
24:11:1–11.28, 2008. (Cited on pp. 182, 375.)
[932] Marielba Rojas and Danny C. Sorensen. A trust-region approach to the regularization of large-
scale discrete forms of ill-posed problems. SIAM J. Sci. Comput., 23:1842–1860, 2002. (Cited on
p. 182.)
[933] Marielba Rojas and Trond Steihaug. An interior-point trust-region-based method for large-scale
non-negative regularization. Inverse Problems, 18:1291–1307, 2002. (Cited on p. 419.)
[934] Vladimir Rokhlin and Mark Tygert. A fast randomized algorithm for overdetermined linear least
squares regression. Proc. Natl. Acad. Sci. USA, 105:13212–13217, 2008. (Cited on p. 319.)
[935] D. J. Rose. A graph-theoretic study of the numerical solution of sparse positive definite systems of
linear equations. In R. C. Read, editor, Graph Theory and Computing, pages 183–217, Academic
Press, New York, 1972. (Cited on pp. 249, 251.)
[936] J. Ben Rosen, Haesun Park, and John Glick. Total least norm problems formulation and solution
for structured problems. SIAM J. Matrix Anal. Appl., 17:110–126, 1996. (Cited on pp. 227, 411.)
[937] Roman Rosipal and Nicole Krämer. Overview and recent advances in partial least squares. In C.
Saunders et al., eds., Proceedings of International Statistics and Optimization Perspectives Work-
shop, “Subspace, Latent Structure and Feature Selection,” volume 3940 of Lecture Notes in Com-
puter Science, pages 34–51. Springer, Berlin, 2006. (Cited on p. 203.)
[938] Miroslav Rozložník, Alicja Smoktunowicz, Miroslav Tůma, and Jiří Kopal. Numerical stability of
orthogonalization methods with a non-standard inner product. BIT Numer. Anal., 52:1035–1058,
2012. (Cited on p. 122.)
[939] Axel Ruhe. Accelerated Gauss–Newton algorithms for nonlinear least squares problems. BIT
Numer. Math., 19:356–367, 1979. (Cited on p. 395.)
[940] Axel Ruhe. Numerical aspects of Gram–Schmidt orthogonalization of vectors. Linear Algebra
Appl., 52/53:591–601, 1983. (Cited on p. 71.)
[941] Axel Ruhe. Rational Krylov: A practical algorithms for large sparse nonsymmetric matrix pencils.
SIAM J. Sci. Comput., 19:1535–1551, 1998. (Cited on p. 376.)
[942] Axel Ruhe and Per Åke Wedin. Algorithms for separable nonlinear least squares problems. SIAM
Rev., 22:318–337, 1980. (Cited on pp. 404, 406.)
[943] Siegfried M. Rump. INTLAB - INTerval LABoratory. In Tibor Csendes, editor, Developments
in Reliable Computing, pages 77–104. Kluwer Academic Publishers, Dordrecht, 1999. (Cited on
p. 34.)
[944] Siegfried M. Rump. Fast and parallel interval arithmetic. BIT Numer. Math., 39:534–554, 1999.
(Cited on p. 34.)
[945] Siegfried M. Rump. Ill-conditioned matrices are componentwise near to singularity. SIAM Review,
41:102–112, 1999. (Cited on p. 31.)
[946] Heinz Rutishauser. Der Quotienten-Differenzen-Algorithmus. Z. Angew. Math. Phys., 5:233–251,
1954. (Cited on p. 351.)
[947] Heinz Rutishauser. Solution of eigenvalue problems with the LR-transformation. Nat. Bur. Stan-
dards Appl. Math. Ser., 49:47–81, 1958. (Cited on p. 339.)
[948] Heinz Rutishauser. Theory of gradient methods. In M. Engeli, Th. Ginsburg, H. Rutishauser,
and E. Stiefel, editors, Refined Methods for Computation of the Solution and the Eigenvalues of
Self-Adjoint Boundary Value Problems, pages 24–50. Birkhäuser, Basel/Stuttgart, 1959. (Cited on
p. 326.)
[949] Heinz Rutishauser. On Jacobi rotation patterns. In Proceedings of Symposia in Applied Math-
ematics, Vol. XV: Experimental Arithmetic, High Speed Computing and Mathematics. American
Mathematical Society, Providence, RI, pages 219–239, 1963. (Cited on p. 341.)
Bibliography 477

[950] Heinz Rutishauser. The Jacobi method for real symmetric matrices. In F. L. Bauer et al., editors,
Handbook for Automatic Computation. Vol. II, Linear Algebra, pages 201–211. Springer, New
York, 1971. Prepublished in Numer. Math., 9:1–10, 1966. (Cited on p. 353.)
[951] Heinz Rutishauser. Description of ALGOL 60. Handbook for Automatic Computation. Vol. I, Part
a. Springer-Verlag, Berlin, 1967. (Cited on p. 69.)
[952] Heintz Rutishauser. Once again: The least squares problem. Linear Algebra Appl., 1:479–488,
1968. (Cited on p. 177.)
[953] Yousef Saad. Preconditioning techniques for nonsymmetric and indefinite linear systems. J. Com-
put. Appl. Math., 24:89–105, 1988. (Cited on p. 313.)
[954] Yousef Saad. Numerical Methods for Large Eigenvalue Problems. Halstead Press, New York, 1992.
(Cited on p. 375.)
[955] Yousef Saad. A flexible inner-outer preconditioned GMRES algorithm. SIAM J. Sci. Statist. Com-
put., 14:461–469, 1993. (Cited on pp. 303, 304.)
[956] Yousef Saad. Iterative Methods for Sparse Linear Systems. PVS Publishing Company, Boston,
MA, 1996. (Cited on pp. 281, 284.)
[957] Yousef Saad. Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia, second edition,
2003. (Cited on p. 269.)
[958] Yousef Saad. Numerical Methods for Large Eigenvalue Problems, volume 66 of Classics in Applied
Math., SIAM, Philadelphia, revised edition, 2011. Updated edition of the work first published by
Manchester University Press, 1992. (Cited on p. 375.)
[959] Youcef Saad and Martin H Schultz. GMRES: A generalized minimal residual algorithm for solving
nonsymmetric linear systems. SIAM J. Sci. Statist. Comput., 7:856–869, 1986. (Cited on p. 301.)
[960] Yousef Saad and Henk A. van der Vorst. Iterative solution of linear systems in the 20th century. J.
Comput. Appl. Math., 123:1–33, 2000. (Cited on p. 269.)
[961] Y. Saad, M. Yeung, J. Erhel, and F. Guyomarc’h. A deflated version of the conjugate gradient
algorithm. SIAM J. Sci. Comput., 21:1909–1926, 2000. (Cited on pp. 312, 336, 336.)
[962] Douglas E. Salane. A continuation approach for solving large-residual nonlinear least squares
problems. SIAM J. Sci. Statist. Comput., 8:655–671, 1987. (Cited on p. 402.)
[963] Michael A. Saunders. Large-Scale Linear Programming Using the Cholesky Factorization. Tech.
Report CS252, Computer Science Department, Stanford University, Stanford, CA, 1972. (Cited
on pp. 146, 146.)
[964] Michael A. Saunders. Sparse least squares by conjugate gradients: A comparison of precondition-
ing methods. In J. F. Gentleman, editor, Proc. Computer Science and Statistics 12th Annual Sym-
posium on the Interface, pages 15–20. University of Waterloo, Canada, 1979. (Cited on p. 317.)
[965] Michael A. Saunders. Solution of sparse rectangular systems using LSQR and Craig. BIT Numer.
Math., 35:588–604, 1995. (Cited on pp. 292, 328.)
[966] M. A. Saunders, H. D. Simon, and E. L. Yip. Two conjugate-gradient-type methods for unsymmet-
ric systems. SIAM J. Numer. Anal., 25:927–940, 1988. (Cited on pp. 305, 331.)
[967] Werner Sautter. Error analysis of Gauss elimination for the best least squares solution. Numer.
Math., 30:165–184, 1978. (Cited on p. 85.)
[968] Berkant Savas and Lek-Heng Lim. Quasi-Newton methods on Grassmannians and multilinear
approximations of tensors. SIAM J. Sci. Comput., 32:3352–3393, 2010. (Cited on p. 217.)
[969] Robert Schatten. Norm Ideals of Completely Continuous Operators. Ergebnisse der Mathematik
und ihrer Grenzgebiete, Neue Folge. Springer Verlag, Berlin, 1960. (Cited on p. 22.)
[970] K. Schittkowski. Solving constrained nonlinear least squares problems by a general purpose SQP-
method. In K.-H. Hoffmann, J. B. Hiriart-Urruty, C. Lemaréchal, and J. Zowe, editors, Trends in
Mathematical Optimization, volume 84 of International Series of Numerical Mathematics, pages
49–83. Birkhäuser-Verlag, Basel, Switzerland, 1985. (Cited on p. 162.)
478 Bibliography

[971] Erhard Schmidt. Zur Theorie der linearen und nichtlinearen Integralgleichungen. 1 Teil. Entwick-
lung willkürlicher Funktionen nach Systemen vorgeschriebener. Math. Ann., 63:433–476, 1907.
(Cited on p. 63.)
[972] Erhard Schmidt. Über die Auflösung linearer Gleichungen mit unendlich vielen Unbekannten.
Rend. Circ. Mat. Palermo. Ser. 1, 25:53–77, 1908. (Cited on p. 63.)
[973] W. Schönemann. A generalized solution of the orthogonal Procrustes problem. Psychometrica,
31:1–10, 1966. (Cited on p. 386.)
[974] Robert S. Schreiber. A new implementation of sparse Gaussian elimination. ACM Trans. Math.
Softw., 8:256–276, 1982. (Cited on p. 250.)
[975] Robert Schreiber and Charles Van Loan. A storage efficient WY representation for products of
Householder transformations. SIAM J. Sci. Statist. Comput., 10:53–57, 1989. (Cited on p. 106.)
[976] Günther Schulz. Iterative Berechnung der reziproken Matriz. Z. Angew. Math. Mech., 13:57–59,
1933. (Cited on p. 379.)
[977] H. R. Schwarz. Tridiagonalization of a symmetric band matrix. Numer. Math., 12:231–241, 1968.
Also appears in [1123, pp. 273–283]. (Cited on p. 341.)
[978] H. R. Schwarz, Hans Rutishauser, and Eduard Stiefel. Matrizen-Numerik. Teubner Verlag, Stuttgart,
1986. (Cited on p. 62.)
[979] Hubert Schwetlick. Nonlinear parameter estimation: Models, criteria and estimation. In D. F.
Griffiths and G. A. Watson, editors, Numerical Analysis 1991. Proceedings of the 14th Dundee
Conference on Numerical Analysis, volume 260 of Pitman Research Notes in Mathematics, pages
164–193. Longman Scientific and Technical, Harlow, UK, 1992. (Cited on p. 402.)
[980] Hubert Schwetlick and V. Tiller. Numerical methods for estimating parameters in nonlinear models
with errors in the variables. Technometrics, 27:17–24, 1985. (Cited on p. 410.)
[981] Hubert Schwetlick and Volker Tiller. Nonstandard scaling matrices for trust region Gauss–Newton
methods. SIAM J. Sci. Statist. Comput., 10:654–670, 1989. (Cited on p. 410.)
[982] Hugo D. Scolnik. On the solution of non-linear least squares problems. In C. V. Freiman, J. E.
Griffith, and J. L. Rosenfeld, editors, Proc. IFIP Congress 71. Vol. 2, pages 1258–1265. North-
Holland, Amsterdam, 1972. (Cited on p. 405.)
[983] Jennifer Scott. On using Cholesky-based factorizations and regularization for solving rank-deficient
sparse linear least-squares problems. SIAM J. Sci. Comput., 39:C319–C339, 2017. (Cited on
p. 315.)
[984] Jennifer A. Scott and Miroslav Tůma. The importance of structure in incomplete factorization
preconditioners. BIT Numer. Math., 51:385–404, 2011. (Cited on p. 310.)
[985] Jennifer A. Scott and Miroslav Tůma. HSL_MI28: An efficient and limited-memory incomplete
Cholesky factorization code. ACM Trans. Math. Softw., 40:Article 24, 2014. (Cited on pp. 311,
315.)
[986] Jennifer Scott and Miroslav Tůma. On positive semidefinite modification schemes for incomplete
Cholesky factorization. SIAM J. Sci. Comput., 36:A609–A633, 2014. (Cited on p. 311.)
[987] Jennifer Scott and Miroslav Tůma. Preconditioning of linear least squares by robust incomplete
factorization for implicitly held normal equations. SIAM J. Sci. Comput., 38:C603–C623, 2016.
(Cited on p. 315.)
[988] Jennifer Scott and Miroslav Tůma. Solving mixed sparse-dense linear least-squares problems by
preconditioned iterative methods. SIAM J. Sci. Comput., 39:A2422–A2437, 2017. (Cited on
p. 263.)
[989] Jennifer A. Scott and Miroslav Tůma. A Schur complement approach to preconditioning sparse
least squares with some dense rows. Numer. Algor., 79:1147–1168, 2018. (Cited on p. 263.)
[990] Jennifer A. Scott and Miroslav Tůma. Sparse stretching for solving sparse-dense linear least-
squares problems. SIAM J. Sci. Comput., 41:A1604–A1625, 2019. (Cited on p. 263.)
Bibliography 479

[991] Jennifer A. Scott and Miroslav Tůma. Algorithms for Sparse Linear Systems. Necas Center Series.
Birkhäuser, Cham, 2023. (Cited on p. 244.)
[992] Shayle R. Searle. Extending some results and proofs for the singular linear model. Linear Algebra
Appl., 210:139–151, 1994. (Cited on p. 128.)
[993] V. de Silva and Lek-Heng Lim. Tensor rank and the ill-posedness of the best low rank approxima-
tion. SIAM J. Matrix Anal. Appl., 30:1084–1127, 2008. (Cited on pp. 215, 216, 218.)
[994] Diana Sima, Sabine Van Huffel, and Gene H. Golub. Regularized total least squares based on
quadratic eigenvalue solvers. Linear Algebra Appl., 44:793–812, 2004. (Cited on p. 225.)
[995] Horst D. Simon. Analysis of the symmetric Lanczos algorithm with reorthogonalization methods.
Linear Algebra Appl., 61:101–131, 1984. (Cited on p. 298.)
[996] Horst D. Simon and Hongyuan Zha. Low-rank matrix approximation using the Lanczos bidiagonal-
ization process with applications. SIAM J. Sci. Comput., 21:2257–2274, 2000. (Cited on pp. 203,
298, 371.)
[997] Valeria Simoncini and Daniel B. Szyld. On the occurrence of superlinear convergence of exact and
inexact Krylov subspace methods. SIAM Rev., 47:247–272, 2005. (Cited on p. 299.)
[998] Valeria Simoncini and Daniel B. Szyld. Recent computational developments in Krylov subspace
methods for linear systems. Numer. Linear Algebra Appl., 14:1–59, 2007. (Cited on p. 337.)
[999] Lennart Simonsson. Subspace Computations via Matrix Decompositions and Geometric Optimiza-
tion. Ph.D. thesis, Linköping Studies in Science and Technology No. 1052, Linköping, Sweden,
2006. (Cited on p. 155.)
[1000] Robert D. Skeel. Scaling for numerical stability in Gaussian elimination. J. Assoc. Comput. Mach.,
26:494–526, 1979. (Cited on p. 31.)
[1001] Robert D. Skeel. Iterative refinement implies numerical stability for Gaussian elimination. Math.
Comp., 35:817–832, 1980. (Cited on p. 103.)
[1002] Gerard L. G. Sleijpen and Henk A. van der Vorst. A Jacobi–Davidson iteration method for linear
eigenvalue problems. SIAM J. Matrix Anal. Appl., 17:401–425, 1996. (Cited on pp. 374, 375,
376.)
[1003] Gerard L. G. Sleijpen and Henk A. van der Vorst. A Jacobi–Davidson iteration method for linear
eigenvalue problems. SIAM Rev., 42:267–293, 2000. (Cited on p. 375.)
[1004] S. W. Sloan. An algorithm for profile and wavefront reduction of sparse matrices. Int. J. Numer.
Methods Eng., 23:239–251, 1986. (Cited on p. 251.)
[1005] B. T. Smith, J. M. Boyle, Jack J. Dongarra, B. S. Garbow, Y. Ikebe, Virginia C. Klema, and Cleve B.
Moler. Matrix Eigensystems Routines—EISPACK Guide, volume 6 of Lecture Notes in Computer
Science. Springer, New York, second edition, 1976. (Cited on p. 113.)
[1006] Alicja Smoktunowicz, Jesse L. Barlow, and Julien Langou. A note on the error analysis of the
classical Gram–Schmidt. Numer. Math., 105:299–313, 2006. (Cited on p. 63.)
[1007] Inge Söderkvist. Perturbation analysis of the orthogonal Procrustes problem. BIT Numer. Math.,
33:687–694, 1993. (Cited on p. 387.)
[1008] Inge Söderkvist and Per-Åke Wedin. Determining the movements of the skeleton using well-
configured markers. J. Biomech., 26:1473–1477, 1993. (Cited on p. 386.)
[1009] Inge Söderkvist and Per-Åke Wedin. On condition numbers and algorithms for determining a rigid
body movement. BIT Numer. Math., 34:424–436, 1994. (Cited on p. 386.)
[1010] Torsten Söderström and G. W. Stewart. On the numerical properties of an iterative method for
computing the Moore–Penrose generalized inverse. SIAM J. Numer. Anal., 11:61–74, 1974. (Cited
on p. 380.)
[1011] D. C. Sorensen. Implicit application of polynomial filters in a k-step Arnoldi method. SIAM J.
Matrix Anal. Appl., 13:357–385, 1992. (Cited on pp. 372, 373.)
480 Bibliography

[1012] Danny C. Sorensen. Numerical methods for large eigenvalue problems. Acta Numer., 11:519–584,
2002. (Cited on p. 375.)
[1013] David Sourlier. Three-Dimensional Feature-Independent Bestfit in Coordinate Metrology. Ph.D.
dissertation, Swiss Federal Institute of Technology, Zürich, 1995. (Cited on p. 416.)
[1014] Helmuth Späth. Mathematical Algorithms for Linear Regression. Academic Press, Boston, 1992.
(Cited on p. 423.)
[1015] G. W. Stewart. Introduction to Matrix Computations. Academic Press, New York, 1973. (Cited on
p. 22.)
[1016] G. W. Stewart. The economical storage of plane rotations. Numer. Math., 25:137–138, 1976. (Cited
on p. 49.)
[1017] G. W. Stewart. On the perturbation of pseudo-inverses, projections and linear least squares prob-
lems. SIAM Rev., 19:634–662, 1977. (Cited on pp. 19, 19, 31, 98.)
[1018] G. W. Stewart. Research, development, and LINPACK. In J. R. Rice, editor, Mathematical Software
III, pages 1–14. Academic Press, New York, 1977. (Cited on pp. 25, 98.)
[1019] G. W. Stewart. The efficient generation of random orthogonal matrices with an application to
condition estimators. SIAM J. Numer. Anal., 17:403–409, 1980. (Cited on p. 63.)
[1020] G. W. Stewart. Computing the CS decomposition of a partitioned orthogonal matrix. Numer. Math.,
40:297–306, 1982. (Cited on pp. 19, 128.)
[1021] G. W. Stewart. A method for computing the generalized singular value decomposition. In
B. Kågström and Axel Ruhe, editors, Matrix Pencils. Proceedings, Pite Havsbad, 1982, volume
973 of Lecture Notes in Mathematics, pages 207–220. Springer-Verlag, Berlin, 1983. (Cited on
p. 128.)
[1022] G. W. Stewart. On the asymptotic behavior of scaled singular value and QR decompositions. Math.
Comp., 43:483–489, 1984. (Cited on p. 132.)
[1023] G. W. Stewart. Rank degeneracy. SIAM J. Sci. Statist. Comput., 5:403–413, 1984. (Cited on pp. 77,
83.)
[1024] G. W. Stewart. Determining rank in the presence of errors. In Mark S. Moonen, Gene H. Golub,
and Bart L. M. De Moor, editors, Large Scale and Real-Time Applications, pages 275–292. Kluwer
Academic Publishers, Dordrecht, 1992. (Cited on p. 80.)
[1025] G. W. Stewart. An updating algorithm for subspace tracking. IEEE Trans. Signal Process.,
40:1535–1541, 1992. (Cited on pp. 78, 153.)
[1026] G. W. Stewart. On the early history of the singular value decomposition. SIAM Rev., 35:551–566,
1993. (Cited on pp. 13, 79.)
[1027] G. W. Stewart. Updating a rank-revealing ULV decomposition. SIAM J. Matrix Anal. Appl., 14:494–
499, 1993. (Cited on pp. 153, 154, 154.)
[1028] G. W. Stewart. Gauss, statistics, and Gaussian elimination. J. Comput. Graphical Statistics, 4:1–11,
1995. (Cited on p. 39.)
[1029] G. W. Stewart. On the stability of sequential updates and downdates. IEEE Trans. Signal Process.,
43:1643–1648, 1995. (Cited on pp. 8, 149.)
[1030] G. W. Stewart. Matrix Algorithms Volume I: Basic Decompositions. SIAM, Philadelphia, 1998.
(Cited on p. 39.)
[1031] G. W. Stewart. Block Gram–Schmidt orthogonalization. SIAM J. Sci. Comput., 31:761–775, 2008.
(Cited on p. 109.)
[1032] G. W. Stewart. On the numerical analysis of oblique projectors. SIAM J. Matrix Anal. Appl.,
32:309–348, 2011. (Cited on pp. 119, 119.)
[1033] G. W. Stewart and Ji-guang Sun. Matrix Perturbation Theory. Academic Press, New York, 1990.
(Cited on pp. 21, 25, 31.)
Bibliography 481

[1034] Michael Stewart and Paul Van Dooren. Updating a generalized URV decomposition. SIAM J.
Matrix Anal. Appl., 22:479–500, 2000. (Cited on p. 155.)
[1035] Eduard Stiefel. Ausgleichung ohne Aufstellung der Gaußschen Normalgleichungen. Wiss. Z. Tech.
Hochsch. Dresden, 2:441–442, 1952/53. (Cited on p. 285.)
[1036] Eduard Stiefel. Über diskrete und lineare Tschebyscheff-Approximation. Numer. Math., 1:1–28,
1959. (Cited on p. 422.)
[1037] S. M. Stigler. An attack on Gauss, published by Legendre in 1820. Hist. Math., 4:31–35, 1977.
(Cited on p. 2.)
[1038] S. M. Stigler. Gauss and the invention of least squares. Ann. Statist., 9:465–474, 1981. (Cited on
pp. 2, 2.)
[1039] Joseph Stoer. On the numerical solution of constrained least-squares problems. SIAM J. Numer.
Anal., 8:382–411, 1971. (Cited on p. 165.)
[1040] Zdeněk Strakoš and Petr Tichý. On error estimation in the conjugate gradient method and why it
works in finite precision computations. ETNA, 13:56–80, 2002. (Cited on p. 299.)
[1041] O. N. Strand. Theory and methods related to the singular-function expansion and Landweber’s
iteration for integral equations of the first kind. SIAM J. Numer. Anal., 11:798–825, 1974. (Cited
on p. 326.)
[1042] Gilbert Strang. A proposal for Toeplitz matrix computations. Stud. Appl. Math., 74:171–176, 1986.
(Cited on p. 324.)
[1043] Gilbert Strang. A framework for equilibrium equations. SIAM Rev., 30:283–297, 1988. (Cited on
p. 116.)
[1044] Gilbert Strang. The discrete cosine transform. SIAM Rev., 41:135–147, 1999. (Cited on p. 237.)
[1045] Rolf Strebel, David Sourlier, and Walter Gander. A comparison of orthogonal least squares fitting
in coordinate metrology. In Sabine Van Huffel, editor, Proceedings of the Second International
Workshop on Total Least Squares and Errors-in-Variables Modeling, Leuven, Belgium, August 21–
24, 1996, pages 249–258. SIAM, Philadelphia, 1997. (Cited on p. 416.)
[1046] Chunguang Sun. Parallel sparse orthogonal factorization on distributed-memory multiprocessors.
SIAM J. Sci. Comput., 17:666–685, 1996. (Cited on p. 258.)
[1047] Ji-guang Sun. Perturbation theorems for generalized singular values. J. Comput. Math., 1:233–242,
1983. (Cited on p. 125.)
[1048] Ji-guang Sun. Perturbation bounds for the Cholesky and QR factorizations. BIT Numer. Math.,
31:341–352, 1991. (Cited on p. 54.)
[1049] Ji-guang Sun. Perturbation analysis of the Cholesky downdating and QR updating problems. SIAM
J. Matrix Anal. Appl., 16:760–775, 1995. (Cited on pp. 146, 148.)
[1050] Ji-guang Sun. Optimal backward perturbation bounds for the linear least-squares problem with
multiple right-hand sides. IMA J. Numer. Anal., 16:1–11, 1996. (Cited on p. 99.)
[1051] Ji-guang Sun and Zheng Sun. Optimal backward perturbation bounds for underdetermined systems.
SIAM J. Matrix Anal. Appl., 18:393–402, 1997. (Cited on p. 99.)
[1052] Brian D. Sutton. Computing the complete CS decomposition. Numer. Algor., 50:33–65, 2009.
(Cited on p. 19.)
[1053] D. R. Sweet. Fast Toeplitz orthogonalization. Numer. Math., 43:1–21, 1984. (Cited on p. 241.)
[1054] Katarzyna Świrydowicz, Julien Langou, Shreyas Ananthan ans Ulrike Yang, and Stephen Thomas.
Low synchronization Gram–Schmidt and generalized minimum residual algorithms. Numer. Linear
Algebra Appl., 28:1–20, 2020. (Cited on p. 109.)
[1055] Daniel B. Szyld. The many proofs of an identity on the norm of an oblique projection. Numer.
Algorithms, 42:309–323, 2006. (Cited on p. 119.)
482 Bibliography

[1056] Kunio Tanabe. Projection method for solving a singular system of linear equations and its applica-
tions. Numer. Math., 17:203–214, 1971. (Cited on p. 270.)
[1057] Robert Tarjan. Depth-first search and linear graph algorithms. SIAM J. Comput., 1:146–160, 1972.
(Cited on p. 264.)
[1058] R. P. Tewarson. A computational method for evaluating generalized inverses. Comput. J., 10:411–
413, 1968. (Cited on pp. 88, 88.)
[1059] Stephen J. Thomas and R. V. M. Zahar. Efficient orthogonalization in the M -norm. Congr. Numer.,
80:23–32, 1991. (Cited on p. 122.)
[1060] Stephen J. Thomas and R. V. M. Zahar. An analysis of orthogonalization in elliptic norms. Congr.
Numer., 86:193–222, 1992. (Cited on p. 122.)
[1061] Robert Tibshirani. Regression shrinkage and selection via the LASSO. Royal Statist. Soc. B,
58:267–288, 1996. (Cited on p. 425.)
[1062] A. N. Tikhonov and V. Y Arsenin. Solutions of Ill-Posed Problems. Winston, Washington D.C.,
1977. (Cited on p. 175.)
[1063] Andrei N. Tikhonov. Solution of incorrectly formulated problems and the regularization method.
Soviet Math. Dokl., 4:1035–1038, 1963. (Cited on p. 175.)
[1064] W. F. Tinney and J. W. Walker. Direct solution of sparse network equations by optimally ordered
triangular factorization. Proc. IEEE, 55:1801–1809, 1967. (Cited on p. 251.)
[1065] M. Tismenetsky. A new preconditioning technique for solving large sparse linear systems. Linear
Algebra Appl., 154/156:331–353, 1991. (Cited on p. 310.)
[1066] Ph. L. Toint. On large scale nonlinear least squares calculations. SIAM J. Sci. Statist. Comput.,
8:416–435, 1987. (Cited on p. 400.)
[1067] Philippe L. Toint. VE10AD a Routine for Large-Scale Nonlinear Least Squares. Harwell Subroutine
Library, AERE Harwell, Oxfordshire, UK, 1987. (Cited on p. 400.)
[1068] Lloyd N. Trefethen and David Bau, III. Numerical Linear Algebra. SIAM, Philadelphia, 1997.
(Cited on p. 60.)
[1069] Michael J. Tsatsomeros. Principal pivot transforms. Linear Algebra Appl., 307:151–165, 2000.
(Cited on p. 136.)
[1070] L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrica, 31:279–311,
1966. (Cited on pp. 216, 217.)
[1071] Madelaine Udell and Alex Townsend. Why are big data matrices approximately low rank? SIAM
J. Math. Data Sci., 1:144–160, 2019. (Cited on p. 24.)
[1072] M. H. van Benthem and M. R. Keenan. A fast non-negativity-constrained least squares algorithm.
J. Chemometrics, 18:441–450, 2004. (Cited on p. 168.)
[1073] A. van der Sluis. Stability of the solutions of linear least squares problems. Numer. Math., 23:241–
254, 1975. (Cited on p. 27.)
[1074] A. van der Sluis and G. Veltkamp. Restoring rank and consistency by orthogonal projection. Linear
Algebra Appl., 28:257–278, 1979. (Cited on p. 26.)
[1075] Henk A. van der Vorst. Iterative Krylov Methods for Large Linear Systems. Number 13 in Cam-
bridge Monographs on Applied and Computational Mathematics. Cambridge University Press,
Cambridge, UK, 2003. (Cited on pp. 269, 285.)
[1076] Sabine Van Huffel, Haesun Park, and J. Ben Rosen. Formulation and solution of structured total
least norm problems for parameter estimation. IEEE Trans. Signal Process., 44:2464–2474, 1996.
(Cited on p. 227.)
[1077] Sabine Van Huffel and Joos Vandewalle. The Total Least Squares Problem: Computational Aspects
and Analysis. SIAM, Philadelphia, 1991. (Cited on pp. 218, 220, 222, 223, 223.)
Bibliography 483

[1078] Sabine Van Huffel, Joos Vandewalle, and Ann Haegemans. An efficient and reliable algorithm for
computing the singular subspace of a matrix associated with its smallest singular values. J. Comput.
Appl. Math., 19:313–330, 1987. (Cited on p. 223.)
[1079] Charles F. Van Loan. Generalizing the singular value decomposition. SIAM J. Numer. Anal., 13:76–
83, 1976. (Cited on pp. 124, 125.)
[1080] Charles F. Van Loan. Computing the CS and the generalized singular value decomposition. Numer.
Math., 46:479–492, 1985. (Cited on pp. 19, 128.)
[1081] Charles Van Loan. On the method of weighting for equality-constrained least squares. SIAM J.
Numer. Anal., 22:851–864, 1985. (Cited on pp. 130, 159.)
[1082] Charles Van Loan. Computational Frameworks for the Fourier Transform, volume 10 of Frontiers
in Applied Math. SIAM, Philadelphia, 1992. (Cited on p. 237.)
[1083] Charles F. Van Loan. The ubiquitous Kronecker product. J. Comput. Appl. Math., 123:85–100,
2000. (Cited on p. 210.)
[1084] Field G. Van Zee, Robert A. van de Geijn, and Gregorio Quintana-Ortí. Restructuring the tridiago-
nal and bidiagonal QR algorithms for performance. Numer. Linear Algebra Appl., 40:18:1–18:34,
2014. (Cited on p. 360.)
[1085] Robert J. Vanderbei. Symmetric quasidefinite matrices. SIAM J. Optim., 5:100–113, 1995. (Cited
on p. 329.)
[1086] J. M. Varah. On the numerical solution of ill-conditioned linear systems with application to ill-
posed problems. SIAM J. Numer. Anal., 10:257–267, 1973. (Cited on p. 172.)
[1087] J. M. Varah. A practical examination of some numerical methods for linear discrete ill-posed
problems. SIAM Rev., 21:100–111, 1979. (Cited on p. 181.)
[1088] J. M. Varah. Pitfalls in the numerical solution of linear ill-posed problems. SIAM J. Sci. Statist.
Comput., 4:164–176, 1983. (Cited on p. 178.)
[1089] James M. Varah. Least squares data fitting with implicit functions. BIT Numer. Math., 36:842–854,
1996. (Cited on pp. 412, 413.)
[1090] Richard S. Varga. Matrix Iterative Analysis. Prentice-Hall, Englewood Cliffs, 1962. (Cited on
pp. 269, 276.)
[1091] Stephen A. Vavasis. Stable numerical algorithms for equilibrium systems. SIAM J. Matrix Anal.
Appl., 15:1108–1131, 1994. (Cited on p. 133.)
[1092] Stephen A. Vavasis. On the complexity of nonnegative matrix factorization. SIAM J. Optim.,
20:1364–1377, 2009. (Cited on p. 419.)
[1093] Vincenzo Esposito Vinzi, Wynne W. Chin, Jörg Henseler, and Huiwen Wang, editors. Handbook of
Partial Least Squares. Springer, New York, 2010. (Cited on p. 200.)
[1094] John von Neumann. Some matrix-inequalities and metrization of matrix-space. Tomsk Univ. Rev.,
1:286–300, 1937. (Cited on p. 21.)
[1095] A. J. Wahten. Preconditioning. Acta Numer., 24:323–376, 2015. (Cited on p. 287.)
[1096] Bertil Waldén, Rune Karlsson, and Ji-guang Sun. Optimal backward perturbation bounds for the
linear least squares problem. Numer. Linear Algebra Appl., 2:271–286, 1995. (Cited on p. 98.)
[1097] Homer F. Walker. Implementation of the GMRES method using Householder transformations.
SIAM J. Sci. Statist. Comput., 9:152–163, 1988. (Cited on pp. 109, 302.)
[1098] R. H. Wampler. A report on the accuracy of some widely used least squares computer programs. J.
Amer. Statist. Assoc., 65:549–565, 1970. (Cited on p. 103.)
[1099] R. H. Wampler. Solutions to weighted least squares problems by modified Gram–Schmidt with
iterative refinement. ACM Trans. Math. Softw., 5:457–465, 1979. (Cited on p. 103.)
[1100] J. Wang, Q. Zhang, and Lennart Ljung. Revisiting Hammerstein system identification through the
two-stage algorithm for bilinear parameter estimation. Automatica, 45:2627–2633, 2009. (Cited
on p. 405.)
484 Bibliography

[1101] Xiaoge Wang. Incomplete Factorization Preconditioning for Least Squares Problems. Ph.D. thesis,
Department of Mathematics, University of Illinois at Urbana-Champaign, Urbana, IL, 1993. (Cited
on p. 312.)
[1102] Xiaoge Wang, Kyle A. Gallivan, and Randall Bramley. CIMGS: An incomplete orthogonal factor-
ization preconditioner. SIAM J. Sci. Comput., 18:516–536, 1997. (Cited on p. 312.)
[1103] David S. Watkins. Understanding the QR algorithm. SIAM Rev., 24:427–440, 1982. (Cited on
p. 368.)
[1104] David S. Watkins. Francis’s algorithm. Amer. Math. Monthly, 118:387–403, 2011. (Cited on
p. 348.)
[1105] Joseph Henry Maclagan Wedderburn. Lectures on Matrices, Dover Publications, Inc., New York,
1964. Unabridged and unaltered republication of the work first published by the American Math-
ematical Society, New York, 1934 as volume XVII in their Colloquium Publications. (Cited on
p. 139.)
[1106] Per-Åke Wedin. On Pseudo-Inverses of Perturbed Matrices. Tech. Report, Department of Computer
Science, Lund University, Sweden, 1969. (Cited on p. 26.)
[1107] Per-Åke Wedin. Perturbation bounds in connection with the singular value decomposition. BIT
Numer. Math., 12:99–111, 1972. (Cited on pp. 22, 393.)
[1108] Per-Åke Wedin. Perturbation theory for pseudo-inverses. BIT Numer. Math., 13:217–232, 1973.
(Cited on pp. 26, 27, 28.)
[1109] Per-Åke Wedin. On the Gauss-Newton Method for the Nonlinear Least Squares Problems. Working
Paper 24, Institute for Applied Mathematics, Stockholm, Sweden, 1974. (Cited on p. 394.)
[1110] Per-Åke Wedin. Perturbation Theory and Condition Numbers for Generalized and Constrained
Linear Least Squares Problems. Tech. Report UMINF–125.85, Institute of Information Processing,
University of Umeå, Sweden, 1985. (Cited on pp. 119, 119, 128, 160.)
[1111] Musheng Wei. Algebraic relations between the total least squares and least squares problems with
more than one solution. Numer. Math., 62:123–148, 1992. (Cited on p. 226.)
[1112] Musheng Wei. The analysis for the total least squares problem with more than one solution. SIAM
J. Matrix Anal. Appl., 13:746–763, 1992. (Cited on p. 222.)
[1113] Musheng Wei. Perturbation theory for the rank-deficient equality constrained least squares problem.
SIAM J. Numer. Anal., 29:1462–1481, 1992. (Cited on p. 160.)
[1114] Yimin Wei and Weiyang Ding. Theory and Computation of Tensors: Multi-Dimensional Arrays.
Academic Press, New York, 2016. (Cited on p. 218.)
[1115] Yimin Wei, Pengpeng Xie, and Liping Zhang. Tikhonov regularization and randomized GSVD.
SIAM J. Matrix Anal. Appl., 37:649–675, 2016. (Cited on p. 334.)
[1116] P. R. Weil and P. C. Kettler. Rearranging matrices to block-angular form for decomposition (and
other) algorithms. Management Sci., 18:98–108, 1971. (Cited on p. 209.)
[1117] Helmut Wielandt. Das Iterationsverfahren bei nicht selbstadjungierten linearen Eigenwertaufgaben.
Math. Z., 50:93–143, 1944. (Cited on p. 365.)
[1118] James H. Wilkinson. Rounding Errors in Algebraic Processes. Prentice-Hall, Englewood Cliffs,
NJ, 1963. (Cited on p. 101.)
[1119] James H. Wilkinson. The Algebraic Eigenvalue Problem. Clarendon Press, Oxford, UK, 1965.
(Cited on p. 47.)
[1120] James H. Wilkinson. Error analysis of transformations based on the use of matrices of the form
I − 2xxH . In L. B. Rall, editor, Error in Digital Computation, pages 77–101. John Wiley, New
York, 1965. (Cited on pp. 34, 51, 350.)
[1121] James H. Wilkinson. A priori error analysis of algebraic processes. In Proc. Internat. Congr. Math.
(Moscow, 1966), Izdat. “Mir”, Moscow, 1968, pp. 629–640. (Cited on pp. 43, 345.)
Bibliography 485

[1122] James H. Wilkinson. Modern error analysis. SIAM Rev., 13:548–568, 1971. (Cited on p. 62.)
[1123] James H. Wilkinson and C. Reinsch, eds. Handbook for Automatic Computation. Volume II: Linear
Algebra. Springer, Berlin, 1971. (Cited on pp. 56, 113, 184, 478.)
[1124] Paul R. Willems and Bruno Lang. The MR3 -GK algorithm for the bidiagonal SVD. ETNA, 39:1–21,
2012. (Cited on p. 352.)
[1125] Paul R. Willems, Bruno Lang, and Christof Vömel. Computing the bidiagonal SVD using multiple
relatively robust representations. SIAM J. Matrix Anal. Appl., 28:907–926, 2006. (Cited on p. 352.)
[1126] T. J. Willmore. An Introduction to Differential Geometry. Clarendon Press, Oxford, UK, 1959.
(Cited on p. 394.)
[1127] Herman Wold. Estimation of principal components and related models by iterative least squares. In
P. R. Krishnaiah, editor, Multivariate Analysis, pages 391–420. Academic Press, New York, 1966.
(Cited on p. 200.)
[1128] S. Wold, A. Ruhe, H. Wold, and W. J. Dunn. The collinearity problem in linear regression. The
partial least squares (PLS) approach to generalized inverses. SIAM J. Sci. Statist. Comput., 5:735–
743, 1984. (Cited on p. 200.)
[1129] Svante Wold, Michael Sjöström, and Lennart Eriksson. PLS-regression: A basic tool of chemomet-
rics. Chemom. Intell. Lab. Syst., 58:109–130, 2001. (Cited on p. 203.)
[1130] Y. K. Wong. An application of orthogonalization process to the theory of lest squares. Ann. Math.
Statist., 6:53–75, 1935. (Cited on p. 64.)
[1131] Max A. Woodbury. Inverting Modified Matrices. Memorandum Report 42, Statistical Research
Group, Princeton, 1950. (Cited on p. 138.)
[1132] Stephen J. Wright. Stability of linear equation solvers in interior-point methods. SIAM J. Matrix
Anal. Appl., 16:1287–1307, 1995. (Cited on p. 129.)
[1133] Stephen J. Wright. Primal-Dual Interior-Point Methods. SIAM, Philadelphia, 1997. (Cited on
p. 419.)
[1134] Pengpeng Xie, Hua Xiang, and Yimin Wei. A contribution to perturbation analysis for total least
squares problems. Numer. Algorithms, 75:381–395, 2017. (Cited on p. 226.)
[1135] Andrew E. Yagle. Non-iterative Reweighted-Norm Least-Squares Local ℓ0 Minimization for Sparse
Solution to Underdetermined Linear Systems of Equations. Tech. Report Preprint, Department of
EECS, The University of Michigan, Ann Arbor, 2008. (Cited on p. 428.)
[1136] Yusaku Yamamoto, Yuji Nakatsukasa, Yuka Yanagisawa, and Takeshi Fukaya. Roundoff error
analysis of the Cholesky QR2 algorithm. ETNA, 44:306–326, 2015. (Cited on p. 213.)
[1137] Ichitaro Yamazaki, Stanimire Tomov, and Jack Dongarra. Mixed-precision Cholesky QR factoriza-
tion and its case studies on multicore CPU with multiple GPUs. SIAM J. Sci. Comput., 37:C307–
C330, 2015. (Cited on p. 213.)
[1138] L. Minah Yang, Alyson Fox, and Geoffrey Sanders. Rounding error analysis of mixed precision
block Householder QR algorithm. ETNA, 44:306–326, 2020. (Cited on p. 109.)
[1139] Sencer Nuri Yeralan, Timothy A. Davis, Wissam M. Sid-Lakhdar, and Sanjay Ranka. Algorithm
980: Sparse QR factorization on the GPU. ACM Trans. Math. Softw., 44:Article 17, 2017. (Cited
on p. 112.)
[1140] K. Yoo and Haesun Park. Accurate downdating of a modified Gram–Schmidt QR decomposition.
BIT Numer. Math., 36:166–181, 1996. (Cited on pp. 151, 152.)
[1141] David M. Young. Iterative Solution of Large Linear Systems. Dover, Mineola, NY, 2003.
Unabridged republication of the work first published by Academic Press, New York-London, 1971.
(Cited on pp. 270, 275, 277, 278.)
[1142] Jin Yun Yuan. Iterative Methods for the Generalized Least Squares Problem. Ph.D. thesis, Instituto
de Matemática Pure e Aplicada, Rio de Janeiro, Brazil, 1993. (Cited on p. 318.)
486 Bibliography

[1143] M. Zelen. Linear estimation and related topics. In John Todd, editor, Survey of Numerical Analysis,
pages 558–584. McGraw-Hill, New York, 1962. (Cited on p. 4.)
[1144] Hongyuan Zha. A two-way chasing scheme for reducing a symmetric arrowhead matrix to tridiag-
onal form. J. Num. Linear Algebra Appl., 1:49–57, 1992. (Cited on p. 362.)
[1145] Hongyuan Zha. Computing the generalized singular values/vectors of large sparse or structured
matrix pairs. Numer. Math., 72:391–417, 1996. (Cited on p. 333.)
[1146] Hongyuan Zha and Horst D. Simon. On updating problems in latent semantic indexing. SIAM J.
Sci. Comput., 21:782–791, 1999. (Cited on p. 360.)
[1147] Shaoshuai Zhang and Panruo Wu. High Accuracy Low Precision QR Factorization and Least
Squares Solver on GPU with TensorCore. Preprint. https://arxiv.org/abs/1912.05508,
2019. (Cited on p. 311.)
[1148] Zhenyue Zhang, Hongyuan Zha, and Wenlong Ying. Fast parallelizable methods for computing
invariant subspaces of Hermitian matrices. J. Comput. Math., 25:583–594, 2007. (Cited on p. 382.)
[1149] Liangmin Zhou, Lijing Lin, Yimin Wei, and Sangzheng Qiao. Perturbation analysis and condition
number of scaled total least squares problems. Numer. Algorithms, 51:381–399, 2009. (Cited on
p. 226.)
[1150] Ciyou Zhu, Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: L-BFGS-B: Fortran
subroutines for large-scale bound-constrained optimization. ACM Tran. Math. Softw., 23:550–560,
1997. (Cited on p. 420.)
[1151] Zahari Zlatev. Comparison of two pivotal strategies in sparse plane rotations. Comput. Math. Appl.,
8:119–135, 1982. (Cited on p. 255.)
[1152] Zahari Zlatev and H. Nielsen. LLSS01—A Fortran Subroutine for Solving Least Squares Prob-
lems (User’s Guide). Tech. Report 79-07, Institute of Numerical Analysis, Technical University of
Denmark, Lyngby, Denmark, 1979. (Cited on p. 258.)
[1153] Zahari Zlatev and Hans Bruun Nielsen. Solving large and sparse linear least-squares problems by
conjugate gradient algorithms. Comput. Math. Appl., 15:185–202, 1988. (Cited on p. 313.)
[1154] E. I. Zolotarev. Application of elliptic functions to questions of functions deviating least and most
from zero. Zap. Imp. Akad. Nauk St. Petersburg, 30, 1877. Reprinted in his Collected Works, Vol.
II, Izdat. Akad. Nauk SSSR, Moscow, 1932, pp. 1–59 (in Russian). (Cited on p. 382.)
Index

active-set methods, 162–168 left-looking, 107 optimal scaling, 93


acute perturbation, 25 LSME, 292 solution by LDLT , 91–94
adjacency graph LSQI, 170 solution by QR, 58–59, 117
of matrix, 247 LSQR, 291
adjacency set, 247 MGS, 62 backward error, 34
algebraic reconstruction NIPALS-PLS, 200, 201 analysis, 36
technique, 274 plane rotation, 48 componentwise, 99
algorithm preconditioned CGLS, 286 normwise, 97
band Cholesky, 187 preconditioned CGME, 287 optimal, 97
Bidiag2-PLS, 202 Rayleigh-quotient iteration, band matrix
CGLS, 282, 327 366 augmented, 204
CGME, 283 Rayleigh–Ritz procedure, Cholesky factorization,
CGS2, 70 369 186–187
Cholesky, 42 RCGME, 328 properties, 183–186
conjugate gradient method, recursive CGS QR, 111 QR factorization, 188–191
280 recursive Cholesky, 110 standard form, 189
elliptic MGS QR, 120 recursive Householder QR, bandwidth of LU factors, 185
extended CGLS, 330 112 bandwidth reduction, 193
FGMRES with variable right right-looking, 107 basis pursuit, 426
preconditioning, 303 secular equation, 170 Bauer–Skeel condition number,
GCGLS, 282 sequential Givens QR, 190 30
GMRES with right singular values by bisection, BBH algorithm, 239
preconditioning, 303 351 best linear unbiased estimate, 4,
Gram–Schmidt SVD of 2 × 2, 356 127, 218
classical, 60 zero-shift QRSVD, 343 biconjugation algorithm,
modified, 61, 62 ALS, see alternating least 314–315
Hager’s condition estimation, squares bidiagonal decomposition,
96 alternating least squares, 402, 191–196
Householder QR, 53, 54 406 Householder algorithm,
Householder reflector, 47 angles 194–196
IMGS factorization, 312 between subspaces, 16–17 one-sided algorithm, 194
IRLS for compressed arithmetic bidiagonal matrix, 183
sensing, 428 floating-point, 32–33 graded, 348
IRLS for overdetermined standard model, 33 unreduced, 194
systems, 424 Arnoldi decomposition, 301 bilinear least squares, 405–406
iterative refinement, 100, Arnoldi process, 301 bipartite graph, 253
102, 105 Arnoldi–Tikhonov method, 335 bisection method, 349–352
least-norm by MGS, 68 ART, see algebraic BLAS, 112–114
least squares by Householder reconstruction technique sparse, 247
QR, 57 augmented system, 8, 268 Blendenpik, 320
least squares by MGS, 67 generalized, 116–117 block algorithms, 106–112

487
488 Index

block-angular form, 309 in QR factorization, 73–76 partial, 390


block-angular problem, reverse, 83 DHT, see discrete Hartley
206–209 column scaling, optimal, 44 transform
covariance matrix, 208 column subset selection, 80 differential qd algorithm,
doubly bordered, 207 compact SVD, 11 351–352
QR algorithm, 208–209 complementary subspace, 118 discrepancy principle, 177
BLS, see bounded least squares complex arithmetic, 33 discrete cosine transform, 320
BLUE, see best linear unbiased compressed sensing, 426–429 discrete Hartley transform, 320
estimate condition estimation distance between subspaces,
bounded least squares, 161 LINPACK, 94–95 17, 368
BP, see basis pursuit condition number, 29 distance to singular matrices,
Bunch–Kaufman pivoting, 92 Bauer–Skeel, 30 25
effective, 172 distribution function, 3
CANDECOMP, 217 estimation of, 94–97 divide and conquer for SVD,
canonical correlation, 19 of matrix, 24–25 358–382
Cauchy matrix, 239 conjugate gradient method, downdating, 144
Cauchy–Schwarz inequality, 20 278–280 by hyperbolic rotation, 137
generalized, 119 contraction product, 214 by seminormal equations,
centering data, 205 core problem, 194–196 147–148
CG method, see conjugate corrected seminormal Cholesky factorization,
gradient method equations, 104–105 145–148
CGLS, 281–285, 317, 327 Courant–Fischer theorem, 12 Saunders algorithm, 145–147
semidefinite case, 282 covariance matrix, 3, 115, 122 dqds algorithm, see differential
termination, 284–285 band matrix, 187–188 qd algorithm
CGNR, see CGLS block-angular problem, 208 dual norm, 20
CGS, see classical computing, 45
Gram–Schmidt generalized least squares, 115 Eckart–Young–Mirsky
Chambers’ downdating method, 149 theorem, 23
algorithm, 135 sparse matrix, 187–188 eigenvalue problem
characteristic equation, 359 cross-validation, 178–180, 332 projected, 368
chasing, 345 CS decomposition, 18–19 symmetric 2 by 2, 353
Chebyshev CSNE, see corrected EISPACK, 113
abscissae, 231 seminormal equations elimination tree, 249–250
interpolation, 232 curvature radius, 394 postordering, 250
polynomials, 231–233 curve fitting, 411–416 topological ordering, 250
Cholesky factorization, 8–10, Cuthill–McKee algorithm, 251 transitive reduction, 250
41–43 Cuthill–McKee ordering, 251 elliptic
band, 186–187 cyclic pivoting, 83 Gram–Schmidt QR, 119–122
downdating, 145–148 norm, 119, 329
extended matrix, 9 damped least squares, 168 singular values, 331
graph model, 249–250 data least squares problem, 219 EM algorithm, 417
incomplete, 309–312 DCT, see discrete cosine empty matrix, 18, 125
QR, 10 transform error analysis
QR2 algorithm, 213 decomposition backward, 36
semidefinite matrix, 71–73 complete orthogonal, 77–79 forward, 36
Cimmino’s method, 273–274 CS, 18–19 inner product, 34
circulant matrix, 322–324 GSVD, 124 error estimation
classical Gram–Schmidt, 63 SVD, 11–13, 339–382 a posteriori, 97–100
Clenshaw’s formula, 229 URV, 78 Cholesky, 44
clique, 248 deconvolution problem, 173 forward, 27–31
column ordering derivative optimal backward, 97–100
minimum degree, 251–252 directional, 390 errors-in-variable model, 218
nested dissection, 206–209 of inverse, 169 Euler angles, 51
column pivoting of orthogonal projector, 403 exchange operator, 136
Index 489

expected value, 3 augmented system, 116–117 Harwell–Boeing collection, 265


exponential fitting, 403 inverse, 15 Harwell Software Library, 87,
extended matrix, 57, 66, 121 least squares, 115–133 400
normal equations, 115, 283 Hessenberg matrix, 55
fast cosine transform, 322 of second kind, 116 Hessian matrix, 390, 391
fast Fourier transform, 235–237 QR factorization, 122–124 Hestenes method, 352–356
FFT, see fast Fourier transform SVD, 124–128 hierarchical memory, 114
fill, in sparse matrix, 243 geodetic measurements, 3 Hölder inequality, 20
filter factor, 176, 325 Givens QR factorization, 55–56 Hölder norm, 20
fitting exponentials, 403 Givens rotation, 47–51 HOSVD, 217
flexible GMRES, 303 GKL bidiagonalization, Householder reflector, 46–47
floating-point 196–199 elliptic, 121–122
IEEE 754 standard, 32 GMRES, 301–304 unitary, 47
number, 32 flexible, 303 Huber’s M-estimator, 422
precision, 32 range-restricted, 334 hyperbolic rotation, 134
rounding error, 33 GQR, see generalized QR hypermatrix, 214–218
flop count factorization unfolding, 215
band Cholesky, 187 grade of vector, 198
bidiagonal decomposition, gradient projection methods, IC, see incomplete Cholesky
192 416–417 ill-posed problem, 171–182
Gram–Schmidt, 61 Gragg–Harrod procedure, 230 ILS, see indefinite least squares
QR factorization, 53 Gram polynomials, 230–231 IMGS, see incomplete MGS
tridiagonal system, 186 Gram–Schmidt implicit Q theorem, 344
forward error, 34 orthogonalization, 198 implicit shift QR algorithm,
analysis, 36
Gram–Schmidt QR, 59–63 344–345
of inner product, 35
classical, 60 incomplete
Fourier analysis
elliptic, 119–122 factorization, 309–314
butterfly relations, 236
modified, 61 QR factorization, 312–314
coefficients, 228
modifying, 150–152 incomplete Cholesky
discrete, 233–237
reorthogonalization, 68–71 higher-level, 310
matrix, 235
graph preconditioner, 309–312
Fourier synthesis, 235
bipartite, 253, 264 threshold, 310
Fréchet derivative, 389
matching, 264 zero-fill, 309
Fredholm integral equation, 171
clique in, 248 incomplete MGS, 312
Frobenius norm, 21
connected, 248 indefinite least squares,
fundamental
directed, 247 133–137
matrix, 128
edges, 247 inner inverse, 15
subspace, 6, 15, 28
elimination, 249–252 interior method, 419
Galerkin condition, 288, 368 filled, 249, 250 primal-dual, 418
Galerkin method, 172 nodes, 247 interior-point method, 427
gap in spectrum, 369 of matrix, 247–250 interval arithmetic, 33–34
Gauss–Markov planar, 256 INTLAB, 34
linear model, 4–5 separator, 248 invariant subspace, dominant,
theorem, 4 strongly connected, 248 368
Gauss–Newton direction, 392 subgraph of, 247 inverse
Gauss–Newton method, 224, undirected, 247 function, 390
391–395 gravitational field model, 3 generalized, 15
inexact, 400–402 GSVD, 159 inner, 15
line-search, 394–395 iteration, 224, 365–367
local convergence, 393–394 Hadamard transform, 319 simultaneous, 83
regularized, 396 Hall property, 253 least-norm, 16
trust-region, 396–398 Hankel least squares problem, least squares, 16
Gauss–Seidel’s method, 272 241 Moore–Penrose, 13
generalized harmonic Ritz value, 370 of band matrix, 185
490 Index

outer, 15 J-orthogonal matrices, 134–136 least squares problem


power method, 365 band, 183–191
problem, 171–182 Kaczmarz’s method, 273 bilinear, 405–406
Schulz iteration, 379 Kalman gain vector, 149 constrained, 155–168
inverse iteration, 81 Khatri–Rao product, 216 generalized, 5, 115–133
IRLS, see iteratively KKT conditions, 161 indefinite, 133–137
reweighted least squares Kogbetliantz’s method, Kronecker, 209–211
irreducible matrix, 185, 264 356–357 nearly square, 86–88
iterative method Kronecker product, 209 nonnegative, 416–420
CGLS, 281–285 least squares problem, separable, 402–406
convergence 209–211 sequential, 155
asymptotic rate, 270 pseudoinverse, 210 statistical aspects, 230–231
average rate, 270 QR factorization, 211 stiff, 129
conditions for, 269 SVD, 211 strongly rectangular,
error-reducing, 273 Krylov subspace, 198 212–213
Gauss–Seidel, 272 augmented, 335–337 Toeplitz, 239–241
Jacobi, 272 deflated, 335–337 weighted, 128–133
LSME, 291–292 methods, 196–200, 278–305 least squares solution
LSMR, 292–294 in finite precision, 294–305 characterization of, 5–8
LSQR, 289–291 updating of, 263
ℓ1 and ℓ∞ approximation,
preconditioned, 285–287 left-looking algorithm, 60
420–423
residual-reducing, 273 Levenberg–Marquardt method,
Lagrangian function, 161
SOR, 274–275 396
LANCELOT, 400
block, 307 Lanczos linear complementarity
splitting, 270 CG method, 288–289 problem, 162
SSOR, 274–275 decomposition, 288 linear equality constraints
stationary, 269–274 process, 287–288, 370 by weighting, 159–160
symmetrizable, 270 Landweber’s method, 325–326 linear inequality constraints
Toeplitz system, 322–325 LAPACK, 114 classification, 160–162
two-block, 308–309 LARS, 425–426 linear model
iterative refinement, 100–106, LASSO, 425–426 errors-in-variables, 218–219
114, 131, 159 latent root regression, 218 Gauss–Markov, 4
in three precisions, 104 Läuchli problem, 40, 130 LINPACK, 113
for linear systems, 100–101 least-angle regression, 425–426 Lipschitz condition, 389
for sparse problem, 104–106 least-norm inverse, 16 low-rank approximation, 23–24
in fixed precision, 103–105 least-norm problem, 7 Löwdin orthogonalization, 383
in mixed precision, 100–103 least squares ℓp approximation, 420–423
iterative regularization, ℓ1 -regularized, 425 LSQR stopping criteria, 296
325–335 damped, 168 LSRN, 320–321
iteratively reweighted least derivative of solution, 26 LU factorization
squares, 423–425 history of, 1–2 partial, 88
inverse, 16 rank-revealing, 88–91
Jacobi’s method, 272 iteratively reweighted, LU preconditioner
Jacobi’s method for SVD, 423–425 for CGLS, 316
352–357 recursive, 148–149 rate of convergence, 318
classical, 354 regularized, 294, 327–331 Lyapunov’s equation, 210
cyclic, 354 least squares fitting
sweep, 354 algebraic, 411–414 matrix
threshold, 354 discrete, 229–231 band, 183–191
Jacobi–Davidson’s method, geometric, 414–416 bidiagonal, 183
374–375 of circle, 411–416 consistently ordered, 275
Jacobian matrix, 389, 391 of ellipse, 411–416 elementary, 46
Jordan–Wielandt matrix, 12, of geometric elements, envelope, 251
342 411–416 idempotent, 6
Index 491

ill-conditioned, 24 update matrix, 256 complex, 5


irreducible, 185 multilinear function, 213 forming of, 39
Jordan–Wielandt, 12 multilinear operator, 391 generalized, 5
nonnegative, 29 multiple relatively robust information loss, 40
norm, 19–22 representation algorithm, method, 39–45
positive definite, 8 352 scaling of, 43–44
random orthogonal, 63 multiprecision algorithm, 104 normalized residuals, 45
reducible, 185 multirank of tensor, 215 nullspace
sign function, 380–382 method, 157–158
sparse, 243 nested dissection, 206 numerical, 173
spectral norm, 21 netlib, 375 from RRQR, 83
square root of, 378–380 Newton’s interpolation formula, from SVD, 173
stretching, 263 378 from ULV, 79
tall-and-skinny, 212 NIPALS-PLS algorithm, from URV, 78
tridiagonal, 183 200–203 of matrix, 6
two-cyclic, 13 NNLS, see nonnegative least numerical
well-conditioned, 24 squares cancellation, 253
matrix function, 376–382 NNMF, see nonnegative matrix ill-determined rank, 172
primary, 377 factorization nullspace, 27
matrix test collection no-cancellation assumption, rank, 26–27
Harwell–Boeing, 244 245 numerical rank, 22
Matrix Market, 244 node(s)
oblique projector, 118–119
SuiteSparse, 244 adjacent, 247
ODR, see orthogonal distance
mean, 421 amalgamation of, 258
regression
median, 421 connected, 248
ODRPACK, 410
medical imaging, 417 degree, 247
Oettli–Prager bound, 99–100
merit function, 400 eccentricity of, 251
optimal backward error,
MGS, see modified indistinguishable, 252
97–100, 296
Gram–Schmidt supernode, 252, 258
orthogonal
midrange, 421 nonlinear least squares
basis problem, 68, 109
MINARES, 300 Ceres solver, 258, 402
coefficients, 228
minimal residual algorithm, constrained, 400
complement, 12
299–300 nonnegative
distance regression, 408–411
minimum degree ordering, 252 least squares, 161–168, iteration, 367
MINRES, see minimal residual 416–420 projection, 15
algorithm gradient projection derivative of, 26
MINRES-QLP, 300 methods, 416–417 regression, 406–408
mixed sparse-dense problems, interior methods, 418–420 systems, 227–231
262–263 matrix, 29 orthogonal polynomials
modified Gram–Schmidt matrix factorization, 419–420 Chebyshev, 231–233
as a Householder method, norm general theory, 227–229
64–66 dual, 20 trigonometric, 233–235
backward stability, 65 elliptic, 119 orthogonal transformation
least-norm solution, 67 Frobenius, 21, 24 elementary, 46–51
least squares solution, 121 Hölder, 20 Givens, 47–51
modifying nuclear, 429 Householder, 46–47
Gram–Schmidt QR, 150–152 of matrix, 19–22 orthogonal tridiagonalization,
Moore–Penrose inverse, 13 of vector, 19–22 304–305
MRRR algorithm, see multiple Schatten, 22 orthogonality, loss of, 68–71
relatively robust spectral, 21 outer inverse, 15
representation algorithm total-variation, 426
multicore processor, 114 unitarily invariant, 21 Padé approximation, 381–382
multifrontal method, 255–258 normal curvature matrix, 393 Paige’s method, 122–123
for QR factorization normal equations PARAFAC, 217
492 Index

partial derivative, 389 least squares, 306–325 partial, 204


partial least squares, 191–203 LU, 316–319 pivoted, 140
partial SVD reduced system, 308 rank-one change, 141–142
of sparse matrix, 376 SSOR, 306–309 rank-revealing, 79–84
partitioned algorithms, 106–112 two-level, 321–322 Chan’s algorithm, 82
PCR, see principal components predicting fill, 247–250 recursive, 111
regression principal row ordering, 254–255
Penrose conditions, 14, 182 angle, 17, 308 row pivoting, 131
permutation components regression, 174 row sequential, 254–255
bit-reversal, 236 vector, 17 row sorting, 131
matrix, 73 Procrustes problem, 383, Vandermonde matrix, 238
perturbation analysis 385–387 weighted, 130–132
bounds for QR factorization, product SVD, 126 QRSVD algorithm, 345–348
54 projector, 6–7 quadratic inequality constraints,
componentwise, 29–31 oblique, 118–119 168–169
least squares solutions, 27–31 orthogonal, 6–7 quasi-definite system, 319
pseudoinverse, 25–26 pseudoinverse, 13–16 quasi-Newton method, 398–400
PET, see positron emission derivative of, 26 quotient difference algorithm,
tomography from QR, 52–74 351
Peters–Wilkinson method, from SVD, 14 quotient SVD, 126
84–86, 129, 317 of Kronecker product, 210
pivoted magnitude, 83 solution, 8
pivoting from LU, 84–86 radius of convergence, 376
cyclic, 83 Wedin’s identity, 26 random
reverse, 76 pseudoskeleton approximation, errors, 4
rook, 85 90 normal projection, 319
plane reflector, 49 PSVD algorithm, 223 orthogonal matrix, 63
plane rotation, 47–51 sampling, 319
algorithm, 48 qd algorithm, see quotient randomized algorithms,
fast, 50–51 difference algorithm 319–321
self-scaling, 51, 132 QL algorithm, 347–348 range space of matrix, 6
storage of, 49 QR algorithm rank
unitary, 49 convergence criteria, 347 numerical, 26–27
PLS, see partial least squares, implicit shift, 344–345 structural, 245
202 operation count, 347 rank of tensor, 216
polar decomposition, 382–386 perfect shifts, 347 rank-revealing QR, 79–84
polynomial real symmetric matrices, modifying, 152–155
approximation, 227–233 344–345 sparse, 259–261
triangle family, 227 zero-shift, 342–344 Rayleigh quotient
positive definite matrix, 6 QR factorization, 52–84 iteration, 224, 365–367
positron emission tomography, appending a column, matrix, 369
417 143–145 Rayleigh-quotient, 224, 365
power method, 339, 363–367 appending a row, 141 Rayleigh–Ritz procedure,
preconditioner backward stability, 54, 131 368–376
approximate inverse, column pivoting, 73–76, 131 implicit restarts, 372
314–315 deleting a column, 142 recursive least squares,
block column, 306–308 deleting a row, 144 148–149
block SSOR, 307 flop count, 53 reduced gradient method, 158
by submatrix, 316–321 generalized, 122–124 reduced-order model, 23
cyclic Jacobi, 308 Givens, 55–56 reducible matrix, 185, 264
diagonal scaling, 286, 306 Hessenberg, 55 reduction
for Toeplitz systems, incomplete, 312–314 to bidiagonal form, 341–342
324–325 modifying, 139–145 to symmetric tridiagonal
IMGS, 312 multifrontal, 255–258 form, 340–341
Index 493

regression Saunders algorithm, 145–147 SPAI, see sparse approximate


least-angle, 425–426 ScaLAPACK, 114 inverse
linear, 40 scaled total least squares, 218 sparse approximate inverse,
principal components, 174 scaling of columns 314–315
robust, 422 optimal, 44 sparse least squares problem
stepwise, 140 Schatten norms, 22 updating, 262–263
regularization, 173–182 Schulz iteration, 379 sparse matrix
filter factor, 325 Schur–Banachiewicz formula, block triangular form,
hybrid method, 332–333 137 263–265
iterative, 325–335 Schur complement, 89, 136, column ordering, 250–252
Krylov subspace methods, 137 irreducible, 264
331–335 Schur decomposition, 378, 379 reducible, 264
Landweber, 325 secular equation, 168–171, 360 row ordering for QR,
parameter, 177 selective reorthogonalization, 254–255
PLS, 202 69 structural cancellation, 253
semiconvergence, 326 semi-iterative method, 275–278 structural rank, 245
Tikhonov, 175 Chebyshev, 277 sparse matrix-vector product,
trust-region method, 182 seminormal equations, 104–106 245–247
regularized for downdating, 147–148 spectral
CGLS, 327–329 separable least squares norm, 21
CGME, 328–329 problem, 402–406 projector, 380
least squares, 294, 327–331 VARPRO algorithm, 404 radius, 269
relaxation parameter, 274 Sherman–Morrison formula, transformation, 365
optimal, 308 138 splitting of matrix, 270
optimal for SOR, 275 shift matrix, 323
proper, 270
reorthogonalization, 68–71, 297 signature matrix, 133
SQD system, see symmetric
selective, 69, 371 singular value decomposition,
quasi-definite system
residual polynomial, 284 11–13
square root of matrix, 378–380
reverse pivoting, 76, 83 compact, 11
SSOR, 274–275
Richardson’s method, 271–272 computation, 339–352
SSOR preconditioner, 307, 308
second order, 278 generalized, 124–128
stability of algorithm, 36–37
ridge regression, 175 Kronecker product, 211
stepwise regression, 140
right-looking algorithm, 61 modifying, 360–363
rigid body movements, 386 Stieltjes procedure, 229, 230
of tensor, 217
Riley’s method, 177 stopping criteria
of two by two, 356
Ritz value, 369 related eigenvalue problems, for LSQR, 296
harmonic, 370 12 storage scheme
Ritz vector, 369 truncated solution, 173–175 compressed
refined, 370 singular values, 12 form, 245
RLS, see recursive least squares absolute gap, 22 row, 246, 247
rook pivoting, 85 by bisection, 349–352 coordinate scheme, 245
rotation, three-dimensional, 49 elliptic, 331 dynamic, 245
rounding error interlacing property, 22 general sparse, 245–246
analysis, 34–36 min-max property, 12, 22 static, 245
row-action methods, 274 of bidiagonal matrix, 343 Sturm sequence, 350
RQI, see Rayleigh-quotient relative gap, 343 submatrix preconditioner,
iteration sensitivity, 22–23 316–321
RRQR, see rank-revealing QR singular vectors, 12 rate of convergence, 316
RRQR factorization of bidiagonal matrix, 343 subspace
Chan’s algorithm, 82 uniqueness, 12 complementary, 118
strong, 84 smoothing-norm operator, 175 iteration, 367–368
Rutishauser’s qd algorithm, 351 SNE, see seminormal equations SuiteSparseQR, 258
SOR, 274–275, 308 sum convention, 213
saddle point system, 116 block, 307 superlinear convergence, 295
494 Index

SVD, see singular value tomography, 161 by Gram–Schmidt, 67


decomposition positron emission, 417 by Householder, 58
Sylvester’s seismic, 363 by normal equations, 8
criterion, 10 total least squares, 218–226 unit roundoff, 32
equation, 379 by inverse iteration, 223–224 unitarily invariant norm, 21
symmetric gauge functions, 22 by SVD, 219–221 updating
symmetric quasi-definite conditioning, 220, 221 least squares solution, 263
system, 329–331 generalized, 221–223 QR factorization, 141–145
symmetrizable, 270 generic problem, 219 URV decomposition, 78
mixed, 220
tall-and-skinny matrix, 212 multidimensional, 221–223
Taylor’s formula, 391 scaled, 218 Vandermonde matrix
tensor, 214 total-variation norm, 426 QR factorization of, 238
CP decomposition, 217 tridiagonal matrix, 183 Vandermonde systems,
decompositions, 216–217 unreduced, 194 237–239
multirank, 215 trigonometric polynomials, 233 fast algorithm, 238
operations, 214–215 truncated SVD, 173–175 variable projection method,
SVD, 217 truncated TLS, 225 402–405
Tikhonov regularization, trust-region methods, 396–398 VARPRO algorithm, 404
175–182 TSVD, see truncated SVD vector space calculus, 389–391
iterated, 177, 326 TV-norm, see total-variation Volterra integral equation, 240
standard form, 176 norm
transformation to, 180–182 two-block methods, 204–206
TLS, see total least squares wavelet transform, 322
two-cyclic matrix, 13
Toeplitz least squares problem, weighted problem
two-level preconditioner,
239–241 by LU factorization, 129–130
321–322
Toeplitz matrix, 239 by QR factorization, 130–132
fast multiplication, 324 ULV decomposition, 77–79 condition number, 132
QR factorization, 239–241 modifying, 152–155 stiff, 129
upper triangular, 240 unbiased estimate, 4, 174 Wielandt–Hoffman theorem, 22
Toeplitz systems best linear, 127 Wilkinson matrix, 94
circulant preconditioner, 324 of variance, 45 Wilkinson shift, 345
iterative solvers, 322–325 underdetermined linear system, Woodbury formula, 138
preconditioner, 324–325 7, 88, 99, 116, 128 wrapping effect, 34
Least Squares Problems
Numerical Methods for
The method of least squares, discovered by Gauss in 1795, is a principal tool
for reducing the influence of errors when fitting a mathematical model to given
observations. Applications arise in many areas of science and engineering. The
Numerical Methods
for Least Squares
increased use of automatic data capturing frequently leads to large-scale least
squares problems. Such problems can be solved by using recent developments in
preconditioned iterative methods and in sparse QR factorization.

Problems
The first edition of Numerical Methods for Least Squares Problems was the leading
reference on this topic for many years. The updated second edition stands out
compared to other books on the topic because it
• provides an in-depth and up-to-date treatment of direct and iterative methods for
solving different types of least squares problems and for computing the singular
value decomposition;
• covers generalized, constrained, and nonlinear least squares problems as well as
partial least squares and regularization methods for discrete ill-posed problems; Second Edition
and
• contains a bibliography of over 1,100 historical and recent references, providing a
comprehensive survey of past and present research in the field.

Audience
This book will be of interest to graduate students and researchers in applied
mathematics and to researchers working with numerical linear algebra applications.

Åke Björck is a professor emeritus at Linköping University, Sweden.


He is the author of many research papers and books on numerical Second
analysis and matrix computations. He served as managing editor of
the journal BIT Numerical Mathematics from 1993 to 2003 and has
Edition
been a SIAM Fellow since 2014.

Åke Björck
For more information about SIAM books, journals, conferences, memberships, and activities, contact:

Society for Industrial and Applied Mathematics

Åke Björck
3600 Market Street, 6th Floor
Philadelphia, PA 19104-2688 USA
+1-215-382-9800
[email protected] • www.siam.org

OT196

OT196
ISBN: 978-1-61197-794-3
90000

9 781611 977943

OT196_BJORCK_COVER_A_V5.indd 1 5/15/2024 3:22:23 PM

You might also like