0% found this document useful (0 votes)
12 views291 pages

Linear Algebra and Applications Compressed 28.12.2023

This document is a comprehensive guide to linear algebra, covering fundamental concepts such as vectors, matrices, and linear equations, as well as their applications in various fields including engineering and computer science. It includes detailed sections on topics like scalar products, matrix multiplication, least-squares, eigenvalues, and principal component analysis, along with exercises for practice. The book emphasizes the importance of linear algebra in modern applications like data mining, machine learning, and optimization.

Uploaded by

vanilla020901
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views291 pages

Linear Algebra and Applications Compressed 28.12.2023

This document is a comprehensive guide to linear algebra, covering fundamental concepts such as vectors, matrices, and linear equations, as well as their applications in various fields including engineering and computer science. It includes detailed sections on topics like scalar products, matrix multiplication, least-squares, eigenvalues, and principal component analysis, along with exercises for practice. The book emphasizes the importance of linear algebra in modern applications like data mining, machine learning, and optimization.

Uploaded by

vanilla020901
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 291

Online Access

Laurent Ei Ghaoui - Alicia Y. Tsai - Giuseppe C. Calafiore

LINEAR ALGEBRA
AND
APPLICATIONS

Hanoi, 2023
CONTENTS

INTRODUCTION 1

Part I. VECTORS

1. BASICS 4

1.1. Definitions 4

1.2. Independence 5

1.3. Subspace, span, affine sets 5

1.4. Basis, dimension 7


2. SCALAR PRODUCT, NORMS AND ANGLES 9

2.1. Scalar product 9

2.2. Norms 10

2.3. Three popular norms 10

2.4. Cauchy-Schwarz inequality 11

2.5. Angles between vectors 12


3. PROJECTION ON A LINE 13

3.1. Definition 13

3.2. Closed-form expression 14

3.3. Interpreting the scalar product 14


4. ORTHOGONALIZATION: THE GRAM-SCHMIDT PROCEDURE 16

4.1. What is orthogonalization? 16

4.2. Basic step: projection on a line 17

4.3. Gram-Schmidt procedure 17


5. HYPERPLANES AND HALF-SPACES 19

5.1. Hyperplanes 19

5.2. Projection on a hyperplane 21

5.3. Geometry of hyperplanes 22

5.4. Half-spaces 22
6. LINEAR FUNCTIONS 23

6.1. Linear and affine functions 23

6.2. First-order approximation of non-linear functions 26

6.3. Other sources of linear models 27


7. APPLICATION: DATA VISUALIZATION BY PROJECTION ON A LINE 28

7.1. Senate voting data 28

7.2. Visualization of high-dimensional data via projection 29

7.3. Examples 30
8. EXERCISES 32

8.1. Subspaces 32

8.2. Projections, scalar product, angles 33

8.3. Orthogonalization 33

8.4. Generalized Cauchy-Schwarz inequalities 33

8.5. Linear functions 34

Part II. MATRICES

9. BASICS 37

9.1. Matrices as collections of column vectors 37

9.2. Transpose 37

9.3. Matrices as collections of rows 37

9.4. Sparse Matrices 39


10. MATRIX-VECTOR AND MATRIX-MATRIX MULTIPLICATION, SCALAR PRODUCT 40

10.1. Matrix-vector product 40

10.2. Matrix-matrix product 43

10.3. Block Matrix Products 43

10.4. Trace, scalar product 45


11. SPECIAL CLASSES OF MATRICES 46

11.1 Some special square matrices 46

11.2 Dyads 49
12. QR DECOMPOSITION OF A MATRIX 51

12.1 Basic idea 51

12.2 Case being full column rank 51

12.3 Case when the columns are not independent 52

12.4 Full QR decomposition 53


13. MATRIX INVERSES 54

13.1 Square full-rank matrices and their inverse 54

13.2 Full column rank matrices and left inverses 55

13.3 Full-row rank matrices and right inverses 55


14. LINEAR MAPS 57

14.1 Definition and Interpretation 57

14.2 First-order approximation of non-linear maps 58


15. MATRIX NORMS 59

15.1 Motivating example: effect of noise in a linear system 59

15.2 RMS gain: the Frobenius norm 59

15.3 Peak gain: the largest singular value norm 60

15.4 Applications 61
16. APPLICATIONS 63

16.1 State-space models of linear dynamical systems. 63


17. EXERCISES 66

17.1. Matrix products 66

17.2 Special matrices 66

17.3. Linear maps, dynamical systems 66

17.4 Matrix inverses, norms 67

Part III. LINEAR EQUATIONS

18. MOTIVATING EXAMPLE 70

18.1. Overview 70

18.2 From 1D to 2D: axial tomography 71

18.3. Linear equations for a single slice 71

18.4. Issues 73
19. EXISTENCE AND UNICITY OF SOLUTIONS 74

19.1. Set of solutions 74

19.2. Existence: range and rank of a matrix 74

19.3. Unicity: nullspace of a matrix 76

19.4. Fundamental facts 78


20. SOLVING LINEAR EQUATIONS VIA QR DECOMPOSITION 80

20.1. Basic idea: reduction to triangular systems of equations 80

20.2. The QR decomposition of a matrix 80

20.3. Using the full QR decomposition 81

20.4. Set of solutions 82


21. APPLICATIONS 83

21.1. Trilateration by distance measurements 83

21.2. Estimation of traffic flow 84


22. EXERCISES 86

22.1 Nullspace, rank and range 86


Part IV. LEAST-SQUARES

23. ORDINARY LEAST-SQUARES 88

23.1. Definition 88

23.2. Interpretations 89

23.3. Solution via QR decomposition (full rank case) 90

23.4. Optimal solution and optimal set 91


24. VARIANTS OF THE LEAST-SQUARES PROBLEM 92

24.1. Linearly constrained least-squares 92

24.2. Minimum-norm solution to linear equations 93

24.3. Regularized least-squares 93


25. KERNELS FOR LEAST-SQUARES 95

25.1. Motivations 95

25.2. The kernel trick 96

25.3. Nonlinear case 96

25.4. Examples of kernels 97

25.5. Kernels in practice 98


26. APPLICATIONS 99

26.1. Linear regression via least-squares 99

26.2. Auto-regressive models for time-series prediction. 100


27. EXERCISES 102

27.1. Standard forms 102

27.2. Applications 103

Part V. EIGENVALUES FOR SYMMETRIC MATRICES

28. QUADRATIC FUNCTIONS AND SYMMETRIC MATRICES 107

28.1. Symmetric matrices and quadratic functions 107

28.2. Second-order approximations of non-quadratic functions 109

28.3. Special symmetric matrices 110


29. SPECTRAL THEOREM 112

29.1. Eigenvalues and eigenvectors of symmetric matrices 112

29.2. Spectral theorem 112

29.3. Rayleigh quotients 113


30. POSITIVE SEMI-DEFINITE MATRICES 115

30.1. Definitions 115

30.2. Special cases and examples 116

30.3. Square root and Cholesky decomposition 116

30.4. Ellipsoids 117


31. PRINCIPAL COMPONENT ANALYSIS 119

31.1. Projection on a line via variance maximization 119

31.2. Principal component analysis 120

31.3. Explained variance 121


32. APPLICATIONS: PCA OF SENATE VOTING DATA 123

32.1 Introduction 123

32.2. Senate voting data and the visualization problem 124

32.3. Projecting on a line 125

32.4. Projecting on a plane 127

32.5. Direction of maximal variance 128

32.6. Principal component analysis 129

32.7. Sparse PCA 131

32.8 Sparse maximal variance problem 133


33. EXERCISES 138

33.1. Interpretation of covariance matrix 138

33.2. Eigenvalue decomposition 139

33.3. Positive-definite matrices, ellipsoids 139

33.4. Least-squares estimation 140


Part VI. SINGULAR VALUES

34. THE SVD THEOREM 142

34.1. The SVD theorem 142

34.2. Geometry 144

34.3. Link with the SED (Spectral Theorem) 144


35. MATRIX PROPERTIES VIA SVD 146

35.1. Nullspace 146

35.2. Range, rank via the SVD 148

35.3. Fundamental theorem of linear algebra 149

35.4. Matrix norms, condition number 149


36. SOLVING LINEAR SYSTEMS VIA SVD 151

36.1. Solution set 151

36.2. Pseudo-inverse 152

36.3. Sensitivity analysis and condition number 153


37. LEAST-SQUARES AND SVD 154

37.1. Set of solutions 154

37.2. Sensitivity analysis 154

37.3. BLUE property 156


38. LOW-RANK APPROXIMATIONS 158

38.1. Low-rank approximations 158

38.2. Link with Principal Component Analysis 159


39. APPLICATIONS 160

39.1 Image compression 160

39.2 Market data analysis 161


40. EXERCISES 164

40.1. SVD of simple matrices 164

40.2. Rank and SVD 165

40.3. Procrustes problem 166

40.4. SVD and projections 166

40.5. SVD and least-squares 167


Part VII. EXAMPLES

Dimension of an affine subspace 169


Sample and weighted average 170
Sample average of vectors 171
Euclidean projection on a set 172
Orthogonal complement of a subspace 173
Power laws 175
Power law model fitting 176
Definition: Vector norm 177
An infeasible linear system 178
Sample variance and standard deviation 179
Functions and maps 180

Functions 180

Maps 180
Dual norm 182
Incidence matrix of a network 183
Nullspace of a transpose incidence matrix 184
Rank properties of the arc-node incidence matrix 185
Permutation matrices 186
QR decomposition: Examples 187
Backwards substitution for solving triangular linear systems. 190
Solving triangular systems of equations: Backwards substitution example 191
Linear regression via least squares 192
Nomenclature 194

Feasible set 194

What is a solution? 194

Local vs. global optimal points 196


Standard forms 197

Functional form 197

Epigraph form 197

Other standard forms 198


A two-dimensional toy optimization problem 199
Global vs. local minima 201
Gradient of a function 202

Definition 202

Composition rule with an affine function 203

Geometric interpretation 203


Set of solutions to the least-squares problem via QR decomposition 205
Sample covariance matrix 207

Definition 207

Properties 208
Optimal set of least-squares via SVD 209
Pseudo-inverse of a matrix 212
SVD: A 4x4 example 214
Singular value decomposition of a 4 x 5 matrix 215
Representation of a two-variable quadratic function 216
Edge weight matrix of a graph 217
Network flow 218
Laplacian matrix of a graph 219
Hessian of a function 220

Definition 220

Examples 220
Hessian of a quadratic function 223
Gram matrix 224
Quadratic functions in two variables 225
Quadratic approximation of the log-sum-exp function 227
Determinant of a square matrix 228

Definition 228

Important result 229

Some properties 230


A squared linear function 232
Eigenvalue decomposition of a symmetric matrix 233
Rayleigh quotients 234
Largest singular value norm of a matrix 235
Nullspace of a 4x5 matrix via its SVD 236
Range of a 4x5 matrix via its SVD 237
Low-rank approximation of a 4x5 matrix via its SVD 238
Pseudo-inverse of a 4x5 matrix via its SVD 240

Part VIII. APPLICATIONS

Image compression via least-squares 243


Senate voting data matrix. 244
Senate voting analysis and visualization. 245
Beer-Lambert law in absorption spectrometry 247
Absorption spectrometry: Using measurements at different light frequencies. 248
Similarity of two documents 249
Image compression 250
Temperatures at different airports 251
Navigation by range measurement 252
Bag-of-words representation of text 253
Bag-of-words representation of text: Measure of document similarity 254
Rate of return of a financial portfolio 255

Rate of return of a single asset 255

Log-returns 255

Rate of return of a portfolio 255


Single factor model of financial price data 257
The problem of Gauss 258
Control of a unit mass 259
Portfolio optimization via linearly constrained least-squares. 260

Part IX. THEOREMS

Cauchy-Schwarz inequality proof 263


Dimension of hyperplanes 264
Spectral theorem: Eigenvalue decomposition for symmetric matrices 266
Singular value decomposition (SVD) theorem 268
Rank-one matrices 270
Rank-one matrices: A representation theorem 271
Full rank matrices 273
Rank-nullity theorem 274
A theorem on positive semidefinite forms and eigenvalues 275
Fundamental theorem of linear algebra 276
1 | INTRODUCTION

INTRODUCTION

BOOK DESCRIPTION:
“You can’t learn too much linear algebra”. Benedict Gross, Professor of Mathematics at Harvard.

This book offers a guided tour of linear algebra and its applications, one of the most important building
blocks of modern engineering and computer sciences.

Topics include matrices, determinants, vector spaces, eigenvalues and eigenvectors, orthogonality, and inner
product spaces; applications include brief introductions to difference equations, Markov chains, and
systems of linear ordinary differential equations.

Rationale: Linear algebra is an important building block in engineering and computer sciences. It is applied
in many areas such as search engines, data mining and machine learning, control and optimization, graphics,
robotics, etc.
VECTORS | 2

PART I
VECTORS

2

1.5 ⯑

1

0.5


0
00
-0.5

0 0.5 1 1.5 2 2.5 3 3.5

A vector is a collection of numbers arranged in a column or a row and can be thought of as a point in space.
We review basic notions such as independence, span, subspaces, and dimension. The scalar product allows
to define the length of a vector, as well as generalize the notion of the angle between two vectors. Via the
scalar product, we can view a vector as a linear function. We can also compute the projection of a vector onto
a line defined by another — a basic ingredient in many visualization techniques for high-dimensional data
sets.

Outline

• Basics
• Scalar product, norms and angles
• Projection on a line
• Orthogonalization
• Hyperplanes and half-spaces
• Linear functions
• Application: data visualization via projection on a line
3 | VECTORS

• Exercises
BASICS | 4

1.

BASICS

• Definitions
• Independence
• Subspaces, span, affine sets
• Basis, dimension

1.1. Definitions

Vectors
Assume we are given a collection of real numbers, . We can represent them as locations on a
line. Alternatively, we can represent the collection as a single point in a -dimensional space. This is the vector
representation of the collection of numbers; each number is called a component or element of the vector.

Vectors can be arranged in a column or a row; we usually write vectors in column format:

We denote by denotes the set of real vectors with components. If denotes a vector, we use
subscripts to denote components, so that is the -th component of . Sometimes the notation is
used to denote the -th component.

A vector can also represent a point in a multi-dimensional


2
1 ⯑ = space , where each component corresponds to a
1
coordinate of the point.

Example 1: The vector in .

2
5 | BASICS

See also:

• Temperatures at different airports.


• Bag-of-words representation of text.

Transpose
If is a column vector, denotes the corresponding row vector, and vice-versa. Hence, if is the column
vector above:

Sometimes we use the looser, in-line notation , to denote a row or column vector, the
orientation being understood from context.

1.2. Independence
A set of vectors in is said to be linearly independent if and only if the following
condition on a vector :

implies for . This means that no vector in the set can be expressed as a linear
combination of the others.

Example 2: the vectors and are not linearly independent, since .

1.3. Subspace, span, affine sets


A subspace of is a subset that is closed under addition and scalar multiplication. Geometrically, subspaces
are ‘‘flat’’ (like a line or plane in 3D) and pass through the origin.

An important result of linear algebra, which we will prove later, says that a subspace can always be
represented as the span of a set of vectors , , that is, as a set of the form
BASICS | 6

An affine set is a translation of a subspace — it is ‘‘flat’’ but does not necessarily pass through , as a subspace
would. (Think for example of a line, or a plane, that does not go through the origin.) So an affine set can
always be represented as the translation of the subspace spanned by some vectors:

for some vectors where . In shorthand notation, we write

Example 3: In , the span of the two vectors


⯑ ⯑

⯑ is the plane passing through the origin pictured in blue.

When is the span of a single non-zero vector, the set is called a line passing through the point . Thus,
lines have the form

where determines the direction of the line, and is a point through which it passes.

2.5

1.5 ⯑
Example 4: A line in passing through the point
1
⯑0 , with direction
0.5 .
0
-0.5
-3 -2 -1 0 1 2 3
7 | BASICS

1.4. Basis, dimension

Basis
A basis of is a set of independent vectors. If the vectors form a basis, we can express any
vector as a linear combination of the ‘s:

for appropriate numbers .

The standard basis (alternatively, natural basis) in consists of the vectors , where ‘s components are
all zero, except the -th, which is equal to 1. In , we have

Example 5: The set of three vectors in :

is not independent, since , and its span has dimension . Since are independent (the
equation has as the unique solution), a basis for that span is, for example, .
In contrast, the collection spans the whole space , and thus forms a basis of that space.

Basis of a subspace
The basis of a given subspace is any independent set of vectors whose span is . If the vectors
form a basis of , we can express any vector as a linear combination of the ‘s:

for appropriate numbers .

The number of vectors in the basis is actually independent of the choice of the basis (for example, in
BASICS | 8

you need two independent vectors to describe a plane containing the origin). This number is called the
dimension of . We can accordingly define the dimension of an affine subspace, as that of the linear subspace
of which it is a translation.

Examples:

• The dimension of a line is 1 since a line is of the form for some non-zero vector .
• Dimension of an affine subspace.
9 | SCALAR PRODUCT, NORMS AND ANGLES

2.

SCALAR PRODUCT, NORMS AND ANGLES

• Scalar product
• Norms
• Three popular norms
• Cauchy-Schwarz inequality
• Angles between vectors

2.1. Scalar product

Definition
The scalar product (or, inner product, or dot product) between two vectors is the scalar denoted
, and defined as

The motivation for our notation above will come later when we define the matrix-matrix product. The
scalar product is also sometimes denoted , a notation that originates in physics.

See also:

• Rate of return of a financial portfolio.


• Sample and weighted average.
• Beer-Lambert law in absorption spectroscopy.

Orthogonality
We say that two vectors are orthogonal if .
SCALAR PRODUCT, NORMS AND ANGLES | 10

Example 1: The two vectors in :

are orthogonal, since

2.2. Norms

Definition
Measuring the size of a scalar value is unambiguous — we just take the magnitude (absolute value) of the
number. However, when we deal with higher dimensions and try to define the notion of size, or length, of a
vector, we are faced with many possible choices. These choices are encapsulated in the notion of norm.

Norms are real-valued functions that satisfy a basic set of rules that a sensible notion of size should involve.
You can consult the formal definition of a norm here. The norm of a vector is usually denoted

2.3. Three popular norms


In this course, we focus on the following three popular norms for a vector :

The Euclidean norm:


⯑ ⯑ 2
= 5
2

5
1

corresponds to the usual notion of distance in two or three


dimensions. The set of points with equal -norm is a circle (in
2D), a sphere (in 3D), or a hyper-sphere in higher dimensions.
11 | SCALAR PRODUCT, NORMS AND ANGLES

⯑ = 3 The norm:
⯑ 1
2

1 3

corresponds to the distance traveled on a rectangular grid to go


from one point to another.

⯑ = 2
The norm:

2 ⯑

1 2 is useful in measuring peak values.

Examples:

• A given vector will in general have different ‘‘lengths” under different norms. For example, the vector
yields , , and .
• Sample standard deviation.

2.4. Cauchy-Schwarz inequality


The Cauchy-Schwarz inequality allows to bound the scalar product of two vectors in terms of their
Euclidean norm.
SCALAR PRODUCT, NORMS AND ANGLES | 12

Theorem: Cauchy-Schwarz inequality

For any two vectors , we have

The above inequality is an equality if and only if are collinear. In other words:

with optimal given by if is non-zero.

For a proof, see here. The Cauchy-Schwarz inequality can be generalized to other norms, using the concept
of dual norm.

2.5. Angles between vectors


When none of the vectors is zero, we can define the corresponding angle as such that

Applying the Cauchy-Schwartz inequality above to and we see that indeed the number
above is in .

The notion above generalizes the usual notion of angle between two directions in two dimensions, and is
useful in measuring the similarity (or, closeness) between two vectors. When the two vectors are orthogonal,
that is, , we do obtain that their angle is .

See also: Similarity of two documents.


13 | PROJECTION ON A LINE

3.

PROJECTION ON A LINE

• Definition
• Closed-form expression
• Interpreting the scalar product

3.1. Definition
Consider the line in passing through and with direction :

2.5

1.5 ⯑
Example 1: A line in passing through the point
1
⯑0 , with direction
0.5 .
0

-0.5
-3 -2 -1 0 1 2 3

The projection of a given point on the line is a vector located on the line, that is closest to (in Euclidean
norm). This corresponds to a simple optimization problem:

This particular problem is part of a general class of optimization problems known as least-squares. It is also
a special case of a Euclidean projection on a general set.
PROJECTION ON A LINE | 14

Example 2: Projection of the vector


on a line passing through the
2

origin ( ) and with (normalized) direction
1.5 ⯑
.

1

At optimality the ‘‘residual’’ vector is
orthogonal to the line, hence , with

. Any other point on the


0.5


0 line is farther away from the point than its
00 projection is.
-0.5
The scalar , ie the scalar product between
0 0.5 1 1.5 2 2.5 3 3.5 and , is the component of along the
normalized direction .

3.2. Closed-form expression


Assuming that is normalized, so that , the objective function of the projection problem reads,
after squaring:

Thus, the optimal solution to the projection problem is

and the expression for the projected vector is

The scalar product is the component of along .

In the case when is not normalized, the expression is obtained by replacing with its scaled version
:

3.3. Interpreting the scalar product


We can now interpret the scalar product between two non-zero vectors , by applying the previous
derivation to the projection of on the line of direction passing through the origin. If is normalized
15 | PROJECTION ON A LINE

( ), then the projection of on is . Its length is . (See above


figure from Example 2.)

In general, the scalar product is simply the component of along the normalized direction
defined by .
ORTHOGONALIZATION: THE GRAM-SCHMIDT PROCEDURE | 16

4.

ORTHOGONALIZATION: THE
GRAM-SCHMIDT PROCEDURE

• Orthogonalization
• Projection on a line
• Gram-Schmidt procedure

A basis is said to be orthogonal if when . If in addition, , we say


that the basis is orthonormal.

Example 1: An orthonormal basis in . The collection of vectors , with

forms an orthonormal basis of .

4.1. What is orthogonalization?


Orthogonalization refers to a procedure that finds an orthonormal basis of the span of given vectors.

Given vectors , an orthogonalization procedure computes vectors


such that

where is the dimension of , and

That is, the vectors form an orthonormal basis for the span of the vectors .
17 | ORTHOGONALIZATION: THE GRAM-SCHMIDT PROCEDURE

4.2. Basic step: projection on a line


A basic step in the procedure consists in projecting a vector on a line passing through zero. Consider the line

where is given, and normalized ( ).

The projection of a given point on the line is a vector located on the line, that is closest to (in
Euclidean norm). This corresponds to a simple optimization problem:

The vector , where is the optimal value, is referred to as the projection of on the line
. As seen here, the solution of this simple problem has a closed-form expression:

Note that the vector can now be written as a sum of its projection and another vector that is orthogonal
to the projection:

where and are orthogonal. The vector


can be interpreted as the result of removing the component of along .

4.3. Gram-Schmidt procedure


The Gram-Schmidt procedure is a particular orthogonalization algorithm. The basic idea is to first
orthogonalize each vector w.r.t. previous ones; then normalize result to have norm one.

Case when the vectors are independent


Let us assume that the vectors are linearly independent. The GS algorithm is as follows.

Gram-Schmidt procedure:

1. set .
2. normalize: set .
3. remove component of in : set .
4. normalize: set .
ORTHOGONALIZATION: THE GRAM-SCHMIDT PROCEDURE | 18

5. remove components of in : set .


6. normalize: set .
7. etc.

2
1

The image of the left shows the GS procedure applied to


the case of two vectors in two dimensions. We first set the
first vector to be a normalized version of the first vector .
Then we remove the component of along the direction
. The difference is the (un-normalized) direction ,
which becomes after normalization. At the end of the
process, the vectors have both unit length and are
orthogonal to each other.

The GS process is well-defined, since at each step (otherwise this would contradict the linear
independence of the ‘s).

General case: the vectors may be dependent


It is possible to modify the algorithm to allow it to handle the case when the ‘s are not linearly
independent. If at step , we find , then we directly jump at the next step.

Modified Gram-Schmidt procedure:

1. set .
2. for :
1. set .
2. if .

On exit, the integer is the dimension of the span of the vectors .


19 | HYPERPLANES AND HALF-SPACES

5.

HYPERPLANES AND HALF-SPACES

• Hyperplanes
• Projection on a hyperplane
• Geometry
• Half-spaces

5.1. Hyperplanes
A hyperplane is a set described by a single scalar product equality. Precisely, a hyperplane in is a set of the
form

where , , and are given. When , the hyperplane is simply the set of points that
are orthogonal to ; when , the hyperplane is a translation, along direction , of that set.

If , then for any other element , we have

Hence, the hyperplane can be characterized as the set of vectors such that is orthogonal to :

Hyperplanes are affine sets, of dimension (see the proof here). Thus, they generalize the usual notion
of a plane in . Hyperplanes are very useful because they allows to separate the whole space in two regions.
The notion of half-space formalizes this.
HYPERPLANES AND HALF-SPACES | 20

Example 1: A hyperplane in .

Consider an affine set of dimension in , which we describe as the set of points in such that there exists two
parameters such that

The set can be represented as a translation of a linear subspace: , with

and the span of the two independent vectors

Thus, the set is of dimension in , hence it is a hyperplane. In , hyperplanes are ordinary planes. We can
find a representation of the hyperplane in the standard form

We simply find that is orthogonal to both and . That is, we solve the equations

The above leads to . Choosing for example leads to .


21 | HYPERPLANES AND HALF-SPACES

The hyperplane can be expressed as


, where is a particular element,
and are two independent vectors. The set is
represented in light blue; it is a translation of the
corresponding span . Any point
in is such that belongs to . Thus, we
can represent the hyperplane as the set of points such
that is orthogonal to , where is any vector
orthogonal to both .

5.2. Projection on a hyperplane


Consider the hyperplane , and assume without loss of generality that is normalized
( ). We can represent as the set of points such that is orthogonal to , where is
any vector in , that is, such that . One such vector is .

By construction, is the projection of on . That is, it is the point on closest to the origin, as it
solves the projection problem

Indeed, for any , using the Cauchy-Schwartz inequality:

and the minimum length | | is attained with .


HYPERPLANES AND HALF-SPACES | 22

5.3. Geometry of hyperplanes

Geometrically, a hyperplane ,
with , is a translation of the set of vectors
orthogonal to . The direction of the translation is
determined by , and the amount by .

⯑0 = ⯑⯑ Precisely, | | is the length of the closest point on


from the origin, and the sign of determines if is
away from the origin along the direction or . As
we increase the magnitude of , the hyperplane is shifting
⯑ further away along , depending on the sign of .

In a 3D space, a hyperplane corresponds to a plane. In


0 the image on the left, the scalar is positive, as and
point to the same direction.

5.4. Half-spaces
A half-space is a subset of defined by a single inequality involving a scalar product. Precisely, a half-space
in is a set of the form

where , , and are given.

Geometrically, the half-space above is the set of points such that , that is, the angle
between and is acute (in ). Here is the point closest to the origin on the
hyperplane defined by the equality . (When is normalized, as in the picture, .)

⯑2

The half-space is the set of points such that


forms an acute angle with , where is the projection
⯑0 = ⯑⯑
of the origin on the boundary of the half-space.

⯑ The image on the left is a visualization of half-spaces in a 3D


context.

0
23 | LINEAR FUNCTIONS

6.

LINEAR FUNCTIONS

• Linear and affine functions


• First-order approximation of non-linear functions
• Other sources of linear models

6.1. Linear and affine functions

Definition
Linear functions are functions which preserve scaling and addition of the input argument. Affine functions
are ‘‘linear plus constant’’ functions.

Formal definition, linear and affine functions. A function is linear if and only if
preserves scaling and addition of its arguments:

• for every , and , ; and


• for every , .

A function is affine if and only if the function with values is


linear.

An alternative characterization of linear functions:


LINEAR FUNCTIONS | 24

A function is linear if and only if either one of the following conditions hold.

1. preserves scaling and addition of its arguments:

– For every in , and in , ; and

– For every in , .

2. vanishes at the origin: , and transforms any line segment in into another segment in :

3. is differentiable, vanishes at the origin, and the matrix of its derivatives is constant: there exist in
such that

Example 1: Consider the functions with values

The function is linear; is affine; and is neither.

Connection with vectors via the scalar product


The following shows the connection between linear functions and scalar products.

Theorem: Representation of affine function via the scalar product

A function is affine if and only if it can be expressed via a scalar product:

for some unique pair , with and , given by , with the -th unit vector in ,
, and . The function is linear if and only if .

The theorem shows that a vector can be seen as a (linear) function from the ‘‘input“ space to the
25 | LINEAR FUNCTIONS

‘‘output” space . Both points of view (matrices as simple collections of numbers, or as linear functions) are
useful.

Gradient of an affine function


The gradient of a function at a point , denoted , is the vector of first derivatives with
respect to (see here for a formal definition and examples). When (there is only one input
variable), the gradient is simply the derivative.

An affine function , with values has a very simple gradient: the constant
vector . That is, for an affine function , we have for every :

Example 2: gradient of a linear function:

Consider the function , with values . Its gradient is constant, with values

For a given in , the -level set is the set of points such that :

The level sets are hyperplanes, and are orthogonal to the gradient.

Interpretations
The interpretation of are as follows.

• The is the constant term. For this reason, it is sometimes referred to as the bias, or intercept
(as it is the point where intercepts the vertical axis if we were to plot the graph of the function).
• The terms , , which correspond to the gradient of , give the coefficients of influence
of on . For example, if , then the first component of has much greater influence on the
value of than the third.

See also: Beer-Lambert law in absorption spectrometry.


LINEAR FUNCTIONS | 26

6.2. First-order approximation of non-linear


functions
Many functions are non-linear. A common engineering practice is to approximate a given non-linear map
with a linear (or affine) one, by taking derivatives. This is the main reason for linearity to be such an
ubiquituous tool in Engineering.

One-dimensional case
Consider a function of one variable , and assume it is differentiable everywhere. Then we can
approximate the values function at a point near a point as follows:

where denotes the derivative of at .

Multi-dimensional case
With more than one variable, we have a similar result. Let us approximate a differentiable function
by a linear function , so that and coincide up and including to the first derivatives. The
corresponding approximation is called the first-order approximation to at .

The approximate function must be of the form

where and . Our condition that coincides with up and including to the first derivatives
shows that we must have

where the gradient of at . Solving for we obtain the following result:


27 | LINEAR FUNCTIONS

Theorem: First-order expansion of a function.

The first-order approximation of a differentiable function at a point is of the form

where is the gradient of at .

Example 3: a linear approximation to a non-linear function

Consider the log-sum-exp function

admits the gradient at the point given by

Hence can be approximated near by the linear function

6.3. Other sources of linear models


Linearity can arise from a simple change of variables. This is best illustrated with a specific example.

Example: Power laws.


APPLICATION: DATA VISUALIZATION BY PROJECTION ON A LINE | 28

7.

APPLICATION: DATA VISUALIZATION BY


PROJECTION ON A LINE

• Senate voting data


• Visualization of high-dimensional data via projection on a line
• Examples

7.1. Senate voting data

In this section, we are focussing on a data set containing the votes of US Senators. This dataset can be
represented as a collection of vectors , in , with the number
of bills, and the number of Senators. Thus, contains all the votes of Senator , and the -th
component of contains the vote of that Senator on bill .

Senate voting matrix: This image shows the votes of the Senators in the 2004-2006 US Senate,
for a total of bills. “Yes” votes are represented as ‘s, “No” as ‘s, and the other votes are
recorded as . Each row represents the votes of a single Senator, and each column contains the votes of all
Senators for a particular bill. The vectors , can be read as rows in the picture.
29 | APPLICATION: DATA VISUALIZATION BY PROJECTION ON A LINE

7.2. Visualization of high-dimensional data via


projection
As seen in the picture above, simply plotting the raw data is often not very informative.

We can try to visualize the data set, by projecting each data point (each row or column of the matrix) on
(say) a one-, two- or three-dimensional space. Each ‘‘view’’ corresponds to a particular projection, that is, a
particular one-, two- or three-dimensional subspace on which we choose to project the data. Let us detail
what it means to project on a one-dimensional set, that is, on a line.

Projecting on a line allows to assign a single number, or ‘‘score’’, to each data point, via a scalar product.
We choose a (normalized) direction , and a scalar . This corresponds to the affine ‘‘scoring’’
function , which, to a generic data point , assigns the value

We thus obtain a vector of values , with components , . It is often


useful to center these scores around zero. This can be done by choosing v such that

The zero-mean condition implies , where

is the vector of sample averages of the different data points.

The vector can be interpreted as the ‘‘average response’’ across data points (the average vote across
Senators in our running example). The values of our scoring function can now be expressed as

In order to be able to compare the relative merits of different directions, we can assume, without loss of
generality, that the direction vector u is normalized (so that ).

Note that our definition of above is consistent with the idea of projecting the data points on
the line passing through the origin and with normalized direction . Indeed, the component of on
the line is .
APPLICATION: DATA VISUALIZATION BY PROJECTION ON A LINE | 30

In the Senate voting example above, a particular projection (that is, a direction in ) corresponds to
assigning a ‘‘score’’ to each Senator, and thus represents all the Senators as a single value on a line. We
will project the data along a vector in the ‘‘bill’’ space, which is . That is, we are going to form linear
combinations of the bills, so that the votes for each Senator are reduced to a single number, or
‘‘score’’. Since we centered our data, the average score (across Senators) is zero.

7.3. Examples

Projection in a random direction

Scores obtained with random direction: This image


shows the values of the projections of the Senator’s votes
(that is, with average across Senators removed)
on a (normalized) ‘‘random bill’’ direction. Each score
is a number given by , with
a random vector (normalized to have unit Euclidean
norm). We show the party affiliation, with the green
color corresponding to the Independent Senator
Jeffords. This projection shows no obvious structure; in
particular, it does not reveal the party affiliation.

Projection on the ‘‘all-ones’’ vector


Clearly, not all directions are ‘‘good’’, in the sense of producing informative plots. Here, we discuss a general
principle that allows us to choose an ‘‘informative’’ direction. But for this data set, a good guess could be
to choose the direction that corresponds to the ‘‘average bill’’. That is, we choose the direction to be the
parallel to the vector of ones in , scaled appropriately so that its Euclidean norm is one.
31 | APPLICATION: DATA VISUALIZATION BY PROJECTION ON A LINE

Scores obtained with all-ones direction: This image


shows the values of the projections of the Senator’s votes
(that is, with average across Senators removed)
on a (normalized) ‘‘average bill’’ direction. Each score is a
number given by , with a vector
with elements all equal to (it is normalized to
unit Euclidean norm). This projection reveals clearly the
party affiliation of each senator, although the
information was not available to the projection
algorithm.

The interpretation is that the behavior of senators on an


‘‘average bill’’ almost fully determines her or his party
affiliation. Note the range of the plot (about ,
which is higher than that of the random vector (about
), allowing a better spread of scores.
EXERCISES | 32

8.

EXERCISES

• Subspaces
• Projections, scalar products, angles
• Orthogonalization
• Generalized Cauchy-Schwarz inequalities
• Linear functions

8.1. Subspaces
1. Consider the set of points such that

Show that is a subspace. Determine its dimension, and find a basis for it.

2. Consider the set in , defined by the equation

a. Show that the set is an affine subspace of dimension . To this end, express it as

where , and are independent vectors.

b. Find the minimum Euclidean distance from to the set . Find a point that achieves
the minimum distance. (Hint: using the Cauchy-Schwarz inequality, prove that the minimum-
distance point is proportional to .)
33 | EXERCISES

8.2. Projections, scalar product, angles

1. Find the projection of the vector on the line that passes through with

direction given by the vector

2. Find the Euclidean projection of a point on a hyperplane

where and are given.

3. Determine the angle between the following two vectors:

Are these vectors linearly independent?

8.3. Orthogonalization
Let be two unit-norm vectors, that is, such that . Show that the
vectors and are orthogonal. Use this to find an orthogonal basis for the subspace
spanned by and .

8.4. Generalized Cauchy-Schwarz inequalities


1. Show that the following inequalities hold for any vector :

2. Show that following inequalities hold for any vector:

Hint: use the Cauchy-Schwarz inequality for the second inequality.

3. In a generalized version of the above inequalities, show that for any non-zero vector ,
EXERCISES | 34

where is the cardinality of the vector , defined as the number of non-zero elements in
For which vectors is the upper bound attained?

8.5. Linear functions


1. For a -vector , with odd, we define the median of as . Now consider the
function , with values

Express as a scalar product, that is, find such that for every . Find a basis
for the set of points such that .

2. For , we consider the ‘‘power-law’’ function , with values

Justify the statement: ‘‘the coefficients provide the ratio between the relative error in to a relative
error in ’’.

3. Find the gradient of the function that gives the distance from a given point
to a point .
35 | MATRICES

PART II
MATRICES

Matrices are collections of vectors of the same size, organized in a rectangular array. The image shows the
matrix of votes of the 2004-2006 US Senate.

Via the matrix-vector product, we can interpret matrices as linear maps (vector-valued functions), which act
from an ‘‘input’’ space to an ‘‘output’’ space, and preserve the addition and scaling of the inputs. Linear
maps arise everywhere in engineering, mostly via a process known as linearization (of non-linear maps).
Matrix norms are then useful to measure how the map amplifies or decreases the norm of specific inputs.

We review a number of prominent classes of matrices. Orthogonal matrices are an important special case,
as they generalize the notion of rotation in the ordinary two- or three-dimensional spaces: they preserve
(Euclidean) norms and angles. The QR decomposition, which proves useful in solving linear equations and
related problems, allows to decompose any matrix as a two-term product involving an orthogonal matrix
and a triangular matrix.

Outline

• Basics
• Matrix-vector and matrix-matrix multiplication, scalar product
• Special classes of matrices
• QR decomposition of a matrix
• Matrix inverses
MATRICES | 36

• Linear maps
• Matrix norms
• Applications
• Exercises
37 | BASICS

9.

BASICS

• Matrices as collections of column vectors


• Transpose
• Matrices as collections of row vectors
• Sparse matrices

9.1. Matrices as collections of column vectors


Matrices can be viewed simply as a collection of column vectors of same size, that is, as a collection of points
in a multi-dimensional space.

A matrix can be described as follows: given vectors in we can define the matrix
with ‘s as columns:

Geometrically, represents points in an -dimensional space.

The notation denotes the set of matrices.

With our convention, a column vector in is thus a matrix in , while a row vector in is a matrix
in .

9.2. Transpose
The notation denotes the element of in row and column . The transpose of a matrix , denoted
by , is the matrix with element at the position, with and .

9.3. Matrices as collections of rows


Similarly, we can describe a matrix in row-wise manner: given vectors in , we can define
the matrix with the transposed vectors as rows:
BASICS | 38

Geometrically, represents points in a -dimensional space.

Example 1: Consider the matrix

The matrix can be interpreted as the collection of two column vectors: , where ‘s contain the
columns of :

Geometrically, represents points in a -dimensional space.

Alternatively, we can interpret as a collection of 3-row vectors in .

where ‘s, contain the rows of :

Geometrically, represents 3 points in a 2-dimensional space.

See also:
39 | BASICS

• Arc-node incidence matrix of a network.


• Matrix of votes in the US Senate, 2004-2006.

9.4. Sparse Matrices


In many applications, one has to deal with very large matrices that are sparse, that is, they have many zeros.
It often makes sense to use a sparse storage convention to represent the matrix.

One of the most common formats involves only listing the non-zero elements, and their associated locations
in the matrix.
MATRIX-VECTOR AND MATRIX-MATRIX MULTIPLICATION, SCALAR PRODUCT | 40

10.

MATRIX-VECTOR AND MATRIX-MATRIX


MULTIPLICATION, SCALAR PRODUCT

• Matrix-vector product
• Matrix-matrix product
• Block matrix product
• Trace and scalar product

10.1. Matrix-vector product

Definition
We define the matrix-vector product between a matrix and a -vector , and denote by , the
-vector with -th component

The picture on the left shows a symbolic example with


and . We have , that is:
⯑ 11 ⯑ 12 ⯑1
⯑1
⯑ 21 ⯑ 22 × = ⯑2
⯑2

⯑ 31 ⯑ 32 ⯑3

Interpretation as linear combinations of columns


If the columns of are given by the vectors so that , then can
be interpreted as a linear combination of these columns, with weights given by the vector :
41 | MATRIX-VECTOR AND MATRIX-MATRIX MULTIPLICATION, SCALAR PRODUCT

In the above symbolic example, we have


, that is:
⯑ 11 ⯑ 12 ⯑ 11 ⯑ 12
⯑1

⯑ 21 ⯑ 22 × = ⯑1 ⯑ 21 + ⯑2 ⯑ 22
⯑2

⯑ 31 ⯑ 32 ⯑ 31 ⯑ 32

See also:

• Network flow.
• Image compression.

Interpretation as scalar products with rows


Alternatively, if the rows of are the row vectors :

then is the vector with elements :


MATRIX-VECTOR AND MATRIX-MATRIX MULTIPLICATION, SCALAR PRODUCT | 42

In the above symbolic example, we have ,


that is:

×
32
31
22
21
2
12
11
1 1

See also: Absorption spectrometry: using measurements at different light frequencies.

Left product
If , then the notation is the row vector of size equal to the transpose of the column vector
. That is:

Example: Return to the network example, involving a incidence matrix. We note that, by
construction, the columns of sum to zero, which can be compactly written as , or .
43 | MATRIX-VECTOR AND MATRIX-MATRIX MULTIPLICATION, SCALAR PRODUCT

10.2. Matrix-matrix product

Definition
We can extend the matrix-vector product to the matrix-matrix product, as follows. If and
, the notation denotes the matrix with element given by

Transposing a product changes the order, so that

Column-wise interpretation
If the columns of are given by the vectors , with , so that , then
can be written as

In other words, results from transforming each column of into .

Row-wise interpretation
The matrix-matrix product can also be interpreted as an operation on the rows of . Indeed, if is given
by its rows then is the matrix obtained by transforming each one of these rows via
, into :

(Note that ‘s are indeed row vectors, according to our matrix-vector rules.)

10.3. Block Matrix Products


Matrix algebra generalizes to blocks, provided block sizes are consistent. To illustrate this, consider the
MATRIX-VECTOR AND MATRIX-MATRIX MULTIPLICATION, SCALAR PRODUCT | 44

matrix-vector product between a matrix and a -vector , where are partitioned in blocks,
as follows:

where is Then

Symbolically, it’s as if we would form the ‘‘scalar’’ product between the ‘‘row vector and the
column vector !

Likewise, if a matrix is partitioned into two blocks , each of size , with


, then

Again, symbolically we apply the same rules as for the scalar product — except that now the result is a matrix.

Example: Gram matrix.

Finally, we can consider so-called outer products. Assume matrix is partitioned row-wise and matrix is
partitioned column-wise. Therefore, we have:

The dimensions of these matrices should be consistent such that are of dimensions and
respectively and are of dimensions and respectively. The dimensions of
the resultant matrices will be
respectively.

Then the product can be expressed in terms of the blocks, as follows:


45 | MATRIX-VECTOR AND MATRIX-MATRIX MULTIPLICATION, SCALAR PRODUCT

10.4. Trace, scalar product

Trace
The trace of a square matrix , denoted by , is the sum of its diagonal elements:

Some important properties:

• Trace of transpose: The trace of a square matrix is equal to that of its transpose.
• Commutativity under trace: for any two matrices and , we have

Scalar product between matrices


We can define the scalar product between two matrices via

The above definition is symmetric: we have

Our notation is consistent with the definition of the scalar product between two vectors, where we simply
view a vector in as a matrix in . We can interpret the matrix scalar product as the vector scalar
product between two long vectors of length each, obtained by stacking all the columns of on top
of each other.
SPECIAL CLASSES OF MATRICES | 46

11.

SPECIAL CLASSES OF MATRICES

• Square Matrices
◦ Identity and diagonal matrices
◦ Triangular matrices
◦ Symmetric matrices
◦ Orthogonal Matrices
• Dyads

11.1 Some special square matrices


Square matrices are matrices that have the same number of rows as columns. The following are important
instances of square matrices.

Identity matrix
The identity matrix (often denoted , or simply , if the context allows), has ones on its diagonal
and zeros elsewhere. It is square, diagonal, and symmetric. This matrix satisfies for every
matrix with columns, and for every matrix with rows.

Example 1: Identity matrix

The identity matrix, denoted , is given by:

This matrix has ones on its diagonal and zeros elsewhere. When multiplied by any matrix , the product
remains , and similarly, for any matrix of size .
47 | SPECIAL CLASSES OF MATRICES

Diagonal matrices
Diagonal matrices are square matrices with when . A diagonal matrix can be
denoted as , with the vector containing the elements on the diagonal. We can also write

where by convention the zeros outside the diagonal are not written.

Symmetric matrices
Symmetric matrices are square matrices that satisfy for every pair . An entire section is
devoted to symmetric matrices.

Example 2: A symmetric matrix

The matrix

is symmetric. The matrix

is not, since it is not equal to its transpose.

Triangular matrices
A square matrix is upper triangular if when . Here are a few examples:
SPECIAL CLASSES OF MATRICES | 48

A matrix is lower triangular if its transpose is upper triangular. For example:

Orthogonal matrices
Orthogonal (or, unitary) matrices are square matrices, such that the columns form an orthonormal basis. If
is an orthogonal matrix, then

Thus, Similarly,

Orthogonal matrices correspond to rotations or reflections across a direction: they preserve length and
angles. Indeed, for every vector ,

Thus, the underlying linear map preserves the length (measured in Euclidean norm). This is
sometimes referred to as the rotational invariance of the Euclidean norm.

In addition, angles are preserved: if are two vectors with unit norm, then the angle between them
satisfies , while the angle between the rotated vectors satisfies
. Since

we obtain that the angles are the same. (The converse is true: any square matrix that preserves lengths and
angles is orthogonal.)

Geometrically, orthogonal matrices correspond to rotations (around a point) or reflections (around a line
passing through the origin).
49 | SPECIAL CLASSES OF MATRICES

Examples 3: A orthogonal matrix

The matrix

is orthogonal.

The vector is transformed by the orthogonal matrix above into

Thus, corresponds to a rotation of angle degrees counter-clockwise.

See also: Permutation matrices

11.2 Dyads
Dyads are a special class of matrices, also called rank-one matrices, for reasons seen later.

Definition
A matrix is a dyad if it is of the form for some vectors . The
dyad acts on an input vector as follows:

In terms of the associated linear map, for a dyad, the output always points in the same direction in output
space ( ), no matter what the input is. The output is thus always a simple scaled version of . The
amount of scaling depends on the vector , via the linear function .

See also: Single-factor models of financial data.

Normalized dyads
We can always normalize the dyad, by assuming that both are of unit (Euclidean) norm, and using a
factor to capture their scale. That is, any dyad can be written in normalized form:
SPECIAL CLASSES OF MATRICES | 50

where , and .
51 | QR DECOMPOSITION OF A MATRIX

12.

QR DECOMPOSITION OF A MATRIX

• Basic idea
• The case when the matrix has linearly independent columns
• General case
• Full QR decomposition

12.1 Basic idea


The basic goal of the QR decomposition is to factor a matrix as a product of two matrices (traditionally
called , hence the name of this factorization). Each matrix has a simple structure that can be further
exploited in dealing with, say, linear equations.

The QR decomposition is nothing else than the Gram-Schmidt procedure applied to the columns of the
matrix and with the result expressed in matrix form. Consider a matrix , with
each is a column of .

12.2 Case being full column rank


Assume first that (the columns of ) are linearly independent. Each step of the G-S procedure
can be written as

We write this as

where and .

Since the ‘s are unit-length and normalized, the matrix satisfies . The QR
decomposition of a matrix thus allows writing the matrix in factored form:
QR DECOMPOSITION OF A MATRIX | 52

where is a matrix with , and is , upper-triangular.

Example: QR decomposition of a 4×6 matrix.

12.3 Case when the columns are not independent


When the columns of are not independent, at some step of the G-S procedure we encounter a zero vector
, which means is a linear combination of . The modified Gram-Schmidt procedure then
simply skips to the next vector and continues.

In matrix form, we obtain , with , , and has an upper staircase


form, for example:

(This is simply an upper triangular matrix with some rows deleted. It is still upper triangular.)

We can permute the columns of to bring forward the first non-zero elements in each row:

where is a permutation matrix (that is, its columns are the unit vectors in some order), whose effect
is to permute columns. (Since is orthogonal, .) Now, is square, upper triangular,
and invertible, since none of its diagonal elements is zero.

The QR decomposition can be written

where
53 | QR DECOMPOSITION OF A MATRIX

1.
2. is the of ;
3. is upper triangular, invertible matrix;
4. is a matrix;
5. is a permutation matrix.

12.4 Full QR decomposition


The full QR decomposition allows to write where is square and orthogonal (
). In other words, the columns of are an orthonormal basis for the whole output
space , not just for the range of .

We obtain the full decomposition by appending an identity matrix to the columns of :


. The QR decomposition of the augmented matrix allows to write

where the columns of the matrix are orthogonal and is upper triangular and
invertible. (As before, is a permutation matrix.) In the G-S procedure, the columns of are obtained
from those of , while the columns of come from the extra columns added to .

The full QR decomposition reveals the rank of : we simply look at the elements on the diagonal of that
are not zero, that is, the size of .

Example: QR decomposition of a 4×6 matrix.


MATRIX INVERSES | 54

13.

MATRIX INVERSES

• Square full-rank matrices and their inverse


• Full column rank matrices and left inverses
• Full-row rank matrices and right inverses

13.1 Square full-rank matrices and their inverse


A square matrix is said to be invertible if and only if its columns are independent. This is equivalent
to the fact that its rows are independent as well. An equivalent definition states that a matrix is invertible if
and only if its determinant is non-zero.

For invertible matrices , there exists a unique matrix such that . The matrix
is denoted and is called the inverse of .

Example 1: A simple matrix.


Consider the matrix and its inverse:

The product of and is:

This is the identity matrix . Similarly, will also result in .

If a matrix is square, invertible, and triangular, we can compute its inverse simply, as follows. We
solve linear equations of the form with the -th column of the
identity matrix, using a process known as backward substitution. Here is an example. At the outset, we form
the matrix By construction, .

For a general square and invertible matrix A, the QR decomposition can be used to compute its inverse. For
55 | MATRIX INVERSES

such matrices, the QR decomposition is of the form , with a orthogonal matrix, and
is upper triangular. Then the inverse is .

A useful property is the expression of the inverse of a product of two square, invertible matrices :
(Indeed, you can check that this inverse works.)

13.2 Full column rank matrices and left inverses


An matrix is said to be full column rank if its columns are independent. This necessarily implies
.

A matrix has full column rank if and only if there exists an matrix such that (here
is the small dimension). We say that is a left-inverse of . To find one left inverse of a matrix with
independent columns , we use the full QR decomposition of to write

where is upper triangular and invertible, while is and orthogonal ( ). We


can then set a left inverse to be

The particular choice above can be expressed in terms of directly:

Note that is invertible, as it is equal to .

In general, left inverses are not unique.

13.3 Full-row rank matrices and right inverses


A matrix is said to be full row rank if its rows are independent. This necessarily implies .

A matrix has full row rank if and only if there exists an matrix such that (here
is the small dimension). We say that is a right-inverse of . We can derive expressions of right
inverses by noting that is full row rank if and only if is full column rank. In particular, for a matrix
with independent rows, the full QR decomposition (of ) allows writing
MATRIX INVERSES | 56

where is upper triangular and invertible, while is and orthogonal ( ). We


can then set a right inverse of to be

The particular choice above can be expressed in terms of directly:

Note that is invertible, as it is equal to .

In general, right inverses are not unique.


57 | LINEAR MAPS

14.

LINEAR MAPS

• Definitions and interpretation


• First-order approximation of non-linear maps

14.1 Definition and Interpretation

Definition
A map is linear (resp. affine) if and only if every one of its components is. The formal
definition we saw here for functions applies verbatim to maps.

To an matrix , we can associate a linear map , with values .


Conversely, to any linear map, we can uniquely associate a matrix which satisfies for every
.

Indeed, if the components of , are linear, then they can be expressed as


for some . The matrix is the matrix that has as its -th row:

Hence, there is a one-to-one correspondence between matrices and linear maps. This is extending what we
saw for vectors, which are in one-to-one correspondence with linear functions.

This is summarized as follows.

Representation of affine maps via the matrix-vector product. A function is affine if


and only if it can be expressed via a matrix-vector product:

for some unique pair , with and . The function is linear if and only if .
LINEAR MAPS | 58

The result above shows that a matrix can be seen as a (linear) map from the “input” space to the “output”
space . Both points of view (matrices as simple collections of vectors, or as linear maps) are useful.

Interpretations
Consider an affine map . An element gives the coefficient of influence of over .
In this sense, if we can say that has much more influence on than . Or, says
that does not depend at all on . Often the constant term is referred to as the “bias” vector.

14.2 First-order approximation of non-linear maps


Since maps are just collections of functions, we can approximate a map with a linear (or affine) map, just as
we did with functions here. If is differentiable, then we can approximate the (vector) values
of near a given point by an affine map :

where is the derivative of the -th component of with respect to ( is referred to as

the Jacobian matrix of at ).

See also:

• Navigation by range measurement.


• State-space models of dynamical systems.
59 | MATRIX NORMS

15.

MATRIX NORMS

• Motivating example
• RMS gain: the Frobenius norm
• Peak gain: the largest singular value norm
• Applications

15.1 Motivating example: effect of noise in a linear


system
We saw how a matrix (say, ) induces, via the matrix-vector product, a linear map .
Here, is an input vector and is the output. The mapping (that is, ) could represent a linear
amplifier with the input of an audio signal and output another audio signal .

Now, assume that there is some noise in the vector : the actual input is , where is an error
vector. This implies that there will be noise in the output as well: the noisy output is so the error
on the output due to noise is . How could we quantify the effect of input noise on the output noise?

One approach is to try to measure the norm of the error vector, . Obviously, this norm depends on the
noise , which we do not know. So we will assume that can take values in a set. We need to come up with a
single number that captures in some way the different values of when spans that set. Since scaling
simply scales the norm accordingly, we will restrict the vectors to have a certain norm, say
.

Clearly, depending on the choice of the set, the norms we use to measure norm lengths, and how we choose
to capture many numbers with one, etc, we will obtain different numbers.

15.2 RMS gain: the Frobenius norm


Let us first assume that the noise vector can take a finite set of directions, specifically the directions
represented by the standard basis, . Then let us look at the average of the squared error norm:
MATRIX NORMS | 60

where stands for the -th column of . The quantity above can be written as , where

is the Frobenius norm of .

The function turns out to satisfy the basic conditions of a norm in the matrix space .
In fact, it is the Euclidean norm of the vector of length formed with all the coefficients of . Further,
the quantity would remain the same if we had chosen any orthonormal basis other than the standard one.

The Frobenius norm is useful to measure the RMS (root-mean-square) gain of the matrix, and its average
response along given mutually orthogonal directions in space. Clearly, this approach does not capture well
the variance of the error, only the average effect of noise.

The computation of the Frobenius norm is very easy: it requires about flops.

15.3 Peak gain: the largest singular value norm


To try to capture the variance of the output noise, we may take a worst-case approach.

Let us assume that the noise vector is bounded but otherwise unknown. Specifically, all we know about
is that , where is the maximum amount of noise (measured in Euclidean norm). What is then
the worst-case (peak) value of the norm of the output noise? This is answered by the optimization problem

The quantity

measures the peak gain of the mapping , in the sense that if the noise vector is bounded in norm by
, then the output noise is bounded in norm by . Any vector which achieves the maximum above
corresponds to a direction in input space that is maximally amplified by the mapping .

The quantity is indeed a matrix norm, called the largest singular value (LSV) norm, for reasons
seen here. It is perhaps the most popular matrix norm.

The computation of the largest singular value norm of a matrix is not as easy as with the Frobenius norm.
However, it can be computed with linear algebra methods seen here, in about flops. While
61 | MATRIX NORMS

it is more expensive to compute than the Frobenious norm, it is also more useful because it goes beyond
capturing the average response to noise.

Other norms
Many other matrix norms are possible, and sometimes useful. In particular, we can generalize the notion of
peak norm by using different norms to measure vector size in the input and output spaces. For example, the
quantity

measures the peak gain with inputs bounded in the maximum norm, and outputs measured with the
-norm.

The norms we have just introduced, the Frobenius and largest singular value norms, are the most popular
ones and are easy to compute. Many other norms are hard to compute.

15.4 Applications

Distance between matrices


Matrix norms are ways to measure the size of a matrix. This allows quantifying the difference between
matrices.

Assume for example that we are trying to estimate a matrix , and came up with an estimate . How can
we measure the quality of our estimate? One way is to evaluate by how much they differ when they act on a
standard basis. This leads to the Frobenius norm.

Another way is to look at the difference in the output:

when runs the whole space. Clearly, we need to scale or limit the size, of , otherwise, the difference above
may be arbitrarily big. Let’s look at the worst-case difference when satisfies . We obtain

which is the largest singular value norm of the difference .


MATRIX NORMS | 62

The direction of maximal variance


Consider a data set described as a collection of vectors , with . We can gather this data
set in a single matrix . For simplicity, let us assume that the average vector is
zero:

Let us try to visualize the data set by projecting it on a single line passing through the origin. The line is thus
defined by a vector , which we can without loss of generality assume to be of Euclidean norm .
The data points, when projected on the line, are turned into real numbers .

It can be argued that a good line to project data on is one which spreads the numbers as much as
possible. (If all the data points are projected to numbers that are very close, we will not see anything, as all
data points will collapse to close locations.)

We can find a direction in space that accomplishes this, as follows. The average of the numbers is

while their variance is

The direction of maximal variance is found by computing the LSV norm of

(It turns out that this quantity is the same as the LSV norm of itself.)
63 | APPLICATIONS

16.

APPLICATIONS

• State-space models of linear dynamical systems.

16.1 State-space models of linear dynamical


systems.

Definition
Many discrete-time dynamical systems can be modeled via linear state-space equations, of the form

where is the state, which encapsulates the state of the system at time contains
control variables, contains specific outputs of interest. The matrices

are of appropriate dimensions to ensure compatibility of matrix multiplications.

In effect, a linear dynamical model postulates that the state at the next instant is a linear function of the state
at past instants, and possibly other ‘‘exogenous’’ inputs; and that the output is a linear function of the state
and input vectors.

A continuous-time model would take the form of a differential equation

Finally, the so-called time-varying models involve time-varying matrices (see an example
below).

Motivation
The main motivation for state-space models is to be able to model high-order derivatives in dynamical
equations, using only first-order derivatives, but involving vectors instead of scalar quantities.
APPLICATIONS | 64

Consider, for instance, the second-order differential


equation:

which captures the dynamics of a damped mass-spring


system. In this equation:

: the mass of the object attached to the spring.

: the damping coefficient.

: the spring constant

: any external force applied to the mass

: the vertical displacement of the mass from


its equilibrium position.

The above involves second-order derivatives of a scalar function . We can express it in an equivalent form
involving only first-order derivatives, by defining the state vector to be

The price we pay is that now we deal with a vector equation instead of a scalar equation:

The position is a linear function of the state by the relation where

A nonlinear system
In the case of non-linear systems, we can also use state-space representations. In the case of autonomous
systems (no external input) for example, these come in the form

where is now a non-linear map. Now assume we want to model the behavior of the
system near an equilibrium point (such that ). Let us assume for simplicity that .

Using the first-order approximation of the map , we can write a linear approximation to the above model:
65 | APPLICATIONS

where

The pendulum’s motion is governed by the nonlinear equation


, where is the angular displacement from the vertical
position and the dot denotes time differentiation.

To understand the dynamics near and , we linearize this


equation using the first-order Taylor series expansion around a point :

For , setting and , we find


and . This gives the approximation
, resulting in the simplified equation

Similarly, for , setting gives . The linear


approximation here is , leading to

This linearization elucidates the pendulum’s unstable dynamics at


, assisting in predicting substantial reactions to minor
disturbances.
EXERCISES | 66

17.

EXERCISES

• Matrix products
• Special matrices
• Linear maps, dynamical systems
• Matrix inverses, norms

17.1. Matrix products


Let and be two maps. Let be the composite map
, with values for . Show that the derivatives of can be
expressed via a matrix-matrix product, as , where the Jacobian matrix
of at is defined as the matrix with element .

17.2 Special matrices


A matrix is a permutation matrix if it is a permutation of the columns of the
identity matrix.

a. For a matrix , we consider the products and . Describe in simple terms


what these matrices look like with respect to the original matrix .

b. Show that is orthogonal.

c. Show that .

17.3. Linear maps, dynamical systems


1. Let be a linear map. Show how to compute the (unique) matrix such that
for every , in terms of the values of at appropriate vectors, which you will
determine.

2. Consider a discrete-time linear dynamical system (for background, see here) with state ,
input vector , and output vector , that is described by the linear equations
67 | EXERCISES

with , and given matrices.

a. Assuming that the system has initial condition , express the output vector at
time as a linear function of ; that is, determine a matrix such that
, where is a vector containing all the
inputs up to and including at time .

b. What is the interpretation of the range of ?

17.4 Matrix inverses, norms


1. Show that a square matrix is invertible if and only if its determinant is non-zero. You can use
the fact that the determinant of a product is a product of the determinant, together with the QR
decomposition of the matrix .

2. Let , and let . Show that


where denotes the largest singular value norm of its matrix argument.
LINEAR EQUATIONS | 68

PART III
LINEAR EQUATIONS

Linear equations have been around for thousands of years. The picture on the left shows a 17th-century
Chinese text that explains the ancient art of fangcheng (‘‘rectangular arrays’’, for more details see here).
Linear equations arise naturally in many areas of engineering, often as simple models of more complicated,
non-linear equations. They form the core of linear algebra, and often arise as constraints in optimization
problems. They are also an important building block of optimization methods, as many optimization
algorithms rely on linear equations.

Linear equations can be expressed as , where is the unknown, is a given vector,


and is a matrix. The set of solutions is an affine set. In turn, any affine set can be described as
the set of solutions to an affine equation.

The issue of existence and unicity of solutions leads to important notions attached to the associated matrix
. The nullspace, which contains the input vectors that are crushed to zero by the associated linear map
; the range, which contains the set of output vectors that is attainable by the linear map; and its
dimension, the rank. There is a variety of solution methods for linear equations; we describe how the QR
decomposition of a matrix can be used in this context.
69 | LINEAR EQUATIONS

Outline

• Motivating example
• Existence and unicity of solutions
• Solving linear equations
• Applications
• Exercises
MOTIVATING EXAMPLE | 70

18.

MOTIVATING EXAMPLE

• Overview
• From 1D to 2D: axial tomography
• Linear equations for a single slice
• Issues raised: finding a solution, existence and unicity

18.1. Overview
Tomography means reconstruction of an image from its sections. The word comes from the greek ‘‘tomos’’
(‘‘slice’’) and ‘‘graph’’ (‘‘description’’). The problem arises in many fields, ranging from astronomy to
medical imaging.

Computerized Axial Tomography (CAT) is a medical imaging method that processes large amounts of two-
dimensional X-ray images in order to produce a three-dimensional image. The goal is to picture for example
the tissue density of the different parts of the brain, in order to detect anomalies (such as brain tumors).

Typically, the X-ray images represent ‘‘slices’’ of the part of the body (such as the brain) that is examined.
Those slices are indirectly obtained via axial measurements of X ray attenuation, as explained below. Thus,
in CAT for medical imaging, we use axial (line) measurements to get two-dimensional images (slices), and
from that scan of images, we may proceed to digitally reconstruct a three-dimensional view. Here, we focus
on the process that produces a single two-dimensional image from axial measurements.

A collection of ‘‘slices’’ of a human brain obtained


by CAT scan. The pictures offer an image of
the density of tissue in the various parts of the brain.

Each slice is actually reconstructed image obtained by


a tomography technique explained below. The
collection of slices can in turn be used to form a full
three-dimensional representation of the brain.

Source: Wikipedia entry.


71 | MOTIVATING EXAMPLE

18.2 From 1D to 2D: axial tomography


In CAT-based medical imaging, a number of X rays are sent through the tissues to be examined along
different directions, and their intensity after they have traversed the tissues is captured by a camera. For each
direction, we record the attenuation of the X ray, by comparing the intensity of the X ray at the source,
, to the intensity after the X ray has traversed the tissues, at the receiver’s end, .

A single slice obtained by CAT scan. Each slice is


a reconstructed image obtained by recording the
attenuation of x-rays through the tissues along a
vast number of directions. The picture shows the X
Ray source and the corresponding receiver used to
measure attenuation along a specific direction.

18.3. Linear equations for a single slice


Similar to the Beer-Lambert law of optics, it turns out that, to a reasonable degree of approximation, the log-
ratio of the intensities at the source and at the receiver is linear in the densities of the tissues traversed.

We consider a discretized version of a square slice image, with


pixels of gray scale values. On the example pictured here, we simply have
, and a total of four pixels. Each pixel can be represented by an
index pair , with . The gray scale values represent
the density of the tissues, with the density at pixel .

With the discretization, the linear relationship between intensities log-ratios and densities can be expressed
as
MOTIVATING EXAMPLE | 72

where denotes the indices of pixel areas traversed by the X ray, the density in the area, and
the proportion of the area within the pixel that is traversed by the ray.

Linear relationship between the observed


log-intensity ratios and
(unobserved) densities
within the four pixels. The slanted arrow
corresponds to an X ray which traverses about
of the area of pixels and , and

of the areas of the pixels and .

Thus, we can relate the vector to the observed intensity log-ratio vector in terms of a
linear equation

where , with . Note that depending on the number of pixels used, and the number of
measurements, the matrix can be quite large. In general, the matrix is wide, in the sense that it has (many)
more columns than rows ( ). Thus, the above system of equations is usually undetermined.

In the example pictured above, we have


73 | MOTIVATING EXAMPLE

18.4. Issues
The above example motivates us to address the problems of solving linear equations. It also raises the issue
of existence (do we have enough measurements to find the densities?) and unicity (if a solution exists, is it
unique?).
EXISTENCE AND UNICITY OF SOLUTIONS | 74

19.

EXISTENCE AND UNICITY OF SOLUTIONS

• Set of solutions to a linear equations


• Existence: the range and rank of a matrix
• Unicity: the nullspace and nullity of a matrix
• Fundamental facts about range and nullspace

19.1. Set of solutions


Consider the linear equation in :

where and are given, and is the variable.

The set of solutions to the above equation, if it is not empty, is an affine subspace. That is, it is of the form
where is a subspace.

We’d like to be able to

• determine if a solution exists;


• if so, determine if it is unique;
• compute a solution if one exists;
• find an orthonormal basis of the subspace .

19.2. Existence: range and rank of a matrix

Range
The range (or, image) of a matrix is defined as the following subset of :

The range describes the vectors that can be attained in the output space by an arbitrary choice of a
vector , in the input space. The range is simply the span of the columns of .
75 | EXISTENCE AND UNICITY OF SOLUTIONS

If , we say that the linear equation is infeasible. The set of solutions to the linear
equation is empty.

From a matrix it is possible to find a matrix, the columns of which span the range of the matrix , and are
mutually orthogonal. Hence, , where is the dimension of the range. One algorithm to obtain
the matrix is the Gram-Schmidt procedure.

Example: An infeasible linear system.

Rank
The dimension of the range is called the rank of the matrix. As we will see later, the rank cannot exceed
any one of the dimensions of the matrix . A matrix is said to be full rank if
.

Note that the rank is a very ‘‘brittle’’ notion, in that small changes in the entries of the matrix can
dramatically change its rank. Random matrices are full rank. We will develop here a better, more numerically
reliable notion.

Example 1: Range and rank of a simple matrix.


Let’s consider the matrix

Range: The columns of are

Any linear combination of these vectors can be represented as , where . For our matrix , the range can
be visually represented as the plane spanned by and .

Rank: The rank of a matrix is the dimension of its range. For our matrix , since both column vectors are linearly
independent, the rank is:

Thus, the matrix is of full rank.

See also:
EXISTENCE AND UNICITY OF SOLUTIONS | 76

• Rank-one matrices.
• Rank properties of the arc-node incidence matrix.

Full row rank matrices


The matrix is said to be full row rank (or, onto) if the range is the whole output space, . The name
‘‘full row rank’’ comes from the fact that the rank equals the row dimension of . Since the rank is always
less than the smallest of the number of columns and rows, a matrix of full row rank has necessarily
less rows than columns (that is, ).

An equivalent condition for to be full row rank is that the square, matrix is invertible,
meaning that it has full rank, .

Proof.

19.3. Unicity: nullspace of a matrix

Nullspace
The nullspace (or, kernel) of a matrix is the following subspace of :

The nullspace describes the ambiguity in given : any will be such that
, so cannot be determined by the sole knowledge of if the nullspace is not reduced to
the singleton .

From a matrix we can obtain a matrix, the columns of which span the nullspace of the matrix , and are
mutually orthogonal. Hence, , where is the dimension of the nullspace.
77 | EXISTENCE AND UNICITY OF SOLUTIONS

Example 2: Nullspace of a simple matrix.

Consider the matrix

The nullspace, , is defined as

Given the matrix structure, for any vector such that the first component is times the second component and
the third component can be arbitrary, .

For example, the vector

satisfies and is thus in the nullspace of . The dimension of this nullspace, , is 2 (since we have two free
variables).

Nullity
The nullity of a matrix is the dimension of the nullspace. The rank-nullity theorem states that the nullity
of a matrix is , where is the rank of .

Full column rank matrices


The matrix is said to be full column rank (or, one-to-one) if its nullspace is the singleton . In this
case, if we denote by the columns of , the equation

has as the unique solution. Hence, is one-to-one if and only if its columns are independent. Since
the rank is always less than the smallest of the number of columns and rows, a matrix of full column
rank has necessarily less columns than rows (that is, ).

The term ‘‘one-to-one’’ comes from the fact that for such matrices, the condition uniquely
determines , since and implies , so that the solution is unique:
EXISTENCE AND UNICITY OF SOLUTIONS | 78

. The name ‘‘full column rank’’ comes from the fact that the rank equals the column dimension of
.

An equivalent condition for to be full column rank is that the square, matrix is invertible,
meaning that it has full rank, .

Proof

Example: Nullspace of a transpose incidence matrix.

19.4. Fundamental facts


Two important results about the nullspace and range of a matrix.

Rank-nullity theorem

The nullity (dimension of the nullspace) and the rank (dimension of the range) of a matrix add up to the
column dimension of .

Proof.

Another important result is involves the definition of the orthogonal complement of a subspace.

Fundamental theorem of linear algebra

The range of a matrix is the orthogonal complement of the nullspace of its transpose. That is, for a matrix
:

Proof.
79 | EXISTENCE AND UNICITY OF SOLUTIONS

The figure provides a sketch of the proof: consider a


matrix, and denote by
its rows, so that

Then if and only if


. In words: is in the
nullspace of if and only if it is orthogonal to the
vectors . But those two vectors span
the range of , hence is orthogonal to any element
in the range.
SOLVING LINEAR EQUATIONS VIA QR DECOMPOSITION | 80

20.

SOLVING LINEAR EQUATIONS VIA QR


DECOMPOSITION

• Basic idea
• The QR decomposition of a matrix
• Solution via full QR decomposition
• Set of solutions

20.1. Basic idea: reduction to triangular systems of


equations
Consider the problem of solving a system of linear equations , where and
are given.

The basic idea in the solution algorithm starts with the observation that in the special case when is upper
triangular, that is, if , then the system can be easily solved by a process known as backward
substitution. In backward substitution we simply start solving the system by eliminating the last variable first,
then proceed to solve backwards. The process is illustrated in this example, and described in generality here.

20.2. The QR decomposition of a matrix


The QR decomposition allows to express any matrix as the product where is
and orthogonal (that is, ) and is an upper triangular. For more details on
this, see here.

Once the QR factorization of is obtained, we can solve the system by first pre-multiplying with both
sides of the equation:

This is due to the fact that . The new system is triangular and can be solved
by backwards substitution. For example, if is full column rank, then is invertible, so that the solution is
unique, and given by .
81 | SOLVING LINEAR EQUATIONS VIA QR DECOMPOSITION

Let us detail the process now.

20.3. Using the full QR decomposition


We start with the full QR decomposition of A with column permutations:

where

• is and orthogonal ( );
• is , with orthonormal columns ( );
• is , with orthonormal columns ( );
• is the rank of ;
• is upper triangular, and invertible;
• is a matrix;
• is a permutation matrix (thus, ).
• The zero submatrices in the bottom (block) row of have rows.

Using , we can write , where . Let’s look at the equation in in


expanded form:

We see that unless , there is no solution. Let us assume that . We have then

which is a set of linear equations in variables.

A particular solution is obtained upon setting , which leads to a triangular system in , with an
invertible triangular matrix . So that , which corresponds to a particular solution to
:
SOLVING LINEAR EQUATIONS VIA QR DECOMPOSITION | 82

20.4. Set of solutions


We can also generate all the solutions, by noting that is a free variable. We have

where

The set of solutions is the affine set


83 | APPLICATIONS

21.

APPLICATIONS

• Trilateration by distance measurements.


• Estimation of traffic flow.

21.1. Trilateration by distance measurements

In many applications such as GPS it is of interest to


infer the location of an emitter (for example, a cell
phone) from the measurement of distances to known
points. These distances are obtained by estimating the
differences in time of arrival of a wave front originating
from the emitter. In trilateration, only three points are
used. We then have to find the intersection of three
spheres. The problem can then be reduced to solving a
linear equation followed by a quadratic equation in one
variable. The multilateration problem, which allows
for more than three points, provides more accurate
measurements. (Source.)

Denote by , the three known points and by the measured distances to the emitter.
Mathematically the problem is to solve, for a point in , the equations

We write them out:

Let . The equations above imply that

Using matrix notation, with the matrix of points, and the vector of ones:
APPLICATIONS | 84

Let us assume that the square matrix is full-rank, that is, invertible. The equation above implies that

In words: the point lies in a line passing through and with direction .

We can then solve the equation in

This equation is quadratic in :

and can be solved in closed-form. The spheres intersect if and only if there is a real, non-negative solution
. Generically, if the spheres have a non-empty intersection, there are two positive solutions, hence two
points in the intersection. This is understandable geometrically: the intersection of two spheres is a circle,
and intersecting a circle with a third sphere produces two points. The line joining the two points is the line
, as identified above.

21.2. Estimation of traffic flow

450 400
610 x1 640

The basic traffic flow estimation problem involves


x4 x2 inferring the number of cars going through links based
on information on the number of cars passing through
neighboring links.
520 x3 600

x1 = ?, x2 = ?, x3 = ?, x4 = ?
85 | APPLICATIONS

For the simple problem above, we simply use the fact that at each intersection, the incoming traffic has to
match the outgoing traffic. This leads to the linear equations:

We can write this in matrix format: with

The matrix is nothing else than the incidence matrix associated with the graph that has the intersections
as nodes and links as edges.
EXERCISES | 86

22.

EXERCISES

22.1 Nullspace, rank and range


1. Determine the nullspace, range and rank of a matrix of the form

where , with , and . In the


above, the zeroes are in fact matrices of zeroes with appropriate sizes.

2. Consider the matrix with .

a. What is the size of ?

b. Determine the nullspace, the range, and the rank of .


87 | LEAST-SQUARES

PART IV
LEAST-SQUARES

The ordinary least-squares (OLS) problem is a particularly simple optimization problem that involves the
minimization of the Euclidean norm of a ‘‘residual error” vector that is affine in the decision variables.
The problem is one of the most ubiquituous optimization problems in engineering and applied sciences. It
can be used for example to fit a straight line through points, as in the figure on the left. The least-squares
approach then amounts to minimize the sum of the area of the squares with side-length equal to the vertical
distances to the line.

We discuss a few variants amenable to the linear algebra approach: regularized least-squares, linearly-
constrained least-squares. We also explain how to use ‘‘kernels’’ to handle problems involving non-linear
curve fitting and prediction using non-linear functions.

Outline

• Ordinary least-squares
• Variants of the least-squares problem
• Kernels for least-squares
• Applications
• Exercises
ORDINARY LEAST-SQUARES | 88

23.

ORDINARY LEAST-SQUARES

• Definition
• Interpretations
• Solution via QR decomposition (full rank case)
• Optimal solution (general case)

23.1. Definition
The Ordinary Least-Squares (OLS, or LS) problem is defined as

where are given. Together, the pair is referred to as the problem data. The
vector is often referred to as the ‘‘measurement” or “output” vector, and the data matrix as the ‘‘design‘‘
or ‘‘input‘‘ matrix. The vector is referred to as the residual error vector.

Note that the problem is equivalent to one where the norm is not squared. Taking the squares is done for
the convenience of the solution.
89 | ORDINARY LEAST-SQUARES

23.2. Interpretations

Interpretation as projection on the range

⯑ We can interpret the problem in terms of the columns of , as


follows. Assume that , where is
the -th column of . The problem reads

In this sense, we are trying to find the best approximation of


in terms of a linear combination of the columns of . Thus, the
OLS problem amounts to project (find the minimum Euclidean
distance) the vector on the span of the vectors ‘s (that is to
say: the range of ).

As seen in the picture, at optimum the residual vector


is orthogonal to the range of .

See also: Image compression via least-squares.

Interpretation as minimum distance to feasibility


The OLS problem is usually applied to problems where the linear is not feasible, that is, there is no
solution to .

The OLS can be interpreted as finding the smallest (in Euclidean norm sense) perturbation of the right-hand
side, , such that the linear equation

becomes feasible. In this sense, the OLS formulation implicitly assumes that the data matrix of the
problem is known exactly, while only the right-hand side is subject to perturbation, or measurement errors.
A more elaborate model, total least-squares, takes into account errors in both and .

Interpretation as regression
We can also interpret the problem in terms of the rows of , as follows. Assume that
, where is the -th row of . The problem reads
ORDINARY LEAST-SQUARES | 90

In this sense, we are trying to fit each component of as a linear combination of the corresponding input
, with as the coefficients of this linear combination.

See also:

• Linear regression.
• Auto-regressive models for time series prediction.
• Power law model fitting.

23.3. Solution via QR decomposition (full rank case)


Assume that the matrix is tall ( ) and full column rank. Then the solution to the
problem is unique and given by

This can be seen by simply taking the gradient (vector of derivatives) of the objective function, which leads
to the optimality condition . Geometrically, the residual vector is orthogonal
to the span of the columns of , as seen in the picture above.

We can also prove this via the QR decomposition of the matrix with a matrix
with orthonormal columns ( ) and a upper-triangular, invertible matrix. Noting that

and exploiting the fact that is invertible, we obtain the optimal solution . This is the
same as the formula above, since

Thus, to find the solution based on the QR decomposition, we just need to implement two steps:

1. Rotate the output vector: set .


2. Solve the triangular system by backward substitution.
91 | ORDINARY LEAST-SQUARES

23.4. Optimal solution and optimal set


Recall that the optimal set of a minimization problem is its set of minimizers. For least-squares problems,
the optimal set is an affine set, which reduces to a singleton when is full column rank.

In the general case ( is not necessarily tall, and /or not full rank) then the solution may not be unique. If
is a particular solution, then is also a solution, if is such that , that is, .
That is, the nullspace of describes the ambiguity of solutions. In mathematical terms:

The formal expression for the set of minimizers to the least-squares problem can be found again via the QR
decomposition. This is shown here.
VARIANTS OF THE LEAST-SQUARES PROBLEM | 92

24.

VARIANTS OF THE LEAST-SQUARES


PROBLEM

• Linearly-constrained least-squares
• Minimum-norm solutions to linear equations
• Regularized least-squares

24.1. Linearly constrained least-squares

Definition
An interesting variant of the ordinary least-squares problem involves equality constraints on the decision
variable :

where , and are given.

See also: Minimum-variance portfolio.

Solution
We can express the solution by first computing the null space of . Assuming that the feasible set of the
constrained LS problem is not empty, that is, is in the range of , this set can be expressed as

where is the dimension of the nullspace of is a matrix whose columns span the nullspace of , and
is a particular solution to the equation .

Expressing in terms of the free variable , we can write the constrained problem as an unconstrained one:
93 | VARIANTS OF THE LEAST-SQUARES PROBLEM

where , and .

24.2. Minimum-norm solution to linear equations


A special case of linearly constrained LS is

in which we implicitly assume that the linear equation in , has a solution, that is, is in the
range of .

The above problem allows selecting a particular solution to a linear equation, in the case when there are
possibly many, that is, the linear system is under-determined.

As seen here, when is full row rank, that is, the matrix is invertible, the above has the closed-form
solution

See also: Control positioning of a mass.

24.3. Regularized least-squares


In the case when the matrix in the OLS problem is not full column rank, the closed-form solution cannot
be applied. A remedy often used in practice is to transform the original problem into one where the full
column rank property holds.

The regularized least-squares problem has the form

where is a (usually small) parameter.

The regularized problem can be expressed as an ordinary least-squares problem, where the data matrix is full
column rank, Indeed, the above problem can be written as the ordinary LS problem

where
VARIANTS OF THE LEAST-SQUARES PROBLEM | 94

The presence of the identity matrix in the matrix ensures that it is full (column) rank.

Solution
Since the data matrix in the regularized LS problem has full column rank, the formula seen here applies. The
solution is unique and given by

For , we recover the ordinary LS expression that is valid when the original data matrix is full rank.

The above formula explains one of the motivations for using regularized least-squares in the case of a rank-
deficient matrix : if , but is small, the above expression is still defined, even if is rank-deficient.

Weighted regularized least-squares


Sometimes, as in kernel methods, we are led to problems of the form

where is positive definite (that is, for every non-zero ). The solution is again unique and
given by
95 | KERNELS FOR LEAST-SQUARES

25.

KERNELS FOR LEAST-SQUARES

• Motivations
• The kernel trick
• Nonlinear case
• Examples of kernels
• Kernels in practice

25.1. Motivations
Consider a linear auto-regressive model for time-series, where is a linear function of

This writes , with the “feature vectors”

We can fit this model based on historical data via least-squares:

The associated prediction rule is

We can introduce a non-linear version, where is a quadratic function of

This writes , with the augmented feature vectors

Everything is the same as before, with replaced by .

It appears that the size of the least-squares problem grows quickly with the degree of the feature vectors.
How do we do it in a computationally efficient manner?
KERNELS FOR LEAST-SQUARES | 96

25.2. The kernel trick


We exploit a simple fact: in the least-squares problem

the optimal lies in the span of the data points :

for some vector . Indeed, from the fundamental theorem of linear algebra, every can be
written as the sum of two orthogonal vectors:

where , which means that is in the nullspace .

Hence the least-squares problem depends only on :

The prediction rule depends on the scalar products between new point and the data points :

Once is formed (this takes ), then the training problem has only variables. When , this
leads to a dramatic reduction in problem size.

25.3. Nonlinear case


In the nonlinear case, we simply replace the feature vectors with some “augmented” feature vectors
, with a non-linear mapping.

This leads to the modified kernel matrix

The kernel function associated with mapping is

It provides information about the metric in the feature space, eg:


97 | KERNELS FOR LEAST-SQUARES

The computational effort involved in

1. solving the training problem;

2. making a prediction,

depends only on our ability to quickly evaluate such scalar products. We can’t choose arbitrarily; it has to
satisfy the above for some .

25.4. Examples of kernels


A variety of kernels are available. Some are adapted to the structure of data, for example, text or images. Here
are a few popular choices.

Polynomial kernels
Regression with quadratic functions involves feature vectors

In fact, given two vectors , we have

More generally when is the vector formed with all the products between the components of ,
up to degree , then for any two vectors ,

The computational effort grows linearly in .

This represents a dramatic reduction in speed over the ‘‘brute force’’ approach:

1. Form ;

2. evaluate . In the above approach, the computational effort grows as .

Gaussian kernels
Gaussian kernel function:
KERNELS FOR LEAST-SQUARES | 98

where is a scale parameter This allows ignoring points that are too far apart. Corresponds to a non-
linear mapping to infinite-dimensional feature space.

Other kernels
There is a large variety (a zoo?) of other kernels, some adapted to the structure of data (text, images, etc).

25.5. Kernels in practice


1. Kernels need to be chosen by the user.

2. The choice is not always obvious; Gaussian or polynomial kernels are popular.

3. We control over-fitting via cross-validation (to choose, say, the scale parameter of Gaussian kernel,
or degree of the polynomial kernel).
99 | APPLICATIONS

26.

APPLICATIONS

• Linear regression via least-squares.


• Auto-regressive (AR) models for time-series prediction.

26.1. Linear regression via least-squares


Linear regression is based on the idea of fitting a linear function through data points.

In its basic form, the problem is as follows. We are given data where
is the ‘‘input’’ and is the ‘‘output’’ for the th measurement. We seek to find a linear function
such that are collectively close to the corresponding values .

In least-squares regression, the way we evaluate how well a candidate function fits the data is via the
(squared) Euclidean norm:

Since a linear function has the form for some , the problem of minimizing the
above criterion takes the form

We can formulate this as a least-squares problem:

where
APPLICATIONS | 100

The linear regression approach can be extended to multiple dimensions, that is, to problems where the
output in the above problem contains more than one dimension (see here). It can also be extended to the
problem of fitting non-linear curves.

In this example, we seek to analyze how customers react


to an increase in the price of a given item. We are given
two-dimensional data points .
The ‘s contain the prices of the item, and the ‘s the
average number of customers who buy the item at that price.

The generic equation of a non-vertical line is


, where contains the
decision variables. The quality of the fit of a generic line
is measured via the sum of the squares of the error in the
component (blue dotted lines). Thus, the best
least-squares fit is obtained via the least-squares problem

Once the line is found, it can be used to predict the value of


the average number of customers buying the item ( ) for a
new price ( ). The prediction is shown in red.

See also: The problem of Gauss.

26.2. Auto-regressive models for time-series


prediction.
A popular model for the prediction of time series is based on the so-called auto-regressive
model

where ‘s are constant coefficients, and is the ‘‘memory length’’ of the model. The interpretation of the
model is that the next output is a linear function of the past. Elaborate variants of auto-regressive models are
widely used for the prediction of time series arising in finance and economics.

To find the coefficient vector in , we collect observations (with


) of the time series, and try to minimize the total squared error in the above equation:
101 | APPLICATIONS

This can be expressed as a linear least-squares problem, with appropriate data .


EXERCISES | 102

27.

EXERCISES

• Standard forms
• Applications

27.1. Standard forms


Regularization for noisy data. Consider a least-squares problem

in which the data matrix is noisy. Our specific noise model assumes that each row
has the form , where the noise vector has zero mean and
covariance matrix , with a measure of the size of the noise. Therefore, now the matrix is a
function of the set of uncertain vectors , which we denote by . We will write
to denote the matrix with rows . We replace the original problem with

where denotes the expected value with respect to the random variable . Show that this problem
can be written as

where is some regularization parameter, which you will determine. That is, regularized least-
squares can be interpreted as a way to take into account uncertainties in the matrix , in the expected
value sense.

Hint: compute the expected value of , for a specific row index .


103 | EXERCISES

27.2. Applications

The figure shows the number of transistors in


13 microprocessors as a function of the year of
their introduction. The plot suggests that we
can obtain a good fit with a function of the
form , where is the year,
is the number of transistors at year , and
and are model parameters. This
model results in a straight line if we plot
on a logarithmic scale versus on a linear scale,
as is done in the figure.

1. Moore’s law describes a long-term trend in the history of computing hardware and states that
the number of transistors that can be placed inexpensively on an integrated circuit has doubled
approximately every two years. In this problem, we investigate the validity of the claim via least-
squares.

Using the problem data below:

Year Transistor
1971 2,250
1972 2,500
1974 5,000
1978 29,000
1982 120,000
1985 275,000
1989 1,180,000
1993 3,100,000
1997 7,500,000
1999 24,000,000
2000 42,000,000
2002 220,000,000
2003 410,000,000

show how to estimate the parameters using least-squares, that is, via a problem of the form
EXERCISES | 104

Make sure to define precisely the data and how the variable relates to the original problem

parameters . (Use the notations for the number of processors, and

for the corresponding years. You can assume that no component of is zero at

optimum.)

a. Is the solution to the problem above unique? Justify carefully your answer, and give the
expression for the unique solution in terms of .

b. The solution to the problem yields . Is this estimate consistent


with the so-called Moore’s law, which states that the number of transistors per integrated
circuit roughly doubles every two years?

2. The Michaelis–Menten model for enzyme kinetics relates the rate of an enzymatic reaction, to
the concentration of a substrate, as follows:

where , are parameters.

a. Show that the model can be expressed as a linear relation between the values and .

b. Use this expression to fit the parameter using linear least-squares.

c. The above approach has been found to be quite sensitive to errors in input data. Can you
experimentally confirm this opinion?

Hint: generate noisy data from parameter values and .


105 | EIGENVALUES FOR SYMMETRIC MATRICES

PART V
EIGENVALUES FOR SYMMETRIC
MATRICES

Symmetric matrices are squares with elements that mirror each other across the diagonal. They can be
used to describe for example graphs with undirected, weighted edges between the nodes; distance matrices
(between say cities), and a host of other applications. Symmetric matrices are also important in optimization,
as they are closely related to quadratic functions.

A fundamental theorem, the spectral theorem, shows that we can decompose any symmetric matrix as a
three-term product of matrices, involving an orthogonal transformation and a diagonal matrix. The theorem
has a direct implication for quadratic functions: it allows a to decompose any quadratic function into a
weighted sum of squared linear functions involving vectors that are mutually orthogonal. The weights are
called the eigenvalues of the symmetric matrix.

The spectral theorem allows, in particular, to determine when a given quadratic function is ‘‘bowl-shaped’’,
that is, convex. The spectral theorem also allows finding directions of maximal variance within a data set.
Such directions are useful to visualize high-dimensional data points in two or three dimensions. This is the
basis of a visualization method known as principal component analysis (PCA).
EIGENVALUES FOR SYMMETRIC MATRICES | 106

Outline

• Quadratic functions and symmetric matrices


• Spectral theorem
• Positive semi-definite matrices
• Principal component analysis
• Applications
• Exercises
107 | QUADRATIC FUNCTIONS AND SYMMETRIC MATRICES

28.

QUADRATIC FUNCTIONS AND SYMMETRIC


MATRICES

• Symmetric matrices and quadratic functions


• Second-order approximation of non-linear functions
• Special symmetric matrices

28.1. Symmetric matrices and quadratic functions

Symmetric matrices
A square matrix is symmetric if it is equal to its transpose. That is,

The set of symmetric matrices is denoted . This set is a subspace of .

Example 1: A symmetric matrix.

The matrix

is symmetric. The matrix

is not, since it is not equal to its transpose.

See also:
QUADRATIC FUNCTIONS AND SYMMETRIC MATRICES | 108

• Representation of a weighted, undirected graph.


• Laplacian matrix of a graph.
• Hessian of a function.
• Gram matrix of data points.

Quadratic functions
A function is said to be a quadratic function if it can be expressed as

for numbers , , and , . A quadratic function is thus an affine combination of the


‘s and all the ‘‘cross-products’’ . We observe that the coefficient of is .

The function is said to be a quadratic form if there are no linear or constant terms in it:

Note that the Hessian (matrix of second-derivatives) of a quadratic function is constant.

Examples:

• Quadratic functions of two variables.


• Hessian of a quadratic function.

Link between quadratic functions and symmetric matrices


There is a natural relationship between symmetric matrices and quadratic functions. Indeed, any quadratic
function can be written as

for an appropriate symmetric matrix , vector and scalar . Here:

• is the coefficient of in ;
• for , is the coefficient of the term in ;
• is the coefficient of the term ;
• is the constant term, .
109 | QUADRATIC FUNCTIONS AND SYMMETRIC MATRICES

If is a quadratic form, then , , and we can write where .

Examples: Two-dimensional example.

28.2. Second-order approximations of


non-quadratic functions
We have seen how linear functions arise when one seeks a simple, linear approximation to a more
complicated non-linear function. Likewise, quadratic functions arise naturally when one seeks to
approximate a given non-quadratic function by a quadratic one.

One-dimensional case
If is a twice-differentiable function of a single variable, then the second-order approximation
(or, second-order Taylor expansion) of at a point is of the form

where is the first derivative, and the second derivative, of at . We observe that the
quadratic approximation has the same value, derivative, and second-derivative as , at .

Example 2: The figure shows a second-order


approximation of the univariate function ,
with values

at the point (in red).

Multi-dimensional case
In multiple dimensions, we have a similar result. Let us approximate a twice-differentiable function
QUADRATIC FUNCTIONS AND SYMMETRIC MATRICES | 110

by a quadratic function , so that and coincide up and including to the second


derivatives.

The function must be of the form

where , , and . Our condition that coincides with up and including to the second
derivatives shows that we must have

where is the Hessian, and the gradient, of at .

Solving for we obtain the following result:

Second-order expansion of a function. The second-order approximation of a twice-differentiable function


at a point is of the form

where is the gradient of at , and the symmetric matrix is the Hessian of


at .

Example: Second-order expansion of the log-sum-exp function.

28.3. Special symmetric matrices

Diagonal matrices
Perhaps the simplest special case of symmetric matrices is the class of diagonal matrices, which are non-zero
only on their diagonal.

If , we denote by , or for short, the (symmetric) diagonal


matrix with on its diagonal. Diagonal matrices correspond to quadratic functions of the form
111 | QUADRATIC FUNCTIONS AND SYMMETRIC MATRICES

Such functions do not have any ‘‘cross-terms’’ of the form with .

Example 3: A diagonal matrix and its associated quadratic form.


Define a diagonal matrix:

For the matrix above, the associated quadratic form is

Symmetric dyads
Another important class of symmetric matrices is that of the form , where . The matrix has
elements and is symmetric. Such matrices are called symmetric dyads. (If , then the dyad is
said to be normalized.)

Symmetric dyads correspond to quadratic functions that are simply squared linear forms:
.

Example: A squared linear form.


SPECTRAL THEOREM | 112

29.

SPECTRAL THEOREM

• Eigenvalues and eigenvectors of symmetric matrices


• Spectral theorem
• Rayleigh quotients

29.1. Eigenvalues and eigenvectors of symmetric


matrices
Let be a square, symmetric matrix. A real scalar is said to be an eigenvalue of if there exists a
non-zero vector such that

The vector is then referred to as an eigenvector associated with the eigenvalue . The eigenvector is said
to be normalized if . In this case, we have

The interpretation of is that it defines a direction along behaves just like scalar multiplication. The
amount of scaling is given by . (In German, the root ‘‘eigen’’, means ‘‘self’’ or ‘‘proper’’). The eigenvalues
of the matrix are characterized by the characteristic equation

where the notation refers to the determinant of its matrix argument. The function, defined by
, is a polynomial of degree called the characteristic polynomial.

From the fundamental theorem of algebra, any polynomial of degree has (possibly not distinct)
complex roots. For symmetric matrices, the eigenvalues are real, since when , and
is normalized.

29.2. Spectral theorem


An important result of linear algebra called the spectral theorem, or symmetric eigenvalue decomposition
(SED) theorem, states that for any symmetric matrix, there are exactly (possibly not distinct) eigenvalues,
113 | SPECTRAL THEOREM

and they are all real; further, that the associated eigenvectors can be chosen so as to form an orthonormal
basis. The result offers a simple way to decompose the symmetric matrix as a product of simple
transformations.

Theorem: Symmetric eigenvalue decomposition

We can decompose any symmetric matrix with the symmetric eigenvalue decomposition (SED)

where the matrix of is orthogonal (that is, ), and contains the


eigenvectors of , while the diagonal matrix contains the eigenvalues of .

Here is a proof. The SED provides a decomposition of the matrix in simple terms, namely dyads.

We check that in the SED above, the scalars are the eigenvalues, and ‘s are associated eigenvectors, since

The eigenvalue decomposition of a symmetric matrix can be efficiently computed with standard software,
in time that grows proportionately to its dimension as .

Example: Eigenvalue decomposition of a symmetric matrix.

29.3. Rayleigh quotients


Given a symmetric matrix , we can express the smallest and largest eigenvalues of , denoted and
respectively, in the so-called variational form

For proof, see here.

The term ‘‘variational’’ refers to the fact that the eigenvalues are given as optimal values of optimization
problems, which were referred to in the past as variational problems. Variational representations exist for all
the eigenvalues but are more complicated to state.
SPECTRAL THEOREM | 114

The interpretation of the above identities is that the largest and smallest eigenvalues are a measure of the
range of the quadratic function over the unit Euclidean ball. The quantities above can be
written as the minimum and maximum of the so-called Rayleigh quotient .

Historically, David Hilbert coined the term ‘‘spectrum’’ for the set of eigenvalues of a symmetric operator
(roughly, a matrix of infinite dimensions). The fact that for symmetric matrices, every eigenvalue lies in the
interval somewhat justifies the terminology.

Example: Largest singular value norm of a matrix.


115 | POSITIVE SEMI-DEFINITE MATRICES

30.

POSITIVE SEMI-DEFINITE MATRICES

• Definitions
• Special cases and examples
• Square root and Cholesky decomposition
• Ellipsoids

30.1. Definitions
For a given symmetric matrix , the associated quadratic form is the function with
values

• A symmetric matrix is said to be positive semi-definite (PSD, notation: ) if and only if the
associated quadratic form is non-negative everywhere:

• It is said to be positive definite (PD, notation: ) if the quadratic form is non-negative and
definite, that is, if and only if .

It turns out that a matrix is PSD if and only if the eigenvalues of are non-negative. Thus, we can check if
a form is PSD by computing the eigenvalue decomposition of the underlying symmetric matrix.

Theorem: eigenvalues of PSD matrices

A quadratic form , with is non-negative (resp. positive-definite) if and only if every


eigenvalue of the symmetric matrix is non-negative (resp. positive).

Proof.

By definition, the PSD and PD properties are properties of the eigenvalues of the matrix only, not of the
POSITIVE SEMI-DEFINITE MATRICES | 116

eigenvectors. Also, if the matrix is PSD, then for every matrix with columns, the matrix
also is.

30.2. Special cases and examples

Symmetric dyads
Special cases of PSD matrices include symmetric dyads. Indeed, if for some vector , then
for every :

More generally if , then is PSD, since

Diagonal matrices
A diagonal matrix is PSD (resp. PD) if and only if all of its (diagonal) elements are non-negative (resp.
positive).

Examples of PSD matrices


• Covariance matrix.
• Laplacian matrix of a graph.
• Gram matrix of data points.

30.3. Square root and Cholesky decomposition


For PD matrices, we can generalize the notion of the ordinary square root of a non-negative number. Indeed,
if is PSD, there exists a unique PSD matrix, denoted , such that . We can express this
matrix square root in terms of the SED of , as , where is obtained
from by taking the square root of its diagonal elements. If is PD, then so is its square root.

Any PSD matrix can be written as a product for an appropriate matrix . The decomposition
is not unique, and is only a possible choice (the only PSD one). Another choice, in terms of
the SED of , is . If is positive-definite, then we can choose to be lower
triangular, and invertible. The decomposition is then known as the Cholesky decomposition of .
117 | POSITIVE SEMI-DEFINITE MATRICES

30.4. Ellipsoids
There is a strong correspondence between ellipsoids and PSD matrices.

Definition
We define an ellipsoid to be an affine transformation of the unit ball for the Euclidean norm:

where is an arbitrary non-singular matrix. We can express the ellipsoid as

where is PD.

Geometric interpretation via SED


We can interpret the eigenvectors and associated eigenvalues of in terms of the geometrical properties
of the ellipsoid, as follows. Consider the SED of : , with and diagonal, with
diagonal elements positive. The SED of its inverse is . Let .
We can express the condition as

Now set , . The above can be written as : in -space, the ellipsoid


is simply a unit ball. In -space, the ellipsoid corresponds to scaling each -axis by the square roots of
the eigenvalues. The ellipsoid has principal axes parallel to the coordinate axes in -space. We then apply
a rotation and a translation to get the ellipsoid in the original -space. The rotation is determined by the
eigenvectors of , which are contained in the orthogonal matrix . Thus, the geometry of the ellipsoid
can be read from the SED of the PD matrix : the eigenvectors give the principal directions,
and the semi-axis lengths are the square root of the eigenvalues.
POSITIVE SEMI-DEFINITE MATRICES | 118

The graph on the left shows the ellipsoid


, with

The matrix admits the SED

with

We check that the columns of determine the principal


directions, and , are the
semi-axis lengths.

The above shows in particular that an equivalent representation of an ellipsoid is

where is PD.

It is possible to define degenerate ellipsoids, which correspond to cases when the matrix in the above,
or its inverse , is degenerate. For example, cylinders or slabs (intersection of two parallel half-spaces) are
degenerate ellipsoids.
119 | PRINCIPAL COMPONENT ANALYSIS

31.

PRINCIPAL COMPONENT ANALYSIS

• Projection on a line via variance maximization


• Principal component analysis
• Explained variance

31.1. Projection on a line via variance maximization


Consider a data set of points , in . We can represent this data set as a
matrix , where each is a -vector. The goal of the variance maximization problem
is to find a direction such that the sample variance of the corresponding vector
is maximal.

Recall that when is normalized, the scalar is the component of along , that is, it corresponds to
the projection of on the line passing through and with direction .

Here, we seek a (normalized) direction such that the empirical variance of the projected values ,
, is large. If is the vector of averages of the ‘s, then the average of the projected values is
. Thus, the direction of maximal variance is one that solves the optimization problem

The above problem can be formulated as

where

is the sample covariance matrix of the data.

We have seen the above problem before, under the name of the Rayleigh quotient of a symmetric matrix.
PRINCIPAL COMPONENT ANALYSIS | 120

Solving the problem entails simply finding an eigenvector of the covariance matrix that corresponds to
the largest eigenvalue.

Maximal variance direction for the Senate voting data.


This image shows the scores assigned to each Senator
along the direction of maximal variance,
, , with a
normalized eigenvector corresponding to the largest
eigenvalue of the covariance matrix . Republican
Senators tend to score negatively, while we find many
Democrats on the positive score (obviously the signs
themselves don’t count here, as we could switch to
; only the order is important). Hence the direction
could be interpreted as revealing the party affiliation.
The two Senators that are in the opposite group
(especially Sen. Chaffee) have indeed sometimes voted
against their party.
Note that the largest absolute score obtained in this
plot is about 18 times bigger than that observed on
the projection in a random direction. This is consistent
with the fact that the current direction has a maximal
variance.

31.2. Principal component analysis

Main idea
The main idea behind principal component analysis is to first find a direction that corresponds to maximal
variance between the data points. The data is then projected on the hyperplane orthogonal to that direction.
We obtain a new data set and find a new direction of maximal variance. We may stop the process when we
have collected enough directions (say, three if we want to visualize the data in 3D).

It turns out that the directions found in this way are precisely the eigenvectors of the data’s covariance
matrix. The term principal components refers to the directions given by these eigenvectors. Mathematically,
the process thus amounts to finding the eigenvalue decomposition of a positive semi-definite matrix, the
covariance matrix of the data points.

Projection on a plane
The projection used to obtain, say, a two-dimensional view with the largest variance, is of the form
, where is a matrix that contains the eigenvectors corresponding to the first two
eigenvalues.
121 | PRINCIPAL COMPONENT ANALYSIS

Two-dimensional projection of the Senate


voting matrix: This particular planar
projection uses the two eigenvectors
corresponding to the largest two eigenvalues of
the data’s covariance matrix. It seems to allow
to cluster of the Senators along party lines and
is therefore more informative than, say, the
plane corresponding to the two
smallest eigenvalues.

31.3. Explained variance


The total variance in the data is defined as the sum of the variances of the individual components. This
quantity is simply the trace of the covariance matrix since the diagonal elements of the latter contain the
variances. If has the EVD , where contains the eigenvalues,
and an orthogonal matrix of eigenvectors, then to total variance can be expressed as the sum of all the
eigenvalues:

When we project the data on a two-dimensional plane corresponding to the eigenvectors associated
with the two largest eigenvalues , we get a new covariance matrix , where the
total variance of the projected data is

Hence, we can define the ratio of variance ‘‘explained’’ by the projected data as the ratio:

If the ratio is high, we can say that much of the variation in the data can be observed on the projected plane.
PRINCIPAL COMPONENT ANALYSIS | 122

This image shows the eigenvalues of the


covariance matrix of the Senate voting
data, which contains the covariances between
the votes of each pair of Senators. Clearly, the
eigenvalues decrease very fast. One is tempted
to say that In this case, the ratio of explained
to total variance is almost , which indicates
that ‘‘most of the information’’ is contained in
the first eigenvalue. Since the corresponding
eigenvector almost corresponds to a perfect
‘‘party line’’, we can say that party affiliation
explains most of the variance in the Senate
voting data.
123 | APPLICATIONS: PCA OF SENATE VOTING DATA

32.

APPLICATIONS: PCA OF SENATE VOTING


DATA

• Introduction
• Senate voting data and the visualization problem
• Projection on a line
• Projection on a plane
• Direction of maximal variance
• Principal component analysis
• Sparse PCA
• Sparse maximal variance problem

32.1 Introduction
In this case study, we take data from the votes on bills in the US Senate (2004-2006) and explore how we
can visualize the data by projecting it, first on a line then on a plane. We investigate how we can choose the
line or plane in a way that maximizes the variance in the result, via a principal component analysis method.
Finally, we examine how a variation on PCA that encourages sparsity of the projection directions allows us
to understand which bills are most responsible for the variance in the data.
APPLICATIONS: PCA OF SENATE VOTING DATA | 124

32.2. Senate voting data and the visualization


problem

Data

The data consists of the votes of Senators in the 2004-2006 US Senate (2004-2006), for a total
of bills. “Yay” (“Yes”) votes are represented as ‘s, “Nay” (“No”) as ‘s, and the other votes are
recorded as . (A number of complexities are ignored here, such as the possibility of pairing the votes.)

This data can be represented here as a ‘‘voting’’ matrix , with elements taken
from . Each column of the voting matrix , contains the votes of a single
Senator for all the bills; each row contains the votes of all Senators on a particular bill.

Senate voting matrix: “Nay” votes are in black, “Yay” ones in white, and the others in grey. The transpose
voting matrix is shown. The picture becomes has many gray areas, as some Senators are replaced over time.
Simply plotting the raw data matrix is often not very informative.

Visualization Problem
We can try to visualize the data set, by projecting each data point (each row or column of the matrix) on (say)
a 1D-, 2D- or 3D-space. Each ‘‘view’’ corresponds to a particular projection, that is, a particular one-, two-
or three-dimensional subspace on which we choose to project the data. The visualization problem consists
of choosing an appropriate projection.
125 | APPLICATIONS: PCA OF SENATE VOTING DATA

There are many ways to formulate the visualization problem, and none dominates the others. Here, we focus
on the basics of that problem.

32.3. Projecting on a line


To simplify, let us first consider the simple problem of representing the high-dimensional data set on a simple
line, using the method described here.

Scoring Senators
Specifically we would like to assign a single number, or ‘‘score’’, to each column of the matrix. We choose a
direction in , and a scalar in . This corresponds to the affine ‘‘scoring’’ function ,
which, to a generic column in of the data matrix, assigns the value

We thus obtain a vector of values in , with

It is often useful to center these values around zero. This can be done by choosing such that

that is: where

is the vector of sample averages across the columns of the matrix (that is, data points). The vector can be
interpreted as the ‘‘average response’’ across experiments.

The values of our scoring function can now be expressed as

In order to be able to compare the relative merits of different directions, we can assume, without loss of
generality, that the vector is normalized (so that ).
APPLICATIONS: PCA OF SENATE VOTING DATA | 126

Centering data
It is convenient to work with the ‘‘centered’’ data matrix, which is

where is the vector of ones in .

We can compute the (row) vector scores using the simple matrix-vector product:

We can check that the average of the above row vector is zero:

Example: visualizing along random direction

Scores obtained with random direction: This image


shows the values of the projections of the Senator’s votes
(that is, with average across Senators removed)
on a (normalized) ‘‘random bill’’ direction. This
projection shows no particular obvious structure. Note
that the range of the data is much less than obtained with
the average bill shown above.
127 | APPLICATIONS: PCA OF SENATE VOTING DATA

32.4. Projecting on a plane


We can also try to project the data on a plane, which involves assigning two scores to each data point.

Scoring Map
This corresponds to the affine ‘‘scoring’’ map , which, to a generic column in of the
data matrix, assigns the two-dimensional value

where are two vectors, and two scalars, while , .

The affine map allows to generate two-dimensional data points (instead of -dimensional)
, . As before, we can require that the be centered:

by choosing the vector to be such that , where is the ‘‘average response’’ defined
above. Our (centered) scoring map takes the form

We can encapsulate the scores in the matrix . The latter can be expressed as the
matrix-matrix product

with the centered data matrix defined above.

Clearly, depending on which plan we choose to project on, we get very different pictures. Some planes seem
to be more ‘‘informative’’ than others. We return to this issue here.
APPLICATIONS: PCA OF SENATE VOTING DATA | 128

Two-dimensional projection of the Senate


voting matrix: This particular projection
does not seem to be very informative. Notice
in particular, the scale of the vertical axis. The
data is all but crushed to a line, and even on the
horizontal axis, the data does not show much
variation.

Two-dimensional projection of the Senate


voting matrix: This particular projection
seems to allow to cluster the Senators along
party line, and is therefore more informative.
We explain how choose such a direction here.

32.5. Direction of maximal variance

Motivation
We have seen here how we can choose a direction in bill space, and then project the Senate voting data matrix
on that direction, in order to visualize the data along a single line. Clearly, depending on how we choose
the line, we will get very different pictures. Some show large variation in the data, others seems to offer a
narrower range, even if we take care to normalize the directions.

What could be a good criterion to choose the direction we project the data on?

It can be argued that a direction that results in large variations of projected data is preferable to a one with
small variations. A direction with high variation ‘‘explains’’ the data better, in the sense that it allows to
distinguish between data points better. One criteria that we can use to quantify the variation in a collection
of real numbers is the sample variance, which is the sum of the squares of the differences between the
numbers and their average.

Solving the maximal variance problem


Let us find a direction which maximizes the empirical variance. We seek a (normalized) direction such that
the empirical variance of the projected values , , is large. If is the vector of averages of
the ‘s, then the average of the projected values is . Thus, the direction of maximal variance is one that
solves the optimization problem
129 | APPLICATIONS: PCA OF SENATE VOTING DATA

The above problem can be formulated as

where

is the sample covariance matrix of the data. The interpretation of the coefficient is that it
provides the covariance between the votes of Senator and those of Senator .

We have seen the above problem before, under the name of the Rayleigh quotient of a symmetric matrix.
Solving the problem entails simply finding an eigenvector of the covariance matrix that corresponds to the
largest eigenvalue.

This image shows the scores assigned to each


Senator along the direction of maximal variance,
, , with a
normalized eigenvector corresponding to the
largest eigenvalue of the covariance matrix .
Republican Senators tend to score positively, while
we find many Democrats on the negative score.
Hence the direction could be interpreted as
revealing the party affiliation.
Note that the largest absolute score (about 18)
obtained in this plot is about three times bigger
than that observed on the previous one. This is
consistent with the fact that the current direction
has maximal variance.

32.6. Principal component analysis

Main idea
APPLICATIONS: PCA OF SENATE VOTING DATA | 130

The main idea behind principal components analysis is to first find a direction that corresponds to maximal
variance between the data points. The data is then projected on the hyperplane orthogonal to that direction.
We obtain a new data set and find a new direction of maximal variance. We may stop the process when we
have collected enough directions (say, three if we want to visualize the data in 3D).

Mathematically, the process amounts to finding the eigenvalue decomposition of a positive semi-definite
matrix: the covariance matrix of the data points. The directions of large variance correspond to the
eigenvectors with the largest eigenvalues of that matrix. The projection to use to obtain, say, a two-
dimensional view with the largest variance, is of the form , where is a matrix that
contains the eigenvectors corresponding to the first two eigenvalues.

Low rank approximations


In some cases, we are not specifically interested in visualizing the data, but simply to approximate the data
matrix with a ‘‘simpler’’ one.

Assume we are given a (sample) covariance matrix of the data, . Let us find the eigenvalue decomposition
of :

where is an orthogonal matrix. Note that the trace of that matrix has an interpretation as the
total variance in the data, which is the sum of all the variances of the votes of each Senator:

Now let us plot the values of ‘s in decreasing order.


131 | APPLICATIONS: PCA OF SENATE VOTING DATA

This image shows the eigenvalues of the


covariance matrix of the Senate voting
data, which contains the covariances between
the votes of each pair of Senators.

Clearly, the eigenvalues decrease very fast. One is tempted to say that ‘‘most of the information’’ is contained
in the first eigenvalue. To make this argument more rigorous, we can simply look at the ratio:

which is the ratio of the total variance in the data (as approximated by ) to that of the whole matrix .

In the Senate voting case, this ratio is of the order of 90%. It turns out that this is true of most voting patterns
in democracies across history: the first eigenvalue ‘‘explains most of the variance’’.

32.7. Sparse PCA

Motivation
Recall that the direction of maximal variance is one vector that solves the optimization problem
APPLICATIONS: PCA OF SENATE VOTING DATA | 132

Here is the estimated center. We obtain a new data set by combining the variables
according to the directions determined by . The resulting dataset would have the same dimension as the
original dataset, but each dimension has a different meaning (since they are linear projected images of the
original variables).

As explained, the main idea behind principal components analysis is to find those directions that
corresponds to maximal variances between the data points. The data is then projected on the hyperplane
spanned by these principal components. We may stop the process when we have collected enough directions
in the sense that the new directions explain the majority of the variance. That is, we can pick those directions
corresponding to the highest scores.

We may also wonder if can have only a few non-zero coordinates. For example, if the optimal direction
is , then it is clear that the 3rd and 4th bills are characterizing most features,
and we may simply want to drop the 1st and 2nd bills. That is, we want to adjust the optimal direction vector
as . This adjustment accounts for sparsity. In the setting of PCA analysis, each
principal component is linear combinations of all input variables. Sparse PCA allows us to find principal
components as linear combinations that contain just a few input variables (hence it looks “sparse” in the
input space). This feature would enhance the interpretability of the resulting dataset and perform dimension
reduction in the input space. Reducing the number of input variables would assist us in the senate voting
dataset, since there are more bills (input variables) than senators (samples).

We are going to compare this result of PCA to sparse PCA results below.
133 | APPLICATIONS: PCA OF SENATE VOTING DATA

This image shows the scores assigned to each


Senator along the direction of maximal
variance, ,
, with corresponding to the PCA
optimization problem. Republican Senators
tend to score positively, while we find many
Democrats on the negative score. Hence the
direction could be interpreted as revealing the
party affiliation.
We are going to compare this result of PCA to
sparse PCA results below.

32.8 Sparse maximal variance problem

Main Idea
A mathematical generalization of the PCA can be obtained by modifying the PCA optimization problem
above. We attempt to find the direction of maximal variance as one vector that solves the
optimization problem

where
APPLICATIONS: PCA OF SENATE VOTING DATA | 134

The difference is that we put one more constraint , where is the number of non-zero
coordinates in the vector . For instance, but
. Here, is a pre-determined hyper-parameter that describes the sparsity of the input space we want.
This constraint of makes the optimization problem non-convex, and without an analytically
closed solution. But we still have a numerical solution and the sparse properties as we explained. However,
makes the optimization problem non-convex and is difficult to solve. Instead, we practice an
regularization relaxation alternative

This optimization problem is convex and can be solved numerically; this is the so-called sparse PCA (SPCA)
method. The parameter is a pre-determined hyper-parameter we introduced as a penalty parameter, which
can be tuned, as we shall see below.

Analysis of the senate dataset


We can apply the SPCA method onto the senate voting dataset, with (i.e. PCA) and increase to
values of 1, 10, and 1000. In each setting, we also record the number of non-zero coordinates of ,
which is a solution to the SPCA optimization problem above. Since each coordinate of represents the
voting outcome of one bill, we call the number of non-zero coordinates of as the number of active
variables. From the results below, we did not observe much change in the separation of party lines given by
the principal components. As increases to 1000, there are only 7 active variables left. The corresponding 7
bills, which are important for distinguishing the party lines of senators, are:

• 8 Energy Issues LIHEAP Funding Amendment 3808


• 16 Abortion Issues Unintended Pregnancy Amendment 3489
• 34 Budget, Spending and Taxes Hurricane Victims Tax Benefit Amendment 3706
• 36 Budget, Spending and Taxes Native American Funding Amendment 3498
• 47 Energy Issues Reduction in Dependence on Foreign Oil 3553
• 59 Military Issues Habeas Review Amendment 3908
• 81 Business and Consumers Targeted Case Management Amendment 3664
135 | APPLICATIONS: PCA OF SENATE VOTING DATA

This image shows the scores assigned to each


Senator along the direction of maximal
variance, for
, where the
corresponds to the sparse PCA optimization
problem with . There are 242
non-zero coefficients, meaning that we only
need 242 different bills to get this score
revealing the party affiliation to this extent.
This is almost identical to the result obtained
from PCA.
APPLICATIONS: PCA OF SENATE VOTING DATA | 136

This image shows the scores assigned to each


Senator along the direction of maximal
variance, for
, where the
corresponds to the sparse PCA optimization
problem with . There are 87
non-zero coefficients, meaning that we only
need 87 different bills to get this score
revealing the party affiliation to this extent.
Compared to PCA, there is one more
mis-classified senator.
137 | APPLICATIONS: PCA OF SENATE VOTING DATA

This image shows the scores assigned to each


Senator along the direction of maximal
variance, for
, where the
corresponds to the sparse PCA optimization
problem with . There are 8
non-zero coefficients, meaning that we only
need 8 different bills to get this score revealing
the party affiliation to this extent, which is not
much different from PCA using all 542 votes
as in PCA.
EXERCISES | 138

33.

EXERCISES

• Interpretation of covariance matrix


• Eigenvalue decomposition
• Positive-definite matrices, ellipsoids
• Least-squares estimation

33.1. Interpretation of covariance matrix


We are given of points in . We assume that the average and variance of the data
projected along a given direction do not change with the direction. In this exercise, we will show that
the sample covariance matrix is then proportional to the identity.

We formalize this as follows. To a given normalized direction ( ), we associate


the line with direction passing through the origin, . We then consider the
projection of the points , , on the line , and look at the associated coordinates
of the points on the line. These projected values are given by

We assume that for any , the sample average of the projected values ,
, and their sample variance , are both constant, independent of the direction (with
). Denote by and the (constant) sample average and variance.

Justify your answer to the following as carefully as you can.

1. Show that

2. Show that the sample average of the data points

is zero.

3. Show that the sample covariance matrix of the data points,


139 | EXERCISES

is of the form , where is the identity matrix of order . (Hint: the largest eigenvalue
of the matrix can be written as: , and a
similar expression holds for the smallest eigenvalue.)

33.2. Eigenvalue decomposition


Let be two linearly independent vectors, with unit norm ( ). Define
the symmetric matrix . In your derivations, it may be useful to use the notation
.

1. Show that and are eigenvectors of and determine the corresponding


eigenvalues.

2. Determine the nullspace and rank of .

3. Find an eigenvalue decomposition of . Hint: use the previous two parts.

4. What is the answer to the previous part if are not normalized?

33.3. Positive-definite matrices, ellipsoids


1. In this problem, we examine the geometrical interpretation of the positive definiteness of a matrix.
For each of the following cases determine the shape of the region generated by the constraint
.

2. Show that if a square, symmetric matrix is positive semi-definite, then for every
matrix , is also positive semi-definite. (Here, is an arbitrary integer.)
EXERCISES | 140

3. Drawing an ellipsoid. How would you efficiently draw an ellipsoid in , if the ellipsoid is
described by a quadratic inequality of the form

where is and symmetric, positive-definite, , and ? Describe your algorithm as


precisely as possible. (You are welcome to provide code.) Draw the ellipsoid

33.4. Least-squares estimation


BLUE property of least-squares. Consider a system of linear equations in vector

where is a noise vector, and the input is , a full rank, tall matrix ( ), and
. We do not know anything about , except that it is bounded: , with a
measure of the level of noise. Our goal is to provide an estimate of via a linear estimator, that is,
a function with a matrix. We restrict attention to unbiased estimators, which are
such that when . This implies that should be a left inverse of , that is, .
An example of the linear estimator is obtained by solving the least-squares problem

The solution is, when is full column rank, of the form , with
. We note that , which means that the LS estimator is unbiased. In this exercise, we show
that is the best unbiased linear estimator. (This is often referred to as the BLUE property.)

1. Show that the estimation error of an unbiased linear estimator is .

2. This motivates us to minimize the size of , say using the Frobenius norm:

Show that is the best unbiased linear estimator (BLUE), in the sense that it solves the above
problem.

Hint: Show that any unbiased linear estimator can be written as with
, and that is positive semi-definite.
141 | SINGULAR VALUES

PART VI
SINGULAR VALUES

The singular value decomposition (SVD) generalizes the spectral theorem (available for a square, symmetric
matrix), to any non-symmetric, and even rectangular, matrix. The SVD allows to describe the effect of a
matrix on a vector (via the matrix-vector product), as a three-step process: a first rotation in the input space;
a simple positive scaling that takes a vector in the input space to the output space; and another rotation in
the output space. The figure on the left shows the SVD of a matrix of biological data.

The SVD allows to analyze matrices and associated linear maps in detail, and solve a host of special
optimization problems, from solving linear equations to linear least-squares. It can also be used to reduce
the dimensionality of high-dimensional data sets, by approximating data matrices with low-rank ones. This
technique is closely linked to the principal component analysis method.

Outline

• The SVD theorem


• Matrix properties via SVD
• Solving linear systems via SVD
• Least-squares and SVD
• Low-rank approximations
• Applications
• Exercises
THE SVD THEOREM | 142

34.

THE SVD THEOREM

• The SVD theorem


• Geometry
• Link with the spectral theorem

34.1. The SVD theorem

Basic idea

Recall from here that any matrix with rank one can be written as

where , and .

It turns out that a similar result holds for matrices of arbitrary rank . That is, we can express any matrix
of rank as sum of rank-one matrices

where are mutually orthogonal, are also mutually orthogonal, and the ‘s are
positive numbers called the singular values of . In the above, turns out to be the rank of .

Theorem statement

The following important result applies to any matrix , and allows us to understand the structure of the
mapping .
143 | THE SVD THEOREM

Theorem: Singular Value Decomposition (SVD)

An arbitrary matrix admits a decomposition of the form

where are both orthogonal matrices, and the matrix is diagonal:

where the positive numbers are unique and are called the singular values of . The number
is equal to the rank of , and the triplet is called a singular value decomposition
(SVD) of . The first columns of : (resp. : ) are called left (resp. right)
singular vectors of , and satisfy

This proof of the theorem hinges on the spectral theorem for symmetric matrices. Note that in the theorem,
the zeros appearing alongside represents blocks of zeros. They may be empty, for example if then
there are no zeros to the right of .

Computing the SVD

The SVD of an matrix can be computed via a sequence of linear transformations. The
computational complexity of the algorithm, when expressed in terms of the number of floating-point
operations, is given by

This complexity can become substantial when dealing with large, dense matrices. However, for sparse
matrices, one can expedite the computation if only the largest few singular values and their corresponding
singular vectors are of interest. To understand the derivation of this complexity:

• The outer product of vectors and has a complexity of . This is because for a vector of
length and a vector of length , the outer product results in an matrix, and computing
each entry requires one multiplication.
• The matrix has at most non-zero singular values, where . Each of these singular
values will contribute to the overall computational cost.
• Combining the costs from the two previous steps, the total computational complexity becomes
.
THE SVD THEOREM | 144

Example: A example.

34.2. Geometry
The theorem allows to decompose the action of on a given input vector as a three-step process. To get
, where , we first form . Since is an orthogonal matrix, is also orthogonal,
and is just a rotated version of , which still lies in the input space. Then we act on the rotated vector
by scaling its elements. Precisely, the first elements of are scaled by the singular values ;
the remaining elements are set to zero. This step results in a new vector which now belongs to the
output space . The final step consists in rotating the vector by the orthogonal matrix , which results
in .

For example, assume has the simple form

then for an input vector in , is a vector in with first component , second component
, and last component being zero.

To summarize, the SVD theorem states that any matrix-vector multiplication can be decomposed as a
sequence of three elementary transformations: a rotation in the input space, a scaling that goes from the
input space to the output space, and a rotation in the output space. In contrast with symmetric matrices,
input and output directions are different.

The interpretation allows to make a few statements about the matrix.

Example: A example.

34.3. Link with the SED (Spectral Theorem)


If admits an SVD, then the matrices and has the following SEDs:

where

is (so it has trailing zeros), and


145 | THE SVD THEOREM

is (so it has trailing zeros). The eigenvalues of and are the same, and equal to the
squared singular values of .

The corresponding eigenvectors are the left and right singular vectors of .

This is a method (not the most computationally efficient) to find the SVD of a matrix, based on the SED.
MATRIX PROPERTIES VIA SVD | 146

35.

MATRIX PROPERTIES VIA SVD

• Nullspace
• Range, rank
• Fundamental theorem of linear algebra
• Matrix norms and condition number

35.1. Nullspace

Finding a basis for the nullspace

The SVD allows the computation of an orthonormal basis for the nullspace of a matrix. To understand this,
let us first consider a matrix of the form

The nullspace of this matrix is readily found by solving the equation . We obtain that
is in the nullspace if and only if the first two components of are zero:

What about a general matrix , which admits the SVD as given in the SVD theorem? Since is orthogonal,
we can pre-multiply the nullspace equation by , and solve in terms of the ‘‘rotated’’ variable
We obtain the condition on

1.
147 | MATRIX PROPERTIES VIA SVD

The above is equivalent to the first components of being zero. Since , this corresponds to the
fact that belongs to the span of the last columns of . Note that these columns form a set of
mutually orthogonal, normalized vectors that span the nullspace: hence they form an orthonormal basis for
it.

Theorem: nullspace via SVD

The nullspace of a matrix with SVD

where are both orthogonal matrices, admits the last columns of as an


orthonormal basis.

Example: Nullspace of a matrix.

Full column-rank matrices

One-to-one (or, full column rank) matrices are the matrices with nullspace reduced to . If the dimension
of the nullspace is zero, then we must have . Thus, full column rank matrices are ones with SVD of
the form
MATRIX PROPERTIES VIA SVD | 148

35.2. Range, rank via the SVD

Basis of the range

As with the nullspace, we can express the range in terms of the SVD of the matrix . Indeed, the range of
is the set of vectors of the form

where . Since is orthogonal, when spans , so does . Decomposing the latter


vector in two sub-vectors , we obtain that the range is the set of vectors , with

where is an arbitrary vector of . Since is invertible, also spans . We obtain that the
range is the set of vectors , where is of the form with arbitrary. This means that
the range is the span of the first columns of the orthogonal matrix , and that these columns form an
orthonormal basis for it. Hence, the number of dyads appearing in the SVD decomposition is indeed the
rank (dimension of the range).

Theorem: range and rank via SVD

The range of a matrix with SVD

where and are both orthogonal matrices, admits the first columns of as an
orthonormal basis.

Full row rank matrices

An onto (or full row rank) matrix has a range . These matrices are characterized by an SVD of the
form

Example: Range of a matrix.


149 | MATRIX PROPERTIES VIA SVD

35.3. Fundamental theorem of linear algebra


The theorem already mentioned here allows to decompose any vector into two orthogonal ones, the first in
the nullspace of a matrix , and the second in the range of its transpose.

Fundamental theorem of linear algebra

Let The sets and form an orthogonal decomposition of , in the sense that any
vector can be written as

In particular, we obtain that the condition on a vector to be orthogonal to any vector in the nullspace implies that
it must be in the range:

Proof.

35.4. Matrix norms, condition number


Matrix norms are useful to measure the size of a matrix. Some of them can be interpreted in terms of input-
output properties of the corresponding linear map; for example, the Frobenius norm measures the average
response to unit vectors, while the largest singular (LSV) norm measures the peak gain. These two norms
can be easily read from the SVD.

Frobenius norm

The Frobenius norm can be defined as

Using the SVD of , we obtain

Hence the squared Frobenius norm is nothing else than the sum of the squares of the singular values.
MATRIX PROPERTIES VIA SVD | 150

Largest singular value norm

An alternate way to measure matrix size is based on asking for the maximum ratio of the norm of the output
to the norm of the input. When the norm used is the Euclidean norm, the corresponding quantity

is called the largest singular value (LSV) norm. The reason for this wording is given by the following
theorem.

Theorem: largest singular value norm

For any matrix ,

where is the largest singular value of . Any left singular vector associated with the singular value achieves
the maximum in the above.

Example: Norms of a matrix.

Condition number

The condition number of an invertible matrix is the ratio between the largest and the smallest
singular values:

As seen in the next section, this number provides a measure of the sensitivity of the solution of a linear
equation to changes in .
151 | SOLVING LINEAR SYSTEMS VIA SVD

36.

SOLVING LINEAR SYSTEMS VIA SVD

• Solution set of a linear equation


• Pseudo-inverse
• Sensitivity analysis and condition number

36.1. Solution set


Consider a linear equation

where and are given. We can completely describe the set of solutions via SVD,
as follows. Let us assume that admits an SVD given here. With pre-multiply the linear
equation by the inverse of , ; then we express the equation in terms of the rotated vector .
This leads to

where is the “rotated” right-hand side of the equation.

Due to the simple form of , the above writes

Two cases can occur:

• If the last components of are not zero, then the above system is infeasible, and the solution
set is empty. This occurs when is not in the range of .
• If is in the range of , then the last set of conditions in the above system hold, and we can solve for
with the first set of conditions:
SOLVING LINEAR SYSTEMS VIA SVD | 152

The last components of are free. This corresponds to elements in the nullspace of . If is full
column rank (its nullspace is reduced to { }, then there is a unique solution.

36.2. Pseudo-inverse

Definition
The solution set is conveniently described in terms of the pseudo-inverse of , denoted by , and defined
via the SVD of :

as one with the same SVD, with non-zero singular values inverted, and the matrix transposed:

The pseudo-inverse of a matrix is always well-defined, and it has the same size as the transpose . When the
matrix is invertible (it is square and full column or row rank: ), then it reduces to the inverse.

Example: pseudo-inverse of a matrix.

Link with solution set


From the above development, we see that the solution set can be written as

where is the nullspace of . Both and a basis for the nullspace can be computed via the SVD.

Case when is full rank


If is full column rank, the pseudo-inverse can be written as

In that case, is a left-inverse of , since .

If is full row-rank, then the pseudo-inverse can be written as


153 | SOLVING LINEAR SYSTEMS VIA SVD

In that case, is a right-inverse of , since .

36.3. Sensitivity analysis and condition number


Sensitivity analysis refers to the process of quantifying the impact of changes in the linear equations’
coefficients (the matrix and vector ) on the solution. To simplify, let us assume that is square and
invertible, and analyze the effects of errors in only. The condition number of the matrix quantifies this.

We start from the linear equation above, which has the unique solution . Now assume that is
changed into , where is a vector that contains the changes in . Let’s denote by the new
solution, which is . From the equations:

and using the definition of the largest singular value norm, we obtain:

Combining the two inequalities we get:

where is the condition number of , defined as:

We can express the condition number as the ratio between the largest and smallest singular values of :

The condition number gives a bound on the ratio between the relative error in the left-hand side to that
of the solution. We can also analyze the effect of errors in the matrix itself on the solution. The condition
number turns out to play a crucial role there as well.
LEAST-SQUARES AND SVD | 154

37.

LEAST-SQUARES AND SVD

• Set of solutions via the pseudo inverse


• Sensitivity analysis
• BLUE property

37.1. Set of solutions


The following theorem provides all the solutions (optimal set) of a least-squares problem.

Theorem: optimal set of ordinary least-squares

The optimal set of the OLS problem

can be expressed as

where is the pseudo-inverse of , and is the minimum-norm point in the optimal set. If is full column
rank, the solution is unique, and equal to

In general, the particular solution is the minimum-norm solution to the least-squares problem.

Proof.

37.2. Sensitivity analysis


We consider the situation where
155 | LEAST-SQUARES AND SVD

with

• the data matrix (known), with full column rank (hence ).


• is the measurement (known).
• is the vector to be estimated (unknown).
• is a measurement noise or error (unknown).

We can use OLS to provide an estimate of . The idea is to seek the smallest vector such that the
above equation becomes feasible, that is,

This leads to the OLS problem:

Since is full column rank, its SVD can be expressed as

where contains the singular values of , with .

Since is full column rank, the solution to the OLS problem is unique, and can be written as a linear
function of the measurement vector :

with the pseudo-inverse of . Again, since is full column rank,

The OLS formulation provides an estimate of the input such that the residual error vector
is minimized in norm. We are interested in analyzing the impact of perturbations in the vector , on the
resulting solution . We begin by analyzing the absolute errors in the estimate and then turn to the
analysis of relative errors.

Set of possible errors


Let us assume a simple model of potential perturbations: we assume that belongs to a unit ball:
, where is given. We will assume for simplicity; the analysis is easily extended to
any .
LEAST-SQUARES AND SVD | 156

We have

In the above, we have exploited the fact that is a left inverse of , that is, .

The set of possible errors on the solution is then given by

which is an ellipsoid centered at zero, with principal axes given by the singular values of . This ellipsoid
can be interpreted as an ellipsoid of confidence for the estimate , with size and shape determined by the
matrix .

We can draw several conclusions from this analysis:

• The largest absolute error in the solution that can result from a unit-norm, additive perturbation on
is of the order of , where is the smallest singular value of .
• The largest relative error is , the condition number of .

37.3. BLUE property


We now return to the case of an OLS with full column rank matrix .

Unbiased linear estimators


Consider the family of linear estimators, which are of the form

where . To this estimator, we associate the error


157 | LEAST-SQUARES AND SVD

We say that the estimator (as determined by matrix ) is unbiased if the first term is zero:

Unbiased estimators only exist when the above equation is feasible, that is, has a left inverse. This is
equivalent to our condition that be full column rank. Since is a left-inverse of , the OLS estimator is
a particular case of an unbiased linear estimator.

Best unbiased linear estimator


The above analysis leads to the following question: which is the best unbiased linear estimator? One way
to formulate this problem is to assume that the perturbation vector is bounded in some way, and try to
minimize the possible impact of such bounded errors on the solution.

Let us assume that belongs to a unit ball: . The set of resulting errors on the solution is
then

which is an ellipsoid centered at zero, with principal axes given by the singular values of . This ellipsoid can
be interpreted as an ellipsoid of confidence for the estimate , with size and shape determined by the matrix
.

It can be shown that the OLS estimator is optimal in the sense that it provides the ‘‘smallest’’ ellipsoid of
confidence among all unbiased linear estimators. Specifically:

This optimality of the LS estimator is referred to as the BLUE (Best Linear Unbiased Estimator) property.
LOW-RANK APPROXIMATIONS | 158

38.

LOW-RANK APPROXIMATIONS

• Low-rank approximations
• Link with PCA

38.1. Low-rank approximations


We consider a matrix , with SVD given as in the SVD theorem:

where the singular values are ordered in decreasing order, . In many applications, it
can be useful to approximate with a low-rank matrix.

Example: Assume that contains the log returns of assets over time periods so that each column of
is a time series for a particular asset. Approximating by a rank-one matrix of the form , with
and amounts to model the assets’ movements as all following the same pattern given by the time-
profile , each asset’s movements being scaled by the components in . Specifically, the component
of , which is the log-return of asset at time , then expresses as .

We consider the low-rank approximation problem

where ( ) is given. In the above, we measure the error in the approximation using
the Frobenius norm; using the largest singular value norm leads to the same set of solutions .
159 | LOW-RANK APPROXIMATIONS

Theorem: Low-rank approximation

A best -rank approximation is given by zeroing out the trailing singular values of , that is

The minimal error is given by the Euclidean norm of the singular values that have been zeroed out in the process:

Sketch of proof: The proof rests on the fact that the Frobenius norm, is invariant by rotation of the input
and output spaces, that is, for any matrix , and orthogonal matrices of
appropriate sizes. Since the rank is also invariant, we can reduce the problem to the case when .

Example: low-rank approximation of a matrix.

38.2. Link with Principal Component Analysis


Principal Component Analysis operates on the covariance matrix of the data, which is proportional to
and sets the principal directions to be the eigenvectors of that (symmetric) matrix. As noted here,
the eigenvectors of are simply the left singular vectors of . Hence, both methods, the above
approximation method, and PCA rely on the same tool, the SVD. The latter is a more complete approach
as it also provides the eigenvectors of , which can be useful if we want to analyze the data in terms of
rows instead of columns.

In particular, we can express the explained variance directly in terms of the singular values. In the context
of visualization, the explained variance is simply the ratio of the total amount of variance in the projected
data, to that in the original. More generally, when we are approximating a data matrix by a low-rank matrix,
the explained variance compares the variance in the approximation to that in the original data. We can also
interpret it geometrically, as the ratio of the squared norm of the approximation matrix to that of the original
matrix:
APPLICATIONS | 160

39.

APPLICATIONS

• Image compression.
• Market data analysis.

39.1 Image compression

Images as matrices
We can represent images as matrices, as follows. Consider an image having pixels. For gray scale
images, we need one number per pixel, which can be represented as a matrix. For color images, we
need three numbers per pixel, for each color: red, green, and blue (RGB). Each color can be represented as a
matrix, and we can represent the full color image as a matrix, where we stack each color’s
matrix column-wise alongside each other, as

The image on the left is a grayscale image, which can be


represented as a matrix containing the gray scale
values stored as integers.

The image can be visualized as well. We must first transform the matrix from integer to double. In JPEG
format, the image will be loaded into matlab as a three-dimensional array, one matrix for each color. For gray
scale images, we only need the first matrix in the array.
161 | APPLICATIONS

Low-rank approximation
Using the low-rank approximation via SVD method, we can form the best rank- approximations for the
matrix.

True and approximated images, with varying rank. We observe that with , the approximation is
almost the same as the original picture, whose rank is .

Recall that the explained variance of the rank- approximation is the ratio between the squared norm of
the rank- approximation matrix and the squared norm of the original matrix. Essentially, it measures how
much information is retained in the approximation relative to the original.

The explained variance for the various rank-


approximations to the original image provides
insight into the fidelity of the approximation.
For , the approximation contains
more than 99% of the total variance in the
picture, suggesting that the rank-50
approximation is almost indistinguishable from
the original in terms of information content.

39.2 Market data analysis


We consider the daily log-returns of a collection of stocks chosen in the Fortune 100 companies over the
time period from January 3, 2007, until December 31, 2008. We can represent this as a matrix,
with each column a day, and each row a time-series corresponding to a specific stock.
APPLICATIONS | 162

The image on the left represents the time series


of the stock market mentioned above, shown as
a collection of time-series. We note that the log-
returns hover around a mean which appears to be
close to zero.

We can form the SVD of the matrix of log-


returns and plot the explained variance. We see
that the first 10 singular values explain more
than 80% of the data’s variance.
163 | APPLICATIONS

It is instructive to look at the singular vector corresponding to the largest singular value, arranged in
increasing order. We observe that all the components have the same sign (which we can always assume is
positive). This means we can interpret this vector as providing a weighted average of the market. As seen
in the previous plot, the corresponding rank-one approximation roughly explains more than 80% of the
variance in this market data, which justifies the phrase ‘‘the market average moves the market’’. The five
components with largest magnitude correspond to the following companies. Note that all are financial:

• FABC (Fidelity Advisor)


• FTU (Wachovia, bought by Wells Fargo)
• MER (Merrill Lynch, bought by Bank by America)
• AIG (don’t need to elaborate)

• MS (Morgan Stanley)
EXERCISES | 164

40.

EXERCISES

• SVD of simple matrices


• Rank and SVD
• Procrustes problem
• SVD and projections
• SVD and least-squares

40.1. SVD of simple matrices


1. Consider the matrix

a. Find the range, nullspace, and rank of .

b. Find an SVD of .

c. Determine the set of solutions to the linear equation , with

2. Consider the matrix

a. What is an SVD of ? Express it as , with the diagonal matrix of singular


165 | EXERCISES

values ordered in decreasing fashion. Make sure to check all the properties required for
.

b. Find the semi-axis lengths and principal axes (minimum and maximum distance and
associated directions from to the center) of the ellipsoid

Hint: Use the SVD of to show that every element of is of the form for
some element in . That is, . (In other words, the matrix
maps into the set .) Then analyze the geometry of the simpler set .

c. What is the set when we append a zero vector after the last column of , that is is
replaced with ?

d. The same question when we append a row after the last row of , that is, is replaced with
. Interpret geometrically your result.

40.2. Rank and SVD

The image on the left shows a matrix of pixel values. The lines indicate values; at
each intersection of lines, the corresponding matrix element is . All the other elements are zero.

1. Show that for some permutation matrices , the permuted matrix has the
symmetric form , for two vectors . Determine and .
EXERCISES | 166

2. What is the rank of ? Hint: find the nullspace of .

3. Find an SVD of . Hint: Find an eigenvalue decomposition of , using the results of an


exercise on eigenvalue here.

40.3. Procrustes problem


The Orthogonal Procrustes problem is a problem of the form

where denotes the Frobenius norm, and the matrices , are given.
Here, the matrix variable is constrained to have orthonormal columns. When
, the problem can be interpreted geometrically as seeking a transformation of points
(contained in ) to other points (contained in ) that involves only rotation.

1. Show that the solution to the Procrustes problem above can be found via the SVD of the
matrix .

2. Derive a formula for the answer to the constrained least-squares problem

with , given.

40.4. SVD and projections


1. We consider a set of data points , . We seek to find a line in such
that the sum of the squares of the distances from the points to the line is minimized. To simplify, we
assume that the line goes through the origin.

a. Consider a line that goes through the origin , where is given.


(You can assume without loss of generality that .) Find an expression for the
projection of a given point on .

b. Now consider the points and find an expression for the sum of the squares of the
distances from the points to the line .

c. Explain how you would find the line via the SVD of the matrix
.
167 | EXERCISES

d. How would you address the problem without the restriction that the line has to pass
through the origin?

2. Solve the same problems as previously by replacing the line with a hyperplane.

40.5. SVD and least-squares


1. Consider the matrix formed as , with

with is a vector chosen randomly in (this represents a 3-dimensional cube with each
dimension ranging from to ). In addition, we define

with again chosen randomly in . We consider the associated least-squares problem

a. What is the rank of ?

b. Apply the least-squares formula . What is the norm of the residual


vector, ?

c. Express the least-squares solution in terms of the SVD of . That is, form the pseudo-inverse
of and apply the formula . What is now the norm of the residual?

d. Interpret your results.

2. Consider a least-squares problem

where the data matrix has rank one.

a. Is the solution unique?

b. Show how to reduce the problem to one involving one scalar variable.

c. Express the set of solutions in closed form.


EXAMPLES | 168

PART VII
EXAMPLES
169 | DIMENSION OF AN AFFINE SUBSPACE

DIMENSION OF AN AFFINE SUBSPACE

The set in defined by the linear equations

is an affine subspace of dimension . The corresponding linear subspace is defined by the linear equations
obtained from the above by setting the constant terms to zero:

We can solve for and get . We obtain a representation of the linear subspace as the
set of vectors that have the form

for some scalar . Hence the linear subspace is the span of the vector , and is of
dimension .

We obtain a representation of the original affine set by finding a particular solution , by setting say
and solving for . We obtain

The affine subspace is thus the line , where are defined above.
SAMPLE AND WEIGHTED AVERAGE | 170

SAMPLE AND WEIGHTED AVERAGE

The sample mean (or, average) of given numbers , is defined as

The sample average can be interpreted as a scalar product:

where is the vector containing the samples, and , with the vector of
ones.

More generally, for any vector , with for every , and , we can define
the corresponding weighted average as . The interpretation of is in terms of a discrete probability
distribution of a random variable , which takes the value with probability , . The
weighted average is then simply the expected value (or, mean) of under the probability distribution . The
expected value is often denoted , or if the distribution is clear from context.
171 | SAMPLE AVERAGE OF VECTORS

SAMPLE AVERAGE OF VECTORS

The sample average of given vectors is defined as the vector , with


EUCLIDEAN PROJECTION ON A SET | 172

EUCLIDEAN PROJECTION ON A SET

An Euclidean projection of a point in on a set is a point that achieves the smallest Euclidean
distance from to the set. That is, it is any solution to the optimization problem

When the set is convex, there is a unique solution to the above problem. In particular, the projection on
an affine subspace is unique.

Example: assume that is the hyperplane

The projection problem reads as a linearly constrained least-squares problem, of particularly simple form:

The projection of on turns out to be aligned with the coefficient vector . Indeed,
components of orthogonal to don’t appear in the constraint, and only increase the objective value.
Setting in the equation defining the hyperplane and solving for the scalar we obtain

so that the projection is


173 | ORTHOGONAL COMPLEMENT OF A SUBSPACE

ORTHOGONAL COMPLEMENT OF A
SUBSPACE

Let be a subspace of . The orthogonal complement of , denoted , is the subspace of that


contains the vectors orthogonal to all the vectors in . If the subspace is described as the range of a matrix:

then the orthogonal complement is the set of vectors orthogonal to the rows of , which is the nullspace of
.

Example: Consider the line in passing through the origin and generated by the vector .
This is a subspace of dimension 1:

To find the orthogonal complement, we find the set of vectors that are orthogonal to any vector of the form
, with arbitrary . This is the same set as the set of vectors orthogonal to itself. So, we solve for
with :

This is equivalent to . This equation characterizes the elements of the orthogonal


complement , in the sense that any can be written as

for some scalars , where

The orthogonal complement is thus the span of the vectors :


ORTHOGONAL COMPLEMENT OF A SUBSPACE | 174
175 | POWER LAWS

POWER LAWS

Consider a physical process that has inputs , , and a scalar output . Inputs and output
are physical, positive quantities, such as volume, height, or temperature. In many cases, we can (at least
empirically) describe such physical processes by power laws, which are non-linear models of the form

where , and the coefficients , are real numbers. For example, the relationship
between area, volume, and size of basic geometric objects; the Coulomb law in electrostatics; birth and
survival rates of (say) bacteria as functions of concentrations of chemicals; heat flows and losses in pipes, as
functions of the pipe geometry; analog circuit properties as functions of circuit parameters; etc.

The relationship is not linear nor affine, but if we take the logarithm of both sides and introduce the
new variables

then the above equation becomes an affine one:

where .

See also: Fitting power laws to data.


POWER LAW MODEL FITTING | 176

POWER LAW MODEL FITTING

Returning to the example involving power laws, we ask the question of finding the ‘‘best’’ model of the form

given experiments with several input vectors and associated outputs , . Here the
variables of our problem are , and the vector . Taking logarithms, we obtain

which can be rearranged to the linear form

where , and and are the logarithms of and , respectively. We can represent the above
linear equations compactly as

In practice, the power law model is only an approximate representation of reality. Finding the best fit can be
formulated as the optimization problem

where , , with the -th column of given by , and

See also: Power laws.


177 | DEFINITION: VECTOR NORM

DEFINITION: VECTOR NORM

Informally, a (vector) norm is a function which assigns a length to vectors.

Any sensible measure of length should satisfy the following basic properties: it should be a convex function
of its argument (that is, the length of an average of two vectors should be always less than the average of
their lengths); it should be positive-definite (always non-negative, and zero only when the argument is the
zero vector), and preserve positive scaling (so that multiplying a vector by a positive number scales its norm
accordingly).

Formally, a vector norm is a function which satisfies the following properties.

Definition of a vector norm

1. Positive homogeneity: for every , , we have .

2. Triangle inequality: for every , we have

3. Definiteness: for every , implies .

A consequence of the first two conditions is that a norm only assumes non-negative values, and that it is
convex.

Popular norms include the so-called -norms, where or :

with the convention that when


AN INFEASIBLE LINEAR SYSTEM | 178

AN INFEASIBLE LINEAR SYSTEM

The system of equations in unknowns is

which can be written as , where is the matrix

and is the -vector

The set of solutions turns out to be empty. Indeed, if we use the first two equations, and solve for ,
we get , but then the last equation is not satisfied, since
179 | SAMPLE VARIANCE AND STANDARD DEVIATION

SAMPLE VARIANCE AND STANDARD


DEVIATION

The sample variance of given numbers , is defined as

where is the sample average of . The sample variance is a measure of the deviations of the
numbers with respect to the average value .

The sample standard deviation is the square root of the sample variance, . It can be expressed in terms of
the Euclidean norm of the vector , as

where denotes the Euclidean norm.

More generally, for any vector , with for every , and , we can define
the corresponding weighted variance as

The interpretation of is in terms of a discrete probability distribution of a random variable , which takes
the value with probability , . The weighted variance is then simply the expected value of
the squared deviation of from its mean , under the probability distribution .

See also: Sample and weighted average.


FUNCTIONS AND MAPS | 180

FUNCTIONS AND MAPS

• Functions
• Maps

Functions
In this course we define functions as objects which take an argument in and return a value in . We use
the notation

to refer to a function with “input” space . The “output” space for functions is .

Example: The function with values

gives the distance from the point to .

We allow for functions to take infinity values. The domain of a function , denoted , is defined as
the set of points where the function is finite.

Example: Define the logarithm function as the function , with values if


, and otherwise. The domain of the function is thus (the set of positive reals).

Maps
We reserve the term map to refer to functions which return more than a single value, and use the notation

to refer to a map with input space and output space . The components of the map are the (scalar-
valued) functions .
181 | FUNCTIONS AND MAPS

Example: A map.

The map with values

has components the functions , with values


DUAL NORM | 182

DUAL NORM

For a given norm on , the dual norm, denoted , is the function from to with values

The above definition indeed corresponds to a norm: it is convex, as it is the pointwise maximum of convex
(in fact, linear) functions ; it is homogeneous of degree , that is, for every
and .

By definition of the dual norm,

This can be seen as a generalized version of the Cauchy-Schwartz inequality, which corresponds to the
Euclidean norm.

Examples:

• The norm dual to the Euclidean norm is itself. This comes directly from the Cauchy-Schwartz
inequality.
• The norm dual to the -norm is the -norm. This is because the inequality

holds trivially and is attained for .

• The dual norm above is the original norm we started with. (The proof of this general result is more
involved.)
183 | INCIDENCE MATRIX OF A NETWORK

INCIDENCE MATRIX OF A NETWORK

Mathematically speaking, a network is a graph of nodes connected by directed arcs. Here, we assume
that arcs are ordered pairs, with at most one arc joining any two nodes; we also assume that there are no self-
loops (arcs from a node to itself). We do not assume that the edges of the graph are weighted—they are all
similar.

We can fully describe the network with the so-called arc-node incidence matrix, which is the matrix
defined as

The figure shows the graph associated with the arc-node incidence
matrix

See also: Network flow.


NULLSPACE OF A TRANSPOSE INCIDENCE MATRIX | 184

NULLSPACE OF A TRANSPOSE INCIDENCE


MATRIX

Recall the definition of the arc-node incidence matrix of a network.

By definition, for the arc-node incidence matrix , we have , where 1 is the vector of ones in
. Hence, 1 is in the nullspace of .

See also: Rank properties of the arc-node incidence matrix.


185 | RANK PROPERTIES OF THE ARC-NODE INCIDENCE MATRIX

RANK PROPERTIES OF THE ARC-NODE


INCIDENCE MATRIX

Recall the definition of the arc-node incidence matrix of a network.

A number of topological properties of a network with nodes and edges can be inferred from those of its
node-arc incidence matrix , and of the reduced incidence matrix , which is obtained from by removing
its last row. For example, the network is said to be connected if there is a path joining any two nodes. It can
be shown that the network is connected if and only if the rank of is equal to .

See also: Nullspace of a transpose incidence matrix.


PERMUTATION MATRICES | 186

PERMUTATION MATRICES

A matrix is a permutation matrix if it is obtained by permuting the rows or columns of an


identity matrix according to some permutation of the numbers to . Permutation matrices are orthogonal
(hence, their inverse is their transpose: ) and satisfy .

For example, the matrix

is obtained by exchanging the columns and , and and , of the identity matrix.

A permutation matrix allows to exchange rows or columns of another via the matrix-matrix product. For
example, if we take any matrix , then (with defined above) is the matrix with columns
and exchanged.
187 | QR DECOMPOSITION: EXAMPLES

QR DECOMPOSITION: EXAMPLES

Consider the matrix

This matrix is full column rank. Q is a matrix and R is a matrix:

This shows that is full column rank since is invertible.

With the full QR decomposition, is now a orthogonal matrix:


QR DECOMPOSITION: EXAMPLES | 188

We can see what happens when the input is not full column rank: for example, let’s consider the matrix

( is not full column rank, as it was constructed so that the last column is a combination of the first and the
third.)

The (full) QR decomposition now yields:


189 | QR DECOMPOSITION: EXAMPLES

We observe that the last triangular element is virtually zero, and the last column is seen to be a linear
combination of the first and the third. This shows that the rank of (itself equal to the rank of ) is
effectively .
BACKWARDS SUBSTITUTION FOR SOLVING TRIANGULAR LINEAR SYSTEMS. | 190

BACKWARDS SUBSTITUTION FOR


SOLVING TRIANGULAR LINEAR SYSTEMS.

Consider a triangular system of the form , where the vector is given, and is upper-
triangular. Let us first consider the case when , and is invertible. Thus, has the form

with each , , non-zero.

The backwards substitution first solves for the last component of using the last equation:

and then proceeds with the following recursion, for :

Example: Solving a triangular system by backward substitution


191 | SOLVING TRIANGULAR SYSTEMS OF EQUATIONS: BACKWARDS SUBSTITUTION EXAMPLE

SOLVING TRIANGULAR SYSTEMS OF


EQUATIONS: BACKWARDS SUBSTITUTION
EXAMPLE

Consider the triangular system

We solve for the last variable first, obtaining (from the last equation) . We plug this value of
into the first and second equation, obtaining a new triangular system in two variables :

We proceed by solving for the last variable . The last equation yields . Plugging this
value into the first equation gives

We can apply the idea to find the inverse of the square upper triangular matrix , by solving

The matrix is then the inverse of . We find

As illustrated above, the inverse of a triangular matrix is triangular.


LINEAR REGRESSION VIA LEAST SQUARES | 192

LINEAR REGRESSION VIA LEAST SQUARES

Linear regression is based on the idea of fitting a linear function through data points.

In its basic form, the problem is as follows. We are given data where
is the ‘‘input’’ and is the ‘‘output’’ for the -th measurement. We seek to find a linear function
such that are collectively close to the corresponding values .

In least-squares regression, the way we evaluate how well a candidate function fits the data is via the
(squared) Euclidean norm:

Since a linear function has the form for some , the problem of minimizing the
above criterion takes the form

We can formulate this as a least-squares problem:

where

The linear regression approach can be extended to multiple dimensions, that is, to problems where the
output in the above problem contains more than one dimension (see here). It can also be extended to the
problem of fitting non-linear curves.
193 | LINEAR REGRESSION VIA LEAST SQUARES

In this example, we seek to analyze how customers react


to an increase in the price of a given item. We are given
two-dimensional data points .
The ‘s contain the prices of the item, and the ‘s the
average number of customers who buy the item at that price.

The generic equation of a non-vertical line is


, where contains the
decision variables. The quality of the fit of a generic line
is measured via the sum of the squares of the error in the
component (blue dotted lines). Thus, the best
least-squares fit is obtained via the least-squares problem

Once the line is found, it can be used to predict the value of


the average number of customers buying the item ( ) for a
new price ( ). The prediction is shown in red.

See also:

• Auto-regressive models for time series prediction.


• The problem of Gauss.
NOMENCLATURE | 194

NOMENCLATURE

• Feasible set
• What is a solution?
• Local vs. global optima

Consider the optimization problem

Feasible set
The feasible set of problem is defined as

A point is said to be feasible for problem if it belongs to the feasible set , that is, it satisfies the
constraints.

Example: In the toy optimization problem, the feasible set is the ‘‘box’’ in , described by
.

The feasible set may be empty if the constraints cannot be satisfied simultaneously. In this case, the problem
is said to be infeasible.

What is a solution?
In an optimization problem, we are usually interested in computing the optimal value of the objective
function, and also often a minimizer, which is a vector that achieves that value, if any.

Feasibility problems
Sometimes an objective function is not provided. This means that we are just interested in finding a feasible
point or determining that the problem is infeasible. By convention, we set to be a constant in that case,
to reflect the fact that we are indifferent to the choice of a point as long as it is feasible.
195 | NOMENCLATURE

Optimal value
The optimal value of the problem is the value of the objective at optimum, and we denote it by :

Example: In the toy optimization problem, the optimal value is .

Optimal set
The optimal set (or the set of solutions) of the problem is defined as the set of feasible points for which
the objective function achieves the optimal value:

We take the convention that the optimal set is empty if the problem is not feasible.

A standard notation for the optimal set is via the notation:

A point is said to be optimal if it belongs to the optimal set. Optimal points may not exist, and the optimal
set may be empty. This can be due to the fact that the problem is infeasible. Or it may be due to the fact that
the optimal value is only attained in the limit.

Example: The problem

has no optimal points, as the optimal value is only attained in the limit .

If the optimal set is not empty, we say that the problem is attained.

Suboptimality
The suboptimal set is defined as

(With our notation, .) Any point in the suboptimal set is termed suboptimal.

This set allows characterizing points that are close to being optimal (when is small). Usually, practical
algorithms are only able to compute suboptimal solutions, and never reach true optimality.
NOMENCLATURE | 196

Example: Nomenclature of the two-dimensional toy problem.

Local vs. global optimal points


A point is locally optimal if there is a value such that is optimal for the problem

In other words, a local minimizer minimizes , but only for nearby points on the feasible set. Then the
value of the objective function is not necessarily the optimal value of the problem.

The term globally optimal (or, optimal for short) is used to distinguish points in the optimal set from local
minima.

Example: a function with local minima.

Many optimization algorithms may get trapped in local minima; a situation often described as a major
challenge in optimization problems.
197 | STANDARD FORMS

STANDARD FORMS

• Functional form
• Epigraph form
• Other standard forms

Functional form
An optimization problem is a problem of the form

where

• is the decision variable;


• is the objective function, or cost;
• represent the constraints;
• is the optimal value.

In the above, the term ‘‘subject to’’ is sometimes replaced with the shorthand colon notation.

Often the above is referred to as a ‘‘mathematical program’’. The term “programming” (or “program”) does
not refer to a computer code. It is used mainly for historical purposes. We will use the more rigorous (but
less popular) term “optimization problem”.

Example: An optimization problem in two variables.

Epigraph form
In optimization, we can always assume that the objective is a linear function of the variables. This can be
done via the epigraph representation of the problem, which is based on adding a new scalar variable :

At optimum, . In the above, the objective function is , with values .

We can picture this as follows. Consider the sub-level sets of the objective function, which are of the
STANDARD FORMS | 198

form for some . The problem amounts to finding the smallest for which the
corresponding sub-level set intersects the set of points that satisfy the constraints.

Example: Geometric view of the optimization problem in two variables.

Other standard forms


Sometimes we single out equality constraints, if any:

where ‘s are given. Of course, we may reduce the above problem to the standard form above, representing
each equality constraint by a pair of inequalities.

Sometimes, the constraints are described abstractly via a set condition, of the form for some subset
of . The corresponding notation is

Some problems come in the form of maximization problems. Such problems are readily cast in standard
form via the expression

where .
199 | A TWO-DIMENSIONAL TOY OPTIMIZATION PROBLEM

A TWO-DIMENSIONAL TOY OPTIMIZATION


PROBLEM

As a toy example of an optimization problem in two variables, consider the problem

(Note that the term ‘‘subject to’’ has been replaced with the shorthand colon notation.)

The problem can be put in standard form

where:

● the decision variable is ;

● the objective function , takes values

● the constraint functions take values

● is the optimal value, which turns out to be .

● The optimal set is the singleton , with

Since the optimal set is not empty, the problem is attained.

We can represent the problem in epigraph form, as


A TWO-DIMENSIONAL TOY OPTIMIZATION PROBLEM | 200

Geometric view of the toy optimization problem above.


The level curves (curves of constant value) of the objective
function are shown. The problem amounts to find the
smallest value of such that for some feasible
. The plot also shows the unconstrained minimum of the
objective function, located at . An
sub-optimal set for the toy problem above is shown (in a
darker color), for . This corresponds to the set of
feasible points that achieves an objective value less or equal
to .
201 | GLOBAL VS. LOCAL MINIMA

GLOBAL VS. LOCAL MINIMA

Minima of a nonlinear function. Local optimum is in


green and global optimum in red.
GRADIENT OF A FUNCTION | 202

GRADIENT OF A FUNCTION

• Definition
• Composition rule with an affine function
• Geometric interpretation

The gradient of a differentiable function contains the first derivatives of the function with
respect to each variable. The gradient is useful to find the linear approximation of the function near a point.

Definition
The gradient of at , denoted , is the vector in given by

Examples:

• Distance function: The distance function from a point to another point is defined as

The function is differentiable, provided , which we assume. Then

• Log-sum-exp function: Consider the ‘‘log-sum-exp’’ function , with values


203 | GRADIENT OF A FUNCTION

The gradient of at is

where . More generally, the gradient of the function with values

is given by

where , and .

Composition rule with an affine function


If is a matrix, and is a vector, the function with values

is called the composition of the affine map with with . Its gradient is given by

Geometric interpretation
Geometrically, the gradient can be read on the plot of the level set of the function. Specifically, at any point
, the gradient is perpendicular to the level set and points outwards from the sub-level set (that is, it points
towards higher values of the function).
GRADIENT OF A FUNCTION | 204

Level and sub-level sets of the function


with values

The gradient at a point (shown in red) is perpendicular to


the level set, and points outside the corresponding sub-level
set. The length of the gradient determines how fast the
function changes locally (The length of the gradient has
been scaled up by a factor of .)
205 | SET OF SOLUTIONS TO THE LEAST-SQUARES PROBLEM VIA QR DECOMPOSITION

SET OF SOLUTIONS TO THE


LEAST-SQUARES PROBLEM VIA QR
DECOMPOSITION

The set of solutions to the least-squares problem

where , and are given, can be expressed in terms of the full QR decomposition of :

where is upper triangular and invertible, is a permutation matrix, and is


and orthogonal.

Precisely we have , with a matrix whose columns span the nullspace of :

Proof: Since and are orthogonal, we have, with :

Exploiting the fact that leaves Euclidean norms invariant, we express the original least-squares problem in
the equivalent form:

Once the above is solved, and is found, we recover the original variable with .

Now let us decompose and in a manner consistent with the block structure of

with two -vectors. Then


SET OF SOLUTIONS TO THE LEAST-SQUARES PROBLEM VIA QR DECOMPOSITION | 206

which leads to the following expression for the objective function:

The optimal choice for the variables is to make the first term zero, which is achievable with

where is free and describes the ambiguity in the solution. The optimal residual is .

We are essentially done with , we can write

that is: , with


207 | SAMPLE COVARIANCE MATRIX

SAMPLE COVARIANCE MATRIX

• Definition
• Properties

Definition
For a vector , the sample variance measures the average deviation of its coefficients around
the sample average :

Now consider a matrix , where each column represents a data point in


. We are interested in describing the amount of variance in this data set. To this end, we look at the
numbers we obtain by projecting the data along a line defined by the direction . This corresponds
to the vector in .

The corresponding sample mean and variance are

where is the sample mean of the vectors [latex]x_1, \cdots, x_m[/latex].

The sample variance along direction can be expressed as a quadratic form in :

where is a symmetric matrix, called the sample covariance matrix of the data points:
SAMPLE COVARIANCE MATRIX | 208

Properties
The covariance matrix satisfies the following properties:

• The sample covariance matrix allows finding the variance along any direction in data space.
• The diagonal elements of give the variances of each vector in the data.
• The trace of gives the sum of all the variances.
• The matrix is positive semi-definite, since the associated quadratic form is non-
negative everywhere.
209 | OPTIMAL SET OF LEAST-SQUARES VIA SVD

OPTIMAL SET OF LEAST-SQUARES VIA


SVD

Theorem: optimal set of ordinary least-squares

The optimal set of the OLS problem

can be expressed as

where is the pseudo-inverse of , and is the minimum-norm point in the optimal set. If is full column
rank, the solution is unique, and equal to

Proof: The following proof relies on the SVD of , and the rotational invariance of the Euclidean norm.

Optimal value of the problem

Using the SVD we can find the optimal set to the least-squares optimization problem

Indeed, if is an SVD of , the problem can be written

where we have exploited the fact that the Euclidean norm is invariant under the orthogonal transformation
. With , and , and changing the variable to , we express the above as
OPTIMAL SET OF LEAST-SQUARES VIA SVD | 210

Expanding the terms, and using the partitioned notations , , we obtain

Since is invertible, we can reduce the first term in the objective to zero with the choice .
Hence the optimal value is

We observe that the optimal value is zero if and only if , which is exactly the same as
.

Optimal set

Let us detail the optimal set for the problem. The variable is partly determined, via its first components:
. The remaining variables contained in are free, as does not appear in the
objective function of the above problem.

Thus, optimal points are of the form , with , , and free.

To express this in terms of the original SVD of , we observe that means that

where is partitioned as , with and . Similarly, the


vector can be expressed as , with formed with the first columns of . Thus, any element
in the optimal set is of the form

where . (We will soon explain the acronym appearing in the subscript.) The free
components correspond to the degrees of freedom allowed to by the nullspace of .

Minimum-norm optimal point

The particular solution to the problem, , is the minimum-norm solution, in the sense that it is the
element of that has the smallest Euclidean norm. This is best understood in the space of -variables.

Indeed, the particular choice corresponds to the element in the optimal set that has the
smallest Euclidean norm. Indeed, the norm of is the same as that of its rotated version, . the first
elements in are fixed, and since , we see that the minimal norm is obtained
with .
211 | OPTIMAL SET OF LEAST-SQUARES VIA SVD

Optimal set via the pseudo-inverse

The matrix , which appears in the expression of the particular solution mentioned above,
is nothing else than the pseudo-inverse of , which is denoted . Indeed, we can express the pseudo-inverse
in terms of the SVD as

With this convention, the minimum-norm optimal point is . Recall that the last columns of
form a basis for the nullspace of . Hence the optimal set of the problem is

When is full column rank , the optimal set reduces to a singleton (a set with
only one element), as the nullspace is . The unique optimal point expresses as
PSEUDO-INVERSE OF A MATRIX | 212

PSEUDO-INVERSE OF A MATRIX

The pseudo-inverse of a matrix is a matrix that generalizes to arbitrary matrices the notion
of inverse of a square, invertible matrix. The pseudo-inverse can be expressed from the singular value
decomposition (SVD) of , as follows.

Let the SVD of be

where are both orthogonal matrices, and is a diagonal matrix containing the (positive) singular
values of on its diagonal.

Then the pseudo-inverse of is the matrix defined as

Note that has the same dimension as the transpose of .

This matrix has many useful properties:

● If is full column rank, meaning , that is, is not singular, then is


a left inverse of , in the sense that . We have the closed form expression

● If is full row rank, meaning , that is, is not singular, then is a


right inverse of , in the sense that . We have the closed-form expression

● If is square, invertible, then its inverse is .

● The solution to the least-squares problem

with minimum norm is .


213 | PSEUDO-INVERSE OF A MATRIX

Example: pseudo inverse of a matrix


SVD: A 4X4 EXAMPLE | 214

SVD: A 4X4 EXAMPLE

Consider a matrix in , with SVD given by

where

From the SVD, we can understand the behavior of the mapping :

• Input components along directions corresponding to and are amplified (by factors of 10 and 7,
respectively) and come out mostly along the plane spanned by and .
• Input components along directions corresponding to and are attenuated (by factors of 0.1 and
0.05, respectively).
• The matrix is nonsingular.
• For some applications, it might be appropriate to consider as effectively rank 2, given the significant
attenuation for components along and .
215 | SINGULAR VALUE DECOMPOSITION OF A 4 X 5 MATRIX

SINGULAR VALUE DECOMPOSITION OF A


4 X 5 MATRIX

Consider the matrix

A singular value decomposition of this matrix is given by , with

Notice above that has non-zero values only in its diagonal, and can be written as

with . The rank of (which is the number of non-zero elements on the


diagonal matrix ) is thus . We can check that , and
.
REPRESENTATION OF A TWO-VARIABLE QUADRATIC FUNCTION | 216

REPRESENTATION OF A TWO-VARIABLE
QUADRATIC FUNCTION

The quadratic function , with values

can be represented via a symmetric matrix, as

In short:

where is the vector , and


217 | EDGE WEIGHT MATRIX OF A GRAPH

EDGE WEIGHT MATRIX OF A GRAPH

A symmetric matrix is a way to describe a weighted, undirected graph: each edge in the graph is assigned a
weight . Since the graph is undirected, the edge weight is independent of the direction (from to or
vice-versa). Hence, is symmetric.

To the graph in the figure, we can associate the corresponding


undirected graph, obtained by ignoring the direction of the
arrows. Assuming that all the edges have the same weight, the
undirected graph has the edge weight matrix given by

See also: Arc-node incidence matrix of a graph.


NETWORK FLOW | 218

NETWORK FLOW

We describe a flow (of goods, traffic, charge, information, etc.) across the network as a vector ,
which describes the amount flowing through any given arc. By convention, we use positive values when the
flow is in the direction of the arc, and negative ones in the opposite case.

The incidence matrix of the network, denoted by , helps represent the relationship between nodes and
arcs. For a given node , the total flow leaving it can be calculated as (remember our convention that the
index spans the arcs)

where is our notation for the th component of vector .

Now, we define the external supply as a vector . Here, a negative represents an external demand
at node , and a positive denotes a supply. We make the assumption that the total supply equals the total
demand, implying

The balance equations for the supply vector are given by . These equations represent
constraints the flow vector must satisfy to meet the external supply/demand represented by .

See also: Incidence matrix of a network.


219 | LAPLACIAN MATRIX OF A GRAPH

LAPLACIAN MATRIX OF A GRAPH

Another important symmetric matrix associated with a graph is the Laplacian matrix. This is the matrix
, with as the arc-node incidence matrix. It can be shown that the element of the
Laplacian matrix is given by

See also:

• Arc-node incidence matrix of a graph.


• Edge-weight matrix of a graph.
HESSIAN OF A FUNCTION | 220

HESSIAN OF A FUNCTION

• Definition
• Examples

Definition
The Hessian of a twice-differentiable function at a point is the matrix
containing the second derivatives of the function at that point. That is, the Hessian is the matrix with
elements given by

The Hessian of at is often denoted .

The second derivative is independent of the order in which derivatives are taken. Hence, for
every pair . Thus, the Hessian is a symmetric matrix.

Examples

Hessian of a quadratic function

Consider the quadratic function

The Hessian of at is given by


221 | HESSIAN OF A FUNCTION

For quadratic functions, the Hessian is a constant matrix, that is, it does not depend on the point at which
it is evaluated.

Hessian of the log-sum-exp function

Consider the ‘‘log-sum-exp’’ function , with values

The gradient of at is

where , . The Hessian is given by

More generally, the Hessian of the function with values

is as follows.

● First the gradient at a point is (see here):

where , and .

● Now the Hessian at a point is obtained by taking derivatives of each component of the gradient.
If is the -th component, that is,

then
HESSIAN OF A FUNCTION | 222

and, for :

More compactly:
223 | HESSIAN OF A QUADRATIC FUNCTION

HESSIAN OF A QUADRATIC FUNCTION

For quadratic functions, the Hessian (matrix of second-derivatives) is a constant matrix, that is, it does not
depend on the variable .

As a specific example, consider the quadratic function

The Hessian is given by


GRAM MATRIX | 224

GRAM MATRIX

Consider -vectors . The Gram matrix of the collection is the matrix with
elements . The matrix can be expressed compactly in terms of the matrix
, as

By construction, a Gram matrix is always symmetric, meaning that for every pair . It is
also positive semi-definite, meaning that for every vector (this comes from the identity
).

Assume that each vector is normalized: . Then the coefficient can be expressed as

where is the angle between the vectors and . Thus is a measure of how similar and are.

The matrix arises for example in text document classification, with a measure of similarity between
the th and th document, and their respective bag-of-words representation (normalized to have
Euclidean norm 1).

See also:

• Bag-of-words representation of text.


• Bag-of-words representation of text: measure of document similarity.
225 | QUADRATIC FUNCTIONS IN TWO VARIABLES

QUADRATIC FUNCTIONS IN TWO


VARIABLES

Two examples of quadratic functions are , with values

The function

is a form, since it has no linear or constant terms in it.

Level sets and graph of the quadratic


function . The epigraph is anything
that extends above the graph in the
axis direction. This function is ‘‘bowl-
shaped’’, or convex.
QUADRATIC FUNCTIONS IN TWO VARIABLES | 226

Level sets and graph of the quadratic function .


This quadratic function is not convex.
227 | QUADRATIC APPROXIMATION OF THE LOG-SUM-EXP FUNCTION

QUADRATIC APPROXIMATION OF THE


LOG-SUM-EXP FUNCTION

As seen here, the log-sum-exp function , with values

admits the following gradient and Hessian at a point :

where , .

Hence, the quadratic approximation of the log-sum-exp function at a point is given by


DETERMINANT OF A SQUARE MATRIX | 228

DETERMINANT OF A SQUARE MATRIX

• Definition
• Important result
• Some properties

Definition
The determinant of a square, matrix , denoted , is defined by an algebraic formula of the
coefficients of . The following formula for the determinant, known as Laplace’s expansion formula, allows
to compute the determinant recursively:

where is the matrix obtained from by removing the -th row and first column.
(The first column does not play a special role here: the determinant remains the same if we use any other
column.)

The determinant is the unique function of the entries of such that

1. .

2. is a linear function of any column (when the others are fixed).

3. changes sign when two columns are permuted.

There are other expressions of the determinant, including the Leibnitz formula (proven here):

where denotes the set of permutations of the integers . Here, denotes the sign
of the permutation , which is the number of pairwise exchanges required to transform
into .
229 | DETERMINANT OF A SQUARE MATRIX

Important result
An important result is that a square matrix is invertible if and only if its determinant is not zero. We use this
key result when introducing eigenvalues of symmetric matrices.

Geometry

The determinant of a matrix with columns


is the volume of the parallelepiped defined
by the vectors . (Source: wikipedia). Hence the
determinant is a measure of scale that quantifies how the
linear map associated with , changes
volumes.

In general, the absolute value of the determinant of a matrix is the volume of the parallelepiped

This is consistent with the fact that when is not invertible, its columns define a parallelepiped of zero
volume.

Determinant and inverse


The determinant can be used to compute the inverse of a square, full-rank (that is, invertible) matrix : the
inverse has elements given by

where is a matrix obtained from by removing its -th row and -th column. For example, the
determinant of a matrix
DETERMINANT OF A SQUARE MATRIX | 230

is given by

It is indeed the volume of the area of a parallelepiped defined with the columns of , . The
inverse is given by

Some properties

Determinant of triangular matrices

If a matrix is square, triangular, then its determinant is simply the product of its diagonal coefficients. This
comes right from Laplace’s expansion formula above.

Determinant of transpose

The determinant of a square matrix and that of its transpose are equal.

Determinant of a product of matrices

For two invertible square matrices, we have

In particular:

This also implies that for an orthogonal matrix , that is, a matrix with , we have

Determinant of block matrices

As a generalization of the above result, we have three compatible blocks :


231 | DETERMINANT OF A SQUARE MATRIX

A more general formula is


A SQUARED LINEAR FUNCTION | 232

A SQUARED LINEAR FUNCTION

A squared linear function is a quadratic function of the form

for some vector .

The function vanishes on the space orthogonal to , which is the hyperplane defined by the single linear
equation . Thus, in effect this function is really one-dimensional: it varies only along the direction
.

Level sets and graph of a dyadic quadratic function, corresponding


to the vector . The function is constant along
hyperplanes orthogonal to .
233 | EIGENVALUE DECOMPOSITION OF A SYMMETRIC MATRIX

EIGENVALUE DECOMPOSITION OF A
SYMMETRIC MATRIX

Let

We solve for the characteristic equation:

Hence the eigenvalues are , . For each eigenvalue , we look for a unit-norm vector such
that . For , we obtain the equation in

which leads to (after normalization) an eigenvector . Similarly for we obtain the


eigenvector . Hence, admits the SED (Symmetric Eigenvalue Decomposition)

See also: Sums-of-squares for a quadratic form.


RAYLEIGH QUOTIENTS | 234

RAYLEIGH QUOTIENTS

Theorem

For a symmetric matrix , we can express the smallest and largest eigenvalues, and , as

Proof: The proof of the expression above derives from the SED of the matrix, and the invariance of the
Euclidean norm constraint under orthogonal transformations. We show this only for the largest eigenvalue;
the proof for the expression for the smallest eigenvalue follows similar lines. Indeed, with , we
have

Now we can define the new variable , so that , and express the problem as

Clearly, the maximum is less than . That upper bound is attained, with for an index such that
, and for . This proves the result. This corresponds to setting ,
where is the eigenvector corresponding to .
235 | LARGEST SINGULAR VALUE NORM OF A MATRIX

LARGEST SINGULAR VALUE NORM OF A


MATRIX

For a matrix , we define the largest singular value (or, LSV) norm of to be the quantity

This quantity satisfies the conditions to be a norm (see here). The reason why this norm is called this way is
given here.

The LSV norm can be computed as follows. Let us square the above. We obtain a representation of the
squared LSV norm as a Rayleigh quotient of the matrix :

This shows that the squared LSV norm is the largest eigenvalue of the (positive semi-definite) symmetric
matrix , which is denoted . That is:
NULLSPACE OF A 4X5 MATRIX VIA ITS SVD | 236

NULLSPACE OF A 4X5 MATRIX VIA ITS SVD

Returning to this example, involving a matrix with row size and column size , and of rank
. The nullspace is the span of the last columns of the matrix :

with

We can check that .


237 | RANGE OF A 4X5 MATRIX VIA ITS SVD

RANGE OF A 4X5 MATRIX VIA ITS SVD

Returning to this example, the Frobenius norm is the square root of the sum of the squares of the elements,
and is equal to

The largest singular value norm is simply . Thus, the Euclidean norm of the output cannot
exceed four times that of the input .
LOW-RANK APPROXIMATION OF A 4X5 MATRIX VIA ITS SVD | 238

LOW-RANK APPROXIMATION OF A 4X5


MATRIX VIA ITS SVD

Returning to this example, involving a matrix with row size and column size :

As seen here, the SVD is given by , with

The matrix is rank . A rank-two approximation is given by zeroing out the smallest singular value,
which produces
239 | LOW-RANK APPROXIMATION OF A 4X5 MATRIX VIA ITS SVD

We check that the Frobenius norm of the error is the sum of singular values we have zeroed
out, which here reduces to :
PSEUDO-INVERSE OF A 4X5 MATRIX VIA ITS SVD | 240

PSEUDO-INVERSE OF A 4X5 MATRIX VIA


ITS SVD

Returning to this example, the pseudo-inverse of the matrix

Can be computed via an SVD: , with

as follows.

We first invert , simply “inverting what can be inverted” and leaving zero values alone. We get

Then the pseudo-inverse is obtained by exchanging the roles of and in the SVD:
241 | PSEUDO-INVERSE OF A 4X5 MATRIX VIA ITS SVD

See also: This example.


APPLICATIONS | 242

PART VIII
APPLICATIONS
243 | IMAGE COMPRESSION VIA LEAST-SQUARES

IMAGE COMPRESSION VIA


LEAST-SQUARES

We can use least-squares to represent an image in terms of a linear combination of ‘‘basic’’ images, at least
approximately.

An image can be represented, via its pixel values or some other mechanism, as a (usually long) vector
. Now assume that we have a library of ‘‘basic’’ images, also in pixel form, that are represented
as vectors . Each vector could contain the pixel representation of a unit vector on some basis,
such as a two-dimensional Fourier basis; however, we do not necessarily assume here that the ‘s form a
basis.

Let us try to find the best coefficients , which allow approximating the given image (given
by ) as a linear combination of the ‘s with coefficients . Such a combination can be expressed
as the matrix-vector product , where is the matrix that contains the basic
images. The best fit can be found via the least-squares problem

Once the representation is found, and if the optimal value of the problem above is small, we can safely
represent the given image via the vector . If the vector is sparse, in the sense that it has many zeros, such a
representation can yield a substantial saving in memory, over the initial pixel representation .

The sparsity of the solution of the problem above for a range of possible images is highly dependent on the
images’ characteristics, as well as on the collection of basic images contained in . In practice, it is desirable
to trade-off the accuracy measure above, against some measure of the sparsity of the optimal vector .
SENATE VOTING DATA MATRIX. | 244

SENATE VOTING DATA MATRIX.

The data consists of the votes of Senators in the 2004-2006 US Senate (2004-2006), for a total
of bills. “Yay” (“Yes”) votes are represented as 1‘s, “Nay” (“No”) as -1‘s, and the other votes are
recorded as 0. (A number of complexities are ignored here, such as the possibility of pairing the votes.)

This data can be represented here as a ‘‘voting’’ matrix

with elements taken from . Each column of the voting matrix contains the
votes of a single Senator for all the bills; each row contains the votes of all Senators on a particular bill.

Senate voting matrix: “Nay” votes are in black,


“Yay” ones in white, and the others in grey. The
transpose voting matrix is shown. The picture has
many gray areas, as some Senators are replaced over
time. Simply plotting the raw data matrix is often not
very informative.
245 | SENATE VOTING ANALYSIS AND VISUALIZATION.

SENATE VOTING ANALYSIS AND


VISUALIZATION.

In this case study, we take data from the votes on bills in the US Senate (2004-2006), shown as a table above,
and explore how we can visualize the data by projecting it, first on a line then on a plane. We investigate how
we can choose the line or plane in a way that maximizes the variance in the result, via a principal component
analysis method. Finally, we examine how a variation on PCA that encourages sparsity of the projection
directions allows to understand which bills are most responsible for the variance in the data.

• Senate voting data and the visualization problem


• Projection on a line
• Projection on a plane
• Maximum-variance projections
• PCA
• Sparse PCA

Senate voting data and the visualization problem.

Data

The data consists of the votes of Senators in the 2004-2006 US Senate (2004-2006), for a total
SENATE VOTING ANALYSIS AND VISUALIZATION. | 246

of bills. “Yay” (“Yes”) votes are represented as ‘s, “Nay” (“No”) as ‘s, and the other votes are
recorded as . (A number of complexities are ignored here, such as the possibility of pairing the votes.)

This data can be represented here as a ‘‘voting’’ matrix , with elements taken
from . Each column of the voting matrix contains the votes of a single Senator
for all the bills; each row contains the votes of all Senators on a particular bill.

Senate voting matrix: “Nay” votes are in black, “Yay” ones in white, and the others in grey. The transpose
voting matrix is shown. The picture becomes has many gray areas, as some Senators are replaced over time.
Simply plotting the raw data matrix is often not very informative.

Visualization Problem

We can try to visualize the data set, by projecting each data point (each row or column of the matrix) on (say)
a 1D-, 2D- or 3D-space. Each ‘‘view’’ corresponds to a particular projection, that is, a particular one-, two-
or three-dimensional subspace on which we choose to project the data. The visualization problem consists
of choosing an appropriate projection.

There are many ways to formulate the visualization problem, and none dominates the others. Here, we focus
on the basics of that problem.

Projection on a line and a plane

To simplify, let us first consider the simple problem of representing the high-dimensional data set on a simple
line, using the method described here.
247 | BEER-LAMBERT LAW IN ABSORPTION SPECTROMETRY

BEER-LAMBERT LAW IN ABSORPTION


SPECTROMETRY

The Beer-Lambert law in optics is an empirical relationship that relates the absorption of light by a material,
to the properties of the material through which the light is traveling. This is the basis of absorption
spectrometry, which allows to measure the concentration of different gases in a chamber.

The principle of an absorption spectrometer, illustrated on


the left, is as follows. Consider the following two
experiments. First, we do a control experiment, where we
illuminate from one side a container containing some
reference gas with light at a certain frequency. We measure
the light intensity (say, ) at the other side of the container.
Then, we add some other gas to the container, repeat the
experiment, and measure the light intensity again (say,
). Depending on the absorption properties, as well as the
concentration , of the added gas, the light will be more or
less absorbed with respect to the reference situation.

The Beer-Lambert law postulates that the log-ratio


is linear in the concentration . In other words,
, where the constant depends
on the light frequency and on the gas.

If the container has a mixture of ‘‘pure’’ gases in it, the law postulates that the logarithm of the ratio
of the light intensities is a linear function of the concentrations of each gas in the mix. The log-ratio of
intensities is thus of the form for some vector , where is the vector of concentrations.
The coefficients , correspond to the log-ratio of light intensities when (the -th
vector of the standard basis, which correspond to the -th pure gas). The quantity is called the coefficient
of absorption of the -th gas and can be measured in the laboratory.

See also: Absorption spectrometry: using measurements at different light frequencies.


ABSORPTION SPECTROMETRY: USING MEASUREMENTS AT DIFFERENT LIGHT FREQUENCIES. | 248

ABSORPTION SPECTROMETRY: USING


MEASUREMENTS AT DIFFERENT LIGHT
FREQUENCIES.

Return to the absorption spectrometry setup described here.

The Beer-Lambert law postulates that the logarithm of the ratio of the light intensities is a linear function of
the concentrations of each gas in the mix. The log-ratio of intensities is thus of the form for some
vector , where is the vector of concentrations, and the vector contains the coefficients of
absorption of each gas. This vector is actually also a function of the frequency of the light we illuminate the
container with.

Now consider a container having a mixture of “pure” gases in it. Denote by the vector of
concentrations of the gases in the mixture. We illuminate the container at different frequencies
. For each experiment, we record the corresponding log-ratio , , of the intensities. If the
Beer-Lambert law is to be believed, then we must have

for some vectors , which contain the coefficients of absorption of the gases at light frequency .

More compactly:

where

Thus, is the coefficient of absorption of the -th gas at frequency .

Since ‘s correspond to “pure” gases, they can be measured in the laboratory. We can then use the above
model to infer the concentration of the gases in a mixture, given some observed light intensity log-ratio.

See also: Absorption spectrometry: the Beer-Lambert law


249 | SIMILARITY OF TWO DOCUMENTS

SIMILARITY OF TWO DOCUMENTS

Returning to the bag-of-words example, we can use the notion of angle to measure how two different
documents are close to each other.

Given two documents, and a pre-defined list of words appearing in the documents (the dictionary), we can
compute the vectors of frequencies of the words as they appear in the documents. The angle between
the two vectors is a widely used measure of closeness (similarity) between documents.

See also:

• Bag-of-words representation of text.


• Gram matrix.
IMAGE COMPRESSION | 250

IMAGE COMPRESSION

In image compression applications, we are given the pixel


representation of a “target” image, as a vector in .
We would like to represent the image as a linear combination
of “basic” images , , where the matrix
is called the dictionary.
The picture on the left shows a total of “basic” images.

To represent the image in terms of the dictionary, we would like to find coefficients , such
that

or, more compactly, .

If the representation of the image (that is, the vector ) has many zeros (we say: is sparse), then we can
represent the entire image with only a few values (the components of that are not zero). We may then,
for example, send the image over a communication network at high speed. Provided the receiver has the
dictionary handy, it can reconstruct perfectly the image.

In practice, it may be desirable to trade off the sparsity of the representation (via ) against the accuracy of
the representation. Namely, we may prefer a representation that achieves only approximately
but has way more zeros. The process of searching for a good sparsity/accuracy trade-off is called image
compression.
251 | TEMPERATURES AT DIFFERENT AIRPORTS

TEMPERATURES AT DIFFERENT AIRPORTS

Representing temperatures at different airports as a


vector
We record the temperatures at three different airports and obtain the following table.
F)
(
o

airport Temperature
SFO 55
ORD 32
JFK 43

We can represent the temperatures on a single temperature


axis. This is known as the dot representation of a vector.

The dot representation could become very confusing if there were three temperatures to plot for each day in
the year.

Alternatively, we can view the triple of temperatures as


a point in a three-dimensional space. Each axis corresponds
to temperatures at a specific location. The vector
representation is still legible if we have more than one triplet
of temperatures to represent. The vector representation
cannot be viewed in more than three dimensions, that is if
we have more than three cities involved. However, the data
can still be represented as a collection of vectors.
NAVIGATION BY RANGE MEASUREMENT | 252

NAVIGATION BY RANGE MEASUREMENT

In the plane, we measure the distances of an object located at an unknown position from points
with known coordinates . The distance vector is a non-linear
function of , given by

Now assume that we have obtained the position of the object at a given time and seek to predict
the change in position that is consistent with observed small changes in the distance vector .

We can approximate the non-linear functions via the first-order (linear) approximation. A linearized
model around a given point is , with a matrix with elements
253 | BAG-OF-WORDS REPRESENTATION OF TEXT

BAG-OF-WORDS REPRESENTATION OF
TEXT

Consider the following text:

A (real) vector is just a collection of real numbers, referred to as the components (or, elements) of the vector; denotes
the set of vectors with elements. If denotes a vector, we use subscripts to denote elements, so that is the -th
component of . Vectors are arranged in a column, or a row. If is a column vector, denotes the corresponding row
vector, and vice-versa.

The row vector contains the number of times each word in the list {vector, of, the} appear
in the above paragraph. Vectors can be thus used to represent text documents. The representation often
referred to as the bag-of-words representation, is not faithful, as it ignores the respective order of appearance
of the words. In addition, often, stop words (such as the or of) are also ignored.

See also: Bag-of-words representation of text: measure of document similarity.


BAG-OF-WORDS REPRESENTATION OF TEXT: MEASURE OF DOCUMENT SIMILARITY | 254

BAG-OF-WORDS REPRESENTATION OF
TEXT: MEASURE OF DOCUMENT
SIMILARITY

Returning to the bag-of-words example, we can use the notion of angle to measure how two different
documents are close to each other.

Given two documents, and a pre-defined list of words appearing in the documents (the dictionary), we can
compute the vectors of frequencies of the words as they appear in the documents. The angle between
the two vectors is a widely used measure of closeness (similarity) between documents.

See also:

• Bag-of-words representation of text.


• Gram matrix.
255 | RATE OF RETURN OF A FINANCIAL PORTFOLIO

RATE OF RETURN OF A FINANCIAL


PORTFOLIO

• Rate of return of a single asset


• Log-returns
• Rate of return of a portfolio

Rate of return of a single asset


The rate of return (or the return) of a financial asset over a given period (say, a year, or a day) is the interest
obtained at the end of the period by investing in it. In other words, if, at the beginning of the period, we
invest a sum in the asset, we will earn at the end. That is:

Log-returns
Often, the rates of return are approximated, especially if the period length is small. If , then

with the latter quantity known as log-return.

Rate of return of a portfolio


For assets, we can define the vector , with the rate of return of the -th asset.

Assume that at the beginning of the period, we invest a sum in all the assets, allocating a fraction
(in ) in the -th asset. Here is a non-negative vector which sums to one. Then the portfolio we
constituted this way will earn
RATE OF RETURN OF A FINANCIAL PORTFOLIO | 256

The rate of return of the porfolio is the relative increase in wealth:

The rate of return is thus the scalar product between the vector of individual returns and of the portfolio
allocation weights .

Note that, in practice, rates of return are never known in advance, and they can be negative (although, by
construction, they are never less than ).
257 | SINGLE FACTOR MODEL OF FINANCIAL PRICE DATA

SINGLE FACTOR MODEL OF FINANCIAL


PRICE DATA

Consider a data matrix which contains the log-returns of assets over time periods (say, days).

A single-factor model for this data is one based on the assumption that the matrix is a dyad:

where , and . In practice, no component of and is zero (if that is not the case, then a
whole row or column of is zero and can be ignored in the analysis).

According to the single factor model, the entire market behaves as follows. At any time , the
log-return of asset is of the form .

The vectors and have the following interpretation.

• For any asset, the rate of change in log-returns between two-time instants is given by the ratio
, independent of the asset. Hence, gives the time profile for all the assets: every asset shows the
same time profile, up to a scaling given by .
• Likewise, for any time , the ratio between the log-returns of two assets and at time is given by
, independent of . Hence gives the asset profile for all the time periods. Each time shows the
same asset profile, up to a scaling given by .

While single-factor models may seem crude, they often offer a reasonable amount of information. It turns
out that with many financial market data, a good single factor model involves a time profile equal to the
log-returns of the average of all the assets, or some weighted average (such as the SP 500 index). With this
model, all assets follow the profile of the entire market.
THE PROBLEM OF GAUSS | 258

THE PROBLEM OF GAUSS

In the early 1800’s, astronomers had just discovered a new planetoid, Ceres, when the object became
impossible to track due to the glare of the sun. Surely the object would reappear sometime soon; but where
to look in the vast sky? The problem of predicting the location of the object, based on location data gathered
during the past 40 days, became the challenge of the day for the astronomy community.

The problem raised the interest of a young mathematician, Carl


Friedrich Gauss, who had invented the method of least-squares at age
18. He used it (together with several other approximations) to predict
the location of Ceres. His computations were so accurate as to allow
the astronomer Franz Xaver von Zach to quickly relocate Ceres. The
figure on the left shows a draft of Gauss’ orbital map. For more on this
story, see here.

Note that there is a controversy as to the identity of the inventor of the Least-Squares method, as Legendre
has published it earlier than Gauss (in 1805), but Gauss claimed to have discovered it years before.
259 | CONTROL OF A UNIT MASS

CONTROL OF A UNIT MASS

Consider the problem of transferring a unit mass at rest sliding on a plane from a point to another at a
unit distance. We can exert a constant force of magnitude on the mass at time intervals ,
.

Denoting by the position at the final instant , we can express via Newton’s law the relationship
between the force vector and position/velocity vector as , where .

Now assume that we would like to find the smallest-norm (in the Euclidean sense) force that puts the mass
at at the final time. This is the problem of finding the minimum-norm solution to the equation
. The solution is .
PORTFOLIO OPTIMIZATION VIA LINEARLY CONSTRAINED LEAST-SQUARES. | 260

PORTFOLIO OPTIMIZATION VIA LINEARLY


CONSTRAINED LEAST-SQUARES.

We consider a universe of financial assets, in which we seek to invest over one time
period. We denote by the vector containing the rates of return of each asset. A portfolio
corresponds to a vector , where is the amount invested in asset . In our simple
model, we assume that ‘‘shorting’’ (borrowing) is allowed, that is, there are no sign
restrictions on .

As explained, the return of the portfolio is the scalar product . We do not know the return
vector in advance. We assume that we know a reasonable prediction of . Of course, we cannot rely
only on the vector only to make a decision, since the actual values in could fluctuate around . We can
consider two simple ways to model the uncertainty on , which result in similar optimization problems.

Mean-variance trade-off.
A first approach assumes that is a random variable, with known mean and covariance matrix . If past
values of the returns are known, we can use the following estimates

Note that, in practice, the above estimates for the mean and covariance matrix are very unreliable, and
more sophisticated estimates should be used.

Then the mean value of the portfolio’s return takes the form , and its variance is

We can strike a trade-off between the ‘‘performance’’ of the portfolio, measured by the mean return, against
the ‘‘risk’’, measured by the variance, via the optimization problem
261 | PORTFOLIO OPTIMIZATION VIA LINEARLY CONSTRAINED LEAST-SQUARES.

where is our target for the nominal return. Since is positive semi-definite, that is, it can be written as
with , the above problem is a linearly constrained least-squares.

An ellipsoidal model
To model the uncertainty in , we can use the following deterministic model. We assume that the true vector
lies in a given ellipsoid , but is otherwise unknown. We describe by its center and a ‘‘shape matrix’’
determined by some invertible matrix :

We observe that if , then will be in an interval , with

Using the Cauchy-Schwartz inequality, as well as the form of given above, we obtain that

Likewise,

For a given portfolio vector , the true return will lie in an interval ,
where is our ‘‘nominal’’ return, and is a measure of the ‘‘risk’’ in the nominal return:

We can formulate the problem of minimizing the risk subject to a constraint on the nominal return:

where is our target for the nominal return, and . This is again a linearly constrained least-
squares. Note that we obtain a problem that has exactly the same form as the stochastic model seen before.
THEOREMS | 262

PART IX
THEOREMS
263 | CAUCHY-SCHWARZ INEQUALITY PROOF

CAUCHY-SCHWARZ INEQUALITY PROOF

For any two vectors , we have

The above inequality is an equality if and only if are collinear. In other words:

with optimal given by if is non-zero.

Proof: The inequality is trivial if either one of the vectors is zero. Let us assume both are non-zero.
Without loss of generality, we may re-scale and assume it has unit Euclidean norm ( ). Let us
first prove that

We consider the polynomial

Since it is non-negative for every value of , its discriminant is non-positive. The


Cauchy-Schwartz inequality follows.

The second result is proven as follows. Let be the optimal value of the problem. The Cauchy-Schwartz
inequality implies that . To prove that the value is attained (it is equal to its upper bound),
we observe that if , then

The vector is feasible for the optimization problem . This establishes a lower bound on
the value of , :
DIMENSION OF HYPERPLANES | 264

DIMENSION OF HYPERPLANES

Theorem:

A set in of the form

where , , and are given, is an affine set of dimension .

Conversely, any affine set of dimension can be represented by a single affine equation of the form ,
as in the above.

Proof:

Consider a set described by a single affine equation:

with . Let us assume for example that . We can express as follows:

This shows that the set is of the form , where

Since the vectors are independent, the dimension of is . This proves that is
indeed an affine set of dimension .

The converse is also true. Any subspace of dimension can be represented via an equation
265 | DIMENSION OF HYPERPLANES

for some . A sketch of the proof is as follows. We use the fact that we can form a basis
for the subspace . We can then construct a vector that is orthogonal to all of these basis vectors. By
definition, is the set of vectors that are orthogonal to .
SPECTRAL THEOREM: EIGENVALUE DECOMPOSITION FOR SYMMETRIC MATRICES | 266

SPECTRAL THEOREM: EIGENVALUE


DECOMPOSITION FOR SYMMETRIC
MATRICES

Spectral theorem

We can decompose any symmetric matrix with the symmetric eigenvalue decomposition (SED)

where the matrix of is orthogonal (that is, ), and contains the


eigenvectors of , while the diagonal matrix contains the eigenvalues of .

Proof: The proof is by induction on the size of the matrix . The result is trivial for . Now let
and assume the result is true for any matrix of size .

Consider the function of , . From the basic properties of the determinant,


it is a polynomial of degree , called the characteristic polynomial of . By the fundamental theorem
of algebra, any polynomial of degree has (possibly not distinct) complex roots; these are called the
eigenvalues of . We denote these eigenvalues by .

If is an eigenvalue of , that is, , then must be non-invertible (see here).


This means that there exists a non-zero real vector such that . We can always normalize so
that . Thus, is real. That is, the eigenvalues of a symmetric matrix are always real.

Now consider the eigenvalue and an associated eigenvector . Using the Gram-Schmidt
orthogonalization procedure, we can compute a matrix such that is orthogonal.
By induction, we can write the symmetric matrix as, where is
a matrix of eigenvectors, and are the eigenvalues
of . Finally, we define the matrix . By construction the matrix
is orthogonal.

We have
267 | SPECTRAL THEOREM: EIGENVALUE DECOMPOSITION FOR SYMMETRIC MATRICES

where we have exploited the fact that , and .

We have exhibited an orthogonal matrix such that is diagonal. This proves the theorem.
SINGULAR VALUE DECOMPOSITION (SVD) THEOREM | 268

SINGULAR VALUE DECOMPOSITION (SVD)


THEOREM

Theorem: Singular Value Decomposition (SVD)

An arbitrary matrix admits a decomposition of the form

where are both orthogonal matrices, and the matrix is diagonal:

where the positive numbers are unique, and are called the singular values of . The number
is equal to the rank of , and the triplet is called a singular value decomposition
(SVD) of . The first columns of (resp. ) are called left (resp.
right) singular vectors of , and satisfy

Proof: The matrix is real and symmetric. According to the spectral theorem, it admits an
eigenvalue decomposition in the form , with a matrix whose columns form an
orthonormal basis (that is, ), and . Here, is
the rank of (if then there are no trailing zeros in ). Since is positive semi-definite, the
‘s are non-negative, and we can define the non-zero quantities .

Note that when , since then .

Let us construct an orthogonal matrix as follows. We set

These -vectors are unit-norm, and mutually orthogonal since ‘s are eigenvectors of . Using (say)
269 | SINGULAR VALUE DECOMPOSITION (SVD) THEOREM

the Gram-Schmidt orthogonalization procedure, we can complete (if necessary, that is in the case )
this set of vectors by in order to form an orthogonal matrix .

Let us check that satisfy the conditions of the theorem, by showing that

We have

where the second line stems from the fact that when . Thus, , as claimed.
RANK-ONE MATRICES | 270

RANK-ONE MATRICES

Recall that the rank of a matrix is the dimension of its range. A rank-one matrix is a matrix with rank equal
to one. Such matrices are also called dyads.

We can express any rank-one matrix as an outer product.

Theorem: outer product representation of a rank-one matrix

Every rank-one matrix can be written as an ‘‘outer product’’, or dyad:

where .

Proof.

The interpretation of the corresponding linear map for a rank-one matrix is that the
output is always in the direction , with coefficient of proportionality a linear function of
.

We can always scale the vectors and in order to express as

where , , with and .

The interpretation for the expression above is that the result of the map for a rank-one
matrix can be decomposed into three steps:

• we project on the -axis, getting a number ;


• we scale that number by the positive number ;
• we lift the result (which is the scalar to get a vector proportional to .

See also: Single factor model of financial price data.


271 | RANK-ONE MATRICES: A REPRESENTATION THEOREM

RANK-ONE MATRICES: A
REPRESENTATION THEOREM

We prove the theorem mentioned here:

Theorem: outer product representation of a rank-one matrix

Every rank-one matrix can be written as an ‘‘outer product’’, or dyad:

where .

Proof: For any non-zero vectors , the matrix is indeed of rank one: if , then

When spans , the scalar spans the entire real line (since ), and the vector spans the
subspace of vectors proportional to . Hence, the range of is the line:

which is of dimension 1.

Conversely, if is of rank one, then its range is of dimension one, hence it must be a line passing through
. Hence for any there exist a function such that

Using , where is the th vector of the standard basis, we obtain that there exist numbers
such that for every :

We can write the above in a single matrix equation:


RANK-ONE MATRICES: A REPRESENTATION THEOREM | 272

Now letting and realizing that the matrix is simply the identity matrix, we

obtain , as desired.
273 | FULL RANK MATRICES

FULL RANK MATRICES

Theorem

A matrix in is:

● Full column rank if and only if is invertible.

● Full row rank if and only if is invertible.

Proof:

The matrix is full column rank if and only if its nullspace is reduced to the singleton , that is,

If is invertible, then the condition implies , which in turn implies .

Conversely, assume that the matrix is full column rank, and let be such that . We then have
, which means . Since is full column rank, we obtain , as
desired. The proof for the other property follows similar lines.
RANK-NULLITY THEOREM | 274

RANK-NULLITY THEOREM

Rank-nullity theorem

The nullity (dimension of the nullspace) and the rank (dimension of the range) of an matrix add up to
the column dimension of , .

Proof: Let be the dimension of the nullspace ( ). Let be a matrix such that
its columns form an orthonormal basis of . In particular, we have . Using the QR
decomposition of the matrix , we obtain a matrix such that the matrix
is orthogonal. Now define the matrix .

We proceed to show that the columns of form a basis for the range of . To do this, we first prove that
the columns of span the range of . Then we will show that these columns are independent. This
will show that the dimension of the range (that is, the rank) is indeed equal to .

Since is an orthonormal matrix, for any , there exist two vectors such that

If , then

This proves that the columns of span the range of :

Now let us show that the columns of are independent. Assume a vector satisfies and let
us show . We have , which implies that is in the nullspace of . Hence,
there exists another vector such that . This is contradicted by the fact that is an
orthogonal matrix: pre-multiplying the last equation by , and exploiting the fact that

we obtain
275 | A THEOREM ON POSITIVE SEMIDEFINITE FORMS AND EIGENVALUES

A THEOREM ON POSITIVE SEMIDEFINITE


FORMS AND EIGENVALUES

Theorem: (Link with SED)

A quadratic form , with is non-negative (resp. positive-definite) if and only if every


eigenvalue of the symmetric matrix is non-negative (resp. positive).

Proof: Let be the SED of .

If , then gor every . Thus, for every :

Conversely, if there exist for which , then choosing will result in for every , then
the condition

trivially implies for every , which can be written as .

Since is orthogonal, it is invertible, and we conclude that . Conversely, if for some , we


can achieve for some non-zero .
FUNDAMENTAL THEOREM OF LINEAR ALGEBRA | 276

FUNDAMENTAL THEOREM OF LINEAR


ALGEBRA

Fundamental theorem of linear algebra

Let . The sets and form an orthogonal decomposition of , in the sense that any
vector can be written as

In particular, we obtain that the condition on a vector to be orthogonal to any vector in the nullspace of implies
that it must be in the range of its transpose:

Proof: The theorem relies on the fact that if a SVD of a matrix is

then an SVD of its transpose is simply obtained by transposing the three-term matrix product involved:

Thus, the left singular vectors of are the right singular vectors of .

From this we conclude in particular that the range of is spanned by the first columns of . Since
the nullspace of is spanned by the last columns of , we observe that the nullspace of and the
range of are two orthogonal subspaces, whose dimension sum to that of the whole space. Precisely, we
can express any given vector in terms of a linear combination of the columns of ; the first columns
correspond to the vector and the last to the vector :
277 | FUNDAMENTAL THEOREM OF LINEAR ALGEBRA

This proves the first result in the theorem.

The last statement is then an obvious consequence of this first result: if is orthogonal to the nullspace,
then the vector in the theorem above must be zero, so that .

You might also like