Linear Algebra and Applications Compressed 28.12.2023
Linear Algebra and Applications Compressed 28.12.2023
LINEAR ALGEBRA
AND
APPLICATIONS
Hanoi, 2023
CONTENTS
INTRODUCTION 1
Part I. VECTORS
1. BASICS 4
1.1. Definitions 4
1.2. Independence 5
2.2. Norms 10
3.1. Definition 13
5.1. Hyperplanes 19
5.4. Half-spaces 22
6. LINEAR FUNCTIONS 23
7.3. Examples 30
8. EXERCISES 32
8.1. Subspaces 32
8.3. Orthogonalization 33
9. BASICS 37
9.2. Transpose 37
11.2 Dyads 49
12. QR DECOMPOSITION OF A MATRIX 51
15.4 Applications 61
16. APPLICATIONS 63
18.1. Overview 70
18.4. Issues 73
19. EXISTENCE AND UNICITY OF SOLUTIONS 74
23.1. Definition 88
23.2. Interpretations 89
25.1. Motivations 95
Functions 180
Maps 180
Dual norm 182
Incidence matrix of a network 183
Nullspace of a transpose incidence matrix 184
Rank properties of the arc-node incidence matrix 185
Permutation matrices 186
QR decomposition: Examples 187
Backwards substitution for solving triangular linear systems. 190
Solving triangular systems of equations: Backwards substitution example 191
Linear regression via least squares 192
Nomenclature 194
Definition 202
Definition 207
Properties 208
Optimal set of least-squares via SVD 209
Pseudo-inverse of a matrix 212
SVD: A 4x4 example 214
Singular value decomposition of a 4 x 5 matrix 215
Representation of a two-variable quadratic function 216
Edge weight matrix of a graph 217
Network flow 218
Laplacian matrix of a graph 219
Hessian of a function 220
Definition 220
Examples 220
Hessian of a quadratic function 223
Gram matrix 224
Quadratic functions in two variables 225
Quadratic approximation of the log-sum-exp function 227
Determinant of a square matrix 228
Definition 228
Log-returns 255
INTRODUCTION
BOOK DESCRIPTION:
“You can’t learn too much linear algebra”. Benedict Gross, Professor of Mathematics at Harvard.
This book offers a guided tour of linear algebra and its applications, one of the most important building
blocks of modern engineering and computer sciences.
Topics include matrices, determinants, vector spaces, eigenvalues and eigenvectors, orthogonality, and inner
product spaces; applications include brief introductions to difference equations, Markov chains, and
systems of linear ordinary differential equations.
Rationale: Linear algebra is an important building block in engineering and computer sciences. It is applied
in many areas such as search engines, data mining and machine learning, control and optimization, graphics,
robotics, etc.
VECTORS | 2
PART I
VECTORS
2
⯑
1.5 ⯑
1
⯑
⯑
0.5
⯑
⯑
0
00
-0.5
A vector is a collection of numbers arranged in a column or a row and can be thought of as a point in space.
We review basic notions such as independence, span, subspaces, and dimension. The scalar product allows
to define the length of a vector, as well as generalize the notion of the angle between two vectors. Via the
scalar product, we can view a vector as a linear function. We can also compute the projection of a vector onto
a line defined by another — a basic ingredient in many visualization techniques for high-dimensional data
sets.
Outline
• Basics
• Scalar product, norms and angles
• Projection on a line
• Orthogonalization
• Hyperplanes and half-spaces
• Linear functions
• Application: data visualization via projection on a line
3 | VECTORS
• Exercises
BASICS | 4
1.
BASICS
• Definitions
• Independence
• Subspaces, span, affine sets
• Basis, dimension
1.1. Definitions
Vectors
Assume we are given a collection of real numbers, . We can represent them as locations on a
line. Alternatively, we can represent the collection as a single point in a -dimensional space. This is the vector
representation of the collection of numbers; each number is called a component or element of the vector.
Vectors can be arranged in a column or a row; we usually write vectors in column format:
We denote by denotes the set of real vectors with components. If denotes a vector, we use
subscripts to denote components, so that is the -th component of . Sometimes the notation is
used to denote the -th component.
2
5 | BASICS
See also:
Transpose
If is a column vector, denotes the corresponding row vector, and vice-versa. Hence, if is the column
vector above:
Sometimes we use the looser, in-line notation , to denote a row or column vector, the
orientation being understood from context.
1.2. Independence
A set of vectors in is said to be linearly independent if and only if the following
condition on a vector :
implies for . This means that no vector in the set can be expressed as a linear
combination of the others.
An important result of linear algebra, which we will prove later, says that a subspace can always be
represented as the span of a set of vectors , , that is, as a set of the form
BASICS | 6
An affine set is a translation of a subspace — it is ‘‘flat’’ but does not necessarily pass through , as a subspace
would. (Think for example of a line, or a plane, that does not go through the origin.) So an affine set can
always be represented as the translation of the subspace spanned by some vectors:
⯑ ⯑
When is the span of a single non-zero vector, the set is called a line passing through the point . Thus,
lines have the form
where determines the direction of the line, and is a point through which it passes.
2.5
1.5 ⯑
Example 4: A line in passing through the point
1
⯑0 , with direction
0.5 .
0
-0.5
-3 -2 -1 0 1 2 3
7 | BASICS
Basis
A basis of is a set of independent vectors. If the vectors form a basis, we can express any
vector as a linear combination of the ‘s:
The standard basis (alternatively, natural basis) in consists of the vectors , where ‘s components are
all zero, except the -th, which is equal to 1. In , we have
is not independent, since , and its span has dimension . Since are independent (the
equation has as the unique solution), a basis for that span is, for example, .
In contrast, the collection spans the whole space , and thus forms a basis of that space.
Basis of a subspace
The basis of a given subspace is any independent set of vectors whose span is . If the vectors
form a basis of , we can express any vector as a linear combination of the ‘s:
The number of vectors in the basis is actually independent of the choice of the basis (for example, in
BASICS | 8
you need two independent vectors to describe a plane containing the origin). This number is called the
dimension of . We can accordingly define the dimension of an affine subspace, as that of the linear subspace
of which it is a translation.
Examples:
• The dimension of a line is 1 since a line is of the form for some non-zero vector .
• Dimension of an affine subspace.
9 | SCALAR PRODUCT, NORMS AND ANGLES
2.
• Scalar product
• Norms
• Three popular norms
• Cauchy-Schwarz inequality
• Angles between vectors
Definition
The scalar product (or, inner product, or dot product) between two vectors is the scalar denoted
, and defined as
The motivation for our notation above will come later when we define the matrix-matrix product. The
scalar product is also sometimes denoted , a notation that originates in physics.
See also:
Orthogonality
We say that two vectors are orthogonal if .
SCALAR PRODUCT, NORMS AND ANGLES | 10
2.2. Norms
Definition
Measuring the size of a scalar value is unambiguous — we just take the magnitude (absolute value) of the
number. However, when we deal with higher dimensions and try to define the notion of size, or length, of a
vector, we are faced with many possible choices. These choices are encapsulated in the notion of norm.
Norms are real-valued functions that satisfy a basic set of rules that a sensible notion of size should involve.
You can consult the formal definition of a norm here. The norm of a vector is usually denoted
5
1
⯑ = 3 The norm:
⯑ 1
2
1 3
⯑ = 2
The norm:
∞
2 ⯑
Examples:
• A given vector will in general have different ‘‘lengths” under different norms. For example, the vector
yields , , and .
• Sample standard deviation.
The above inequality is an equality if and only if are collinear. In other words:
For a proof, see here. The Cauchy-Schwarz inequality can be generalized to other norms, using the concept
of dual norm.
Applying the Cauchy-Schwartz inequality above to and we see that indeed the number
above is in .
The notion above generalizes the usual notion of angle between two directions in two dimensions, and is
useful in measuring the similarity (or, closeness) between two vectors. When the two vectors are orthogonal,
that is, , we do obtain that their angle is .
3.
PROJECTION ON A LINE
• Definition
• Closed-form expression
• Interpreting the scalar product
3.1. Definition
Consider the line in passing through and with direction :
2.5
1.5 ⯑
Example 1: A line in passing through the point
1
⯑0 , with direction
0.5 .
0
-0.5
-3 -2 -1 0 1 2 3
The projection of a given point on the line is a vector located on the line, that is closest to (in Euclidean
norm). This corresponds to a simple optimization problem:
This particular problem is part of a general class of optimization problems known as least-squares. It is also
a special case of a Euclidean projection on a general set.
PROJECTION ON A LINE | 14
1
⯑
At optimality the ‘‘residual’’ vector is
orthogonal to the line, hence , with
⯑
In the case when is not normalized, the expression is obtained by replacing with its scaled version
:
In general, the scalar product is simply the component of along the normalized direction
defined by .
ORTHOGONALIZATION: THE GRAM-SCHMIDT PROCEDURE | 16
4.
ORTHOGONALIZATION: THE
GRAM-SCHMIDT PROCEDURE
• Orthogonalization
• Projection on a line
• Gram-Schmidt procedure
That is, the vectors form an orthonormal basis for the span of the vectors .
17 | ORTHOGONALIZATION: THE GRAM-SCHMIDT PROCEDURE
The projection of a given point on the line is a vector located on the line, that is closest to (in
Euclidean norm). This corresponds to a simple optimization problem:
The vector , where is the optimal value, is referred to as the projection of on the line
. As seen here, the solution of this simple problem has a closed-form expression:
Note that the vector can now be written as a sum of its projection and another vector that is orthogonal
to the projection:
Gram-Schmidt procedure:
1. set .
2. normalize: set .
3. remove component of in : set .
4. normalize: set .
ORTHOGONALIZATION: THE GRAM-SCHMIDT PROCEDURE | 18
2
1
⯑
The GS process is well-defined, since at each step (otherwise this would contradict the linear
independence of the ‘s).
1. set .
2. for :
1. set .
2. if .
5.
• Hyperplanes
• Projection on a hyperplane
• Geometry
• Half-spaces
5.1. Hyperplanes
A hyperplane is a set described by a single scalar product equality. Precisely, a hyperplane in is a set of the
form
where , , and are given. When , the hyperplane is simply the set of points that
are orthogonal to ; when , the hyperplane is a translation, along direction , of that set.
Hence, the hyperplane can be characterized as the set of vectors such that is orthogonal to :
Hyperplanes are affine sets, of dimension (see the proof here). Thus, they generalize the usual notion
of a plane in . Hyperplanes are very useful because they allows to separate the whole space in two regions.
The notion of half-space formalizes this.
HYPERPLANES AND HALF-SPACES | 20
Example 1: A hyperplane in .
Consider an affine set of dimension in , which we describe as the set of points in such that there exists two
parameters such that
Thus, the set is of dimension in , hence it is a hyperplane. In , hyperplanes are ordinary planes. We can
find a representation of the hyperplane in the standard form
We simply find that is orthogonal to both and . That is, we solve the equations
By construction, is the projection of on . That is, it is the point on closest to the origin, as it
solves the projection problem
Geometrically, a hyperplane ,
with , is a translation of the set of vectors
orthogonal to . The direction of the translation is
determined by , and the amount by .
5.4. Half-spaces
A half-space is a subset of defined by a single inequality involving a scalar product. Precisely, a half-space
in is a set of the form
Geometrically, the half-space above is the set of points such that , that is, the angle
between and is acute (in ). Here is the point closest to the origin on the
hyperplane defined by the equality . (When is normalized, as in the picture, .)
⯑2
0
23 | LINEAR FUNCTIONS
6.
LINEAR FUNCTIONS
Definition
Linear functions are functions which preserve scaling and addition of the input argument. Affine functions
are ‘‘linear plus constant’’ functions.
Formal definition, linear and affine functions. A function is linear if and only if
preserves scaling and addition of its arguments:
A function is linear if and only if either one of the following conditions hold.
– For every in , .
2. vanishes at the origin: , and transforms any line segment in into another segment in :
3. is differentiable, vanishes at the origin, and the matrix of its derivatives is constant: there exist in
such that
for some unique pair , with and , given by , with the -th unit vector in ,
, and . The function is linear if and only if .
The theorem shows that a vector can be seen as a (linear) function from the ‘‘input“ space to the
25 | LINEAR FUNCTIONS
‘‘output” space . Both points of view (matrices as simple collections of numbers, or as linear functions) are
useful.
An affine function , with values has a very simple gradient: the constant
vector . That is, for an affine function , we have for every :
Consider the function , with values . Its gradient is constant, with values
For a given in , the -level set is the set of points such that :
The level sets are hyperplanes, and are orthogonal to the gradient.
Interpretations
The interpretation of are as follows.
• The is the constant term. For this reason, it is sometimes referred to as the bias, or intercept
(as it is the point where intercepts the vertical axis if we were to plot the graph of the function).
• The terms , , which correspond to the gradient of , give the coefficients of influence
of on . For example, if , then the first component of has much greater influence on the
value of than the third.
One-dimensional case
Consider a function of one variable , and assume it is differentiable everywhere. Then we can
approximate the values function at a point near a point as follows:
Multi-dimensional case
With more than one variable, we have a similar result. Let us approximate a differentiable function
by a linear function , so that and coincide up and including to the first derivatives. The
corresponding approximation is called the first-order approximation to at .
where and . Our condition that coincides with up and including to the first derivatives
shows that we must have
7.
In this section, we are focussing on a data set containing the votes of US Senators. This dataset can be
represented as a collection of vectors , in , with the number
of bills, and the number of Senators. Thus, contains all the votes of Senator , and the -th
component of contains the vote of that Senator on bill .
Senate voting matrix: This image shows the votes of the Senators in the 2004-2006 US Senate,
for a total of bills. “Yes” votes are represented as ‘s, “No” as ‘s, and the other votes are
recorded as . Each row represents the votes of a single Senator, and each column contains the votes of all
Senators for a particular bill. The vectors , can be read as rows in the picture.
29 | APPLICATION: DATA VISUALIZATION BY PROJECTION ON A LINE
We can try to visualize the data set, by projecting each data point (each row or column of the matrix) on
(say) a one-, two- or three-dimensional space. Each ‘‘view’’ corresponds to a particular projection, that is, a
particular one-, two- or three-dimensional subspace on which we choose to project the data. Let us detail
what it means to project on a one-dimensional set, that is, on a line.
Projecting on a line allows to assign a single number, or ‘‘score’’, to each data point, via a scalar product.
We choose a (normalized) direction , and a scalar . This corresponds to the affine ‘‘scoring’’
function , which, to a generic data point , assigns the value
The vector can be interpreted as the ‘‘average response’’ across data points (the average vote across
Senators in our running example). The values of our scoring function can now be expressed as
In order to be able to compare the relative merits of different directions, we can assume, without loss of
generality, that the direction vector u is normalized (so that ).
Note that our definition of above is consistent with the idea of projecting the data points on
the line passing through the origin and with normalized direction . Indeed, the component of on
the line is .
APPLICATION: DATA VISUALIZATION BY PROJECTION ON A LINE | 30
In the Senate voting example above, a particular projection (that is, a direction in ) corresponds to
assigning a ‘‘score’’ to each Senator, and thus represents all the Senators as a single value on a line. We
will project the data along a vector in the ‘‘bill’’ space, which is . That is, we are going to form linear
combinations of the bills, so that the votes for each Senator are reduced to a single number, or
‘‘score’’. Since we centered our data, the average score (across Senators) is zero.
7.3. Examples
8.
EXERCISES
• Subspaces
• Projections, scalar products, angles
• Orthogonalization
• Generalized Cauchy-Schwarz inequalities
• Linear functions
8.1. Subspaces
1. Consider the set of points such that
Show that is a subspace. Determine its dimension, and find a basis for it.
a. Show that the set is an affine subspace of dimension . To this end, express it as
b. Find the minimum Euclidean distance from to the set . Find a point that achieves
the minimum distance. (Hint: using the Cauchy-Schwarz inequality, prove that the minimum-
distance point is proportional to .)
33 | EXERCISES
1. Find the projection of the vector on the line that passes through with
8.3. Orthogonalization
Let be two unit-norm vectors, that is, such that . Show that the
vectors and are orthogonal. Use this to find an orthogonal basis for the subspace
spanned by and .
3. In a generalized version of the above inequalities, show that for any non-zero vector ,
EXERCISES | 34
where is the cardinality of the vector , defined as the number of non-zero elements in
For which vectors is the upper bound attained?
Express as a scalar product, that is, find such that for every . Find a basis
for the set of points such that .
Justify the statement: ‘‘the coefficients provide the ratio between the relative error in to a relative
error in ’’.
3. Find the gradient of the function that gives the distance from a given point
to a point .
35 | MATRICES
PART II
MATRICES
Matrices are collections of vectors of the same size, organized in a rectangular array. The image shows the
matrix of votes of the 2004-2006 US Senate.
Via the matrix-vector product, we can interpret matrices as linear maps (vector-valued functions), which act
from an ‘‘input’’ space to an ‘‘output’’ space, and preserve the addition and scaling of the inputs. Linear
maps arise everywhere in engineering, mostly via a process known as linearization (of non-linear maps).
Matrix norms are then useful to measure how the map amplifies or decreases the norm of specific inputs.
We review a number of prominent classes of matrices. Orthogonal matrices are an important special case,
as they generalize the notion of rotation in the ordinary two- or three-dimensional spaces: they preserve
(Euclidean) norms and angles. The QR decomposition, which proves useful in solving linear equations and
related problems, allows to decompose any matrix as a two-term product involving an orthogonal matrix
and a triangular matrix.
Outline
• Basics
• Matrix-vector and matrix-matrix multiplication, scalar product
• Special classes of matrices
• QR decomposition of a matrix
• Matrix inverses
MATRICES | 36
• Linear maps
• Matrix norms
• Applications
• Exercises
37 | BASICS
9.
BASICS
A matrix can be described as follows: given vectors in we can define the matrix
with ‘s as columns:
With our convention, a column vector in is thus a matrix in , while a row vector in is a matrix
in .
9.2. Transpose
The notation denotes the element of in row and column . The transpose of a matrix , denoted
by , is the matrix with element at the position, with and .
The matrix can be interpreted as the collection of two column vectors: , where ‘s contain the
columns of :
See also:
39 | BASICS
One of the most common formats involves only listing the non-zero elements, and their associated locations
in the matrix.
MATRIX-VECTOR AND MATRIX-MATRIX MULTIPLICATION, SCALAR PRODUCT | 40
10.
• Matrix-vector product
• Matrix-matrix product
• Block matrix product
• Trace and scalar product
Definition
We define the matrix-vector product between a matrix and a -vector , and denote by , the
-vector with -th component
⯑ 31 ⯑ 32 ⯑3
⯑ 21 ⯑ 22 × = ⯑1 ⯑ 21 + ⯑2 ⯑ 22
⯑2
⯑ 31 ⯑ 32 ⯑ 31 ⯑ 32
See also:
• Network flow.
• Image compression.
×
32
31
22
21
2
12
11
1 1
Left product
If , then the notation is the row vector of size equal to the transpose of the column vector
. That is:
Example: Return to the network example, involving a incidence matrix. We note that, by
construction, the columns of sum to zero, which can be compactly written as , or .
43 | MATRIX-VECTOR AND MATRIX-MATRIX MULTIPLICATION, SCALAR PRODUCT
Definition
We can extend the matrix-vector product to the matrix-matrix product, as follows. If and
, the notation denotes the matrix with element given by
Column-wise interpretation
If the columns of are given by the vectors , with , so that , then
can be written as
Row-wise interpretation
The matrix-matrix product can also be interpreted as an operation on the rows of . Indeed, if is given
by its rows then is the matrix obtained by transforming each one of these rows via
, into :
(Note that ‘s are indeed row vectors, according to our matrix-vector rules.)
matrix-vector product between a matrix and a -vector , where are partitioned in blocks,
as follows:
where is Then
Symbolically, it’s as if we would form the ‘‘scalar’’ product between the ‘‘row vector and the
column vector !
Again, symbolically we apply the same rules as for the scalar product — except that now the result is a matrix.
Finally, we can consider so-called outer products. Assume matrix is partitioned row-wise and matrix is
partitioned column-wise. Therefore, we have:
The dimensions of these matrices should be consistent such that are of dimensions and
respectively and are of dimensions and respectively. The dimensions of
the resultant matrices will be
respectively.
Trace
The trace of a square matrix , denoted by , is the sum of its diagonal elements:
• Trace of transpose: The trace of a square matrix is equal to that of its transpose.
• Commutativity under trace: for any two matrices and , we have
Our notation is consistent with the definition of the scalar product between two vectors, where we simply
view a vector in as a matrix in . We can interpret the matrix scalar product as the vector scalar
product between two long vectors of length each, obtained by stacking all the columns of on top
of each other.
SPECIAL CLASSES OF MATRICES | 46
11.
• Square Matrices
◦ Identity and diagonal matrices
◦ Triangular matrices
◦ Symmetric matrices
◦ Orthogonal Matrices
• Dyads
Identity matrix
The identity matrix (often denoted , or simply , if the context allows), has ones on its diagonal
and zeros elsewhere. It is square, diagonal, and symmetric. This matrix satisfies for every
matrix with columns, and for every matrix with rows.
This matrix has ones on its diagonal and zeros elsewhere. When multiplied by any matrix , the product
remains , and similarly, for any matrix of size .
47 | SPECIAL CLASSES OF MATRICES
Diagonal matrices
Diagonal matrices are square matrices with when . A diagonal matrix can be
denoted as , with the vector containing the elements on the diagonal. We can also write
where by convention the zeros outside the diagonal are not written.
Symmetric matrices
Symmetric matrices are square matrices that satisfy for every pair . An entire section is
devoted to symmetric matrices.
The matrix
Triangular matrices
A square matrix is upper triangular if when . Here are a few examples:
SPECIAL CLASSES OF MATRICES | 48
Orthogonal matrices
Orthogonal (or, unitary) matrices are square matrices, such that the columns form an orthonormal basis. If
is an orthogonal matrix, then
Thus, Similarly,
Orthogonal matrices correspond to rotations or reflections across a direction: they preserve length and
angles. Indeed, for every vector ,
Thus, the underlying linear map preserves the length (measured in Euclidean norm). This is
sometimes referred to as the rotational invariance of the Euclidean norm.
In addition, angles are preserved: if are two vectors with unit norm, then the angle between them
satisfies , while the angle between the rotated vectors satisfies
. Since
we obtain that the angles are the same. (The converse is true: any square matrix that preserves lengths and
angles is orthogonal.)
Geometrically, orthogonal matrices correspond to rotations (around a point) or reflections (around a line
passing through the origin).
49 | SPECIAL CLASSES OF MATRICES
The matrix
is orthogonal.
11.2 Dyads
Dyads are a special class of matrices, also called rank-one matrices, for reasons seen later.
Definition
A matrix is a dyad if it is of the form for some vectors . The
dyad acts on an input vector as follows:
In terms of the associated linear map, for a dyad, the output always points in the same direction in output
space ( ), no matter what the input is. The output is thus always a simple scaled version of . The
amount of scaling depends on the vector , via the linear function .
Normalized dyads
We can always normalize the dyad, by assuming that both are of unit (Euclidean) norm, and using a
factor to capture their scale. That is, any dyad can be written in normalized form:
SPECIAL CLASSES OF MATRICES | 50
where , and .
51 | QR DECOMPOSITION OF A MATRIX
12.
QR DECOMPOSITION OF A MATRIX
• Basic idea
• The case when the matrix has linearly independent columns
• General case
• Full QR decomposition
The QR decomposition is nothing else than the Gram-Schmidt procedure applied to the columns of the
matrix and with the result expressed in matrix form. Consider a matrix , with
each is a column of .
We write this as
where and .
Since the ‘s are unit-length and normalized, the matrix satisfies . The QR
decomposition of a matrix thus allows writing the matrix in factored form:
QR DECOMPOSITION OF A MATRIX | 52
(This is simply an upper triangular matrix with some rows deleted. It is still upper triangular.)
We can permute the columns of to bring forward the first non-zero elements in each row:
where is a permutation matrix (that is, its columns are the unit vectors in some order), whose effect
is to permute columns. (Since is orthogonal, .) Now, is square, upper triangular,
and invertible, since none of its diagonal elements is zero.
where
53 | QR DECOMPOSITION OF A MATRIX
1.
2. is the of ;
3. is upper triangular, invertible matrix;
4. is a matrix;
5. is a permutation matrix.
where the columns of the matrix are orthogonal and is upper triangular and
invertible. (As before, is a permutation matrix.) In the G-S procedure, the columns of are obtained
from those of , while the columns of come from the extra columns added to .
The full QR decomposition reveals the rank of : we simply look at the elements on the diagonal of that
are not zero, that is, the size of .
13.
MATRIX INVERSES
For invertible matrices , there exists a unique matrix such that . The matrix
is denoted and is called the inverse of .
If a matrix is square, invertible, and triangular, we can compute its inverse simply, as follows. We
solve linear equations of the form with the -th column of the
identity matrix, using a process known as backward substitution. Here is an example. At the outset, we form
the matrix By construction, .
For a general square and invertible matrix A, the QR decomposition can be used to compute its inverse. For
55 | MATRIX INVERSES
such matrices, the QR decomposition is of the form , with a orthogonal matrix, and
is upper triangular. Then the inverse is .
A useful property is the expression of the inverse of a product of two square, invertible matrices :
(Indeed, you can check that this inverse works.)
A matrix has full column rank if and only if there exists an matrix such that (here
is the small dimension). We say that is a left-inverse of . To find one left inverse of a matrix with
independent columns , we use the full QR decomposition of to write
A matrix has full row rank if and only if there exists an matrix such that (here
is the small dimension). We say that is a right-inverse of . We can derive expressions of right
inverses by noting that is full row rank if and only if is full column rank. In particular, for a matrix
with independent rows, the full QR decomposition (of ) allows writing
MATRIX INVERSES | 56
14.
LINEAR MAPS
Definition
A map is linear (resp. affine) if and only if every one of its components is. The formal
definition we saw here for functions applies verbatim to maps.
Hence, there is a one-to-one correspondence between matrices and linear maps. This is extending what we
saw for vectors, which are in one-to-one correspondence with linear functions.
for some unique pair , with and . The function is linear if and only if .
LINEAR MAPS | 58
The result above shows that a matrix can be seen as a (linear) map from the “input” space to the “output”
space . Both points of view (matrices as simple collections of vectors, or as linear maps) are useful.
Interpretations
Consider an affine map . An element gives the coefficient of influence of over .
In this sense, if we can say that has much more influence on than . Or, says
that does not depend at all on . Often the constant term is referred to as the “bias” vector.
See also:
15.
MATRIX NORMS
• Motivating example
• RMS gain: the Frobenius norm
• Peak gain: the largest singular value norm
• Applications
Now, assume that there is some noise in the vector : the actual input is , where is an error
vector. This implies that there will be noise in the output as well: the noisy output is so the error
on the output due to noise is . How could we quantify the effect of input noise on the output noise?
One approach is to try to measure the norm of the error vector, . Obviously, this norm depends on the
noise , which we do not know. So we will assume that can take values in a set. We need to come up with a
single number that captures in some way the different values of when spans that set. Since scaling
simply scales the norm accordingly, we will restrict the vectors to have a certain norm, say
.
Clearly, depending on the choice of the set, the norms we use to measure norm lengths, and how we choose
to capture many numbers with one, etc, we will obtain different numbers.
where stands for the -th column of . The quantity above can be written as , where
The function turns out to satisfy the basic conditions of a norm in the matrix space .
In fact, it is the Euclidean norm of the vector of length formed with all the coefficients of . Further,
the quantity would remain the same if we had chosen any orthonormal basis other than the standard one.
The Frobenius norm is useful to measure the RMS (root-mean-square) gain of the matrix, and its average
response along given mutually orthogonal directions in space. Clearly, this approach does not capture well
the variance of the error, only the average effect of noise.
The computation of the Frobenius norm is very easy: it requires about flops.
Let us assume that the noise vector is bounded but otherwise unknown. Specifically, all we know about
is that , where is the maximum amount of noise (measured in Euclidean norm). What is then
the worst-case (peak) value of the norm of the output noise? This is answered by the optimization problem
The quantity
measures the peak gain of the mapping , in the sense that if the noise vector is bounded in norm by
, then the output noise is bounded in norm by . Any vector which achieves the maximum above
corresponds to a direction in input space that is maximally amplified by the mapping .
The quantity is indeed a matrix norm, called the largest singular value (LSV) norm, for reasons
seen here. It is perhaps the most popular matrix norm.
The computation of the largest singular value norm of a matrix is not as easy as with the Frobenius norm.
However, it can be computed with linear algebra methods seen here, in about flops. While
61 | MATRIX NORMS
it is more expensive to compute than the Frobenious norm, it is also more useful because it goes beyond
capturing the average response to noise.
Other norms
Many other matrix norms are possible, and sometimes useful. In particular, we can generalize the notion of
peak norm by using different norms to measure vector size in the input and output spaces. For example, the
quantity
measures the peak gain with inputs bounded in the maximum norm, and outputs measured with the
-norm.
The norms we have just introduced, the Frobenius and largest singular value norms, are the most popular
ones and are easy to compute. Many other norms are hard to compute.
15.4 Applications
Assume for example that we are trying to estimate a matrix , and came up with an estimate . How can
we measure the quality of our estimate? One way is to evaluate by how much they differ when they act on a
standard basis. This leads to the Frobenius norm.
when runs the whole space. Clearly, we need to scale or limit the size, of , otherwise, the difference above
may be arbitrarily big. Let’s look at the worst-case difference when satisfies . We obtain
Let us try to visualize the data set by projecting it on a single line passing through the origin. The line is thus
defined by a vector , which we can without loss of generality assume to be of Euclidean norm .
The data points, when projected on the line, are turned into real numbers .
It can be argued that a good line to project data on is one which spreads the numbers as much as
possible. (If all the data points are projected to numbers that are very close, we will not see anything, as all
data points will collapse to close locations.)
We can find a direction in space that accomplishes this, as follows. The average of the numbers is
(It turns out that this quantity is the same as the LSV norm of itself.)
63 | APPLICATIONS
16.
APPLICATIONS
Definition
Many discrete-time dynamical systems can be modeled via linear state-space equations, of the form
where is the state, which encapsulates the state of the system at time contains
control variables, contains specific outputs of interest. The matrices
In effect, a linear dynamical model postulates that the state at the next instant is a linear function of the state
at past instants, and possibly other ‘‘exogenous’’ inputs; and that the output is a linear function of the state
and input vectors.
Finally, the so-called time-varying models involve time-varying matrices (see an example
below).
Motivation
The main motivation for state-space models is to be able to model high-order derivatives in dynamical
equations, using only first-order derivatives, but involving vectors instead of scalar quantities.
APPLICATIONS | 64
The above involves second-order derivatives of a scalar function . We can express it in an equivalent form
involving only first-order derivatives, by defining the state vector to be
The price we pay is that now we deal with a vector equation instead of a scalar equation:
A nonlinear system
In the case of non-linear systems, we can also use state-space representations. In the case of autonomous
systems (no external input) for example, these come in the form
where is now a non-linear map. Now assume we want to model the behavior of the
system near an equilibrium point (such that ). Let us assume for simplicity that .
Using the first-order approximation of the map , we can write a linear approximation to the above model:
65 | APPLICATIONS
where
17.
EXERCISES
• Matrix products
• Special matrices
• Linear maps, dynamical systems
• Matrix inverses, norms
c. Show that .
2. Consider a discrete-time linear dynamical system (for background, see here) with state ,
input vector , and output vector , that is described by the linear equations
67 | EXERCISES
a. Assuming that the system has initial condition , express the output vector at
time as a linear function of ; that is, determine a matrix such that
, where is a vector containing all the
inputs up to and including at time .
PART III
LINEAR EQUATIONS
Linear equations have been around for thousands of years. The picture on the left shows a 17th-century
Chinese text that explains the ancient art of fangcheng (‘‘rectangular arrays’’, for more details see here).
Linear equations arise naturally in many areas of engineering, often as simple models of more complicated,
non-linear equations. They form the core of linear algebra, and often arise as constraints in optimization
problems. They are also an important building block of optimization methods, as many optimization
algorithms rely on linear equations.
The issue of existence and unicity of solutions leads to important notions attached to the associated matrix
. The nullspace, which contains the input vectors that are crushed to zero by the associated linear map
; the range, which contains the set of output vectors that is attainable by the linear map; and its
dimension, the rank. There is a variety of solution methods for linear equations; we describe how the QR
decomposition of a matrix can be used in this context.
69 | LINEAR EQUATIONS
Outline
• Motivating example
• Existence and unicity of solutions
• Solving linear equations
• Applications
• Exercises
MOTIVATING EXAMPLE | 70
18.
MOTIVATING EXAMPLE
• Overview
• From 1D to 2D: axial tomography
• Linear equations for a single slice
• Issues raised: finding a solution, existence and unicity
18.1. Overview
Tomography means reconstruction of an image from its sections. The word comes from the greek ‘‘tomos’’
(‘‘slice’’) and ‘‘graph’’ (‘‘description’’). The problem arises in many fields, ranging from astronomy to
medical imaging.
Computerized Axial Tomography (CAT) is a medical imaging method that processes large amounts of two-
dimensional X-ray images in order to produce a three-dimensional image. The goal is to picture for example
the tissue density of the different parts of the brain, in order to detect anomalies (such as brain tumors).
Typically, the X-ray images represent ‘‘slices’’ of the part of the body (such as the brain) that is examined.
Those slices are indirectly obtained via axial measurements of X ray attenuation, as explained below. Thus,
in CAT for medical imaging, we use axial (line) measurements to get two-dimensional images (slices), and
from that scan of images, we may proceed to digitally reconstruct a three-dimensional view. Here, we focus
on the process that produces a single two-dimensional image from axial measurements.
With the discretization, the linear relationship between intensities log-ratios and densities can be expressed
as
MOTIVATING EXAMPLE | 72
where denotes the indices of pixel areas traversed by the X ray, the density in the area, and
the proportion of the area within the pixel that is traversed by the ray.
Thus, we can relate the vector to the observed intensity log-ratio vector in terms of a
linear equation
where , with . Note that depending on the number of pixels used, and the number of
measurements, the matrix can be quite large. In general, the matrix is wide, in the sense that it has (many)
more columns than rows ( ). Thus, the above system of equations is usually undetermined.
18.4. Issues
The above example motivates us to address the problems of solving linear equations. It also raises the issue
of existence (do we have enough measurements to find the densities?) and unicity (if a solution exists, is it
unique?).
EXISTENCE AND UNICITY OF SOLUTIONS | 74
19.
The set of solutions to the above equation, if it is not empty, is an affine subspace. That is, it is of the form
where is a subspace.
Range
The range (or, image) of a matrix is defined as the following subset of :
The range describes the vectors that can be attained in the output space by an arbitrary choice of a
vector , in the input space. The range is simply the span of the columns of .
75 | EXISTENCE AND UNICITY OF SOLUTIONS
If , we say that the linear equation is infeasible. The set of solutions to the linear
equation is empty.
From a matrix it is possible to find a matrix, the columns of which span the range of the matrix , and are
mutually orthogonal. Hence, , where is the dimension of the range. One algorithm to obtain
the matrix is the Gram-Schmidt procedure.
Rank
The dimension of the range is called the rank of the matrix. As we will see later, the rank cannot exceed
any one of the dimensions of the matrix . A matrix is said to be full rank if
.
Note that the rank is a very ‘‘brittle’’ notion, in that small changes in the entries of the matrix can
dramatically change its rank. Random matrices are full rank. We will develop here a better, more numerically
reliable notion.
Any linear combination of these vectors can be represented as , where . For our matrix , the range can
be visually represented as the plane spanned by and .
Rank: The rank of a matrix is the dimension of its range. For our matrix , since both column vectors are linearly
independent, the rank is:
See also:
EXISTENCE AND UNICITY OF SOLUTIONS | 76
• Rank-one matrices.
• Rank properties of the arc-node incidence matrix.
An equivalent condition for to be full row rank is that the square, matrix is invertible,
meaning that it has full rank, .
Proof.
Nullspace
The nullspace (or, kernel) of a matrix is the following subspace of :
The nullspace describes the ambiguity in given : any will be such that
, so cannot be determined by the sole knowledge of if the nullspace is not reduced to
the singleton .
From a matrix we can obtain a matrix, the columns of which span the nullspace of the matrix , and are
mutually orthogonal. Hence, , where is the dimension of the nullspace.
77 | EXISTENCE AND UNICITY OF SOLUTIONS
Given the matrix structure, for any vector such that the first component is times the second component and
the third component can be arbitrary, .
satisfies and is thus in the nullspace of . The dimension of this nullspace, , is 2 (since we have two free
variables).
Nullity
The nullity of a matrix is the dimension of the nullspace. The rank-nullity theorem states that the nullity
of a matrix is , where is the rank of .
has as the unique solution. Hence, is one-to-one if and only if its columns are independent. Since
the rank is always less than the smallest of the number of columns and rows, a matrix of full column
rank has necessarily less columns than rows (that is, ).
The term ‘‘one-to-one’’ comes from the fact that for such matrices, the condition uniquely
determines , since and implies , so that the solution is unique:
EXISTENCE AND UNICITY OF SOLUTIONS | 78
. The name ‘‘full column rank’’ comes from the fact that the rank equals the column dimension of
.
An equivalent condition for to be full column rank is that the square, matrix is invertible,
meaning that it has full rank, .
Proof
Rank-nullity theorem
The nullity (dimension of the nullspace) and the rank (dimension of the range) of a matrix add up to the
column dimension of .
Proof.
Another important result is involves the definition of the orthogonal complement of a subspace.
The range of a matrix is the orthogonal complement of the nullspace of its transpose. That is, for a matrix
:
Proof.
79 | EXISTENCE AND UNICITY OF SOLUTIONS
20.
• Basic idea
• The QR decomposition of a matrix
• Solution via full QR decomposition
• Set of solutions
The basic idea in the solution algorithm starts with the observation that in the special case when is upper
triangular, that is, if , then the system can be easily solved by a process known as backward
substitution. In backward substitution we simply start solving the system by eliminating the last variable first,
then proceed to solve backwards. The process is illustrated in this example, and described in generality here.
Once the QR factorization of is obtained, we can solve the system by first pre-multiplying with both
sides of the equation:
This is due to the fact that . The new system is triangular and can be solved
by backwards substitution. For example, if is full column rank, then is invertible, so that the solution is
unique, and given by .
81 | SOLVING LINEAR EQUATIONS VIA QR DECOMPOSITION
where
• is and orthogonal ( );
• is , with orthonormal columns ( );
• is , with orthonormal columns ( );
• is the rank of ;
• is upper triangular, and invertible;
• is a matrix;
• is a permutation matrix (thus, ).
• The zero submatrices in the bottom (block) row of have rows.
We see that unless , there is no solution. Let us assume that . We have then
A particular solution is obtained upon setting , which leads to a triangular system in , with an
invertible triangular matrix . So that , which corresponds to a particular solution to
:
SOLVING LINEAR EQUATIONS VIA QR DECOMPOSITION | 82
where
21.
APPLICATIONS
Denote by , the three known points and by the measured distances to the emitter.
Mathematically the problem is to solve, for a point in , the equations
Using matrix notation, with the matrix of points, and the vector of ones:
APPLICATIONS | 84
Let us assume that the square matrix is full-rank, that is, invertible. The equation above implies that
In words: the point lies in a line passing through and with direction .
and can be solved in closed-form. The spheres intersect if and only if there is a real, non-negative solution
. Generically, if the spheres have a non-empty intersection, there are two positive solutions, hence two
points in the intersection. This is understandable geometrically: the intersection of two spheres is a circle,
and intersecting a circle with a third sphere produces two points. The line joining the two points is the line
, as identified above.
450 400
610 x1 640
x1 = ?, x2 = ?, x3 = ?, x4 = ?
85 | APPLICATIONS
For the simple problem above, we simply use the fact that at each intersection, the incoming traffic has to
match the outgoing traffic. This leads to the linear equations:
The matrix is nothing else than the incidence matrix associated with the graph that has the intersections
as nodes and links as edges.
EXERCISES | 86
22.
EXERCISES
PART IV
LEAST-SQUARES
The ordinary least-squares (OLS) problem is a particularly simple optimization problem that involves the
minimization of the Euclidean norm of a ‘‘residual error” vector that is affine in the decision variables.
The problem is one of the most ubiquituous optimization problems in engineering and applied sciences. It
can be used for example to fit a straight line through points, as in the figure on the left. The least-squares
approach then amounts to minimize the sum of the area of the squares with side-length equal to the vertical
distances to the line.
We discuss a few variants amenable to the linear algebra approach: regularized least-squares, linearly-
constrained least-squares. We also explain how to use ‘‘kernels’’ to handle problems involving non-linear
curve fitting and prediction using non-linear functions.
Outline
• Ordinary least-squares
• Variants of the least-squares problem
• Kernels for least-squares
• Applications
• Exercises
ORDINARY LEAST-SQUARES | 88
23.
ORDINARY LEAST-SQUARES
• Definition
• Interpretations
• Solution via QR decomposition (full rank case)
• Optimal solution (general case)
23.1. Definition
The Ordinary Least-Squares (OLS, or LS) problem is defined as
where are given. Together, the pair is referred to as the problem data. The
vector is often referred to as the ‘‘measurement” or “output” vector, and the data matrix as the ‘‘design‘‘
or ‘‘input‘‘ matrix. The vector is referred to as the residual error vector.
Note that the problem is equivalent to one where the norm is not squared. Taking the squares is done for
the convenience of the solution.
89 | ORDINARY LEAST-SQUARES
23.2. Interpretations
The OLS can be interpreted as finding the smallest (in Euclidean norm sense) perturbation of the right-hand
side, , such that the linear equation
becomes feasible. In this sense, the OLS formulation implicitly assumes that the data matrix of the
problem is known exactly, while only the right-hand side is subject to perturbation, or measurement errors.
A more elaborate model, total least-squares, takes into account errors in both and .
Interpretation as regression
We can also interpret the problem in terms of the rows of , as follows. Assume that
, where is the -th row of . The problem reads
ORDINARY LEAST-SQUARES | 90
In this sense, we are trying to fit each component of as a linear combination of the corresponding input
, with as the coefficients of this linear combination.
See also:
• Linear regression.
• Auto-regressive models for time series prediction.
• Power law model fitting.
This can be seen by simply taking the gradient (vector of derivatives) of the objective function, which leads
to the optimality condition . Geometrically, the residual vector is orthogonal
to the span of the columns of , as seen in the picture above.
We can also prove this via the QR decomposition of the matrix with a matrix
with orthonormal columns ( ) and a upper-triangular, invertible matrix. Noting that
and exploiting the fact that is invertible, we obtain the optimal solution . This is the
same as the formula above, since
Thus, to find the solution based on the QR decomposition, we just need to implement two steps:
In the general case ( is not necessarily tall, and /or not full rank) then the solution may not be unique. If
is a particular solution, then is also a solution, if is such that , that is, .
That is, the nullspace of describes the ambiguity of solutions. In mathematical terms:
The formal expression for the set of minimizers to the least-squares problem can be found again via the QR
decomposition. This is shown here.
VARIANTS OF THE LEAST-SQUARES PROBLEM | 92
24.
• Linearly-constrained least-squares
• Minimum-norm solutions to linear equations
• Regularized least-squares
Definition
An interesting variant of the ordinary least-squares problem involves equality constraints on the decision
variable :
Solution
We can express the solution by first computing the null space of . Assuming that the feasible set of the
constrained LS problem is not empty, that is, is in the range of , this set can be expressed as
where is the dimension of the nullspace of is a matrix whose columns span the nullspace of , and
is a particular solution to the equation .
Expressing in terms of the free variable , we can write the constrained problem as an unconstrained one:
93 | VARIANTS OF THE LEAST-SQUARES PROBLEM
where , and .
in which we implicitly assume that the linear equation in , has a solution, that is, is in the
range of .
The above problem allows selecting a particular solution to a linear equation, in the case when there are
possibly many, that is, the linear system is under-determined.
As seen here, when is full row rank, that is, the matrix is invertible, the above has the closed-form
solution
The regularized problem can be expressed as an ordinary least-squares problem, where the data matrix is full
column rank, Indeed, the above problem can be written as the ordinary LS problem
where
VARIANTS OF THE LEAST-SQUARES PROBLEM | 94
The presence of the identity matrix in the matrix ensures that it is full (column) rank.
Solution
Since the data matrix in the regularized LS problem has full column rank, the formula seen here applies. The
solution is unique and given by
For , we recover the ordinary LS expression that is valid when the original data matrix is full rank.
The above formula explains one of the motivations for using regularized least-squares in the case of a rank-
deficient matrix : if , but is small, the above expression is still defined, even if is rank-deficient.
where is positive definite (that is, for every non-zero ). The solution is again unique and
given by
95 | KERNELS FOR LEAST-SQUARES
25.
• Motivations
• The kernel trick
• Nonlinear case
• Examples of kernels
• Kernels in practice
25.1. Motivations
Consider a linear auto-regressive model for time-series, where is a linear function of
It appears that the size of the least-squares problem grows quickly with the degree of the feature vectors.
How do we do it in a computationally efficient manner?
KERNELS FOR LEAST-SQUARES | 96
for some vector . Indeed, from the fundamental theorem of linear algebra, every can be
written as the sum of two orthogonal vectors:
The prediction rule depends on the scalar products between new point and the data points :
Once is formed (this takes ), then the training problem has only variables. When , this
leads to a dramatic reduction in problem size.
2. making a prediction,
depends only on our ability to quickly evaluate such scalar products. We can’t choose arbitrarily; it has to
satisfy the above for some .
Polynomial kernels
Regression with quadratic functions involves feature vectors
More generally when is the vector formed with all the products between the components of ,
up to degree , then for any two vectors ,
This represents a dramatic reduction in speed over the ‘‘brute force’’ approach:
1. Form ;
Gaussian kernels
Gaussian kernel function:
KERNELS FOR LEAST-SQUARES | 98
where is a scale parameter This allows ignoring points that are too far apart. Corresponds to a non-
linear mapping to infinite-dimensional feature space.
Other kernels
There is a large variety (a zoo?) of other kernels, some adapted to the structure of data (text, images, etc).
2. The choice is not always obvious; Gaussian or polynomial kernels are popular.
3. We control over-fitting via cross-validation (to choose, say, the scale parameter of Gaussian kernel,
or degree of the polynomial kernel).
99 | APPLICATIONS
26.
APPLICATIONS
In its basic form, the problem is as follows. We are given data where
is the ‘‘input’’ and is the ‘‘output’’ for the th measurement. We seek to find a linear function
such that are collectively close to the corresponding values .
In least-squares regression, the way we evaluate how well a candidate function fits the data is via the
(squared) Euclidean norm:
Since a linear function has the form for some , the problem of minimizing the
above criterion takes the form
where
APPLICATIONS | 100
The linear regression approach can be extended to multiple dimensions, that is, to problems where the
output in the above problem contains more than one dimension (see here). It can also be extended to the
problem of fitting non-linear curves.
where ‘s are constant coefficients, and is the ‘‘memory length’’ of the model. The interpretation of the
model is that the next output is a linear function of the past. Elaborate variants of auto-regressive models are
widely used for the prediction of time series arising in finance and economics.
27.
EXERCISES
• Standard forms
• Applications
in which the data matrix is noisy. Our specific noise model assumes that each row
has the form , where the noise vector has zero mean and
covariance matrix , with a measure of the size of the noise. Therefore, now the matrix is a
function of the set of uncertain vectors , which we denote by . We will write
to denote the matrix with rows . We replace the original problem with
where denotes the expected value with respect to the random variable . Show that this problem
can be written as
where is some regularization parameter, which you will determine. That is, regularized least-
squares can be interpreted as a way to take into account uncertainties in the matrix , in the expected
value sense.
27.2. Applications
1. Moore’s law describes a long-term trend in the history of computing hardware and states that
the number of transistors that can be placed inexpensively on an integrated circuit has doubled
approximately every two years. In this problem, we investigate the validity of the claim via least-
squares.
Year Transistor
1971 2,250
1972 2,500
1974 5,000
1978 29,000
1982 120,000
1985 275,000
1989 1,180,000
1993 3,100,000
1997 7,500,000
1999 24,000,000
2000 42,000,000
2002 220,000,000
2003 410,000,000
show how to estimate the parameters using least-squares, that is, via a problem of the form
EXERCISES | 104
Make sure to define precisely the data and how the variable relates to the original problem
for the corresponding years. You can assume that no component of is zero at
optimum.)
a. Is the solution to the problem above unique? Justify carefully your answer, and give the
expression for the unique solution in terms of .
2. The Michaelis–Menten model for enzyme kinetics relates the rate of an enzymatic reaction, to
the concentration of a substrate, as follows:
a. Show that the model can be expressed as a linear relation between the values and .
c. The above approach has been found to be quite sensitive to errors in input data. Can you
experimentally confirm this opinion?
PART V
EIGENVALUES FOR SYMMETRIC
MATRICES
Symmetric matrices are squares with elements that mirror each other across the diagonal. They can be
used to describe for example graphs with undirected, weighted edges between the nodes; distance matrices
(between say cities), and a host of other applications. Symmetric matrices are also important in optimization,
as they are closely related to quadratic functions.
A fundamental theorem, the spectral theorem, shows that we can decompose any symmetric matrix as a
three-term product of matrices, involving an orthogonal transformation and a diagonal matrix. The theorem
has a direct implication for quadratic functions: it allows a to decompose any quadratic function into a
weighted sum of squared linear functions involving vectors that are mutually orthogonal. The weights are
called the eigenvalues of the symmetric matrix.
The spectral theorem allows, in particular, to determine when a given quadratic function is ‘‘bowl-shaped’’,
that is, convex. The spectral theorem also allows finding directions of maximal variance within a data set.
Such directions are useful to visualize high-dimensional data points in two or three dimensions. This is the
basis of a visualization method known as principal component analysis (PCA).
EIGENVALUES FOR SYMMETRIC MATRICES | 106
Outline
28.
Symmetric matrices
A square matrix is symmetric if it is equal to its transpose. That is,
The matrix
See also:
QUADRATIC FUNCTIONS AND SYMMETRIC MATRICES | 108
Quadratic functions
A function is said to be a quadratic function if it can be expressed as
The function is said to be a quadratic form if there are no linear or constant terms in it:
Examples:
• is the coefficient of in ;
• for , is the coefficient of the term in ;
• is the coefficient of the term ;
• is the constant term, .
109 | QUADRATIC FUNCTIONS AND SYMMETRIC MATRICES
One-dimensional case
If is a twice-differentiable function of a single variable, then the second-order approximation
(or, second-order Taylor expansion) of at a point is of the form
where is the first derivative, and the second derivative, of at . We observe that the
quadratic approximation has the same value, derivative, and second-derivative as , at .
Multi-dimensional case
In multiple dimensions, we have a similar result. Let us approximate a twice-differentiable function
QUADRATIC FUNCTIONS AND SYMMETRIC MATRICES | 110
where , , and . Our condition that coincides with up and including to the second
derivatives shows that we must have
Diagonal matrices
Perhaps the simplest special case of symmetric matrices is the class of diagonal matrices, which are non-zero
only on their diagonal.
Symmetric dyads
Another important class of symmetric matrices is that of the form , where . The matrix has
elements and is symmetric. Such matrices are called symmetric dyads. (If , then the dyad is
said to be normalized.)
Symmetric dyads correspond to quadratic functions that are simply squared linear forms:
.
29.
SPECTRAL THEOREM
The vector is then referred to as an eigenvector associated with the eigenvalue . The eigenvector is said
to be normalized if . In this case, we have
The interpretation of is that it defines a direction along behaves just like scalar multiplication. The
amount of scaling is given by . (In German, the root ‘‘eigen’’, means ‘‘self’’ or ‘‘proper’’). The eigenvalues
of the matrix are characterized by the characteristic equation
where the notation refers to the determinant of its matrix argument. The function, defined by
, is a polynomial of degree called the characteristic polynomial.
From the fundamental theorem of algebra, any polynomial of degree has (possibly not distinct)
complex roots. For symmetric matrices, the eigenvalues are real, since when , and
is normalized.
and they are all real; further, that the associated eigenvectors can be chosen so as to form an orthonormal
basis. The result offers a simple way to decompose the symmetric matrix as a product of simple
transformations.
We can decompose any symmetric matrix with the symmetric eigenvalue decomposition (SED)
Here is a proof. The SED provides a decomposition of the matrix in simple terms, namely dyads.
We check that in the SED above, the scalars are the eigenvalues, and ‘s are associated eigenvectors, since
The eigenvalue decomposition of a symmetric matrix can be efficiently computed with standard software,
in time that grows proportionately to its dimension as .
The term ‘‘variational’’ refers to the fact that the eigenvalues are given as optimal values of optimization
problems, which were referred to in the past as variational problems. Variational representations exist for all
the eigenvalues but are more complicated to state.
SPECTRAL THEOREM | 114
The interpretation of the above identities is that the largest and smallest eigenvalues are a measure of the
range of the quadratic function over the unit Euclidean ball. The quantities above can be
written as the minimum and maximum of the so-called Rayleigh quotient .
Historically, David Hilbert coined the term ‘‘spectrum’’ for the set of eigenvalues of a symmetric operator
(roughly, a matrix of infinite dimensions). The fact that for symmetric matrices, every eigenvalue lies in the
interval somewhat justifies the terminology.
30.
• Definitions
• Special cases and examples
• Square root and Cholesky decomposition
• Ellipsoids
30.1. Definitions
For a given symmetric matrix , the associated quadratic form is the function with
values
• A symmetric matrix is said to be positive semi-definite (PSD, notation: ) if and only if the
associated quadratic form is non-negative everywhere:
• It is said to be positive definite (PD, notation: ) if the quadratic form is non-negative and
definite, that is, if and only if .
It turns out that a matrix is PSD if and only if the eigenvalues of are non-negative. Thus, we can check if
a form is PSD by computing the eigenvalue decomposition of the underlying symmetric matrix.
Proof.
By definition, the PSD and PD properties are properties of the eigenvalues of the matrix only, not of the
POSITIVE SEMI-DEFINITE MATRICES | 116
eigenvectors. Also, if the matrix is PSD, then for every matrix with columns, the matrix
also is.
Symmetric dyads
Special cases of PSD matrices include symmetric dyads. Indeed, if for some vector , then
for every :
Diagonal matrices
A diagonal matrix is PSD (resp. PD) if and only if all of its (diagonal) elements are non-negative (resp.
positive).
Any PSD matrix can be written as a product for an appropriate matrix . The decomposition
is not unique, and is only a possible choice (the only PSD one). Another choice, in terms of
the SED of , is . If is positive-definite, then we can choose to be lower
triangular, and invertible. The decomposition is then known as the Cholesky decomposition of .
117 | POSITIVE SEMI-DEFINITE MATRICES
30.4. Ellipsoids
There is a strong correspondence between ellipsoids and PSD matrices.
Definition
We define an ellipsoid to be an affine transformation of the unit ball for the Euclidean norm:
where is PD.
with
where is PD.
It is possible to define degenerate ellipsoids, which correspond to cases when the matrix in the above,
or its inverse , is degenerate. For example, cylinders or slabs (intersection of two parallel half-spaces) are
degenerate ellipsoids.
119 | PRINCIPAL COMPONENT ANALYSIS
31.
Recall that when is normalized, the scalar is the component of along , that is, it corresponds to
the projection of on the line passing through and with direction .
Here, we seek a (normalized) direction such that the empirical variance of the projected values ,
, is large. If is the vector of averages of the ‘s, then the average of the projected values is
. Thus, the direction of maximal variance is one that solves the optimization problem
where
We have seen the above problem before, under the name of the Rayleigh quotient of a symmetric matrix.
PRINCIPAL COMPONENT ANALYSIS | 120
Solving the problem entails simply finding an eigenvector of the covariance matrix that corresponds to
the largest eigenvalue.
Main idea
The main idea behind principal component analysis is to first find a direction that corresponds to maximal
variance between the data points. The data is then projected on the hyperplane orthogonal to that direction.
We obtain a new data set and find a new direction of maximal variance. We may stop the process when we
have collected enough directions (say, three if we want to visualize the data in 3D).
It turns out that the directions found in this way are precisely the eigenvectors of the data’s covariance
matrix. The term principal components refers to the directions given by these eigenvectors. Mathematically,
the process thus amounts to finding the eigenvalue decomposition of a positive semi-definite matrix, the
covariance matrix of the data points.
Projection on a plane
The projection used to obtain, say, a two-dimensional view with the largest variance, is of the form
, where is a matrix that contains the eigenvectors corresponding to the first two
eigenvalues.
121 | PRINCIPAL COMPONENT ANALYSIS
When we project the data on a two-dimensional plane corresponding to the eigenvectors associated
with the two largest eigenvalues , we get a new covariance matrix , where the
total variance of the projected data is
Hence, we can define the ratio of variance ‘‘explained’’ by the projected data as the ratio:
If the ratio is high, we can say that much of the variation in the data can be observed on the projected plane.
PRINCIPAL COMPONENT ANALYSIS | 122
32.
• Introduction
• Senate voting data and the visualization problem
• Projection on a line
• Projection on a plane
• Direction of maximal variance
• Principal component analysis
• Sparse PCA
• Sparse maximal variance problem
32.1 Introduction
In this case study, we take data from the votes on bills in the US Senate (2004-2006) and explore how we
can visualize the data by projecting it, first on a line then on a plane. We investigate how we can choose the
line or plane in a way that maximizes the variance in the result, via a principal component analysis method.
Finally, we examine how a variation on PCA that encourages sparsity of the projection directions allows us
to understand which bills are most responsible for the variance in the data.
APPLICATIONS: PCA OF SENATE VOTING DATA | 124
Data
The data consists of the votes of Senators in the 2004-2006 US Senate (2004-2006), for a total
of bills. “Yay” (“Yes”) votes are represented as ‘s, “Nay” (“No”) as ‘s, and the other votes are
recorded as . (A number of complexities are ignored here, such as the possibility of pairing the votes.)
This data can be represented here as a ‘‘voting’’ matrix , with elements taken
from . Each column of the voting matrix , contains the votes of a single
Senator for all the bills; each row contains the votes of all Senators on a particular bill.
Senate voting matrix: “Nay” votes are in black, “Yay” ones in white, and the others in grey. The transpose
voting matrix is shown. The picture becomes has many gray areas, as some Senators are replaced over time.
Simply plotting the raw data matrix is often not very informative.
Visualization Problem
We can try to visualize the data set, by projecting each data point (each row or column of the matrix) on (say)
a 1D-, 2D- or 3D-space. Each ‘‘view’’ corresponds to a particular projection, that is, a particular one-, two-
or three-dimensional subspace on which we choose to project the data. The visualization problem consists
of choosing an appropriate projection.
125 | APPLICATIONS: PCA OF SENATE VOTING DATA
There are many ways to formulate the visualization problem, and none dominates the others. Here, we focus
on the basics of that problem.
Scoring Senators
Specifically we would like to assign a single number, or ‘‘score’’, to each column of the matrix. We choose a
direction in , and a scalar in . This corresponds to the affine ‘‘scoring’’ function ,
which, to a generic column in of the data matrix, assigns the value
It is often useful to center these values around zero. This can be done by choosing such that
is the vector of sample averages across the columns of the matrix (that is, data points). The vector can be
interpreted as the ‘‘average response’’ across experiments.
In order to be able to compare the relative merits of different directions, we can assume, without loss of
generality, that the vector is normalized (so that ).
APPLICATIONS: PCA OF SENATE VOTING DATA | 126
Centering data
It is convenient to work with the ‘‘centered’’ data matrix, which is
We can compute the (row) vector scores using the simple matrix-vector product:
We can check that the average of the above row vector is zero:
Scoring Map
This corresponds to the affine ‘‘scoring’’ map , which, to a generic column in of the
data matrix, assigns the two-dimensional value
The affine map allows to generate two-dimensional data points (instead of -dimensional)
, . As before, we can require that the be centered:
by choosing the vector to be such that , where is the ‘‘average response’’ defined
above. Our (centered) scoring map takes the form
We can encapsulate the scores in the matrix . The latter can be expressed as the
matrix-matrix product
Clearly, depending on which plan we choose to project on, we get very different pictures. Some planes seem
to be more ‘‘informative’’ than others. We return to this issue here.
APPLICATIONS: PCA OF SENATE VOTING DATA | 128
Motivation
We have seen here how we can choose a direction in bill space, and then project the Senate voting data matrix
on that direction, in order to visualize the data along a single line. Clearly, depending on how we choose
the line, we will get very different pictures. Some show large variation in the data, others seems to offer a
narrower range, even if we take care to normalize the directions.
What could be a good criterion to choose the direction we project the data on?
It can be argued that a direction that results in large variations of projected data is preferable to a one with
small variations. A direction with high variation ‘‘explains’’ the data better, in the sense that it allows to
distinguish between data points better. One criteria that we can use to quantify the variation in a collection
of real numbers is the sample variance, which is the sum of the squares of the differences between the
numbers and their average.
where
is the sample covariance matrix of the data. The interpretation of the coefficient is that it
provides the covariance between the votes of Senator and those of Senator .
We have seen the above problem before, under the name of the Rayleigh quotient of a symmetric matrix.
Solving the problem entails simply finding an eigenvector of the covariance matrix that corresponds to the
largest eigenvalue.
Main idea
APPLICATIONS: PCA OF SENATE VOTING DATA | 130
The main idea behind principal components analysis is to first find a direction that corresponds to maximal
variance between the data points. The data is then projected on the hyperplane orthogonal to that direction.
We obtain a new data set and find a new direction of maximal variance. We may stop the process when we
have collected enough directions (say, three if we want to visualize the data in 3D).
Mathematically, the process amounts to finding the eigenvalue decomposition of a positive semi-definite
matrix: the covariance matrix of the data points. The directions of large variance correspond to the
eigenvectors with the largest eigenvalues of that matrix. The projection to use to obtain, say, a two-
dimensional view with the largest variance, is of the form , where is a matrix that
contains the eigenvectors corresponding to the first two eigenvalues.
Assume we are given a (sample) covariance matrix of the data, . Let us find the eigenvalue decomposition
of :
where is an orthogonal matrix. Note that the trace of that matrix has an interpretation as the
total variance in the data, which is the sum of all the variances of the votes of each Senator:
Clearly, the eigenvalues decrease very fast. One is tempted to say that ‘‘most of the information’’ is contained
in the first eigenvalue. To make this argument more rigorous, we can simply look at the ratio:
which is the ratio of the total variance in the data (as approximated by ) to that of the whole matrix .
In the Senate voting case, this ratio is of the order of 90%. It turns out that this is true of most voting patterns
in democracies across history: the first eigenvalue ‘‘explains most of the variance’’.
Motivation
Recall that the direction of maximal variance is one vector that solves the optimization problem
APPLICATIONS: PCA OF SENATE VOTING DATA | 132
Here is the estimated center. We obtain a new data set by combining the variables
according to the directions determined by . The resulting dataset would have the same dimension as the
original dataset, but each dimension has a different meaning (since they are linear projected images of the
original variables).
As explained, the main idea behind principal components analysis is to find those directions that
corresponds to maximal variances between the data points. The data is then projected on the hyperplane
spanned by these principal components. We may stop the process when we have collected enough directions
in the sense that the new directions explain the majority of the variance. That is, we can pick those directions
corresponding to the highest scores.
We may also wonder if can have only a few non-zero coordinates. For example, if the optimal direction
is , then it is clear that the 3rd and 4th bills are characterizing most features,
and we may simply want to drop the 1st and 2nd bills. That is, we want to adjust the optimal direction vector
as . This adjustment accounts for sparsity. In the setting of PCA analysis, each
principal component is linear combinations of all input variables. Sparse PCA allows us to find principal
components as linear combinations that contain just a few input variables (hence it looks “sparse” in the
input space). This feature would enhance the interpretability of the resulting dataset and perform dimension
reduction in the input space. Reducing the number of input variables would assist us in the senate voting
dataset, since there are more bills (input variables) than senators (samples).
We are going to compare this result of PCA to sparse PCA results below.
133 | APPLICATIONS: PCA OF SENATE VOTING DATA
Main Idea
A mathematical generalization of the PCA can be obtained by modifying the PCA optimization problem
above. We attempt to find the direction of maximal variance as one vector that solves the
optimization problem
where
APPLICATIONS: PCA OF SENATE VOTING DATA | 134
The difference is that we put one more constraint , where is the number of non-zero
coordinates in the vector . For instance, but
. Here, is a pre-determined hyper-parameter that describes the sparsity of the input space we want.
This constraint of makes the optimization problem non-convex, and without an analytically
closed solution. But we still have a numerical solution and the sparse properties as we explained. However,
makes the optimization problem non-convex and is difficult to solve. Instead, we practice an
regularization relaxation alternative
This optimization problem is convex and can be solved numerically; this is the so-called sparse PCA (SPCA)
method. The parameter is a pre-determined hyper-parameter we introduced as a penalty parameter, which
can be tuned, as we shall see below.
33.
EXERCISES
We assume that for any , the sample average of the projected values ,
, and their sample variance , are both constant, independent of the direction (with
). Denote by and the (constant) sample average and variance.
1. Show that
is zero.
is of the form , where is the identity matrix of order . (Hint: the largest eigenvalue
of the matrix can be written as: , and a
similar expression holds for the smallest eigenvalue.)
2. Show that if a square, symmetric matrix is positive semi-definite, then for every
matrix , is also positive semi-definite. (Here, is an arbitrary integer.)
EXERCISES | 140
3. Drawing an ellipsoid. How would you efficiently draw an ellipsoid in , if the ellipsoid is
described by a quadratic inequality of the form
where is a noise vector, and the input is , a full rank, tall matrix ( ), and
. We do not know anything about , except that it is bounded: , with a
measure of the level of noise. Our goal is to provide an estimate of via a linear estimator, that is,
a function with a matrix. We restrict attention to unbiased estimators, which are
such that when . This implies that should be a left inverse of , that is, .
An example of the linear estimator is obtained by solving the least-squares problem
The solution is, when is full column rank, of the form , with
. We note that , which means that the LS estimator is unbiased. In this exercise, we show
that is the best unbiased linear estimator. (This is often referred to as the BLUE property.)
2. This motivates us to minimize the size of , say using the Frobenius norm:
Show that is the best unbiased linear estimator (BLUE), in the sense that it solves the above
problem.
Hint: Show that any unbiased linear estimator can be written as with
, and that is positive semi-definite.
141 | SINGULAR VALUES
PART VI
SINGULAR VALUES
The singular value decomposition (SVD) generalizes the spectral theorem (available for a square, symmetric
matrix), to any non-symmetric, and even rectangular, matrix. The SVD allows to describe the effect of a
matrix on a vector (via the matrix-vector product), as a three-step process: a first rotation in the input space;
a simple positive scaling that takes a vector in the input space to the output space; and another rotation in
the output space. The figure on the left shows the SVD of a matrix of biological data.
The SVD allows to analyze matrices and associated linear maps in detail, and solve a host of special
optimization problems, from solving linear equations to linear least-squares. It can also be used to reduce
the dimensionality of high-dimensional data sets, by approximating data matrices with low-rank ones. This
technique is closely linked to the principal component analysis method.
Outline
34.
Basic idea
Recall from here that any matrix with rank one can be written as
where , and .
It turns out that a similar result holds for matrices of arbitrary rank . That is, we can express any matrix
of rank as sum of rank-one matrices
where are mutually orthogonal, are also mutually orthogonal, and the ‘s are
positive numbers called the singular values of . In the above, turns out to be the rank of .
Theorem statement
The following important result applies to any matrix , and allows us to understand the structure of the
mapping .
143 | THE SVD THEOREM
where the positive numbers are unique and are called the singular values of . The number
is equal to the rank of , and the triplet is called a singular value decomposition
(SVD) of . The first columns of : (resp. : ) are called left (resp. right)
singular vectors of , and satisfy
This proof of the theorem hinges on the spectral theorem for symmetric matrices. Note that in the theorem,
the zeros appearing alongside represents blocks of zeros. They may be empty, for example if then
there are no zeros to the right of .
The SVD of an matrix can be computed via a sequence of linear transformations. The
computational complexity of the algorithm, when expressed in terms of the number of floating-point
operations, is given by
This complexity can become substantial when dealing with large, dense matrices. However, for sparse
matrices, one can expedite the computation if only the largest few singular values and their corresponding
singular vectors are of interest. To understand the derivation of this complexity:
• The outer product of vectors and has a complexity of . This is because for a vector of
length and a vector of length , the outer product results in an matrix, and computing
each entry requires one multiplication.
• The matrix has at most non-zero singular values, where . Each of these singular
values will contribute to the overall computational cost.
• Combining the costs from the two previous steps, the total computational complexity becomes
.
THE SVD THEOREM | 144
Example: A example.
34.2. Geometry
The theorem allows to decompose the action of on a given input vector as a three-step process. To get
, where , we first form . Since is an orthogonal matrix, is also orthogonal,
and is just a rotated version of , which still lies in the input space. Then we act on the rotated vector
by scaling its elements. Precisely, the first elements of are scaled by the singular values ;
the remaining elements are set to zero. This step results in a new vector which now belongs to the
output space . The final step consists in rotating the vector by the orthogonal matrix , which results
in .
then for an input vector in , is a vector in with first component , second component
, and last component being zero.
To summarize, the SVD theorem states that any matrix-vector multiplication can be decomposed as a
sequence of three elementary transformations: a rotation in the input space, a scaling that goes from the
input space to the output space, and a rotation in the output space. In contrast with symmetric matrices,
input and output directions are different.
Example: A example.
where
is (so it has trailing zeros). The eigenvalues of and are the same, and equal to the
squared singular values of .
The corresponding eigenvectors are the left and right singular vectors of .
This is a method (not the most computationally efficient) to find the SVD of a matrix, based on the SED.
MATRIX PROPERTIES VIA SVD | 146
35.
• Nullspace
• Range, rank
• Fundamental theorem of linear algebra
• Matrix norms and condition number
35.1. Nullspace
The SVD allows the computation of an orthonormal basis for the nullspace of a matrix. To understand this,
let us first consider a matrix of the form
The nullspace of this matrix is readily found by solving the equation . We obtain that
is in the nullspace if and only if the first two components of are zero:
What about a general matrix , which admits the SVD as given in the SVD theorem? Since is orthogonal,
we can pre-multiply the nullspace equation by , and solve in terms of the ‘‘rotated’’ variable
We obtain the condition on
1.
147 | MATRIX PROPERTIES VIA SVD
The above is equivalent to the first components of being zero. Since , this corresponds to the
fact that belongs to the span of the last columns of . Note that these columns form a set of
mutually orthogonal, normalized vectors that span the nullspace: hence they form an orthonormal basis for
it.
One-to-one (or, full column rank) matrices are the matrices with nullspace reduced to . If the dimension
of the nullspace is zero, then we must have . Thus, full column rank matrices are ones with SVD of
the form
MATRIX PROPERTIES VIA SVD | 148
As with the nullspace, we can express the range in terms of the SVD of the matrix . Indeed, the range of
is the set of vectors of the form
where is an arbitrary vector of . Since is invertible, also spans . We obtain that the
range is the set of vectors , where is of the form with arbitrary. This means that
the range is the span of the first columns of the orthogonal matrix , and that these columns form an
orthonormal basis for it. Hence, the number of dyads appearing in the SVD decomposition is indeed the
rank (dimension of the range).
where and are both orthogonal matrices, admits the first columns of as an
orthonormal basis.
An onto (or full row rank) matrix has a range . These matrices are characterized by an SVD of the
form
Let The sets and form an orthogonal decomposition of , in the sense that any
vector can be written as
In particular, we obtain that the condition on a vector to be orthogonal to any vector in the nullspace implies that
it must be in the range:
Proof.
Frobenius norm
Hence the squared Frobenius norm is nothing else than the sum of the squares of the singular values.
MATRIX PROPERTIES VIA SVD | 150
An alternate way to measure matrix size is based on asking for the maximum ratio of the norm of the output
to the norm of the input. When the norm used is the Euclidean norm, the corresponding quantity
is called the largest singular value (LSV) norm. The reason for this wording is given by the following
theorem.
where is the largest singular value of . Any left singular vector associated with the singular value achieves
the maximum in the above.
Condition number
The condition number of an invertible matrix is the ratio between the largest and the smallest
singular values:
As seen in the next section, this number provides a measure of the sensitivity of the solution of a linear
equation to changes in .
151 | SOLVING LINEAR SYSTEMS VIA SVD
36.
where and are given. We can completely describe the set of solutions via SVD,
as follows. Let us assume that admits an SVD given here. With pre-multiply the linear
equation by the inverse of , ; then we express the equation in terms of the rotated vector .
This leads to
• If the last components of are not zero, then the above system is infeasible, and the solution
set is empty. This occurs when is not in the range of .
• If is in the range of , then the last set of conditions in the above system hold, and we can solve for
with the first set of conditions:
SOLVING LINEAR SYSTEMS VIA SVD | 152
The last components of are free. This corresponds to elements in the nullspace of . If is full
column rank (its nullspace is reduced to { }, then there is a unique solution.
36.2. Pseudo-inverse
Definition
The solution set is conveniently described in terms of the pseudo-inverse of , denoted by , and defined
via the SVD of :
as one with the same SVD, with non-zero singular values inverted, and the matrix transposed:
The pseudo-inverse of a matrix is always well-defined, and it has the same size as the transpose . When the
matrix is invertible (it is square and full column or row rank: ), then it reduces to the inverse.
where is the nullspace of . Both and a basis for the nullspace can be computed via the SVD.
We start from the linear equation above, which has the unique solution . Now assume that is
changed into , where is a vector that contains the changes in . Let’s denote by the new
solution, which is . From the equations:
and using the definition of the largest singular value norm, we obtain:
We can express the condition number as the ratio between the largest and smallest singular values of :
The condition number gives a bound on the ratio between the relative error in the left-hand side to that
of the solution. We can also analyze the effect of errors in the matrix itself on the solution. The condition
number turns out to play a crucial role there as well.
LEAST-SQUARES AND SVD | 154
37.
can be expressed as
where is the pseudo-inverse of , and is the minimum-norm point in the optimal set. If is full column
rank, the solution is unique, and equal to
In general, the particular solution is the minimum-norm solution to the least-squares problem.
Proof.
with
We can use OLS to provide an estimate of . The idea is to seek the smallest vector such that the
above equation becomes feasible, that is,
Since is full column rank, the solution to the OLS problem is unique, and can be written as a linear
function of the measurement vector :
The OLS formulation provides an estimate of the input such that the residual error vector
is minimized in norm. We are interested in analyzing the impact of perturbations in the vector , on the
resulting solution . We begin by analyzing the absolute errors in the estimate and then turn to the
analysis of relative errors.
We have
In the above, we have exploited the fact that is a left inverse of , that is, .
which is an ellipsoid centered at zero, with principal axes given by the singular values of . This ellipsoid
can be interpreted as an ellipsoid of confidence for the estimate , with size and shape determined by the
matrix .
• The largest absolute error in the solution that can result from a unit-norm, additive perturbation on
is of the order of , where is the smallest singular value of .
• The largest relative error is , the condition number of .
We say that the estimator (as determined by matrix ) is unbiased if the first term is zero:
Unbiased estimators only exist when the above equation is feasible, that is, has a left inverse. This is
equivalent to our condition that be full column rank. Since is a left-inverse of , the OLS estimator is
a particular case of an unbiased linear estimator.
Let us assume that belongs to a unit ball: . The set of resulting errors on the solution is
then
which is an ellipsoid centered at zero, with principal axes given by the singular values of . This ellipsoid can
be interpreted as an ellipsoid of confidence for the estimate , with size and shape determined by the matrix
.
It can be shown that the OLS estimator is optimal in the sense that it provides the ‘‘smallest’’ ellipsoid of
confidence among all unbiased linear estimators. Specifically:
This optimality of the LS estimator is referred to as the BLUE (Best Linear Unbiased Estimator) property.
LOW-RANK APPROXIMATIONS | 158
38.
LOW-RANK APPROXIMATIONS
• Low-rank approximations
• Link with PCA
where the singular values are ordered in decreasing order, . In many applications, it
can be useful to approximate with a low-rank matrix.
Example: Assume that contains the log returns of assets over time periods so that each column of
is a time series for a particular asset. Approximating by a rank-one matrix of the form , with
and amounts to model the assets’ movements as all following the same pattern given by the time-
profile , each asset’s movements being scaled by the components in . Specifically, the component
of , which is the log-return of asset at time , then expresses as .
where ( ) is given. In the above, we measure the error in the approximation using
the Frobenius norm; using the largest singular value norm leads to the same set of solutions .
159 | LOW-RANK APPROXIMATIONS
A best -rank approximation is given by zeroing out the trailing singular values of , that is
The minimal error is given by the Euclidean norm of the singular values that have been zeroed out in the process:
Sketch of proof: The proof rests on the fact that the Frobenius norm, is invariant by rotation of the input
and output spaces, that is, for any matrix , and orthogonal matrices of
appropriate sizes. Since the rank is also invariant, we can reduce the problem to the case when .
In particular, we can express the explained variance directly in terms of the singular values. In the context
of visualization, the explained variance is simply the ratio of the total amount of variance in the projected
data, to that in the original. More generally, when we are approximating a data matrix by a low-rank matrix,
the explained variance compares the variance in the approximation to that in the original data. We can also
interpret it geometrically, as the ratio of the squared norm of the approximation matrix to that of the original
matrix:
APPLICATIONS | 160
39.
APPLICATIONS
• Image compression.
• Market data analysis.
Images as matrices
We can represent images as matrices, as follows. Consider an image having pixels. For gray scale
images, we need one number per pixel, which can be represented as a matrix. For color images, we
need three numbers per pixel, for each color: red, green, and blue (RGB). Each color can be represented as a
matrix, and we can represent the full color image as a matrix, where we stack each color’s
matrix column-wise alongside each other, as
The image can be visualized as well. We must first transform the matrix from integer to double. In JPEG
format, the image will be loaded into matlab as a three-dimensional array, one matrix for each color. For gray
scale images, we only need the first matrix in the array.
161 | APPLICATIONS
Low-rank approximation
Using the low-rank approximation via SVD method, we can form the best rank- approximations for the
matrix.
True and approximated images, with varying rank. We observe that with , the approximation is
almost the same as the original picture, whose rank is .
Recall that the explained variance of the rank- approximation is the ratio between the squared norm of
the rank- approximation matrix and the squared norm of the original matrix. Essentially, it measures how
much information is retained in the approximation relative to the original.
It is instructive to look at the singular vector corresponding to the largest singular value, arranged in
increasing order. We observe that all the components have the same sign (which we can always assume is
positive). This means we can interpret this vector as providing a weighted average of the market. As seen
in the previous plot, the corresponding rank-one approximation roughly explains more than 80% of the
variance in this market data, which justifies the phrase ‘‘the market average moves the market’’. The five
components with largest magnitude correspond to the following companies. Note that all are financial:
• MS (Morgan Stanley)
EXERCISES | 164
40.
EXERCISES
b. Find an SVD of .
values ordered in decreasing fashion. Make sure to check all the properties required for
.
b. Find the semi-axis lengths and principal axes (minimum and maximum distance and
associated directions from to the center) of the ellipsoid
Hint: Use the SVD of to show that every element of is of the form for
some element in . That is, . (In other words, the matrix
maps into the set .) Then analyze the geometry of the simpler set .
c. What is the set when we append a zero vector after the last column of , that is is
replaced with ?
d. The same question when we append a row after the last row of , that is, is replaced with
. Interpret geometrically your result.
The image on the left shows a matrix of pixel values. The lines indicate values; at
each intersection of lines, the corresponding matrix element is . All the other elements are zero.
1. Show that for some permutation matrices , the permuted matrix has the
symmetric form , for two vectors . Determine and .
EXERCISES | 166
where denotes the Frobenius norm, and the matrices , are given.
Here, the matrix variable is constrained to have orthonormal columns. When
, the problem can be interpreted geometrically as seeking a transformation of points
(contained in ) to other points (contained in ) that involves only rotation.
1. Show that the solution to the Procrustes problem above can be found via the SVD of the
matrix .
with , given.
b. Now consider the points and find an expression for the sum of the squares of the
distances from the points to the line .
c. Explain how you would find the line via the SVD of the matrix
.
167 | EXERCISES
d. How would you address the problem without the restriction that the line has to pass
through the origin?
2. Solve the same problems as previously by replacing the line with a hyperplane.
with is a vector chosen randomly in (this represents a 3-dimensional cube with each
dimension ranging from to ). In addition, we define
c. Express the least-squares solution in terms of the SVD of . That is, form the pseudo-inverse
of and apply the formula . What is now the norm of the residual?
b. Show how to reduce the problem to one involving one scalar variable.
PART VII
EXAMPLES
169 | DIMENSION OF AN AFFINE SUBSPACE
is an affine subspace of dimension . The corresponding linear subspace is defined by the linear equations
obtained from the above by setting the constant terms to zero:
We can solve for and get . We obtain a representation of the linear subspace as the
set of vectors that have the form
for some scalar . Hence the linear subspace is the span of the vector , and is of
dimension .
We obtain a representation of the original affine set by finding a particular solution , by setting say
and solving for . We obtain
The affine subspace is thus the line , where are defined above.
SAMPLE AND WEIGHTED AVERAGE | 170
where is the vector containing the samples, and , with the vector of
ones.
More generally, for any vector , with for every , and , we can define
the corresponding weighted average as . The interpretation of is in terms of a discrete probability
distribution of a random variable , which takes the value with probability , . The
weighted average is then simply the expected value (or, mean) of under the probability distribution . The
expected value is often denoted , or if the distribution is clear from context.
171 | SAMPLE AVERAGE OF VECTORS
An Euclidean projection of a point in on a set is a point that achieves the smallest Euclidean
distance from to the set. That is, it is any solution to the optimization problem
When the set is convex, there is a unique solution to the above problem. In particular, the projection on
an affine subspace is unique.
The projection problem reads as a linearly constrained least-squares problem, of particularly simple form:
The projection of on turns out to be aligned with the coefficient vector . Indeed,
components of orthogonal to don’t appear in the constraint, and only increase the objective value.
Setting in the equation defining the hyperplane and solving for the scalar we obtain
ORTHOGONAL COMPLEMENT OF A
SUBSPACE
then the orthogonal complement is the set of vectors orthogonal to the rows of , which is the nullspace of
.
Example: Consider the line in passing through the origin and generated by the vector .
This is a subspace of dimension 1:
To find the orthogonal complement, we find the set of vectors that are orthogonal to any vector of the form
, with arbitrary . This is the same set as the set of vectors orthogonal to itself. So, we solve for
with :
POWER LAWS
Consider a physical process that has inputs , , and a scalar output . Inputs and output
are physical, positive quantities, such as volume, height, or temperature. In many cases, we can (at least
empirically) describe such physical processes by power laws, which are non-linear models of the form
where , and the coefficients , are real numbers. For example, the relationship
between area, volume, and size of basic geometric objects; the Coulomb law in electrostatics; birth and
survival rates of (say) bacteria as functions of concentrations of chemicals; heat flows and losses in pipes, as
functions of the pipe geometry; analog circuit properties as functions of circuit parameters; etc.
The relationship is not linear nor affine, but if we take the logarithm of both sides and introduce the
new variables
where .
Returning to the example involving power laws, we ask the question of finding the ‘‘best’’ model of the form
given experiments with several input vectors and associated outputs , . Here the
variables of our problem are , and the vector . Taking logarithms, we obtain
where , and and are the logarithms of and , respectively. We can represent the above
linear equations compactly as
In practice, the power law model is only an approximate representation of reality. Finding the best fit can be
formulated as the optimization problem
Any sensible measure of length should satisfy the following basic properties: it should be a convex function
of its argument (that is, the length of an average of two vectors should be always less than the average of
their lengths); it should be positive-definite (always non-negative, and zero only when the argument is the
zero vector), and preserve positive scaling (so that multiplying a vector by a positive number scales its norm
accordingly).
A consequence of the first two conditions is that a norm only assumes non-negative values, and that it is
convex.
The set of solutions turns out to be empty. Indeed, if we use the first two equations, and solve for ,
we get , but then the last equation is not satisfied, since
179 | SAMPLE VARIANCE AND STANDARD DEVIATION
where is the sample average of . The sample variance is a measure of the deviations of the
numbers with respect to the average value .
The sample standard deviation is the square root of the sample variance, . It can be expressed in terms of
the Euclidean norm of the vector , as
More generally, for any vector , with for every , and , we can define
the corresponding weighted variance as
The interpretation of is in terms of a discrete probability distribution of a random variable , which takes
the value with probability , . The weighted variance is then simply the expected value of
the squared deviation of from its mean , under the probability distribution .
• Functions
• Maps
Functions
In this course we define functions as objects which take an argument in and return a value in . We use
the notation
to refer to a function with “input” space . The “output” space for functions is .
We allow for functions to take infinity values. The domain of a function , denoted , is defined as
the set of points where the function is finite.
Maps
We reserve the term map to refer to functions which return more than a single value, and use the notation
to refer to a map with input space and output space . The components of the map are the (scalar-
valued) functions .
181 | FUNCTIONS AND MAPS
Example: A map.
DUAL NORM
For a given norm on , the dual norm, denoted , is the function from to with values
The above definition indeed corresponds to a norm: it is convex, as it is the pointwise maximum of convex
(in fact, linear) functions ; it is homogeneous of degree , that is, for every
and .
This can be seen as a generalized version of the Cauchy-Schwartz inequality, which corresponds to the
Euclidean norm.
Examples:
• The norm dual to the Euclidean norm is itself. This comes directly from the Cauchy-Schwartz
inequality.
• The norm dual to the -norm is the -norm. This is because the inequality
• The dual norm above is the original norm we started with. (The proof of this general result is more
involved.)
183 | INCIDENCE MATRIX OF A NETWORK
Mathematically speaking, a network is a graph of nodes connected by directed arcs. Here, we assume
that arcs are ordered pairs, with at most one arc joining any two nodes; we also assume that there are no self-
loops (arcs from a node to itself). We do not assume that the edges of the graph are weighted—they are all
similar.
We can fully describe the network with the so-called arc-node incidence matrix, which is the matrix
defined as
The figure shows the graph associated with the arc-node incidence
matrix
By definition, for the arc-node incidence matrix , we have , where 1 is the vector of ones in
. Hence, 1 is in the nullspace of .
A number of topological properties of a network with nodes and edges can be inferred from those of its
node-arc incidence matrix , and of the reduced incidence matrix , which is obtained from by removing
its last row. For example, the network is said to be connected if there is a path joining any two nodes. It can
be shown that the network is connected if and only if the rank of is equal to .
PERMUTATION MATRICES
is obtained by exchanging the columns and , and and , of the identity matrix.
A permutation matrix allows to exchange rows or columns of another via the matrix-matrix product. For
example, if we take any matrix , then (with defined above) is the matrix with columns
and exchanged.
187 | QR DECOMPOSITION: EXAMPLES
QR DECOMPOSITION: EXAMPLES
We can see what happens when the input is not full column rank: for example, let’s consider the matrix
( is not full column rank, as it was constructed so that the last column is a combination of the first and the
third.)
We observe that the last triangular element is virtually zero, and the last column is seen to be a linear
combination of the first and the third. This shows that the rank of (itself equal to the rank of ) is
effectively .
BACKWARDS SUBSTITUTION FOR SOLVING TRIANGULAR LINEAR SYSTEMS. | 190
Consider a triangular system of the form , where the vector is given, and is upper-
triangular. Let us first consider the case when , and is invertible. Thus, has the form
The backwards substitution first solves for the last component of using the last equation:
We solve for the last variable first, obtaining (from the last equation) . We plug this value of
into the first and second equation, obtaining a new triangular system in two variables :
We proceed by solving for the last variable . The last equation yields . Plugging this
value into the first equation gives
We can apply the idea to find the inverse of the square upper triangular matrix , by solving
Linear regression is based on the idea of fitting a linear function through data points.
In its basic form, the problem is as follows. We are given data where
is the ‘‘input’’ and is the ‘‘output’’ for the -th measurement. We seek to find a linear function
such that are collectively close to the corresponding values .
In least-squares regression, the way we evaluate how well a candidate function fits the data is via the
(squared) Euclidean norm:
Since a linear function has the form for some , the problem of minimizing the
above criterion takes the form
where
The linear regression approach can be extended to multiple dimensions, that is, to problems where the
output in the above problem contains more than one dimension (see here). It can also be extended to the
problem of fitting non-linear curves.
193 | LINEAR REGRESSION VIA LEAST SQUARES
See also:
NOMENCLATURE
• Feasible set
• What is a solution?
• Local vs. global optima
Feasible set
The feasible set of problem is defined as
A point is said to be feasible for problem if it belongs to the feasible set , that is, it satisfies the
constraints.
Example: In the toy optimization problem, the feasible set is the ‘‘box’’ in , described by
.
The feasible set may be empty if the constraints cannot be satisfied simultaneously. In this case, the problem
is said to be infeasible.
What is a solution?
In an optimization problem, we are usually interested in computing the optimal value of the objective
function, and also often a minimizer, which is a vector that achieves that value, if any.
Feasibility problems
Sometimes an objective function is not provided. This means that we are just interested in finding a feasible
point or determining that the problem is infeasible. By convention, we set to be a constant in that case,
to reflect the fact that we are indifferent to the choice of a point as long as it is feasible.
195 | NOMENCLATURE
Optimal value
The optimal value of the problem is the value of the objective at optimum, and we denote it by :
Optimal set
The optimal set (or the set of solutions) of the problem is defined as the set of feasible points for which
the objective function achieves the optimal value:
We take the convention that the optimal set is empty if the problem is not feasible.
A point is said to be optimal if it belongs to the optimal set. Optimal points may not exist, and the optimal
set may be empty. This can be due to the fact that the problem is infeasible. Or it may be due to the fact that
the optimal value is only attained in the limit.
has no optimal points, as the optimal value is only attained in the limit .
If the optimal set is not empty, we say that the problem is attained.
Suboptimality
The suboptimal set is defined as
(With our notation, .) Any point in the suboptimal set is termed suboptimal.
This set allows characterizing points that are close to being optimal (when is small). Usually, practical
algorithms are only able to compute suboptimal solutions, and never reach true optimality.
NOMENCLATURE | 196
In other words, a local minimizer minimizes , but only for nearby points on the feasible set. Then the
value of the objective function is not necessarily the optimal value of the problem.
The term globally optimal (or, optimal for short) is used to distinguish points in the optimal set from local
minima.
Many optimization algorithms may get trapped in local minima; a situation often described as a major
challenge in optimization problems.
197 | STANDARD FORMS
STANDARD FORMS
• Functional form
• Epigraph form
• Other standard forms
Functional form
An optimization problem is a problem of the form
where
In the above, the term ‘‘subject to’’ is sometimes replaced with the shorthand colon notation.
Often the above is referred to as a ‘‘mathematical program’’. The term “programming” (or “program”) does
not refer to a computer code. It is used mainly for historical purposes. We will use the more rigorous (but
less popular) term “optimization problem”.
Epigraph form
In optimization, we can always assume that the objective is a linear function of the variables. This can be
done via the epigraph representation of the problem, which is based on adding a new scalar variable :
We can picture this as follows. Consider the sub-level sets of the objective function, which are of the
STANDARD FORMS | 198
form for some . The problem amounts to finding the smallest for which the
corresponding sub-level set intersects the set of points that satisfy the constraints.
where ‘s are given. Of course, we may reduce the above problem to the standard form above, representing
each equality constraint by a pair of inequalities.
Sometimes, the constraints are described abstractly via a set condition, of the form for some subset
of . The corresponding notation is
Some problems come in the form of maximization problems. Such problems are readily cast in standard
form via the expression
where .
199 | A TWO-DIMENSIONAL TOY OPTIMIZATION PROBLEM
(Note that the term ‘‘subject to’’ has been replaced with the shorthand colon notation.)
where:
GRADIENT OF A FUNCTION
• Definition
• Composition rule with an affine function
• Geometric interpretation
The gradient of a differentiable function contains the first derivatives of the function with
respect to each variable. The gradient is useful to find the linear approximation of the function near a point.
Definition
The gradient of at , denoted , is the vector in given by
Examples:
• Distance function: The distance function from a point to another point is defined as
The gradient of at is
is given by
where , and .
is called the composition of the affine map with with . Its gradient is given by
Geometric interpretation
Geometrically, the gradient can be read on the plot of the level set of the function. Specifically, at any point
, the gradient is perpendicular to the level set and points outwards from the sub-level set (that is, it points
towards higher values of the function).
GRADIENT OF A FUNCTION | 204
where , and are given, can be expressed in terms of the full QR decomposition of :
Exploiting the fact that leaves Euclidean norms invariant, we express the original least-squares problem in
the equivalent form:
Once the above is solved, and is found, we recover the original variable with .
Now let us decompose and in a manner consistent with the block structure of
The optimal choice for the variables is to make the first term zero, which is achievable with
where is free and describes the ambiguity in the solution. The optimal residual is .
• Definition
• Properties
Definition
For a vector , the sample variance measures the average deviation of its coefficients around
the sample average :
where is a symmetric matrix, called the sample covariance matrix of the data points:
SAMPLE COVARIANCE MATRIX | 208
Properties
The covariance matrix satisfies the following properties:
• The sample covariance matrix allows finding the variance along any direction in data space.
• The diagonal elements of give the variances of each vector in the data.
• The trace of gives the sum of all the variances.
• The matrix is positive semi-definite, since the associated quadratic form is non-
negative everywhere.
209 | OPTIMAL SET OF LEAST-SQUARES VIA SVD
can be expressed as
where is the pseudo-inverse of , and is the minimum-norm point in the optimal set. If is full column
rank, the solution is unique, and equal to
Proof: The following proof relies on the SVD of , and the rotational invariance of the Euclidean norm.
Using the SVD we can find the optimal set to the least-squares optimization problem
where we have exploited the fact that the Euclidean norm is invariant under the orthogonal transformation
. With , and , and changing the variable to , we express the above as
OPTIMAL SET OF LEAST-SQUARES VIA SVD | 210
Since is invertible, we can reduce the first term in the objective to zero with the choice .
Hence the optimal value is
We observe that the optimal value is zero if and only if , which is exactly the same as
.
Optimal set
Let us detail the optimal set for the problem. The variable is partly determined, via its first components:
. The remaining variables contained in are free, as does not appear in the
objective function of the above problem.
To express this in terms of the original SVD of , we observe that means that
where . (We will soon explain the acronym appearing in the subscript.) The free
components correspond to the degrees of freedom allowed to by the nullspace of .
The particular solution to the problem, , is the minimum-norm solution, in the sense that it is the
element of that has the smallest Euclidean norm. This is best understood in the space of -variables.
Indeed, the particular choice corresponds to the element in the optimal set that has the
smallest Euclidean norm. Indeed, the norm of is the same as that of its rotated version, . the first
elements in are fixed, and since , we see that the minimal norm is obtained
with .
211 | OPTIMAL SET OF LEAST-SQUARES VIA SVD
The matrix , which appears in the expression of the particular solution mentioned above,
is nothing else than the pseudo-inverse of , which is denoted . Indeed, we can express the pseudo-inverse
in terms of the SVD as
With this convention, the minimum-norm optimal point is . Recall that the last columns of
form a basis for the nullspace of . Hence the optimal set of the problem is
When is full column rank , the optimal set reduces to a singleton (a set with
only one element), as the nullspace is . The unique optimal point expresses as
PSEUDO-INVERSE OF A MATRIX | 212
PSEUDO-INVERSE OF A MATRIX
The pseudo-inverse of a matrix is a matrix that generalizes to arbitrary matrices the notion
of inverse of a square, invertible matrix. The pseudo-inverse can be expressed from the singular value
decomposition (SVD) of , as follows.
where are both orthogonal matrices, and is a diagonal matrix containing the (positive) singular
values of on its diagonal.
where
• Input components along directions corresponding to and are amplified (by factors of 10 and 7,
respectively) and come out mostly along the plane spanned by and .
• Input components along directions corresponding to and are attenuated (by factors of 0.1 and
0.05, respectively).
• The matrix is nonsingular.
• For some applications, it might be appropriate to consider as effectively rank 2, given the significant
attenuation for components along and .
215 | SINGULAR VALUE DECOMPOSITION OF A 4 X 5 MATRIX
Notice above that has non-zero values only in its diagonal, and can be written as
REPRESENTATION OF A TWO-VARIABLE
QUADRATIC FUNCTION
In short:
A symmetric matrix is a way to describe a weighted, undirected graph: each edge in the graph is assigned a
weight . Since the graph is undirected, the edge weight is independent of the direction (from to or
vice-versa). Hence, is symmetric.
NETWORK FLOW
We describe a flow (of goods, traffic, charge, information, etc.) across the network as a vector ,
which describes the amount flowing through any given arc. By convention, we use positive values when the
flow is in the direction of the arc, and negative ones in the opposite case.
The incidence matrix of the network, denoted by , helps represent the relationship between nodes and
arcs. For a given node , the total flow leaving it can be calculated as (remember our convention that the
index spans the arcs)
Now, we define the external supply as a vector . Here, a negative represents an external demand
at node , and a positive denotes a supply. We make the assumption that the total supply equals the total
demand, implying
The balance equations for the supply vector are given by . These equations represent
constraints the flow vector must satisfy to meet the external supply/demand represented by .
Another important symmetric matrix associated with a graph is the Laplacian matrix. This is the matrix
, with as the arc-node incidence matrix. It can be shown that the element of the
Laplacian matrix is given by
See also:
HESSIAN OF A FUNCTION
• Definition
• Examples
Definition
The Hessian of a twice-differentiable function at a point is the matrix
containing the second derivatives of the function at that point. That is, the Hessian is the matrix with
elements given by
The second derivative is independent of the order in which derivatives are taken. Hence, for
every pair . Thus, the Hessian is a symmetric matrix.
Examples
For quadratic functions, the Hessian is a constant matrix, that is, it does not depend on the point at which
it is evaluated.
The gradient of at is
is as follows.
where , and .
● Now the Hessian at a point is obtained by taking derivatives of each component of the gradient.
If is the -th component, that is,
then
HESSIAN OF A FUNCTION | 222
and, for :
More compactly:
223 | HESSIAN OF A QUADRATIC FUNCTION
For quadratic functions, the Hessian (matrix of second-derivatives) is a constant matrix, that is, it does not
depend on the variable .
GRAM MATRIX
Consider -vectors . The Gram matrix of the collection is the matrix with
elements . The matrix can be expressed compactly in terms of the matrix
, as
By construction, a Gram matrix is always symmetric, meaning that for every pair . It is
also positive semi-definite, meaning that for every vector (this comes from the identity
).
Assume that each vector is normalized: . Then the coefficient can be expressed as
where is the angle between the vectors and . Thus is a measure of how similar and are.
The matrix arises for example in text document classification, with a measure of similarity between
the th and th document, and their respective bag-of-words representation (normalized to have
Euclidean norm 1).
See also:
The function
where , .
• Definition
• Important result
• Some properties
Definition
The determinant of a square, matrix , denoted , is defined by an algebraic formula of the
coefficients of . The following formula for the determinant, known as Laplace’s expansion formula, allows
to compute the determinant recursively:
where is the matrix obtained from by removing the -th row and first column.
(The first column does not play a special role here: the determinant remains the same if we use any other
column.)
1. .
There are other expressions of the determinant, including the Leibnitz formula (proven here):
where denotes the set of permutations of the integers . Here, denotes the sign
of the permutation , which is the number of pairwise exchanges required to transform
into .
229 | DETERMINANT OF A SQUARE MATRIX
Important result
An important result is that a square matrix is invertible if and only if its determinant is not zero. We use this
key result when introducing eigenvalues of symmetric matrices.
Geometry
In general, the absolute value of the determinant of a matrix is the volume of the parallelepiped
This is consistent with the fact that when is not invertible, its columns define a parallelepiped of zero
volume.
where is a matrix obtained from by removing its -th row and -th column. For example, the
determinant of a matrix
DETERMINANT OF A SQUARE MATRIX | 230
is given by
It is indeed the volume of the area of a parallelepiped defined with the columns of , . The
inverse is given by
Some properties
If a matrix is square, triangular, then its determinant is simply the product of its diagonal coefficients. This
comes right from Laplace’s expansion formula above.
Determinant of transpose
The determinant of a square matrix and that of its transpose are equal.
In particular:
This also implies that for an orthogonal matrix , that is, a matrix with , we have
The function vanishes on the space orthogonal to , which is the hyperplane defined by the single linear
equation . Thus, in effect this function is really one-dimensional: it varies only along the direction
.
EIGENVALUE DECOMPOSITION OF A
SYMMETRIC MATRIX
Let
Hence the eigenvalues are , . For each eigenvalue , we look for a unit-norm vector such
that . For , we obtain the equation in
RAYLEIGH QUOTIENTS
Theorem
For a symmetric matrix , we can express the smallest and largest eigenvalues, and , as
Proof: The proof of the expression above derives from the SED of the matrix, and the invariance of the
Euclidean norm constraint under orthogonal transformations. We show this only for the largest eigenvalue;
the proof for the expression for the smallest eigenvalue follows similar lines. Indeed, with , we
have
Now we can define the new variable , so that , and express the problem as
Clearly, the maximum is less than . That upper bound is attained, with for an index such that
, and for . This proves the result. This corresponds to setting ,
where is the eigenvector corresponding to .
235 | LARGEST SINGULAR VALUE NORM OF A MATRIX
For a matrix , we define the largest singular value (or, LSV) norm of to be the quantity
This quantity satisfies the conditions to be a norm (see here). The reason why this norm is called this way is
given here.
The LSV norm can be computed as follows. Let us square the above. We obtain a representation of the
squared LSV norm as a Rayleigh quotient of the matrix :
This shows that the squared LSV norm is the largest eigenvalue of the (positive semi-definite) symmetric
matrix , which is denoted . That is:
NULLSPACE OF A 4X5 MATRIX VIA ITS SVD | 236
Returning to this example, involving a matrix with row size and column size , and of rank
. The nullspace is the span of the last columns of the matrix :
with
Returning to this example, the Frobenius norm is the square root of the sum of the squares of the elements,
and is equal to
The largest singular value norm is simply . Thus, the Euclidean norm of the output cannot
exceed four times that of the input .
LOW-RANK APPROXIMATION OF A 4X5 MATRIX VIA ITS SVD | 238
Returning to this example, involving a matrix with row size and column size :
The matrix is rank . A rank-two approximation is given by zeroing out the smallest singular value,
which produces
239 | LOW-RANK APPROXIMATION OF A 4X5 MATRIX VIA ITS SVD
We check that the Frobenius norm of the error is the sum of singular values we have zeroed
out, which here reduces to :
PSEUDO-INVERSE OF A 4X5 MATRIX VIA ITS SVD | 240
as follows.
We first invert , simply “inverting what can be inverted” and leaving zero values alone. We get
Then the pseudo-inverse is obtained by exchanging the roles of and in the SVD:
241 | PSEUDO-INVERSE OF A 4X5 MATRIX VIA ITS SVD
PART VIII
APPLICATIONS
243 | IMAGE COMPRESSION VIA LEAST-SQUARES
We can use least-squares to represent an image in terms of a linear combination of ‘‘basic’’ images, at least
approximately.
An image can be represented, via its pixel values or some other mechanism, as a (usually long) vector
. Now assume that we have a library of ‘‘basic’’ images, also in pixel form, that are represented
as vectors . Each vector could contain the pixel representation of a unit vector on some basis,
such as a two-dimensional Fourier basis; however, we do not necessarily assume here that the ‘s form a
basis.
Let us try to find the best coefficients , which allow approximating the given image (given
by ) as a linear combination of the ‘s with coefficients . Such a combination can be expressed
as the matrix-vector product , where is the matrix that contains the basic
images. The best fit can be found via the least-squares problem
Once the representation is found, and if the optimal value of the problem above is small, we can safely
represent the given image via the vector . If the vector is sparse, in the sense that it has many zeros, such a
representation can yield a substantial saving in memory, over the initial pixel representation .
The sparsity of the solution of the problem above for a range of possible images is highly dependent on the
images’ characteristics, as well as on the collection of basic images contained in . In practice, it is desirable
to trade-off the accuracy measure above, against some measure of the sparsity of the optimal vector .
SENATE VOTING DATA MATRIX. | 244
The data consists of the votes of Senators in the 2004-2006 US Senate (2004-2006), for a total
of bills. “Yay” (“Yes”) votes are represented as 1‘s, “Nay” (“No”) as -1‘s, and the other votes are
recorded as 0. (A number of complexities are ignored here, such as the possibility of pairing the votes.)
with elements taken from . Each column of the voting matrix contains the
votes of a single Senator for all the bills; each row contains the votes of all Senators on a particular bill.
In this case study, we take data from the votes on bills in the US Senate (2004-2006), shown as a table above,
and explore how we can visualize the data by projecting it, first on a line then on a plane. We investigate how
we can choose the line or plane in a way that maximizes the variance in the result, via a principal component
analysis method. Finally, we examine how a variation on PCA that encourages sparsity of the projection
directions allows to understand which bills are most responsible for the variance in the data.
Data
The data consists of the votes of Senators in the 2004-2006 US Senate (2004-2006), for a total
SENATE VOTING ANALYSIS AND VISUALIZATION. | 246
of bills. “Yay” (“Yes”) votes are represented as ‘s, “Nay” (“No”) as ‘s, and the other votes are
recorded as . (A number of complexities are ignored here, such as the possibility of pairing the votes.)
This data can be represented here as a ‘‘voting’’ matrix , with elements taken
from . Each column of the voting matrix contains the votes of a single Senator
for all the bills; each row contains the votes of all Senators on a particular bill.
Senate voting matrix: “Nay” votes are in black, “Yay” ones in white, and the others in grey. The transpose
voting matrix is shown. The picture becomes has many gray areas, as some Senators are replaced over time.
Simply plotting the raw data matrix is often not very informative.
Visualization Problem
We can try to visualize the data set, by projecting each data point (each row or column of the matrix) on (say)
a 1D-, 2D- or 3D-space. Each ‘‘view’’ corresponds to a particular projection, that is, a particular one-, two-
or three-dimensional subspace on which we choose to project the data. The visualization problem consists
of choosing an appropriate projection.
There are many ways to formulate the visualization problem, and none dominates the others. Here, we focus
on the basics of that problem.
To simplify, let us first consider the simple problem of representing the high-dimensional data set on a simple
line, using the method described here.
247 | BEER-LAMBERT LAW IN ABSORPTION SPECTROMETRY
The Beer-Lambert law in optics is an empirical relationship that relates the absorption of light by a material,
to the properties of the material through which the light is traveling. This is the basis of absorption
spectrometry, which allows to measure the concentration of different gases in a chamber.
If the container has a mixture of ‘‘pure’’ gases in it, the law postulates that the logarithm of the ratio
of the light intensities is a linear function of the concentrations of each gas in the mix. The log-ratio of
intensities is thus of the form for some vector , where is the vector of concentrations.
The coefficients , correspond to the log-ratio of light intensities when (the -th
vector of the standard basis, which correspond to the -th pure gas). The quantity is called the coefficient
of absorption of the -th gas and can be measured in the laboratory.
The Beer-Lambert law postulates that the logarithm of the ratio of the light intensities is a linear function of
the concentrations of each gas in the mix. The log-ratio of intensities is thus of the form for some
vector , where is the vector of concentrations, and the vector contains the coefficients of
absorption of each gas. This vector is actually also a function of the frequency of the light we illuminate the
container with.
Now consider a container having a mixture of “pure” gases in it. Denote by the vector of
concentrations of the gases in the mixture. We illuminate the container at different frequencies
. For each experiment, we record the corresponding log-ratio , , of the intensities. If the
Beer-Lambert law is to be believed, then we must have
for some vectors , which contain the coefficients of absorption of the gases at light frequency .
More compactly:
where
Since ‘s correspond to “pure” gases, they can be measured in the laboratory. We can then use the above
model to infer the concentration of the gases in a mixture, given some observed light intensity log-ratio.
Returning to the bag-of-words example, we can use the notion of angle to measure how two different
documents are close to each other.
Given two documents, and a pre-defined list of words appearing in the documents (the dictionary), we can
compute the vectors of frequencies of the words as they appear in the documents. The angle between
the two vectors is a widely used measure of closeness (similarity) between documents.
See also:
IMAGE COMPRESSION
To represent the image in terms of the dictionary, we would like to find coefficients , such
that
If the representation of the image (that is, the vector ) has many zeros (we say: is sparse), then we can
represent the entire image with only a few values (the components of that are not zero). We may then,
for example, send the image over a communication network at high speed. Provided the receiver has the
dictionary handy, it can reconstruct perfectly the image.
In practice, it may be desirable to trade off the sparsity of the representation (via ) against the accuracy of
the representation. Namely, we may prefer a representation that achieves only approximately
but has way more zeros. The process of searching for a good sparsity/accuracy trade-off is called image
compression.
251 | TEMPERATURES AT DIFFERENT AIRPORTS
airport Temperature
SFO 55
ORD 32
JFK 43
The dot representation could become very confusing if there were three temperatures to plot for each day in
the year.
In the plane, we measure the distances of an object located at an unknown position from points
with known coordinates . The distance vector is a non-linear
function of , given by
Now assume that we have obtained the position of the object at a given time and seek to predict
the change in position that is consistent with observed small changes in the distance vector .
We can approximate the non-linear functions via the first-order (linear) approximation. A linearized
model around a given point is , with a matrix with elements
253 | BAG-OF-WORDS REPRESENTATION OF TEXT
BAG-OF-WORDS REPRESENTATION OF
TEXT
A (real) vector is just a collection of real numbers, referred to as the components (or, elements) of the vector; denotes
the set of vectors with elements. If denotes a vector, we use subscripts to denote elements, so that is the -th
component of . Vectors are arranged in a column, or a row. If is a column vector, denotes the corresponding row
vector, and vice-versa.
The row vector contains the number of times each word in the list {vector, of, the} appear
in the above paragraph. Vectors can be thus used to represent text documents. The representation often
referred to as the bag-of-words representation, is not faithful, as it ignores the respective order of appearance
of the words. In addition, often, stop words (such as the or of) are also ignored.
BAG-OF-WORDS REPRESENTATION OF
TEXT: MEASURE OF DOCUMENT
SIMILARITY
Returning to the bag-of-words example, we can use the notion of angle to measure how two different
documents are close to each other.
Given two documents, and a pre-defined list of words appearing in the documents (the dictionary), we can
compute the vectors of frequencies of the words as they appear in the documents. The angle between
the two vectors is a widely used measure of closeness (similarity) between documents.
See also:
Log-returns
Often, the rates of return are approximated, especially if the period length is small. If , then
Assume that at the beginning of the period, we invest a sum in all the assets, allocating a fraction
(in ) in the -th asset. Here is a non-negative vector which sums to one. Then the portfolio we
constituted this way will earn
RATE OF RETURN OF A FINANCIAL PORTFOLIO | 256
The rate of return is thus the scalar product between the vector of individual returns and of the portfolio
allocation weights .
Note that, in practice, rates of return are never known in advance, and they can be negative (although, by
construction, they are never less than ).
257 | SINGLE FACTOR MODEL OF FINANCIAL PRICE DATA
Consider a data matrix which contains the log-returns of assets over time periods (say, days).
A single-factor model for this data is one based on the assumption that the matrix is a dyad:
where , and . In practice, no component of and is zero (if that is not the case, then a
whole row or column of is zero and can be ignored in the analysis).
According to the single factor model, the entire market behaves as follows. At any time , the
log-return of asset is of the form .
• For any asset, the rate of change in log-returns between two-time instants is given by the ratio
, independent of the asset. Hence, gives the time profile for all the assets: every asset shows the
same time profile, up to a scaling given by .
• Likewise, for any time , the ratio between the log-returns of two assets and at time is given by
, independent of . Hence gives the asset profile for all the time periods. Each time shows the
same asset profile, up to a scaling given by .
While single-factor models may seem crude, they often offer a reasonable amount of information. It turns
out that with many financial market data, a good single factor model involves a time profile equal to the
log-returns of the average of all the assets, or some weighted average (such as the SP 500 index). With this
model, all assets follow the profile of the entire market.
THE PROBLEM OF GAUSS | 258
In the early 1800’s, astronomers had just discovered a new planetoid, Ceres, when the object became
impossible to track due to the glare of the sun. Surely the object would reappear sometime soon; but where
to look in the vast sky? The problem of predicting the location of the object, based on location data gathered
during the past 40 days, became the challenge of the day for the astronomy community.
Note that there is a controversy as to the identity of the inventor of the Least-Squares method, as Legendre
has published it earlier than Gauss (in 1805), but Gauss claimed to have discovered it years before.
259 | CONTROL OF A UNIT MASS
Consider the problem of transferring a unit mass at rest sliding on a plane from a point to another at a
unit distance. We can exert a constant force of magnitude on the mass at time intervals ,
.
Denoting by the position at the final instant , we can express via Newton’s law the relationship
between the force vector and position/velocity vector as , where .
Now assume that we would like to find the smallest-norm (in the Euclidean sense) force that puts the mass
at at the final time. This is the problem of finding the minimum-norm solution to the equation
. The solution is .
PORTFOLIO OPTIMIZATION VIA LINEARLY CONSTRAINED LEAST-SQUARES. | 260
We consider a universe of financial assets, in which we seek to invest over one time
period. We denote by the vector containing the rates of return of each asset. A portfolio
corresponds to a vector , where is the amount invested in asset . In our simple
model, we assume that ‘‘shorting’’ (borrowing) is allowed, that is, there are no sign
restrictions on .
As explained, the return of the portfolio is the scalar product . We do not know the return
vector in advance. We assume that we know a reasonable prediction of . Of course, we cannot rely
only on the vector only to make a decision, since the actual values in could fluctuate around . We can
consider two simple ways to model the uncertainty on , which result in similar optimization problems.
Mean-variance trade-off.
A first approach assumes that is a random variable, with known mean and covariance matrix . If past
values of the returns are known, we can use the following estimates
Note that, in practice, the above estimates for the mean and covariance matrix are very unreliable, and
more sophisticated estimates should be used.
Then the mean value of the portfolio’s return takes the form , and its variance is
We can strike a trade-off between the ‘‘performance’’ of the portfolio, measured by the mean return, against
the ‘‘risk’’, measured by the variance, via the optimization problem
261 | PORTFOLIO OPTIMIZATION VIA LINEARLY CONSTRAINED LEAST-SQUARES.
where is our target for the nominal return. Since is positive semi-definite, that is, it can be written as
with , the above problem is a linearly constrained least-squares.
An ellipsoidal model
To model the uncertainty in , we can use the following deterministic model. We assume that the true vector
lies in a given ellipsoid , but is otherwise unknown. We describe by its center and a ‘‘shape matrix’’
determined by some invertible matrix :
Using the Cauchy-Schwartz inequality, as well as the form of given above, we obtain that
Likewise,
For a given portfolio vector , the true return will lie in an interval ,
where is our ‘‘nominal’’ return, and is a measure of the ‘‘risk’’ in the nominal return:
We can formulate the problem of minimizing the risk subject to a constraint on the nominal return:
where is our target for the nominal return, and . This is again a linearly constrained least-
squares. Note that we obtain a problem that has exactly the same form as the stochastic model seen before.
THEOREMS | 262
PART IX
THEOREMS
263 | CAUCHY-SCHWARZ INEQUALITY PROOF
The above inequality is an equality if and only if are collinear. In other words:
Proof: The inequality is trivial if either one of the vectors is zero. Let us assume both are non-zero.
Without loss of generality, we may re-scale and assume it has unit Euclidean norm ( ). Let us
first prove that
The second result is proven as follows. Let be the optimal value of the problem. The Cauchy-Schwartz
inequality implies that . To prove that the value is attained (it is equal to its upper bound),
we observe that if , then
The vector is feasible for the optimization problem . This establishes a lower bound on
the value of , :
DIMENSION OF HYPERPLANES | 264
DIMENSION OF HYPERPLANES
Theorem:
Conversely, any affine set of dimension can be represented by a single affine equation of the form ,
as in the above.
Proof:
Since the vectors are independent, the dimension of is . This proves that is
indeed an affine set of dimension .
The converse is also true. Any subspace of dimension can be represented via an equation
265 | DIMENSION OF HYPERPLANES
for some . A sketch of the proof is as follows. We use the fact that we can form a basis
for the subspace . We can then construct a vector that is orthogonal to all of these basis vectors. By
definition, is the set of vectors that are orthogonal to .
SPECTRAL THEOREM: EIGENVALUE DECOMPOSITION FOR SYMMETRIC MATRICES | 266
Spectral theorem
We can decompose any symmetric matrix with the symmetric eigenvalue decomposition (SED)
Proof: The proof is by induction on the size of the matrix . The result is trivial for . Now let
and assume the result is true for any matrix of size .
Now consider the eigenvalue and an associated eigenvector . Using the Gram-Schmidt
orthogonalization procedure, we can compute a matrix such that is orthogonal.
By induction, we can write the symmetric matrix as, where is
a matrix of eigenvectors, and are the eigenvalues
of . Finally, we define the matrix . By construction the matrix
is orthogonal.
We have
267 | SPECTRAL THEOREM: EIGENVALUE DECOMPOSITION FOR SYMMETRIC MATRICES
We have exhibited an orthogonal matrix such that is diagonal. This proves the theorem.
SINGULAR VALUE DECOMPOSITION (SVD) THEOREM | 268
where the positive numbers are unique, and are called the singular values of . The number
is equal to the rank of , and the triplet is called a singular value decomposition
(SVD) of . The first columns of (resp. ) are called left (resp.
right) singular vectors of , and satisfy
Proof: The matrix is real and symmetric. According to the spectral theorem, it admits an
eigenvalue decomposition in the form , with a matrix whose columns form an
orthonormal basis (that is, ), and . Here, is
the rank of (if then there are no trailing zeros in ). Since is positive semi-definite, the
‘s are non-negative, and we can define the non-zero quantities .
These -vectors are unit-norm, and mutually orthogonal since ‘s are eigenvectors of . Using (say)
269 | SINGULAR VALUE DECOMPOSITION (SVD) THEOREM
the Gram-Schmidt orthogonalization procedure, we can complete (if necessary, that is in the case )
this set of vectors by in order to form an orthogonal matrix .
Let us check that satisfy the conditions of the theorem, by showing that
We have
where the second line stems from the fact that when . Thus, , as claimed.
RANK-ONE MATRICES | 270
RANK-ONE MATRICES
Recall that the rank of a matrix is the dimension of its range. A rank-one matrix is a matrix with rank equal
to one. Such matrices are also called dyads.
where .
Proof.
The interpretation of the corresponding linear map for a rank-one matrix is that the
output is always in the direction , with coefficient of proportionality a linear function of
.
The interpretation for the expression above is that the result of the map for a rank-one
matrix can be decomposed into three steps:
RANK-ONE MATRICES: A
REPRESENTATION THEOREM
where .
Proof: For any non-zero vectors , the matrix is indeed of rank one: if , then
When spans , the scalar spans the entire real line (since ), and the vector spans the
subspace of vectors proportional to . Hence, the range of is the line:
which is of dimension 1.
Conversely, if is of rank one, then its range is of dimension one, hence it must be a line passing through
. Hence for any there exist a function such that
Using , where is the th vector of the standard basis, we obtain that there exist numbers
such that for every :
Now letting and realizing that the matrix is simply the identity matrix, we
obtain , as desired.
273 | FULL RANK MATRICES
Theorem
A matrix in is:
Proof:
The matrix is full column rank if and only if its nullspace is reduced to the singleton , that is,
Conversely, assume that the matrix is full column rank, and let be such that . We then have
, which means . Since is full column rank, we obtain , as
desired. The proof for the other property follows similar lines.
RANK-NULLITY THEOREM | 274
RANK-NULLITY THEOREM
Rank-nullity theorem
The nullity (dimension of the nullspace) and the rank (dimension of the range) of an matrix add up to
the column dimension of , .
Proof: Let be the dimension of the nullspace ( ). Let be a matrix such that
its columns form an orthonormal basis of . In particular, we have . Using the QR
decomposition of the matrix , we obtain a matrix such that the matrix
is orthogonal. Now define the matrix .
We proceed to show that the columns of form a basis for the range of . To do this, we first prove that
the columns of span the range of . Then we will show that these columns are independent. This
will show that the dimension of the range (that is, the rank) is indeed equal to .
Since is an orthonormal matrix, for any , there exist two vectors such that
If , then
Now let us show that the columns of are independent. Assume a vector satisfies and let
us show . We have , which implies that is in the nullspace of . Hence,
there exists another vector such that . This is contradicted by the fact that is an
orthogonal matrix: pre-multiplying the last equation by , and exploiting the fact that
we obtain
275 | A THEOREM ON POSITIVE SEMIDEFINITE FORMS AND EIGENVALUES
Conversely, if there exist for which , then choosing will result in for every , then
the condition
Let . The sets and form an orthogonal decomposition of , in the sense that any
vector can be written as
In particular, we obtain that the condition on a vector to be orthogonal to any vector in the nullspace of implies
that it must be in the range of its transpose:
then an SVD of its transpose is simply obtained by transposing the three-term matrix product involved:
Thus, the left singular vectors of are the right singular vectors of .
From this we conclude in particular that the range of is spanned by the first columns of . Since
the nullspace of is spanned by the last columns of , we observe that the nullspace of and the
range of are two orthogonal subspaces, whose dimension sum to that of the whole space. Precisely, we
can express any given vector in terms of a linear combination of the columns of ; the first columns
correspond to the vector and the last to the vector :
277 | FUNDAMENTAL THEOREM OF LINEAR ALGEBRA
The last statement is then an obvious consequence of this first result: if is orthogonal to the nullspace,
then the vector in the theorem above must be zero, so that .