MachineLearningPatternRecognition_18_finalversion
MachineLearningPatternRecognition_18_finalversion
Peter J. Ramadge
Princeton University
1 c Peter J. Ramadge 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 2
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Contents
3 Matrix Algebra 31
3.1 Matrix Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Similarity Transformations and Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Symmetric and Positive Semidefinite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 35
3
ELE 435/535 Fall 2018 4
6 Multivariable Differentiation 61
6.1 Real Valued Functions on Rn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Functions f : Rn → Rm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3 Real Valued Functions on Rn×m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4 Matrix Valued Functions of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.5 Appendix: Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 5
I Appendices 233
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 6
C QR-Factorization 241
C.1 The Gram-Schmidt Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
C.2 QR-Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 7
Notation
Z The set of integers. Integers are typically denoted by lower case roman letters, e.g., i, j, k.
N The set of non-negative integers.
R The set of real numbers.
R+ The set of non-negative real numbers.
Rn The set of n-tuples of real numbers for n ∈ N and n ≥ 1.
Rn×m The set of n × m matrices of real numbers.
[1 : k] The set of integers 1, . . . , k.
∆ ∆
= Equal by definition, as in A(x) = xxT .
a ∈ A Indicates that a is an element of the set A.
0n , 1n The vectors in Rn with 0n = (0, . . . , 0) and 1n = (1, . . . , 1). (n is omitted if clear from context).
ej The j-th vector in the standard basis for Rn .
In The n × n identity matrix, In = [e1 , . . . , en ].
X −1 The matrix inverse of the square matrix X.
XT The transpose of the matrix X. If X = [Xi,j ], then X T = [Xj,i ].
X −T The transpose of the inverse of the square matrix X, i.e., (X −1 )T .
Xi,: The i-th row of X ∈ Rn×m , i.e., the 1 × m matrix Xi,: = [Xi,j , j ∈ [1 : m]].
X:,j The j-th column of X ∈ Rn×m , i.e., the n × 1 matrix X:,j = [Xi,j , i ∈ [1 : n]].
On The group of (real) n × n orthogonal matrices.
Vn,k The set of real n × k matrices, with k ≤ n orthonormal columns.
Sn , Sn+ The subsets of symmetric and symmetric PSD matrices in Rn×n , respectively.
X ⊗ Y The Schur product [Xi,j Yi,j ] of X, Y ∈ Rm×n . Similarly, x ⊗ y = [xi yi ] for x, y ∈ Rn .
<·, ·> An inner product.
x⊥y Vectors x and y are orthogonal. Similarly, X ⊥ Y indicates orthogonality of matrices X and Y .
U⊥ The orthogonal complement of a subspace U.
Df (x) The derivative of f (x) w.r.t. x ∈ Rn . Often displayed as Df (x)(v) to indicate its action on v ∈ Rn .
∇f (x) The gradient of the real valued function f (x) at x ∈ Rn . Note that ∇f (x) ∈ Rn .
X A random variable, or random vector. X denotes a matrix, whereas X denotes a random vector.
E[X] The expected value of the random variable, or random vector X.
µX The mean of the random variable, or random vector X.
2
σX The variance of a random variable X.
ΣX The covariance matrix of the random vector X.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 8
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 9
Chapter 1
The objective of machine learning is to develop principled methods to automatically identify useful relation-
ships in data. By useful we mean that these relationships or patterns must generalize to new data collected
from the same source. The patterns of interest can take several forms. For example, these could be topologi-
cal (e.g., clusters in the data), or be (approximate) functional relationships between subsets of data features.
Patterns can also be expressed probabilistically by identifying probabilistic dependences between compo-
nents of the data. The identified patterns might be used to better understand the data, to compress the data,
to estimate the values of missing variables, and to make decisions based on observed data.
The data is the primary guide for accomplishing the machine learning task. But we don’t necessarily
want to give the data total control. We also want to impose appropriate objectives, and use domain knowl-
edge (prior knowledge) to guide the learning process.
Before introducing formal definitions and mathematical notation, it’s helpful to see a simple machine
learning problem. This serves to motivate the subsequent development.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 10
than once. In practice, you are often given, or must form, your own set of testing data. Despite the fact that
the testing data is in your possession, you must reserve it only for testing the final classifier.
One way to form the training and testing sets is to partition the labelled data into two fixed disjoint
subsets. We want the training set to be large so that it is representative of the data that is likely to be
encountered in the future, and we want the testing data to be large so that it can provide a good estimate of
the accuracy of the learned classifier. If the original set of labeled data is sufficiently large, a single split
into disjoint training and testing sets can work well. This is sometimes called the holdout method of testing
classifier performance, since the testing data is “held out” during the training phase.
by setting α = p̂. The selected randomized classifier outputs the label 1 with probability p̂. Its testing
performance is:
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 11
fraction correct
fraction correct
0.6 0.6 0.6
fp
0.2 f0 0.2 fp drawn for p = p 0.2 gp
f1 gp 0.8p p 1.2p
f1/2 f1/2 0.8p p 1.2p
0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
p p p
Figure 1.1: Performance of simple spam filters. On each graph, the vertical axis is the fraction of correctly classified
emails, and the horizontal axis is the fraction p of spam emails, in each case for the testing data. Left: The performance
of the classifiers f0 , f1 , and f1/2 . Center: The performance of the classifiers fp̂ , and gp̂ for the ideal situation p̂ = p.
Right: The performance of fp̂ , and gp̂ when p̂ differs from p by up to 20%.
Notice that we reduced the entire training dataset to a single scalar quantity p̂. Then used this value to select
a classifier from the fixed parameterized family F. Hence the selected classifier fp̂ is a function of training
data. Here is another example. Use p̂ to select a classifier from the set of classifiers {f0 , f1 } using the rule:
(
f0 , if p̂ ≤ 0.5;
gp̂ =
f1 , otherwise.
For the ideal situation p̂ = p, the performance curves of fp̂ and gp̂ are shown in the center plot of Figure
1.11. Both classifiers show improved performance over the trivial classifiers f0 , f1 , f1/2 , but performance
remains low because we are still ignoring the content of the input email.
Under the assumption that the training dataset is representative of the testing dataset, we expect p̂ to be a
good approximation to the fraction p of spam emails in the testing data. Let’s examine what happens to the
testing performance of the above classifiers when p̂ differs from p. The plot on the right side of Figure 1.11
displays the performance of fp̂ and gp̂ when p̂ and p differ by up to 20%. Notice that the classifiers don’t
always generalize well to the testing data. There has been a tradeoff. The classifiers are tuned to the training
data and this improves performance when p̂ = p, but it can also reduce robustness to variations between the
testing and training data.
The classifiers fp̂ and gp̂ are special instances of a common theme in machine learning. First fix a
parameterized family of classifiers. Then use information derived from the training data to select a particular
classifier from this family. In this way, the selected classifier is a function of the training data.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 12
As a result, a high occurrence rate of either of these symbols was useful in identifying spam.
We can illustrate these ideas using the Spambase dataset. This publicly available datatset is described
in Figure 1.2. Feature 51 in the Spambase dataset is the occurrence rate of the symbol “!” in an email. To
determine if the values of this particular feature vary between spam and non-spam email, we plot histograms
of the feature values for each class over the training data. These are shown superimposed in Figure 1.3. As
you can see, the spam and non-spam histograms differ significantly. This confirms that feature 51 is indeed
relevant for identifying spam.
When we examine one feature in isolation, we are doing a univariate analysis. For example, examining
the histograms of each feature one at a time, is a univariate analysis. It gives a subjective measure of the
ability of each feature (in isolation) to distinguish the two classes. Designing a classifier based on a single
feature and testing how well it can distinguish spam from email is also a univariate analysis. This would
provide a quantitative measure of this feature’s potential (considered in isolation) for distinguishing email
spam. For the moment, we will restrict our attention to such univariate analyses.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 13
5
Density histogram of feature 51
on the training data
Range = [0,20]
4 #Bins = 100
Performance = 0.791
probability density
3
1
nonspam
spam
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
bin for feature value: ! 51
Figure 1.3: Histograms of feature 51 (“!’) for spam and non-spam training emails. The histograms are plotted as
densities. So the area under each histogram is 1. To obtain the corresponding probability mass function, multiply
each bin value by the bin width (= 0.2). Using a Bayes classifier, these histograms yield training data classification
performance of 0.791.
∆ ∆
conditional densities p1 (x) = p(x|y = 1) and p0 (x) = p(x|y = 0).
Our agreed metric for selecting a classifier is maximizing the probability of success, or equivalently,
minimizing the probability of error. Hence given x, we want to select the label value y ∈ {0, 1} that
maximizes p(y|x). Using Bayes rule one can write
p(x|y)p(y)
p(y|x) = . (1.3)
p(x)
Since the term p(x) in the denominator of (1.3) doesn’t depend on y, it doesn’t influence the maximization.
Hence the desired classifier is
(
1, if p(1)p1 (x) > p(0)p0 (x);
f (x) = arg max pk (x)p(y) = (1.4)
k∈{0,1} 0, otherwise.
This is called the Bayes classifier for the given model. Because our objective is to minimizing the probability
of error, the Bayes classifier results in making the decision with maximum a posteriori probability (MAP).
This special form of Bayes classifier is called a MAP classifier.
Here are two alternative expressions for a binary MAP classifier (1.4):
( (
1, if pp01 (x)
(x) p(0)
> p(1) ; 1, if ln pp10 (x)
(x) > ln p(0)
p(1) ;
f (x) = f (x) = (1.5)
0, otherwise 0, otherwise
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 14
the label is k. The likelihood ratio compares the two likelihoods under the observation x. Since a strictly
monotone increasing function is order preserving, we can take the natural log of both sides of the likelihood
ratio
comparison
without changing the result. This yields the second expression for f (x) in (1.5). The term
p1 (x)
ln p0 (x) in this expression is called the log-likelihood ratio.
Performance
The performance of the MAP classifier is determined as follows. There are two mutually exclusive ways to
make a correct decision: f (x) = 1 and the joint outcome is (x, 1), or f (x) = 0 and the joint outcome is
(x, 0). For the first case, p(x, 1) = p1 (x)p(1), and for the second, p(x, 0) = p0 (x)p(0). The contribution to
the success of f at outcome x is thus
This is the MAP classifier for the given histograms H0 , H1 , and prior probability of spam p̂. Note that
the number of bins used in the histograms is a selectable parameter. So we have a family of classifiers
parameterized by an integer variable N . The value of N must select before learning the probabilistic model
from the training data. Quantities of this form are often called hyperparameters. In contrast, p̂ is a parameter
of the model learned directly from the training data.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 15
We are now in a position to make an initial assessment of whether a feature in the spambase dataset
is informative of class membership. To do so, we estimate a Bayes classifier using the feature’s training
data and report this classifier’s performance. The results of classifier training and testing on each of the 57
univariate features in the Spambase dataset is shown in Figure 1.4. For all features we set N = 100. The
scalar data for each feature was first standardized by subtracting its sample mean from each example, and
scaling the resulting values to have unit variance.
1.0
f0
Bayes classifier performance on each feature training
Standardized data
0.9 Range = [-1.0, 19.0] testing
Number of bins = 100
fraction correct
0.8
0.7
0.6
0.5
make 00
address 01
all 02
3d 03
our 04
over 05
remove 06
internet 07
mail 09
order 08
receive 10
will 11
people 12
business 16
email 17
you 18
report 13
addresses 14
free 15
credit 19
your 20
font 21
000 22
money 23
hp 24
hpl 25
george 26
650 27
lab 28
labs 29
data 32
telnet 30
857 31
415 33
85 34
technology 35
table 46
1999 36
parts 37
pm 38
direct 39
cs 40
meeting 41
original 42
project 43
re 44
edu 45
conference 47
; 48
( 49
[ 50
! 51
$ 52
# 53
crla 54
crll 55
crlt 56
feature
Figure 1.4: Univariate classification results for each feature in the Spambase dataset. An empirical Bayes classifier
using 100 histogram bins was trained on the (standardized) feature. The plot suggests that approximately half of the
features are informative of class membership.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 16
fraction correct
fraction correct
0.80 0.80 0.80
fraction correct
fraction correct
0.80 0.80 0.80
Figure 1.5: Classification performance of the Bayes Classifier based on single feature histograms as a function of the
number of histogram bins. The top plots are for three word occurrence rate features: “our”, “remove” and “free”. The
bottom plots are for two symbol occurrence rates (“(” and “!”) and a feature that measures to longest run length of
capital letters.
The ability to generalize from training to testing data is one of the main goals of machine learning. Finally,
we note that a “good” value for N depends on the particular feature being used. Moreover, we can’t use the
testing data to determine this value, we can only use the training data. That is a new problem that needs to
be solved.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 17
results provide an average performance and information on the spread of performance about the average. The
scheme is slightly more complex than the hold out method, and requires a k-fold increase in computation.
Nevertheless, it is a useful and widely used heuristic for evaluating classifier performance.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 18
Figure 1.7: Hyperparameter selection by subdividing the training set into fixed training and validation subsets.
on the left-out fold. This results in k estimates of performance. Around this loop place an selection loop
that cycles through the hyperparameter values in contention and evaluates each classifier’s performance in
the above manner. After the classifiers have been trained and evaluated, select the hyperparameter value
that gave the best results. You can then train the classifier on all of the training data using the selected
hyperparameter value. Then report its performance on the test data. The scheme is illustrated in Figure 1.8.
Because k-fold cross-validation obtains k performance estimates for every value of the hyperparamter
being examined, it is computationally expensive. Nevertheless it is frequently used.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 19
Figure 1.9: Classifier hyperparameter selection, training, and testing using nested k-fold cross-validation.
to evaluate classifier performance under the selected parameter. The outer loop iterates though the k folds
by training on k − 1 folds and testing performance on the corresponding left-out fold. This results in k
estimates of classifier performance. As explained below, it also provides k corresponding hyperparameter
values.
Inside the outer loop we place two inner loops. At the start of each of the above k iterations, we place
a selection loop that iterates through a designated set of hyperparameter values. For each hyperparameter
value we train and evaluate the resulting classifier. This is where a third innermost loop is used. The simplest
way to implement this inner loop is to use k − 2 of the k − 1 current training folds as the training subset,
and the left-aside fold of the current training folds as the validation subset. At the completion of the inner
loop, we obtain k − 1 estimates of performance. These give an estimate of the average performance of
the current classifier, and the spread of its performance about the average. After training and evaluation
using each of the hyperparameter values, we select a “best” value based on the information obtained. We
can then train the classifier using the selected hyperparameter value on all of the k − 1 folds of the current
training data. Then report its performance on the left-out test fold. At the end of the k iterations in the outer
loop, we have k best hyperparameter value selections and k sets of performance metrics. The entire scheme
is illustrated in Figure 1.9. This scheme is both complex and computationally expensive. It is only used
when its computation expense is justified. For example, it may be appealing when the set of labelled data is
limited.
We have seen that a single scalar valued feature can yield considerable improvement in classification per-
formance over baseline classifiers. We suspect, however, that combining a complementary set of univariate
features is likely to yield even greater performance improvement. This called multivariate analysis.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 20
For several reasons, the use of feature maps is ubiquitous in machine learning. For many applications it is
possible to use feature extraction methods to map the application data into a vector of features in Rn . This
may incur some loss of information, but it allows analysis methods developed for Rn to be used across many
domains. Mapping data to vectors in Rn has several additional advantages. First, by reducing the dimension
of the data it can make the machine learning problem more tractable. Second, it allows us to exploit the
algebraic, metric and geometric structure of Rn to pose and solve machine learning problems. This includes,
for example, using the tools of linear algebra, Euclidian geometry, differential calculus, convex optimization,
and so on. Finally, a well chosen representation can reveal informative features of the data, and hence
simplify the design of a classifier.
In general, there are two forms of feature map: hand-crafted and learned. A hand-crafted feature map
is a pre-specified set of computations derived using insights from the application domain. In applications
where we have good insights, a hand-crafted feature map can provide a concise and informative representa-
tion of the data. The Spambase dataset is an example of a hand-crafted feature map.
In complex applications where human insights are less well-honed, the training data can be used to learn
a feature map. This is called representation learning. In principle, the learned features can then be used in
a variety of machine learning tasks. In applications such as image content classification, this approach has
enabled machine learning classifiers to perform on par with humans.
The use of numerical surrogates (or proxies) in place of the real data does have potential pitfalls. For
example, distinct email messages can map to the same feature vector φ(x). Hence one needs to think
carefully about the set of features being used, and any unintended consequences that may result. See [34]
for a discussion of the misuse of proxies in certain applications.
Linear Classifiers
Let’s now focus on the problem of learning a classifier based on the feature vectors of the training data.
From this point forward we drop the explicit notation for the feature map φ, and denote each element of the
training data as a pair (xi , yi ) with xi ∈ Rn the numerical proxy for the i-th training example, and yi it’s
corresponding label.
The training data specifies two point-clouds in Rn . The cloud of points with label 0: {xi }yi =0 , and with
label 1: {xi }yi =1 . Our task is learn a decision boundary that separates these point clouds into two classes in
a way that matches (to the extent possible) the point labels. Think of this decision boundary as a surface in
Rn that approximately separates the two point clouds.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 21
x w (x−q)
p = aw
q = −bw/||w||2
w
T
x+
b=
0
0 H
Figure 1.10: A hyperplane H in R2 . If b > 0, the half space containing the origin is the negative half space, and the
other half space is the positive half space.
A linear classifier in Rn is a binary classifier that tries to separate the two clouds of training points
using a hyperplane. Recall that a hyperplane is a set of points satisfying an affine equation of the form
wT x + b = 0, for some w ∈ Rn , and b ∈ R. The vector w and scalar b are parameters of the hyperplane.
You can think of the hyperplane as a flat n − 1 dimensional surface in Rn . In R2 a hyperplane is a line, and
in R3 it’s a two dimensional plane.
A hyperplane separates Rn into two half spaces: the positive half space with wT x + b > 0, and the
negative half space with wT x + b < 0. A linear classifier, classifies a point x ∈ Rn according to the half
space in which it is contained. By changing the sign of w if necessary, we can always write the classifier in
the form (
1, if wT x > −b; (positive half space)
f (x) = (1.9)
0, if wT x ≤ −b (negative half space) .
Here we have included the hyperplane in the negative half space. Notice that classification reduces to
computing the sign of wT x + b.
To explore this further, we use some elementary linear algebra4 . Consider the line through the origin in
the direction w. This is the set of points {αw : α ∈ R}. Each point p on this line can be specified by a
unique coordinate α ∈ R with p = αw. An easy calculation shows that α = wT p/kwk2 . The line intersects
the hyperplane wT x + b = 0 at the point q = −bw/kwk2 with coordinate α = −b/kwk2 . These points are
illustrated for a hyperplane in R2 in Figure 1.10.
For notational simplicity, assume that kwk = 1. If x ∈ Rn is orthogonally projected onto the line
through the origin in the direction w, the projected point has the coordinate α(x) = wT x. If x lies in
the hyperplane, α(x) = −b; if x is in the positive half space, α(x) > −b; and if in the negative half
space, α(x) < −b. So the positive half space projects to the half line with α > −b, and the negative half
space projects to the half line with α < −b. It follows that equivalent classification can be obtained by
orthogonally projecting all points to the line, and then using the 1-D linear classifier
(
1, if wT x > −b;
g(wT x) =
0, if wT x ≤ −b.
4
The linear algebra needed in the course will be revised in the early chapters.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 22
So a linear classifier uses a linear map to reduce each multivariate example x ∈ Rn to a composite scalar
feature wT x ∈ R. It then classifies the scalar points using a simple thresholding function.
To design a linear classifier we only need to select the direction of the projection line w ∈ Rn and
the threshold −b; a total of n parameters. Over the course of time, many methods have been proposed for
selecting suitable values for w and b. A few are listed below with very brief descriptions. Each will be
discussed in greater detail in subsequent chapters.
• Linear Discriminant Analysis (LDA). This method uses the training data to fit multivariate Gaussian
densities to each class, under the assumption that the class covariance matrices are the same. So the
method computes the empirical means of each class: µ̂0 and µ̂1 , and estimates a single covariance
matrix Σ̂ for both classes. It then uses the corresponding Bayes classifier. The number of scalar
parameters estimated in the learning process is O(n2 ).
LDA has an alternative but equivalent formulation. This requires selecting w so that the projection
of the class means and the training data onto the line in direction w maximizes the ratio of squared
distance between the projected class means, and the variance of the projected points about their class
means.
• The Perceptron. The perceptron is an elementary neural network. For each input x, it computes the
affine function wT x + b and passes the result through a smooth scalar nonlinearity ψ(·) to obtain an
output scalar value. The parameters w and b are selected to minimize a suitable cost function ( e.g.,
T 2
P
i (yi − ψ(w xi + b)) ) over the training data using a gradient descent procedure. Once w and b
have been determined, the classification is given by the closest label to ψ(wT x + b). The number of
scalar parameters estimated during the learning process is n + 1.
• The Linear Support Vector Machine (Linear SVM). The linear SVM selects w and b by solving a
convex optimization problem. The optimization objective has two terms. These terms are balanced
using a scalar hyperparameter that needs to be selected. Roughly, one term in the objective function
seeks to position the hyperplane “equally between” the two classes of training examples, and the other
penalizes points that deviate from this objective. The number of scalar parameters estimated during
the learning process is n + 1.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 23
1.00
train
0.95 test
fraction correct 0.90
0.85
0.80
0.75
0.70
0.65
0.60
GNB LDA NCen kNN(3) Ptron LReg LSVM AdaB RBFSVM NNet
Method
Figure 1.11: Training and testing performance of some multivariate spam filters using the Spambase dataset. The
vertical axis is the fraction of correctly classified emails. The horizontal axis indicates the ML classifier.
• k-Nearest Neighbor Classifier. The nearest neighbor classifier estimates a label for a new test ex-
ample x by assigning it the label of the closest training example. Notice that this classifier makes no
attempt to learn from the training data. It simply stores (memorizes) it. Hence the classifier requires
an increasing amount of memory as the number of training examples grows. To improve efficiency,
the training data is usually preprocessed and stored in a data structure that makes finding the closest
example more efficient. Nevertheless, finding the closest training example is the computational bot-
tleneck in performing a classification. The k-nearest neighbor classifier finds the closest k nearest
neighbors and then resolves the label by a weighted voting scheme.
• Gaussian Naive Bayes (GNB). The Gaussian Naive Bayes classifier uses the training data to fit
multivariate Gaussian densities to each class, under the assumption that the covariance matrix for each
class is diagonal. This corresponds to assuming that the features are independent Gaussian random
variables. The method computes the empirical means µ̂1 and µ̂0 , and diagonal covariance matrices
Σ̂0 , Σ̂1 to fit each class. It then uses the Bayes classifier for this estimated model. In general, this
results in a quadratic decision surface. A total of 4n scalar parameters are learned during the training.
• Radial Basis Function SVM (RBF-SVM). This is a nonlinear classifier based on using a kernel
function and the linear SVM. Conceptually, it maps the training and test data into a higher dimensional
space. Then uses a linear SVM in this space. In practice, this is done seamlessly using a function
known as a radial basis function kernel.
• Neural Network. A neural network is a concatenation of layers. Each layer consists of a set of
affine combinations followed by applications of a fixed scalar nonlinear function. The first layer
is the input layer where x is presented. The output of the next layer is formed by taking affine
combinations of x each followed by the same scalar nonlinear function. This layer is called the
first hidden layer. This can be repeated to form additional hidden layers. The final output layer
maps the results of the previous hidden layer to two outputs (for a binary classifier). Classification
is accomplished by selecting label 0 if the first output has the larger value, and label 1 otherwise.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 24
The number of parameters in a neural network depends on the number hidden layers and the widths
of these layers. A network with one hidden layer of width m1 would typically have O(m1 n) scalar
parameters. The values of these parameters must be learned from the training data. The number, and
the sizes, of hidden layers must be decided prior to training.
Notes
For extra reading, see the tutorials by Kulkarni and Harman [25], Bousquet, Boucheron, and Lugosi [6], and
the perspective article by Mitchell [31]. The introductory sections of Duda et al. [13], Scholkopf et al [42],
Bishop [5], Murphy [32], Theodoridis [48] are also good resources.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Chapter 2
For the time being, we will think of datasets as subsets of Rn , the set of n-tuples of real numbers. Doing so
allows us to exploit the algebraic, metric and geometric structure of Rn to pose and solve machine learning
problems. It also brings in the tools of linear algebra, Euclidian geometry, differential calculus, convex
optimization, and so on.
Linear algebra provides an underlying foundation for a great deal of modern machine learning. We
assume a background in linear algebra at the level of an introductory undergraduate course. This chapter
provides a concise review of the critical elements of that material. If you are already an expert, you may
want to skim the chapter before proceeding to the next. If you are a little rusty on the material, use the
chapter to revise and refine your understanding. The chapter will also continue the introduction of our
notational conventions. An excellent notational convention is an asset, but inevitably it will have conflicts
and exceptions. Hence it is essential to learn how to distinguish meaning from context.
2.1 Vectors
Let Rn denote the set of n-tuples of real numbers. By convention, the elements of Rn are called vectors.
We denote elements of Rn by lower case roman letters, e.g., x, y, z and display the components of x ∈ Rn
either by writing x = (x1 , x2 , . . . , xn ), or x = (x(1), x(2), . . . , x(n)), where xi , x(i) ∈ R, i = [1 : n]. For
clarity, we often denote scalars (i.e., real numbers) by lower case Greek letters, e.g., α, β, γ.
There are two core operations on Rn . These are called vector addition and scalar multiplication. These
operations are defined component-wise. For vectors x, y ∈ Rn and scalars α ∈ R,
∆
x + y = (x1 + y1 , . . . , xn + yn )
(2.1)
∆
αx = (αx1 , αx2 , . . . , αxn ).
Many useful concepts and constructions derive from these two operations.
A finite indexed set of elements in Rn is displayed as x1 , x2 , . . . , xk , or {xj }kj=1 , and an infinite sequence
as x1 , x2 , . . . , or {xj }∞
j=1 . This notation conflicts with that used for the components of n-tuple. However,
this is unlikely to confuse unless we simultaneously refer to the elements of xk . In such cases, we use the
alternative notation xk = (xk (1), xk (2), . . . , xk (n)).
25
ELE 435/535 Fall 2018 26
2.2 Matrices
Recall that a real n × m matrix is a rectangular array of real numbers with n rows and m columns. We let
Rn×m denote the set of real n × m matrices and denote elements of Rn×m by upper case roman letters, e.g.,
X, Y, Z. A matrix is said to be square if it has the same number of rows and columns. The components (or
entries) of X are denoted by Xi,j , or X(i, j), for i ∈ [1 : n], j ∈ [1 : m]. The first index is the row index, and
the second is the column index. To display the entries of a matrix X ∈ Rn×m we place the corresponding
rectangular array of elements within square brackets, thus
X1,1 X1,2 ... X1,m
X2,1 X2,2 X2,m
X= . .. .
.. .
Xn,1 Xn,2 . . . Xn,m
It is often convenient to display or specify a matrix by providing a formula for its i, j-element. To do so we
write X = [Xi,j ], where Xi,j is an expression for the i, j-th element of X. For example, the transpose of a
matrix X ∈ Rn×m is the m × n matrix X T with X T = [Xj,i ].
The set Rn×m has two key operations called of matrix addition and scalar multiplication. For X, Y ∈
Rn×m and α ∈ R, these operations are defined component-wise:
∆
X + Y = [Xi,j + Yi,j ]
(2.2)
∆
αX = [αXi,j ].
So [x1 , . . . , xn ] denotes a 1 × n matrix and (x1 , . . . , xn ) denotes a vector in Rn . Similarly, a finite set
of vectors {xj }m n
j=1 ⊂ R can be written as a matrix X ∈ R
n×m by letting x be the j-th column of X,
j
j ∈ [1 : m]. Then xTi is the i-th row of the matrix X T .
For X ∈ Rn×m , we let X:,j denote the j-th column of X (a column vector) and Xi,: denote its i-th row
(a row vector).
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 27
A:,m
Here f denotes the function that maps the matrix A ∈ Rn×m into the vector f (A) ∈ Rnm . It is easy to
verify that this mapping respects the operations of matrix addition and scalar multiplication:
A function that satisfies this properties is called a linear function. In addition, since we know n and m, the
mapping f is invertible. We can recover A from a = f (A) by simply “unstacking” the columns of a. Thus
column-wise vectorization is an isomorphism between the vector spaces Rn×m and Rnm . See Appendix A
for more details. As far as the algebraic operations of the vector spaces are concerned, it does not matter
whether we work in Rn×m or (under the isomorphism f ) in Rnm ; the results obtained will correspond under
the isomorphism. We will examine this issue again after we discuss the geometric structure of Rn and
Rn×m .
Linear Combinations
Pm
We say that z is a linear combination of the vectors x1 , . . . , xm using scalars α1 , . . . , αm if z = j=1 αj xj .
The span of a set of vectors {xj }m n
j=1 ⊂ R is the set of all linear combinations of its elements:
∆ Pm
span{x1 , . . . , xm } = {x : x = j=1 αj xj , for some scalars αj , j ∈ [1 : m].
Subspaces
A subspace of Rn is a subset U ⊆ Rn that isPclosed under linear combinations of its elements. So for any
x1 , . . . , xk ∈ U and any scalars α1 , . . . , αk , kj=1 αj xj ∈ U.
Example 2.3.1. Some examples:
(a) For any set of vectors x1 , . . . , xm ∈ Rn , U = span{x1 , . . . , xm } is a subspace of Rn .
(b) Let w ∈ Rn be nonzero, and consider the set of vectors W = {x : x = αw, α ∈ R}. This is called
the line in Rn in the direction of w. Since W = span(w), it is a subspace of Rn .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 28
(c) U = Rn and U = {0n } are trivial subspaces of Rn (Note: 0n denotes the zero vector in Rn ).
For X ∈ Rn×m , the span of the columns of X is a subspace of Rn . This is called the range of X and
denoted by R(X). Similarly, the set of vectors a ∈ Rm with Xa = 0n is a subspace of Rm . This is called
the null space of X and denoted by N (X).
The rank of X ∈ Rn×m is the dimension of the range of X. Hence the rank of X equals the number of
linearly independent columns in X. The rank of X also equals the number of linearly independent rows of
X. Thus rank(X) ≤ min(n, m). The matrix X is said to full rank if rank(X) = min(n, m).
Linear independence
A finite set of vectors {x1 , . . . , xk } ⊂ Rn is linearly independent if for every set of scalars α1 , . . . , αk ,
Pk
i=1 αi xi =0 ⇒ αi = 0, i ∈ [1 : k].
Notice that a linearly independent set can’t contain the zero vector. A set of vectors which is not linearly
independent is said to be linearly dependent. The key consequence of linear independence is that every
x ∈ span{x1 , . . . , xk } has a unique representation as a linear combination of x1 , . . . , xk .
Bases
Let U be a subspace of Rn . A finite set of vectors {x1 , . . . , xk } is said to span U, or to be a spanning set
for U, if U = span{x1 , . . . , xk }. In this case, every x ∈ U has a representation as a linear combination of
{x1 , . . . , xk }. However, a spanning set may be redundant in the sense that one or more elements of the set
may be a linear combination of the remaining elements. A basis for U is a finite set of linearly independent
vectors that span U. The spanning property means that every vector in U has a representation as a linear
combination of the basis vectors, and linear independence ensures that this representation is unique. It is a
standard result that every nonzero subspace U ⊆ Rn has a basis, and every basis for U contains the same
number of vectors.
Dimension
A vector space that has a basis is said to be finite dimensional. The dimension of a finite dimensional
subspace U is the number of elements in any basis for U.
For example, it is easy to see that Rn is finite dimensional. The standard basis for Rn is the set of
vectors ej , j ∈ [1 : n], defined by
(
1, if k = j;
ej (k) =
0, otherwise.
It is clear that if ni=1 αj ej = 0, then αj = 0, j ∈ [1 : n]. Thus the set is linearly independent. It is also
P
clear that any vector in Rn can be written as a linear combination of the ej ’s. Hence e1 , . . . , en is a basis,
and Rn is a finite dimensional. Thus every basis for Rn has n elements, and Rn has dimension n.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 29
Coordinates
Let {bj }nj=1 be a basis for Rn . The coordinates of x ∈ Rn with respect to this basis are the unique scalars
{αj }nj=1 such that x = nj=1 αj bj . Every vector uniquely determines and is uniquely determined by its
P
coordinates.
Pn For the standard basis, the coordinates of x ∈ Rn are simply the entries of x since x =
j=1 x(j)ej .
You can think of the coordinates of x as an alternative representation of x, specified using the selected
basis. If we choose a different basis, then we obtain a distinct representation. The idea of modifying how
we represent data is important in machine learning.
Example 2.3.2. Let w ∈ Rn be a non-zero vector and consider the line in Rn in the direction of w. This
is a one dimensional subspace of Rn with basis {w}. Every point z on this line can be uniquely written as
z = αw for some α ∈ R. The scalar α is the coordinate of z with respect to the basis {w}.
Notes
We have focused on the vector space Rn , but many of the concepts and definitions have natural extensions
to the vector space of real n × m matrices. We illustrate some of these extensions in the examples and the
exercises. For additional reading see the excellent introductory book by Gilbert Strang [46], and Chapter 0
in Horn and Johnson [22]. For the more technical proofs see Horn and Johnson [22].
Exercises
Exercise 2.1. Show that for any x1 , . . . , xk ∈ Rn , span(x1 , . . . , xk ) is a subspace of Rn .
Pn n
Exercise 2.2. Given fixed scalars αi , i ∈ [1 : n], show that the set U = {x : i=1 αi x(i) = 0} is a subspace of R .
(j) n (j)
More generally, given k sets of scalars, {αi }ni=1 , j ∈ [1 : k], show that the set U = {x :
P
i=1 αi x(i) = 0, j ∈ [1 :
n
k]} is a subspace of R .
Exercise 2.3. Show that span{x1 , . . . , xk } is smallest subspace of Rn that contains the vectors x1 , . . . , xk . By this
we mean that if V is a subspace with x1 , . . . , xk ∈ V, then span{x1 , . . . , xk } ⊆ V.
Exercise 2.4. For subspaces U, V ⊆ Rn , let
∆
U ∩ V = {x : x ∈ U and x ∈ V}
∆
U + V = {x = u + v : u ∈ U, v ∈ V}
Show that U ∩ V and U + V are also subspaces of Rn .
Exercise 2.5. Show that:
(a) A linearly independent set in Rn containing n vectors is a basis for Rn .
(b) A subset of Rn containing k > n vectors is linearly dependent.
(c) If U is proper subspace of Rn , then dim(U) < n.
Exercise 2.6. Consider the set of vectors in R4 displayed as the columns of the following matrix:
1 1 1 0
1 1 −1 0
u1 u2 u3 u4 = 1 −1 0
1
1 −1 0 −1
Show that {u1 , . . . , u4 } is a basis for R4 .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 30
Exercise 2.8. A square matrix S ∈ Rn×n is symmetric if S T = S. Show that the set of n × n symmetric matrices is
a subspace of Rn×n . What is the dimension of this subspace?
Exercise 2.9. A real n × n matrix S is antisymmetric if S T = −S. Show that the set of n × n antisymmetric matrices
is a subspace of Rn×n . What is the dimension of this subspace?
Exercise 2.10. Let Rn×n denote the set of n×n real matrices. Let S ⊂ Rn×n denote the subspace of n×n symmetric
matrices and A ⊂ Rn×n denote the subspace of n × n antisymmetric matrices. Show that Rn×n = S + A.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Chapter 3
Matrix Algebra
Notice that the number of columns in the first matrix X must equal the number of rows in the second matrix
Y . Hence for general rectangular matrices, one or both of the products XY and Y X may not exist. For
X ∈ Rn×m and Y ∈ Rm×n both XY and Y X exist, but have different sizes. When X, Y ∈ Rn×n , both
products exist and have the same size, but in general XY 6= Y X.
When X ∈ Rn×m and y is a column vector of length m,
Xy = [ m
P Pm
k=1 Xi,k yk ] = k=1 X:,k yk .
So Xy is the column vector formed by taking a linear combination of the columns of X using the entries of
y. This can be generalized as follows. For X ∈ Rn×q and Y ∈ Rq×m ,
XY = XY:,1 . . . XY:,m .
So the j-th column of XY is the linear combination of the columns of X using the j-th column of Y .
1) A(BC) = (AB)C
2) A(B + C) = AB + AC
4) (AB)T = B T AT .
Proof. Exercise.
31
ELE 435/535 Fall 2018 32
The first equation is an expression in the vector space Rn , the second is an equivalent matrix equation. This
interpretation is worth emphasizing: the matrix product Xa forms a linear combination of the columns of
X using the elements of the column vector a.
Since vectors in Rn can be regarded as n × 1 matrices, this definition also extends to vectors. For x y ∈ Rn ,
(x ⊗ y)j = xj yj .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 33
Lemma 3.2.2. A ∈ Rn×n is invertible if and only if the columns of A are linearly independent.
Proof. (⇒) First assume that A is invertible. Suppose Ax = 0. So the linear combination of the columns
of A using the entries of x as coefficients is the zero vector. Then 0 = A−1 Ax = x. So x = 0. Hence the
columns of A are linearly independent.
(⇐) Now assume the columns of A are linearly independent. So {Ae1 , . . . , Aen } is a basis. Let B
be the matrix defined on this basis by B(Aej ) = ej , j ∈ [1 : n]. Then BA = I. Suppose Bx = 0.
Writing x as a linear combination of the columns of A we have x = Ay. So 0 = Bx = BAy = y.
Hence y = 0. Thus x = Ay = 0. So the columns of B are linearly independent. Then 0 = BA − I
gives 0 = BAB − B = B(AB − I). Since the columns of B are linearly independent, this implies
AB = BA = I. So B = A−1 .
Lemma 3.2.3. A ∈ Rn×n is invertible if and only if the rows of A are linearly independent.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 34
M1 has distinct eigenvalues 1, 2 each with a one dimensional eigenspace. So the algebraic and geometric
multiplicities both equal 1. M2 has a single eigenvalue at 1 with algebraic multiplicity 2. But it’s eigenspace
has dimension 1. So the algebraic multiplicity is strictly greater than the geometric multiplicity. M3 has
an eigenvalue of 1 with algebraic multiplicity 2. It eigenspace has dimension 2. In this case, the geometric
multiplicity equals the algebraic multiplicity.
Lemma 3.3.1. Eigenvectors of A ∈ Cn×n corresponding to distinct eigenvalues are linearly indepen-
dent.
Multiplying both sides of (3.1) by A, using the eigenvalue property, and subtracting from this the result of
multiplying both sides of (3.1) by λr , yields
Pr−1
j=1 (λj − λr )αj xj = 0.
Since the λj are distinct and αj 6= 0, all of the coefficients in this sum are nonzero. Hence using nonzero
coefficients, there is a linear combination of r − 1 eigenvectors that yields the zero vector; this is a contra-
diction.
It follows from Lemma 3.3.1 that if A has k distinct eigenvalues, then A has at least k linearly indepen-
dent eigenvectors. However, this is only a lower bound. Depending on the particular matrix, there can be
anywhere from k to n linearly independent eigenvectors.
The following lemma records some other useful results on matrix eigenvalues.
(c) AB and BA have the same eigenvalues with the same algebraic multiplicities.
Proof. (a) and (b) are standard results that can be found in any text on linear algebra. (c) is proved in
Exercise 3.8.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 35
Proof. We first use det(M1 M2 ) = det(M1 ) det(M2 ) and det(M −1 ) = det(M )−1 to show that A and B
have the same characteristic polynomial:
It follows that A and B have the same eigenvalues with the same algebraic multiplicities. Let λ be an
eigenvalue and consider the eigenspace of B for λ:
Suppose A and B are similar with B = V −1 AV . Let a ∈ Cn denote the coordinate vector of x ∈ Cn
with respect to the columns of V . So x = V a. Then Ba = V −1 AV a = V −1 Ax. Hence b = Ba is the
coordinate vector of Ax with respect to the columns of V .
The action of A = V BV −1 on x can be separated into three steps: (1) compute the coordinates of x
with respect to V , a = V −1 x; multiply this coordinate vector by B to obtain b = Ba; then use b as a
coordinate vector to give Ax = V b. By this decomposition you see that B is the matrix corresponding to A
when we use the basis V .
3.4.1 Diagonalization
If A ∈ Cn×n has n linearly independent eigenvectors, then A is similar to a diagonal matrix, and we say that
A is diagonalizable. To see this, let v1 , . . . , vn be linearly independent eigenvectors of A with Avi = λi vi ,
i ∈ [1 : n]. Form V ∈ Cn×n with V = v1 v2 . . . vn . Then
AV = A v1 v2 . . . vn
= λ1 v1 λ2 v2 . . . λn vn
= V Λ,
where Λ ∈ Cn×n is diagonal with the corresponding eigenvalues on the diagonal. It follows that
Λ = V −1 AV and A = V ΛV −1 .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 36
Theorem 3.5.1. S ∈ Rn×n is symmetric if and only if S has n real eigenvalues and n orthonormal
eigenvectors.
Proof. (⇒) Let Sx = λx with x 6= 0. Then S x̄ = Sx = λ̄x̄. Hence x̄T Sx = λ̄kxk2 and x̄T Sx = λkxk2 .
Subtracting these expressions and using x 6= 0, yields λ = λ̄. Thus λ is real. It follows that x can be
selected in Rn . We prove the second claim under the simplifying assumption that S has distinct eigenvalues.
Let Sx1 = λ1 x1 and Sx2 = λ2 x2 . Then xT2 Sx1 = λ2 xT2 x1 and xT2 Sx1 = λ1 xT2 x1 . Subtracting these
expressions and using the fact that λ1 6= λ2 yields xT2 x1 = 0. Thus x1 ⊥ x2 . For a proof without the
simplifying assumption, see Theorem 2.5.6 in Horn and Johnson [22].
(⇐) We can write S = V ΛV T , where Λ is diagonal with the real eigenvalues of S on the diagonal and
the columns of V ∈ Rn×n are n corresponding orthonormal eigenvectors of S. Here we used the fact that
V −1 = V T . Then S T = (V ΛV T )T = V ΛV T = S.
If we place the n orthonormal eigenvectors of S in the columns of the matrix V and place the corre-
sponding eigenvalues on the diagonal of the diagonal matrix Λ, then SV = V Λ and hence S = V ΛV T .
Corollary 3.5.1. Let P ∈ Rn×n be a symmetric matrix. Then P is positive semidefinite (respectively
positive definite) if and only if the eigenvalues of P are real and nonnegative (respectively positive).
Proof. Since P is symmetric, all of its eigenvalues are real and it has a set of n real ON eigenvectors. Let x
be an eigenvector of P with eigenvalue λ.
(⇒) Since P is PSD, xT P x = xT λx = λkxk2 ≥ 0. Since kxk2 6= 0, λ ≥ 0. If P is PD, then xT P x > 0.
Hence λkxk2 > 0. Since kxk2 6= 0, we must have λ > 0.
(⇐) Write P = V ΛV T where Λ is diagonal with the nonnegative eigenvalues of P on the diagonal and
the columns of V ∈ Rn×n are nPcorresponding orthonormal eigenvectors of P . Then for any x ∈ Rn ,
xT P x = xT V λV T x = y T Λy = i λi yi2 ≥ 0, where y = V T x. Hence P is PSD. Now let the eigenvalues
of P be positive and x 6= 0. Then y 6= 0, and we obtain x P x = i λi yi2 > 0.
T
P
Notes
You will find all of this material in a good introductory linear algebra textbook. See for example the book
by Gilbert Strang [46]. There are many other similar books. For the more technical proofs see Horn and
Johnson [22].
Exercises
Exercise 3.1. For a n × m real matrix X show that:
(a) R(X) = {z ∈ Rn : z = Xw, for w ∈ Rm } is a subspace of Rn .
(b) N (X) = {a ∈ Rm : Xa = 0} is a subspace of Rm .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 37
Exercise 3.2. Let Aj ∈ Rnj ×m and Nj = {x ∈ Rn : Aj x = 0}, j = 1, 2. Give a similar matrix equation for the
subspace N1 ∩ N2 .
Exercise 3.3. Let Aj ∈ Rn×mj and Rj = {y ∈ Rn : y = Aj x with x ∈ Rmj }, j = 1, 2. Give a similar matrix
equation for the subspace R1 + R2 .
Exercise 3.4. A permutation matrix P ∈ Rn×n is a square matrix of the form P = [ek1 , ek2 , . . . , ekn ], where
ek1 , . . . , ekn is some ordering of the standard basis.
(a) Show that P T is a permutation matrix.
(b) Show that P T is the inverse permutation of P .
(c) Show that y = P T x is a permutation of the entries of x according to the ordering of the standard basis in the
rows of P T (= the ordering in columns of P ).
(d) Show that P T A permutes the rows of A and AP permutes the columns of A in the same way as above.
Exercise 3.5. Let x ∈ Rm and y ∈ Rn . Find the range and null space of the matrix xy T ∈ Rm×n . What is the rank
of this matrix?
Exercise 3.6. Let R denote the right rotation matrix that maps a column vector x = [x1 , x2 , . . . , xn−1 , xn ]T ∈ Rn to
column vector Rx = [xn , x1 , x2 , . . . , xn−1 ]T ∈ Rn .
So C = [h, Rh, R2 h, . . . , Rn−1 h] where h ∈ Rn is the first column of C and R ∈ Rn×n is the right rotation matrix.
(a) Show that the family of real n × n circulant matrices is a subspace of Rn×n .
(b) Show that I, R, . . . , Rn−1 is a basis for the above subspace. [Assume the results of previous questions.]
(c) Show that if C1 , C2 are circulant matrices, so is the product C1 C2 .
(d) Show that all circulant matrices commute.
Exercise 3.8. Let A, B ∈ Cn×n . It always holds that trace(AB) = trace(BA). Hence the sum of eigenvalues of
AB is the same as the sum of eigenvalues of BA. We now show the stronger result that AB and BA have the same
characteristic polynomial and hence the same eigenvalues with the same algebraic multiplicities.
(a) Show that the following 2n × 2n block matrices are similar:
AB 0 0 0 I A
M1 = M2 = Hint: consider n .
B 0 B BA 0 In
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 38
(b) Show that the characteristic polynomial of M1 is sn pAB (s) and of M2 is sn pBA (s).
(c) Use (a) and (b) to show that AB and BA have the same characteristic polynomial. Show that AB and BA the
same eigenvalues with the same algebraic multiplicities.
(d) Now prove a corresponding result for A ∈ Rm×n and B ∈ Rn×m . Without loss of generality, consider m ≤ n.
Exercise 3.9. Let P, Q ∈ Rn×n be symmetric with P PSD and Q PD. Clearly P and Q have nonnegative eigenvalues.
Show that QP may not be symmetric, but QP always has nonnegative eigenvalues.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Chapter 4
We now consider the Euclidean geometry of Rn . This geometry brings in the important concepts of length,
distance, angle, orthogonality, and orthogonal projection. Although we focus on Rn , the main concepts also
apply to general finite-dimensional inner product spaces.
We can equivalently write the inner product as a matrix product: <x, y> = xT y. The following lemma
indicates that this function satisfies the basic properties required of an inner product.
Proof. These claims follow from the definition of the inner product via simple algebra.
39
ELE 435/535 Fall 2018 40
x+y
x ||x+y|| ||x||
||x||
y
||y||
Proof. Items (1) and (2) easily follow from the definition of the norm. Item (3) can be proved using the
Cauchy-Schwarz inequality. This proof is left as an exercise.
The triangle inequality is illustrated in Figure 4.1. The norm kxk measures the “length” or “size” of the
vector x. Equivalently, kxk is the distance between 0 and x, and kx − yk is the distance between x and y.
If kxk = 1, x is called a unit vector, or a unit direction. The set {x : kxk = 1} of all unit vectors is called
the unit sphere.
The Euclidean inner product and norm satisfy the Cauchy-Schwartz inequality.
Lemma 4.1.3 (Cauchy-Schwarz Inequality). For all x, y ∈ Rn , |<x, y>| ≤ kxk kyk.
Proof. Using the definition of the norm and properties of the inner product we have:
k
X k
X k
X k X
X k k
X k
X
k xj k2 = < xi , xj > = <xi , xj > = <xj , xj > = kxj k2 .
j=1 i=1 j=1 i=1 j=1 j=1 j=1
A set of vectors {xi }ki=1 in Rn is orthonormal if it is orthogonal and every vector has unit norm: kxi k =
1, i ∈ [1 : k].
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 41
Pk
Proof. Let {xi }ki=1 be an orthonormal set and suppose that i=1 αi xi = 0. Then for each xj we have
0 = < ki=1 αi xi , xj > = αj .
P
An orthonormal basis for Rn is basis of n orthonormal vectors. Since an orthonormal set is always
linearly independent, any set of n orthonormal vectors is an orthonormal basis for Rn . A nice property of
orthonormal bases is that it is easy to find the coordinates of any vector x with respect to the basis. To see
this, let {xi }ni=1 be an orthonormal basis and x = ni=1 αi xi . Then
P
X X
<x, xj > = < αi xi , xj > = αi <xi , xj > = αj .
i i
So the coordinate of x with respect to the basis element xj is simply αj = <x, xj >, j ∈ [1 : n].
Example 4.1.1. The Hadamard basis is an orthonormal basis in Rn with n = 2p . It can be defined recur-
sively as the columns of the matrix the Hadamard matrix Hpa where
a a
1 Hp−1 Hp−1
H0a =1 and Hpa =√ a a .
2 Hp−1 −Hp−1
For example, in R2 , R4 , and R8 , the Hadamard basis given by the columns of the matrices:
1 1 1 1 1 1 1 1
1 −1 1 −1 1 −1 1 −1
1 1 1 1 1 1 −1 −1 1 1 −1 −1
1 1 1 1 1 −1 1 −1 1
1 −1 −1 1 1 −1 −1 1
H1a = √ 1 −1 H2a = √ 1 1 −1 −1
H3a = √ 1 1 1 1 −1 −1 −1 −1
2 4 8
1 −1 −1 1 1 −1 1 −1 −1 1 −1 1
−1 −1 −1 −1
1 1 1 1
1 −1 −1 1 −1 1 1 −1
Example 4.1.2. The Haar basis is an orthonormal basis in Rn with n = 2p . The elements of the basis can
be arranged into groups. The first group consists of one vector: √12p 1. The next group also has one vector.
The entries in first half of the vector are √12p and the second half are − √12p . Subsequent groups are derived
√
by subsampling by 2, scaling by 2, and translating. We illustrate the procedure below for p = 1, 2, 3. In
each case, the Haar basis consists of the columns of the Haar matrix Hp :
√1 √1 √1 √1
8 8 4
0 2
0 0 0
√1 √1 √1 0 − √12 0 0 0
1 8 8 4
1 √1 √1 √1 − √14 1
0 0 0 0 0
√
2 2 2
8 8 2
√1 √1 1 1
− √12 0 √1 √1 − √14 0 0 − √12 0 0
2 2 2 2 8 8
H1 = H2 = 1 H3 = .
√1 − √12 1
− 12 0 √ √1 − √18 0 √1 0 0 √1 0
2 2 2 8 4 2
1
2
− 12 0 − √12
√1
8
− √18 0 √1
4
0 0 − √12 0
√1 − √18 0 − √14 0 0 0 √1
8 2
√1 − √18 0 − √14 0 0 0 − √12
8
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 42
Lemma 4.1.5. If Q ∈ On , then for each x, y ∈ Rn , <Qx, Qy> = <x, y> and kQxk = kxk.
Proof. <Qx, Qy> = xT QT Qy = xT y = <x, y>, and kQxk2 = <Qx, Qx> = <x, x> = kxk2 .
Let On denote the set of n × n orthogonal matrices. We show below that the set On forms a (noncom-
mutative) group under matrix multiplication. Hence On is called the n × n orthogonal group.
Lemma 4.1.6. The set On contains the identity matrix In , and is closed under matrix multiplication
and matrix inverse.
The following lemma gives a useful alternative expression for <A, B>.
min 1
2 kx − zk2
z∈Rn (4.3)
s.t. z ∈ span{u}.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 43
The subspace span{u} is a line through the origin in the direction u, and we seek the point z on this line
that is closest x. Every point z on the line has a unique coordinate α ∈ R with z = αu. Hence we can
equivalently solve the unconstrained problem
α? = arg min 1
2 kx − αuk2 .
α∈R
1
2 kx − zk2 = 12 <x − z, x − z> = 12 kxk2 − α<u, x> + 12 α2 kuk2 .
This is a strictly convex quadratic function of the scalar α. Hence there is a unique value of α? that minimizes
the objective. Setting the derivative of the above expression w.r.t. α equal to zero gives the unique solution
α? = <u, x>. Hence the closet point to x on the line span(u) is
x̂ = <u, x>u.
The associated error vector rx = x − x̂ is called the residual. We claim that the residual is orthogonal to u
and hence to the subspace span{u}. To see this note that
Thus x̂ is the unique orthogonal projection of x onto the line span{u}, and by Pythagoras we have kxk2 =
kx̂k2 + krx k2 . This result is illustrated on the left in Figure 4.2.
x x
rx
rx U
^
x
^
u x
0
0
Figure 4.2: Left: Orthogonal projection of x onto a line through zero. Right: Orthogonal projection of x onto the
subspace U.
We can also write the solution using matrix notation. Noting that <u, x> = uT x, we have
x̂ = (uuT )x = P x
rx = (I − uuT )x = (I − P )x,
with P = uuT . So for fixed u, both x̂ and rx are linear functions of x. As one might expect, these linear
functions have special properties. For example, since x̂ ∈ span{u}, the projection of x̂ onto span{u} must
be x̂. Hence P 2 = P . Such a matrix is said to be idempotent. This property is easily checked using the
formula P = uuT . We have P 2 = (uuT )(uuT ) = uuT = P . We also note that P = uuT is symmetric.
Hence P is both symmetric (P T = P ) and idempotent (P 2 = P ). A matrix with these two properties is
called a projection matrix.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 44
min 1
2 kx − zk2
z∈Rn (4.4)
s.t. z ∈ U.
Pk
By uniquely writing z = j=1 αj uj , we can equivalently solve the unconstrained problem:
1 Pk 2.
min kx − j=1 αj uj k
α1 ,...,αk 2
Using the definition of the norm and the properties of the inner product, the objective function can be
expanded to:
1
2 kx − zk2 = 12 <x − z, x − z>
= 12 kxk2 − <z, x> + 12 kzk2
= 12 kxk2 − kj=1 αj <uj , x> + 12 kj=1 αj2 .
P P
Pk
In the last line we used Pythagoras to write kzk2 = 2
j=1 αj . Taking the derivative with respect to αj and
setting this equal to zero yields the unique solution
Moreover, the residual rx = x − x̂ is orthogonal to every uj and hence to the subspace U = span({ui }ki=1 ).
To see this compute
<uj , rx > = <uj , x − x̂> = <uj , x> − <uj , x̂> = <uj , x> − <uj , x> = 0.
Thus x̂ is the unique orthogonal projection of x onto U, and by Pythagoras, kxk2 = kx̂k2 + krx k2 . This
property is illustrated on the right in Figure 4.2. From (4.5), notice that x̂ and the residual rx = x − x̂ are
linear functions of x.
We can also write these results as matrix equations. First let U ∈ Rn×k be the matrix with columns
u1 , . . . , uk . From (4.5) we have
Pk T = ( kj=1 uj uTj )x = P x,
P
x̂ = j=1 uj uj x
Pk T
with P = j=1 uj uj = U U T . Hence,
x̂ = U U T x
rx = (I − U U T )x.
This confirms that x̂ and rx are linear functions of x and that P is symmetric and idempotent.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 45
It is clear that {0}⊥ = Rn and (Rn )⊥ = {0}. For a nonzero vector u, span{u}⊥ must be the subspoaxe
of Rn with normal u. When U = span{u}, we write U ⊥ as simply u⊥ .
Given a subspace U in Rn and x ∈ Rn , the projection x̂ of x onto U lies in U, the residual rx lies in U ⊥ ,
and x = x̂ + rx . Because U and U ⊥ are orthogonal, this representation is unique.
Lemma 4.4.2. Every x ∈ Rn has a unique representation in the form x = u + v with u ∈ U and
v ∈ U ⊥.
Proof. By the properties of orthogonal projection, x = x̂ + rx with x̂ ∈ U and rx ∈ U ⊥ . This gives one
decomposition of the required form. Suppose there are two decompositions of this form: x = ui + vi , with
ui ∈ U and vi ∈ U ⊥ , i = 1, 2. Subtracting the expressions gives (u1 − u2 ) = −(v1 − v2 ). Now u1 − u2 ∈ U
and v1 − v2 ∈ U ⊥ , and since U ∩ U ⊥ = 0 (Lemma 4.4.1), we must have u1 = u2 and v1 = v2 .
It follows from Lemma 4.4.2 that U + U ⊥ = Rn . This simply states that every vector in Rn is the sum
of some vector in U and some vector in U ⊥ . Because this representation is also unique, this is sometimes
written as Rn = U ⊕ U ⊥ , and we say that Rn is the direct sum of U and U ⊥ . Exercise 4.26 covers several
additional properties of the orthogonal complement.
Theorem 4.4.1. Let A ∈ Rn×m have null space N (A) and range R(A). Then N (A)⊥ = R(AT ).
4.5 Norms on Rn
The Euclidean norm is one of many norms on Rn . Each of the following functions is also a norm on Rn :
(b) kxkp = ( nj−1 |x(j)|p )1/p for an integer p ≥ 1. This is called the p-norm.
P
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 46
Note that the 1-norm and the 2-norm are instances of the p-norm with p = 1 and p = 2, respectively.
Notes
We have only given a brief outline of the geometric structure of Rn . For more details see the relevant
sections in Chapter 2 of Strang [46].
Exercises
Pn
Exercise 4.1. The mean of a vector x ∈ Rn is the scalar mx = (1/n) i=1 x(i). Show that the set of all vectors in
Rn with mean 0 is a subspace U0 ⊂ Rn of dimension n − 1. Show that all vectors in U0 are orthogonal to 1n ∈ Rn .
Exercise 4.2. Given x, y ∈ Rn find the closest point to x on the line through 0 in the direction of y.
Exercise 4.3. Let P ∈ Rn×n be the matrix for orthogonal projection of Rn onto the subspace U of Rn . Show that P
is symmetric, idempotent, PSD, and that trace(P ) = dim(U).
Exercise 4.4. Let P ∈ Rn×n be the matrix for orthogonal projection of Rn onto the subspace U of Rn . Show that
I − P is symmetric, idempotent, PSD, and that trace(I − P ) = n − dim(U).
Exercise 4.5. Prove or disprove: for any subspace U ⊂ Rn and each x ∈ Rn there exits a point u ∈ U such that
(x − u) ⊥ u.
Exercise 4.6. Prove or disprove: If u, v ∈ Rn are orthogonal, then ku − vk2 = kuk2 + kvk2 .
Exercise 4.7. Prove or disprove: For C ∈ Rn×n , if for each x ∈ Rn , xT Cx = 0, then C = 0.
Exercise 4.8. The correlation of x1 , x2 ∈ Rn is the scalar:
<x1 , x2 >
ρ(x1 , x2 ) = .
kx1 kkx2 k
For given x1 , what vectors x2 maximize the correlation? What vectors x2 minimize the correlation? Show that
ρ(x1 , x2 ) ∈ [−1, 1] and is zero precisely when the vectors are orthogonal.
Basic Properties
Exercise 4.9. Prove that the properties in Lemma 4.1.1 hold for the inner product <x, y> = xT y in Rn .
Exercise 4.10. Use the definition of the inner product and its properties listed in Lemma 4.1.1, together with the
definition of the norm, to prove the Cauchy-Schwarz Inequality (Lemma 4.1.3).
(a) First let x, y ∈ Cn with kxk = kyk = 1.
(1) Set x̂ = <x, y>y and rx = x − x̂. Show that <rx , y> = <rx , x̂> = 0.
(2) Show that krx k2 = 1 − <x̂, x>.
(3) Use the previous result and the definition of x̂ to show that |<x, y>| ≤ 1.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 47
Exercise 4.11. Prove that for x ∈ Rn the function (xT x)1/2 satisfies the properties of a norm listed in Lemma 4.1.2.
Hint: For the triangle inequality, using the properties of the inner inner product to expand kx + yk2 and use the
Cauchy-Schwartz inequality.
Exercise 4.12. Show that the Euclidean norm in Cn is:
(a) permutation invariant: if y a permutation of the entries of x, then kyk = kxk.
(b) an absolute norm: if y = |x| component-wise, then kyk = kxk.
Exercise 4.13. Let X , Y be inner product spaces over the same field F with F = R, or F = C. A linear isometry
from X to Y is a linear function D : X → Y that preserves distances: (∀x ∈ X ) kD(x)k = kxk. Show that a linear
isometry between inner product spaces also preserves inner products.
(a) First examine kD(x + y)k2 and conclude that Re(<Dx, Dy>) = Re(<x, y>).
(b) Now examine kD(x + iy)k2 where i is the imaginary unit.
Orthonormal Bases in Rn
p
Exercise 4.14. For p = 1, 2, 3, show that the Haar basis in R(2 ) , is an orthonormal basis (see Example 4.1.2).
Exercise 4.15. Show that the general Haar basis is orthonormal (see Example 4.1.2). One means to do so is to show
that Hp can be specified recursively via
1 Hp−1 I2p−1
Pp H p = √ ,
2 Hp−1 −I2p−1
where Pp is an appropriate 2p × 2p permutation matrix.
Exercise 4.16. Show that for p = 1, 2, 3, the Hadamard basis is orthonormal (see Example 4.1.1).
Exercise 4.17. Show that the Hadamard basis is orthonormal and that the Hadamard matrix Hpa is symmetric (see
Example 4.1.1). [Hint: the definition is recursive].
Exercise 4.18. Let u1 , . . . , uk ∈ Rn be an ON set spanning a subspace U and let v ∈ Rn with v ∈/ U. Find a point ŷ
on the linear manifold M = {x : x − v ∈ U} that is closest to a given point y ∈ Rn . [Hint: transform the problem to
one that you know how to solve.]
Exercise 4.19. A Householder transformation on Rn is a linear transformation that reflects each point x in Rn about
a given n − 1 dimensional subspace U specified by giving its unit normal u ∈ Rn . To reflect x about U we want to
move it orthogonally through the subspace to the point on the opposite side that is equidistant from the subspace.
(a) Given U = u⊥ = {x : uT x = 0}, find the required Householder matrix.
(b) Show that a Householder matrix H is symmetric, orthogonal, and its own inverse.
Exercise 4.20. Let H ∈ Rn×n be the matrix of a Householder transformation reflecting about a subspace with unit
normal u ∈ Rn .
(a) Show that H has n − 1 eigenvalues at 1, and one eigenvalue at −1. Find an eigenvector for the eigenvalue −1.
(b) Show that H has a complete set of ON eigenvectors.
Exercise 4.21. Let P ∈ Rn×n be a projection matrix with R(P ) = U.
(a) Show that the distinct eigenvalues of P are {0, 1} with the eigenvalue 1 having k = dim(U) linearly indepen-
dent eigenvectors, and the eigenvalue 0 having n − k linearly independent eigenvectors.
(b) Show that P has n orthonormal eigenvectors.
(c) Show that trace(P ) = dim(U).
Orthogonal Matrices
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 48
Exercise 4.22. Show that an orthogonal matrix Q ∈ On has all of its entries in the interval [−1, 1].
Exercise 4.23. Let Pn denote the set of n × n permutation matrices. Show that Pn is a (noncommutative) group under
matrix multiplication. Show that every permutation matrix is an orthogonal matrix. Hence Pn is a subgroup of On .
Exercise 4.24. A subspace U ⊆ Rn of dimension d ≤ n can be represented by a matrix U = [u1 , . . . , ud ] ∈ Rn×d
with U T U = Id . The columns of U form an orthonormal basis for U. However, this representation is not unique
since there are infinitely many orthonormal bases for U. Show that for any two such representations U1 , U2 for U there
exists a d × d orthogonal matrix Q with U2 = U1 Q.
Orthogonal Complement
Exercise 4.25. Prove Lemma 4.4.1.
Exercise 4.26. Let X be an inner product space of dimension n, and U, V be subspaces of X . Prove each of the
following:
(a) U ⊆ V ⇒ implies V ⊥ ⊆ U ⊥ .
(b) (U ⊥ )⊥ = U.
(c) (U + V)⊥ = U ⊥ ∩ V ⊥
(d) (U ∩ V)⊥ = U ⊥ + V ⊥
(e) If dim(U) = k, then dim(U ⊥ ) = n − k
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 49
Chapter 5
This chapter reviews the Singular Value Decomposition (SVD) of a rectangular matrix. The SVD extends the
idea of an eigendecomposition of a square matrix to non-square matrices. It has specific applications in data
analysis, dimensionality reduction (PCA), low-rank matrix approximation, and some forms of regression.
We first present and interpret the main SVD result in what is called the compact form. Then introduce
an alternative version known as the full SVD. After these discussions, we examine the relationship of the
SVD to several norms, then use the SVD to examine the issue of a “best” rank k approximation to a given
matrix. Finally, we turn our attention to the ideas and constructions that form the foundation of the SVD.
The factorization (5.1) of A is called a compact singular value decomposition (compact SVD) of A.
The positive scalars σj are called the singular values of A. We also write σj (A) to indicate the j-th singular
value of A. The r orthonormal columns of U are called the left or output singular vectors of A, and
the r orthonormal columns of V are called the right or input singular vectors of A. The compact SVD
decomposition is illustrated in Figure 5.1.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 50
Figure 5.2: A visualization of the three operational steps in the compact SVD. The projection of x ∈ Rn onto N (A)⊥
is represented in terms of the basis v1 , v2 . Here x = α1 v1 + α2 v2 . The singular values scale these coordinates. Then
the scaled coordinates are transferred to the output space Rm and used to the form y = Ax as the linear combination
y = σ1 α1 u1 + σ2 α2 u2 .
When r m, n, the SVD provides a more concise representation of A. In place of the mn entries of A,
the SVD uses the (m+n+1)r parameters required to specify U , V and the r singular values. The conditions
U T U = Ir and V T V = Ir indicate that U and V have orthonormal columns. In general, U U T 6= Im and
V V T 6= In because U and V need not be square matrices. The theorem does not claim that U and V are
unique. We discuss this issue later in the chapter.
Corollary 5.1.1. The matrices U and V in the compact SVD have the following additional properties:
(c) The rank one matrices uj vjT , j ∈ [1 : r], form an orthonormal basis in Rm×n .
Proof. (a) Writing Ax = U (ΣV T x) shows that Ax ∈ R(U ) and hence that R(A) ⊆ R(U ). Let u ∈ R(U ).
Then for some z ∈ Rr , u = U z = U ΣV T V (Σ−1 z) = A(V T Σ−1 z). Hence R(U ) ⊆ R(A).
(b) By taking transposes and using part (a), the columns of V form an ON basis for the range of AT . Using
N (A)⊥ = R(AT ) yields the desired result.
(c) Taking the inner product of uk vkT and uj vjT and using the property that trace(AB) = trace(BA)
whenever both products are defined, yields
(
0, if j 6= k;
<uk vkT , uj vjT > = trace(vk uTk uj vjT ) = trace(uTk uj vjT vk ) =
1, if j = k.
The above observations lead to the following operational interpretation of the SVD. Since the columns
of V form an orthonormal basis for N (A)⊥ , the orthogonal projection of x ∈ Rn onto the range of N (A)⊥
is x̂ = V V T x. Hence V T x gives the coordinates of x̂ with respect to V . These r coordinates are then
individually scaled using the r diagonal entries of Σ. Finally, we synthesize the output vector by using the
scaled coordinates and the ON basis U for R(A): y = U (ΣV T x). So the SVD has three steps: (1) An
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 51
Figure 5.3: A visualization of the action of A on the unit sphere in Rn in terms of its SVD.
analysis step: V T x, (2) A scaling step: Σ(V T x), and (3) a synthesis step: U (ΣV T x). In particular, for
x = vk , y = Ax = σk uk , k ∈ [1 : r]. So the r ON basis vectors for N (A)⊥ are mapped to scaled versions of
the corresponding ON basis vectors for R(A). These steps are illustrated in Fig. 5.2. Notice that restricted
to N (A)⊥ , the map A : N (A)⊥ → R(A) is one-to-one and onto and hence invertible.
Finally, we note that the SVD is selecting an orthonormal basis of rank one matrices {uj vjT }rj=1 specif-
ically adapted to A, and expressing A as a positive linear combination of this basis A = rj=1 σj (A)uj vjT .
P
We then have a full SVD factorization A = U ΣV T . The full SVD is illustrated in Fig. 5.4. The utility of
the full SVD derives from U and V being orthogonal (hence invertible) matrices. We note from the above
construction that in general the full SVD is not unique.
If P is a symmetric positive semidefinite matrix, a full SVD of P is simply an eigendecomposition of P :
U ΣV T = QΣQT , where Q is the orthogonal matrix of eigenvectors of P . In this sense, the SVD extends
the eigendecomposition by using different orthonormal sets of vectors in the input and output spaces.
The following theorem is sometimes useful. It says that to maximize the inner product of A and B by
modifying the left and right singular vectors of B, make B’s left and right singular vectors the same as the
corresponding singular vectors for A.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 52
Corollary 5.2.1. Let A, B ∈ Rm×n , with full SVDs A = UA ΣA VAT and B = UB ΣB VBT . The
maximum value of <A, QBR> over Q ∈ Om and R ∈ On is <ΣA , ΣB >. This is attained by setting
Q = UA UBT and R = VB VAT .
Proof.
where S = (UAT QUB ), and T = (VBT RVA ) and orthogonal matrices. The result now follows by Theorem
5.2.1.
Corollary 5.2.2. Let A, B ∈ Rm×n , have full SVD factorizations A = U ΣA V T and B = QΣB RT .
Then kA − Bk2F ≥ kΣA − ΣB k2F .
Proof. Exercise.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 53
Proof. The SVD expresses A as a positive linear combination of the orthonormal, rank one matrices uj vjT ,
j ∈ [1 : r]: A = rj=1 σj (A)uj vjT . Applying Pythagorous’ Theorem we have
P
Pr Pr Pr
kAk2F = k T 2
j=1 σj uj vj kF = 2 T 2
j=1 σj kuj vj kF = 2
j=1 σj .
∆ kAxk2
kAk2 = max = max kAxk2 . (5.4)
x6=0 kxk2 kxk2 =1
It is easy to check that the induced norm is indeed a norm on Rm×n , i.e., it satisfies the properties of a norm
listed in Lemma 4.1.2. We say that it is the matrix norm induced by the Euclidean norms on Rn and Rm .
Because of the following connection with eigenvalues, the induced matrix 2-norm is also called the
spectral norm.
p
Lemma 5.3.2. For A ∈ Rm×n , kAk2 = λmax (AT A).
Proof. We want to maximize the expression kAxk22 = xT (AT A)x over unit vectors x ∈ Rn . This is a
simple Raleigh quotient problem (see Appendix D). To maximize the expression, select x to be a unit norm
eigenvector of AT A with maximum eigenvalue.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 54
Proof. Let A have compact SVD U ΣV T . Using Lemma 5.3.2, kAk22 = λmax (AT A) = λmax (V Σ2 V T ) =
σ12 (A). Hence the induced 2-norm of A equals the maximum singular value of A.
We see from the proof of the above lemma that the input direction with the most gain is v1 , this appears
in the output in the direction u1 , and the gain is σ1 : Av1 = σ1 u1 . This is visualized in Figure 5.3.
min kA − Bk2F
B∈Rm×n (5.5)
s.t. rank(B) = k.
If à solves (5.5), we call à a best rank k approximation to A under the Frobenius norm.
Let A have compact SVD A = U ΣV T = ri=1 σi ui viT , and set
P
k
X
Ak = σi ui viT (5.6)
i=1
Theorem 5.4.1. For any matrix A ∈ Rm×n with rank r ≥ k, the matrix Ak formed by truncating an
SVD of A to its k leading terms is a best rank k approximation to A under the Frobenius norm.
Proof. Since the expression (5.6) specifies a compact SVD of Ak , it is clear that Ak has rank k. Its distance
from A can be found using standard equalities:
Pr
kA − Ak k2F = kU ΣV T − U Σk V T k2F = kΣ − Σk k2F = 2
i=k+1 σi .
k
X r
X
kA − Bk2F ≥ kΣ − ΣB k2F = (σi − λi )2 + σi2 ≥ kA − Ak k2F .
i=1 i=k+1
So among all m × n matrices of rank k, the matrix Ak achieves the minimum Euclidean distance to A.
The matrix Ak is also a best rank k approximation to A under the induced 2-norm.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 55
Noting that N (A) = N (AT A) (Exercise 5.3), we see that the null space of AT A also has dimension n − r.
It follows that n − r of the eigenvectors of AT A must lie in N (A) and r must lie in N (A)⊥ . Hence
Noting that N (AAT ) = N (AT ), and N (AT ) = R(A)⊥ , we see that the dimension of N (AAT ) is m − r.
So m − r of the eigenvectors of AAT must lie in N (AT ) and r must lie in N (AT )⊥ = R(A). Hence
So either Avj = 0, or Avj is an eigenvector of AAT with eigenvalue σj2 . If Avj = 0, then (AT A)vj = 0.
This contradicts AT Avj = σj2 vj with σj2 > 0. Hence Avj must be an eigenvector of AAT with eigenvalue
σj2 . Assume for simplicity, that the positive eigenvalues of both AT A and AAT are distinct. Then for some
k, with 1 ≤ k ≤ r:
σj2 = λ2k and Avj = αuk , with α > 0.
We can take α > 0 by swapping −uk for uk if necessary. Using this result we find
(
σj2 vjT vj = σj2 ;
vjT AT Avj =
(Avj )T (Avj ) = α2 uTj uj = α2 .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 56
Since λ2k > 0, we can’t have AT uk = 0. So AT uk is an eigenvector of AT A with eigenvalue λ2k . Under the
assumption of distinct nonzero eigenvalues, this implies that for some p with 1 ≤ p ≤ r,
Using this expression to evaluate uTk (AAT )uk we find λ2k = β 2 . Hence β 2 = λ2k = σp2 and AT uk = βvp .
We now have two ways to evaluate AT Avj :
(
T σj2 vj by definition;
A Avj = T
αA uk = αβvp . using the above analysis.
Equating these answers gives j = p and αβ = σj2 . Since α > 0, it follows that β > 0 and α = σj = λj = β.
Thus Avj = σj uj , j ∈ [1 : r]. Written in matrix form this is almost the compact SVD:
σ1
A v1 . . . vr = u1 . . . ur
.. .
.
σr
From this we deduce that AV V T = U ΣV T . V V T computes the orthogonal projection of x onto N (A)⊥ .
Hence for every x ∈ Rn , AV T
p V x = Ax. p Thus AV V T = A, and we have A = U ΣV T .
Finally note that σj = λj (A A) = λj (AAT ), j ∈ [1 : r]. So the singular values are always unique.
T
Notes
For more detailed reading about the SVD see Chapter 7, section 7.3, of Horn and Johnson [22].
Exercises
Preliminaries
Exercise 5.1. Let A ∈ Rm×n . The rank of A is the dimension of R(A). Show that: (a) the rank of A equals the
number of linearly independent columns of A, (b) the rank of A equals the number of linearly independent rows of A.
Exercise 5.2. Let A ∈ Rm×n have rank r. Show that
(a) dim N (A) = n − r
(b) dim R(A)⊥ = m − r
(c) dim N (A)⊥ = dim R(A)
(d) dim N (A) + dim R(A) = n
Exercise 5.3. Let A ∈ Rm×n . Show that N (AT A) = N (A) and N (AAT ) = N (AT ).
Induced 2-norm
Exercise 5.4. Show that the induced 2-norm satisfies the properties of a norm.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 57
Exercise 5.13. Let A ∈ Rn×n have rank n. Show that minkxk2 =1 kAxk2 = σn (A).
Exercise 5.14. Let A ∈ Rm×n and B ∈ Rn×m . So AB ∈ Rm×m and BA ∈ Rn×n . If both AB and BA
are symmetric, show that the nonzero singular values of AB are the same as those of BA, including multiplicities.
Without loss of generality assume m ≤ n.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 58
Exercise 5.17. Let A ∈ Rm×n and y ∈ Rm be given. We want to find a solution x of the linear equations AT Ax =
AT y. Show that if A = U ΣV T is a compact SVD of A, then a solution is x? = V Σ−1 U T y and x? ∈ N (A)⊥ .
Exercise 5.19. Let Σr be an r × r diagonal matrix with positive diagonal entries σ1 ≥ σ2 ≥ · · · ≥ σr > 0. Let
Σ ∈ Rn×n be block diagonal with Σ = diag(Σr , 0(n−r)×(n−r) ).
(a) What orthogonal matrices Q ∈ Or maximize the inner product <Q, Σr >?
(b) What orthogonal matrices W ∈ On maximize the inner product <W, Σ>?
Exercise 5.20. Let A ∈ Rn×n .
(a) Find Q ∈ On to maximize the inner product <Q, A> and determine the maximum value.
(b) Show that Q ∈ On maximizes <Q, A> iff QT A is symmetric PSD.
Exercise 5.21. For given A, B ∈ Rm×n , find an orthogonal matrix Q ∈ Om to maximize the inner product <A, QB>.
This Q “rotates” the columns of B to maximize the inner product with A.
Exercise 5.22. Let Σ ∈ Rr×r be diagonal with positive diagonal entries. Find a matrix B ∈ Rr×r subject to
σ1 (B) ≤ 1 that maximizes the inner product <Σ, B>.
Exercise 5.23. For given A ∈ Rm×n with r = rank(A),
Pr we want to find B ∈ R
m×n
subject to σ1 (B) ≤ 1 that
maximizes <A, B>. Show that the maximum value is i=1 σi (A) and find a solution B.
Minimizing Norms
Exercise 5.24. For given A, B ∈ Rm×n , we seek Q ∈ Om to minimize kA − QBk2F . Show that this is equivalent to
finding Q ∈ Om to maximize the inner product <A, QB>, and determine the minimum achievable value.
Exercise 5.25. For A, B ∈ Rm×n , we seek orthogonal matrices Q ∈ Om and R ∈ On to minimize kA − QBRk2F .
Let A = UA ΣA VAT and B = UB ΣB VBT be full singular value decompositions. For Q ∈ Om and R ∈ On , show the
following claims:
Exercise 5.26. Let A, B ∈ Rm×n have full SVDs A = UA ΣA VAT and B = UB ΣB . Show that
kA − BkF ≥ kΣA − ΣB kF .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 59
Miscellaneous
Exercise 5.30. Prove or disprove: Let X ∈ Rm×n . Let the columns of V be k orthonormal eigenvectors for the
nonzero eigenvalues of X T X, and the columns of U be k orthonormal eigenvectors for the nonzero eigenvalues of
XX T . Then XV = U .
Pk
Exercise 5.31. Prove or disprove: Let {xj }kj=1 be a linearly independent set in Rn . Then A = j=1 xj xTj has k non
zero singular values.
Exercise 5.32. Prove or disprove: For X ∈ Rm×n with X 6= 0, the nonzero eigenvalues of the matrices X T X and
XX T are identical.
Exercise 5.33. Prove or disprove: For x ∈ Rm and y ∈ Rn , kxy T kF = kxy T k2 = kxy T k∗ .
Exercise 5.34. Let W ∈ Rn×k , with k ≤ n, have ON columns (note W need not be square). Prove or disprove
each of the following claims: for each x ∈ Rk and each A ∈ Rk×m : (a) kW xk2 = kxk2 , (b) kW Ak2 = kAk2 , (c)
kW AkF = kAkF , and (d) kW Ak∗ = kAk∗ .
Exercise 5.35 (Idempotent But Not Symmetric). Let P ∈ Rn×n have rank r and compact SVD P = U ΣV T . If
P 2 = P , show that either r = n and P = In or r < n and P = V V T + V0 V T where the columns of V0 lie in R(V )⊥ .
Choosing V0 = 0 yields a projection matrix P = V V T . But choosing V0 6= 0, this yields an idempotent matrix P
that is not symmetric.
Exercise 5.36. Let A ∈ Rn×n have a compact SVD UA ΣA VAT . Show that trace(A) ≤ trace(ΣA ).
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 60
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 61
Chapter 6
Multivariable Differentiation
Many machine learning problems are formulated as optimization problems. Typically, this involves a real-
valued loss function f (w) where the vector w ∈ Rn parameterizes the possible solutions. The objective is to
select w to minimize the loss f (w). Assuming the function f : Rn → R is differentiable, a natural approach
is find the derivative of f with respect to w and set this equal to zero. Sometimes it is possible to solve the
resulting equation for w or reduce its solution to known computations.
An example is the generalized Rayleigh quotient problem
wT P w
maxn f (w) =
w∈R wT Qw
where P and Q are n × n real symmetric matrices with P positive semidefinite and Q positive semidefinite.
This problem arises in one formulation of Linear Discriminant Analysis.
More generally, the cost could be a function of matrix W ∈ Rn×d . In this case, we want to take the
derivative of f (W ) with respect to the matrix W , set this equal to zero and solve for the optimal W . An
example is the matrix Rayleigh quotient problem
max f (W ) = trace(W T P W )
W ∈Rn×d
s.t. W T W = Id ,
where P ∈ Rn×n is a symmetric positive semidefinite matrix. This constrained optimization problem arises
in dimensionality reduction via Principal Components Analysis.
In many cases, even if we can compute the derivative of f it may not be clear how to solve for the optimal
value of the parameter. In such cases, an alternative is to determine (or approximate) the gradient of the loss
function f . Then we can iteratively minimize f by gradient descent (or stochastic gradient descent when
we have an approximation to the gradient). For example, this is the currently preferred means of training a
neural network.
The first issue in addressing all of these problems is the ability to take derivatives and compute gradients.
This chapter reviews this task. We define the derivative of various forms of functions (real-valued, vector-
valued, matrix-valued) defined on Rn or Rn×m . We also define the gradient of real-valued functions and
point out the distinction between the gradient and the derivative. By the end of the chapter the reader will
be equipped to derive the gradients and derivatives shown in Table 6.1, and to solve the Rayleigh quotient
problems shown above.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 62
f ∇f Df
aT x a aT v
kxk22 2x 2xT v
kxk2 x/kxk2 (x/kxk2 )T v
kAxk22 2AT Ax 2xT (AT A)v
a⊗x n/a a⊗v
x⊗x n/a 2x ⊗ v
trace(AT M ) A trace(AT V )
kM k2F 2M 2 trace(M T V )
trace (M/kM kF )T V
kM kF M/kM kF
trace(M ) In trace(V )
det(M ) det(M )M −T det(M ) trace(M −1 H)
A⊗M n/a A⊗V
M ⊗M n/a 2M V
M2 n/a MV + V M
M −1 n/a −M −1 V M −1
Table 6.1: Summary Table. A selection of functions f shown with corresponding derivatives and gradients. In the
table entries, x and M are the variables of the function f , a and A are constants, and v and V are the dummy variables
of the derivatives, with x, a, v ∈ Rn and M, A, V ∈ Rn×n .
f (x + v) − f (x) − Df (x)(v)
lim = 0. (6.1)
v→0 kvk2
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 63
Since Df (x)(v) = ∇f (x)T v = <∇f (x), v>. The derivative of f at x determines the gradient of f at x,
and the gradient of f at x determines the derivative of f at x.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 64
Example 6.1.2. We illustrate the gradient using the functions in Example 6.1.1.
The avoid confusion, keep in mind that the derivative and the gradient are not the same. The definition
of the gradient assumes a real-valued function. In contrast, the derivative can exist for real-valued, vector-
valued, and matrix-valued functions. Even when both are defined, one is a linear function mapping the
domain of f into R, while the other is a vector in Rn . In short, when both are defined, the derivative and the
gradient are distinct but connected objects.
6.2 Functions f : Rn → Rm
Conceptually, the derivative of a function f : Rn → Rm at x ∈ Rn is the linear function Df (x) that gives
the best local approximation to f around the point x. So to first order, the best linear approximation to f at
x is f (x + v) ≈ f (x) + Df (x)(v). This leads to the following definition.
A function f : Rn → Rm is differentiable at x ∈ Rn if there a linear function Df (x) : Rn → Rm such
that
f (x + v) − f (x) − Df (x)v
lim = 0. (6.4)
v→0 kvk2
Here v ∈ Rn , Df (x)(v) ∈ Rm , and the norm in the denominator is the 2-norm in Rn .
T
Write f (x) = f1 (x) f2 (x) . . . fm (x) where fi : Rn → R is a scalar valued function that gives
the value of the i-th entry of f (x) ∈ Rm . The best linear approximation to f at x must also give the best
linear approximation for each component of f (x). Hence
Df1 (x)(v)
Df2 (x)(v)
Df (x)(v) = .
..
.
Dfm (x)(v)
Let x = (x1 , x2 , . . . , xn ) and v = (v1 , v2 , . . . , vn ). Then we can use partial derivatives to write
∂f (x) ∂f (x)
· · · ∂f∂x
1 (x)
1 1
∂x1 ∂x2 n
∂f2 (x) ∂f2 (x)
∂x1 ∂x 2
· · · ∂f∂x
2 (x)
n
Df (x)(v) = .. .. .. v.
. . .
∂fm (x) ∂fm (x) ∂fm (x)
∂x1 ∂x2 ··· ∂xn
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 65
(b) Let a, x ∈ Rn and f (x) = a ⊗ x where ⊗ denotes the Schur product. Then fi (x) = ai xi and
∂fi (x)
Df (x)(v) = v = diag(a)v = a ⊗ v.
∂xj
Proof. If it exists, the derivative of f at x is the unique linear function from Rn to Rm that best matches f in
a neighborhood of x. Since f is linear, its best linear approximation at any point x is itself. Hence a linear
function is differentiable everywhere and Df (x)(v) = f (v). Let’s check
f (x + v) − f (x) − Df (x)(v) f (x) + f (v) − f (x) − f (v)
lim = lim = 0.
v→0 kvk2 v→0 kvk2
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 66
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 67
The product rule can be generalized to functions f, g : Rn → Rm by considering h(x) = <f (x), g(x)>.
This leads to the following extension of Lemma 6.2.4.
Lemma 6.2.4 (Inner Product Rule). Let f, g : Rn → Rm be differentiable at x and set h(x) =
<f (x), g(x)>. Then h is differentiable at x and
Pm
Proof. h(x) = j=1 fj (x)gj (x). Hence using the product rule
Pm Pm
Dh(x)(v) = j=1 D(fj (x)gj (x))(v) = j=1 Dfj (x)(v)gj (x) + fj (x)Dg(x)(v).
The RHS of the above equation can be rearranged to yield the stated result.
Example 6.2.2. The following examples illustrate the above properties using functions f : Rn → R. We
also determine the gradient of each example.
(a) Fix a ∈ Rn and for x ∈ Rn set f (x) = aT x. f (x) is a linear function from Rn to R. The best
approximation to a linear function is the linear function itself. Hence,
Df (x)(v) = aT v,
∇f (x) = a.
(b) Let x ∈ Rn , A ∈ Rn×n and f (x) = xT Ax. Using the generalized product rule we have:
√ 1
(c) Let x ∈ Rn , A ∈ Rn×n and f (x) = xT Ax. For α > 0, define g(α) = α 2 . Then write f (x) =
1
g(xT Ax). Scalar differentiation yields g 0 (α) = 21 α− 2 . Hence by the chain rule
1
Df (x)(v) = √ xT (A + AT )v, xT Ax 6= 0,
T
2 x Ax
1
∇f (x) = √ (A + AT )x, xT Ax 6= 0.
T
2 x Ax
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 68
x? = arg maxn xT P x
x∈R (6.8)
s.t. xT x = 1.
Problem (6.8) can be solved using the method of Lagrange multipliers. We first give some insights into
how this works. The constraint xT x = 1 is the zero level set of the function g(x) = 1 − xT x. This level
set is a surface in Rn (in this particular case it’s a sphere in Rn of radius 1). At a point x on the surface,
g(x + v) ≈ g(x) + ∇g(x)v. So the set of vectors tangent to the surface at x is the subspace ∇g(x)⊥ . This
result implies that ∇g(x) is normal to the tangent plane of the surface at x.
We want to maximize f (x) = xT P x. The gradient ∇f (x) gives the direction of maximum increase in
f at x. Generally, moving x in this direction will violate the constraint. But we can always move x in the
direction of the orthogonal projection of ∇f (x) onto the tangent plane at x. This projection is
1
d(x) = ∇f (x) − ∇g(x)∇g(x)T ∇f (x).
k∇g(x)k22
As long as d(x) is nonzero, x can be moved in the tangent plane to increase the objective function f (x).
Hence for x to be a solution it is necessary that d(x) = 0. This requires that for some scalar µ,
∇f (x) + µ∇g(x) = 0.
∆
L(x, µ) = f (x) + µg(x),
where µ ∈ R is a Lagrange multiplier or dual variable. Taking the derivative of L(x, µ) with respect to x
and setting this equal to zero yields (Df (x) + µDg(x))v = 0 for each v ∈ Rn . This yields the necessary
condition ∇f (x) + µ∇g(x) = 0. Setting the derivative of L(x, µ) with respect to µ equal to zero yields the
constraint g(x) = 0. So the derivatives of the Lagrangian give two necessary conditions for x to a solution
of the optimization problem:
Theorem 6.2.1. Let the eigenvalues of P be λ1 ≥ λ2 ≥ · · · ≥ λn . Problem (6.8) has the optimal value
λ1 and this is achieved if and only if x? is a unit norm eigenvector of P for λ1 . If λ1 > λ2 , this solution
is unique up to the sign of x? .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 69
(b) Let A ∈ Rn×n and f (M ) = trace(M T AM ). Since trace(·) is a linear function, Df (M )(V ) =
trace(D(M T AM )V ). The product rule then yields Df (M )(V ) = trace(V T AM + M T AV ) =
trace(M T (A + AT )V ) = <(A + AT )M, V >. Thus
Df (M )(V ) = <(A + AT )M, V >,
∇f (M ) = (A + AT )M.
(c) Let f (M ) = kM k2F . We can write f (M ) = <M, M > = trace(M T M ). By the chain rule and
linearity of the trace function, Df (M )(V ) = trace(D(M T M )(V )). Then by the product rule,
D(M T M )(V ) = V T M + M T V . So Df (M )(V ) = trace(V T M + M T V ) = 2 trace(M T V ). Thus
Df (M )(V ) = 2<M, V >
∇f (M ) = 2M.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 70
To see informally why the above result holds, first note that
For any matrix A, det(I + A) is the product of the eigenvalues of I + A. If sufficiently small, I + A
is dominated by its diagonal terms 1 +Qaii . Hence to first order, the product of the eigenvalues can be
approximated by the first order terms in ni=1 (1 + aii ). This gives det(I + A) = 1 + trace(A) + O(2 ).
Thus
det(M + V ) − det(M )
Df (M )(V ) = lim = det(M ) trace(M −1 V ).
→0
max trace(W T P W )
W ∈Rn×d (6.11)
s.t. W T W = Id .
This problem seeks d orthonormal vectors wi , i ∈ [1 : d] (the columns of W ), such that trace(W T P W ) =
P d T
j=1 wj P wj is maximized. The objective function is a real valued function of W . There are d constraints
of the form wjT wj = 1, j ∈ [1 : d], and d(d − 1)/2 constraints of the form wjT wk = 0 for j ∈ [1 : d − 1],
k ∈ [j + 1 : d]. So we will need d scalar dual variables for the first set of constraints and another d(d − 1)/2
scalar dual variables for the second set of constraints. Each constraint will be multiplied by its corresponding
dual variable and these products will be summed and added to the objective function to form the Lagrangian.
Since the constraints Id − W T W = 0 are symmetric, we represent the dual variables as the entries of real
symmetric matrix Ω. We can then write the Lagrangian in the compact form
Theorem 6.3.1. Let P ∈ Rn×n be a symmetric PSD matrix and d ≤ rank(P ). Then every solution of
max trace(W T P W )
W ∈Rn×d (6.12)
s.t. W T W = Id ,
has the form W ? = We Q where the columns of We are orthonormal eigenvectors of P for its d largest
(hence non-zero) eigenvalues, and Q is a d × d orthogonal matrix.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 71
with Ω a real symmetric matrix. Setting the derivative of L with respect to W acting on V ∈ Rn×d equal to
zero yields
Since this holds for all V , a solution W ? must satisfy the necessary condition
P W ? = W ? Ω. (6.13)
By the symmetry of Ω ∈ Rd×d , there exists an orthogonal matrix Q ∈ Od such that Ω = QΛQT with Λ a
diagonal matrix with the real eigenvalues of Ω listed in decreasing order on the diagonal. Substituting this
expression into (??) and rearranging yields
P (W ? Q) = (W ? Q)Λ, (6.14)
trace(Λ) = trace (W ? Q)T P (W ? Q) = trace(W ? T P W ? ).
(6.15)
The last term in (6.15) is the optimal value of (6.12). Thus the optimal value of (6.12) is trace(Λ), and
W ? Q is also a solution of (6.12). Finally, (6.14) shows that the columns of We = W ? Q are orthonormal
eigenvectors of P . By optimality, the diagonal entries of Λ must hence be the d largest eigenvalues of P .
Since d ≤ rank(P ), all of these eigenvalues are positive.
(a) Let A ∈ Rp×m , B ∈ Rn×q and f (M ) = AM B. This is a linear function of M . Hence Df (M )(V ) =
AV B.
(b) Let A ∈ Rn×n and f (M ) = A⊗M . This is also a linear function of M . Hence Df (M )(V ) = A⊗V .
Df (M )(V ) = −M −1 V M −1 . (6.16)
V M −1 + M Df (M )(V ) = 0.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 72
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 73
Notes
Our presentation of multivariable differentiation follows that of standard texts such as Rudin [38] and Flem-
ing [15]. For additional reading on this topic see Section 3.3 of [15]. For a discussion of vector-valued
functions of a vector variable, see Chapter 4 of the same book.
Exercises
Exercise 6.1. Show that equation (6.6) continues to hold when the functions fj take values in Rm .
1
p(x) = [exi ] q(z) = z
1T z
Here Rn+ denotes the postive cone {x ∈ Rn : xi > 0}. p(·) maps x ∈ Rn , into the positive cone Rn+ , and for z ∈ Rn+ ,
q(·) normalizes z to a probability mass function in Rn+ .
(a) Determine the derivative of p(x) at x ∈ Rn
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 74
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Chapter 7
7.1 Preliminaries
We begin with some simple properties of subsets of Rn . Specifically, the properties of being bounded, open,
closed, and compact. A set S ⊂ Rn is bounded on Rn there exists β > 0, such that for each x ∈ S,
kxk ≤ β. So a bounded set is, as the name implies, bounded in extent 1 . The set S is open if, roughly
speaking, it doesn’t contain any boundary points. For example the interval (0, 1) in R is open. This interval
has two boundary points 0, 1 and these are not contained in the set. A more precise definition is that for
every point x ∈ S there exists δ > 0, such that the ball B = {z : kx − zk ≤ δ} is entirely contained in
S. So every point is S is completely surrounded by other points in S. The set S is closed if it contains its
boundary. For example, the interval [0, 1] in R is closed. By contrast the intervals (0, 1), [0, 1), (0, 1] are
not, since each fails to contain all of its boundary points. A more precise definition is that S is closed if
every convergent sequence of points in S converges to a point in S. So a closed set contains the limits of
its convergent sequences. The subset S is said to be compact if it is both closed and bounded. Compactness
has important implications. For example, the extreme value theorem states that every real-valued continuous
function defined on a compact subset S ⊂ Rn achieves a minimum and maximum value at points in S.
(b) Any interval is a convex subset of R. Conversely, a convex subset of R must be an interval. If the
interval is bounded, then it takes one the forms: (a, b), (a, b], [a, b), or [a, b], where a, b ∈ R and
a < b. If it is unbounded, then it takes one of the forms (a, ∞), [a, ∞), (−∞, a], (−∞, a), or R. In
all cases, the interior of the interval has the form (a, b), where a could be −∞ and b could be ∞.
1
By the equivalence of norms on Rn , if S is bounded in some norm, it is bounded in all norms on Rn .
75
ELE 435/535 Fall 2018 76
(c) A subspace of U ⊂ Rn is convex. If u, v ∈ U, then for α ∈ [0, 1], it is clear that (1 − α)u + αv ∈ U.
(d) A closed half space H of Rn is a set of the form {x : aT x ≤ b} where a ∈ Rn with a 6= 0 and b ∈ R.
Thus H is one side of a hyperplane plane in Rn including the plane itself. Every closed half space is
convex. If aT x ≤ b and aT y ≤ b, and α ∈ [0, 1], then
Pn
the 1-norm, B1 = {x :
(e) The until ball of P i=1 |xi | ≤ P To see this let x, y ∈ B1 and
P 1}, is convex.
α ∈ [0, 1]. Then ni=1 |(1 − α)xi + αyi | ≤ (1 − α) ni=1 |xi | + α ni=1 |yi | ≤ 1.
(a) Closure under intersection: If for each a ∈ A, Sa ⊂ Rn is convex, then ∩a∈A Sa is convex.
(c) Image of a convex set under a linear map: If S ⊂ Rn is convex and F is a linear map from Rn to
Rm , then F (S) = {z ∈ Rm : z = F s, s ∈ S} is convex.
(d) Pre-image of a convex set under a linear map: If S ⊂ Rm is convex and F is a linear map from
∆
Rn to Rm , then F −1 (S) = {x ∈ Rn : F x ∈ S} is convex.
Proof. Exercise.
Example 7.2.2. We can use Theorem 7.2.1 to provide some additional examples of convex sets.
(b) A closed polytope is any bounded region of Rn defined by the intersection of a finite number of closed
half spaces aTj x ≤ bj , j ∈ [1 : p]. Since half spaces are convex, and convex sets are closed under
intersection, a closed polytope is convex.
(c) The unit ball of the max norm, B∞ = {x : kxk∞ ≤ 1}, is a closed polytope defined by the intersection
of the closed half spaces eTj x ≤ 1 and eTj x ≥ −1, j ∈ [1 : n]. Hence it is a (closed) convex set.
(d) For A ∈ Rm×n and b ∈ Rm , the set C = {x : Ax ≤ b} is convex. C is the intersection of the half
spaces {x : Ai,: x ≤ bi } where Ai,: is the i-th row of A and bi is the i-th entry of b. Since half spaces
are convex, and convex sets are closed under intersection, C is convex.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 77
f(y)
(1 )f(x) + f(y)
f(x)
f(x )
x x = (1 )x + y y
Figure 7.1: Illustration of the concept of a convex function f . This function is strictly convex over the interval [x, y].
For α ∈ [0, 1], the point xα = (1 − α)x + αy lies on the line joining x (α = 0) to y (α = 1). On the other
hand, the scalar value fα = (1−α)f (x)+αf (y) is the corresponding linear interpolation of the values f (x)
(α = 0) and f (y) (α = 1). Convexity requires that the value of the function f along the line segment from
x to y is no greater than the corresponding linear interpolation (1 − α)f (x) + αf (y) of the values f (x) and
f (y). This is illustrated in Figure 7.1. A function is strictly convex if for x 6= y and α ∈ (0, 1), (7.1) holds
with strict inequality. Hence strict convex requires that the value of the function f along the line segment
between two distinct x and y is strictly less than the corresponding linear interpolation (1−α)f (x)+αf (y).
It is easy to see that a strictly convex function is convex, but that not every convex function is strictly convex.
A concept closely related to convexity is called concavity. A function f : Rn → R is a concave if −f (x)
is a convex, and it is strictly concave if −f (x) is strictly convex.
For f : Rn → R and S ⊂ Rn , the restriction of f to S, is the function g : S → R with g(x) = f (x)
for each x ∈ S. Clearly, if f is a convex function and C ⊂ Rn is a convex set, then the restriction of f to
C is a convex function. In particular, the restriction f to any line in Rn is convex, and the restriction of f
to any finite line segment is convex. The last statement is essentially the definition of convexity. Thus the
following statements are equivalent: f is convex on Rn , f is convex on every convex subset of Rn , f is
convex on every line in Rn , and f is convex on every line segment in Rn .
The above idea is often used in the following way. Let f : C → R be a convex function. Pick x, y ∈ C.
Then for t ∈ [0, 1], define the function g : [0, 1] → R by g(t) = f ((1 − t)x + ty). It is easy to see that g
is a convex function on [0, 1]. The converse of this result also holds. If for every x, y ∈ C, g(t) is a convex
function on [0, 1], then f is a convex function.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 78
Theorem 7.3.2 (Jenson’s Inequality). Let C be a convex set P and f : C → R be a convex function. If
{xi }ki=1 ⊂ C, and {αi }ki=1 is a set on negative scalars with ki=1 αi = 1, then
k k
!
X X
f αi xi ≤ αi f (xi ).
i=1 i=1
Proof. Exercise.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 79
y
v2
v2 v1 v1
1 1 x v3 x
v4
Figure 7.2: Illustration of the vertices of the unit 1-norm balls in Rk , for k = 1, 2, 3.
∆
Example 7.3.1 (An illustration of convex combinations). We show that any point in the set B1 = {x ∈
Rn : kxk1 ≤ 1} (the unit ball of the 1-norm) can be written as a convex combination of its 2n vertices
{vi }2n
i=1 = {e1 , . . . , en , −e1 , . . . , −en }. These vertices are illustrated in Figure 7.2.
P n n
Let
Pn x ∈ B 1 . First assume
Pn that i=1 |xi | = 1. We know that for a unique set of scalars {αi }i=1 ,
x = i=1 αi ei . Clearly, i=1 |αi | = 1. Define scalars {βj }2n j=1 as follows:
( (
αj , αj ≥ 0; −αj , αj < 0;
for j ∈ [1 : n] : βj = for j ∈ [n + 1 : 2n] : βj =
0, otherwise. 0, otherwise.
Then x = 2n
P P
j=1 βj vj with βj ≥ 0 and j βj = 1.
Now assume x is in the interior of B1 . If x = 0, then x = 1/2e1 − 1/2e1 = 1/2v1 + 1/2vn+1 . Hence
assume x 6= 0. Consider P Pn through 0 and x. This line intersects the boundary of B1 at
the line that passes
n n
two points a, b ∈ R with i=1 |ai | = i=1 |bi | = 1. Since x lies on the line segment from a to b, we can
write x = (1 − γ)a + γb for some γ ∈ [0, P 1]. By construction, a, b are on the boundary of B1 . So each is
convex combination of the vertices: a = 2n
P2n
α v
j=1 j j and b = j=1 βj vj . Then
P P P
2n 2n 2n
x = (1 − γ) α
j=1 j jv + γ j=1 j j =
β v j=1 ( (1 − γ)αj + γβj )vj ,
P2n
with ( (1 − γ)αj + γβj ) ≥ 0 and j=1 ( (1 − γ)αj + γβj ) = 1. Hence every x ∈ B1 is a convex
combination of its vertices {vj }2n
j=1 .
Example 7.3.2 (An application of Jensen’s inequality). Let B1 denote the 1-norm unit ball in Rn . From
Examples
P 7.2.1 and 7.3.1 we knowPthat B1 is a convex set, and that any x ∈ B1 can be written in the form
x = 2n α
j=1 j jv with α j ≥ 0 and 2n n
j=1 αj = 1, where the points vj ∈ R are the 2n vertices of B1 . Now
consider a convex function f : B1 → R. By Jensen’s inequality, we have
f (x) = f ( 2n
P P2n
j=1 αj vj ) ≤ j=1 αj f (vj ) ≤ maxj f (vj ).
Hence the values of f on B1 are bounded above by the maximum value of f on the vertices of B1 .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 80
Theorem 7.3.3. Let f and g be convex functions on Rn . Then the following functions are convex:
(e) h(x) = g(f (x)) where g : R → R is convex and nondecreasing on the range of f .
(f) h(x) = limk→∞ fk (x) where the sequence of functions {fk }k≥1 converges pointwise to h.
h((1 − α)x + αy) = max{f ((1 − α)x + αy), g((1 − α)x + αy)}
≤ max{(1 − α)f (x) + αf (y), (1 − α)g(x) + αg(y)} f, g are convex
≤ (1 − α) max{f (x), g(x)} + α max{f (y), g(y)}
= (1 − α)h(x) + αh(y).
(e) Since f is convex, f ((1 − α)x + αy) ≤ (1 − α)f (x) + αf (y). Then since g is nondecreasing
h((1 − α)x + αy) = g(f ((1 − α)x + αy)) ≤ g((1 − α)f (x) + αf (y))
≤ (1 − α)g(f (x)) + αg(f (y)) g is convex
= (1 − α)h(x) + αh(y).
(f) Let x, y ∈ Rn , α ∈ [0, 1], and xα = (1 − α)x + αy. By assumption, limk→∞ fk (xα ) = h(xα ). Let
∆
gk (xα ) = fk (xα )−(1−α)fk (x)−αfk (y). By the convexity of fk , gk (xα ) ≤ 0, and then by the assumption
of pointwise convergence, limk→∞ gk (xα ) = h(xα ) − (1 − α)h(x) − αh(y) ≤ 0. Thus h is convex.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 81
A function f : Rn → R is strongly convex if for some c > 0, f (x) − 2c kxk22 is convex. When this holds, we
say that f is strongly convex with modulus c.
Proof. Exercise.
Example 7.3.3. These examples are selected to highlight the distinctions between convexity, strict convexity
and strong convexity.
(a) Let f : R → R with f (x) = ax + b, for a, b ∈ R. Then f (x) is a convex, but not strictly convex on
R. You can see this directly from the derivation in Example 7.3.1(a).
(b) Let f : R → R with f (x) = x2 . Then f (x) is strictly convex (Example 7.3.1(b)). It is also strongly
convex. To see this let 0 < c ≤ 2 and set g(x) = f (x) − 2c x2 . Then g(x) = x2 − 2c x2 = (1 − 2c )x2 .
Since g(x) is a non-negative scaling of the convex function x2 , it is convex. Thus f (x) is strongly
convex on R.
(c) Consider the function fk : R → R with fk (x) = x2k for an integer k ≥ 0. The function f0 (x) = 1
is convex but not strictly convex. The function f1 (x) = x2 is strongly convex (see (b) above) and
hence also strictly convex. The function f2 (x) = x4 is strictly convex but not strongly convex. First
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 82
g(x) = x4 − 2c x2 = x2 (x2 − 2c ).
We see that g is an even function, g(0) = 0, g(x) < 0 for x2 < 2c , and g(x) > 0 for x2 > 2c , Thus it
can’t be convex for any c > 0. So f2 (x) = x4 is not strongly convex. The same argument shows that
the functions f2k (x) = x2k for k > 1 are strictly convex but not strongly convex.
Like convexity, strict convexity and strong convexity are preserved by a variety of standard operations
on functions. A list of these operations would be similar to, but not identical to those given in Theorem
7.3.3. These operations are examined in the exercises.
Theorem 7.3.5. A convex function defined on a convex set C ⊂ Rn is continuous on the interior of C.
Theorem 7.4.1. Let f : I → R be differentiable on an open interval I. Then f is convex if and only if
for all x, y ∈ I,
f (y) ≥ f (x) + f 0 (x)(y − x). (7.2)
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 83
It possible to extend Theorem 7.4.1 to cover strictly convex and strongly convex functions. These
extensions are given in the following corollary. Part (a) is proved in Appendix 7.7. The proof of part (b) is
left as at exercise.
(a) f is strictly convex if and only if for all x, y ∈ I with y 6= x, f (y) > f (x) + f 0 (x)(y − x).
(b) f is strongly convex if and only if there exists c > 0 such that for all x, y ∈ I with y 6= x,
f (y) ≥ f (x) + f 0 (x)(y − x) + 2c (y − x)2 .
If f has a second derivative, then the convexity of f is determined by the sign of f 00 (x).
The second derivative is the rate of change of the slope f 0 (x) of f at x. In these terms, f is convex if and
only if the slope f 0 (x) is nondecreasing.
There is a partial extension of Theorem 7.4.2 to strictly convex functions, and a full extension to strongly
convex functions. These extensions are given in the following corollary. The proof is left as exercise.
(b) f is strongly convex on I if and only if there exists c > 0 such that for all x ∈ I, f 00 (x) ≥ c.
Example 7.4.1. The following examples illustrate the application of Theorem 7.4.2 and its corollaries.
(a) f (x) = x2 has f 00 (x) = 2 > 0. Hence f is strictly convex on R. In addition, at each point x ∈ R,
f 00 (x) ≥ 2. Hence f is strongly convex on R.
(b) f (x) = x4 has f 00 (x) = 12x2 . Using Corollary 7.4.2 we conclude that f (x) is strictly convex on
every open interval not containing 0. However, we know that f (x) = x4 is strictly convex on R. So
in general, f 00 (x) > 0 for all x ∈ I, is sufficient but not necessary for the strict convexity on I.
√
(c) f (x) = x for x ∈ (0, ∞). f 00 (x) = − 14 x−3/2 is negative for x > 0. Hence f is not convex on
√
(0, ∞). However, − x is strictly convex on (0, ∞). Hence f (x) is strictly concave on (0, ∞).
(d) f (x) = x ln x for x ∈ (0, ∞). We have f 0 (x) = ln x + 1 and f 00 (x) = 1/x. Since f 00 (x) > 0 at each
x ∈ (0, ∞), f is strictly convex on (0, ∞).
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 84
f(v) f(y)
(1 )f(x) + f(y)
f(x) g(y)
f(v) f(x) + Df(x)(v x)
f(x) f(x )
g(x) g(v) = f(x ) + Df(x )(v x )
x x = (1 )x + y y
Figure 7.3: Left: Illustration of the bound (7.3). The function f (v) is bounded below by the best linear approximation
to the function at any point x. Right: Illustration of the bound (7.3) applied at the point xα . We see that f (x) ≥ g(x)
and f (y) ≥ g(y). Hence (1 − α)f (x) + αf (y) ≥ (1 − α)g(x) + αg(y) = f (xα ).
You will often see equation (7.3) written in an the following equivalent form using the gradient of f :
The following corollary gives the extension of Theorem 7.4.3 to strictly convex and strongly convex
functions. The proof is left as an exercise.
(a) f is strictly convex if and only if for all x, y ∈ Rn , f (y) > f (x) + Df (x)(y − x).
(b) f is strongly convex if and only if there exists c > 0 such that for all x, y ∈ C, f (y) ≥ f (x) +
Df (x)(y − x) + 2c ky − xk2 .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 85
where Hf (x) ∈ Rn×n is the Hessian matrix of f at x. Comparison of this equation with (7.3) suggests that
f is convex if and only if H(x) is a positive semidefinite matrix for each x in the domain of f . This leads to
the following multivariable analog of Theorem 7.4.2.
Theorem 7.4.4. A C 2 function f on an open convex set C ⊂ Rn is convex if and only if at each x ∈ C
the Hessian matrix Hf (x) is positive semidefinite.
Here are the generalizations for strict convexity and strong convexity. The proofs are left as exercises.
(b) f is strongly convex on C if and only if there exists c > 0 such that for all x ∈ C, Hf (x) − cIn is
positive semidefinite.
Sublevel sets are nested in the sense that for a ≤ b, La ⊆ Lb . In particular, if f has a global minimum at x? ,
then the set of x that achieve the global minimum value is Lf (x? ) , and for any x0 ∈ Rn , Lf (x? ) ⊆ Lf (x0 ) .
For convex functions we can say more.
Theorem 7.5.1. The sublevel sets of a convex function f : Rn → R are convex and closed.
Proof. If Lc = ∅, then it is closed and convex. Otherwise let x, y ∈ Lc . Then f ((1 − αx) + αy) ≤
(1 − α)f (x) + αf (y) ≤ c. Hence Lc is convex. Since f is convex on Rn it is continuous on Rn . Thus
if {xk } ⊂ Lc with xk converging to x, then by the continuity of f , limk→∞ f (xk ) = f (x). Finally, since
f (xk ) ≤ c, limk→∞ f (xk ) = f (x) ≤ c. Thus Lc is closed.
If the sublevel sets of a continuous function are bounded, then by the extreme value theorem, there
exists x? ∈ Rn such that f achieves a global minimum value at x? . In general, a convex function f need
not have bounded sublevel sets. For example, the linear function f (x) = ax + b on R is convex and has
unbounded sublevel sets. Hence convexity alone does not ensure the existence of minimizing point x? .
Strong convexity, however, is sufficient.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 86
Theorem 7.5.2. The sublevel sets of strongly convex functions are bounded.
(c) If f is strictly convex and has a local minima, then this is the unique global minima.
(d) If f is strongly convex, then there exists a unique x? ∈ Rn at which f has a local and hence
global minima.
Proof. (a) Assume f has a local minima at x? , with f (x? ) = c. Then there exists r > 0 such that for all x
with kx − x? k2 < r, f (x) ≥ c. Suppose that for some z ∈ Rn , f (z) < c. Let xα = (1 − α)x? + αz with
α ∈ (0, 1). Then for α > 0 sufficiently small, kxα − x? k2 < r and f (xα ) ≤ (1 − α)f (x? ) + αf (z) < c; a
contradiction. Hence f has a global minima at x? .
(b) If f has no local minima, then the set of global minima is empty; which is a convex set. Now assume f
has a global minima at x? ∈ Rn with f (x? ) = c. The set of all points at which f has a global minima is the
sublevel set Lc = {x : f (x) ≤ c}. By Theorem 7.5.1, this set is convex.
(c) Assume f has a local minima at x? with f (x? ) = c. Then by (a), f has a global minima at x? . If f has
global minima at two distinct points x? and y ? , then f has a global minima at all points on the line joining
x? and y ? . But this violates the strict convexity of f .
(d) By Theorem 7.5.3, since f is strongly convex, its sublevel sets are compact and convex. The continuity
of f on Rn and the extreme value theorem then ensure that f achieves a minimum value over each nonempty
sublevel set. Select a sublevel set La with nonempty interior. Then f achieves a minimum value at some
point x? ∈ La . If this point is on the boundary of La , then f (x? ) = a. But La has interior points y with
f (y) < a; a contradiction. So x? is an interior point of La . Hence f has a local minimum at x? . Since
strong convexity implies strict convex, by (c) x? is the unique point at which f has a local minimum.
In summary, convexity ensures that if f has a local minima, it is a global minima. Moreover, the set of
points at which f attains a global minima is itself convex. Strict convexity allows us to conclude that f has at
most one local minima. However, this leaves the question of the existence of a local minima unresolved. A
strongly convex function always has a local minima. The previous results then ensure that this is the unique
point at which f achieves a global minimum value.
Strong convexity is sufficient, but not necessary for the existence of local minima. For example the
function f (x) = x4 is not strongly convex, but it has a unique local minima at x = 0.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 87
for y ∈ C, x? + α(y − x? ) lies in C for α ∈ [0, 1]. It follows that if x? minimizes f (x) over C, then for each
y ∈ C,
∇f (x? )T (y − x? ) ≥ 0. (7.5)
It turns out that (7.5) is both necessary and sufficient for x? to minimize f over C. Here is a formal
statement of this result.
Theorem 7.5.4 allows for the possibility that x? is a boundary point of C. If we exclude this possibility,
then a stronger result holds. Specifically, if C is an open set, then x? ∈ C minimizes f (x) over C if and only
if Df (x? ) = 0.
Df (x? ) = 0. (7.7)
Theorem 7.6.1. For each y ∈ Rn , there exists a unique ŷ ∈ C that is closest to y in the Euclidean norm.
Proof. If y ∈ C, then ŷ = y. Suppose y ∈ / C. Since C is bounded, there exists r > 0 such that R =
{x : ky − xk2 ≤ r} ∩ C = 6 ∅. The set R is a closed and bounded, and hence compact, subset of C. The
function f (x) = ky − xk2 is a continuous function defined on Rn . Hence by the extreme value theorem, f
achieves a minimum on the compact set R. So there exists ŷ ∈ R ⊆ C minimizing the Euclidean distance
to y over all points in R, and hence over all points in C. That ŷ is unique follows by noting that f (x) is a
strictly convex function and applying Theorem 7.5.3. So ŷ is the unique point in C closest to y.
Theorem 7.5.4 can be used to give the following characterization of the point ŷ. This characterization is
illustrated in Figure 7.4.
Proof. f (x) = ky − xk22 is convex and differentiable with Df (x)(h) = −2(y − x)T h. By Theorem 7.5.4,
z minimizes f (x) over C if and only if for each x ∈ C, (y − z)T (x − z) ≤ 0.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 88
y1 y1
y2
y1 y1
y2 y3
y2 x2
y2 y3 = y4
x1 y4
Figure 7.4: Left: An illustration of the projection ŷ of a point y onto a closed convex set C. For each point x ∈ C,
the angle between y − ŷ and x − ŷ must be at least π/2. Right: An illustration of the non-expansive property of the
projection. For all points y1 , y2 , the distance between the ŷ1 and ŷ2 is at most the distance between y1 and y2 .
Proof. If ŷ1 = ŷ2 , then y1 = y2 and the result holds. Hence assume ŷ1 6= ŷ2 . By Lemma 7.6.1, (y1 −
ŷ1 )T (ŷ2 − ŷ1 ) ≤ 0 and (y2 − ŷ2 )T (ŷ1 − ŷ2 ) ≤ 0. Adding these expressions and expanding we find
By regrouping terms and using the Cauchy-Schwarz inequality this can be written as
∆
Now consider the distance from y to C. This is given by dC (y) = ky − PC (y)k2 . This is a well-defined
function of y that is zero for y ∈ C and positive for y ∈
/ C. As one might expect, dC is a convex function.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 89
Proof. Let y1 , y2 ∈ Rn and ŷi = PC (yi ), i = 1, 2. Then for α ∈ [0, 1], let yα = (1 − α)y1 + αy2 and
ŷα = (1 − α)ŷ1 + αŷ2 . Since C is convex, ŷα ∈ C, and hence dC (yα ) ≤ kyα − ŷα k2 . Thus
kx − zk1
α= . (7.8)
δ
Now write x = (1 − α)z + αa, and use the convexity of f to obtain
Similarly, the point z lies on the line segment from x to b. So z = x + β(b − x). The 1-norm of b − x is
δ(1 + α) and that of z − x is δα, with α given by (7.8). Hence
δα α
β= = . (7.10)
δ(1 + α) 1+α
Now write z = (1 − β)x + βb, and use the convexity of f and (7.10) to obtain
1 α
f (z) ≤ (1 − β)f (x) + βf (b) = f (x) + f (b).
1+α 1+α
Multiplying both sides of this equation by 1 + α and rearranging the result yields
kx − zk1
|f (x) − f (z)| ≤ |c − f (z)|.
δ
Hence if a sequence xk converges to z, then f (xk ) converges to f (z). Thus f is continuous at z.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 90
Proof of Theorem 7.4.1. (If) Assume that the lower bound (7.2) holds. Let x, y ∈ I, α ∈ [0, 1], and set
xα = (1 − α)x + αy. Applying (7.2) at xα with variable v we obtain f (v) ≥ f (xα ) + f 0 (xα )(v − xα ). For
v = x and v = y this yields f (x) ≥ f (xα ) + f 0 (xα )(x − xα ) and f (y) ≥ f (xα ) + f 0 (xα )(y − xα ). Hence
(1 − α)f (x) + αf (y) ≥ (1 − α)(f (xα ) + f 0 (xα )(x − xα )) + α(f (xα ) + f 0 (xα )(y − xα ))
= f (xα ) + f 0 (xα )[(1 − α)(x − xα ) + α(y − xα )]
= f (xα ).
(Only If) If y = x, the result clearly holds. Hence let y 6= x, and α ∈ (0, 1). Since f is convex, f (x + α(y −
x)) ≤ f (x) + α(f (y) − f (y)). Thus f (x + α(y − x)) − f (x) ≤ α(f (y) − f (x)). Dividing both sides by
α gives
f (x + α(y − x)) − f (x)
(y − x) ≤ f (y) − f (x).
α(y − x)
Taking the limit as α → 0 yields f 0 (x)(y − x) ≤ f (y) − f (x).
Proof of Corollary 7.4.1. (If) This follows the proof of the corresponding part of Theorem 7.4.1. (Only
If) f is strictly convex and hence convex. Let x ∈ I. By the convexity of f , for any y ∈ I, f (y) ≥
f (x) + f 0 (x)(y − x). Suppose that for some y ∈ I, with y 6= x, f (y) = f (x) + f 0 (x)(y − x). Then for
β ∈ (0, 1), and zβ = (1 − β)x + βy, we have
Proof of Theorem 7.4.2. (If) By Taylor’s theorem, f (x + h) = f (x) + f 0 (x)h + f 00 (z)h2 where z is a point
between x and x + h. Hence f (x + h) ≥ f (x) + f 0 (x)h. So at any point in I, f is bounded below by its
derivative. We can then apply Theorem 7.4.1 to conclude that f is convex.
(Only If) By convexity, f (x) = f ( x+h+x−h
2 ) ≤ f (x+h)+f
2
(x−h)
. Hence f (x + h) − 2f (x) + f (x − h) ≥ 0.
The second derivative of f at x is found by taking the limit as h ↓ 0 of
1 f (x + h) − f (x) f (x) − f (x − h) f (x + h) − 2f (x) + f (x − h)
− = ≥ 0.
h h h h2
Hence f 00 (x) ≥ 0.
Proof of Theorem 7.4.3. (If) This part of the proof follows the corresponding part of the proof of Theorem
7.4.1, except that we use the bound (7.3).
(Only If) f is convex on C. Let x, y ∈ C, and set xt = (1 − t)x + ty, for t ∈ R. Since C is open, there
exists an open interval (a, b) containing [0, 1], such that for t ∈ (a, b), xt ∈ C, with x0 = x and x1 = y.
∆
Then g(t) = f ( (1 − t)x + ty ) is a convex function on (a, b) with g(0) = f (x) and g(1) = f (y). Hence by
Theorem 7.4.1, for each t ∈ (a, b), g(t) ≥ g(0) + g 0 (0)t. This gives g(t) ≥ f (x) + Df (x)(y − x)t. Setting
t = 1 we obtain f (y) ≥ f (x) + Df (x)(y − x).
Proof of Theorem 7.4.4. (If) Let x, y ∈ C. By the multivariable version of Taylor’s theorem we have
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 91
where z is a point on the line segment joining x and y. Since Hf (z) is PSD, it follows that f (y) ≥
f (x) + ∇f (x)T (y − x). Thus f is convex.
(Only If) Assume f is convex. Fix x ∈ C and consider an open ball B around x contained in C. Let y ∈ B
with y 6= x. Foor t ∈ R, let z(t) = (1 − t)x + ty. Then there exists an interval (a, b) containing [0, 1] such
that for t ∈ (a, b), z(t) ∈ B, with z(0) = x and z(1) = y.
Now for t ∈ (a, b) define g(t) = f ( (1 − t)x + ty ). Since f is convex on C, g is convex on (a, b). Hence
for t ∈ (a, b), g 00 (z(t)) ≥ 0. By direct evaluation we find g 0 (z(t)) = (∇f (z(t)))T (y − x). Here we have
used the gradient to representation of the derivative and omitted the dummy variable h since we are dealing
with a scalar variable t. The function ∇f (z) is a map from Rn into Rn . Hence its derivative is a linear map
Hf (z) from Rn into Rn . Thus
where we have used the fact that the Hessian matrix is symmetric. Evaluation at t = 0 ∈ (a, b) gives
Since this holds for all y ∈ B, we conclude that Hf (x) is positive semidefinite.
Proof of Theorem 7.5.2. Let a ∈ R and consider the sublevel set La (f ) = {x : f (x) ≤ a}. If La (f )
is empty, then it is bounded. Hence assume La (f ) is nonempty and select x0 ∈ La (f ). Without loss of
generality we can assume x0 = 0. This follows by noting that the sublevel sets of f (x) and h(x) = f (x−x0 )
are related by a translation. Hence La (f ) is bounded if and only if La (h) is bounded. Moreover, since
x0 ∈ La (f ), 0 ∈ La (h).
We now prove by contradiction, that La (f ) is bounded. Assume that La (f ) is unbounded. Then La (f )
contains an unbounded sequence of points {xk }k≥0 . We can always ensure that the first term is x0 = 0, and
by selecting a subsequence if necessary, that
kx1 k2 ≥ 1 and,
(7.12)
kxk+1 k2 ≥ kxk k2 k ≥ 1.
By radially mapping xk onto the unit ball centered at 0, we obtain a bounded sequence {yk }k≥1 with
1
yk = αk xk where αk = . (7.13)
kxk k2
The conditions (7.12) ensure {αk }k≥1 ⊂ (0, 1] and that {αk }k≥1 monotonically decreases to 0.
Since f is strongly convex, for some c > 0, g(x) = f (x) − 2c kxk22 is convex. The convexity of g implies
it is continuous and hence it has a finite maximum and minimum value over the unit ball centered at x0 . The
points yk are on the boundary of this ball. It follows that |g(yk )| is bounded as k → ∞.
On the other hand, using the convexity of f and k · k22 , the bound f (xk ) ≤ a, and (7.13) we find
The final term converges to −∞ as k → ∞, contradicting the fact that g takes bounded values on the
sequence {yk }. Hence the nonempty sublevel set La (f ) must be bounded.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 92
Proof of Theorem 7.5.4. (If) For y ∈ C, (7.3) and (7.6) give f (y) − f (x? ) ≥ Df (x? )(y − x? ) ≥ 0. Hence
f (y) ≥ f (x? ).
(Only If) Suppose x? minimizes of f over C. If (7.6) does not hold, then for some y ∈ C, Df (x? )(y − x? ) <
0. Let h = y − x? and note that for α ∈ [0, 1], x? + αh = (1 − α)x? + αy ∈ C. Using the definition of the
derivative we have
f (x? + αh) − f (x? )
Df (x? )h = lim < 0.
α↓0 α
By the definition of a limit, there exists α0 > 0 such that for all 0 < α ≤ α0 , f (x? + αh) − f (x? ) < 0. For
such α, x? + αh ∈ C and f (x? + αh) < f (x? ); a contradiction.
Proof of Corollary 7.5.1. (If) By Theorem 7.4.3, for each y ∈ C, f (y) ≥ f (x? )+Df (x? )(y−x? ) = f (x? ).
(Only If) Since x? is an interior point of S, for some α > 0, and each direction h ∈ Rn , x? + αh ∈ C. Hence
for each direction h, by Theorem 7.5.4, f (x? + αh) − f (x? ) ≥ Df (x? )(αh) ≥ 0. But then Df (x? )(h) ≥ 0
and Df (x? )(−h) ≥ 0. Thus Df (x? ) = 0.
Notes
The material is this chapter is standard and can be found in any modern book on optimization or convex
analysis. See, for example, the optimization books by Bertsekas [4], Boyd and Vandenberghe [7], and Chong
and Zak [9]; and the texts by Fleming [15], and Urruty and Lemaréchal [18]. The proof of Theorem 7.3.5 is
drawn (with modifications) from [15, Theorem 3.5]. For a physical interpretation of Jensen’s inequality, see
MacKay [28, §3.5].
Exercises
Convex Sets
Exercise 7.1. Prove the following basic properties of convex sets:
(a) Let Sa ⊂ Rn be a convex set for each a ∈ A. Show that ∩a∈A Sa is a convex set.
(b) Let S ⊂ Rn be a convex set and F be a linear map from Rn to Rm . Show that F (S) = {z : z = F s, s ∈ S} is
a convex set.
Convex Functions
Exercise 7.2. Let k · k be a norm on Rn . Are its sublevel sets convex? Are the sublevel sets bounded? Does k · k have
a unique global minimum?
Exercise 7.3. Let P ∈ Rn×n , b ∈ Rn , and c ∈ R. Show that the quadratic function xT P x + bT x + c is convex if and
only if P is positive semidefinite.
Exercise 7.4. Let f (x, y) be a convex function of (x, y) ∈ Rp+q with x ∈ Rp and y ∈ Rq . Show that for each
∆
x0 ∈ Rp the function gx0 (y) = f (x0 , y) is a convex function of y ∈ Rq .
Exercise 7.5 (Epigraph). Let S ⊆ Rn , and f : S → R. The epigraph of f is the set of points
∆
epi(f ) = {(x, v) ∈ S × R : f (x) ≤ v}.
Show that:
(a) If S is a convex set and f is a convex function, then epi(f ) is a convex subset of Rn+1
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 93
(b) If epi(f ) is a convex set, then S is a convex set and f is a convex function.
Exercise 7.6 (Jensens’s Inequality). Let C be a convex set, and f : C → R be a convex function. Show that for any
Pk
{xi }ki=1 ⊂ C, and {αi ≥ 0}ki=1 with i=1 αi = 1,
P P
k k
f i=1 αi xi ≤ i=1 αi f (xi ).
Exercise 7.7. For {Aj }kj=1 ∈ Rm×n , show that σ12 ( j Aj ) ≤ j σ12 (Aj ).
P P
Exercise 7.8. Let fj : Rn → R be a strongly convex function, j = 1, 2, 3. Assuming it is non-empty, give a labelled
conceptual sketch of the region {x : fj (x) ≤ βj , j = 1, 2, 3} and indicate its key properties.
Exercise 7.10. Determine general sufficient conditions (if any exist) under which the indicated function is convex.
(a) f : Rn → R with f (x) = (xT Qx)r . Here Q ∈ Rn×n is symmetric PSD.
Pn
|x(i)|)r
(b) f : Rn → R with f (x) = 1 + e( i=1 .
m×n
(c) f : R → R with f (A) = σ1 (A), where σ1 (A) is the maximum singular value of A.
(d) f : Rm×n
P
→ R with f (A) = j σj (A), where σj (A) is the j-th singular value of A.
Exercise 7.11. Let C = {x ∈ Rn : x(i) > 0, i ∈ [1 : n]} and for x ∈ C, let ln(x) = [ln(x(i))] ∈ Rn . Prove or
disprove: f (x) = xT ln(x) is a convex function on the set C.
2
Exercise 7.12. Let p : Rn → R be the function p(x) = e−γkxk2 and f : R → R be a convex function such that for
each x ∈ Rn , the function f (u)p(u − x) is integrable. Then we can “smooth” f by convolution with p to obtain
Z
g(x) = f (u)p(x − u)du.
Rn
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 94
Strict Convexity
Exercise 7.20. Let f, g : Rn → R with f strictly convex, and g convex. Show that:
(a) For α > 0, αf (x) is strictly convex.
(b) f (x) + g(x) is a strictly convex.
Exercise 7.21. Show that if f, g : Rn → R are both strictly convex, then h(x) = max{f (x), g(x)} is strictly convex.
Exercise 7.22. Show that if f : Rn → R is strictly convex, A ∈ Rn×m has rank m, and b ∈ Rm , then h(x) =
f (Ax + b) is strictly convex.
Exercise 7.23. Assume that g has the properties stated below on the image of f . Show that:
(a) If f is strictly convex, and g is convex and strictly increasing, then h(x) = g(f (x)) is strictly convex.
(b) If f convex, and g is strictly convex and nondecreasing, then h(x) = g(f (x)) is strictly convex.
Exercise 7.24. Show that for symmetric P ∈ Rn×n , xT P x is strictly convex on Rn if and only if P is positive
definite. Similarly, show that for a ∈ Rn and b ∈ R, the quadratic function xT P x + aT x + b is strictly convex if and
only if P is positive definite.
Exercise 7.25. Let f : I → R be twice differentiable on an open interval I. Show that if for all x ∈ I, f 00 (x) > 0,
then f is strictly convex on I.
Strong Convexity
Exercise 7.26. Show that a strongly convex function is strictly convex.
Exercise 7.27. Show that f : Rn → R is strongly convex with modulus c if and only if for x, y ∈ Rn and α ∈ [0, 1],
Exercise 7.28. Show that if f (x) is strongly convex with modulus c > 0, then f (x) is strongly convex with modulus
β for 0 < β ≤ c.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 95
Proximal Operators
Exercise 7.41. Let f : Rn → R be a convex function, x ∈ Rn , and λ > 0. The function
∆ 1
Pf (x) = arg minn f (v) + kx − vk22
v∈R 2λ
is called a proximal operator. Show that the problem on the RHS always has a unique solution.
Pn
Exercise 7.42. Let λ > 0 and kxk1 = j=1 |xj |. Obtain an analytical expression for the proximal operator
1
P1 (x) = arg minn kvk1 + kx − vk22 .
v∈R 2λ
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 96
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 97
Chapter 8
8.1 Preliminaries
8.1.1 Linear Manifolds
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 98
Lemma 8.1.1. Let U1 , U2 ∈ Vn,d both be representations for the d-dimensional subspace U ⊂ Rn .
Then the exists Q ∈ Od with U2 = U1 Q and U1 = U2 QT .
Proof. Let U1 , U2 ∈ Rn×d be two orthonormal bases for the d-dimensional subspace U. Since U1 is a basis
for U and every column of U2 lies in U, there must exist a matrix Q ∈ Rd×d such that U2 = U1 Q. It follows
that Q = U1T U2 . Using U1 U1T U2 = U2 and U2 U2T U1 = U1 , we then have
xj = z + ûj + j
= v + u + U U T (xj − (u + v)) + (I − U U T )(xj − (u + v))
= v + U U T (xj − v) + (I − U U T )(xj − v).
So only the component of z in U ⊥ plays a role in determining the resulting residuals. Moreover, the sum of
squared residuals is
Pm 2
Pm Pm
j=1 kj k2 = j=1 k(I − U U T )xj − vk22 = j=1 kpj − vk22 ,
where pj = (I − U U T )xj , j ∈ [1 : m]. The minimization of the above expression over v is a simple calculus
problem with solution
m m
1 X 1 X
v= pj = (I − U U T ) xj = (I − U U T )µ̂,
m m
j=1 j=1
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 99
Figure 8.1: A linear manifold µ̂ + U and its subspace U in R2 . Here U = span(u) with kuk2 = 1.
where µ̂ is the empirical mean of the data. So given a fixed U with orthonormal basis U , an optimal z is
obtained by setting v = (I − U U T )µ̂, letting u ∈ U be arbitrary, and setting z = v + u. In particular, it is
convenient to select u = U U T µ̂ since the resulting z is then independent of the selection of U:
z = U U T µ̂ + (I − U U T )µ̂ = µ̂.
Problem 8.1 is an optimization problem over the Stiefel manifold. We know that it does not have a unique
solution since if U is a solution, then U Q is a solution for every Q ∈ Od . These solutions correspond to
different parameterizations of the same subspace. However, beyond this obvious non-uniqueness, it may be
possible that two distinct subspaces are both optimal projection subspaces of dimension d. We will examine
that issue in due course.
We can rewrite the problem 8.1 to make the constraint U ∈ Vn,d explicit as follows:
min kX − U U T Xk2F
U ∈Rn×d (8.2)
s.t. U T U = Id .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 100
Let S = XX T ∈ Rn×n denote the scatter matrix of the data. Then we can equivalently solve:
max trace(U T SU )
U ∈Rn×d (8.3)
s.t. U T U = Id .
This is a matrix Rayleigh quotient problem. The simplest version with d = 1 is easily solved. The solution
is to take u to be a unit norm eigenvector corresponding to a maximum eigenvalue of S. The solution of the
general case with d > 1 is only slightly more complicated. One solution is obtained by taking the columns
of U to be d orthonormal eigenvectors corresponding to the d largest eigenvalues of S. Then for all Q ∈ Od ,
U ? = U Q is also a solution (Theorem 6.3.1).
It follows that a solution U ? to (8.3) is obtained by selecting the columns of U ? to be a set orthonormal
eigenvectors of S = XX T corresponding to its d largest eigenvalues. Working backwards, we see that U ? is
then also a solution to (8.2). Any basis of the form U ? Q with Q ∈ Od also spans the same optimal subspace
U ? . In general, the subspace U ? need not be unique. To see this, consider the situation when λd = λd+1 .
When this holds, the selection of a d-th eigenvector in U ? is not unique. However, aside from this very
special situation, U ? is unique.
In summary, a solution to problem (8.2) is obtained as follows. Find the d largest eigenvalues of the
scatter matrix S = XX T and a corresponding set of orthonormal eigenvectors U ? . Then over all d dimen-
sional subspaces, U ? = R(U ? ) minimizes the sum of the squared norms of the projection residuals. Here
are a some other important observations:
(1) The case d = 0 merits comment. When d = 0, U = 0 and M = {z}. So we project the data to
Pm z (the translation
a single point vector). Hence we seek z ∈ Rn that minimizes thePsum of squared
1 m
distances j=1 kz − xj k22 . The solution is the empirical mean of the data µ̂ = m j=1 xi . This is
consistent with our result that xj = µ̂ + ûj + j , since in this case ûj ∈ U = 0.
(2) For d = 1 we find the unit norm eigenvector for the largest eigenvalue of the scatter matrix of the
centered data. Let this be u. Then the best approximation linear manifold is the straight line µ̂ +
span{u}. This is illustrated using synthetic data in Figure 8.2.
(3) As we vary d we obtain a nested set of optimal projection subspaces U0? ⊂ U1? ⊂ · · · ⊂ Ur? where r is
the rank of the centered data matrix. Thus the best approximation linear manifolds M?d = µ̂ + Ud? are
also nested. Hence for each data point xj there is a sequence of progressively refined approximations:
Here ui is the i-th unit norm eigenvector of the scatter matrix of the centered data, listed in order of
decreasing eigenvalues.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 101
Figure 8.2: Optimal data approximation using a linear manifold in R2 . When d = 0, the best approximation linear
manifold is the empirical mean µ̂. When d = 1 it is the 1-dimensional linear manifold M shown, and when d = 2 it
is R2 .
(2) By projecting xj − µ̂ to x̂j = U ? (U ? )T (xj − µ̂) we obtain an approximation of the centered data
as points in the subspace U ? . Each x̂j is exactly represented by its coordinates aj = (U ? )T xj with
respect the orthonormal basis U ? . So by forming a d-dimensional approximation to the data we have
effectively reduced the dimension of its representation from n to d < n. This is an example of
dimensionality reduction.
Since confusion is unlikely, we have dropped the hat on R for notational simplicity. We can use R to write
an expression for the empirical variance of the data in any direction u:
The direction u in which the data has maximum variance is then obtained by solving the problem:
arg maxn uT Ru
u∈R (8.6)
s.t. uT u = 1
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 102
with R a symmetric positive semidefinite matrix. This is a simple Rayleigh quotient problem with d = 1.
The solution is to take u = v1 , where v1 is a unit norm eigenvector for a maximum eigenvalue σ12 of R.
We must take care if we want to find two directions of largest variance. Without any constraint, the
second direction can be arbitrarily close to v1 and yield variance near σ12 . One way to prevent this is to
constrain the second direction to be orthogonal to the first. Then if we want a third direction, constraint it
to orthogonal to the two previous directions, and Pso on. In this case, for d orthogonal directions we want
d
to find U = [u1 , . . . , ud ] ∈ Vn,d to maximize j=1 uTj Ruj = trace(U T RU ). Hence we want to solve
problem (8.3) with S = R. As discussed previously, one solution is attained by taking the d directions to be
unit norm eigenvectors v1 , . . . , vd for the largest d eigenvalues of R.
By this means you see that we can obtain n orthonormal directions of maximum empirical variance in
the data. These directions v1 , v2 , . . . , vn and the corresponding emppirical variances σ12 ≥ σ22 ≥ · · · ≥ σn2
are eigenvectors and corresponding eigenvalues of R: Rvj = σj2 vj , j ∈ [1 : n]. The vectors vj are called the
principal components of the data, and this decomposition is called Principal Components Analysis (PCA).
Let V be the matrix with the vj as its columns, and Σ2 = diag(σ12 , . . . , σn2 ) (note σ12 ≥ σ22 ≥ · · · ≥ σn2 ).
Then PCA is an ordered eigen-decomposition of the emperical covariance matrix: R = V Σ2 V T .
There is a clear connection between PCA and finding a subspace that minimizes the sum of squared
norms of the residuals. We can see this by noting that the sample covariance is just a scalar multiple of the
scatter matrix XX T :
m
1 X 1
R= xj xTj = XX T .
m m
j=1
Hence the principal components are the unit norm eigenvectors of S = XX T listed in order of decreasing
eigenvalues. In particular, the first d principal components are the first d unit norm eigenvectors (ordered
by eigenvalue) of XX T . This is an orthonormal basis that defines an optimal d-dimensional projection
subspace U. Thus the leading d principal components give a particular orthonormal basis for an optimal
d-dimensional projection subspace.
A direction in which the data has small variance relative to σ12 may not be not an important direction;
after all the data stays close to the mean in this direction. If one accepts this hypothesis, then the directions
of the largest variance are the important directions. These directions explain most of the variance in the data.
This hypothesis suggests that we could select an integer d < rank(R) and project the centered data onto
the d directions of largest variance. Let Vd = [v1 , v2 , . . . , vd ]. Then the projection of the centered data onto
the span of the columns of Vd is x̂j = Vd (VdT xj ). The term aj = VdT xj gives the coordinates of xj with
respect to Vd , and the product Vd aj synthesizes x̂j using these to form the appropriate linear combination of
the columns of Vd .
Here is a critical observation: since the directions are fixed and known, we do not need to form x̂j .
Instead we can simply map xj to the coordinate vector aj ∈ Rd . We lose no information in working with
aj instead of x̂j since the latter is an invertible linear function of the former. Hence {aj }m j=1 gives a new set
of data that captures most of the variance in the original data, and lies in the reduced dimension space Rd
(d ≤ rank(R) ≤ n).
We now address the question of how to select the dimension d. The selection of d involves a tradeoff
between dimensionality reductionP and the amount of captured variance in the P resulting approximation. The
“variance captured” is ν 2 = dj=1 σj2 and the “residual variance” is ρ2 = nj=d+1 σj2 . Reducing d reduces
ν 2 and increases ρ2 . The selection of d thus involves determining the fraction of the total variance that
ensures the projected data is useful for the task at hand. For example, if the projected data is to be used to
learn a classifier, then d can be selected to yield acceptable (or perhaps best) classifier performance using
cross-validation.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 103
S = XX T = U ΣV T V ΣU T = U Σ2 U T .
Hence the principal components with nonzero variances are the r left singular vectors of U , and the variance
1 2
of the data in direction the j-th principal component uj is m σj , j ∈ [1 : r].
Now let d ≤ r, and write U = [Ud Ur−d ] and V = [Vd Vr−d ]. Similarly, let Σd be the top left
d × d submatrix of Σ, and Σr−d denote its bottom right (r − d) × (r − d) submatrix. To form a d-
dimensional approximation of the centered data we project X onto the subspace spanned by its first d
principal components:
T
Σd 0 Vd
Ud UdT X = Ud UdT U ΣV = Ud Id 0n×(r−d)
T = Ud Σd Vd .
0 Σr−d Vr−d
Hence PCA projection of the data to d dimensions is equivalent to finding the best rank d approximation to
the centered data matrix X.
Here are some other important points to note from the above expression:
(1) If we know we want the d-dimensional PCA projection of X, then we need only to compute the best
rank d approximation Ud Σd Vd to the centered data X.
(2) The d-dimensional coordinates the projected points with respect to the basis Ud are given in the
columns of the d × m matrix Σd VdT . So the coordinates can be found directly from a compact SVD
of X, or from a compact SVD of its best rank d approximation.
Exercises
Exercise 8.1. Let x1 , . . . , xm ∈ R. We seek the point z ? ∈ R that is “closest” to the set of points {xi }m
i=1 .
m
(a) If we measurePcloseness using squared error, then we seek z ? = arg minz∈R i=1 (z − xi )2 . In this case show
P
? 1 m
that z = m i=1 xi . This is the empirical mean of the data points.
Pm
(b) If we measure closeness using absolute error, then we seek z ? = arg minz∈R i=1 |z − xi |. Show that in this
case z ? is the median of the data points.
Exercise 8.2. Let x1 , . . . , xm ∈ Rn . We seek the point z ? ∈ Rn that is “closest” to the set of points {xi }m
i=1 .
?
P n 2
(a) If we measure closeness
? 1
Pm using squared error, then we seek z = arg minz∈Rn i=1 kz − xi k2 . In this case
show that z = m i=1 xi . This is the empirical mean of the data points.
Pn
(b) If we measure closeness using absolute error, then we seek z ? = arg minz∈Rn i=1 kz − xi k1 . Show that in
this case z ? (i) = median{xj (i)}m j=1 . This is the vector median of the data points.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 104
1
Exercise 8.3. Let {xj }mj=1 be a dataset of interest and set X = [x1 , . . . , xm ] ∈ R
n×m
. Then µ̂ = m X1m . Let X̃
1 T
denote the corresponding matrix of centered data and u = m 1m . Show that X̃ = X(Im − uu ). Hence centering
√
xT P x
RP (x) = .
xT x
If P is symmetric, show that λmin (P ) ≤ RP (x) ≤ λmax (P ).
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 105
Chapter 9
L = {f : f (x) = wT x, w ∈ Rn }.
Based on a finite set of training examples {(xj , yj ) ∈ Rn × R}m j=1 , we want to select f ∈ L so that the
linear function f (x) best approximates the relationship between the input variable x ∈ Rn and the output
variable y ∈ R. Since L is parameterized by the vector variable w ∈ Rn , this is equivalent to selecting ŵ so
that f (x) = ŵT x achieves the above goal.
Let X = [x1 , . . . , xm ] ∈ Rn×m be the matrix of training examples (input vectors) and y = [yj ] ∈ Rm
be the vector of corresponding target values (output values). Each row of X gives the observed values of
one feature of the data examples. For example, in a medical context xi (1) might be a measurement of heart
rate, xi (2) of blood pressure, and so on. In this case, the first row of X gives the values of the heart rate
feature across all examples, and the second row gives the values of the blood pressure feature, and so on.
We call the rows of X the feature vectors.
For a given w ∈ Rn , the vector of predicted values ŷ ∈ Rm , and the corresponding vector of prediction
errors ε ∈ Rm on the training data, are given by
ŷ = X T w (9.1)
T
ε = y − ŷ = y − X w. (9.2)
Each row in X T is a training example, and each column of X T is a feature vector. So ŷ is formed as a linear
combination of the feature vectors. The error vector ε is the part of y that is “unexplained” by X T w. It often
called the residual.
To learn a w that achieves the “best” prediction performance, we need to measure performance using a
function of w ∈ Rn . This can be done, for example, by taking a norm of the residual vector on the training
data. What is ultimately important, however, is the prediction error on held-out testing examples. This is
the testing error or generalization error.
The problem described above is called linear regression. Linear regression finds the “best approxima-
tion” to the target vector y as a linear combination of the feature vectors f1 , . . . , fn (the columns of X T ) by
minimizing a cost function of the residual ε = y − X T w on the training data. In this context, the matrix X T
is often called the regression matrix and a column of X T is called a regressor.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 106
√ √
The matrix P is the unique symmetric, positive definite, square root of P . Using P we can write
√ √
ky − F wk2P = (y − F w)T P (y − F w) = k P y − P F wk22 = kỹ − F̃ wk22 .
√
This reduces (9.5)√to a standard least squares problem with a modified regressor matrix F̃ = P F and
target vector ỹ = P y.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 107
Here F1 ∈ Rm1 ×n , F2 ∈ Rm2 ×n , y1 ∈ Rm1 , and y2 ∈ Rm2 . Noting that the sum kF1 w−y1 k22 +kF2 w−y2 k22
is just a sum of squares we can write:
2 2
F1 w − y1 F1 y
kF1 w − y1 k22 + kF2 w − y2 k22 = = w− 1 = kF̃ w − ỹk22 ,
F2 w − y2 2
F2 y2 2
where
F y
F̃ = 1 ∈ R(m1 +m2 )×n and ỹ = 1 ∈ Rm1 +m2 .
F2 y2
This reduces (9.6) to a standard least squares problem with an augmented regression matrix F̃ and tar-
get vector ỹ. A similar transformation can be applied if the objective is a finite sum of quadratic terms:
P k 2
j=1 kFj w − yj k2 .
Here the scalar λ > 0 is selected to appropriately balance the competing objectives of minimizing the
residual squared√error while keeping w small. Notice that (9.7) is a special case of (9.6) with F1 = F ,
y1 = y, F2 = λIn and y2 = 0. Hence problem (9.7) can be transformed into a standard least squares
problem with the objective kF̃ w − ỹk22 , where
F y
F̃ = √ ∈ R(m+n)×n and ỹ = ∈ Rm+n .
λIn 0
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 108
At a stationary point, equality in this expression must hold for all h ∈ Rn . Hence for w? to be a solution it
is necessary that
F T F w? = F T y. (9.8)
These are called the normal equations. The convexity of the objective function and Corollary 7.5.1 ensure
that a solution w? of the normal equations is a solution of the least squares problem. Hence any vector
satisfying (9.8) is called a least squares solution of (9.3).
All least squares solutions have the following fundamental property.
Lemma 9.2.1. Let w? be a solution of the normal equations (9.8), and set ŷ = F w? and ε = y − ŷ.
Then
y = ŷ + ε with ŷ ∈ R(F ) and ε ∈ R(F )⊥ .
By Lemma 9.2.1, ŷ is the unique orthogonal projection of y onto R(F ) (the span of the feature vectors)
and ε is the orthogonal residual. Every least squares solution w? gives an exact representation of ŷ as a
linear combination of the columns of F . Hence w? is unique if and only if the columns of F (the feature
vectors) are linearly independent. This requires rank(F ) = n ≤ m. So uniqueness requires more examples
than features. One can readily show that the columns of F are linearly independent if and only if F T F is
invertible. In that case,
w? = (F T F )−1 F T y.
On the other hand, if the columns of F are linearly dependent (rank(F ) = r < n), then N (F ) is nontrivial,
and there are infinitely many solutions w? , each giving a different representation of the same point ŷ.
In summary, finding the solution of a standard least squares problem involves two operations:
(a) Projection: y is orthogonally projected onto R(F ) to yield the unique vector ŷ.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 109
V ΣU T U ΣV T w = V ΣU T y.
V V T w? = V Σ−1 U T y. (9.9)
If the columns of F are linearly independent, N (F ) = 0 and N (F )⊥ = Rn . Recall that the columns of V
span N (F )⊥ . So in this case, V ∈ On , V V T w? = w? and the unique least squares solution is given by
w? = V Σ−1 U T y. (9.10)
On the other hand, if the columns of F are linearly dependent then N (F ) is nontrivial. In this case, if w? is
a solution of the normal equations, then so is w? + v for every v ∈ N (F ). Conversely, if w is a solution of
the normal equations, then F T F (w − w? ) = 0. So w = w? + v with v ∈ N (F ). Hence the set of solutions
is a linear manifold formed by translating the subspace N (F ) by a particular solution: w? + N (F ).
We claim that the solution manifold always contains a unique solution wln ? of least norm. To see this,
2
note that the manifold is a convex set, and kwk2 is a strongly convex function on this set. Hence by Theorem
7.5.3, there is a unique point of least norm on the solution manifold. What is less obvious is that (9.10) is
the least norm solution.
? = V Σ−1 U T y.
Theorem 9.2.1. Let U ΣV T be a compact SVD of F . Then wln
Proof. Let w̃ = V Σ−1 U T y. We first show that w̃ is a solution. This follows by the expansion
The first term is a constant and the second is made zero by setting w = w̃. Hence w̃ achieves the minimum
? = w̃ + w for some w ∈
the least squares cost. Thus it must be a solution. It follows that we can write wln 0 0
⊥
N (F ). From the definition of w̃, note that w̃ ∈ N (F ) . Hence by Pythagorous, kwln ? k2 = kw̃k2 + kw k2 .
2 2 0 2
So kwln? k2 ≥ kw̃k2 . Since w ? is the least norm solution, we must have w = 0, and hence w ? = w̃.
2 2 ln 0 ln
So when the columns of F are linearly independent, (9.10) gives the unique solution, and when the
columns are linearly dependent, it gives the least norm solution. The matrix F + = V Σ−1 U T is the Moore-
Penrose pseudo-inverse of F (see Exercise 5.16).
An Alternative Parameterization
Here is an corollary to Theorem 9.2.1. Recall that F = X T , where X ∈ Rn×m is the matrix with the
training examples as its columns.
Corollary 9.2.1. The least norm solution wln ? (including the least squares solution when it is unique)
lies in the span of the training examples: wln ∈ R(X). Hence for some a? ∈ Rm , wln
? ? = Xa? .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 110
At first sight, this reformulation of the problem might seem to offer little advantage. But notice that the
reformulated problem uses the m × m Gram matrix X T X of the examples. If the number of examples m
is significantly less than the dimension of the data n, that could be useful. In addition, the computations in
(9.11) only require taking inner products of training examples (X T X), and of the training examples with a
test example (X T x). That also turns out to be potentially useful.
projecting y onto the r left singular vectors U T y, then using V Σ−1 to map these coordinates to the vector
wln? ∈ N (F )⊥ . If the columns of F are almost linearly dependent, we expect the features to be very
close to a subspace of dimension d < m in Rm . A natural candidate for this subspace is obtained by a
rank d approximation of F . Let Ud , Vd denote the matrices consisting of the first d columns of U and V
respectively, and Σd denote the top left d × d submatrix of Σ. Then the least squares solution using the rank
d approximation Ud Σd VdT to F is
d
X 1
wd? = Vd Σ−1 T
d Ud y = vj uTj y.
σj
j=1
new term to the objective function that ensures a unique solution. For example, if prior knowledge suggests
that w? is likely to be small, one can modify the least squares problem to minw∈Rn ky−F wk2 +λkwk2 . Here
λ > 0 is a selectable hyperparameter that balances the two competing terms, ky − F wk22 and kwk22 in the
overall objective. We show below that this modified problem always has a unique solution. More generally,
prior information may indicate that w is likely to be close to a given vector b ∈ Rn . This information can be
incorporated by selecting w to minimize ky − F wk22 + λkw − bk22 . This problem also has a unique solution.
The imposition of an auxiliary term into the least squares objective, as illustrated above, is often called
regularization. It was first investigated in the context of underdetermined problems by the Russian math-
ematician Tikhonov (1943). Stated in our context, he studied regularization terms of the form kGwk22 for
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 111
a specified matrix G. This approach can be generalized to kGw − gk22 where g is a given vector. This
form of regularization is called Tikhonov regularization. Somewhat later, regularized linear regression was
studied by Hoerl (1962) [19], and Hoerl and Kennard (1970) [20], using the regularization term kwk22 . This
approach is now widely known as ridge regression.
Tikhonov regularized least squares can be posed as:
where F ∈ Rm×n and y ∈ Rm are the usual elements of the least squares problem, and G ∈ Rk×n and
g ∈ Rk are the new elements of the regularization penalty. The selectable scalar λ > 0 balances the two
competing objectives. Ridge regularization is a special case with G = In and g = 0. We have already seen
that (9.12) can be transformed into a standard least squares problem. Since the objective of (9.12) is a sum
of squares, we can write it as
2
2 √ Fw − y
kF w − yk + λkGw − gk22 = = kF̃ w − ỹk22 ,
λ(Gw − g) 2
where
F y
F̃ = √ ∈ R(m+k)×n and ỹ = √ ∈ Rm+k .
λG λg
If F̃ has n linearly independent columns (i.e., rank(F̃ ) = n), the augmented problem has the unique
solution
w? (λ) = (F T F + λGT G)−1 (F T y + λGT g) . (9.13)
A sufficient condition ensuring that rank(F̃ ) = n is rank(G) = n (Exercise 9.10).
Ridge regularization is a special case with k = n, G = In and g = 0. In this case, F T F is symmetric
positive semidefinite, and adding λIn ensures the sum is positive definite and hence invertible. Thus ridge
regression always has a unique solution, and this is obtained by solving
?
wrr (λ) = (F T F + λIn )−1 F T y . (9.14)
The residual ε = y − F w? under Tikhonov and ridge regularization is generally not orthogonal to the
columns of F . You can see this using the normal equations. For ridge regularization F T (y − F wrr
? ) = λw ? ,
rr
and for Tikhonov regression F T (y − F w? ) = λGT (Gw? − g).
Regularization Path
The solutions (9.13) and (9.14) are functions of the regularization parameter λ. As λ varies the solutions
trace out a curve in Rn called the regularization path. For ridge regression, the entire regularization path is
contained R(X).
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 112
w ∗rr w ∗ln
(V Σ2 V T + λIn )wrr
?
= (V Σ2 V T + λIn )V V T wrr
?
= V (Σ2 + λIr )V T wrr
?
.
This allows us to write the normal equations as V (Σ2 + λIr )V T wrr
? = V ΣU T y. Multiplying both sides of
2 −1 T
this equation by V (Σ + λIr ) V yields:
r
" # !
? 2 −1 T σj T
X σj T
wrr (λ) = V (Σ + λIr ) ΣU y = V diag 2 U y = 2 vj uj y. (9.15)
λ + σj j=1
λ + σ j
By varying λ in (9.15), the entire regularization path is easily computed. In addition, (9.15) provides a proof
of the following result.
? (λ) = w ? .
Lemma 9.3.1. limλ→0 wrr ln
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 113
Proof. Since 1 + dT P −1 d 6= 0, the RHS of (9.21) is a finite valued k × k real matrix. To prove the claim
simply multiply the RHS by (P + ddT ):
T −1 1 −1 −1 T
(P + dd ) P − (P d)(P d)
1 + dT P −1 d
1 1
= In + ddT P −1 − −1
ddT P −1 − d(dT P −1 d)dT P −1
T
1+d P d 1 + d P −1 d
T
dT P −1 d
1
= In + 1 − − ddT P −1
1 + dT P −1 d 1 + dT P −1 d
= In
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 114
For an alternative proof, and a generalization, see Appendix B. Also see the appendix to this chapter.
By assumption, the columns of Fm are linearly independent. Hence Pm is nonsingular and the least squares
? = P −1 s . After obtaining w ? we assume that P −1 and s remain available. When a
solution is wm m m m m m
new training example is added, Fm+1 still has n linearly independent columns (Fm has linearly independent
columns and adding a row to these vectors does not change this). Hence Pm+1 is also invertible. Application
−1
of Lemma 9.4.1 to equation (9.20) yields the following set of recursive equations for computing Pm+1 from
Pm−1 and x ?
m+1 , and hence for obtaining wm+1 :
−1 x
(Pm −1 T
−1 −1 m+1 )(Pm xm+1 )
Pm+1 = Pm − −1 , (9.22)
1 + xTm+1 Pm xm+1
? −1
wm+1 = Pm+1 sm+1 . (9.23)
ŷ(m + 1) = xTm+1 wm
?
, (9.24)
Pm−1 x
? ? m+1
wm+1 = wm + T −1 (y(m + 1) − ŷ(m + 1)) . (9.25)
1 + xm+1 Pm xm+1
?
This update procedure is known as recursive least squares. It gives an efficient update formula for wm+1 in
?
terms of wm and each new training example. The update is driven by the prediction error y(m+1)−ŷ(m+1),
with no update required if the prediction error is zero. Inverting Pm+1 ∈ Rn×n requires O(n3 ) operations.
On the other hand, the recursive equations above require O(n2 ) operations per update. Hence RLS is an
efficient procedure when examples are presented sequentially and a solution is needed immediately.
Exercise 9.19 explores online least squares using mini-batch updates, and Exercise 9.20 explores an
online version of ridge regression.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 115
It is easy to prove this result; one need only do the required multiplication. Here is an alternative
analysis. For λ small
d
B −1 (λ) = P −1 + λ B −1 (λ)|λ=0 + H.O.T.
dλ
Now B −1 (λ)B(λ) = I. Hence
d −1
B (λ) = −B −1 (λ)uuT B −1 (λ) . (9.26)
dλ
Thus for λ small
B −1 (λ) = P −1 − λ(P −1 u)(P −1 u)T + H.O.T.
We see that the curve leaves the point P −1 in the direction (P −1 u)(P −1 u)T .
Now look at the second derivative. This is easily found using (9.26)
d2 −1
B (λ)|λ=0 = 2B −1 (λ)uuT B −1 (λ)uuT B −1 (λ)|λ=0
dλ2
= 2P −1 u(uT P −1 u)uT P −1
= 2[uT P −1 u] (P −1 u)(P −1 u)T .
To second order, the curve still leaves the point P −1 in a straight line in the direction (P −1 u)(P −1 u)T .
One may already see that this will be true to any order. The same method used to obtain the second derivative
can be used to obtain derivatives to any order, and when evaluated at λ = 0, each is a scalar multiple of the
same matrix (P −1 u)(P −1 u)T . So the curve is a straight line leaving P −1 in the direction (P −1 u)(P −1 u)T .
To find the parameterization of the straight line given in the lemma, work out the n-th derivative of
B −1 (λ) at λ = 0. Then examine the Taylor series. This results in
"∞ #
X
T −1 −1 n n T −1 n−1
(P + λuu ) = P + (−1) λ (u P u) (P −1 u)(P −1 u)T
n=1
λ
= P −1 − (P −1 u)(P −1 u)T .
1 + λuT P −1 u
Exercises
Preliminaries
Exercise 9.1. Let A ∈ Rm×n . Show each of the following claims.
a) AT A is positive definite (hence invertible) if and only if the columns of A are linearly independent.
b) AAT is positive definite if and only if the rows of A are linearly independent.
c) N (AT A) = N (A) and N (AAT ) = N (AT ).
d) R(AT A) = R(AT ) and R(AAT ) = R(A).
Exercise 9.2. Show that for any A ∈ Rm×n and λ > 0, AT (λIm + AAT )−1 = (λIn + AT A)−1 AT .
Exercise 9.3. Let P ∈ Rn×n be a symmetric positive definite matrix.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 116
a) Show that there exists a unique symmetric positive definite matrix P 1/2 such that P = P 1/2 P 1/2 .
Exercise 9.5. (Affine least squares and data centering) Let {(xi , yi )}m n
i=1 ⊂ R × R be a training dataset. Place
n×m
the xi in the columns of X ∈ R , and the target values yi in the rows of y ∈ Rm . Consider the affine predictor
ŷ(x) = w x + b, with x ∈ R . One can fit this predictor to the training data by forming w̃ = [w, b]T , X̃ = [X, 1m ],
T n
The solution gives w? and b? for the affine predictor. Here we explore the alternative option of explicitly determining
w and b by solving
w? , b? = arg min n
ky − X T w − b1m k22 . (9.27)
w∈R ,b∈R
(a) Show that the modified normal equations for (9.27) are
XX T w? + b? X1m = Xy (9.28)
µ̂Tx w? ?
+ b = µ̂y , (9.29)
1 1 T
where µ̂x = m X1m ∈ Rn and µ̂y = m 1m y ∈ R.
(b) Show that the affine predictor takes the form ŷ(x) = w? T (x − µ̂x ) + µ̂y .
Xc XcT w? = Xc yc (9.30)
where Xc = X − µ̂x 1Tm is the matrix of centered input examples, and yc = y − µ̂y 1m is the vector of centered
output targets. Hence we can find w? by first centering the data, and solving a standard least squares problem.
We can then use w? , µ̂x , µ̂y , and (9.29), to find b? .
Exercise 9.6. (Affine ridge regression) Let {(xi , yi )}m n
i=1 ⊂ R × R be a training dataset. Place the xi in the columns
of X ∈ R n×m
, and the target values yi in the rows of y ∈ R . We learn an affine predictor ŷ(x) = w? T x + b? by
m
(b) Show that the affine predictor takes the form ŷ(x) = w? T (x − µ̂x ) + µ̂y , where µ̂x = 1
m X1m ∈ Rn and
1 T
µ̂y = m 1m y ∈ R.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 117
where Xc = X − µ̂x 1Tm is the matrix of centered input examples, and yc = y − µ̂y 1m is the vector of centered
output labels. Hence by first centering the data, and solving a standard ridge regression problem we find w? .
Show that we can then use w? , µ̂x , µ̂y , and the normal equations to find b? .
Exercise 9.7. (Centered versus uncentered problems) Let F ∈ Rm×n have rank n, and y ∈ Rm . Center the rows
of F and the entries of y by forming
1 T 1 T
µ̂y = m 1m y yc = (I − m 1m 1m )y = y − 1m µ̂y .
µ̂Tf = 1 T
m 1m F Fc = (I −m 1
1m 1Tm )F = F − 1m µ̂Tf
Assume Fc also has rank n. Determine the relationship between w? and wc , where
Exercise 9.8. (Least squares regression with vector targets.) We are given training data {(xi , zi )}m i=1 with input
examples xi ∈ Rn and vector target values zi ∈ Rd . Place the input examples into the columns of X ∈ Rn×m and the
targets into the columns of Z ∈ Rd×m . We want to learn a linear predictor of the targets z ∈ Rd of test inputs x ∈ Rn .
To do so, first use the training data to find:
where we have set Y = Z T and F = X T and require λ ≥ 0 (λ = 0 removes the ridge regularizer).
(a) Show that (9.34) separates into d standard ridge regression problems each solvable separately.
(b) Without using the property in (a), find an expression for the solution W ? . Is the separation property evident
from this expression?
Exercise 9.9. (SVD/PCA Regression). Let F ∈ Rm×n have rank r and compact SVD U ΣV T . Let y ∈ Rm and wln ?
denote the least norm solution to the regression problem minw∈Rn ky − F wk2 . For 1 ≤ k ≤ r, let Fk = Uk Σk VkT be
the SVD rank k approximation to F . Here Uk and Vk consist of the first k columns of U and V , respectively, and Σk
is the upper right k × k submatrix of Σ. Then let wk? be the least norm solution to the problem:
min ky − Fk wk2 .
w∈Rn
Show that:
(a) wk? = Vk Σ−1 T
k Uk y.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 118
Exercise 9.10. (Uniqueness of the Tikhonov solution) Tikhonov regularized least squares is posed as:
Show that w? = By for some matrix B ∈ Rp×n , and relate the singular values of B to those of F .
Exercise 9.13. Let X ∈ Rn×m and y ∈ Rm be given, and consider the problem
Show that there exists a unique solution w? and that w? ∈ R(X). Using (9.36) and these two results, show that
w? = X(X T X + λIm )−1 y.
Approximation Problems
Exercise 9.14. For y, z ∈ Rn and λ > 0, find and interpret the solution of the approximation problem
Exercise 9.15. Let D ∈ Rn×n be diagonal with nonnegative diagonal entries and consider the problem:
This problem seeks to best approximate y ∈ Rn with a nonuniform penalty for a large entries in x.
(a) Solve this problem using the formula for the solution of regularized regression.
(b) Show that the objective function is separable into a sum of decoupled terms. Show that this decomposes the
problem into n independent scalar problems.
(c) Find the solution of each scalar problem.
(d) By putting these scalar solutions together, find and interpret the solution to the original problem.
Exercise 9.16. You are given k points {zi }ki=1 in Rn and you want to find the points x ∈ Rn that minimize the sum
of squared distances to these fixed points.
(a) Solve this directly using calculus.
(b) Now pose it as a regression problem and solve it using your knowledge of least squares and the SVD.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 119
Exercise 9.17. You want to learn an unknown function f : [0, 1] → R using a set of noisy measurements (xj , yj ),
with yj = f (xj ) + j , j ∈ [1 : m]. Your plan is to approximate f (·) by a Fourier series on [0, 1] with q ∈ N terms:
q
∆ a0 X
fq (x) = + ak cos(2πkx) + bk sin(2πkx).
2
k=1
To control the smoothness of fq (·), you also decide to penalized the size of the coefficients ak , bk more heavily as k
increases.
(a) Formulate the above problem as a regularized regression problem.
(b) For q = 2, display the regression matrix, the target y, and the regularization term.
(c) Comment briefly on how to select q.
How do these equations change if the mini-batches are not all the same size?
Exercise 9.20. (On-line ridge regression)
(a) Determine the detailed equations for on-line ridge regression.
(b) Use these equations to explain how RLS and on-line ridge regression differ.
(c) As m → ∞, will the two solutions differ?
(d) In light of your answer to (c), what is the role of λ in ridge regression?
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 120
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Chapter 10
We now consider linear least squares problems with the additional objective of finding a sparse solution. By
this we mean that many of the entries of the solution w? are zero. Let kwk0 denote the number of nonzero
components of w ∈ Rn . Then sparse linear regression can be posed as
Underdetermined Problems. When F has more columns than rows, problem (10.1) seeks a sparse solution
to an underdetermined least squares problem.
Subset Selection. Given training data, we can learn a linear predictor by solving a least squares problem
such as ridge regression:
ŵ = arg minn ky − F wk22 + λkwk22 . (10.2)
w∈R
Generally, the entries of the solution ŵ are all non-zero, suggesting that all of the features are important for
predicting y. However, in many practical applications this is unlikely to be true. Not all of the available
features will be relevant to the desired prediction. This suggests seeking a (small) subset of the features that
are most relevant for a forming a linear prediction of y. This is called a subset selection problem. Solving
(10.1) to find a sparse solution w? incorporates subset selection directly into the regression problem. The
indices with nonzero entries in w? indicate the selected subset of features.
Sparse Representation Classification. Suppose we are provided with a database of labelled face images
{(fj , zj )}m n
j=1 . Here fj ∈ R is the j-th vectorized face image and zj is its label (the person’s identity).
Form the face examples into matrix F = [f1 , . . . , fp ]. In this case, the examples are the columns of F and
we call F the dictionary. Given a new (unlabelled) face image y, we want to predict its label. We suspect
that the subset of images in the database corresponding to the identity of y will be most important. So we
set out to find a sparse representation of y in terms of the columns of F by solving (10.1). The result is an
approximate representation of y as a linear combination of relatively few columns of F . This is called a
sparse representation of y using the dictionary F . The solution w? selects a subset of the columns of F and
gives each selected column a nonzero weight. If we extract the subset of selected columns with label z and
the corresponding weights, then we can form a class z predictor of y as a linear combination of the columns
121
ELE 435/535 Fall 2018 122
of F with label z. The class predictor that yields the least error in representing y, provides the estimated
label z of y. This is called sparse representation classification.
Sparse Data Corruption. In security applications one needs to classify a new face image with respect to a
set of previously captured face images. Unlike the images in the dictionary, in the new image the subject is
wearing glasses, or sun glasses, or a scarf. So the new image will most likely differ significantly from the
best match among the previous images by a sparse set of pixels. Hence we could model the new image y as
y ≈ ŷ + ys , where ŷ is a sparse representation of y using images in the database and ys is a sparse image
with most pixels having value 0. One might then pose the problem of finding the best match to y from the
dictionary F as
min ky − F w − ys k22
w∈Rm ,ys ∈Rn
Sparse Approximation. The simplest version of problem (10.1) is called sparse approximation. Given
y ∈ Rn we want to find an approximation x to y such that x has at most k nonzero entries. This can be
posed as:
min ky − xk
x∈Rn (10.3)
s.t. kxk0 ≤ k.
Since x is close to y, but is specified with fewer nonzero coordinates, we can regard x as a compressed form
of y. We suspect that such an approximation will be useful if y has relatively few large entries, and many
small entries. In other words, y has some special structure that makes it “compressible”. This is often the
case for natural forms of data such as speech and images. For example, after the application of a wavelet
transform, most natural images are highly compressible.
10.1 Preliminaries
The support of x ∈ Rn is the set of indices S(x) = {i : x(i) 6= 0}. Let |S(x)| denote the size of S(x). Then
the number of nonzero entries in the vector x is |S(x)|. For an integer k ≥ 0, we say that x is k-sparse if
|S(x)| ≤ k. More generally, we say that a vector x ∈ Rn is sparse if |S(x)| n, i.e., relatively few of its
entries are nonzero.
An alternative notation is to let |α|0 denote the indicator function of the set {α ∈ R : α 6= 0}:
(
∆ 0, if α = 0;
|α|0 = (10.4)
1, otherwise.
The number of nonzero entries in a vector x ∈ Rn can then be expressed as
n
∆
X
kxk0 = |x(j)|0 . (10.5)
j=1
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 123
Figure 10.1: An illustration of the 1-sublevel sets of the function fp = ( j |x(j)|p )1/p and the connection to the
P
1-sublevel set of kxk0 . Left: for p ≥ 1, fp (x) is a norm and the 1-sublevel sets must be convex. Right: for p < 1,
the sublevel sets of fp (x) are no longer convex. As p ↓ 0, these 1-sublevel sets “look like” the 1-sublevel set of kxk0
intersected with the 1-norm unit ball.
The function k · k0 is not a norm. You can verify this by checking each of the required norm properties:
(a) kxk0 is non-negative and zero iff x = 0 (positivity); (b) k · k0 satisfies kx + yk0 ≤ kxk0 + kyk0 (triangle
inequality); (c) but for α 6= 0, kαxk0 6= |α| kxk0 . So k · k0 doesn’t satisfy the scaling property of a norm.
The function k · k0 is also not convex. To see this, let e1 , e2 denote the first two standard basis vectors
in Rn , and consider xα = (1 − α)e1 + αe2 for α ∈ (0, 1). The points e1 , e2 are clearly 1-sparse, but xα is
not 1-sparse. Hence the 1-sublevel set of k · k0 is not convex. Thus k · k0 is not a convex function.
The left panel of Figure 10.1 illustrates the unit balls of some `p norms on R2 . P The right panel of the
figure illustrates that for 0 < p < 1 the 1-sublevel sets of the function fp (x) = ( j |x(j)|p )1/p are no
longer convex. Indeed as p ↓ 0, these sets begin to look like the 1-sublevel set of k · k0 restricted to the unit
ball of the 1-norm (or any other p-norm).
min ky − Axk22
x∈Rn (10.6)
s.t. kxk0 ≤ k, k ∈ N, k > 0
min kxk0
x∈Rn (10.7)
s.t. ky − Axk22 ≤ , >0
The three formulations share the difficulty that k · k0 is not a convex function. For the moment we will focus
on the formulation (10.8).
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 124
Proof. (a) It is clear that any zero atom of A can be removed. This changes the size of A and w, but
otherwise leaves the problem intact.
(b) Let y = ŷ + ỹ with ŷ ∈ R(A) and ỹ ∈ R(A)⊥ . Then
ky − Awk22 + λkwk0 = kỹ + ŷ − Awk22 + λkwk0
= kỹk22 + kŷ − Awk22 + λkwk0
≡ kŷ − Awk22 + λkwk0 .
So solving the problem using ŷ yields a solution of the original problem. Conversely, if w? is a solution of
the original problem, then it is also a solution for the problem with ŷ. Hence without loss of generality we
can assume y ∈ R(A).
(c) If the atoms do not have unit norm, let à = AD where the atoms of à have unit norm and D is diagonal
with positive diagonal entries. If z = Dw, then kwk0 = kD−1 zk0 = kzk0 and
ky − Awk22 + λkwk0 = ky − ÃDwk22 + λkwk0
= ky − Ãzk22 + λkD−1 zk0
= ky − Ãzk22 + λkzk0 .
So if we solve the sparse regression problem using à (unit norm atoms) to obtain z ? , then w? = D−1 z ?
solves the original problem. Conversely, if w? solves the original problem, then Dw? is a solution for the
problem using Ã. Hence without loss of generality we can assume that A has unit norm atoms.
Now multiply the objective function by c2 > 0 and set u = cy, z = cw and λ̃ = c2 λ. Noting that
kcwk0 = kwk0 this gives
c2 ky − Awk22 + λkwk0 = kcy − Acwk22 + λc2 kcwk0
If z ? solves the modified problem using u = cy and λ̃ = c2 λ, then w? = z ? /c solves the origin problem
using y and λ. Conversely, if w? is a solution of the original problem, then z ? = cw? is a solution for the
modified problem using u and λ̃. Now note that choosing c = 1/kyk2 ensures u has unit norm. Hence
without loss of generality we can assume y has unit norm.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 125
We show in the next section that problem (a) is easily solved. The other special cases are all reducible to
problem (a), and hence are also easily solved. In contrast, the general sparse least squares problem is difficult
to solve. For example, problem (10.7) is known to be NP-hard [33]. In light of the difficulty of finding an
efficient general solution method, a number of greedy algorithms have been proposed for efficiently finding
an approximate solution. Examples of such methods are examined in §10.4.
min f (y − x)
x∈Rn (10.9)
s.t. kxk0 ≤ k,
Here f (y − x) and λkxk0 are competing objectives. Selecting a small value for λ encourages less sparsity
and a better match between y and x. Increasing λ encourages greater sparsity, but potentially a worse match
between x and y. A solution x? of (10.10) will be generally sparse, but we can’t guarantee it will be k-sparse.
Nevertheless, it will be convenient to first solve (10.10).
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 126
We assume that the functions hj : R+ → R+ are strictly monotone increasing functions with hj (0) = 0. In
this case, we want to minimize
n
X
f (y − x) + λkxk0 = ( hj (|y(j) − x(j)|) + λ|x(j)|0 ) . (10.11)
j=1
This is the sum of n decoupled scalar subproblems. Subproblem j has the form
min hj (|y(j) − α|) + λ|α|0 .
α
This subproblem can be solved by considering three cases: (1) If y(j) = 0, we set α = 0; (2) If y(j) 6= 0,
there are two options to consider: either we set α = y(j) and incur a cost λ, or we set α = 0 and incur a
cost hj (|y(j)|). This yields the following result.
Theorem 10.3.1. Assume f (z) = nj=1 hj (|z(j)|), where the functions hj : R+ → R+ are strictly
P
monotone increasing functions with hj (0) = 0. Then the solution of problem (10.10) is
(
? y(j), if hj (|y(j)|) ≥ λ;
x (j) = (10.12)
0, otherwise.
So for suitable separable functions f , the solution to the sparse approximation problem is obtained by
hard thresholding y(j) based on a comparison of hj (|y(j)|) and λ. Because hj (·) is strictly monotone in-
∆
creasing, it has an inverse tj (λ) = h−1
j (λ). Using t(λ) we can equivalently write the thresholding condition
as |y(j)| ≥ tj (λ).
Bring in the generic hard thresholding operator defined for z ∈ R by
(
z, if |z| ≥ t;
Ht (z) = (10.13)
0, otherwise .
Example 10.3.2. Consider the problem: minx∈Rn ky−xk22 +λkxk0 . For this problem, f (x) = nj=1 (y(j)−
P
√
x(j))2 = nj=1 h(|y(j) − x(j)|) with h(α) = α2 . Hence the appropriate hard threshold is t = λ. This is
P
applied to each entry to y to obtain:
√
? √ y(j), if |y(j)| ≥ λ;
x = H λ (y) = .
0, otherwise.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 127
A key observation from (10.12) is that x? is a function of λ, and as λ decreases smoothly, kx? (λ)k0
increases monotonically in a staircase fashion. Depending on the value of λ, we may obtain kx? k0 > k,
or kx? k0 < k. As a thought experiment, imagine starting with a large value of λ and computing x? (λ) as
λ smoothly decreases. For simplicity, suppose that at least k of the values fj (|y(j)|) are nonzero and that
the nonzero values are distinct. For each value of λ, let S(λ) denote the support set of x? (λ), i.e., the set of
indices j for which x? (j) 6= 0. As we decrease λ the following things happen:
(0) Start with λ = λ0 > maxj fj (|y(j)|), then x? (λ0 ) = 0 and S(λ0 ) = ∅.
(1) When λ decreases to λ = λ1 = maxj {fj (|y(j)|)} with j1 = arg maxj {fj (|y(j)|)}, there is a jump
change. At this point S(λ1 ) = {j1 }, x? (λ1 )(j1 ) = y(j1 ), and x? (λ1 ) is the unique optimal 1-sparse
approximation to y.
(2) Continuing to decrease λ we reach a value λ2 equal to the second largest value of {hj (|y(j)|)}.
Suppose λ2 = hj2 (|y(j2 )|). At this point, j2 is added to S so that S(λ2 ) = {j1 , j2 }, and x? is
modified so that x? (j2 ) = y(j2 ). Then x? (λ2 ) is the unique optimal 2-sparse approximation to y.
(3) Continuing in this fashion, we see that the optimal k-sparse approximation x? to y is obtained by
letting S be the set of indices of the first k largest values of hj (y(j)), and setting
(
y(j), if j ∈ S;
x? (j) = (10.14)
0, otherwise.
The assumption that all of the nonzero values fj (|y(j)|) are distinct was made to simplify the explana-
tion. More generally, one can prove the following result.
Theorem 10.3.2. Let S be the indices of any set of k largest values of hj (|y(j)|). Then x? defined by
(10.14) is a solution to problem (10.9). This solution is unique if and only if the k-th largest value of
hj (|y(j)|) is strictly larger than the (k + 1)-st value.
Proof. Exercise.
For a fixed value of λ, solving problem (10.10) under the assumption of separability is very efficient.
One just needs to threshold the values of y(j) based on the corresponding values of hj (|y(j)|) and λ. The
downside is that solving problem (10.10) doesn’t give precise control of the resulting value of kx? (λ)k0 .
But we now see how to solve problem (10.9), and this gives precise control over kx? k0 . We simply need to
find the indices j1 , . . . , jk of any set of k largest values of hi (|y(i)|). This can be done using the following
algorithm:
(1) Scan the entries y(j) in order for j ∈ [1 : n].
(2) Maintain a sorted list of at most k pairs (j, hj (|y(j)|)) of the k largest values of hj (|y(j)|) seen so far.
(3) When entry j of y(j) is scanned, compute hj (|y(j)|). If the number of table entries is less than k, add
(j, hj (|y(j)|)). Otherwise, if hj (|y(j)|) is larger than the smallest corresponding value in the table,
add (j, hj (|y(j)|)) to the sorted table and remove the entry (i, hi (|y(i)|)) with the smallest value
hi (|y(i)|). Otherwise, read the next value.
The overall complexity of this algorithm O(n) for computing hj (|y(j)|) and making a comparison, and
an additional O(k log k) overhead for keeping an ordered list of the k largest values. If a predetermined
value of k is required, then the second solution method is probably more efficient. But if either k or λ
is to be determined by cross-validation (checking performance on held-out testing data), then the solution
method based on thresholding using λ may have an efficiency advantage.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 128
min ky − xk∞
x∈Rn (10.15)
s.t. kxk0 ≤ k.
Note that the max norm kxk∞ = maxi |x(i)|, is a non-separable function. For simplicity, initially assume
that the entries of y are nonzero with distinct absolute values, and that these values are arranged in y from
largest to smallest by absolute value. Hence kyk∞ = |y(1)| > |y(2)| > · · · > |y(n)| > 0.
We can use up to k nonzero elements in x to minimize ky − xk∞ . The optimal allocation is to use
x(1), . . . x(k) to make |y(j)−x(j)| no larger than |y(k+1)|, j ∈ [1 : k]. This yields ky −x? k∞ = |y(k+1)|.
This is the smallest achievable value of ky − xk∞ using a k-sparse x. Any x? with
Lemma 10.3.1. GP(n) is closed under matrix multiplication, contains the identity, and every P ∈
GP(n) has an inverse P −1 ∈ GP(n).
Proof. Let Dn denote the family of n × n diagonal matrices with diagonal entries in {±1}. Then I ∈ Dn
and if D ∈ Dn , then D2 = I. So D−1 = D ∈ Dn . Finally, if D1 , D2 ∈ Dn , then D = D1 D2 ∈ Dn .
Let Q be a permutation matrix and P = DQ ∈ GP(n). Then P = QD0 for some D0 ∈ Dn and hence
P −1 = D0 QT ∈ GP(n). Now let Pi = Di Qi ∈ GP(n), with Qi permutation matrices, i = 1, 2. Then
P1 P2 = D1 Q1 D2 Q2 = DQ with Q1 D2 = D20 Q1 , D = D1 D20 and Q = Q1 Q2 . So P1 P2 ∈ GP(n).
Example 10.3.3. Some examples of symmetric norms are given below. In the accompanying verifications,
Q ∈ GP(n).
(a) Every p-norm: kQxkp = ( nj=1 |(Qx)(j)|p )1/p = ( nj=1 |x(j)|p )1/p = kxkp .
P P
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 129
(b) The max norm: kQxk∞ = maxj {|(Qx)(j)|}nj=1 = maxj {|x(j)|}nj=1 = kxk∞ .
(c) The c-norm defined by kxkc = maxP ∈GP(n) {xT P c}, where c ∈ Rn is nonzero:
Proof. This proof uses some advanced aspects of norms and convex sets. If x = 0, then kxk = 0 ≤ kyk.
Hence assume x 6= 0, and set Lkxk = {u : kuk ≤ kxk}. This is the kxk-sublevel set of k · k with kxk > 0.
Clearly x ∈ Lkxk . A norm is a convex function, and the sublevel sets of a convex function are closed and
convex. Hence Lkxk is a closed convex set.
Consider the two convex sets Lkxk and C = {x}. Lkxk contains interior points, C is nonempty and con-
tains no interior points of Lkxk . Hence by the separation theorem for convex sets, there exists a hyperplane
wT z = c, with w 6= 0, such that for all z ∈ Lkxk , wT z ≤ c, and for all z ∈ C, wT z ≥ c. Since x is in both
sets, wT x = c.
Under the assumption that x > 0, we show that w ≥ 0. Suppose to the contrary that w(i) < 0. Form x̂
from x by setting x̂(i) = −x(i). By the symmetry of the norm we have kx̂k = kxk and x̂ ∈ Lkxk . Hence
we must have wT x̂ ≤ c. On the other hand, w(i) < 0 and x̂i = −x(i) < 0 imply that
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 130
Then 0 < x̃ ≤ ỹ. Let xα = (1 − α)x + αx̃ and yα = (1 − α)y + αỹ for α ∈ [0, 1]. For α > 0,
Hence by the result proved above, kxα k ≤ kyα k. Now take the limit as α → 0 and use the continuity of the
norm to conclude that kxk ≤ kyk.
Under a symmetric norm, the sparse approximation problem (10.3) has the following simple solution.
Theorem 10.3.3. Let k · k be a symmetric norm, y ∈ Rn , and S be the indices of k largest values of
|y(j)|, j ∈ [1 : n]. Then (10.14) gives a solution to problem (10.3).
Proof. For any P ∈ GP (n), we have ky − xk = kP y − P xk and kxk0 = kP xk0 . So we can select a
P to ensure (P y)(1) ≥ (P y)(2) ≥ · · · ≥ (P y)(n) ≥ 0. Hence from this point forward we assume that
y(1) ≥ y(2) ≥ · · · ≥ y(n) ≥ 0. To simplify the proof, we will also assume the largest k values in y are
distinct, but this is not required.
Let z = y − x. The possibilities for the best k-sparse solution fall into two forms: either z(j) = 0, for
all j ∈ [1 : k], or there exist integers p, q with 1 ≤ p ≤ k and k + 1 ≤ q ≤ n, such that z(p) 6= 0, and
z(q) = 0 with y(q) < y(p). In the first case, let z1 = y − x, and the second, let z2 = y − x. We can permute
the entries of z2 to form z20 by swapping the zero values outside the range 1, . . . , k with the locations with
nonzero values in the range 1, . . . , k. This is visualized below.
So kz2 k = kz20 k and 0 ≤ z1 ≤ z20 . Hence by Lemma 10.3.2, kz1 k ≤ kz20 k = kz2 k. Thus x? (j) = y(j),
j ∈ [1 : k], achieves an objective value at least as good as any other x.
The solution under a symmetric norm need not be unique, even when |y(k)| > |y(k + 1)| in an ordered
list of these values. For example, it is not unique under the max norm.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 131
(2) Check if a termination condition is satisfied (see below). If not, go to step (1).
The construction terminates after a desired number of distinct atoms have been selected (problem (10.6)) or
the size of the residual falls below some threshold (problem (10.7)). On termination, the algorithm results
in the set of indicesPof the selected atoms i1 , . . . , ik and weights w(i1 ), . . . , w(ik ). These give the sparse
approximation ŷ = kj=1 w(ij )aij to y with residual r = y − ŷ.
(0) Initialize: t = 0, S0 = ∅, A0 = [ ], r0 = y.
(2) Check if a termination condition is satisfied (see below). If not, go to step (1).
Step (1)(d) requires solving a least squares problem with one additional column than the previous iteration.
Note that after step (1) is completed, rt ⊥ span(At ). So once an atom has been selected it can’t be selected
a second time. The construction terminates after a desired number of atoms have been selected (for (10.6)),
the size of the residual falls below some threshold (for (10.7)), or no atom can be found that has a nonzero
correlation with the residual. On termination, the algorithm results in a set of selected atoms ai1 , . . . , aik
and weights w(i1 ), . . . , w(ik ). These give ŷ = kj=1 w(ij )aij as a sparse approximation to y with residual
P
r = y − ŷ.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 132
If spark(A) = 1, at least one column of A must be 0. We normally assume all columns of A are nonzero.
In this case, it takes at least two columns to form a linearly dependent set, and by the definition of rank, any
r + 1 columns of A will be linearly dependent. Hence for matrices with nonzero columns
2 ≤ spark(A) ≤ r + 1.
Note that if rank(A) = n, then every subset of columns of A is linearly independent. In this situation,
spark(A) is not defined.
Example 10.5.1. Here are some illustrative examples:
(a) Let A = [e1 , e2 , e3 , e4 , e1 ]. Then rank(A) = 4 and spark(A) = 2. This achieves the lower bound
for spark.
(b) Let A = [e1 , e2 , e3 , e4 , 4j=1 ej ]. In this case, rank(A) = 4 and spark(A) = 5. This achieves the
P
upper bound for spark.
(c) Let A = [e1 , e2 , e3 , 3j=1 ej , e5 , e6 ]. In this case, rank(A) = 5 and spark(A) = 4. This achieves
P
neither the lower nor upper bound for spark.
The spark of A is of interest because it can be used to give a sufficient condition for (10.16) to have a
unique solution. Roughly, if a solution is sufficiently sparse, then it is the unique sparsest solution. To show
this we let Sk = {x : kxk0 ≤ k} denote the set of k-sparse vectors in Rn . We now state the following simple
results.
Lemma 10.5.1. k < spark(A) ⇔ N (A) ∩ Sk = 0.
Proof. (IF) Assume k < spark(A). Let x ∈ Sk ∩ N (A). Hence Ax = 0. If x 6= 0, then kxk0 of the
columns of A are linearly dependent. Hence k ≥ kxk0 ≥ spark(A); a contradiction. Thus x = 0. Since x
was an arbitrary element in Sk ∩ N (A) we conclude that Sk ∩ N (A) = 0.
(ONLY IF) Assume N (A) ∩ Sk = 0. Then for any nonzero x with kxk0 ≤ k, Ax 6= 0. Hence spark(A) >
k.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 133
Proof. (IF) Assume 2k < spark(A). For any x, z ∈ Sk , x − z ∈ S2k . Now suppose Ax = Az. Then
A(x − z) = 0. Since x − z ∈ S2k and 2k < spark(A), Lemma 10.5.1 implies x − z = 0, i.e., x = z. So A
is injective on Sk .
(ONLY IF) Assume A is injective on Sk . Any nonzero w ∈ S2k can be written as w = x − z for nonzero
x, z ∈ Sk . If kwk0 ≤ k, set x = w/2 and z = −w/2, otherwise allot the nonzero entries of w between x
and −z so that both are nonzero and each is in Sk . Then x 6= z and Aw = A(x − z) = A(x) − A(z) 6= 0,
by the assumption. Since w was any nonzero element of S2k , it follows that spark(A) > 2k.
1
Theorem 10.5.1. Let Aw? = y. If kw? k0 < 2 spark(A), then w? is the unique sparsest solution of
Aw = y.
Proof. Set k = kw? k0 and assume k < 21 spark(A). By Lemma 10.5.2, A is injective on Sk . Hence for
every z ∈ Sk with z 6= w? , Az 6= y = Aw? . Hence w? is the unique solution in Sk . Thus is it the unique
sparsest solution.
(ONLY IF) Assume w? is the unique sparsest solution of Aw = y. Let k = kw? k0 .Then for each z ∈ Sk
with z 6= w? , Az 6= Aw? . So z − w? ∈
/ N (A).
Theorem 10.5.1 indicates that there are situations in which (10.16) has a unique sparsest solution. The
condition kw? k0 < 21 spark(A) in the theorem is sufficient but not necessary for w? to be a unique sparsest
solution. Moreover, in general, computing spark(A) is computationally expensive.
Then spark(A) = 2, Aw? = y, and kw? k0 = 1 = 12 spark(A). So the condition of Theorem 10.5.1 fails to
hold. Moreover, z is also a 1-sparse solution. So there is not a unique sparsest solution.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 134
∆ |aTi aj |
µ(A) = max . (10.17)
i6=j kai k2 kaj k2
The coherence of A is relatively easy to compute and gives a measure of pairwise similarity among the
columns of A. In general, would we like µ(A) to be small. For example, for A ∈ Vn,k , the columns of A
are orthonormal, and µ(A) = 0.
Since the columns of A appear in (10.17) in normalized form, without loss of generality we can assume
that the columns of A have unit norm. Bring in the Gram matrix G = AT A. The diagonal entries of G are
1 and the off diagonal entires indicate the signed similarity of pairs of distinct columns of A. Moreover,
Here σ1 (M ) denotes the maximum singular value, and kM k2 the induced 2-norm of M . The center in-
equality holds since the magnitudes of the entries of M are bounded above by σ1 (M ) (Exercise 5.8).
If µ(A) = 0, then the columns of A are orthonormal and hence linearly independent. In this case the
spark of A is not defined. When spark(A) is defined, we must have µ(A) > 0. In this case, the coherence
can be used to lower bound spark(A).
Lemma 10.5.3. Let A be a matrix with linearly dependent columns. Then µ(A) > 0, and
1
1+ ≤ spark(A). (10.18)
µ(A)
Proof. Without loss of generality, assume A has unit norm columns. Let ai = A:,i denote the i-th column
of A, and G = AT A. Then Gij = 1 if i = j and has magnitude at most Pn µ(A) otherwise. Since A has
x ∈ n with Ax =
linearly dependent columns there exists a nonzero R i=1 xP
i ai = 0. Hence for some j,
aj = i6=j xxji ai . Taking the inner product of both sides with aj yields, 1 = i6=j xxji aTj ai . From this we
P
Combining Lemma 10.5.3 and Theorem 10.5.1 yields the following result.
1 1
Theorem 10.5.2. If Aw? = y and kw? k0 < 2 1+ µ(A) , then w? is the unique solution of (10.16).
Proof. Exercise.
This result is weaker than the corresponding result using spark(A), but is easier to verify.
Example 10.5.5. Consider " #
1 0 √1 √1
A= 2 2 .
0 1 √1 − √12
2
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 135
A has unit norm columns. Any three columns must be linearly dependent. So spark(A) ≤ 3. By inspection
we see that no two columns are linear dependent. Hence spark(A) = 3. To determine µ(A) we compute
1 0 √1 √1
2 2
0 1 √1 − √12
AT A = √1 2
√1 1 0
2 2
1 1
√
2
− √2 0 1
√ 1
√
Hence µ(A) = 1/ 2. We can then check 3 = spark(A) ≥ 1 + µ(A) = 1 + 2 ≈ 2.414. The condition
kwk0 < 1.5 is sufficient to ensure a unique sparsest solution of Aw = y. This is equivalent to y being a
column of A.
Proof. Let x ∈ Sk be non-zero. By the RIP property, δk < 1 and (1 − δk )kxk22 ≤ kAxk22 ≤ (1 + δk )kxk22 .
So kAxk22 > 0 and hence Ax ∈/ N (A). Thus k < spark(A).
Combining Lemma 10.5.4 with Lemmas 10.5.1, 10.5.2, and Theorem 10.5.1 we see that:
(a) If A satisfies the RIP of order k, then N (A) ∩ Sk = {0}.
(b) If A satisfies the RIP of order 2k, then A is injective on Sk . If in addition, w? ∈ Sk and Aw? = b,
then w? is the unique sparsest solution of Aw = y.
The proof of these claims is left as an exercise.
Notes
The sparse representation classifier discussion in the Introduction is based on the work by Wright et al. in [50]. The
Matching Pursuit algorithm is due to Mallat and Zhang [29]. Orthogonal Matching Pursuit was introduced Pati et al.
in [35]. These methods followed earlier approaches for iterative basis selection that had come to be called projection
pursuit [16], [23]. The Restricted Isometry Property was introduced by Candes and Tao in [8].
Other suggested work includes:
Donoho, D.L., Elad, M.: Optimally sparse representation in general (nonorthogonal) dictionaries via `1 minimiza-
tion, Proc. Nat. Acad. Sci. 100, 21972202 (2003).
Candès, E.J., Tao, T.: Decoding by linear programming, IEEE Trans. Inform. Theory 51 (2005)
Davenport, M.A., Duarte, M.F., Eldar, Y.C., Kutyniok, G.: Introduction to Compressed Sensing. In: Eldar, Y.C.,
Kutyniok, G. (Eds.), Compressed Sensing: Theory and Applications, Cambridge University Press (2011)
Fickus, M., Mixon, D.G.: Deterministic matrices with the restricted isometry property, Proc. SPIE (2011)
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 136
Exercises
Sparsity
Exercise 10.1. Invariance properties of k · k0 . A norm on Rn is symmetric if: (a) for any permutation matrix P
and all x ∈ Rn , kP xk = kxk; and (b) for any diagonal matrix D with diagonal entries in {±1}, and all x ∈ Rn ,
kDxk = kxk. A square matrix DP , with D = diag[±1] and P is a permutation is called a generalized permutation.
Show that k · k0 is:
(a) A symmetric function, i.e., invariant under the group of generalized permutations on Rn .
(b) Is invariant under the action of any diagonal matrix with nonzero diagonal entries.
(c) Is not invariant under the orthogonal group On .
Separability
Pn
Exercise 10.2. Let f : Rn → R be a separable convex function with f (z) = j=1 hj (z(j)) for functions hj : R → R
with hj (0) = 0. Show that each hj is a convex function.
Sparse Approximation
Exercise 10.3. (Approximation with an `1 penalty) Let x, y ∈ Rn . Replacing the sparsity penalty kxk0 in problem
(10.10) by the 1-norm penalty kxk1 , yields the approximation problem:
min ky − xk22
x∈Rn
subject to: kxk1 ≤ α.
Exercise 10.6. (Sparse approximation in a non-symmetric norm) Let D ∈ Rn×n be diagonal and positive definite.
Then kxkD = (xT Dx)1/2 is a norm.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 137
(a) Solve the following problem and determine if the solution is unique.
min ky − xk2D
x∈Rn
subject to: kxk0 ≤ k.
For D 6= αIn , k · kD is non-symmetric. So sparse approximation is easily solved for some non-symmetric
norms.
Exercise 10.7. (Sparse representation in an ON basis) Let r ≤ n and Q ∈ Rn×r have orthonormal columns.
(a) Find a solution of the following problem and determine if the solution is unique.
min ky − Qxk22
x∈Rr
subject to: kxk0 ≤ k.
(b) Show that the same method also gives solutions for:
min ky − Qwk22
w∈Rw
subject to: kwk0 ≤ k.
Exercise 10.9. (Sparse representation when A = UΣPT ) Let r ≤ n, U ∈ Rn×r have orthonormal columns,
Σ ∈ Rr×r be diagonal with positive diagonal entries, and P ∈ Rr×r be a generalized permutation. Set A = U ΣP T .
(a) Find a solution of the following problem and determine if the solution is unique:
min ky − Axk22
x∈Rr
subject to: kxk0 ≤ k.
(b) Show that a similar approach gives the solution for: minx∈Rr ky − Axk22 + λkxk0 .
Exercise 10.10. (Sparse approximation in a quadratic norm) Let P, D ∈ Rn×n be symmetric PD with D diagonal,
∆ ∆
and let Q ∈ On . For x ∈ Rn , define kxkP = (xT P x)1/2 and kxkD = (xT Dx)1/2 .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 138
(b) Show that the following two problems are equivalent, in the sense that an instance of one can be transformed
into an instance of the other. Thus a solution method for one problem gives a solution method for the other.
For general P , the first problem is a sparse approximation problem in a non-symmetric quadratic norm. The
second problem is a sparse representation problem with respect to an orthonormal basis and a simpler non-
symmetric norm.
Exercise 10.11. Let P ∈ Rn×n be symmetric. Show that f (x) = xT P x is separable if and only if P is diagonal.
Exercise 10.14. The Hadamard basis is an orthonormal basis in Rn with n = 2p . It can be defined recursively as the
columns of the matrix Hpa with
a a
1 Hp−1 Hp−1
H0a =1 and Hpa =√ a a .
2 Hp−1 −Hp−1
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 139
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 140
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Chapter 11
The Lasso
11.1 Introduction
Sparse regression problems, e.g., minimizing ky − Axk22 + λkxk0 , have two coupled two aspects: select-
ing a subset of atoms (subset selection) and forming a weighted sum of these atoms to obtain the best
approximation to y. In general, the first aspect is computationally challenging, while the second is easy.
The computational challenge of solving these problems, motivated the introduction of greedy methods for
finding an approximation to the solution.
An alternative approach is to relax the sparsity function k · k0 to the convex approximation k · k1 . To see
why we regard k·k1 as a convex relaxation of k·k0 , see Figure 10.1 and the informal explanation given in the
caption of the figure. The relaxed problem requires minimizing a convex function, e.g., ky − Axk22 + λkxk1 .
In general, for appropriate values of λ, the solution of the relaxed problem will be sparse. In this case, the
solution selects a subset S of atoms of A and finds a linear combination of these atoms to approximate y. In
a variety of applications the solution obtained may be an adequate substitute for the solution of the original
sparse regression problem. However, one can also use the subset of selected atoms to solve the least squares
problem minw∈Rn ky − AS wk22 , where the columns of AS are the selected atoms. This yields an alternative
approximation to the solution of the original sparse regression problem.
Problem (11.1) is an unconstrained convex optimization problem. Indeed, the objective function the sum
of two competing convex terms, kAw − bk22 and λkwk1 . These terms are competing in the sense that
each has its own distinct minimizer. Example sublevel sets of these terms are illustrated in Figure 11.1.
Consideration of these sublevel sets leads to two equivalent formulations of 1-norm regularized regression.
The first minimizes kAw − yk22 over a fixed sublevel set of the `1 norm:
141
ELE 435/535 Fall 2018 142
x∗
0 x∗sr
Figure 11.1: Sublevel sets of kAw − yk22 (red) and the kwk1 regularizer (blue) together with the regularization path of
the lasso solution x?sr . For comparison, a sublevel set of the `2 norm (dashed blue) and the corresponding regularization
path of the ridge regression solution are also shown.
The second minimizes kwk1 over a fixed sublevel set of kAw − yk22 :
min kwk1
w∈Rn (11.3)
s.t. kAw − yk22 ≤ δ.
The optimal solution of (11.2) occurs at the point w? where a level set of kAw − yk22 first intersects the
-ball of k · k1 . This is illustrated in Figure 11.1. We also illustrate the `2 -ball that first intersects the same
level set of kAw − yk22 . Notice the difference in the sparsity of the two intersection points. The `1 ball
has vertices and these are aligned with the coordinate axes. These vertices are more likely to first touch a
sublevel level set of the quadratic term. So the shape of the unit `1 ball encourages a sparse solution. There
will be exceptions, but the exceptional cases require the sublevel sets of kAw − yk22 to be positioned in a
particular way relative to the axes. The solution of (11.3) occurs at the point wδ? where a level set of k · k1
first intersects the δ-sublevel set of kAw − yk2 . Notice that for appropriate choices of δ and , problems
(11.2) and (11.3) have the same solution. See Figure 11.1.
The above three problems are often called lasso problems. In general, no closed form expressions for
the solutions of (11.1), (11.2) and (11.3) are known. This is the usual situation for most convex optimization
problems. The important point is that the above convex problems are amenable to solution via efficient
numerical algorithms. The special case of (11.1) for sparse approximation is particularly easy to solve and
this connects to the sparse approximation problem using k · k0 . We discuss this in the next section.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 143
where ȳ = αy, Ā = αA, and λ̄ = α2 λ. As a result, it is meaningless to talk about the value of λ employed
when solving (11.1) without accounting for possible re-scaling of the data. One way to do this is to let
aj denote the j-th column of A and define λmax = maxm T
j=1 |aj y|. Then the ratio λ/λmax is invariant to
scaling.
This requires selecting x to approximate y with a convex 1-norm penalty on x to encourage a sparse approx-
imation. The objective function in (11.4) is convex and separable:
Pn
ky − xk22 + λkxk1 = i=1 (y(i) − x(i))2 + λ|x(i)| .
Each term in the sum can be optimized separately, leading to n scalar problems of the form:
We would like to use differential calculus to solve (11.5), but we see that |α| is not differentiable at α = 0.
For the moment, we will handled this as follows. First consider α > 0, and set the derivative w.r.t. α of the
objective in (11.5) equal to 0. This yields
The only case that remains is α = 0, with objective value y(i)2 . This must be the solution for −λ/2 ≤
y(i) ≤ λ/2. Hence the solution of each scalar problem is:
y(i) − λ/2, if y(i) ≥ λ/2;
?
x (i) = 0, if − λ/2 < y(i) < λ/2;
y(i) + λ/2, if y(i) ≤ −λ/2.
If the magnitude of y(i) is smaller than λ/2, x? (i) is set to 0. This introduces sparsity in x? . The remaining
nonzero components of x? are formed by reducing the magnitude of the corresponding values in y by λ/2.
For this reason, this operation is called shrinkage.
Bring in the scalar soft thresholding function:
z − t, if z ≥ t;
St (z) = 0, if − t < z < t;
z + t, if z ≤ −t.
This function is illustrated in Figure 11.2. We have shown above that x? (i) = Sλ/2 (y(i)). The optimal
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 144
Ht(z) St(z)
-t -t
t z z
t
Figure 11.2: The hard (left) and soft (right) scalar thresholding functions Ht (z) and St (z).
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 145
1.0 f( ) =
f( ) = /2
0.8
0.6
f( )
0.4
0.2
0.0
10 4 10 3 10 2 10 1 100
√
Figure 11.3: Plots of λ and λ/2 versus λ.
We can handle this issue by introducing the concept of the subgradiant of a convex function. Recall that
the derivative of a differentiable function f : Rn → R at x is the linear function from Rn into R given by
Df (x)(h) = ∇f (x)T h. If f is also convex, then for any z ∈ Rn ,
This gives a global lower bound for f (z) in terms of f (x), the gradient of f at x, and the deviation of z from
x. This result was given in Chapter 7 and is illustrated Figure 7.3.
The bound (11.7) provides a way to define a “generalized gradient” for nondifferentiable convex func-
tions. A vector g ∈ Rn is called a subgradient of a convex function f at x if for all z ∈ Rn ,
The function f is called subdifferentiable at x if it has a nonempty set of subgradients. When this holds, the
set of subgradients, denoted by ∂f (x), is called the subdifferential of f at x. If f is differentiable at x then
it is subdifferentiable, and its subdifferential is ∂f (x) = {∇f (x)}.
Example 11.4.1. Consider the scalar function |z|. For z 6= 0, this is differentiable with ∂|z| = 1 if z > 0
and ∂|z| = −1 if z < 0. At z = 0, we have ∂|z| = [−1, 1]. We can write this as
1,
if z > 0;
g ∈ ∂|z| ⇐⇒ g = γ ∈ [−1, 1], if z = 0;
−1, if z < 0.
Similarly,
1,
if w(i) > 0;
g ∈ ∂kwk1 ⇐⇒ g(i) = γ ∈ [−1, 1], if w(i) = 0; (11.8)
−1, if w(i) < 0.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 146
Lemma 11.4.1. The point w? minimizes the convex function f : Rn → R if and only if 0 ∈ ∂f (w? ).
Proof. If w? minimizes f , then for every z, f (z) ≥ f (w? ) = f (w? ) + 0T (z − w? ). Hence 0 ∈ ∂f (w? ).
Conversely, if 0 ∈ ∂f (w? ), then for all z, f (z) ≥ f (w? ) + 0T (z − w? ) = f (w? ). So w? minimizes f .
wT AT Aw − 2wT AT y + y T y.
Proof. (If) Assume the condition (11.10) holds. Then 0 ∈ AT (Awla ? − y) + λ ∂kw ? k . Hence by Lemma
2 la 1
? is a solution of (11.1).
11.4.1, wla
(Only If) Assume wla ? is a solution of (11.1). Then by Lemma 11.4.1, 0 ∈ AT (Aw ? −y)+ λ ∂kw ? k . Hence
la 2 la 1
there exists g ∈ ∂kwla ? k such that AT Aw ? = AT y − λ g. This can be rearranged as λ g = AT (y − Aw ? ).
1 la 2 2 la
It follows that g must also be in R(AT ) = N (A)⊥ . Thus (11.10) holds.
The term y − Awla ? is the residual, and the rows of AT are the atoms (columns of A). The j-th entry
in g depends on the sign of the j-th entry of w? , and by the above this must equal the inner product of the
residual with the j-th atom. Specifically,
? (j) > 0;
λ/2,
if wla
aTj (y − Awla
?
) = λ2 g = γ ∈ [−λ/2, λ/2], if wla ? = 0;
−λ/2, ? (j) < 0.
if wla
So when wla ? uses atom a (i.e., w ? (j) 6= 0), the inner product of a and the residual is λ sign(w ? (j)); but
j la j 2 la
? does not use atom a , then the inner product of a and the residual lies in the interval [− λ , λ ]. Notice
if wla j j 2 2
that the residual is never orthogonal to an atom used by wla ?.
We have previously derived corresponding conditions for ridge and least squares regression. For least
squares the condition is
AT Awls? = AT y.
In this case, for every atom ai of A, aTi (y − Awls? ) = 0. So every atom is orthogonal to the residual. If
AT A is invertible, the solution wls? = (AT A)−1 AT y is unique and is linear in y. For ridge regression the
corresponding condition is
AT Awrr
?
= AT y − λwrr ?
.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 147
Hence for any atom aj , the residual y − Awrr ? satisfies aT (y − AT w ? ) = λw ? (j). So an atom is orthogonal
j rr rr
? . The unique ridge solution is w ? = (AT A+λI)−1 AT y,
to the residual if and only if it has zero weight in wrr rr
which is also linear in y.
Equation (11.10) has the same form as those above, but g is a nonlinear function of y and wla ? . In
?
particular, wla is not a linear function of y. This is shown in the examples below.
Corollary 11.5.1. In the limit as λ → 0, the solution of (11.1) is a least 1-norm solution of the least
squares problem minw∈Rn ky − Awk2 .
λ λ
kAT Awla
?
(λ) − AT yk22 = kgk22 ≤ n.
2 2
Hence in the limit as λ → 0, wla ? satisfies the normal equations and is hence a solution of the stated least
squares problem. If the least squares problem has a unique solution, then we are done. If not, there is at least
one point satisfying the normal equations that has the least 1-norm over all such solutions. Without loss of
generality we can assume that y ∈ R(A). Then ky − Auk22 = 0. Hence by the optimality of wla ? (λ),
?
ky − Awla (λ)k22 + λkwla
?
(λ)k1 ≤ ky − Auk22 + λkuk1 = λkuk1 .
? (λ)k ≤ λkuk . Taking the limit as λ → 0, and using the continuity of k · k , we obtain
Thus kwla 1 1 1
? ?
k lim wla (λ)k1 = lim kwla (λ)k1 ≤ kuk1 .
λ→0 λ→0
Since limλ→0 wla? (λ) solves the least squares problem, we must have equality in the final inequality: Hence
?
k limλ→0 wla (λ)k1 = kuk1 .
Corollary 11.5.2. The application of Theoem 11.5.1 to various special cases of (11.1) yields the fol-
lowing previously stated results for 1-norm regularization regression:
Proof. (a) Applying the necessary and sufficient conditions (11.10) to the approximation problem (11.4)
yields x? = y − λ2 g for some g ∈ ∂kx? k1 . Writing this out componentwise we obtain
λ/2
if x? (i) > 0 ≡ y(i) > λ/2;
?
x (i) = y(i) − γ ∈ [−λ/2, λ/2] if x? (i) = 0 ≡ y(i) ∈ [−λ/2, λ/2] with γ = y(i);
− /2 if x? (i) < 0 ≡ y(i) < −λ/2.
λ
= Sλ/2 (y(i)).
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 148
11.5.1 Examples
Example 11.5.1. Let u, y ∈ Rn , with kuk2 = 1, and set A = u . For λ > 0 and w ∈ R, we seek
Example 11.5.2. Let u, y ∈ Rn , with kuk2 = 1, and set A = u u . For λ > 0 and w ∈ R2 , we seek the
solution of (11.1). We have AT A = 11T , N (A)⊥ = span{1}, and AT y = uT y1. By (11.5.1) a necessary
and sufficient condition for w? to be a solution is
The condition g ∈ ∂kw? k1 ∩ span{1} implies that g must take the form
γ
g= where γ ∈ [−1, 1].
γ
If 1T w? > 0, then one entry of w? is positive and the other must be nonnegative. In this case γ = 1. If
1T w? < 0, then one entry of w? is negative and the other must be nonpositive. In this case γ = −1. if
1T w? = 0 then w? = 0, and γ ∈ [−1, 1]. In all cases we can write g = γ1 for appropriate γ ∈ [−1, 1].
Substituting this into (11.11) reveals that we require 1T w? = uT y − λ2 γ for the prescribed allowed values
of γ. This yields
T T ? T
u y − λ/2, if 1 w > 0 ≡ u y > λ/2;
1T w? = 0, if 1T w? = 0 ≡ −λ/2 ≤ uT y ≤ λ/2;
T
u y + λ/2, if 1T w? < 0 ≡ uT y < −λ/2.
Thus 1T w? = Sλ/2 (uT y). This uniquely specifies the sum of the entries of w? . In general w? is not unique.
If the entries of w? are nonnegative, then any point w ≥ 0 with w(1) + w(2) = Sλ/2 (uT y) is a solution.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 149
Similarly, if the entries of w? are nonpositive, then any point w ≤ 0 with w(1) + w(2) = Sλ/2 (uT y) is a
solution.
Example 11.5.3. Consider problem (11.1) with
1 0 0.5 1
A= , y= .
0 1 0.5 0
Note that y is one of the atoms in the dictionary. Simple computation yields
1 0 12
1 1 0 1
AT A = 0 1 12 , N (A) = span 1 , N (A)⊥ = span 0 1 , T
and A y = 0 .
1 1 1 1 1 1
2 2 2 −2 2 2 2
1 0 12
1 1 0
0 1 1 w? = 0 − λ g where g ∈ ∂kw? k1 ∩ span 0 1 . (11.12)
2 2
1 1 1 1 1 1
2 2 2 2 2 2
To find a solution of (11.14), first consider the limit as λ → 0. In this limiting case, w? = e1 + αn is a
solution, where n = (1, 1, −2) and α ∈ R. It easily checked that kw? k1 is minimized at α = 0. So at λ = 0
the least 1-norm solution is w? = e1 . As λ increases we expect the soft threshold operator to shrink the
above solution to w? = (1 − λ/2)e1 . This is verified by selecting g = (1, 0, 1/2), and checking the equality
of both sides in (11.14):
1 0 21 1 − λ2 1 − λ2 1 − λ2
1 1
0 1 1 0 = 0 and 0 − λ 0 = 0 .
2 2
1 1 1 1 λ 1 1 1 λ
2 2 2 0 2 − 4 2 2 2 − 4
Equality continues holds for 0 ≤ λ < 2. Hence w? = (1 − λ2 )e1 is a solution for 0 ≤ λ < 2. At λ = 2 the
solution is w? = 0. We expect that increasing λ further will leave the solution invariant at 0. This is verified
by selecting g = ( λ2 , 0, λ1 ) and checking the equality of both sides of (11.14). This is left as an exercise. In
summary, we have determined the lasso solution path
(
? (1 − λ/2)e1 , 0 ≤ λ ≤ 2;
wla (λ) = (11.13)
0, λ > 2.
Example 11.5.4. Consider the dictionary in Example 11.5.3. This time we seek the solution for yt =
(1 − t)e1 + te2 , where t ∈ [0, 1]. For simplicity, consider λ < 1/2. In this case
1−t
AT y t = t ,
1
2
1 0 12
1−t 1 0
0 1 1 w? = t − λ g where g ∈ ∂kw? k1 ∩ span 0 1 . (11.14)
2 2
1 1 1 1 1 1
2 2 2 2 2 2
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 150
For t = 0 we know from Example (11.5.3) that the solution is w? = (1 − λ/2)e1 . Hence for t > 0 we first
examine if w? = αe1 , with α > 0, continues to be a solution. In this case, (11.14) requires that
1 1−t 1
λ
α 0 = t − γ some γ ∈ [−1, 1].
1 1 2 1 γ
2 2 2 + 2
This yields
α = 1 − t − λ2 , for t ≤ 1 − λ2 ;
2t
γ= λ, for t ≤ λ2 .
We note that at t = 1 − λ2 , w? (1) = 0. So for larger values of t we examine w? = βe2 . In this case, (11.14)
requires that
0 1−t γ
λ
β 1 = t − 1 , some γ ∈ [−1, 1].
1 1 2 1 γ
2 2 2 + 2
This yields
β = t − λ2 , for λ
2 ≤ t ≤ 1;
γ = (1 − t) λ2 , for 1 − λ
2 ≤ t ≤ 1.
Finally, we have the complete solution path as t takes values from 0 to 1 for a fixed value of λ > 0:
λ
(1 − t − 2 )e1 ,
0 ≤ t ≤ λ2 ;
? (t) =
wla (1 − t − λ2 )e1 + (t − λ2 )e2 , λ2 ≤ t ≤ 1 − λ2 ; (11.15)
λ λ
(t − 2 )e2 , 1 − 2 ≤ t ≤ 1.
We now make the following observations. For the solution w? to be linear in y, the solution path for
yt must be w? (t) = (1 − t)w? (0) + tw? (1), where w? (0) is the solution for y0 and w? (1) for y1 . So the
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 151
solution path would move in a straight line from w? (0) to w? (1). But we see that this is not the case. So the
lasso solution is not linear in y. From w? (0) = (1 − λ/2)e1 the solution first shrinks until t = λ/2. At this t,
w? (t) = (1 − λ)e1 . Symmetrically, as t decreases from 1, the solution w? (1) = (1 − λ/2)e2 shrinks until
t = 1 − λ/2. At which point w? (t) = (1 − λ)e2 . As t transits from λ/2 to 1 − λ/2, the solution path does
linearly interpolate between w? (λ/2) and w? (1 − λ/2).
The optimality condition that 0 must be in each subdifferential gives µ = z and AT µ = λg for some
g ∈ ∂kwk1 . These equations allow the elimination of z and w from L. In detail,
To ensure the satisfiability of the equation AT µ = λg for some g ∈ ∂kwk1 , it is necessary that |aTi µ| ≤ λ
for each atom ai . This leads to the following dual problem:
min kµ − yk22
µ∈Rn
(11.20)
subject to: |aTi µ| ≤ λ, i ∈ [1 : m].
y = µ? + Aw? (11.21)
(
T ? λsgn(w? (i)), if w? (i) 6= 0;
ai µ = (11.22)
γ ∈ [−λ, λ] , if w? (i) = 0.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 152
From (11.21) we see that µ? is the optimal residual resulting from the selection of w? . So the dual problem
directly seeks the optimal residual.
Let Fλ denote the set of µ satisfying the constraints in (11.20). These constraints can be written as a
set of 2m linear half plane constraints aT µ ≤ λ for each a ∈ {±ai }pi=1 . So the set of feasible points is a
convex polytope in Rn . In addition, maximizing the objective function in (11.20) seeks the closest point in
Fλ to y. Hence µ? = PFλ (y).
It is sometimes convenient to introduce the scaled dual variable θ = µ/λ. This removes λ from the
constraints in the dual problem. This results in the modified dual problem,
min kθ − y/λk22
θ∈Rn
(11.23)
s.t. |aTi θ| ≤ 1 i ∈ [1 : m],
In this case, the dual solution θ? (λ) is the unique projection of y/λ onto the closed convex polytope F =
{θ : |aTi θ| ≤ 1}. Notice that now the dual feasible set F does not depend on λ and we project the point
y/λ onto F. This is illustrated in Figure 11.4. Version (11.23) of the dual problem makes it very clear that
the dual solution θ? is a continuous function of λ, and this can be used to show that the solution w? of the
primal problem (11.1) is also a continuous function of λ.
Theorem 11.6.1. Let θ? (λ) be the solution of the dual problem (11.23) and w? (λ) be the solution of
the primal problem (11.1). Then θ? (λ), Aw? (λ) are continuous functions of λ.
Proof. From (11.23) we note that the set of dual feasible points F is the intersection of a set of closed half
spaces. Hence F is closed and convex. Thus θ? (λ) is the unique projection of y/λ onto F. The argument
y/λ is continuous at each λ > 0, and the function PC : Rn → F is continuous. Hence θ? (λ) is continuous
in λ for all λ > 0. From (11.24) we see that Aw? (λ) = (1 − λ)θ? (λ) is also continuous in λ.
Notes
Not done yet.
Exercises
Exercise 11.1. Solve each of the following problems:
(a) minx∈Rn ky − xk22 + λkDxk1 .
(c) minx∈Rk ky − U DP T xk22 + λkxk1 , where U ∈ Rn×k has ON columns, D ∈ Rk×k is diagonal with a positive
diagonal, and P ∈ Rk×k is a generalized permutation.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 153
S n-1 -a1
-a3
-a2
0
a2
Dual
Feasible a
3
Region
a1 y
y/λ
Figure 11.4: The geometry of the dual problem in terms of the dual variable θ.
Suppose λ > 1/2 and yγ = (1 − γ)e1 + γe2 where γ ∈ [0, 1]. Find and plot the solution wγ? of the corresponding lasso
regression problem as a function of γ. Also plot ŷγ = Awγ? .
Let yγ = (1 − γ)e1 + γe2 where γ ∈ [0, 1]. By imposing an upper bound on λ, find and plot the solution wγ? of the
corresponding lasso regression problem as a function of γ. Also plot ŷγ = Awγ? .
Exercise 11.4. Let D = [d1 , . . . , dp ] be a dictionary of unit norm atoms and consider the sparse representation
problem
minp ky − Dwk22 + λkwk1 .
w∈R
(a) Let y = dj . Show that w? = (1 − λ)ej and Dw? = (1 − λ)dj , for 0 < λ < 1.
∆
(b) Fix dj , and let k = arg maxi6=j |dTi dj |, C = dTk dj , and for convenience assume C ≥ 0. Constrain λ <
(1 + C)/2. Now let yγ = (1 − γ)dj + γdk for γ ∈ [0, 1]. So yγ traces out the line segment from dj to dk .
Determine the corresponding solution of wγ? of the `1 -sparse regression problem as a function of γ.
Exercise 11.5. (Basis Pursuit) Let A ∈ Rm×n with rank(A) = m < n, and y ∈ Rm . We seek the sparsest solution
of Ax = y:
minn kxk0 , s.t. Ax = y. (11.25)
x∈R
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 154
minx,z∈Rn 1T z
s.t. Ax = y
x−z ≤ 0
−x − z ≤ 0
−z ≤ 0
Exercise 11.6. Let A ∈ Rm×n with rank(A) = m < n. Then for y ∈ Rm set
with λ > 0, A ∈ Rm×n , and y ∈ Rm . For simplicity, assume that rank(A) = m < n.
(a) Show that problem (11.27) can be written in the form
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Chapter 12
This chapter formulates a generative model for data drawn from c classes. We then derive the corresponding
the Bayes classifier for this model. To illustrate this in greater detail, we examine Gaussian generative
models as a special case. We then introduce the exponential family of densities and derive the binary
Bayes classifier for any two densities in the same exponential family. In the process, we also introduce
the concept of the KL-divergence of two densities or probability mass functions. Finally, we introduce
empirically derived Bayes classifiers and show that several simple classifiers (Nearest Centroid, Gaussian
Naive Bayes, and Linear Discriminant Analysis), are instances of this method of classifier design under
particular assumptions about the structure of an underlying Gaussian generative model.
155
ELE 435/535 Fall 2018 156
The term fY|X (k|x) is called the posterior probability of class k. The example x gives information about
its corresponding class. Hence once x is known, the prior class probabilities πk can be updated. The pmf
fY|X (k|x) specifies the posterior class probabilities after observing the example x. Notice that the posterior
pmf combines information provided by x with information provided by the class prior pmf.
If we predict the label of x to be k, then the probability of error is 1 − fY|X (k|x). Hence to minimize
the probability of error, we select the label with maximum posterior probability. This yields the Maximum
Aposteriori Probability (MAP) classifier. Since the term fX (x) in (12.2) is common to all label values, it
can be dropped. The MAP classifier is then specified by
∆
ŷMAP (x) = arg max πk pk (x). (12.3)
k∈[1:c]
By taking the natural log of the RHS, this can be equivalently written as
The term ln pk (x) is the log-likelihood for class k under the observation x. It gives a measure of the
likelihood that the class is k, given the example value x. The term ln πk is a measure of the prior likeli-
hood that the class is k. The sum of the two terms blends prior information with information provided by
observing x, to give an overall measure of the likelihood of label k. Each of the functions gk (x) = πk pk (x)
and hk (x) = ln pk (x) + ln(πk ) measure the likelihood that the class is k given the data. In each case,
classification is performed by taking the maximum over these functions: ŷ(x) = arg maxk∈[1,c] gk (x) =
arg max k ∈ [1, c]hk (x). Hence the functions {gk (x)} and the functions {hk (x)}, k ∈ [1 : c], are called
discriminant functions.
The term p1 (x)/p0 (x) is called the likelihood ratio, and ln(p1 (x)/p0 (x)) is called the log-likelihood ratio.
One can think of these terms as quantifying the relative evidence in support of class 1 provided by x.
Similarly, ln(π1 /π0 ) quantifies the prior relative evidence in support of class 1. So the classifier (12.5)
decides the class is 1 when the aggregate relative evidence provided by the example and the prior for class 1
is greater than that for class 0.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 157
Given the example value x ∈ Rn , we select the label value to minimize the probability of error. This is a
MAP classification problem based on a scalar valued feature.
The posterior probabilities fY|X (k|x) satisfy
1 1 2 2
fX (x)fY|X (k|x) = πk √ e− 2 (x−µk ) /σk , k ∈ [1 : c].
2πσk
We obtain the MAP classifier by taking the negative of the natural log of this expression, and dropping terms
that don’t depend on k. This yields
x − µk 2
ŷMAP (x) = arg min + 2 ln(σk ) − 2 ln(πk ). (12.7)
k∈[1,c] σk
The term σk inside the parentheses in (12.7) scales x − µk to units of class k standard deviations. The
quadratic term thus penalizes the standardized distance between x and µk . The second term adds a penalty
for larger variance, and the third term adds a penalty for lower class probability.
If all class variances equal σ 2 , we obtain the slightly simpler classifier
x − µk 2
ŷMAP (x) = arg min − 2 ln(πk )
k∈[1,c] σ (12.8)
= arg min (x − µk )2 − 2σ 2 ln(πk ).
k∈[1,c]
The first line in (12.8) classifies to the closest class mean in standardized units, except for a bias term that
takes into account prior class probability. The second expression is equivalent. It indicates that if the distance
to the mean is not in standardized units, then the bias due to the prior must be scaled by the variance.
The first term ((x − µ0 )/σ0 )2 on the LHS of (12.9) measures the squared standardized distance between
x and µ0 . The second term does the same computation for class 1. These terms are then subtracted to
determine which is larger. The third term accounts for any difference in class variances. It gives a bonus to
the class of smaller variance. The final term adds a bias in favor of the class of higher prior probability.
There are two interesting special cases. The first is equal class variances: σ02 = σ12 = σ 2 . For simplicity
assume µ1 ≥ µ0 , and let
∆ µ 0 + µ1 ∆ µ1 − µ 0
µ̄ = and κ= .
2 σ
The point µ̄ is the midpoint of the line joining µ0 and µ1 . The term κ is a measure of the dissimilarity of the
class-conditional densities1 . The more dissimilar the class-conditional densities, the more information an
example provides (on average) about its class. For example, increasing σ makes the examples more noisy
and hence less “informative”. Conversely, increasing µ1 − µ0 makes the examples more “informative”.
1
Dissimilarity of densities is measured by KL-divergence. In this case, DKL (N (µ1 , σ 2 )kN (µ0 , σ 2 )) = 21 κ2 .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 158
kN( k, k2)
kN( k, k2)
0.10 0.10 0.10
kN( k, k2)
kN( k, k2)
0.10 0.10 0.10
Figure 12.1: Six examples of pairs of conditional Gaussian densities and the resulting MAP decision regions. In the
left column, the class means are equal, while in the right column the class variances are equal. The middle column
shows a general example. In the top row π0 = π1 = 0.5, and in the bottom row π0 = 0.6, π1 = 0.4. Note that the
left two figures in both rows have a decision interval for class 1 that is surrounded by the decision region for class 0.
This is due to the quadratic form of the decision surface. In contrast, the two plots on the right use a single threshold
to separate the line into two decision regions. These plots show linear classifiers.
The term (x − µ̄)/σ in (12.10) measures the signed standardized distance from µ̄ to x. If κ > 1, κ amplifies
(x − µ̄)/σ, otherwise it reduces it. In the first case, we weight the example more and discount the prior. In
the second, we discount the example and put more weight on the prior. The second expression in (12.10)
indicates that for equal class variances, the binary Gaussian MAP classifier simply compares x to a threshold
τ , deciding class 1 if x > τ and class 0 otherwise (recall we assumed µ1 ≥ µ0 ). This is a linear classifier.
The classifier is illustrated in the right plots of Figure 12.1.
The second special case is when the class means are equal µ0 = µ1 = µ.2 Under this constraint, the
2 2
σ1 σ0
2
In this case, DKL (N (µ, σ12 )kN (µ, σ02 )) = 1
2 2
σ0
− 1 + ln 2
σ1
.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 159
1.0 1.0
R0
R1
= 4.00
0.4 0.4 = 2.00
= 1.00
= 0.50
0.2 0.2 =0
MAP 0 = 0.5
MAP 0 = 0.6
0.0 0.0
1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0
R0: Probability of error | class 0
Figure 12.2: Binary MAP classification for conditional class densities N (µk , σ 2 ), k ∈ {0, 1}. Left: MAP classifier
performance as a function of κ = |µ1 −µ σ
0|
. The plot shows the conditional probability of error Rk given that the true
label is k ∈ {0, 1}, and the overall probability of error R? . Right: ROC curves for various values of κ. An ROC
curve plots the probability of classifier success given that the true label is 1 versus the probability of error given that
the true label is 0. Each curve is the locus of the point (R0 , 1 − R1 ) as the threshold τ in (12.10) moves from −∞
to ∞. The MAP classifier specifies a particular operating point (determined by τ in (12.10)) on each curve. These
points are indicated for two values of π0 . R0 is sometimes called the probability of a false positive (or false alarm,
false discovery), and 1 − R1 , the probability of a true postive (or true detection, true discovery). The abbreviation
ROC stands for Receiver Operating Characteristic, a term coined in the early days of radar.
2
σ12
x−µ
1, if σ1 σ02
− 1 + 2 ln σσ01 + 2 ln ππ01 > 0;
ŷMAP (x) = (12.11)
0, otherwise.
The decision boundary is the set of points satisfying the above inequality with equality. This is a level set
of a quadratic function in x. Hence one decision region will be surrounded by the other. If σ1 > σ0 , the
quadratic is convex. So the outer region will classify to class 1 and the inner region to class 0. Conversely, if
σ0 > σ1 , the quadratic is concave, the inner region classifies to class 1 and the outer region to class 0. This
case is illustrated in the left plots of Figure 12.1.
Let Dk ⊂ R denote the set of x ∈ R for which the binary classifier (12.9) makes decision k, k ∈ {0, 1}.
Clearly D1 = D0c . We call these sets the decision regions of the classifier. For the quadratic function
in (12.9), in general, one decision region is an interval of the form (a, b) and the other takes the form
(−∞, a] ∪ [b, ∞).
The classifier can make two types of error. It can decide the class is 1 when the true class is 0, or it can
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 160
decide the class is 0 when the true class is 1. These errors are made with conditional probabilities
Z
1 1 2 2
Pe|0 = √ e− 2 (x−µ0 ) /σ0 dx (12.12)
2πσ0
ZD1
1 1 2 2
Pe|1 = √ e− 2 (x−µ1 ) /σ1 dx. (12.13)
D0 2πσ1
The total probability of error Pe is then the average of Pe|0 and Pe|1 with respect to the class probabilities:
Pe = π0 Pe|0 + π1 Pe|1 . (12.14)
In the literature, you will often see Pe|1 and Pe|0 denoted by R1 and R0 , respectively. This notation reflects
the description of these quantities as the conditional risks. The probability of error of the Bayes classifier is
then called the Bayes risk and denoted by R? .
It will be useful to determine detailed expressions for R0 , R1 and R? under the assumption that σ02 = σ12 .
To do so assume µ0 ≤ µ1 , let τ denote the threshold in (12.10), and Φ denote the cumulative distribution
function of the N (0, 1) density. Then
Z ∞
τ − µ0
R0 = Nµ0 ,σ2 (x)dx = 1 − Φ
τ σ
Z τ
τ − µ1 µ1 − τ
R1 = Nµ1 ,σ2 (x)dx = Φ =1−Φ .
−∞ σ σ
Substituting the value for τ from (12.10), and simplifying yields:
τ − µ0 κ 1 π0 µ1 − τ κ 1 π0
= + ln and = − ln .
σ 2 κ π1 σ 2 κ π1
Substituting these expressions into those for R0 and R1 , and averaging over the class probabilities, we find
? κ 1 π0 κ 1 π0
R = 1 − π0 Φ + ln + π1 Φ − ln . (12.15)
2 κ π1 2 κ π1
We expect the probability of error to decrease monotonically as κ2 increases. This is indeed the case.
Lemma 12.3.1. The probability of error (12.15) decreases monotonically as κ2 increases.
Proof. When π0 = 1/2, we have π1 = 1−π0 = 1/2 and the probability of error simplifies to Pe = 1−Φ( κ2 ).
This function is clearly strictly monotone decreasing in κ. Since Pe is continuous in κ, it follows that
there must be an open interval around the point 1/2 such that for each π0 in this interval, Pe is monotone
decreasing in κ. The proof of the lemma (given in the appendix) shows that this interval is (0, 1).
Figure 12.2 shows plots of R0 , R1 , and R? versus κ, and of R0 and R1 plotted as ROC curves. Note that
R0 and R1 are not both monotonic in κ, but the weighted sum Pe = R? is monotonically decreasing in κ.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 161
The density (12.19) is called a mixture of Gaussians. It is the convex sum of k distinct Gaussian densities.
Taking the natural log of pk (x) = fX|Y (x|k) yields
n 1 1
ln (pk (x)) = − ln(2π) − ln (|Σk |) − (x − µk )T Σ−1
k (x − µk ). (12.20)
2 2 2
Substituting this expression into (12.4), dropping constant terms, and multiplying by −2, yields the MAP
classifier for the assumed model:
Binary Classification
For binary classification only two terms appear in (12.21). These can be subtracted and the result compared
to 0. This yields the classifier,
|Σ0 |
(
1, if (x − µ0 )T Σ−1 T −1 π1
0 (x − µ0 ) − (x − µ1 ) Σ1 (x − µ1 ) + ln |Σ1 | + 2 ln π0 > 0;
ŷ(x) =
0, otherwise.
(12.22)
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 162
with parameters
w = Σ−1 (µ1 − µ0 )
(12.25)
π1
b = −wT (µ0 + µ1 )/2 + ln .
π0
If the classes are equiprobable, the classifier selects the closest class mean. For two classes, the classifier is
linear, with w and b given by (12.25) using Σ = σ 2 In .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 163
The Nearest Centroid Classifier. This is an empirically derived Gaussian Bayes classifier that as-
sumes all classes have the same spherical covariance. If m and mk denote the number of examples and
the number of examples in class k, respectively, then the parameter estimates are
c n
mk 1 X 1 XXX
π̂k = , µ̂k = xj , σ̂ 2 = (xj (i) − µ̂k (i))2 .
m mk nm
yj =k k=1 yj =k i=1
The classifier is formed by using these estimates in the MAP classifier (12.26).
Gaussian Naive Bayes. This is an empirically derived Gaussian Bayes classifier that assumes inde-
pendent features. If m and mk denote the number of examples and the number of examples in class k,
respectively, then
mk 1 X 1 X
π̂k = , µ̂k (i) = xj (i), σ̂k2 (i) = (xj (i) − µ̂k (i))2 , i ∈ [1 : n], k ∈ [1 : c].
m mk mk
y(j)=k y(j)=k
The classifier is formed by using these estimates in the MAP classifier (12.27).
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 164
To set the stage, recall that the line though the origin in the direction of w ∈ Rn is the set of all points
of the form αw with α ∈ R. Each point p on this line is uniquely specified by its corresponding coordinate
α ∈ R. Without loss of generality, we will assume that kwk2 = 1. Hence the orthogonal projection of
x ∈ Rn onto the line is the point p(x) = (wT x)w, and the coordinate of p(x) with respect to w is wT x.
Now consider a binary labelled training dataset {xi , yi }m n
i=1 with xi ∈ R and yi ∈ {0, 1}, i ∈ [1 : m].
n
A linear classifier for this dataset consists of a pair w ∈ R and b ∈ R. Since any nonzero scaling of w
and b does not change the classifier, we assume that kwk2 = 1. The classifier orthogonally projects the
dataset onto the line through the origin in the direction w, yielding a labelled set of scalars {(wT xi , yi )}m
i=1 .
If (wT x) > −b, (wT x) is predicted to be in one class, and in the other class otherwise.
Bring in the class sample means µ̂k and sample covariance matrices R̂k , k = 0, 1. Then the class sample
means and variances of the projected training examples {(wT xi , yi )}m i=1 are given by
P
ν̂k (w) = m1k yi =k wT xi = wT m1j yi =k xi = wT µ̂k
P
1 X T
σ̂k2 (w) = (w (xi − µ̂k ))2 = wT R̂k w, k = 0, 1.
mk
yi =k
To improve class separation under the projection, it seems reasonable to select w to make (ν̂1 (w) − ν̂0 (w))2
large (push the projected class means apart). However, this ignores the dependence of the variances σ̂k2 (w)
on w. Large variances could hinder class separation. To resolve this, LDA selects the unit norm vector w to
maximize the ratio of the squared distance between the projected means
(ν̂0 (w) − ν̂1 (w))2 = wT (µ̂0 − µ̂1 )(µ̂0 − µ̂1 )T w,
and the weighted sum of class variances
m0 T m1 T
w R̂0 w + w R̂1 w = wT (S0 + S1 )w,
m m
where we have dropped m since it is a constant, and Sk denotes the scatter matrix of class k = 0, 1. This
leads to the objective function
wT (µ̂0 − µ̂1 )(µ̂0 − µ̂1 )T w wT SB w
J(w) = = ,
wT (S0 + S1 )w w T SW w
where
SB = (µ̂0 − µ̂1 )(µ̂0 − µ̂1 )T
is called the between-class scatter matrix and
X X
SW = S0 + S1 = (xi − µ̂0 )(xi − µ̂0 )T + (xi − µ̂1 )(xi − µ̂1 )T
yi =0 yi =1
is called the within-class scatter matrix. By this reasoning we arrive at the LDA optimization problem:
? wT SB w
wLDA = arg maxn
w∈R wT SW w (12.28)
s.t. kwk22 = 1.
Problem (12.28) is an instance of a generalized Rayleigh quotient problem. In general, finding a solution
involves finding the maximum eigenvalue and a corresponding unit norm eigenvector of an appropriate
matrix (see Appendix D). However, in this instance, the rank one structure of SB ensures that (12.28) has a
?
simple solution: up to a scale factor, wLDA must be a solution of the set of linear equations
SW w = µ0 − µ1 .
This solution can then be scaled to ensure it has unit norm (Exercise 12.1).
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 165
An Alternative Perspective
Assume the example value x ∈ Rn and its label y ∈ {0, 1} are the outcomes of random variables X and Y,
with the joint probability density specified in the factored form (12.1). Let the class-conditional densities be
N (µk , Σ), and πk denote the prior probability of class k, k ∈ {0, 1}. By (12.24), the MAP classifier for this
? T
model is linear. It predicts class 1 if wMAP x + b?MAP > 0, and class 0 otherwise, where
?
wMAP = Σ−1 (µ1 − µ0 )
(12.29)
T π1
b?MAP = ?
−wMAP (µ0 + µ1 )/2 + ln .
π0
To link this model to the LDA problem, let w ∈ Rn have unit norm and consider the scalar random variable
Z = wT X. Z will have a scalar class-conditional densities N (νk , σ 2 ), and class prior probabilities πk ,
k = 0, 1. We must have
νk = EN (µk ,Σ) wT X = wT EN (µk ,Σ) [X] = wT µk , k = 0, 1,
Without loss of generality, assume that ν0 < ν1 . Then the MAP classifier for the scalar model is given by
(12.10). Translated into the current notation, this has the form
(
1, wT x + b > 0;
ŷ(wT x) = (12.30)
0, otherwise.
where the threshold b is given by
wT Σw
T π1
b = −w (µ0 + µ1 )/2 + T ln . (12.31)
w (µ1 − µ0 ) π0
Expression (12.31) is an extension to general w of the expression for b?MAP in (12.29). The first terms in
the expressions for b?MAP and b are the same. Moreover, if we substitute the value of w given in (12.29) into
the second term in (12.31), it simplifies to the second term in the expression for b?MAP .
We now select w to minimize the classifier probability of error Pe given in (12.15). In the current
situation
(ν1 − ν0 )2 wT (µ0 − µ1 )(µ0 − µ1 )T w
κ2 = = .
σ2 wT Σw
By Lemma 12.3.1, the probability of error decreases monotonically as κ2 increases. Hence we select w to
maximize κ2 :
wT (µ0 − µ1 )(µ0 − µ1 )T w
maxn
w∈R wT Σw (12.32)
2
s.t. kwk2 = 1.
Modulo a scaling of the objective function, problem (12.32) is the same as problem (12.28). We now make a
few more observations. The Bayes classifier (12.29) minimizes the probability of error for its corresponding
model. Since problem (12.32) includes the normalized solution wMAP ? in its feasible set, the solution of
(12.32) is at least as good as that of (12.29). The optimality of (12.29) then implies that both solutions are
equivalent in the sense that each specifies the same hyperplane. So the binary LDA classifier determined by
solving (12.32), is the binary Bayes classifier for the “common covariance” Gaussian model. We summarize
LDA in the box below.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 166
The Binary LDA Classifier. LDA is an empirically derived Gaussian Bayes classifier that assumes
both classes have the same covariance. This assumption results in a linear classifier. Assuming the
estimated covariance matrix Σ̂ is full rank, the parameters of the LDA classifier can be specified by
provided the expectation exists (is finite). By definition, KL-divergence is the expected value under p1 of the
log-likelihood ratio ln(p1 (x)/p0 (x)). In general, D(p1 (x)kp0 (x)) 6= D(p0 (x)kp1 (x)). So KL-divergence
is not symmetric. Hence it is not a true distance. However, it is nonnegative. To prove this we use the
following extension of Jensen’s inequality.
Theorem 12.6.1 (Jensen’s Inequality [11]). If f is a convex function and X is a random variable, then
Moreover, if f is strictly convex, then equality in (12.35) implies that X = E [X] with probability 1.
When two functions are equal except on a set of points with probability zero, we say that the functions
are equal with probability one (w.p.1.). A closely related concept is that two functions are equal almost
everywhere (a.e.). This holds when the functions are equal except on a set that has zero “measure”. An
example of a set measure is the probability of the set. Another example is the length of an interval. The cor-
responding measure for sets built from unions and intersections of intervals is derived accordingly. We will
not need to go into the details of set measures, except to understand the concept of equal almost everywhere.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 167
Theorem 12.6.2. For all p0 (x), p1 (x) for which D(p1 kp0 ) is defined, D(p1 kp0 ) ≥ 0 with equality if
and only if p1 = p0 a.e..
Proof. We use the strict convexity of the function − ln(·) together with Jensen’s inequality to obtain
Suppose D(p1 kp0 ) = 0. Then there must be equality in Jensen’s inequality. Theorem 12.6.1 then implies
that ln(p1 /p0 ) = 0 a.e., and hence that p1 = p0 a.e..
Here θ is a vector parameter taking values in a finite dimensional inner product space H, t(x) is a vector
valued function of x taking values in H, h(x) is a real-valued function of x, and Z(θ) is a real-valued
function of θ. The term <θ, t(x)> denotes the inner product of θ and t(x) in H.
We require that h(x)e<θ,t(x)> is non-negative. Hence h(x) ≥ 0. The purpose of Z(θ) is to ensure that
fX (x) integrates to 1. Hence we also require that
Z
h(x)e<θ,t(x)> dx < ∞.
∆
Z(θ) = (12.37)
D
The admissible set of natural parameter values, denoted by Θ, consists of all θ for which (12.37) holds. The
partition function Z is hence a real valued function defined on Θ ⊆ H.
The parameter θ is called the natural parameter, t(x) is called the sufficient statistic, and Z(θ) is called
the partition function of the density. An alternative but equivalent notation is to set A(θ) = ln(Z(θ)) and
write
fX (x) = h(x)e<θ,t(x)>−A(θ) .
A(θ) is called the log partition function.
The parameterization (12.36) is non-redundant if every density in the family is specified by a unique
natural parameter θ ∈ Θ. If the parameterizarion is redundant, then for some θ0 , θ1 ∈ Θ, with θ0 6= θ1 ,
fθ1 (x) = fθ0 (x) a.e..
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 168
Proof. Exercise.
Theorem 12.7.2. Let fk (x) = Z(θ1 k ) h(x)e<θk ,t(x)> , k = 0, 1, be density functions. Then the binary
MAP classifier under class-conditional densities f0 , f1 is
(
Z(θ0 ) π1
1, if <(θ1 − θ0 ), t(x)> + ln Z(θ 1)
+ ln π0 > 0;
ŷ(x) = (12.39)
0, otherwise.
Proof. Exercise.
By computing the (possibly) nonlinear function t(x) we obtain an alternative (possibly lossy) represen-
tation of x. Nevertheless, the MAP classifier can be exactly implemented using t(x). Moreover, the resulting
the MAP classifier is linear in t(x).
12.7.1 Examples
Example 12.7.1 (Exponential Density). The scalar exponential density is defined on R+ by
Here λ > 0 is fixed parameter. This density has the form (12.36) with
Using (12.38), the KL-divergence of two exponential densities with parameters λ0 and λ1 is
λ0 λ1
D(λ1 kλ0 ) = − 1 + ln ,
λ1 λ0
and using 12.39, the corresponding binary MAP classifier is linear in t(x) with parameters
λ1 π1
w = (λ1 − λ0 ), b = ln + ln .
λ0 π0
Example 12.7.2 (Poisson pmf). The Poisson pmf is defined on the natural numbers N by
λk −λ
f (k) = e , k ∈ N.
k!
Here λ > 0 is parameter of the density. This density is often used to model the number of events occurring
over a fixed interval of time. The density can be placed in the form (12.36) by setting
1 θ
h(k) = , t(k) = k, θ = ln λ, Z(θ) = eλ = ee .
k!
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 169
and by (12.39), the corresponding binary MAP classifier is the linear in t(k) with parameters
λ1 π1
w = ln , b = (λ0 − λ1 ) + ln .
λ0 π0
Example 12.7.3 (Bernoulli pmf). The Bernoulli pmf is defined on two outcomes {0, 1} by f (1) = p and
f (0) = 1 − p, where p ∈ [0, 1] is a parameter. We can write this pmf in the form (12.36) as follows:
1 1 e−θ 1
p= and 1−p=1− = = . (12.40)
(1 + e−θ ) 1 + e−θ 1 + e−θ 1 + eθ
Thus Z(θ) = 1 + eθ .
We can now use 12.38 to determine the KL-divergence between two members of the family with param-
pk
eters θk = ln( 1−p k
), k = 0, 1. First note that
pk 1
Z(θk ) = 1 + eθk = 1 + = , k = 0, 1
1 − pk 1 − pk
p1 p0 p1 (1 − p0 )
θ1 − θ0 = ln − ln = ln .
1 − p1 1 − p0 p0 (1 − p1 )
Hence we have
Z(θ0 ) p1 (1 − p0 ) 1 − p1
D(p1 kp0 ) = (u1 − u0 )Ep1 [x] + ln = ln p1 + ln . (12.41)
Z(θ1 ) p0 (1 − p1 ) 1 − p0
Similarly, by (12.39) the corresponding MAP classifier is linear in t(x) with parameters
p1 (1 − p0 ) 1 − p1 π1
w = ln , b = ln + ln .
p0 (1 − p1 ) 1 − p0 π0
Assume the prior is uniform. If x = 0, the classifier decides according to which of the probabilities 1 − p0
or 1 − p1 is larger. But if x = 1, it decides according to which of the probabilities p0 or p1 is larger.
Example 12.7.4 (Binomial pmf). The binomial pmf gives the probability of k successes in n independent
Bernoulli trials when the probability of success in each trial is p. Hence
n k
f (k) = p (1 − p)n−k , k ∈ [0 : n].
k
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 170
When the number of trials is fixed, the binomial pmf is in the exponential family. To see this write
n k ln p+(n−k) ln(1−p) h(k) θT t(k)
f (k) = e = e ,
k Z(θ)
p
, t(k) = k, h(k) = nk , and Z(θ) = e−n ln(1−p) = (1 + eθ )n .
where θ = ln 1−p
It follows that the KL-divergence of two binomial pmfs using the same number of trials n but distinct
parameters p0 and p1 is
p1 1 − p0 1 − p1
D(B(n, p1 )kB(n, p0 )) = < ln , EB(n,p1 ) [K] > + n ln
1 − p1 p0 1 − p0
p1 (1 − p0 ) 1 − p1
= np1 ln + n ln
p0 (1 − p1 ) 1 − p0
p1 1 − p1
= np1 ln + n(1 − p1 ) ln .
p0 1 − p0
Similarly using (12.39), the corresponding MAP classifier is linear in t(x) with parameters
p1 (1 − p0 ) 1 − p1
w = ln , b = n ln .
p0 (1 − p1 ) 1 − p0
x−µ 2
1
e− /2( )
1
Example 12.7.5 (Univariate Gaussian). The univariate Gaussian density f (x) = √2πσ σ , can
written as
1 1 2 2 1 1 2 µ µ2 1 T
f (x) = √ e− 2σ2 (x −2µx+µ ) = √ e− 2σ2 x + σ2 x e− 2σ2 = h(x)eθ t(x) ,
2πσ 2πσ Z(θ)
where
2 T µ −1 √ µ2
r
π −θ(1)2
h(x) = 1, t(x) = [x, x ] , θ = [ 2 , 2 ]T , Z(θ) = 2πσe 2σ 2 = e 4θ(2) .
σ 2σ −θ(2)
Notice that x is a scalar, but t(x) is a vector of dimension 2.
Using 12.38, the KL-divergence of two exponential densities N (µk , σk2 ) with natural parameters θk ,
k = 0, 1, is given by
2 2 µ1 µ0 1 1 Z(θ0 )
D(N (µ1 , σ1 )kN (µ0 , σ0 )) = − , − Ep1 [t(X)] + ln
σ12 σ02 2σ02 2σ12 Z(θ1 )
2 2
!
σ0 e1/2µ0 /σ0
µ1 µ0 1 1 µ1
= − , − + ln
σ12 σ02 2σ02 2σ12 σ12 + µ21 2 2
σ1 e1/2µ1 /σ1
2
µ21
µ1 µ0 1 1 µ1 σ0 µ0
= − , − + ln + −
σ12 σ02 2σ02 2σ12 σ12 + µ21 σ1 2σ02 2σ12
!
µ0 − µ1 2 σ12
1 σ0
= + 2 − 1 + 2 ln .
2 σ0 σ0 σ1
By (12.39), the corresponding MAP classifier is a linear classifier of the sufficient statistic t(x) ∈ R2 with
2
µ21
µ1 µ0 1 1 σ0 µ0 π1
wT = − , − , b = ln + − + ln .
σ12 σ02 2σ02 2σ12 σ1 2σ02 2σ12 π0
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 171
1 T Σ−1 (x−µ)
Example 12.7.6 (Multivariate Gaussian). The density f (x) = 1
n
1
1 e− 2 (x−µ) , can be
(2π) 2 |Σ| 2
written as
1 1 − 1 (xT Σ−1 x−2µT Σ−1 x+µT Σ−1 µ)
f (x) = n 1 e
2
(2π) 2 |Σ| 2
1 1 µT Σ−1 x−trace(1/2Σ−1 xxT ) −1/2µT Σ−1 µ
= n 1 e e
(2π) |Σ| 2
2
1
= h(x)e<θ,t(x)> ,
Z(θ)
where
1
h(x) = 1, t(x) = (x, − xxT ), θ = (Σ−1 µ, Σ−1 ),
2
and
n 1 T Σ−1 µ n 1 T u(2)−1 u(1)
Z(θ) = (2π) 2 |Σ| 2 e /2µ = (2π) 2 det(u(2))− 2 e /2u(1)
1 1
.
Here t(x), θ ∈ Rn × Sn with the inner product <(x, M ), (y, N )> = <x, y> + <M, N >.
We now compute the KL-divergence of two exponential densities N (µk , Σk ), k = 0, 1, with natural
parameters θ0 and θ1 . We first list some intermediate results:
EN (µ1 ,Σ1 ) [t(x)] = EN (µ1 ,Σ1 ) (x, −1/2xxT ) = (µ1 , −1/2(Σ1 + µ1 µT1 ))
= µT1 Σ−1 T −1 −1 T T
1 µ1 − µ0 Σ0 µ1 − /2 trace(In − Σ0 Σ1 + µ1 Σ1 µ1 − µ1 Σ0 µ1 )
1
= µT1 Σ−1 T −1 −1 −1 −1
1 µ1 − µ0 Σ0 µ1 + /2(−n + trace(Σ0 Σ1 ) − µ1 Σ1 µ1 + µ1 Σ0 µ1 )
1
1 −1
T
!
|Σ0 | 2 e1/2µ0 Σ0 µ0
Z(u0 ) |Σ0 |
ln = ln 1 1/2µT Σ−1 µ1
= 1/2 ln + 1/2µT0 Σ−1 1 T −1
0 µ0 − /2µ1 Σ1 µ1 .
Z(u1 ) |Σ1 | e
2 1 1 |Σ1 |
The corresponding MAP classifier is a linear classifier of the sufficient statistic t(x) = (x, − 21 xxT ) ∈
Rn × Rn×n . Let w = (v, Ω). Then
w = θ1 − θ0 = ( Σ−1 −1 −1 −1
1 µ1 − Σ0 µ0 , Σ1 − Σ0 ).
Hence v = Σ−1 −1 −1 −1
1 µ1 − Σ0 µ0 and Ω = Σ1 − Σ0 . The MAP classifier then takes the form:
(
1, if <w, t(x)> + b > 0;
ŷ(x) =
0, otherwise.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 172
We have
and
|Σ0 | 1/2µT Σ−1 µ0 1/2µT Σ−1 µ1
π1
b= 1/2 ln + 0 0 − 1 1 + ln .
|Σ1 | π0
Hölder’s Inequality
A real valued function f defined on a subset D ⊆ Rn is said to be integrable if D |f (x)|dx < ∞. We often
R
have functions f, g and we want to show that the product f g is integrable. For example, suppose we know
that f, g are square integrable (i.e., f 2 and g 2 are integrable), does that imply that f g is integrable? Hölder’s
inequality tells us that the answer is yes. Remarkably, Hölder’s inequality tells us that for any real numbers
p, q > 1 with 1/p + 1/q = 1, if f p and g q are integrable so is f g.
Proof. A proof can be found, for example, in [1, 2, 15, 37]. The proofs given in Bartle [2] and Fleming [15]
make the condition for equality explicit.
We now examine the convexity of the set Θ and the convexity of the log partition function.
Theorem 12.7.4. The admissible set Θ of a density (or pmf) f in the exponential family is convex, and
its log partition function ln(Z(θ)) is a convex function on Θ. If the parameterization of the density is
non-redundant, then ln(Z(θ)) is strictly convex on Θ.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 173
Proof. Let θ0 , θ1 ∈ Θ.Then Z(θk ) is nonnegative and finite for k = 0, 1. Hence for any α ∈ (0, 1),
Z(θ0 )1−α and Z(θ1 )α are finite. Now consider the point θα = (1 − α)θ0 + αθ1 . By definition,
Z
Z(θα ) = h(x)e<θα ,t(x)> dx
Z
= h(x)1−α e<(1−α)θ0 ,t(x)> h(x)α e<αθ1 ,t(x)> dx
Z 1 1
= g0 (x)g1 (x)dx g01−α (x) and g1α (x)are integrable
1 1−α 1 α
1−α
≤ g0 (x)dx α
g1 (x)dx Hölder’s Inequality
From the above bound and the finiteness of Z(θ0 )1−α and Z(θ1 )α , it follows that Z(θα ) is finite. So θα ∈ Θ,
and thus Θ is a convex set. Taking the log of the above inequality yields ln(Z(θα )) ≤ (1 − α) ln(Z(θ0 )) +
α ln(Z(θ1 )). Hence the log partition function is convex on Θ. It is strictly convex if for each α ∈ (0, 1)
Hölder’s inequality is strict. Set p = 1/(1 − α) and q = 1/α. There is equality in Hölder’s inequality at α
if and only if
g0p (x) h(x)e<θ0 ,t(x)> h(x)e<θ1 ,t(x)> g q (x)
p = = = 1 q a.e.
kg0 kp Z(θ0 ) Z(θ1 ) kg1 kq
Thus strong convexity follows by the assumption that the density parameterization is non-redundant.
Lemma 12.7.1. The gradient of the log partition function of f is ∇ ln(Z(θ)) = Ef [t(X)].
1
Proof. We have Dθ ln(Z(θ))(v) = Z(θ) Dθ Z(θ)(v). So if Dθ Z(θ)(v) exists, then the previous equation
gives Dθ ln(Z(θ))(v). To take the derivative of Z(θ) we interchange the derivative operator and the integra-
tion. An interchange limiting operations is something we need to check. This issue is discussed in Appendix
12.8. Making the exchange and taking the derivative of the function inside the integral yields
Z
Dθ Z(θ))(v) = Dθ h(x)e < θ,t(x)>
dx (v)
Z
= <v, t(x)> h(x)e<θ,t(x)> dx
Z
= <v, t(x)h(x)e<θ,t(x)> dx>.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 174
Figure 12.3: The geometry of the log partition function of a density in the exponential family.
We now appeal to Theorem 7.4.3. This theorem indicates that for a differentiable convex function f over a
convex domain C and any fixed point x1 in the interior of C, for every x ∈ C
So the derivative provides a global lower bound for the function. This is illustrated in Figure 7.3.
Now consider the log partition function. The set of all points (θ, ln(Z(θ)) ∈ Θ × R is called the graph
of ln(Z(θ)). This set of points forms the surface in Θ × R illustrated in Figure 12.3. The lower bound for
ln Z(θ) given by the derivative at θ1 is
This lower bound is illustrated as the shaded plane in Figure 12.3. The plane is tangent to the surface at
(θ1 , ln(Z(θ1 ))). In Θ, if the gradient vector ∇ ln(Z(θ1 )) is translated to the point ln(Z(θ1 )) it points in the
direction of greatest increase in ln(Z(θ)) at θ1 . It is thus orthogonal to the level set of ln(Z(θ)) with value
ln(Z(θ1 )). This is also illustrated in Figure 12.3.
Now consider a second point θ0 ∈ Θ. The lower bound (12.43) requires that ln(Z(θ0 )) ≥ ln(Z(θ1 )) +
<∇ ln(Z(θ1 )), θ0 − θ1 >. The point (θ0 , ln(Z(θ0 ))) lies on the surface, while the corresponding point on
the lower bound is (θ0 , ln(Z(θ1 )) + <∇ ln(Z(θ1 )), θ0 − θ1 >. The distance between the point on the plane
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 175
f (x, t0 + v) − f (x, t0 )
fv (x) = .
v
The dominated convergence theorem tells us that if there exists an integrable function g(x) such that
|fv (x)| ≤ g(x), then
Z Z Z Z
d d
f (x, t)dx = lim fv (x, t)dx = lim fh (x, t)dx = f (x, t)dx.
dt v→0 v→0 dt
For example, h(x)e<θ,t(x)> is a function of θ ∈ Θ and x ∈ D. For simplicity assume that D is a compact
set. At θ0 , we have
<θ0 ,t(x)>
h(x)e<θ0 +v,t(x)>−h(x)e e<v,t(x)> − 1
= h(x)e<θ0 ,t(x)> .
kvk kvk
For sufficiently small kvk, the term on the right is bounded above by the integrable function h(x)e<θ0 ,t(x)> .
Hence under the stated assumptions,
Z Z
Dθ h(x)e<θ,t(x)> dx(v) = Dθ h(x)e<θ,t(x)> (v)dx.
For further reading see the Bartle [2, Theorem 5.6 and Corollary 5.9].
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 176
it is more convenient to work with the probability of correct classification. From (12.15) you see that this is
the sum of the terms,
κ α κ α
C0 = π0 Φ + and C1 = π1 Φ − ,
2 κ 2 κ
and that the derivatives of these terms w.r.t. κ are
0 κ α 1 α
κ α 1 α
0
C0 = π0 Φ + − and C10 = π1 Φ 0
− + 2 .
2 κ 2 κ2 2 κ 2 κ
Notice that C00 > 0 for κ2 > 2α and C10 > 0 for all κ2 > 0. Hence for κ2 ≥ 2α the probability of being
correct is monotone increasing in κ. For 0 < κ2 ≤ 2α, C00 ≤ 0 and C10 > 0. To ensure C00 + C10 > 0 for
0 < κ2 ≤ 2α, it is sufficient that
0 κ α α 1
0
−C0 = π0 Φ + − < C10 .
2 κ κ2 2
To verify that the above holds we examine
! α 1
!
−C00 Φ0 κ2 + ακ 2 − 2
π0 κ
ln = ln + ln + ln α
C10 π1 Φ0 κ2 − ακ κ2
+ 12
!
α 1
−
1 κ α 2 κ α 2 2 2
=α+ − + + − + ln κα
2 2 κ 2 κ κ2
+ 21
!
α 1
2 − 2
= ln κα
κ2
+ 12
< 0.
Hence for α ≥ 0, the probability that the classifier is correct is monotonically increasing in κ and hence also
in κ2 .
Notes
For a good introduction to Bayesian decision theory see Duda et al. [13, Ch. 2], and the more advanced discussion
for Gaussian conditional densities in Murphy [32, Ch. 4]. Poor [36, Ch 2], gives a concise summary of general
Bayes decision rules. See also Wasserman [49], Silvey [45], and Lehmann [27]. For additional reading on ROC
curves, see Duda et al. [13, §2.8.3]. The exponential family is discussed in many texts. For a detailed modern
treatment see [32, §9.2]. Other interesting accounts are given in the books by Wasserman [49, §9.13.3], Berger [3],
and Lehmann [26, §1.5], [27, §2.7]. The essence of the proof of Theorem 12.7.4 follows that in [27, §2.7].
Exercises
Exercise 12.1. Show that the solution of the LDA problem (12.28) with SW PD can be obtained by scaling the solution
of the linear equations SW w = µ1 − µ0 to unit norm. (This only requires linear algebra.)
Exercise 12.2. Prove Theorem 12.7.1.
Exercise 12.3. Prove Theorem 12.7.2.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Chapter 13
13.1 Introduction
This chapter formulates a generative model for the regression problem.
Let a data example take the form (x, y) with x ∈ Rn and y ∈ Rq . The values x and y are the assumed
to be the outcomes of random variables X and Y with joint density fXY . For the moment we assume that
this joint density is known.
The specific problem of interest to estimate the value of Y having observed the outcome x of X. Hence
we call Y the target and X the observation. The information provided by x about the outcome of Y refines
the uncertainty in Y through the conditional density
fXY (x, y)
fY|X (y|x) = .
fX (x)
If X and Y are independent, then by definition fXY (x, y) = fX (x)fY (y). From this equality it follows that
we can write
fXY (x, y) fX|Y (x|y)fY (y)
fY|X (y|x) = = for all x with fX (x) > 0. (13.2)
fX (x) fX (x)
This equation relates the prior density fY (y) (prior to the observation) to the posterior density fY|X (y|x)
(after the observation). Equation (13.2) also suggests that the natural generative model in this situation is
given by the right hand side of (13.1).
The estimated value of Y will be a function of the observed value of X. This function, denoted by ŷ(·),
is called an estimator or predictor. In contrast, the value produced by the estimator for a particular value x
of X is called an estimate or prediction of Y.
177
ELE 435/535 Fall 2018 178
Our objective is to determine an optimal estimator for Y given the value of X. To do so we must
first decide what criterion we seek to optimize. Since Y can take on a continuum of values, it is not the
probability of error that is important (it will almost certainly be 1), but how close, on average, the estimate
ŷ(x) is to the actual value y of Y. There are many ways to measure this error. One way is the squared
distance kŷ(x) − yk22 . Sometimes this will be small, other times it will be large. What is important is the
expected value of this cost over the outcomes of X and Y. This leads to the performance metric
E kŷ(X) − Yk22 .
This is called the mean squared error (MSE), and an estimator that minimizes this cost is called a minimum
mean square error (MMSE) estimator.
E kY − ŷk22 = E kY − µY + µY − ŷk22
= E (Y − µY + µY − ŷ)T (Y − µY + µY − ŷ)
The assumption that Y has finite first and second order moments ensures that the final expression for the
MSE is finite, and the expression is clearly minimized by setting ŷ = µY .
A different cost function would different result for the optimal estimator ŷ, but it will still be some
constant determined by the density of Y and the selected cost function. The MSE cost function is easy to
work with since it is quadratic in ŷ and it yields an intuitively reasonable result.
Now consider how to estimate the value of Y when a realization of (X, Y) is determined but we can
only observe the value assumed by X, i.e., we know that X = x. In this case, we simply replace the prior
density of Y by its posterior density.
Lemma 13.2.2. Assume the conditional density fY|X (y|x) has finite first and second order moments.
Then the optimal MSE estimate of the value of Y given that X = x is the mean of the conditional
density fY|X (y|x): Z
ŷ(x) = E[Y|X = x] = yfY|X (y|x)dy = µY|X (x).
R
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 179
Proof. Given X = x, any residual uncertainty in the value of Y is completely described by fY|X (y|x).
Hence, under the assumptions of Lemma 13.2.1, the optimal MSE estimate of the value of Y given that
X = x, is the mean of the conditional density.
The mean of the condition density is a function of the observed value x. Hence it is an estimator. The
value of the conditional mean for a particular observed value x of X is the corresponding estimate ŷ(x) of
Y. In general, it is difficult to find a closed form expression for the (usually nonlinear) conditional mean
estimator. So although we know that this estimator exists and that it is optimal under the MSE cost, in
general, computing the estimator can be a challenge. It is easy to the find the MMSE estimator when X and
Y are jointly Gaussian. We consider this situation in the next section.
Let
W T = ΣYX Σ−1
X
(13.6)
b = µY − ΣYX Σ−1
X µX .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 180
13.4 Examples
In the following examples we assume that X and Y are jointly Gaussian, and that the joint density s specified
using the generative model (13.1). Hence Y is N (µY , ΣY ), with µY and ΣY known. The conditional density
fX|Y (x|y) is then be specified using the outcome of Y. Finally, we observe the value of X, and want to
estimate a corresponding value of Y.
µX = E[Y + N] = µY
ΣX = E (Y − µY + N)(Y − µY + N)T = ΣY + ΣN
ΣXY = E (Y − µY + N)(Y − µY )T = ΣY .
Additional insight into 13.7 can be obtained by considering the special cases discussed below.
σ2
ŷ(x) = 2 (x − µ) + µ
σ 2 + σN
(13.8)
σ2
= αx + (1 − α)µ, with α = 2 2 ∈ [0, 1].
σ + σN
This estimator has an intuitive interpretation. When σN 2 σ 2 , the observation provides clear information
1
about the value of Y . In this case, α ≈ 1 and ŷ(x) ≈ x. On the other hand, when σN 2 σ 2 , the observation
provides little information about Y. In this case, α ≈ 0, and ŷ(x) ≈ µ. In the first case, the observation is
trust worthy and the estimator puts most of its emphasis on x. In the second, the observation is unreliable,
and the estimator puts most of is weight on the prior mean µ. For situations between these extremes the
estimator forms a convex combination of the observation x and the prior mean µ.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 181
of coordinates of the projection of Y − µY onto the line in the direction uj . The variance of this random
variable is
E uTj (Y − µY )(Y − µY )T uj = uTj ΣY uj = σj2 .
So we can interpret σj2 as the variance of Y in the direction of its j-th unit norm eigenvector uj .
Now use the fact that U T U = Iq to express W T in terms of U :
−1
W T = ΣY (ΣY + ΣN )−1 = (U DU T ) U DU T + σN 2
UUT
−1 T
= (U DU T ) U D + σN 2
In U
" #
2
σj T
=U 2 U . (13.9)
σj2 + σN
So W T attenuates the deviations of X from its mean in the direction of uj using the gain
∆ σj2
αj = , αj ∈ [0, 1], j ∈ [1 : q].
σj2 + σN
2
For eigenvectors uj with σj2 σN 2 , α ≈ 1, while for eigenvectors with σ 2 σ 2 , α ≈= 0. For the
j N j j
T
first set of directions, X − µX passes through W with almost unit gain, but for the second set, X − µX
is highly attenuated. In situations between these extremes, the attenuation in direction uj is determined by
αj = σj2 /(σj2 + σN 2 ).
Now bring the mean µY into the picture by using (13.9) to rewrite (13.7) in the form
ŷ(x) = U [diag(αj )] U T x + [diag(1 − αj )] U T µY .
(13.10)
If σj2 < σN
2 , α is small and the component of ŷ in the direction u is formed mainly from the component
j j
2 2
of µY in this direction. Conversely, when σj > σN , αj is large and the component of ŷ in the direction uj
2 σ , for all j ∈ [1 : q],
is formed mainly from the component of x in this direction. In particular, when σN j
the optimal MMSE estimate of Y given X = x reverts to the prior mean µY . This makes sense, since in this
case the observation X = x adds very little information about the value of Y and (under the MSE metric)
the best we can do is to estimate the value of Y to be its mean µY .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 182
W T = ΣY B T (BΣY B T + ΣN )−1
(13.12)
ŷ(x) = W T (x − µX ) + µY .
bσ 2
ŷ(x) = 2 (x − bµ) + µ
b2 σ 2 + σN
(13.13)
−1 b2 σ 2
= α(b)(b x) + (1 − α(b))µ, with α(b) = 2 2 2 ∈ [0, 1].
b σ + σN
The second equation in (13.13) indicates that when b 6= 0, we can divide x by b and then proceed as in the
previous example. As b becomes large, the estimate moves towards x/b, and as b becomes small it moves
towards the prior mean µ. If b = 0 then the observation is useless and the best estimate is just µ.
Then
−1
σ12 σ12 b1 0
2 2
T b1 0 σ1 σ12 b1 0 σN 0
W = +
σ21 σ22 0 b2 0 b2 σ21 σ22 0 b2 2
0 σN
−1
(13.14)
b σ 2 b2 σ12 b21 σ12 + σN 2
b1 b2 σ12
= 1 1
b1 σ21 b2 σ22 b1 b2 σ21 b22 σ22 + σN2
Equation (13.15) indicates that we can also implement the optimal predictor by first applying B −1 to the
observation x and then the MMSE estimator for the model (B −1 X) = Y + B −1 N . One should take,
however, if either of b1 , b2 are close to 0 since then the center matrix in the second equation in (13.15) is
likely to be ill-conditioned.
Let’s examine what happens when b2 = 0. In this case, (13.14) is still applicable and simplifies to
" 1 # b1 σ2
b σ 2 0 b21 σ12 +σN2 0 2 2
1
2 0
WT = 1 1 1 = b1bσ11σ+σ N
b1 σ21 0 0 σ 2 2 2
21
2 0
N b1 σ1 +σN
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 183
Notice that the estimator provides an estimated value for both components of y even though we had no direct
observation of the second component (b2 = 0). The estimator is able to so because of the cross-correlation
term σ21 in ΣY .
= E kY − W T X − bk22 .
min (13.16)
W ∈Rn×q , b∈Rq
Let X have mean µX ∈ Rn and covariance ΣX ∈ Rn×n , Y have mean µY ∈ Rq and covariance
ΣY ∈ Rq×q , and let the cross covariance of X and Y be ΣXY ∈ Rn×q . We show below that a solution of
(13.16) can be found using only these quantities.
Theorem 13.5.1. A minimum MSE affine estimator of Y given X = x has the form
ŷ(x) = W ? T (x − µX ) + µY , (13.17)
with W ? satisfying ΣX W ? = ΣXY . In particular, if ΣX is positive definite, the minimum MSE affine
estimator is unique and is specified by W ? = Σ−1
X ΣXY .
Proof. Assume for the moment that µX = 0 and µY = 0. Expanding the RHS of (13.16) yields
E kY − W T X − bk22 = E (Y − W T X − b)T (Y − W T X − b)
(13.18)
T T T T T T
= E (Y − W X) (Y − W X) + E −2(Y − W X) b + b b
= E (Y − W T X)T (Y − W T X) + bT b.
(13.19)
Since the two terms in (13.19) are nonnegative, and the first does not depend on b, the expression is mini-
mized with b = 0. Using the properties of the trace function, the first term in (13.19) can rewritten as
E (Y − W T X)T (Y − W T X) = E trace(YT Y − YT W T X − XT W Y + XT W W T X)
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 184
Hence
E kY − W T X − bk22 = trace ΣY − 2W T ΣXY + W T ΣX W .
(13.20)
The derivative of (13.20) with respect to W acting on H ∈ Rn×q is
Setting this expression equal to zero we find that for every H ∈ Rn×q , <H, ΣX W − ΣXY > = 0. It follows
that a necessary condition for W to minimize the MSE cost is that
ΣX W = ΣXY . (13.22)
To verify that a solution of (13.22) minimizes the objective function we can compute the second derivative,
i.e., the derivative with respect to W of the RHS of (13.21). This yields trace(H T ΣX H). Noting that ΣX is
PSD, we conclude that trace(H T ΣX H) ≥ 0 and hence that all solutions of (13.22) minimize the MSE cost
objective. If ΣX is positive definite, (13.22) has the unique solution
W ? = Σ−1
X ΣXY . (13.23)
When the means µX and µY are nonzero, we apply the reasoning above to predict Y − µY given the
value of X − µX . This yields the minimum MSE predictor W ? T (x − µX ). Hence the least MSE predictor
of Y is ŷ(x) = W ? T (x − µX ) + µY , where W ? is any solution of (13.22).
We leave it as an exercise to show that these two approaches have the same set of solutions.
Exercises
Exercise 13.1. Prove Lemma F.0.1.
Exercise 13.2. Let X be a Gaussian random vector with parameters µ, Σ. Show that:
(a) E[X]
= µ.
(b) E (X − µ)(X − µ)T = Σ.
Exercise 13.3. Let X be an n-dimensional Gaussian random vector with mean µ and PD covariance Σ and let Y =
A(X − µ) + b for b ∈ Rn and an invertible matrix A ∈ Rn×n . Show that Y is also a Gaussian random vector with
mean b and covariance Ω = AΣAT .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 185
Exercise 13.4. Let X be an n-dimensional Gaussian random vector with mean µ and PD covariance Σ. Show that
there exists a symmetric PD matrix A and vector b such that Y = AX + b, is a Gaussian random vector with mean 0
and covariance In .
Exercise 13.5. Let X be an n-dimensional Gaussian random vector with mean µ and PD covariance Σ. Show that
there exists a matrix A and a vector b such that X = AY + b, where Y is a zero mean Gaussian random vector with
independent components of unit variance. Is A unique? Can A be chosen to be symmetric PD? Is such a choice
unique?
Exercise 13.6. Find the mean and covariance of Ŷ? .
Exercise 13.7. Find the mean and covariance of the error random vector E = Y − Ŷ? . Use this to show that in all
directions, the variance of Y is greater than or equal to the variance of E.
Exercise 13.8. Assume ΣX is postive definite. Show that the MSE of the optimal MSE affine estimator is
Exercise 13.10. Show that the MSE of the estimator (13.12) is trace ((I − W ? B)ΣY ).
Exercise 13.11. Let B ∈ Rm×n and Y, N, X be random vectors with Y and N independent, E[N] = 0, and X =
BY + N. The first and second order statistics of Y and N are known, and ΣN is PD.
The minimum MSE estimate of Y given X = x is
(a) Show that using minimum MSE affine denoising to estimate BY given X = x, then multiplying the result by
B + , yields the same estimator.
(b) Show that in general, the estimator that results by first multiplying X by B + and then denoising is not an
optimal MSE affine estimator.
Exercise 13.12. Bias, error covariance, and MSE. Consider random vectors X and Y with a joint density fXY and
PD covariance Σ. Let X have mean µX ∈ Rn and covariance ΣX ∈ Rn×n , Y have mean µY ∈ Rq and covariance
ΣY ∈ Rq×q , and let the cross-covariance of X and Y be ΣXY ∈ Rn×q .
∆
Let ŷ(x) be an estimator of Y given X = x, and denote the corresponding prediction error by E = Y − ŷ(X). Of
interest is µE , ΣE and the MSE. The estimator is said to be unbiased if µE = 0.
(a) For any estimator ŷ with finite µE and MSE, show that MSE(ŷ) = trace(ΣE ) + kµE k22 . This shows that the
MSE is the sum of two terms: the total variance trace(ΣE ) of the error, and the squared norm of the bias kµE k22 .
(b) Let ŷ(x) = µY . Show that this is an unbiased estimator, determine ΣE , show that ΣE is PD, and determine the
estimator MSE.
(c) The minimum MSE affine estimator of Y given X = x is
Show that ŷ ? (·) is an unbiased estimator, determine ΣE , show that ΣE is PD, and determine the estimator MSE.
Exercise 13.13. Empirical statistics, MSE affine prediction, and least squares. Fix a training dataset {(xi , yi )}m
i=1 ,
with examples xi ∈ Rn and targets yi ∈ Rq . Let X denote the matrix with the examples as its columns, Y denote the
matrix with the corresponding targets as its columns. Define the following first and second order empirical statistics
of the data:
µ̂X = 1/mX1m µ̂Y = 1/mY 1m
(13.26)
Σ̂X = 1/m(X − µ̂X 1Tm )(X − µ̂X 1Tm )T Σ̂XY = 1/m(X − µ̂X 1Tm )(Y − µ̂Y 1Tm )T
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 186
A MMSE affine estimator ŷ(x) = W T x + b based on the empirical statistics (13.26) must satisfy
Show that Wls , bls satisfy (13.27). Thus directly solving the least squares problem (13.28) yields an optimal
MSE affine estimator for the empirical first and second order statistics in (13.26).
(b) Consider the ridge regression problem
Wrr , brr = arg min 1/mkY − W T X − b1Tm k2F + λkW k2F , λ > 0. (13.29)
W ∈Rn×q ,b∈Rq
Determine if Wrr , brr satisfy (13.27). If not, what needs to be changed in (13.26) to ensure Wrr , brr satisfy
(13.27). Interpret the change you suggest.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Chapter 14
Convex Optimization
This chapter briefly summarizes the essential elements of convex optimization. There are many good books
on this topic that go into much greater depth. See the references at the end of the chapter for additional
reading. Before reading this chapter, it may be helpful to revise the material on convex sets and functions in
Chapter 7.
min f (w)
w∈Rn
s.t. fi (w) ≤ 0, i ∈ [1 : k] (14.1)
Aw − b = 0,
where f, fi : Rn → R, i ∈ [1 : k], are convex functions, A ∈ Rm×n and b ∈ Rm . The function f is called the
objective function, the functions fi are called the constraint inequalities, and the rows of the affine vector
function Aw − b are called the affine equality constraints.
A point w ∈ Rn satisfying all of the constraints in (14.1) is said to be a feasible point, and the set
of all feasible points is called the feasible set. The feasible set is the intersection of the 0-sublevel sets
(i)
L0 = {w : fi (w) ≤ 0}, i ∈ [1 : k], and the set S = {w : Aw − b = 0}. The set S is either empty
(b ∈/ R(A)), or is an affine manifold of the form wp + N (A) where Awp = b. Hence S is closed and
(i)
convex. Since each of the functions fi is convex, each sublevel set L0 is closed and convex. So the feasible
set is an intersection of closed convex sets and is hence closed and convex. Thus problem (14.1) seeks to
minimize a convex function f over a closed, convex set. To be an interesting problem, we need the feasible
set to be nonempty. In this case we say that the problem is feasible. Otherwise we say that it is infeasible.
min hT w
w∈Rn (14.2)
s.t. F w ≤ g.
187
ELE 435/535 Fall 2018 188
The objective function f (w) = hT w is linear and hence convex. Let Fi,: denote the i-th row of F and gi
denote the i-th entry of g. Then there are m constraint inequalities fi (w) = Fi,: w − gi ≤ 0. These are affine
and hence convex. Thus a linear program is a convex program. To ensure feasibility, we need the existence
of a w ∈ Rn such that F w ≤ g.
min 1/2 wT P w + q T w + r
w∈Rn (14.3)
s.t. F w ≤ g.
k
∆
X
L(w, λ, µ) = f (w) + λi fi (w) + µT (Aw − b).
i=1
Notice that L(w, λ, µ) is a convex function of w. Moreover, if the objective f (w) and constraint functions
fi (w), i ∈ [1 : k], are differentiable w.r.t. w, then so is L(w, λ, µ).
We now set out to minimize the unconstrained convex function L with respect to w, without requiring
that w is feasible. This gives rise to the dual objective function g(λ, µ) defined by
The domain of g is the set of all (λ, µ) ∈ Rk × Rm satisfying λ ≥ 0 and g(λ, µ) > −∞. Such points are
said to be dual feasible. By construction, for all dual feasible (λ, µ) and feasible w,
So for all dual feasible (λ, µ) and all feasible w, the dual objective lower bounds the primal objective:
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 189
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 190
Hence the following conditions, known as the KKT conditions1 , are necessarily satisfied at w? , λ? , µ? :
k
X
∇f (w? ) + λ?i ∇fi (w? ) + AT µ? = 0 ∇w L(w? , λ? , µ? ) = 0
i=1
for i ∈ [1 : k], fi (w? ) ≤ 0 primal constraint
?
Aw − b = 0 primal constraint
?
λ ≥0 dual constraints (There can be more)
for i ∈ [1 : k], λ?i fi (w? ) =0 complementary slackness
For general optimization problems that’s as much as we can say. However, for convex programs satisfying
the above assumptions one can say more.
1
Named for those who first published the result: William Karush (1939), and Harold W. Kuhn and Albert W. Tucker (1951).
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 191
Theorem 14.5.1. For a convex program in which strong duality holds and the functions f (w) and fi (w), i ∈
[1 : k], are continuously differentiable, the KKT conditions are both necessary and sufficient for optimality.
Example 14.5.1 (The KKT Conditions for a Quadratic Program). The primal quadratic program is
a convex program with affine inequality constraints. Hence if the primal problem has a feasible point,
Slater’s condition is satisfied and strong duality holds. Let w? and λ? denote the solutions to the primal
and dual problems, respectively. Then g(λ? ) = f (w? ). In addition, the primal objective function f (w)
is continuously differentiable. Hence the following KKT conditions are necessary and sufficient for the
optimality of w? , λ? :
w? + P −1 q + P −1 F T λ? = 0 ∇w L(w? , λ? , µ? ) = 0
F w? − g ≤ 0 primal constraint
?
λ ≥0 dual constraint
for i ∈ [1 : m] λ?i (Fi,: w? − gi ) = 0 complementary slackness.
Notes
For additional reading see the books by Boyd and Vandenberghe [7], Bertsekas [4] and Chong and Zak [9]. The
presentation here follows [7].
Exercises
Exercise 14.1. Consider the primal problem below. Here X ∈ Rn×m , y ∈ Rm and 1 ∈ Rm is the vector of all 1’s.
Give a detailed but concise derivation of the corresponding dual problem.
min 1/2kwk2
w∈Rn ,b∈R
s.t. X T w + by ≥ 1.
Exercise 14.2. Let A ∈ Rm×n with rank(A) = n and y ∈ Rm . Consider the constrained regression problem:
This requires that we minimize the least squares residual while keeping the magnitude of the entries of w to at most 1.
Show that this is a feasible convex program and that strong duality holds.
Exercise 14.3. Let A ∈ Rm×n , y ∈ Rm and c > 0. Consider the constrained regression problem:
s.t. kwk22 ≤ c.
We want to minimize the least squares residual while ensuring the squared norm of w is at most c.
(a) Verify that this is a convex program, that it is feasible and that strong duality holds.
(b) Write down the KKT conditions.
(c) Show that the primal solution is given by ridge regression using the optimal value λ? of the dual variable.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 192
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 193
Chapter 15
The support vector machine (SVM) is a framework for using a set of labelled training examples to learn
a binary classifier. A special case of the SVM, called the generalized portrait method, was introduced by
Vapnik and Lerner in 1963. It assumed that the labelled training data could be separated using a hyperplane
(i.e., linearly separable training data). This is a special case of what is now called the linear SVM. The
original method was later extended to allow a nonlinear decision boundary constructed as a hyperplane in
a higher dimensional space. The SVM framework became popular after 1995 when Cortes and Vapnik
extended the framework by removing the requirement of linearly separable training data. The extended
method, together with the previous nonlinear extensions, is what we know today as the support vector
machine. From a practical perspective, the SVM has been employed across a wide variety of applications
and has produced robust classifiers that generalize well. The two main limitations of the SVM are its natural
binding to binary classification and the need to specify (rather than learn) a kernel function (more on this
later).
This chapter focuses on the linear SVM. We begin by considering linearly separable training data. Then
show how to remove the linearly separable assumption. In a subsequent chapter we show how to extend
these results to the general (nonlinear) SVM.
15.1 Preliminaries
15.1.1 Hyperplanes
Recall that a hyperplane H in Rn is a subset of Rn of the form {x : wT x + b = 0}, where w ∈ Rn is
nonzero, and b ∈ R. The ray in the direction of w is the set of all points of the form αw for α > 0. This ray
intersects the hyperplane at the point q with q = αw satisfying wT (αw) + b = 0. Thus q = −bw/kwk2 .
The distance d from the origin to the hyperplane is just the norm of q. Hence d = |b|/kwk. Finally, for each
x on the hyperplane, wT (x − q) = wT x + b = 0. Thus w ⊥ (x − q). In this sense, the vector w is normal
to the hyperplane. These observations are summarized in the left diagram in Figure 15.1.
The parameters (w, b) of a given hyperplane H are not unique. For any α > 0, (w, b) and (αw, αb)
specify the same hyperplane. Let [H] denote the set of equivalent parameters (w, b) all of which yield the
same hyperplane H.
A hyperplane H divides Rn into a positive half space: {x : wT x + b > 0}, and a negative half space:
{x : wT x + b < 0}. For the sake of definiteness, we will include the hyperplane H = {x : wT x + b = 0}
in the positive half space. This yields a binary subdivision of Rn that assigns each x ∈ Rn a label ŷw,b (x) ∈
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 194
xj
)
x w (x−q)
j ,H
d( x
p = aw pj = wwT xj /||w||2
q = −bw/||w||2 q = −bw/||w||2
w w
w
w
T
T
x+
x+
b=
b=
0
0
0 H 0 H
Figure 15.1: Right: Hyperplane properties. Left: Computing the distance d(xj , H) from xj to the hyperplane H.
This is a linear classifier on Rn . Normally we write this linear classifier as ŷw,b (x) = sign(wT x + b), where
it is understood that we will take sign(0) = 1.
The family of linear classifiers on Rn is parameterized by (w, b) ∈ Rn+1 . For any α > 0, (w, b) and
(αw, αb) specify the same hyperplane, hence the same linear classifier. So a linear classifier corresponds to
the equivalence class [H] of pairs (w, b) yielding the same hyperplane H.
This is a set of m linear inequalities in (w, b) defining a subset of (w, b) pairs in Rn+1 . These equations can
be written more compactly as
yj (wT xj + b) > 0, j ∈ [1 : m]. (15.2)
In light of (15.2), for every example xj there exists a scalar γj > 0 such that yj (γj wT xj + γj b) = 1. Letting
γ = maxj γj we see that
yj (γwT xj + γb) ≥ 1 j ∈ [1 : m].
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 195
wT x + b >0
w
T
x+
w
wT
b
T
x+
=1
x+
b=
b
T
0
=0
x+
b
=−
1
H
H
wT x + b <0
Figure 15.2: Left: Linearly separable data in R2 . The red points are the positive examples, and the blue points are
the negative examples. The hyperplane H separates these two sets of points. Right: The maximum margin separating
hyperplane, and its support vectors (circled), for the separable data on the left.
Moreover there exists at least one j such that yj (γwT xj + γb) = 1. For future reference, we summarize
these useful observations in the following lemma.
Lemma 15.1.1. A hyperplane H separates the training data {(xj , yj )}m
j=1 if and only if there exists
(w, b) ∈ [H] satisfying
Proof. (If) If (w, b) satisfies (15.3) and (15.4), then (w, b) satisfies (15.2). So H separates the training data.
(Only If) Suppose H separates the training data and let (w, b) ∈ [H]. Then (w, b) must satisfying (15.2).
So for each training example, yj (wT xj + b) > 0. It follows that for each j there exists γj > 0 such that
yj ((γj w)T xj + (γj b)) = 1. Since the training data is finite, γ = maxj∈[1:m] {γj } exists and is finite. Let
(w̄, b̄) = (γw, γb). Then (w̄, b̄) ∈ [H], and for each training example, yj (w̄T xj + b̄) ≥ 1. Hence (w̄, b̄)
satisfies (15.3). In addition, for some j, γ = γj , and hence yj (w̄T xj + b̄) = 1. Thus (15.4) is satisfied.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 196
yj (wT xj + b) 1
ρH = min = . (15.6)
j kwk kwk
The linear SVM problem for separable data is posed as finding the separating hyperplane of maximum
margin. By (15.6), this is achieved by minimizing kwk2 under the constraints of Lemma 15.1.1:
min 1/2kwk2
w∈Rn ,b∈R
We claim that minimizing kwk2 always ensures that the second constraint in (15.7) is satisfied. To see this
suppose that (w, b) satisfies the first constraint with strict inequality for each j. Then there is a positive
γ < 1 so that (γw, γb) satisfies the second constraint. Since this scaling decreases kwk2 , it will always be
favored under the minimization of kwk2 . This observation allows us to simplify (15.7) to:
min 1/2kwk2
w∈Rn ,b∈R
(15.8)
s.t. yj (wT xj + b) ≥ 1, j ∈ [1 : m].
This is a convex program with affine inequality constraints. Under the assumption of linearly separable
training data, the problem is feasible. Hence Slater’s condition is satisfied and strong duality holds. Since
strong duality holds, and the objective and the constraint functions are continuously differentiable, the KKT
conditions are both necessary and sufficient for optimality.
Problem (15.8) can be written more compactly by forming the vector y ∈ {±1}m with y(i) = yi , and
letting Z ∈ Rn×m be the matrix Z = [y1 x1 , . . . , ym xm ] of label-weighted examples. Then (15.8) can be
written as
min 1/2 wT w
w∈Rn ,b∈R
(15.9)
s.t. Z T w + by − 1 ≥ 0.
Here 1 denotes the vector of all 1’s, 0 denotes the vector of all 0’s, and the inequality is interpreted compo-
nentwise. Now bring in the dual variables α ∈ Rm with α ≥ 0, and form the Lagrangian
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 197
[u(i)v(i)] denote the Schur product of u and v. Then the KKT conditions for problem (15.9) are:
w − Zα = 0 (∇w L = 0) (15.13)
αT y = 0 (∇b L = 0) (15.14)
T
Z w + yb − 1 ≥ 0 primal constraint (15.15)
α≥0 dual variable constraint (15.16)
T
α ⊗ (Z w + yb − 1) = 0 complementary slackness (15.17)
Since Slater’s condition is satisfied, we known that w? , b? , and α? satisfy these equations if and only if
w? , b? is a solution of (15.9) and α? is a solution of the corresponding dual problem.
w? T xi + b? = yi , i ∈ A. (15.19)
Equation (15.19) shows that each support vector lies on one of the two hyperplanes w? T x + b? = ±1,
according to the value of its label. This is illustrated on the right of Figure 15.2. We know from the argu-
ment used to simplify (15.7) to (15.8) that there exist training examples satisfying the primal constraint with
equality. The support vectors constitute a subset of these training examples.
Here ȳA is the average label, and x̄A the average example, over the support vectors.
So w? and b? are determined by the nonzero entries of α? , and the resulting linear classifier is
ŷ(x) = sign w? T (x − x̄A ) + ȳA . (15.22)
Only the support vectors play a role in classification. Hence once we know α? , the classifier is determined.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 198
max 1T α − 1/2 αT Z T Zα
α∈Rm
s.t. y T α = 0 (15.23)
α ≥ 0.
Solving this problem gives α? , and the results of the previous subsection then uniquely determine the SVM
classifier. By multiplying the objective by −1 and changing the max to a min we see that the dual problem
is equivalent to a convex program with a quadratic objective and affine inequality constraints.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 199
w T
w
x+
T
w b
x+
w
T =1
b
T
x+
=1
x+
w b
b
T
T =0
x+
=0
x+
b
b
=−
=−
1
1
H H
Figure 15.3: An illustration of the linear SVM applied to separable training data. In both plots, the red points are the
positive examples, and the blue points are the negative examples. Left: A linear SVM classifier trained with C = 0.5.
There are six support vectors. Two satisfy yi (wT xi + b) = 1, but four have yi (wT xi + b) < 1 and hence si > 0 even
though the data is linearly separable. Right: A linear SVM classifier trained on the same data with C = 5. Now there
are only two support vectors and both have si = 0.
with s(i) = si . The primal linear SVM problem can then be written as
s.t. Z T w + by + s − 1 ≥ 0 (15.26)
s ≥ 0,
with the inequalities interpreted componentwise. Notice that the training data appears in the first constraint
as the n × m matrix Z and the vector y.
Problem (15.26) is a feasible, convex (quadratic) program. Since it is feasible and has affine inequality
constraints, Slater’s condition is satisfied and strong duality holds. It also has a continuously differentiable
objective function. Thus the KKT conditions are both necessary and sufficient for optimality.
To obtain the KKT conditions, bring in the dual variables α, µ ∈ Rm with α, µ ≥ 0, and form the
Lagrangian
L(w, b, s, α, µ) = 1/2 wT w + C1T s − αT (Z T w + by + s − 1) − µT s. (15.27)
From these equations we conclude that w − Zα = 0, αT y = 0, and α + µ = C1. The KKT conditions for
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 200
w − Zα = 0 (∇w L = 0) (15.31)
T
α y=0 (∇b L = 0) (15.32)
α + µ − C1 = 0 (∇s L = 0) (15.33)
Z T w + yb − 1 + s ≥ 0 primal constraint (15.34)
s≥0 primal constraint (15.35)
α≥0 dual variable constraint (15.36)
µ≥0 dual variable constraint (15.37)
α ⊗ (Z T w + yb − 1 + s) = 0 complementary slackness (15.38)
µ⊗s=0 complementary slackness (15.39)
We can use the KKT conditions to draw several conclusions about the solution w? , b? , s? and α? , µ? .
∆
where A = {i : αi? > 0} is nonempty since α? 6= 0. The examples with indices in A are called the support
vectors. Only these examples contribute to forming w? . By (15.38), for i ∈ A we have
w? T xi + b? = yi (1 − s?i ). (15.41)
(a) 0 < αi? < C. In this case, (15.33) implies µi > 0 and then (15.39), implies si = 0. So this support
vector lies on one of the two parallel hyperplanes:
w? T xi + b? = yi , yi ∈ {±1}.
(b) αi? = C. In this case, by (15.33), µ?i = 0 and we have si ≥ 0. This support vector lies on the
hyperplane specified by (15.41). If si < 1, the support vector is correctly classified, otherwise it is
incorrectly classified.
The P
support vectors are made up of both positive and negative examples. To see this note that (15.32)
implies i∈A αi? yi = 0. Since for i ∈ A, αi? > 0, we conclude that yi must take both positive and negative
values over i ∈ A.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 201
w
T
x+
w
w
T
T
b=
x+
x+
w
T
1
w
b=
T
b=
x+
w
x+
1
0
b=
x+
b=
0
b=
−1
−
1
Figure 15.4: Training a linear SVM using nonseparable data. In both plots, the red points are the positive examples,
and the blue points are the negative examples. Left: A linear SVM classifier trained with C = 1. There are twelve
support vectors. Right: A linear SVM classifier trained on the same data with C = 5. In this case there are nine
support vectors.
min 1T s
s∈Rm ,b∈R
Im 0 s 0
(15.42)
s.t ≥ .
Im y b 1 − Z T Zα?
Only the support vectors participate in classification, and do so only via inner products with the test example.
15.3.2 The Dual Linear SVM Problem
The analysis in the previous section indicates the importance of α? in determining the linear SVM classifier.
Often α? is obtained by directly solving the dual problem. Following the procedure in Chapter 14, the dual
SVM problem is found to be,
max 1T α − 1/2 αT Z T Zα
α∈Rm , µ∈Rm
s.t. y T α = 0
α + µ − C1 = 0 (15.43)
α≥0
µ ≥ 0.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 202
We can drop µ from the problem and replace the second constraint with α ≤ C1. Then µ = C1 − α. This
simplifies the dual to:
max 1T α − 1/2 αT Z T Zα
α∈Rm
s.t. y T α = 0 (15.44)
α ≤ C1
α ≥ 0.
By multiplying the objective by −1 and changing the max to a min we see that the dual problem is equivalent
to a convex program with a quadratic objective and affine inequality constraints. The ij-th entry of the matrix
Z T Z is yi (xTi xj )yj . Using this expression to write the term αT Z T Zα out in detail we see that
m X
X m
αT Z T Zα = αi αj yi yj (xTi xj ).
i=1 j=1
So the training examples appear in the dual problem through the terms yi yj (xTi xj ). In due course we will
see that this gives the dual problem a significant advantage.
Proposition 15.4.1 (After Schölkopf et al, 2000). If the primal ν-SVM for given ν > 0 has solution
w? , b? , r? , s? with r? > 0, then the solution of the primal C-SVM problem with C = 1/(r? m) yields
the identical classifier. Conversely, if the C-SVM for given C > 0 has solution w? , b? , s? , α? , µ? , then
the ν-SVM with ν = (1T α? )/(Cm) yields the identical classifier.
Proof. Let w? , b? , r? , s? with r? > 0 be a solution of the primal ν-SVM problem. We claim that w̄ = w? /r? ,
b̄ = b? /r? , and s̄ = s? /r? solve the C-SVM problem for C = 1/(r? m). To see this, go down the list of
KKT conditions for the primal ν-SVM and transform each by the modifications above. Then check that
the corresponding equation in the column for the primal C-SVM is satisfied. For most of the equations
this simply requires dividing each side of the equation by r? . For the complementary slackness equations
it requires dividing each side of the equation by r? 2 . Hence for every ν > 0, if the solution of the primal
ν-SVM problem yields r? > 0, then the solution of the primal C-SVM with C = 1/(r? m) has solution
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 203
s.t. Z T w + by + s − 1 ≥ 0 s.t. Z T w + by + s − r1 ≥ 0
s ≥ 0. s≥0
r ≥ 0.
Lagrangian: Lagrangian:
w − Zα = 0 (∇w L) w − Zα = 0 (∇w L)
αT y = 0 (∇b L) αT y = 0 (∇b L)
α + µ − C1 = 0 (∇s L) α+µ− 1/m1 =0 (∇s L)
T
α 1−ν−γ =0 (∇r L)
Z T w + yb − 1 + s ≥ 0 p. c. Z T w + yb − r1 + s ≥ 0 p. c.
s≥0 p. c. s≥0 p. c.
r≥0 p. c.
α≥0 d. c. α≥0 d. c.
µ≥0 d. c. µ≥0 d. c.
γ≥0 d. c.
T T
α ⊗ (Z w + yb − 1 + s) = 0 c. s. α ⊗ (Z w + yb − r1 + s) = 0 c. s.
µ⊗s=0 c. s. µ⊗s=0 c. s.
γr = 0 c. s.
Table 15.1: The C-SVM and ν-SVM and corresponding Lagrangians and KKT conditions.
w̄ = w? /r? , b̄ = b? /r? , and s̄ = s? /r? . This solution defines the same hyperplane and hence the same
classifier as the ν-SVM.
To prove the converse, let w? , b? , s? , α? , µ? satisfy the KKT conditions for the C-SVM for some C > 0.
We claim that w̄, b̄, s̄, ᾱ, µ̄ (where z̄ = z/(Cm)), with r̄ = 1/(Cm) and γ̄ = 0 satisfy the KKT conditions
for the primal ν-SVM with ν = 1T ᾱ. To see this, go down the list of KKT conditions for the primal C-SVM
and transform each by the modifications above. For most of the equations this simply requires dividing each
side of the equation by Cm. For the complementary slackness equations, divide each side of the equation by
(Cm)2 . Then check that the corresponding equation in the column for the primal µ-SVM is satisfied. This
verifies all equations except the 4-th, 9-th and 12-th. Since r̄ = 1/(Cm) > 0, select γ̄ = 0. This ensures the
9-th and 12-th equations are satisfied. Then ν = 1T ᾱ ensures the 4-th equation is satisfied. Since all KKT
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 204
conditions are satisfied, we have a solution of the primal ν-SVM. Moreover this solution yields the same
hyperplane and hence classifier as the C-SVM solution.
The question concerning advantages of the ν-SVM is answered in the following propositions.
Proposition 15.4.2 (Schölkopf et al, 2000). If we solve a ν-SVM yielding a solution with r? > 0, then
ν is an upper bound on the fraction of the training examples with s?i > 0 and a lower bound on the
fraction of training examples that are support vectors.
Proof. Let q be the number of training examples with s?i > 0 and let t ≥ q be the number of support vectors.
Since r? > 0, the KKT conditions imply γ ? = 0 and hence ν = α? T 1. Moreover, by the KKT conditions,
1 1
if s?i > 0, then µ?i = 0 and αi? = 1/m; otherwise αi? ≤ 1/m. Hence q m ≤ ν = 1T α ? ≤ t m .
Proposition 15.4.3 (Schölkopf et al, 2000). Suppose the m training examples used in the ν-SVM are
drawn independently from a distribution p(x, y) such that p(x|y = 1) and p(x|y = −1) are absolutely
continuous. Then with probability one, in the limit as m → ∞ the fraction of support vectors and the
fraction of margin errors both converge to ν.
can set ŵ = w/. Then ŵT xi = wT xi / ≥ 1, i ∈ [1 : m]. Thus there exists a hyperplane ŵT x + b = 0 with
b = −1 that separates the data from the origin. Moreover, the distance from 0 to this hyperplane is 1/kŵk2 .
It follows that an unlabeled dataset {xi }m i=1 is linearly separable from the origin if and only if there
exists w ∈ Rn such that wT xi − 1 ≥ 0, i ∈ [1 : m]. Moreover, the distance from any such hyperplane to the
origin is 1/kwk2 . It is now clear that we can find a hyperplane in this family that is furthest from the origin.
It will be convenient to set X = [x1 , . . . , xm ] and write the condition wT xi − 1 ≥ 0, i ∈ [1 : m], in
vector form as X T w − 1 ≥ 0. We can then pose the simple one-class SVM problem as:
min 1/2 wT w
w∈Rn (15.45)
s.t. X T w − 1 ≥ 0.
Since problem (15.45) assumes the data is linearly separable from the origin, it is analogous to the simple
SVM problem. The problem can be generalized by allowing some points to be on the “wrong” side of the
hyperplane. This removes the linear separability assumption and leads to the following formulation of the
one class C-SVM (shown in primal and dual forms):
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 205
One can also consider a one-class ν-SVM. In this case the primal and dual problems can be stated as:
Spherical One Class C-SVM Primal: Spherical One Class C-SVM Dual:
Notes
The SVM has a long history. It rose to wide prominence after the paper by Cortes and Vapnik in1995 [10].
Notice that this paper did not use the name SVM. The ν-SVM appears in Schölkopf et al. [41].
The use of spheres with hard boundaries to describe data was examined with a soft margin one-class
SVM by Tax and Duin in 1999 [47]. The one class ν-SVM appears in the paper by Schölkopf et al. in
1999 [44] and is expanded upon in [39]. See [42] for a more expansive coverage of one class SVM classifiers.
Exercises
Primal SVM Problem
Exercise 15.1. (Uniqueness of SVM solutions)
(a) Show that the solution w? , b? of the simple linear SVM based on separable data is unique.
(b) Show that if w? , s? , b? is a solution of the primal linear SVM problem, then w? and 1T s? are unique. Show
that the uniqueness of s? and b? is determined by a feasible linear program.
Exercise 15.2. Show that if the solution of the primal linear SVM problem is unique, then there is at least one support
vector with s?i = 0. This support vector determines b? . Thus when the primal problem has a unique solution, b? is
easily determined from α? . What can one say when the solution is not unique?
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 206
(a) Let h ∈ Rn and Q ∈ On , and form the second training set: {Q(xi − h), yi )}m
i=1 . Show that the SVM classifier
for this dataset is Qw? , w? T h + b? .
(b) Show that both classifiers have the same accuracy on any testing set.
(c) In particular, if we first center the training examples, how does this change the SVM classifier?
min 1/2 wT w
w∈Rn ,b∈R
s.t. Z T w + by − 1 ≥ 0
Exercise 15.7. Give a clear and concise derivation of the dual of the primal linear SVM problem shown below and
explain the origin of each of the constraints in the dual problem.
s.t. Z T w + by + s − 1 ≥ 0
s ≥ 0.
Exercise 15.8. Show that the solution of the dual SVM problem is invariant under a “rigid body” transformation of
the training data. Specifically, let {(xi , yi )}m n
i=1 with xi ∈ R and yi ∈ {±1}, i ∈ [1 : m], be a training dataset, and
h ∈ R , Q ∈ On . Then the solutions of the dual SVM problems for the datasets {(xi , yi )}m
n m
i=1 and {(Q(xi −h), yi )}i=1
are identical.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 207
Exercise 15.10. If we drop the constraint s ≥ 0 in the primal linear SVM problem, we obtain the modified problem
s.t. Z T w + by + s − 1 ≥ 0.
s.t. Z T w + by + s − 1 = 0.
(b) Analyze the classifier that results from the problem in (a) in detail. In particular, show its connection to the
nearest centroid classifier.
Exercise 15.11. As in Exercise 15.10, drop the constraint P s ≥ 0 in the primal linear SVM problem and make the
m
primal constraint
Pm 2 an equality. However, this time replace C i=1 si in the SVM objective with the quadratic penalty
1/2C s .
i=1 i
(a) Formulate the new primal problem in vector form and determine when the primal problem feasible and when
strong duality holds.
(b) Write down the KKT conditions.
(c) Show that α? and b? solve a set of linear equations.
(d) Show that these linear equations have a unique solution.
(e) Find the dual problem in its simplest form.
Exercise 15.12. You are provided with m > 1 data points {xj ∈ Rn }m j=1 of which at least d, with 1 < d ≤ m are
distinct. Let X = [x1 , . . . , xm ] and consider the one class SVM problem:
min R2 + C1T s
R∈R,a∈Rn ,s∈Rm
(a) Show that this is a feasible convex program and that strong duality holds. [Hint: let r = R2 ]
(b) Write down the KKT conditions.
(c) Show that α? 6= 0 and that if C > 1/(d − 1) then (R2 )? > 0 (harder).
(d) What are the support vectors for this problem?
(e) Derive the dual problem.
(f) Assume C > 1/(d − 1). Given the dual solution, how should a and R2 be selected?
Exercise 15.13. You are provided with m > 1 data points {xi ∈ Rn }m i=1 that are linearly separable from the origin.
Let X = [x1 , . . . , xm ] and consider the (simple) one-class SVM problem
min 1/2 wT w
w∈Rn
s.t. X T w − 1 ≥ 0.
(a) Show that this is a feasible convex program and that strong duality holds.
(b) Find the KKT conditions.
(c) Derive the corresponding dual problem.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 208
Exercise 15.14. Consider the same set-up as the previous exercise except the set of unlabeled examples {xi ∈ Rn }m
i=1
may not be linearly separable from the origin. In this case, the one-class SVM primal problem is
s.t. XT w + s − 1 ≥ 0
s ≥ 0.
(a) Show that this is a feasible convex program and that strong duality holds.
(b) Write down the KKT conditions.
(c) Derive the corresponding dual problem.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 209
Chapter 16
A feature map is a function that maps the examples of interest (both training and testing) into an inner
product space. In doing so, the feature map can transform the representation of the data to aid subsequent
analysis. Here are several potential advantages of this approach:
(1) The new representation can enable the application of known machine learning methods. For example,
let the original data is a set of text, binary images, or tables of categorical data. If a feature map can
be founds that represents the important aspects of the data as Euclidean vectors, standard machine
learning methods can be applied to the transformed data.
(2) A feature map can reorganize the data representation to aid data analysis. This is usually done by
extracting, combining and reorganizing informative features from the initial data representation. For
example, mapping an image into a vector of wavelet features, or a sound signal into a vector spectral
features, or using a trained neural network to map images into a large dimensional vector of learned
image features.
(3) A nonlinear feature map has the ability to warp the data space to reduce within class variation and
increase between class variation. This ability can often be enhanced by making the dimension of the
feature space higher than that of the initial space.
We first discuss feature maps and the potential advantages and disadvantages of using such maps in
machine learning. Then we bring in the concept of the kernel of a feature map and discuss kernel properties.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 210
6
2.5 30
4 20
2.0
10
2 x1 x2
2
0
1.5 w x+T
p
b
w T x + =1 0
10
b =0
x2
1.0 w T x + b 20
= −1
H 2
30
0.5 25
20
4 15
0.0 0
5 10
x2 2
10
6 15 5
6 4 2 0 2 4 6 x12 20
0.5 x1 25
0
1.5 1.0 0.5 0.0 0.5 1.0 1.5
Figure 16.1: Left: A simple example showing a set of training data along the x-axis that is not linearly separable, and
above that, a mapping of the data into R2 using φ(x) = (x, 0.5 + x2 ), after which it is linearly separable. Middle and
Right: Illustrative plots for the second example in Example 16.1.1. The middle plot shows a set of training data √ in the
plane that is not linearly separable. The right plot shows a mapping of the data into R3 using φ(x) = (x21 , x22 , 2x1 x2 ),
after which it is linearly separable using the plane φ(x)(3) = 0.
learning the classifier, a new test point x ∈ Rn is classified in Rq using its image φ(x):
(
1, if w? T φ(x) + b? ≥ 0;
ŷ(φ(x)) =
−1, otherwise.
Example 16.1.1. The following examples give some insight into the potential benefits of the above idea.
(1) Consider the specific binary classification problem shown in left plot of Figure 16.1. The set of binary
labeled training examples along the x-axis is not linearly separable. Shown above that are the results
of mapping the data into R2 using φ(x) = (x, 0.5 + x2 ). The mapped training examples are now
linearly separable in R2 .
(2) Let x ∈ R2 and write x = (x1 , x2 ). We create a variation of the famous “exclusive or” problem,
by letting the label of x = (x1 , x2 ) be y(x) = sign(x1 x2 ). Under this labelling, any sufficiently
large training set drawn from a density supported on R2 will not be linearly separable. But if we first
apply the nonlinear feature map φ(x) = x1 x2 , the data is linearly separable in R. This feels like
cheating; how would we know to select that special feature map? If we thought that second order
statistics of the data might √
help classification, we could instead apply a simple quadratic feature map
such as φ(x) = x21 x22 2x1 x2 . The mapped data is then linearly separable in R3 using the third
feature. Moreover, this fact can be learned using the training data to find a linear SVM classifier. This
is illustrated in the middle and right plots of Figure 16.1.
Example 16.1.1 suggests that a feature map can be a powerful tool. The feature map can nonlinearly
warp the original space to bring examples with the same label closer together (thus reducing within class
variation) while moving examples with distinct labels further apart (thus increasing between class distances).
This task can potentially benefit by lifting the data into a higher dimension space since that adds extra degrees
of freedom for warping the data surface. Of course that’s a vague idea. It suggests that using nonlinear
functions to map the data into a high dimensional space is a good idea, but leaves open the question of how
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 211
to actually select the feature map. That will be application dependent, but the general idea is to identify and
build informative composite features from the original weakly informative features.
We should also note a potential computational issue with the use of high dimensional feature maps.
Each example could already have a high-dimensional representation (e.g., a large image). Hence mapping
the data into an even higher dimensional space and doing subsequent computation in that higher dimensional
space could be computationally expensive or even infeasible.
∆
k(x, z) = <φ(x), φ(z)>. (16.1)
We say that k is the kernel of φ and that k is a kernel on X . If it is important to indicate that a kernel k and
feature map φ are connected by (16.1), we denote the kernel by kφ . A few important properties of kernels
follow immediately from (16.1). We list these below.
(1) For any function h : X → R, k(x, z) = h(x)h(z) is a kernel on X . This follows by noting that the
simplest feature maps extract one scalar feature, i.e., φ : X → R. Every such function defines a kernel
on X of the form k(x, z) = φ(x)φ(z). Hence h(x)h(z) is the kernel for the feature map h.
(3) A kernel k(x, z) inherits the following properties of the inner product in H: positivity, k(x, x) ≥ 0;
symmetry, k(x, y) = k(y, x); and the Cauchy-Schwartz inequality,
p
|k(x, z)| = |<φ(x), φ(z)>| ≤ kφ(x)k2 kφ(z)k2 = k(x, x)k(z, z).
Hence
k(x,z)
−1 ≤ √ ≤ 1.
k(x,x)k(z,z)
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 212
k4 (x, y) = x21 z12 + x22 z22 + 2x1 x2 z1 z2 + 2x1 z1 + 2x2 z2 + 1 = (xT z + 1)2 .
Hence if we want to pick a function k : X × X → R such that k is the kernel of some feature map on X ,
then for every integer m ≥ 1 and every set of points {xj ∈ X }m j=1 , the matrix K = [k(xi , xj )] ∈ R
m×m
must be symmetric positive semidefinite. It turns out that this is also a sufficient condition for k to be the
kernel of some feature map on X .
Theorem 16.2.1. A function k : X × X → R is the kernel of some feature map φ : X → H, where H
is a Hilbert space, if and only if for every integer m ≥ 1 and every set of m points {xj ∈ X }m
j=1 , the
matrix K = [k(xi , xj )] is symmetric positive semidefinite.
Proof. We proved necessity above the statement of the theorem. The proof of sufficiency involves the
construction of a reproducing kernel Hilbert space H of functions with reproducing kernel k(·, ·). The
required feature map is then φ(xi ) = k(xi , ·) ∈ H. See, for example [42].
Together with the definition of a kernel and its immediate consequences, the following core properties
play a key role in identifying and constructing kernels.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 213
Theorem 16.2.2 (Core Properties of Kernels). Kernels have the following properties:
(c) Product: k1 , k2 are kernels on X , then k(x, x0 ) = k1 (x, x0 )k2 (x, x0 ) is a kernel on X .
is a kernel on X × Y.
(e) Limits: If {kp }p≥1 are kernels on X and limp→∞ kp (x, z) = k(x, z). Then k is a kernel on X .
(b) For a polynomial q(s) with nonnegative coefficients, q(k(x, z)) is a kernel on X .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 214
√
Hence we can set φ(x) = (x(1)2 , x(2)2 , 2x(1)x(2)). For n = 2 and general d, the same expansion
yields
12
d d
21 d
21
φ(x) = x(1)d x(1)d−1 x(2) x(1)d−2 x(2)2 . . . x(1)1 x(2)d−1 x(2)d .
1 2 1
For general n and d, we expand k(x, z) = (x(1)z(1) + x(2)z(2) + · · · + x(n)z(n))d using the
multinomial theorem, and use the pattern from above to obtain
1 Qn
n
φ(x) = k1 ,...kn
2
t=1 x(t)kt .
The term inside the outer parentheses is the generic component of the vector. The components are
indexed by k1 , . . . , kn where nt=1 kt = n. For any m points in Rn arranged in the columns of X,
P
the kernel matrix is K = ⊗dj=1 (X T X).
2
(f) k(x, z) = e−γkx−zk2 on Rn , with γ ≥ 0.
This kernel is called the Gaussian kernel. Writing
2 T 2 T
k(x, z) = e−γkxk2 e2γx z e−γkzk2 = f (x)f (z)e2γx z ,
and using the results of Theorem 16.2.2 and Corollary 16.2.1, one sees that this is indeed a kernel. To
obtain a feature map, for x ∈ Rn consider
√ −2γkθ−xk2
φ(x) = gx (θ) = Ce , (16.2)
2
where the normalizing constant satisfies C Rn e−4γkθk = 1. The function gx (θ) lies in the Hilbert
R
space L2 (Rn ) of real valued square integrable functions on Rn with inner product
Z
<f (θ), g(θ)> = f (θ)g(θ)dθ. (16.3)
Rn
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 215
To check that φ is a feature map of the Gaussian kernel, use (16.3) to compute
T
(g) For any symmetric PD matrix P ∈ Rn×n , k(x, y) = e−(x−z) P (x−z) on Rn .
The Gaussian kernel is a special case with P = γIn . A second special case is P = 1
2 diag(σj2 )−1 for
(xj −zj )2/σ 2
Pn
fixed scalars σj2 , j ∈ [1 : n]. This gives k(x, z) = e− /2
1
j=1 j (Exercise 16.31).
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 216
remains true if we take a limit of weighted sums. If the weights are appropriately selected, this yields a
kernel formed as a weighted integral:
Z b
2
k(x, z) = p(t)e−tkx−zk2 dt,
a
where 0 ≤ a ≤ b, and p(t) is a non-negative integrable function on [a, b]. This is called a smoothed kernel.
In this example, one can regard t (or more appropriately 1/t) as a length scale parameter. By scaling the
term kx − zk22 , t determines when points are regarded as close versus distant. Roughly, the division between
close and distant is determined by tkx − zk22 = 1. Smoothing over a range of t can incorporate a range of
length scales into the kernel.
More generally, one can prove the following result.
Theorem 16.4.1. Let k(t, x, z) be a family of kernels on Rn parameterized by t ∈ [a, b]. Assume that
for all x, y ∈ Rn , and t ∈ [a, b], |k(t, x, y)| ≤ B(x, y) < ∞. Then for any real-valued, non-negative,
integrable function p(t) defined over [a, b],
Z b
κ(x, z) = p(t)k(t, x, z)dt,
a
is a kernel on Rn .
Rb
Proof. Let m ≥ 1, {xj }m n
j=1 ⊂ R , and set K = [κ(xi , xj )] = [ a p(t)k(t, xi , xj )dt]. Clearly K is
symmetric. For any a ∈ Rn ,
Z b Z b Xm X
m
aT Ka = p(t)aT [k(t, xi , xj )]a dt = p(t) ai aj k(t, xi , xj ) dt.
a a i j
The term (. . . ) in the second integral is non-negative and bounded above by maxi,j B(xi , xj )kak21 . Since
p(t) is integrable, the integral is well defined, and since p(t) is non-negative, it yields a (finite) non-negative
number. Hence aT Ka ≥ 0. Thus K is PSD and κ(x, z) is a kernel.
Notes
The material in this chapter is standard and can be found in most modern books and tutorials on machine
learning. See, for example, [42, Part III], [5, Chapter 6], [32, Chapter 14], [48, Chapter 11] and the tutorial
paper [21].
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 217
Exercises
Miscellaneous
Exercise 16.1. Prove the results in Corollary 16.2.1. Specifically, if k(x, z) is a kernel on X , then
(b) For a polynomial q(s) with nonnegative coefficients, q(k(x, z)) is a kernel on X .
Exercise 16.2. Let KX = {k : X × X → R, k is a kernel} denote the family of kernels on the set X . Shown that KX
is a closed convex cone.
Exercise 16.3. Let A be a finite set and for each subset U ⊆ A let |U| denote the number of elements in U. For
U, V ⊂ A, let k(U, V) = |U ∩ V|. By finding a suitable feature map, show that k(·, ·) is a kernel on the power set
P(A) of all subsets of A.
Simple kernels on R
Exercise 16.5. Let a > 0 and L2 [0, a] denote the set of real valued square integrable functions on the interval [0, a].
Ra Rt
L2 [0, a] is a Hilbert space under the inner product <g, h> = 0 g(s)h(s)ds. For f ∈ L2 [0, a], let g(t) = 0 f 2 (s)ds
Ra
and h(t) = t f 2 (s)ds, where t ∈ [0, a]. Show that
Exercise 16.6. Show that if for each a > 0, if k(x, z) is a kernel on [−a, a], then k(x, z) is a kernel on R.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 218
Shift-invariant kernels on R
It will be convenient to consider kernels on R taking complex values. In this case, the feature space H is a Hilbert
space over the complex field C, and for some feature map φ : X → H, k(x, z) = <φ(x), φ(z)>. Any kernel matrix
K is then required to be hermitian (K = K ∗ = K̄ T ) and PSD.
Exercise 16.10. A complex-valued function r : R → C is said to be positive semidefinite if for every integer m ≥ 1
∆
and every set of m points {xj ⊂ R, j ∈ [1 : m]}, the matrix K = [r(xi − xj )] is hermitian (K = K ∗ = K̄ T ) and
m T
positive semidefinite (∀a ∈ C , a Kā ≥ 0). Show that r(x − z) is a shift-invariant kernel if and only if r(t) is a
positive semidefinite function.
Exercise 16.11. If r is a positive semi-definite function, show that
(a) r(0) ≥ 0
(b) for all t ∈ R, r(−t) = r(t) (Hint: consider K for 2 points.)
(c) for all t ∈ R, r(0) ≥ |r(t)|
(d) for all s, t ∈ R, |r(t) − r(s)|2 ≤ r(0)|r(0) − r(t − s)| (Hint: consider K for 3 points t, s, 0.)
(e) if r(t) is continuous at t = 0, then r(t) is uniformly continuous on R
Exercise 16.12. Show that each of the following operations on PSD functions yields a PSD function.
(a) non-negative scaling: r(t) a PSD function and α ≥ 0, implies αr(t) is a PSD function.
(b) addition: r1 (t), r2 (t) PSD functions implies r1 (t) + r2 (t) is a PSD function.
(c) product: r1 (t), r2 (t) PSD functions implies r1 (t)r2 (t) is a PSD function.
(d) limit: rj (t), j ≥ 1, PSD functions and limj→∞ rj (t) = r(t) implies r(t) is a PSD function.
Exercise 16.13. Determine, with justification, which of the following functions are PSD:
P∞
(a) r(t) = eiωt , for each ω ∈ R (f) r(t) = k=−∞ αk eikωt , for ω ∈ R, αk ≥ 0,
and pointwise series convergence
(b) r(t) = eiω1 t + eiω2 t , for general ω1 , ω2 ∈ R
(g) r(t) = sin2 (ωt), for each ω ∈ R
(c) r(t) = cos(ωt), for each ω ∈ R (
2
(d) r(t) = cos (ωt), for each ω ∈ R 1, − 21 ≤ t ≤ 21 ;
(h) r(t) =
0, otherwise
Pn sin((n+ 1 )t)
(e) dn (t) = 1 + 2 k=1 cos(kt) = sin t 2
(2) (i) r(t) = |t|.
Exercise 16.14. (Simplified Bochner’s Theorem) Prove that if r̂(ω) is an integrable nonnegative function, then
Z
1
r(t) = r̂(ω)eiωt dω (16.4)
2π R
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 219
Exercise 16.16. Let L1 (R) denote the set of real-valued integrable functions on R. The convolution of f, g ∈ L1 (R)
is the function f ∗ g : R → R defined by
Z
∆
(f ∗ g)(t) = f (s)g(t − s)ds. (16.5)
R
If f, g ∈ L1 (R), then (f ∗ g) is well defined and is also in L1 (R). In addition, the Fourier transforms fˆ, ĝ, ĥ are well
defined, and ĥ(ω) = fˆ(ω)ĝ(ω) (convolution theorem).
Show that if r1 , r2 ∈ L1 (R) are continuous PSD functions , then r = (r1 ∗ r2 ) is a continuous PSD function.
Hence the convolution of integrable, continuous PSD functions is an integrable, continuous PSD function.
Exercise 16.17. Let L2 (R) denote the set of real-valued, square-integrable functions on R. L2 (R) is a Hilbert space
∆ R
under the inner product <f, g> = R f (s)g(s)ds. Define the autocorrelation function rh of h ∈ L2 (R) by
Z
∆
rh (τ ) = h(s)h(s + τ )ds. (16.6)
R
Exercise 16.18. The convolution f ∗ g given by (16.5) of f, g ∈ L2 (R) is well defined. The convolution operation is
commutative, f ∗ g = g ∗ f . It is also associative: (f ∗ g) ∗ h = f ∗ (g ∗ h), provided the stated convolutions are well
∆
defined. For h ∈ L2 (R), let hr (t) = h(−t).
Show that:
(a) For h ∈ L2 (R), rh (t) = (h ∗ hr )(t). Hence for each h ∈ L2 (R), h ∗ hr is a shift-invariant kernel on R
Exercise 16.19. Every function h ∈ L2 (R) has a Fourier transform ĥ(ω) with ĥ(ω) a complex valued, square inte-
grable function on R. Moreover, the inverse Fourier transform of ĥ(ω) yields a function that is equal to h(t) almost
everywhere. Let rh (τ ) denote the autocorrelation function of h ∈ L2 (R).
(a) Show that rh (τ ) has a Fourier transform and r̂h (ω) = ĥ(ω)ĥ(−ω)
(b) Without using Bochner’s theorem, show that r̂h (ω) is a real-valued, even, non-negative, integrable function.
Exercise 16.20. Use the result of Exercise 16.17 to prove that for γ ≥ 0,
(a) k(x, z) = e−γ|x−z| is a kernel on R
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 220
For each function below, sketch the function and determine (with a justification) if it is a shift-invariant kernel on R:
(a) r(t) = p(t) (c) r(t) = (p ∗ p ∗ p)(t)
(b) r(t) = (p ∗ p)(t) (d) r(t) = (p ∗ p ∗ p ∗ p)(t)
Kernels on Rn
Exercise 16.23. (Separable
Qn Kernels) Let κ(u, v) bePa kernel on R. Show that the following functions are kernels on
n
Rn : (a) k(x, z) = i=1 κ(xi , zi ) and (b) k(x, z) = i=1 κ(xi , zi ).
Exercise 16.24. For all γ ≥ 0, show that k(x, z) = e−γkx−zk1 is a separable kernel on Rn .
2
Exercise 16.25. For all γ ≥ 0, show that k(x, z) = e−γkx−zk2 is a separable kernel on Rn .
Exercise 16.26. Determine, with justification, which of the following functions are kernels on Rn :
Pn Pn
(a) j=1 e
iωj (xj −zj )
, ωj ∈ R (d) ei j=1 ωj (xj −zj ) , ωj ∈ R
Pn Pq Pn
(b) k(x, z) = j=1 cos(xj − zj ) (e) k=1 cos( j=1 ωjq (xj − zj )), ωjq ∈ R
P P
Pn
(c) k(x, z) = j=1 cos2 (xj − zj ) (f) cos( j xj − j zj )
Exercise 16.27. Let k(x, z) be a kernel on Rn . Show that the following function is a kernel on the same space:
( k(x,z)
√ , if k(x, x), k(z, z) 6= 0;
k̃(x, z) = k(x,x)k(z,z)
0, otherwise.
Exercise 16.28. Let kj be a kernel on X with feature map φj : X → Rq , j = 1, 2. In each part below, find a simple
feature map for the kernel k in terms of feature maps for the kernels kj . By this means, give an interpretation for the
new kernel k.
(a) k(x, z) = k1 (x, z) + k2 (x, z)
(b) k(x, z) = k1 (x, z)k2 (x, y)
p
(c) k(x, z) = k1 (x, z)/ k1 (x, x)k1 (z, z)
Exercise 16.29. Find a feature map for the 4th order homogeneous polynomial kernel on R3 .
Exercise 16.30. You want to learn an unknown function f : [0, 1] → R using a set of noisy measurements (xj , yj ),
with yj = f (xj ) + j , j ∈ [1 : m]. To do so, you plan to approximate f (·) by a Fourier series on [0, 1] with q ∈ N
terms:
q
∆ a0
X
fq (x) = + ak cos(2πkx) + bk sin(2πkx).
2
k=1
Then learn the coefficients ak , bk using regularized regression (see Exercise 9.17). Give an expression for the feature
map being used and determine its kernel in the simplest form.
1 T
Exercise 16.31. Let P ∈ Rn×n be symmetric PD. Show that k(x, z) = e− 2 (x−z) P (x−z)
is a kernel on Rn .
Exercise 16.32. Show that the following functions are kernels on Rn : (a) k(x, z) = kxk22 + kz||22 − kx − zk22 , and (b)
k(x, z) = kx + zk22 − kx − zk22 .
Exercise 16.33. Which of the following functions are kernels on Rn and why/why not?
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 221
Pk
Pk αj (xT Pj z)
(a) Let αj ≥ 0, j=1 αj = 1, and {Pj }kj=1 ⊂ Rn×n be symmetric PSD. Set k(x, z) = e j=1 .
T T
(b) k(x, z) = (xe−x ) (ze−z
x T z
).
(c) k(x, z) = 1/2 1 + (x/kxk2 )T (z/kzk2 ) .
(d) k(x, y) = Πni=1 |xi yi |.
(e) k(x, y) = ht (QT x)T ht (QT y) where Q ∈ Rn×n is orthogonal and ht is a thresholding function that maps
z = [zi ] to ht (z) = [z̃i ] with z̃i = zi if |zi | > t and 0 otherwise.
Smoothed kernels on Rn
Exercise 16.34. Let p(t) be a real-valued, non-negative, integrable function on [0, ∞). Show that the following
function is a kernel on Rn : Z ∞
2
k(x, z) = p(t)e−tkx−zk2 dt
0
√ 2 2 √
x2 + x
R∞ 2 R∞ b
Exercise 16.35. From −∞
e−t dt = π one can derive the identity 0
e−(a 2)
dx = 1 π −2ab
2 a e (see, e.g.,
3/2
−αkx−zk2 n −αkx−zk2
[17]). Use this identity to show that k(x, z) = e is a kernel on R . Is k(x, z) = e a kernel on
Rn ?
Exercise 16.36. Let t ≥ 0 and σ > 0. Show that the following shift-invariant functions are kernels on Rn :
σ2
(a) k(x, z) = σ 2 +kx−zk22
.
σ
(b) k(x, z) = σ+kx−zk2 .
σ
(c) k(x, z) = σ+kx−zk1 .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 222
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 223
Chapter 17
We now explore using kernels to form nonlinear extensions of learning algorithms. For mathematical sim-
plicity, we consider feature maps φ : Rn → Rq for finite q. This allows us to continue to use vector and
matrix notation to derive a kernel extension of an existing machine learning method. The equations derived
will be useful in their own right but will also suggest the extension to infinite dimensional feature spaces
(e.g., a Gaussian kernel), but we omit the proofs in the infinite dimensional setting.
We begin with a general observation. Suppose we decide to employ a feature map φ : Rn → Rq ,
and then apply a known machine learning method in the feature space Rq . If the kernel function kφ (x, z)
is known, and can be efficiently evaluated, it gives a means of bypassing computing the feature vectors
φ(x), and the inner products of feature vectors, by directly computing inner products using the kernel:
k(x, z) = <φ(x), φ(z)>. Any machine learning algorithm using training and testing examples only within
inner products can be applied in feature space using such kernel evaluations. This gives a way to efficiently
extend various machine learning methods to include nonlinear feature maps. This is the path we explore.
max 1T α − 1/2αT Z T Zα
α∈Rm
yT α = 0 (17.1)
α ≤ C1
α ≥ 0.
The matrix Z = [yi xi ] ∈ Rn×m contains the m label-weighted examples as its columns. So Z T Z ∈ Rm×m
has i, j-th entry yi (xTi xj )yj . Notice that the training examples appear in the form of inner products.
We first examine how the SVM problem can be solved directly in the feature space. To do so, bring
in a feature map φ : Rn → Rq and map the training examples {(xi , yi )}m i=1 into feature space to obtain
m
{(φ(xi ), yi )}i=1 . Let φ(Z) denote the q × m matrix [yj φ(xj )]. Then in feature space, the dual SVM
problem requires the matrix
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 224
Figure 17.1: A comparison of the operations required to: (left) compute the feature map of the training data, use the
SVM to learn a classifier in the high dimensional space, then classify new data in the high dimensional space; versus:
(right) picking a kernel and using that directly in the SVM to learn and subsequently classify in the original space.
max 1T α − 1/2αT Kα
α∈Rm
yT α = 0 (17.2)
α ≤ C1
α ≥ 0.
The label-weighted kernel matrix K is m × m with ij-th entry yi <φ(xi ), φ(xj )>yj . So the dual problem is
no larger, and in this sense no harder to solve. However, we need to spend extra time and energy computing
each φ(xi ) and the inner products <φ(xi ), φ(xj )> in the higher dimensional space to obtain K. If the
feature map has a known kernel function, then some of this work can be avoided by using direct evaluations
of the kernel function: K = [yi yj k(xi , xj )].
Once the dual problem is solved, let A = {i : αi? > 0} denote the indices of the support vectors. In the
feature map approach, a new test example x is then classified by computing
!
X
ŷ(x) = sign αi? yi <φ(xi ), φ(x)> + b? .
i∈A
So we first compute φ(x) and then compute |A| feature space inner products <φ(xi ), φ(x)>, i ∈ A. Typ-
ically |A| is much less that the number of training examples. Alternatively, if the kernel function of φ is
known, we can simply compute |A| kernel evaluations k(xi , x), i ∈ A, to obtain
X
ŷ(x) = αi? yi k(xi , x) + b? .
i∈A
Computation in the higher dimensional feature space will generally be more expensive. Hence in terms of
computation, direct and efficient evaluation of the kernel k(x, z) is critical. The kernels of some useful fea-
ture maps can indeed be efficiently evaluated without requiring any computation in the higher dimensional
space. For example, the homogeneous and inhomogeneous quadratic kernels, polynomial kernels, and the
Gaussian kernel.
The above discussion reinforces an important idea. Rather than picking a feature map φ, instead pick
a kernel function that is useful, and efficiently computable in the original space. Then kernel evaluations
don’t require computation in the higher dimensional space. Sometimes this is called the “kernel trick”, but
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 225
it’s more of a clever observation that a trick. Figure 17.1 illustrates the idea for the SVM by contrasting the
two ways of organizing the computational process: in feature space, and in the original space making direct
use of a known kernel function.
Notice that in these formulas the examples appear as XX T rather than the Gram matrix X T X and x
does not appear in the form X T x. However, this is easily remedied. First we make a side observation. If
A ∈ Rn×n and B ∈ Rm×m are invertible and AM = M B for some M ∈ Rn×m , then A−1 M = M B −1
(exercise). Using this together with the simple equality (XX T + λIn )X = X(X T X + λIm ), we conclude
that
(XX T + λIn )−1 X = X(X T X + λIm )−1 .
Applying this equality to (17.5) and (17.6) yields
These formulas indicate how a kernel can be introduced, and also remind us of some known properties of the
solution. First, by (17.7) we see that w? lies in the range of X. So there exists a vector a? ∈ Rm such that
T m
w? = Xa? . Then from (17.8), ŷ(x) = a? X T x = j=1 a? (j)xTj x. Hence the ridge regression predictor
P
ŷ(x) is formed by taking a linear combination of the inner products xTj x. These properties were discussed
previously - see below.
A Representer Theorem
The following theorem (an instance of a “representer theorem”) yields the above observations directly from
(17.3). The theorem was previously stated as Lemma 9.3.1 in Chapter 9 on least squares.
Theorem 17.2.1. The solution w? of the ridge regression problem (17.3) can be represented as w? =
Xa? , for some a? ∈ Rm , and the corresponding ridge regression predictor is given by
ŷ(·) = m ?
P
j=1 a (j)<xj , ·>.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 226
Proof. For each w ∈ Rn write w = ŵ + w̃ where ŵ ∈ R(X) and w̃ ∈ R(X)⊥ . Then ky − X T wk22 =
ky −X T ŵk22 . So w and ŵ give the same value for the first term in the ridge regression objective. The second
term is kwk22 = kŵk22 + kw̃k22 ≥ kŵk22 . Hence w? ∈ R(X). So w? = Xa? for some a? ∈ Rm . The second
claim then follows from ŷ(x) = w? T x = (Xa? )T x.
By Theorem 17.2.1 we can substitute w = Xa into (17.3) to obtain the following problem for a? :
a? = arg minm 1/2ky − X T Xak22 + λkXak22 .
a∈R
This is a Tikhonov regularized least squares problem with corresponding normal equations
X T X(λIm + X T X)a? = X T Xy. (17.9)
It is clear that a? = (λIm + X T X)−1 y is a particular solution. Hence w? = X(λIm + X T X)−1 y. This
gives (17.7), from which (17.8) immediately follows.
If the columns of X are linearly dependent, a? is not unique. Indeed, if a? is a solution to (17.9), so is
a? + v for every v ∈ N (X). The good news is that the predictor is unique even when a? is not. This follows
by noting that w? defines the predictor and w? = Xa? . So the component of a? in N (X) can’t influence
the predictor.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 227
Hence to find the d-dimensional PCA projection we seek a solution U ? of the matrix Rayleigh quotient
problem (see (8.3)):
max trace(U T XX T U )
U ∈Rn×d (17.15)
s.t. U T U = Id .
We then project each example xi to its d-dimensional coordinates zi = U ? T xi with respect to U ? . Let
Z = [z1 , . . . , zm ] ∈ Rd×m be the matrix of these coordinates. Then
Z = U ? T X. (17.16)
Keep in mind that the objective of PCA dimensionality reduction is to compute Z. U ? is just an intermediate
variable.
By Theorem D.3.1, a solution U ? of (17.15) can be obtained by letting the columns of U ? be the leading
d eigenvectors of XX T (or equivalently the leading d left singular vectors of X). However, it is not imme-
diately clear how to add a kernel to this procedure since it involves XX T rather than X T X. To remedy this
we bring in the following representation result (a representer theorem).
Theorem 17.3.1. For every solution U ? of (17.15) there exists A? ∈ Rm×d such that U ? = XA? .
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 228
This is another instance of a matrix Rayleigh quotient problem. One solution is obtained by selecting
the columns of V ? to be orthonormal eigenvectors of G for its d largest eigenvalues. This V ? satisfies
GV ? = V ? Λ where Λ ∈ Rd×d is the diagonal matrix with the d largest eigenvalues of G listed in decreasing
order on the diagonal. These are the same√d largest eigenvalues of XX T . From the change of coordinates
√ √ √
we have V ? = GA? and hence GA? = GV ? . Using GV ? = V ? Λ, then yields
√ √
GA? = GV ? = V ? Λ.
In summary, PCA projection to d dimensions can be accomplished by finding the largest d eigenvalues
Λ = diag(λ1 , . . . , λd ) and corresponding orthonormal eigenvectors V ? = [v1 , . . . , vd ] of the Gram matrix
G = X T X. The PCA projections of the data to dimension d are then given by
√ √ √
√λ1 v1 (1) √λ1 v1 (2) . . . √λ1 v1 (m)
λ2 v2 (1) λ2 v2 (2) λ2 v2 (m)
Z = z1 z 2 . . . zm = . (17.21)
.. .. ..
√ . √ . √ .
λd vd (1) λd vd (2) . . . λd vd (m)
Note that this alternative method of computing Z uses only the Gram matrix G = X T X.
Now bring in a feature map φ : Rn → Rq with kernel k(·, ·). We seek the projection of the mapped data
{φ(xj )}m
j=1 in feature space onto its first d principal components. Normally the first step is to center the
data. For the moment assume this has been done. After discussing the main steps of computing a kernel
PCA projection, we show how to modify the procedure to include centering.
Let φ(X) = [φ(x1 ), . . . , φ(xm )] ∈ Rq×m , and K denote the m × m kernel matrix on the training data:
Assuming the kernel function is given, the matrix K can be computed in the ambient space of the data.
To obtain the PCA projection from feature space to d in dimensions we first solve problem (17.19) for
the data mapped into feature space. The only required modification of (17.19) is the replacement of G by
the kernel matrix K. A solution of the modified problem is obtained by letting Λ be the diagonal d × d
matrix with the largest eigenvalues of K listed in decreasing order down the diagonal and forming V ? from
the corresponding orthonormal eigenvectors of K. The desired PCA projection is then given by (17.20).
So the kernel PCA projection of the data is obtained by finding the leading d eigenvectors of the PSD
kernel matrix K and the corresponding eigenvalues λj , j ∈ [1 : d]. Then training example xj ∈ Rn is
projected to the point zj ∈ Rd by taking the j-th entries of the eigenvectors scaled by the square roots of
the corresponding eigenvalues as shown in (17.21). This only uses the m × m kernel matrix K and can be
computed in the ambient space of the examples.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 229
K̃ = <φ̃(X), φ̃(X)>
= <φ(X)(Im − 1/m 1m 1Tm ), φ(X)(Im − 1/m 1m 1Tm )>
= (Im − 1/m 1m 1Tm )K(Im − 1/m 1m 1Tm )
= K − 1/m 1m 1Tm K − 1/m K1m 1m T + 1/m2 (1Tm K1m ) 1m 1Tm . (17.22)
Hence to perform kernel PCA with centered data we simply follow the method outlined previously using
the kernel matrix K̃ in place of K. The matrix K̃ can be computed directly from K without the need for
any computation in feature space.
Now we implement the same classifier in feature space. Let φ : Rn → Rq be a feature map with kernel
k(·, ·), and φ(Xj ) ∈ Rq×mj be obtained by applying φ to each column of Xj . Using (17.23), the nearest
sample mean classifier in feature space can be written as
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 230
where
kj (x) = φ(Xj )T φ(x) = [k(xi , x)] ∈ Rmj
denotes the vector of kernel evaluations using x and the examples xi in class j, and Kj = φ(Xj )T φ(Xj ) =
[k(xi1 , xi2 )] is the kernel matrix of the examples in class j. Equation (17.24) gives an expression for kernel
nearest centroid classification in terms of the kernel of the feature map. This is directly computable in the
original space. However, there is penalty. For each classification, a kernel evaluation k(xi , x) is done for
every training example. So we must now remember every example in the training set and each classification
requires m kernel evaluations.
Let’s see how to form a kernel version of the nearest neighbor classifier. We bring in a feature map
φ : Rn → Rq and then use nearest neighbor classification on the mapped training examples {(φ(xi ), yi )}m
i=1 .
This results in the classifier:
ŷ(x) = yj ? .
where k(·, ·) is the kernel function of φ. Noting that k(xj , xj ) = Kjj , where K = [k(xi , xj )] is the kernel
matrix of k on the training data, yields the following kernelized version of the nearest neighbor classifier:
ŷ(x) = yj ? .
For each xj , k(xj , ·) is a function from Rn into R. The kernelized nearest neighbor classifier evaluates each
of these functions at x and forms the estimated label of x as a function of these evaluations:
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 231
Notes
For extra reading, see [42] and [21]. Kernel PCA is first mentioned in [43] and in [30]. A more refined
perspective is given in [42] and [40].
Exercises
Exercise 17.1. A binary labelled set of data in R2 is used to learn a SVM using the homogeneous quadratic kernel.
By writing the equation for the decision boundary in terms of a quadratic form, reason about the types of decision
boundaries that are possible in R2 using this framework. In each case, give a neat sketch.
Exercise 17.2. The solution of kernel PCA projection is not unique. Characterize the set of solutions.
Exercise 17.3. Let {(xi , yi )}m
i=1 be a set of labeled training data. Suppose we select a kernel k(·, ·) and use this to
perform kernel PCA dimensionality reduction to dimension d. This yields a new training set {(zi , yi )}m i=1 . Is training
a linear SVM on the reduced dimension training data equivalent to using a kernel SVM on the original dataset? If so,
is the SVM kernel distinct from k?
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 232
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Part I
Appendices
233
Appendix A
Vector Spaces
A.1 Definition
The concept of a vector space generalizes the algebraic structure of cartesian space Rn . A vector space
consists of a set X of vectors and a field F of scalars together with an operation of vector addition x + y for
x, y ∈ X and an operation of scalar multiplication αx for x ∈ X , α ∈ F. The operations of vector addition
and scalar multiplication must satisfy the following axioms.
For vector addition:
1. (∀x1 , x2 ∈ X ), x1 + x2 = x2 + x1
3. ∃0 ∈ X such that ∀x ∈ X , x + 0 = 0 + x = x
These axioms ensure that X together with the operation of vector addition forms a commutative group.
For scalar multiplication:
4. (∀x ∈ X ), 1x = x & 0x = 0
We denote generic vector spaces by upper case script letters U, V, . . . ; generic vectors by lower case
roman letters u, v, . . . ; and generic scalars by lower case greek letters α, β, . . . . If extra clarity is required,
we will emphasize that u is a vector by writing u or ~u. However, usually, context will distinguish the
intended meaning.
Example A.1.1. Vector addition on Cn and scalar multiplication by α ∈ C is defined similarly to the
corresponding operations on Rn . Under these operations, the vectors Cn and scalars C form a vector space.
Example A.1.2. Vector addition on Rn×m and scalar multiplication by α ∈ R are defined by (2.2). Under
these operations, the set of matrices Rn×m and scalars R form a vector space.
235
ELE 435/535 Fall 2018 236
a) onto (or surjective) if for each y ∈ Rn there exists x ∈ Rm with f (x) = y. In this case, every point
in Y is the image of some point in X .
b) one-to-one (or injective) if for each x1 , x2 ∈ Rm , with x1 6= x2 , we have f (x1 ) 6= f (x2 ). In this case,
no two distinct points in X map to the same point in Y.
c) invertible if f is both onto and one-to-one. In this case, there exists a function f −1 : Y → X such that
∀x ∈ X , f −1 (f (x)) = x and ∀y ∈ Y, f (f −1 (y)) = y.
Sometimes we also call a linear function a linear map. More generally, if X and Y are vector spaces over
the same field, a function from X to Y that satisfies the above two properties is called a linear function or
linear map.
An invertible linear function is called an isomorphism [isomorphism from the greek isos (equal), and
morphe (shape)]. Two vector spaces are isomorphic is there exists an isomorphism f from one to the other.
In this case, the two vector spaces have the same algebraic structure. The isomorphism f verifies this by
giving a correspondence between the vectors of the two spaces that matches (or respects) the vector space
operations.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Appendix B
Assuming A is invertible, we can zero the block below A by left matrix multiplication:
Ip 0 A B A B
= . (B.2)
−CA−1 Iq C D 0 D − CA−1 B
This is just a 2 × 2 block matrix version of Gaussian elimination. Similarly, we can zero the block to the
right of A by right matrix multiplication:
Ip −A−1 B
A B A 0
= . (B.3)
C D 0 Iq C D − CA−1 B
Ip −A−1 B
Ip 0 A B A 0
= , (B.4)
−CA−1 Iq C D 0 Iq 0 SA
Proof. Multiply.
237
ELE 435/535 Fall 2018 238
Ip A−1 B
A B Ip 0 A 0
= .
C D CA−1 Iq 0 SA 0 Iq
On the RHS, the first and third matrices are invertible and the submatrix A of M is assumed to be invertible.
Hence if SA is also invertible, then M must be invertible and the above equality can be used to find an
expression for M −1 . Alternatively, we can assume that M is invertible. Then SA must be invertible, and we
again arrive at an expression for M −1 .
Proof. If M and A are invertible, then (B.4) and Lemma B.1.1 imply that SA is invertible. The result then
follows by taking the inverse of both sides of (B.4) and using Lemma B.1.1.
Ip −BD−1
A B Ip 0 SD 0
= , (B.7)
0 Iq C D −D−1 C Iq 0 D
is called the Schur complement of D in M . These derivations yield the following result.
Proof. Exercise.
Lemmas B.1.2 and B.1.3 give distinct expressions for M −1 . So the result of applying any function to
each of these matrix expressions must yield the same result. This and other properties are explored in the
exercises below.
Notes
For further reading on block matrices and matrix identities see the comprehensive summary in [24, Appendix A].
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 239
Exercises
Exercise B.1. Let M be given by (B.1). Show that:
(a) M is invertible if and only if A and SA = D − CA−1 B are invertible.
(b) M is invertible if and only if D and SD = A − BD−1 C are invertible.
Exercise B.2. Let A ∈ Rp×p and D ∈ Rq×q . Assuming A or D is invertible, as appropriate, show that
(
det(A) det(D − CA−1 B)
A B
det =
C D det(D) det(A − BD−1 C).
(b) Using (a), show that if A and D, and at least one of A + BDC or D−1 + CA−1 B is invertible, then:
Equalities of this form are called the Woodbury identity or Matrix Inversion Lemma.
Exercise B.4. Let P ∈ Rn×n have known inverse P −1 and u, v ∈ Rn . Derive the following identity:
P −1 uv T P −1
(P + uv T )−1 = P −1 − .
1 + v T P −1 u
Exercise B.5. Let P ∈ Rn×n have known inverse P −1 and U, V ∈ Rn×r with r n. Show that
(P + U V T )−1 = P −1 − P −1 U (I + V T P −1 U )−1 V T P −1 .
What is the complexity of finding the inverse on the LHS compared with using the formula on the RHS.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 240
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Appendix C
QR-Factorization
Proof. It is clear that {q1 , . . . , qk } is an ON set. Hence we need only show that span{a1 , . . . , ak } =
span{q1 , . . . , qk }. We first show that qj ∈ span{a1 , . . . , ak }, j ∈ [1 : k], using induction. Note that q1 ∈ U.
Now assume qi ∈ U, i < j. Then rj = aj − j−1 T
P
i=1 (qi aj )qi ∈ U. Hence qj ∈ U. We now show that
aj ∈ span{q1 , . . . , qk }. From step (j) of the Gram-Schmidt procedure,
j−1
X j
X
aj = krj kqj + qi qiT aj = rij qi . (C.1)
i=1 i=1
241
ELE 435/535 Fall 2018 242
C.2 QR-Factorization
The matrix version of the Gram-Schmidt procedure is called QR-factorization.
Theorem C.2.1. If A ∈ Rn×k has linearly independent columns, then there exists a matrix Q ∈ Vn,k and
an upper triangular and invertible matrix R ∈ Rk×k such that A = QR.
Proof. Let A = a1 . . . ak ∈ Rn×k with {a1 , . . . , ak } a linearly independent set. Writing equations
where Q ∈ Vn,k and the matrix R ∈ Rk×k is upper triangular with the terms rij given by (C.2). In particular,
rii = kri k2 > 0, i ∈ [1 : k]. So R has positive diagonal entries and is invertible.
By construction, the columns of Q form an ON basis for the range of A. One can also see this from
R(A) = R(QR) and since R is invertible R(QR) = R(Q).
Theorem C.2.1 is a particular case of the following result.
Theorem C.2.2 (General QR-Factorization). Let A ∈ Rn×m and r = rank(A). Then A can be written in
the form
AP = QR
where P ∈ Rm×m is a permutation matrix, Q ∈ Vn,r , and the first r × r block of R ∈ Rr×m is upper
triangular and invertible.
Proof. Perform Gram-Schmidt as before, except when aj ∈ span(a1 , . . . , aj−1 ), add the coefficients to a
matrix R̂ but do not add a new column to Q. This yields A = QR̂ where Q ∈ Vn,r , and R̂ ∈ Rr×m has the
form:
• × × × × × × × ×
0 • × × × × × × ×
· 0 0 • × × × × ×
R̂ =
· 0 0 0 • × ×
0 · · 0 • ×
Here • indicates a positive entry, and × indicates a possibly nonzero entry. Now use a permutation matrix
P to permute the columns of R̂ and A so the r linearly independent columns are first:
• × × × × × × × ×
0 • × × × × × × ×
R = R̂P = · 0 • × × × × × ×
· 0 • × × × × ×
0 · · 0 • × × × ×
The first r × r block of R is upper triangular with positive diagonal entries. So R has r linearly independent
rows and hence rank(R) = r. Finally, AP = QR with R = R̂P of the required form.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 243
has O(kn) time complexity. When the columns of A ∈ Rn×m are not linearly independent, some computa-
tions are not required. But the time complexity remains O(mn).
Exercises
QR-Factorization
Exercise C.1. Let {a1 , . . . , ak } ⊂ Rn be a linearly independent set of vectors. Show that the Gram-Schmidt procedure
can be written in the form:
r11 q1 = a1
r22 q2 = (In − Q1 QT1 )a2 Q1 = [q1 ]
r33 q3 = (In − Q2 QT2 )a3 Q2 = [q1 q2 ]
.. ..
. .
rnn qk = (In − Qk−1 QTk−1 )ak Qk−1 = [q1 · · · qk−1 ].
Exercise C.2. Let A1 ∈ Rn×k have rank k and QR-factorization A1 = Q1 R1 . Let A2 ∈ Rn×m be such that
A = [A1 A2 ] has linearly independent columns. Show that the QR-factorization of A is
R1 U
A1 A2 = Q1 Q2
0 R̄2
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 244
where A2 is the block of columns we want to remove and Q2 is the block of corresponding columns in Q. Show
that the QR-factorization of [A1 A3 ] is
R1 V
A1 A3 = Q1 Q̄3
0 R̄3
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Appendix D
For convenient reference, this Appendix gathers a set of Rayleigh quotient problems together in one place.
x? = arg maxn xT P x
x∈R (D.1)
s.t. xT x = 1.
Theorem D.1.1. Let the eigenvalues of P be λ1 ≥ λ2 ≥ · · · ≥ λn . Problem (D.1) has the optimal value
λ1 and this is achieved if and only if x? is a unit norm eigenvector of P for λ1 . If λ1 > λ2 , this solution is
unique up to the sign of x? .
Proof. We want to maximize xT P x subject to xT x = 1. Bring in a dual variable µ ∈ R and form the
Lagrangian L(x, µ) = xT P x + µ(1 − xT x). Take the derivative of L(x, µ) with respect to x. Setting the
derivative equal to zero yields the necessary condition P x = µx. Hence µ must be an eigenvalue of P with
x a corresponding eigenvector normalized so that xT x = 1. For such x, xT P x = µxT x = µ. Hence the
maximum achievable value of the objective is λ1 and this is achieved when u is a corresponding unit norm
eigenvector of P . Conversely, if u is any unit norm eigenvector of P for λ1 , then uT P u = λ1 and hence u
is a solution.
xT P x
.
xT Qx
245
ELE 435/535 Fall 2018 246
Since the quotient is invariant to a scaling of x, we can set xT Qx = 1 and state the problem as
x? = arg maxn xT P x
x∈R (D.2)
s.t. xT Qx = 1.
Lemma D.2.1. If P and Q are symmetric positive semidefinite matrices and Q is invertible, then Q−1 P has
real nonnegative eigenvalues.
Proof. A similarity transformation of a square matrix A leaves its eigenvalues invariant, i.e., for an invertible
1
matrix B, A and BAB −1 have the same eigenvalues. Let Q 2 denote the symmetric PD square root of Q.
1 1 1 1
It follows from the above that Q−1 P and Q 2 (Q−1 P )Q− 2 = Q− 2 P Q− 2 have the same eigenvalues. The
second matrix is symmetric PSD. Hence the eigenvalues of Q−1 P are real and nonnegative.
Theorem D.2.1 (Generalized Rayleigh Quotient). Denote the real nonnegative eigenvalues of Q−1 P by
λ1 ≥ · · · ≥ λn . Then the solution of problem (D.2) is given by any unit norm eigenvector x? for the
maximum eigenvalue λ1 of Q−1 P . If λ1 > λ2 , this solution is unique up to sign.
1 1 1
Proof. Let Q 2 denote the symmetric PD square root of Q and set y = Q 2 x. So x = Q− 2 y. Making this
substitution in (D.2) yields the equivalent problem
1 1
y ? = arg maxn y T Q− 2 P Q− 2 y
y∈R (D.3)
T
s.t. y y = 1.
1 1
By Theorem D.1.1, any unit norm eigenvector y ? of Q− 2 P Q− 2 for its largest eigenvalue is a solution of
1 1 1 1 1
D.3. Then Q− 2 P Q− 2 y ? = λy ? implies Q−1 P (Q− 2 y ? ) = λ(Q− 2 y ? ). It follows that x? = Q− 2 y ? is an
eigenvector of Q−1 P with eigenvalue λ1 .
Theorem D.3.1. Let P ∈ Rn×n and d be specified as above. Then every solution of D.4 has the form
W ? = We Q where the columns of We are orthonormal eigenvectors of P for its d largest (hence nonzero)
eigenvalues, and Q is a d × d orthogonal matrix.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 247
Proof. Form the Lagrangian L = trace(W T P W ) − trace(Ω(W T W − Id )). Here Ω is a real symmetric
matrix. Setting the derivative of L with respect to W acting on V ∈ Rn×d equal to zero yields
0 = trace 2W T P V − 2ΩW T V = 2<W T P − ΩW T , V >.
Since this holds for all V , we conclude that a solution W ? must satisfy the necessary condition
P W ? = W ? Ω, (D.5)
By the symmetry Ω ∈ Rd×d , there exists an orthogonal matrix Q ∈ Od such that Ω = QΛQT with Λ a
diagonal matrix with the real eigenvalues of Ω listed in decreasing order on the diagonal. Substituting this
expression into (D.5) and rearranging yields
P (W ? Q) = (W ? Q)Λ, (D.6)
trace(Λ) = trace (W Q) P (W Q) = trace(W ? T P W ? ).
? T ?
(D.7)
The last term in (D.7) is the optimal value of (D.4). Thus the optimal value of (D.4) is trace(Λ), and W ? Q
is also a solution of (D.4). Lastly, we note that by (D.6) the columns of We = W ? Q are orthonormal
eigenvectors of P . By optimality, the diagonal entries of Λ must be the d largest eigenvalues of P . Since
d ≤ rank(P ), all of these eigenvalues are positive.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 248
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Appendix E
Let ⊗ denote the Schur product of matrices. If A, B ∈ Rn×n are symmetric, it is clear that A ⊗ B is also
symmetric. What is less obvious is that if A, B are also positive semidefinite, so is A ⊗ B. This can been
shown using the following elementary properties of the Schur product. For u, v, x ∈ Rn ,
The first of these properties shows that the stated claim holds for symmetric rank one matrices.
Pra By (E.1),
Proof. thePtheorem holds if A and B have rank 1. More generally, compact SVDs give: A =
T, B = rb T
σ u
i=1 i i iu j=1 ρj vj vj . Hence using (E.1),
σi ui uTi ) ⊗ ( rj=1
P a Prb
A ⊗ B = ( ri=1 ρj vj vjT ) = ri=1 T
Pa Pb
j σi ρj (ui ⊗ vj )(ui ⊗ vj ) (E.3)
Since σi ρj > 0, and (ui ⊗ vj )(ui ⊗ vj )T is symmetric PSD, it follows that A ⊗ B is symmetric PSD.
Now assume that A and B are symmetric PD. So ra = rb = n and {ui }ni=1 and {vi }ni=1 are orthonormal
bases for Rn . Since A ⊗ B is symmetric PSD, we only need to show that xT (A ⊗ B)x = 0 implies x = 0.
From the SVD expansion (E.3), xT (A ⊗ B)x = 0 implies ∀i∀j, (ui ⊗ vj )T x = 0. Using (E.2), this means
that ∀i∀j, uTi (vj ⊗ x) = 0. Since {ui }ni=1 is an ON basis, it follows that ∀j, vj ⊗ x = 0. Now since {vj }nj=1
is a basis, for no index k can it be that ∀j, vj (k) = 0. Hence x = 0.
This statement and proof of the Schur product theorem follows that in the first edition of Horn and
Johnson. See also Horn and Johnson [22, p. 479].
Exercises
Exercise E.1. Prove the two results listed in equations (E.1) and (E.2).
249
ELE 435/535 Fall 2018 250
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Appendix F
Let the random vector X take values in Rn and have mean µ ∈ Rn and covariance matrix Σ ∈ Rn×n . We
say that X is a non-degenerate Gaussian random vector if Σ is positive definite, and X has the density
This is called a multivariate Gaussian density. The matrixK = Σ−1 is called the precision matrix of the
density. Clearly, K is also a symmetric positive definite matrix.
We can use (F.1) to write,
where C and C 0 are constants that do not depend on x. In expression (F.2), the quadratic term specifies K,
and the linear term specifies Kµ. Since K is known from the quadratic term, Kµ specifies the mean µ. So
if fX (x) is known to be a Gaussian density, then the precision matrix K and the mean µ can be extracted
from the quadratic and linear terms in an expansion of ln fX (x). A converse result is given in the following
lemma.
Lemma F.0.1. If fX (x) is a density and ln fX (x) has the form (F.2), with K symmetric positive definite,
then fX (x) is a Gaussian density with precision matrix K and mean µ.
Proof. Exercise.
ΣXY is called the cross covariance of X and Y and ΣYX is called the cross covariance of Y and X. Clearly
ΣTXY = ΣYX . Since X and Y need not have the same dimensions, in general ΣXY is not a square matrix.
251
ELE 435/535 Fall 2018 252
Let fXY (x, y) denote the density of Z = (X, Y), and fX (x) and fY (y) denote the marginal densities of
X and Y, respectively. We will think of X as a random vector that generates an example, and Y as a random
vector that generates its corresponding target value. Given the value of X, and we want to predict the value
of Y.
Proof. By Lemma F.2.1, ΣX is invertible. Hence, we can use the results in Appendix B to write
ΣX ΣXY I −Σ−1
I 0 X ΣXY = ΣX 0
,
−ΣYX Σ−1X I ΣYX ΣY 0 I 0 SΣX
Lemma F.2.3.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 253
where C 0 does not depend on x or y. As a function of y, fY|X (y|x) is a density, and ln fY|X (x|y) has the
−1
form (F.2). In addition, by Lemma F.2.2, SΣ X
is symmetric PD. Hence by Lemma F.0.1, fY|X (y|x) is a
Gaussian density. Equations (F.6) and (F.7) then follow directly from the previous expression.
Notice that the conditional covariance of fY|X (y|x) does not depend on x. But, as expected, the condi-
tional mean µY|X does depend x.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 254
When we fix the data x1 , . . . , xm , and regard µ and Σ as the variables, this function is called the likelihood
function. It measures the likelihood of the observed training data under each set of parameters. Maximum
likelihood estimation selects estimates for the unknown parameters by maximizing the likelihood function,
or equivalently by maximizing the log-likelihood ln(L). For (F.8), the log-likelihood is
m
1 1X
ln(L) = − m ln det(Σ) − (xi − µ)Σ−1 (xi − µ) + C,
2 2
i=1
where the constant C does not depend on the data, µ, or Σ. Thus the problem of maximizing the log-
likelihood is equivalent to:
m
X
min J(µ, Σ) = m ln det(Σ) + (xi − µ)T Σ−1 (xi − µ)
µ∈Rn ,Σ∈Rn×n (F.9)
i=1
s.t. Σ is symmetric PD.
Problem (F.9) can be solved as follows. First set the derivative of J(µ, Σ) with respect to µ equal to
zero. This gives
m
X
Dµ J(µ, Σ)(h) = −hT Σ−1 (xi − µ) − (xi − µ)T Σ−1 h
i=1
m
!
X
= −2 (xi − µ)T Σ−1 h
i=1
= 0.
Since this holds for all h ∈ Rn , we have m T −1 = 0. Multiplying both sides of this expression
P
i=1 (xi − µ) Σ
by Σ and rearranging gives the maximum likelihood estimate
m
1 X
µ̂ = xi . (F.10)
m
i=1
This is just the empirical mean of the training data. Note that this expression does not depend on Σ.
We can now substitute µ̂ for µ in J(µ, Σ) to obtain Pma newTobjective that isT only a function of Σ. It is
convenient to do this by setting z = x − µ̂ and S = z z . Noting that z Σ −1 z = trace(z z T Σ−1 ),
i i i=1 i i i i i i
gives m T Σz = trace(SΣ−1 ). The new problem can now be written as
P
i=1 iz i
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 255
The symmetric matrices in Rn×n form a subspace of dimension (n + 1)n/2. The symmetric positive
semidefinite matrices constitute a closed subset C of this subspace and the symmetric positive definite ma-
trices form the interior of C. If F.9 has a positive definite solution, then this lies in the interior of C and we
can use calculus to try to find it.
To take the derivative of the objective function in (F.11) with respect to Σ we use the derivatives of the
functions f : Rn×n → R with f (M ) = det(M ) and g : Rn×n → Rn×n with g(M ) = M −1 (where M is
assumed to be invertible). Expressions for these derivatives for general M (not necessarily symmetric) were
given in Lemma 6.3.1 and Example 6.4.1. For convenience, these results are shown in the box below.
Now we return to (F.11) and set the derivative of the objective function with respect to Σ equal to zero.
This gives
1
DJ(Σ)(H) = m det(Σ) trace(Σ−1 H) − trace(SΣ−1 HΣ−1 )
det(Σ)
= trace mΣ−1 − Σ−1 SΣ−1 H
= 0.
Thus for all H, <mΣ−1 − Σ−1 SΣ−1 , H> = 0. It follows that mΣ−1 − Σ−1 SΣ−1 = 0. Multiplying both
sides of this expression on the right and left by Σ and rearranging yields the candidate maximum likelihood
estimate
m
1 1 X
Σ̂ = S = (xi − µ̂)(xi − µ̂)T . (F.12)
m m
i=1
This estimate is just the empirical covariance of the training data. It is symmetric and positive semidefinite
but it might fail to be positive definite. Assuming fX (x) is non-degenerate, if Σ̂ fails to be positive definite,
then we have used insufficient training data.
We have proved the following result.
Theorem F.5.1. Let {xi }mi=1 be independent samples from a non-degenerate multivariate Gaussian density
with mean µ ∈ R and covariance Σ ∈ Rn×n . Then the maximum likelihood estimates of µ and Σ based
n
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 256
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Appendix G
One often needs to bound the eigenvalues of matrix. A simple result that is often useful for this purpose is
Gershgorin’s Circle Theorem. It gives estimates for the locations of the eigenvalues in terms of discs in the
complex plane known to contain the eigenvalues. You will see Gershgorin’s name spelt in various ways, and
the theorem is sometimes called the Gershgorin disc theorem.
Di = {z : |z − aii | ≤ ri }, i = 1, . . . , n.
Proof. Let λ be an eigenvalue of A with eigenvector x ∈ Cn . ThenP λx − Ax = 0. Writing out the i-th row
of this equation and separating the aii term gives λxi − aii xi = j6=i aij xj . Thus
P
|λ − aii ||xi | ≤ j6=i |aij ||xj |. (G.1)
P |xj | P
|λ − aii | ≤ j6=i |aij | |xi | ≤ j6=i |aij |.
The Gershgorin disc Di is centered at aii and its radius is the sum of the absolute values of the terms on
the i-th row of A, excluding the entry on the diagonal. If A diagonal, say A = diag(a11 , . . . , ann ), then the
eigenvalues are the diagonal entries. In this case, the discs have zero radius and the eigenvalues are located
at the centers of the discs of radius 0.
One can apply the same reasoning in Gershgorin’s theorem to the columns of A by applying the theorem
to AT . Since A and AT have the same eigenvalues and the same diagonal entries, the only possible change
is the radius of each Gershgorin disc.
257
ELE 435/535 Fall 2018 258
G.2 Examples
Example G.2.1. The 5 × 5 matrix
0.5000 −0.4000 0 0 0
0.4500 0.4500 0 0 0
A= 0 0 −0.6500 0 0.3000
0.1000 0 0 −0.2500 0.4000
0.1000 0 0.1000 −0.3000 −0.300
has the Gershgorin discs indicated in Figure G.1. All eigenvalues A (shown using ∗) are inside the unit
circle and all of the Gershgorin discs are within the unit circle. The same plot for AT is shown on the right
of the figure. The eigenvalues and disc centers remain the same, but this time the Gershgorin discs are not
contained in the unit circle.
1 1 1
imaginary axis
imaginary axis
0 0 0
−1 −1 −1
Figure G.1: The Gershgorin discs for Examples G.2.1 and G.2.2. Left: For A and AT in Example G.2.1. Right: For F
in Example G.2.2. Disc centers shown as • and eigenvalues as ∗. In all plots, the unit circle is indicated by the dashed
blue curve.
has the Gershgorin discs shown on the right in Figure G.1. The matrix has one eigenvalue at 1 and all other
eigenvalues inside the unit circle.
Exercises
Exercise G.1. Show that a symmetric matrix S ∈ Rn×n with 0 <
P
j6=i |Sij | < Sii , i ∈ [1 : n], is PD.
Exercise G.2. Let P ∈ Rn×n be symmetric, have nonnegative entries, and satisfy P 1 = 1. Show that if Pii > 1/2,
i ∈ [1 : n], then P is PD.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Appendix H
When we fix the data x1 , . . . , xm , and regard θ as the variable, this function is called the likelihood function
and is denoted by L(θ). It measures the likelihood of the training data under each value of the parameter.
Maximum likelihood estimation selects the value of the unknown parameter by maximizing the likelihood
function, or equivalently by maximizing the log-likelihood function ln L(θ). Notice that subject only to the
assumed form of the density fX , this method gives the data total control. The assumed form of the density,
is the only means of regularizing the estimate.
The objective function is convex, and strictly convex if the density parameterization is non-redundant. Hence
if a local minima exists, it is a global minima. To obtain a solution we take the derivative of the log likelihood
and set this equal to zero. This gives
m
1 X
∇ ln(Z(θ)) = t(xi ).
m
i=1
259
ELE 435/535 Fall 2018 260
We also know that ∇ ln(Z(θ)) = Eθ [t(X)]. Hence the maximum likelihood estimate of θ satisfies
m
1 X
∇ ln(Z(θ̂)) = Eθ̂ [t(X)] = t(xi ). (H.4)
m
i=1
H.1.2 Examples
Example H.1.1 (Exponential Density). The scalar exponential density is f (x) = λe−λx , x ∈ [0, ∞). Here
λ > 0 is fixed parameter. This density has the form (H.2) with h(x) = 1, t(x) = −x, θ = λ, Z(θ) = 1/θ.
Using (H.4), the maximum likelihood estimate of θ given m i.i.d. examples drawn from the density must
1 Pm
satisfy ∇ ln(Z(θ)) = −1
θ = − m i=1 xi . Thus the maximum likelihood estimate is
1
λ̂ = θ̂ = 1 Pm .
m i=1 xi
k
Example H.1.2 (Poisson pmf). The Poisson pmf is f (k) = λk! e−λ , k ∈ N, k ∈ N. Here λ > 0 is a fixed
1 θ
parameter. This density has the form (H.2) with h(k) = k! , t(k) = k, θ = ln λ, Z(θ) = eλ = ee . Using
(H.4), the maximum likelihood estimate of θ given m i.i.d. examples drawn from the density must satisfy
1 Pm
∇ ln(Z(θ)) = eθ = λ = m i=1 ki . Thus the maximum likelihood estimate is
m
1 X
λ̂ = xi and θ̂ = ln λ̂.
m
i=1
Example H.1.3 (Bernoulli pmf). The Bernoulli pmf is given by f (x) = px (1 − p)(1−x) where
p is a
p
parameter and x ∈ {0, 1}. This takes the form (H.2) with h(x) = 1, t(x) = x, θ = ln 1−p and
Z(θ) = 1 + eθ . By (H.4), the maximum likelihood estimate of θ given m i.i.d. examples drawn from the
density is
m
eθ̂
1 X p̂
p̂ = = xi and θ̂ = ln
1 + eθ̂ m 1 − p̂
i=1
and Z(θ) = (1 + eθ )n . By (H.4), the maximum likelihood estimate of θ given m i.i.d. examples drawn from
the density is
m
eθ̂
1 X p̂
p̂ = = xi and θ̂ = ln .
1 + eθ̂ nm 1 − p̂
i=1
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 261
Example H.1.5 (Univariate Gaussian). As shown in Example 12.7.5, the univariate Gaussian density has
√ µ2 q −θ(1)2
the form (H.2) with h(x) = 1, t(x) = (x, x2 ), θ = ( σµ2 , 2σ
−1
2 ), and Z(θ) =
π
2πσe 2σ2 = −θ(2) e 4θ(2) .
The maximum likelihood estimate of θ given m i.i.d. examples drawn from the density satisfies (H.4) with
m m m
!
1 X 1 X 1 X 2
t(xi ) = xi , xi .
m m m
i=1 i=1 i=1
We know that Eθ [t(X)] = (µ, σ 2 + µ2 ). Hence the maximum likelihood estimates of µ and σ 2 are
m m m
1 X 2 1 X 2 1 X
µ̂ = xi σ̂ = xi − µ̂2 = (xi − µ̂)2 .
m m m
i=1 i=1 i=1
If desired, one can then obtain θ̂ using the known relationship between θ, µ and σ 2 .
Example H.1.6 (Multivariate Gaussian). As shown in Example 12.7.6, the multivariate Gaussian density
n 1 T −1
has the form (H.2) with h(x) = 1, t(x) = (x, xxT ), θ = (Σ−1 µ, − 21 Σ−1 ), and Z(θ) = (2π) 2 |Σ| 2 e1/2µ Σ µ .
Here t(x), θ ∈ Rn × Sn with the inner product <(x, M ), (y, N )> = <x, y> + <M, N >. The maximum
likelihood estimate of θ given m i.i.d. examples drawn from the density satisfies (H.4) with
m m m
!
1 X 1 X 1 X T
t(xi ) = xi , xi xi .
m m m
i=1 i=1 i=1
Using Eθ [t(X)] = (µ, Σ + µµT ) yields the maximum likelihood estimates of µ and Σ:
m m m
1 X 1 X 1 X
µ̂ = xi Σ̂ = xi x2i − µ̂µ̂T = (xi − µ̂)(xi − µ̂)T .
m m m
i=1 i=1 i=1
One can obtain θ̂ using the known relationship between θ, µ and Σ. You might like to contrast this deriva-
tion of the maximum likelihood estimates for a multivariate Gaussian with the derivation from scratch in
Appendix F.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 262
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
References
[1] Malcolm. Adams and Victor Guillemin. Measure Theory and Probability. Wadsworth, 1986.
[2] R.G. Bartle. The Elements of Integration. John Wiley and Sons, 1966.
[3] J.O. Berger. Statistical Decision Theory and Bayesian Analysis. Springr-Verlag, 2nd edition, 1985.
[5] Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2007.
[6] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In: Bousquet O.,
von Luxburg U., Rtsch G. (eds), Advanced Lectures on Machine Learning. Lecture Notes in Computer
Science, vol 3176. Springer, Berlin, Heidelberg, 2004.
[7] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[8] E. J. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory,
51(12):4203–4215, Dec 2005.
[9] Edwin Chong and Stanislaw Zak. An Introduction to Optimization. John Wiley and Sons, 2008.
[10] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273–297,
1995.
[11] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991.
[12] Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017.
[13] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. John Wiley and Sons, 2nd
edition, 2001.
[14] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179–
188, 1936.
[16] J. H. Friedman and W. Stuetzle. Projection pursuit regression. J. Amer. Statist. Asso., 76:817–823,
1981.
[18] J.-B Hiriart-Urruty and C. Lemaréchal. Fundamentals of Convex Analysis. Springer, 2001.
263
ELE 435/535 Fall 2018 264
[19] A. E. Hoerl. Application of ridge analysis to regression problems. Chemical Engineering Progress,
58:54–59, 1962.
[20] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.
Technometrics, 12(1):55–67, 1970.
[21] Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola. Kernel methods in machine learning.
The Annals of Statistics, 36(3):1171–1220, 2008.
[22] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 2nd edition,
2013.
[24] Thomas Kailath, Ali H. Sayed, and Babak Hassibi. Linear Estimation. Prentice Hall, 2000.
[25] S.R. Kulkarni and G. Harman. Statistical learning theory: a tutorial. Wiley Interdisciplinary Reviews:
Computational Statistics, 3:543–556, 2011.
[26] E. L. Lehmann, S. Fienberg, and G. Casella. Theory of Point Estimation. Springer, 1998.
[27] E.L. Lehmann. Testing Statistical Hypotheses. Wiley Interscience, 2nd edition, 1986.
[28] David J.C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University
Press, 2003.
[29] S. G. Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transac-
tions on Signal Processing, 41(12):3397–3415, Dec 1993.
[30] Sebastian Mika, Bernhard Schölkopf, Alex Smola, Klaus-robert Müller, Matthias Scholz, and Gunnar
Rätsch. Kernel pca and de-noising in feature spaces. Analysis, 11(i):536–542, 1999.
[31] Tom M Mitchell. The Discipline of Machine Learning. Machine Learning, 17(July):1–7, 2006.
[32] Kevin Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.
[33] B.K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing,
24(2):227–234, 1995.
[34] Cathy O’Neil. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens
Democracy. Crown Publishing Group, New York, NY, USA, 2016.
[35] Y. Pati, R. Rezaiifar, and P. Krishnaprasad. Orthogonal matching pursuit: recursive function approxi-
mation with application to wavelet decomposition. Asilomar Conf. on Signals, Systems and Computing,
1993.
[36] H. Vincent Poor. An Introduction to Signal Detection and Estimation. Springer, 2nd edition, 1994.
[38] Walter Rudin. Principles of Mathematical Analysis. McGraw-Hill Education, 3rd edition, 1976.
[39] B. Schölkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, and R.C. Williamson. Estimating the support of
a high-dimensional distribution. Neural computation, 13(7):1443–1471, 2001.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 265
[40] B. Scholkopf, A. J. Smola, and K.-R. Muller. Kernel Principal Component Analysis. Computer Vision
And Mathematical Methods In Medical And Biomedical Image Analysis, 1327:583–588, 2012.
[41] Bernhard Schölkopf, Alex J. Smola, Robert C. Williamson, and Peter L. Bartlett. New Support Vector
Algorithms. Neural Computation, 12(5):1207–1245, 2000.
[42] Bernhard Schölkopf and Alexander Smola. Learning with Kernels: Support Vector Machines, Regu-
larization, Optimization, and Beyond. The MIT Press, 2002.
[43] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Nonlinear Component Analysis as
a Kernel Eigenvalue Problem. Neural Computation, 10(5):1299–1319, 1998.
[44] Bernhard Schölkopf, Robert Williamson, Alex Smola, John Shawe-Taylor, and John Platt. Support
Vector Method for Novelty Detection. Advances in Neural Information Processing Systems 12, pages
582–588, 1999.
[46] Gilbert Strang. Linear Algebra and Its Applications. Brooks Cole; 4th edition, 2006.
[47] David M. J. Tax and Robert P. W. Duin. Support vector domain description. Pattern Recognition
Letters, 20:1191–1199, 1999.
[48] Sergios Theodoridis. Machine Learning: A Bayesian and Optimization Perspective. Elsevier, 2015.
[50] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Yi Ma. Robust face recognition via sparse rep-
resentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(2):210–227, Feb
2009.
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.