0% found this document useful (0 votes)
8 views

MachineLearningPatternRecognition_18_finalversion

The document is a course outline for ELE 435/535, Machine Learning and Pattern Recognition 1, taught by Peter J. Ramadge at Princeton University in Fall 2018. It includes topics such as machine learning fundamentals, matrix algebra, multivariable differentiation, and various regression techniques. The course covers both theoretical concepts and practical applications in machine learning.

Uploaded by

boweringisabel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

MachineLearningPatternRecognition_18_finalversion

The document is a course outline for ELE 435/535, Machine Learning and Pattern Recognition 1, taught by Peter J. Ramadge at Princeton University in Fall 2018. It includes topics such as machine learning fundamentals, matrix algebra, multivariable differentiation, and various regression techniques. The course covers both theoretical concepts and practical applications in machine learning.

Uploaded by

boweringisabel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 265

ELE 435/535

Machine Learning and Pattern Recognition 1

Peter J. Ramadge
Princeton University

Fall 2018, Version 4.3

1 c Peter J. Ramadge 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 2

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Contents

1 A Machine Learning Primer 9


1.1 An Email Spam Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Learning the Prior Probability of Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Numerical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 A Simple Univariate Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Underfitting, Overfitting, and Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.6 Moving from Univariate to Multivariate Classification . . . . . . . . . . . . . . . . . . . . . 19

2 Vectors and Matrices 25


2.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Matrix Algebra 31
3.1 Matrix Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Similarity Transformations and Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Symmetric and Positive Semidefinite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Inner Product Spaces 39


4.1 Inner Product on Rn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 An Inner Product on Rn×n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Orthogonal Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 The Orthogonal Complement of a Subspace . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 Norms on Rn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Singular Value Decomposition 49


5.1 The Compact SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 The Full SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3 Singular Values and Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4 Best Rank k Approximation to a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.5 Inner Workings of the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3
ELE 435/535 Fall 2018 4

6 Multivariable Differentiation 61
6.1 Real Valued Functions on Rn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Functions f : Rn → Rm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3 Real Valued Functions on Rn×m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4 Matrix Valued Functions of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.5 Appendix: Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7 Convex Sets and Functions 75


7.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.3 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.4 Differentiable Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.5 Minimizing a Convex Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.6 Projection onto a Convex Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.7 Appendix: Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

8 Principal Components Analysis 97


8.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.2 Selecting the Best Linear Manifold Approximation . . . . . . . . . . . . . . . . . . . . . . 98
8.3 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.4 The Connection between PCA and SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

9 Least Squares Regression 105


9.1 Ordinary Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
9.2 The Normal Equations and Properties of the Solution . . . . . . . . . . . . . . . . . . . . . 107
9.3 Tikhonov and Ridge Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9.4 On-Line Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

10 Sparse Least Squares 121


10.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
10.2 Sparse Least Squares Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
10.3 Sparse Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.4 Greedy Algorithms for Sparse Least Squares Problems . . . . . . . . . . . . . . . . . . . . 130
10.5 Spark, Coherence, and the Restricted Isometry Property . . . . . . . . . . . . . . . . . . . . 132

11 The Lasso 141


11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
11.2 Lasso Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
11.3 1-Norm Sparse Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
11.4 Subgradients and the Subdifferential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.5 Application to 1-Norm Regularized Least Squares . . . . . . . . . . . . . . . . . . . . . . . 146
11.6 The Dual Lasso Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

12 Generative Models for Classification 155


12.1 Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
12.2 The Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
12.3 MAP Classifiers for Scalar Gaussian Conditional Densities . . . . . . . . . . . . . . . . . . 156

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 5

12.4 Bayes Classifiers for Multivariate Gaussian Conditional Densities . . . . . . . . . . . . . . 160


12.5 Learning Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
12.6 Kullback-Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
12.7 The Exponential Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
12.8 Appendix: Interchanging Derivatives and Integrals . . . . . . . . . . . . . . . . . . . . . . 175
12.9 Appendix: Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

13 Generative Models for Regression 177


13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
13.2 Determining the MMSE Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
13.3 The MSE Estimator for Jointly Gaussian Random Vectors . . . . . . . . . . . . . . . . . . . 179
13.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
13.5 General Affine Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
13.6 Learning an Affine Estimator from Training Data . . . . . . . . . . . . . . . . . . . . . . . 184

14 Convex Optimization 187


14.1 Convex Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
14.2 The Lagrangian and the Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
14.3 Weak Duality, Strong Duality and Slater’s Condition . . . . . . . . . . . . . . . . . . . . . 189
14.4 Complementary Slackness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
14.5 The KKT Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

15 The Linear Support Vector Machine 193


15.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
15.2 A Simple Linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
15.3 The Linear SVM for General Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . 198
15.4 The ν-SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
15.5 One Class SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

16 Feature Maps and Kernels 209


16.1 Feature Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
16.2 Kernels and Kernel Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
16.3 Examples of Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
16.4 Shift-Invariant, Isotropic, and Smoothed Kernels . . . . . . . . . . . . . . . . . . . . . . . 215

17 Machine Learning with Kernels 223


17.1 Kernel SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
17.2 Kernel Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
17.3 Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
17.4 Kernel Nearest Centroid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
17.5 Kernel Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

I Appendices 233

A Vector Spaces 235


A.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
A.2 Functions between Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 6

B Matrix Inverses and the Schur Complement 237


B.1 Block Matrices and the Schur Complement . . . . . . . . . . . . . . . . . . . . . . . . . . 237

C QR-Factorization 241
C.1 The Gram-Schmidt Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
C.2 QR-Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

D Rayleigh Quotient Problems 245


D.1 The Rayleigh Quotient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
D.2 The Generalized Rayleigh Quotient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
D.3 Matrix Rayleigh Quotient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
D.4 Generalized Matrix Rayleigh Quotient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

E Schur Product Theorem 249


E.1 The Theorem and its Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

F Gaussian Random Vectors 251


F.1 Jointly Gaussian Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
F.2 Some Prelimminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
F.3 The Marginal Densities fX (x) and fY (y) . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
F.4 The Conditional Density fX|Y (x|y) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
F.5 Maximum Likelihood for the Gaussian Density . . . . . . . . . . . . . . . . . . . . . . . . 254

G Gershgorin Circle Theorem 257


G.1 The Theorem and its Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
G.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

H Maximum Likelihood Estimation 259


H.1 The Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 7

Notation
Z The set of integers. Integers are typically denoted by lower case roman letters, e.g., i, j, k.
N The set of non-negative integers.
R The set of real numbers.
R+ The set of non-negative real numbers.
Rn The set of n-tuples of real numbers for n ∈ N and n ≥ 1.
Rn×m The set of n × m matrices of real numbers.
[1 : k] The set of integers 1, . . . , k.
∆ ∆
= Equal by definition, as in A(x) = xxT .
a ∈ A Indicates that a is an element of the set A.
0n , 1n The vectors in Rn with 0n = (0, . . . , 0) and 1n = (1, . . . , 1). (n is omitted if clear from context).
ej The j-th vector in the standard basis for Rn .
In The n × n identity matrix, In = [e1 , . . . , en ].
X −1 The matrix inverse of the square matrix X.
XT The transpose of the matrix X. If X = [Xi,j ], then X T = [Xj,i ].
X −T The transpose of the inverse of the square matrix X, i.e., (X −1 )T .
Xi,: The i-th row of X ∈ Rn×m , i.e., the 1 × m matrix Xi,: = [Xi,j , j ∈ [1 : m]].
X:,j The j-th column of X ∈ Rn×m , i.e., the n × 1 matrix X:,j = [Xi,j , i ∈ [1 : n]].
On The group of (real) n × n orthogonal matrices.
Vn,k The set of real n × k matrices, with k ≤ n orthonormal columns.
Sn , Sn+ The subsets of symmetric and symmetric PSD matrices in Rn×n , respectively.
X ⊗ Y The Schur product [Xi,j Yi,j ] of X, Y ∈ Rm×n . Similarly, x ⊗ y = [xi yi ] for x, y ∈ Rn .
<·, ·> An inner product.
x⊥y Vectors x and y are orthogonal. Similarly, X ⊥ Y indicates orthogonality of matrices X and Y .
U⊥ The orthogonal complement of a subspace U.
Df (x) The derivative of f (x) w.r.t. x ∈ Rn . Often displayed as Df (x)(v) to indicate its action on v ∈ Rn .
∇f (x) The gradient of the real valued function f (x) at x ∈ Rn . Note that ∇f (x) ∈ Rn .
X A random variable, or random vector. X denotes a matrix, whereas X denotes a random vector.
E[X] The expected value of the random variable, or random vector X.
µX The mean of the random variable, or random vector X.
2
σX The variance of a random variable X.
ΣX The covariance matrix of the random vector X.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 8

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 9

Chapter 1

A Machine Learning Primer

The objective of machine learning is to develop principled methods to automatically identify useful relation-
ships in data. By useful we mean that these relationships or patterns must generalize to new data collected
from the same source. The patterns of interest can take several forms. For example, these could be topologi-
cal (e.g., clusters in the data), or be (approximate) functional relationships between subsets of data features.
Patterns can also be expressed probabilistically by identifying probabilistic dependences between compo-
nents of the data. The identified patterns might be used to better understand the data, to compress the data,
to estimate the values of missing variables, and to make decisions based on observed data.
The data is the primary guide for accomplishing the machine learning task. But we don’t necessarily
want to give the data total control. We also want to impose appropriate objectives, and use domain knowl-
edge (prior knowledge) to guide the learning process.
Before introducing formal definitions and mathematical notation, it’s helpful to see a simple machine
learning problem. This serves to motivate the subsequent development.

1.1 An Email Spam Filter


A spam filter is a function f that takes an email message as input and outputs a label in the set {spam, non-spam}.
Since f classifies an email into one of two classes, it is called a binary classifier.
A spam filter can make two types of errors: it can classify a non-spam email as spam, or it can classify
a spam email as non-spam. For simplicity, we treat these errors equally. It is then natural to measure the
performance of a spam filter by the fraction of emails it correctly classifies.
To help design a good spam filter we have gathered a set of example emails, with each example labelled
as either 0 (non-spam) or 1 (spam). We called this a labelled dataset. In fact, we begin by using two such
labelled datasets. We call one the training dataset, and the other the testing dataset. The training data will be
used to learn the classifier. In contrast, the disjoint testing data will be used to determine the generalization
performance of the classifier to new data. The testing dataset must not be used to select any aspect the spam
filter. Its dedicated purpose is to determine and report the performance of the final spam filter. However, we
assume that the emails in the training data are representative of the future emails that we need to classify,
i.e., those in the testing data.
One way to think about this is as follows. You use the training data to distill information about the two
classes of email, and thereby design your spam filter f . Then you submit the spam filter to your customer.
The customer applies the classifier to their testing data, and returns the resulting testing performance S(f ).
This scenario makes it very clear that the testing data is not at your disposal to examine, query, or use more

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 10

than once. In practice, you are often given, or must form, your own set of testing data. Despite the fact that
the testing data is in your possession, you must reserve it only for testing the final classifier.
One way to form the training and testing sets is to partition the labelled data into two fixed disjoint
subsets. We want the training set to be large so that it is representative of the data that is likely to be
encountered in the future, and we want the testing data to be large so that it can provide a good estimate of
the accuracy of the learned classifier. If the original set of labeled data is sufficiently large, a single split
into disjoint training and testing sets can work well. This is sometimes called the holdout method of testing
classifier performance, since the testing data is “held out” during the training phase.

1.1.1 Default Classifiers


We first give some examples of spam filters that ignore the training data, and the email to be classified, and
simply output a label. Such classifiers provide a baseline of performance that our learned classifier must
better.
Let f0 denote the classifier that labels every email as non-spam. Cleary this is correct on every non-
spam email and incorrect otherwise. If the fraction of spam email in the testing data is p, then the testing
performance of f0 is:
S(f0 ) = (1 − p)1 + p0 = 1 − p.
This classifier performs well when there is little spam (p ≈ 0), but performance decreases linearly to 0 as
p increases to 1. Similarly, let f1 classify every email message as spam. This classifier must have testing
performance S(f1 ) = 1 − S(f0 ) = p.
A randomized classifier outputs a label according to a specified probability mass function. Let fα denote
the randomized classifier that selects the label “spam” with probability α, for some fixed α ∈ [0, 1]. This
yields f0 when α = 0, and f1 when α = 1. For α = 1/2 it corresponds to randomly guessing the label.
You can think of fα as randomizing between f0 and f1 . Each time you want to classify an email, toss a coin
that comes up heads with probability α. If the result is heads, apply the classifier f1 ; otherwise apply the
classifier f0 .
These classifies ignore the training data (it isn’t used at all) and ignore the input email (it’s never exam-
ined). Hence we don’t expect high performance. This is confirmed by the performance curves shown in the
left plot of Figure 1.11. We have plotted performance curves over p ∈ [0, 1] because we don’t know p; its a
parameter of the testing dataset.

1.2 Learning the Prior Probability of Labels


Using the training labels we can readily compute the fraction of each class in the training data. If p̂ denotes
the fraction of spam, then the fraction of non-spam must be 1 − p̂. Think of p̂ as an estimate of the prior
probability that an email is spam. This uses the training data in a very simple way.
The estimate p̂ can now be used to select a randomized classifier from the set

F = {fα : α ∈ [0, 1]},

by setting α = p̂. The selected randomized classifier outputs the label 1 with probability p̂. Its testing
performance is:

S(fp̂ ) = (1 − p)(1 − p̂) + pp̂ = 1 − (p + p̂) + 2pp̂ (1.1)


= 1 − 2p(1 − p) if p̂ = p. (1.2)

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 11

1.0 1.0 1.0

0.8 0.8 0.8


fraction correct

fraction correct

fraction correct
0.6 0.6 0.6

0.4 0.4 0.4

fp
0.2 f0 0.2 fp drawn for p = p 0.2 gp
f1 gp 0.8p p 1.2p
f1/2 f1/2 0.8p p 1.2p
0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
p p p

Figure 1.1: Performance of simple spam filters. On each graph, the vertical axis is the fraction of correctly classified
emails, and the horizontal axis is the fraction p of spam emails, in each case for the testing data. Left: The performance
of the classifiers f0 , f1 , and f1/2 . Center: The performance of the classifiers fp̂ , and gp̂ for the ideal situation p̂ = p.
Right: The performance of fp̂ , and gp̂ when p̂ differs from p by up to 20%.

Notice that we reduced the entire training dataset to a single scalar quantity p̂. Then used this value to select
a classifier from the fixed parameterized family F. Hence the selected classifier fp̂ is a function of training
data. Here is another example. Use p̂ to select a classifier from the set of classifiers {f0 , f1 } using the rule:
(
f0 , if p̂ ≤ 0.5;
gp̂ =
f1 , otherwise.

For the ideal situation p̂ = p, the performance curves of fp̂ and gp̂ are shown in the center plot of Figure
1.11. Both classifiers show improved performance over the trivial classifiers f0 , f1 , f1/2 , but performance
remains low because we are still ignoring the content of the input email.
Under the assumption that the training dataset is representative of the testing dataset, we expect p̂ to be a
good approximation to the fraction p of spam emails in the testing data. Let’s examine what happens to the
testing performance of the above classifiers when p̂ differs from p. The plot on the right side of Figure 1.11
displays the performance of fp̂ and gp̂ when p̂ and p differ by up to 20%. Notice that the classifiers don’t
always generalize well to the testing data. There has been a tradeoff. The classifiers are tuned to the training
data and this improves performance when p̂ = p, but it can also reduce robustness to variations between the
testing and training data.
The classifiers fp̂ and gp̂ are special instances of a common theme in machine learning. First fix a
parameterized family of classifiers. Then use information derived from the training data to select a particular
classifier from this family. In this way, the selected classifier is a function of the training data.

1.3 Numerical Features


To improve classifier performance further, we need to determine features of email messages that help distin-
guish spam from the real thing. To keep things simple, we restrict attention to features consisting of a scalar
numerical value derived from the email. For example, the number of sentences, and the average length of
sentences, are numerical features of an email. So are the occurrence rates of particular symbols, character
types, words, word types, and so on. We will call a feature of this form a univariate email feature.
We are particularly interested in univariate features that vary in value between spam and non-spam
emails. For example, in its early days, spam was often abundantly sprinkled with the symbols “!” and “$”.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 12

The Spambase Dataset


The spambase dataset is a publicly available spam email dataset [12]. It provides a vector of 57 numerical
features for each of 4,601 labelled emails. The first 48 features are the occurrence rates of particular words, the
next 6 are occurrence rates of particular symbols, and the last 3 are related to the use of capital letters. For a
word, the occurrence rate is the number of times that word appears divided by the number of words in the email.
For a symbol, it is the number of times that symbol appears divided by the number of characters in the email.
For both words and symbols the rates are expressed as percentages (0 − 100). The 3 run length features have
a different character. The first (crla) is the average run length of capital letters; the second (crll) is the integer
length of the longest run of capital letters; and the third (crlt) is the total number of capital letters [12]. A full
listing of the features is shown below.

0: make 1: address 2: all 3: 3d 4: our 5: over 6: remove


7: internet 8: order 9: mail 10: receive 11: will 12: people 13: report
14: addresses 15: free 16: business 17: email 18: you 19: credit 20: your
21: font 22: 000 23: money 24: hp 25: hpl 26: george 27: 650
28: lab 29: labs 30: telnet 31: 857 32: data 33: 415 34: 85
35: technology 36: 1999 37: parts 38: pm 39: cdirect 40: cs 41: meeting
42: original 43: project 44: re 45: edu 46: table 47: conference
48: ; 49: ( 50: [ 51: ! 52: $ 53: #
54: crla 55: crll 56: crlt
To explore classification using this dataset, we randomly split the dataset into 60% (2760) training examples and
40% (1841) testing examples . The fraction of spam in the training set is 0.394, and in the testing set is 0.404.
Figure 1.2: The spambase dataset.

As a result, a high occurrence rate of either of these symbols was useful in identifying spam.
We can illustrate these ideas using the Spambase dataset. This publicly available datatset is described
in Figure 1.2. Feature 51 in the Spambase dataset is the occurrence rate of the symbol “!” in an email. To
determine if the values of this particular feature vary between spam and non-spam email, we plot histograms
of the feature values for each class over the training data. These are shown superimposed in Figure 1.3. As
you can see, the spam and non-spam histograms differ significantly. This confirms that feature 51 is indeed
relevant for identifying spam.
When we examine one feature in isolation, we are doing a univariate analysis. For example, examining
the histograms of each feature one at a time, is a univariate analysis. It gives a subjective measure of the
ability of each feature (in isolation) to distinguish the two classes. Designing a classifier based on a single
feature and testing how well it can distinguish spam from email is also a univariate analysis. This would
provide a quantitative measure of this feature’s potential (considered in isolation) for distinguishing email
spam. For the moment, we will restrict our attention to such univariate analyses.

1.4 A Simple Univariate Bayes Classifier


To design a binary classifier based on a single feature, we bring in the following model for the data. Assume
that a labelled example is the outcome of pair of random variables (X, Y). The label Y is a Bernoulli random
variable with parameter p. Denote the two possible outcomes of Y by 1 (spam) and 0 (non-spam). Then
p(1) = p and p(0) = 1 − p. Further assume that given the value of the label, the random variable X has

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 13

5
Density histogram of feature 51
on the training data
Range = [0,20]
4 #Bins = 100
Performance = 0.791

probability density
3

1
nonspam
spam
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
bin for feature value: ! 51

Figure 1.3: Histograms of feature 51 (“!’) for spam and non-spam training emails. The histograms are plotted as
densities. So the area under each histogram is 1. To obtain the corresponding probability mass function, multiply
each bin value by the bin width (= 0.2). Using a Bayes classifier, these histograms yield training data classification
performance of 0.791.

∆ ∆
conditional densities p1 (x) = p(x|y = 1) and p0 (x) = p(x|y = 0).
Our agreed metric for selecting a classifier is maximizing the probability of success, or equivalently,
minimizing the probability of error. Hence given x, we want to select the label value y ∈ {0, 1} that
maximizes p(y|x). Using Bayes rule one can write

p(x|y)p(y)
p(y|x) = . (1.3)
p(x)

Since the term p(x) in the denominator of (1.3) doesn’t depend on y, it doesn’t influence the maximization.
Hence the desired classifier is
(
1, if p(1)p1 (x) > p(0)p0 (x);
f (x) = arg max pk (x)p(y) = (1.4)
k∈{0,1} 0, otherwise.

This is called the Bayes classifier for the given model. Because our objective is to minimizing the probability
of error, the Bayes classifier results in making the decision with maximum a posteriori probability (MAP).
This special form of Bayes classifier is called a MAP classifier.
Here are two alternative expressions for a binary MAP classifier (1.4):
( (    
1, if pp01 (x)
(x) p(0)
> p(1) ; 1, if ln pp10 (x)
(x) > ln p(0)
p(1) ;
f (x) = f (x) = (1.5)
0, otherwise 0, otherwise

The quantity pp10 (x)


(x) is called the likelihood ratio. When x is known and fixed, pk (x) is called the likelihood
of label k. It can be thought of as a measure of the “evidence” provided by x in support of deciding that

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 14

the label is k. The likelihood ratio compares the two likelihoods under the observation x. Since a strictly
monotone increasing function is order preserving, we can take the natural log of both sides of the likelihood
ratio
 comparison
 without changing the result. This yields the second expression for f (x) in (1.5). The term
p1 (x)
ln p0 (x) in this expression is called the log-likelihood ratio.

Performance
The performance of the MAP classifier is determined as follows. There are two mutually exclusive ways to
make a correct decision: f (x) = 1 and the joint outcome is (x, 1), or f (x) = 0 and the joint outcome is
(x, 0). For the first case, p(x, 1) = p1 (x)p(1), and for the second, p(x, 0) = p0 (x)p(0). The contribution to
the success of f at outcome x is thus

p(1)p1 (x)f (x) + p(0)p0 (x)(1 − f (x)).

The MAP classifier performance is then obtained by integrating over x:


Z Z
S(f ) = p(1) p1 (x)f (x)dx + p(0) p0 (x)(1 − f (x))dx, (1.6)
ZR RZ

= p(1) p1 (x)dx + p(0) p0 (x)dx. (1.7)


{f (x)=1} {f (x)=0}

Estimating a MAP classifier on the Spambase dataset


The Bayes classifier model is an approximation to reality. Nevertheless, it does give some valuable insights.
In practice, we know neither p, nor the conditional class densities p0 (x) and p1 (x). But we can set out to
estimate these quantities from the training data and thereby obtain an empirical approximation to the MAP
classifier.
We can estimate the probability of spam p to be the fraction p̂ of spam in the training data. There are
several ways we could obtain estimates for the two conditional densities. Probably the simplest approxima-
tion is obtained using histograms H0 , H1 of the scalar feature values over each class. In this case, we do
not use x as the basis of classification, but instead the histogram bin b(x) into which x falls. Hence it is
important that the two class histograms H0 and H1 use the same bins. For simplicity, we assume that there
are N bins of uniform width covering a fixed interval of scalar values.
Once p̂, H0 and H1 are computed, the MAP classifier is formed as before. However, this time the
classifier selects a label for each bin, i.e., f : b(x) 7→ ŷ. When an email needs to be classified, we first
determine the value of the feature being used. Then determine the bin into which this value falls. The email
is then assigned the label of that bin.
By replicating the previous derivation we obtain the following expression for the classifier
( (
1, p̂H1 (b) > (1 − p̂)H0 (b); 1, H 1 (b) 1−p̂
H0 (b) > p̂ ;
f (b) = =
0, otherwise. 0, otherwise.

This is the MAP classifier for the given histograms H0 , H1 , and prior probability of spam p̂. Note that
the number of bins used in the histograms is a selectable parameter. So we have a family of classifiers
parameterized by an integer variable N . The value of N must select before learning the probabilistic model
from the training data. Quantities of this form are often called hyperparameters. In contrast, p̂ is a parameter
of the model learned directly from the training data.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 15

The overall performance of the classifier on the training data is given by


N
X N
X
S(f ) = p̂ H1 (b)f (b) + (1 − p̂) H0 (b)(1 − f (b)). (1.8)
b=1 b=1

We are now in a position to make an initial assessment of whether a feature in the spambase dataset
is informative of class membership. To do so, we estimate a Bayes classifier using the feature’s training
data and report this classifier’s performance. The results of classifier training and testing on each of the 57
univariate features in the Spambase dataset is shown in Figure 1.4. For all features we set N = 100. The
scalar data for each feature was first standardized by subtracting its sample mean from each example, and
scaling the resulting values to have unit variance.

1.0
f0
Bayes classifier performance on each feature training
Standardized data
0.9 Range = [-1.0, 19.0] testing
Number of bins = 100
fraction correct

0.8

0.7

0.6

0.5
make 00
address 01
all 02
3d 03
our 04
over 05
remove 06
internet 07
mail 09
order 08
receive 10
will 11
people 12

business 16
email 17
you 18
report 13
addresses 14
free 15

credit 19
your 20
font 21
000 22
money 23
hp 24
hpl 25
george 26
650 27
lab 28
labs 29

data 32
telnet 30
857 31
415 33
85 34
technology 35

table 46
1999 36
parts 37
pm 38
direct 39
cs 40
meeting 41
original 42
project 43
re 44
edu 45
conference 47
; 48
( 49
[ 50
! 51
$ 52
# 53
crla 54
crll 55
crlt 56
feature

Figure 1.4: Univariate classification results for each feature in the Spambase dataset. An empirical Bayes classifier
using 100 histogram bins was trained on the (standardized) feature. The plot suggests that approximately half of the
features are informative of class membership.

1.5 Underfitting, Overfitting, and Generalization


We now consider the selection of the hyperparameter N . Figure 1.5 plots the classification results results
for six of the features versus N (the number of bins used). The results indicate a number of important
points. First, these six features are informative of class membership. Both testing and training performance
is well above the default levels given by f0 . But we can also make several other important observations. If
we choose two few bins1 , the resulting classifier fails to reach peak performance. This is apparent in the
initial region of curves below about 20 bins. This phenomenon is called underfitting. In this regime, the
classifier structure doesn’t have the flexibility to fully capture the relevant information in the training data.
On the other hand, if we use too many bins, we see gradually improving training performance but gradually
decreasing testing performance. This is phenomenon is called overfitting. In this regime, the classifier has
too much flexibility and is capturing patterns in the training data that do not hold in the testing data. In this
regime one says that the classifier performance doesn’t generalize (from the training data) to the testing data.
1
Since the range is fixed, increasing the number of bins is equivalent to decreasing bin width.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 16

1.00 1.00 1.00


train train train
0.95 Bayes classifier
Standardized data f0 0.95 Bayes classifier
Standardized data f0 0.95 Bayes classifier
Standardized data f0
Performance vs #bins test Performance vs #bins test Performance vs #bins test
0.90 Feature: our 04 f0 0.90 Feature: remove 06 f0 0.90 Feature: free 15 f0
0.85 0.85 0.85
fraction correct

fraction correct

fraction correct
0.80 0.80 0.80

0.75 0.75 0.75

0.70 0.70 0.70

0.65 0.65 0.65

0.60 0.60 0.60

0.55 0.55 0.55


40 80 120 160 200 240 280 320 360 400 440 480 40 80 120 160 200 240 280 320 360 400 440 480 40 80 120 160 200 240 280 320 360 400 440 480
Number of bins Number of bins Number of bins
1.00 1.00 1.00
train train train
0.95 Bayes classifier
Standardized data f0 0.95 Bayes classifier
Standardized data f0 0.95 Bayes classifier
Standardized data f0
Performance vs #bins test Performance vs #bins test Performance vs #bins test
0.90 Feature: ( 49 f0 0.90 Feature: ! 51 f0 0.90 Feature: crll 55 f0
0.85 0.85 0.85
fraction correct

fraction correct

fraction correct
0.80 0.80 0.80

0.75 0.75 0.75

0.70 0.70 0.70

0.65 0.65 0.65

0.60 0.60 0.60

0.55 0.55 0.55


40 80 120 160 200 240 280 320 360 400 440 480 40 80 120 160 200 240 280 320 360 400 440 480 40 80 120 160 200 240 280 320 360 400 440 480
Number of bins Number of bins Number of bins

Figure 1.5: Classification performance of the Bayes Classifier based on single feature histograms as a function of the
number of histogram bins. The top plots are for three word occurrence rate features: “our”, “remove” and “free”. The
bottom plots are for two symbol occurrence rates (“(” and “!”) and a feature that measures to longest run length of
capital letters.

The ability to generalize from training to testing data is one of the main goals of machine learning. Finally,
we note that a “good” value for N depends on the particular feature being used. Moreover, we can’t use the
testing data to determine this value, we can only use the training data. That is a new problem that needs to
be solved.

1.5.1 k-Fold Cross-Validation for Training and Testing


k-fold cross-validation (k-fold CV) starts by partitioning a labelled dataset into k disjoint, balanced subsets
(folds). By balanced we mean that each fold is (approximately) the same size, and has the same proportion
of class labels as the entire labelled dataset. This is also referred to as stratified k-fold cross-validation. The
folds are then used to train and test a classifier as follows. A union of (k − 1) folds is used as the training
data, and the left-out fold is used as the testing data. This is done for each of the k possible left-out folds.
So the classifier is trained and tested k times. The results obtained yield a mean test accuracy, and provide
a measure of the uncertainty in this estimate. The scheme is illustrated in Figure 1.6.
The hold out scheme of splitting the labelled data into one training set and one testing set, provides
one round of classifier training and testing, and yields one estimate of classifier performance. The k-fold
cross-validation procedure is a natural extension of this idea. It uses all of the labelled data to obtain k
test performance results, each on a disjoint set of testing data. This results in k estimates of classifier
performance. However, these k results are not independent since the training datasets are not disjoint. The

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 17

results provide an average performance and information on the spread of performance about the average. The
scheme is slightly more complex than the hold out method, and requires a k-fold increase in computation.
Nevertheless, it is a useful and widely used heuristic for evaluating classifier performance.

Figure 1.6: Training and testing of a classifier using k-fold cross-validation.

1.5.2 Hyperparameter Selection


To select the value of a classifier hyperparameter, we need to be able to assess performance of new training
data (we can’t use the testing data). One way to accomplish this by partitioning the training data into a
learning subset and a disjoint validation subset. This variation on the hold out method is illustrated in Figure
1.7.
The labelled dataset is first split into training data and testing data, and the testing data is set aside for
later use. The training data is further subdivided into a training subset and a disjoint validation subset2 . To
select a classifier hyperparameter, you train the classifier for each hyperparameter value in an agreed set,
using the training subset. Then you evaluate the classifier’s performance on the validation subset. When
all hyperparameters have been used, you select the value yielding the best validation performance. You can
then retrain classifier on all of the training data using the selected hyperparameter value. Then report its
performance on the test data3 .
This fixed partitioning of the training data is simple, and has low computational overhead. However,
one must take care to ensure that the training subset is not too small, and that the validation subset is large
enough to yield reliable estimates of performance. When this is not the case, the resulting selection of the
hyperparameter value may be sensitive to the particular split of the training data that was chosen. This is a
concern when you begin with a small labelled dataset.

Using k-fold Cross-Validation


An alternative approach to hyperparameter selection uses k-fold cross-validation. The labelled dataset is first
split into training data and testing data, and the testing data is set aside for later use. The training data is then
partitioned into k disjoint balanced folds. To evaluate classifier performance for a particular hyperparameter
value, loop though the k splits of training data obtained by training on k−1 folds and evaluating performance
2
Some authors swap the terms validation data and testing data.
3
Are the situations in which going directly to the test data might be preferred?

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 18

Figure 1.7: Hyperparameter selection by subdividing the training set into fixed training and validation subsets.

Figure 1.8: Hyperparameter selection using k-fold cross-validation.

on the left-out fold. This results in k estimates of performance. Around this loop place an selection loop
that cycles through the hyperparameter values in contention and evaluates each classifier’s performance in
the above manner. After the classifiers have been trained and evaluated, select the hyperparameter value
that gave the best results. You can then train the classifier on all of the training data using the selected
hyperparameter value. Then report its performance on the test data. The scheme is illustrated in Figure 1.8.
Because k-fold cross-validation obtains k performance estimates for every value of the hyperparamter
being examined, it is computationally expensive. Nevertheless it is frequently used.

1.5.3 Using nested k-fold cross-validation


The k-fold cross-validation method can also be used in a nested fashion for hyperparameter selection, and
classifier training, and testing. This is done by nesting the parameter selection scheme shown in Figure 1.8
inside the training and testing scheme shown in Figure 1.6.
As usual, we first partition the entire labelled dataset into k balanced folds. We then use an outer loop

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 19

Figure 1.9: Classifier hyperparameter selection, training, and testing using nested k-fold cross-validation.

to evaluate classifier performance under the selected parameter. The outer loop iterates though the k folds
by training on k − 1 folds and testing performance on the corresponding left-out fold. This results in k
estimates of classifier performance. As explained below, it also provides k corresponding hyperparameter
values.
Inside the outer loop we place two inner loops. At the start of each of the above k iterations, we place
a selection loop that iterates through a designated set of hyperparameter values. For each hyperparameter
value we train and evaluate the resulting classifier. This is where a third innermost loop is used. The simplest
way to implement this inner loop is to use k − 2 of the k − 1 current training folds as the training subset,
and the left-aside fold of the current training folds as the validation subset. At the completion of the inner
loop, we obtain k − 1 estimates of performance. These give an estimate of the average performance of
the current classifier, and the spread of its performance about the average. After training and evaluation
using each of the hyperparameter values, we select a “best” value based on the information obtained. We
can then train the classifier using the selected hyperparameter value on all of the k − 1 folds of the current
training data. Then report its performance on the left-out test fold. At the end of the k iterations in the outer
loop, we have k best hyperparameter value selections and k sets of performance metrics. The entire scheme
is illustrated in Figure 1.9. This scheme is both complex and computationally expensive. It is only used
when its computation expense is justified. For example, it may be appealing when the set of labelled data is
limited.

1.6 Moving from Univariate to Multivariate Classification

We have seen that a single scalar valued feature can yield considerable improvement in classification per-
formance over baseline classifiers. We suspect, however, that combining a complementary set of univariate
features is likely to yield even greater performance improvement. This called multivariate analysis.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 20

1.6.1 Feature Maps


With the goal of multivariate analysis in mind, assume we have selected a set of n (simple, univariate)
numerical features, each measuring a distinct email characteristic. The selected features define a mapping φ
of each email x into a vector of numerical values φ(x) ∈ Rn . We call the function φ a feature map, and φ(x)
a feature vector. The feature vectors φ(x) have n univariate component features, and are hence multivariate
objects.
You can think of φ(x) as a summary of the actual email x. Indeed, at the cost of losing some information,
the actual email x can be replaced by φ(x). A classifier f can then be learned using the set of labelled feature
vectors as numerical surrogates for the email training data. The learned classifier is then tested using the
corresponding numerical surrogates of the email testing data.

Some general comments on feature maps

For several reasons, the use of feature maps is ubiquitous in machine learning. For many applications it is
possible to use feature extraction methods to map the application data into a vector of features in Rn . This
may incur some loss of information, but it allows analysis methods developed for Rn to be used across many
domains. Mapping data to vectors in Rn has several additional advantages. First, by reducing the dimension
of the data it can make the machine learning problem more tractable. Second, it allows us to exploit the
algebraic, metric and geometric structure of Rn to pose and solve machine learning problems. This includes,
for example, using the tools of linear algebra, Euclidian geometry, differential calculus, convex optimization,
and so on. Finally, a well chosen representation can reveal informative features of the data, and hence
simplify the design of a classifier.
In general, there are two forms of feature map: hand-crafted and learned. A hand-crafted feature map
is a pre-specified set of computations derived using insights from the application domain. In applications
where we have good insights, a hand-crafted feature map can provide a concise and informative representa-
tion of the data. The Spambase dataset is an example of a hand-crafted feature map.
In complex applications where human insights are less well-honed, the training data can be used to learn
a feature map. This is called representation learning. In principle, the learned features can then be used in
a variety of machine learning tasks. In applications such as image content classification, this approach has
enabled machine learning classifiers to perform on par with humans.
The use of numerical surrogates (or proxies) in place of the real data does have potential pitfalls. For
example, distinct email messages can map to the same feature vector φ(x). Hence one needs to think
carefully about the set of features being used, and any unintended consequences that may result. See [34]
for a discussion of the misuse of proxies in certain applications.

Linear Classifiers

Let’s now focus on the problem of learning a classifier based on the feature vectors of the training data.
From this point forward we drop the explicit notation for the feature map φ, and denote each element of the
training data as a pair (xi , yi ) with xi ∈ Rn the numerical proxy for the i-th training example, and yi it’s
corresponding label.
The training data specifies two point-clouds in Rn . The cloud of points with label 0: {xi }yi =0 , and with
label 1: {xi }yi =1 . Our task is learn a decision boundary that separates these point clouds into two classes in
a way that matches (to the extent possible) the point labels. Think of this decision boundary as a surface in
Rn that approximately separates the two point clouds.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 21

x w (x−q)
p = aw

q = −bw/||w||2

w
T
x+
b=
0
0 H

Figure 1.10: A hyperplane H in R2 . If b > 0, the half space containing the origin is the negative half space, and the
other half space is the positive half space.

A linear classifier in Rn is a binary classifier that tries to separate the two clouds of training points
using a hyperplane. Recall that a hyperplane is a set of points satisfying an affine equation of the form
wT x + b = 0, for some w ∈ Rn , and b ∈ R. The vector w and scalar b are parameters of the hyperplane.
You can think of the hyperplane as a flat n − 1 dimensional surface in Rn . In R2 a hyperplane is a line, and
in R3 it’s a two dimensional plane.
A hyperplane separates Rn into two half spaces: the positive half space with wT x + b > 0, and the
negative half space with wT x + b < 0. A linear classifier, classifies a point x ∈ Rn according to the half
space in which it is contained. By changing the sign of w if necessary, we can always write the classifier in
the form (
1, if wT x > −b; (positive half space)
f (x) = (1.9)
0, if wT x ≤ −b (negative half space) .
Here we have included the hyperplane in the negative half space. Notice that classification reduces to
computing the sign of wT x + b.
To explore this further, we use some elementary linear algebra4 . Consider the line through the origin in
the direction w. This is the set of points {αw : α ∈ R}. Each point p on this line can be specified by a
unique coordinate α ∈ R with p = αw. An easy calculation shows that α = wT p/kwk2 . The line intersects
the hyperplane wT x + b = 0 at the point q = −bw/kwk2 with coordinate α = −b/kwk2 . These points are
illustrated for a hyperplane in R2 in Figure 1.10.
For notational simplicity, assume that kwk = 1. If x ∈ Rn is orthogonally projected onto the line
through the origin in the direction w, the projected point has the coordinate α(x) = wT x. If x lies in
the hyperplane, α(x) = −b; if x is in the positive half space, α(x) > −b; and if in the negative half
space, α(x) < −b. So the positive half space projects to the half line with α > −b, and the negative half
space projects to the half line with α < −b. It follows that equivalent classification can be obtained by
orthogonally projecting all points to the line, and then using the 1-D linear classifier
(
1, if wT x > −b;
g(wT x) =
0, if wT x ≤ −b.
4
The linear algebra needed in the course will be revised in the early chapters.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 22

So a linear classifier uses a linear map to reduce each multivariate example x ∈ Rn to a composite scalar
feature wT x ∈ R. It then classifies the scalar points using a simple thresholding function.
To design a linear classifier we only need to select the direction of the projection line w ∈ Rn and
the threshold −b; a total of n parameters. Over the course of time, many methods have been proposed for
selecting suitable values for w and b. A few are listed below with very brief descriptions. Each will be
discussed in greater detail in subsequent chapters.

Examples of linear classifiers


• Nearest Centroid (NC). This classifier uses the training data to compute the empirical means µ̂1 and
µ̂0 of each class. Then classifies a point x according to the closest mean
(
0, if kx − µ̂0 k2 ≤ kx − µ̂1 k2 ;
fNC (x) =
1, otherwise.

The number of scalar parameters estimated in the learning process is 2n.

• Linear Discriminant Analysis (LDA). This method uses the training data to fit multivariate Gaussian
densities to each class, under the assumption that the class covariance matrices are the same. So the
method computes the empirical means of each class: µ̂0 and µ̂1 , and estimates a single covariance
matrix Σ̂ for both classes. It then uses the corresponding Bayes classifier. The number of scalar
parameters estimated in the learning process is O(n2 ).
LDA has an alternative but equivalent formulation. This requires selecting w so that the projection
of the class means and the training data onto the line in direction w maximizes the ratio of squared
distance between the projected class means, and the variance of the projected points about their class
means.

• The Perceptron. The perceptron is an elementary neural network. For each input x, it computes the
affine function wT x + b and passes the result through a smooth scalar nonlinearity ψ(·) to obtain an
output scalar value. The parameters w and b are selected to minimize a suitable cost function ( e.g.,
T 2
P
i (yi − ψ(w xi + b)) ) over the training data using a gradient descent procedure. Once w and b
have been determined, the classification is given by the closest label to ψ(wT x + b). The number of
scalar parameters estimated during the learning process is n + 1.

• The Linear Support Vector Machine (Linear SVM). The linear SVM selects w and b by solving a
convex optimization problem. The optimization objective has two terms. These terms are balanced
using a scalar hyperparameter that needs to be selected. Roughly, one term in the objective function
seeks to position the hyperplane “equally between” the two classes of training examples, and the other
penalizes points that deviate from this objective. The number of scalar parameters estimated during
the learning process is n + 1.

1.6.2 Nonlinear Classifiers


The are many forms of nonlinear classifiers. In general, a Bayes classifier learned from training data will be
nonlinear. There is also a wide variety kernel based nonlinear classifiers. Deep learning (multilayer neural
networks), is naturally nonlinear. We give some examples below. These nonlinear classifiers will also be
discussed during the course.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 23

1.00
train
0.95 test
fraction correct 0.90
0.85
0.80
0.75
0.70
0.65
0.60
GNB LDA NCen kNN(3) Ptron LReg LSVM AdaB RBFSVM NNet
Method
Figure 1.11: Training and testing performance of some multivariate spam filters using the Spambase dataset. The
vertical axis is the fraction of correctly classified emails. The horizontal axis indicates the ML classifier.

Examples of Nonlinear Classifiers

• k-Nearest Neighbor Classifier. The nearest neighbor classifier estimates a label for a new test ex-
ample x by assigning it the label of the closest training example. Notice that this classifier makes no
attempt to learn from the training data. It simply stores (memorizes) it. Hence the classifier requires
an increasing amount of memory as the number of training examples grows. To improve efficiency,
the training data is usually preprocessed and stored in a data structure that makes finding the closest
example more efficient. Nevertheless, finding the closest training example is the computational bot-
tleneck in performing a classification. The k-nearest neighbor classifier finds the closest k nearest
neighbors and then resolves the label by a weighted voting scheme.

• Gaussian Naive Bayes (GNB). The Gaussian Naive Bayes classifier uses the training data to fit
multivariate Gaussian densities to each class, under the assumption that the covariance matrix for each
class is diagonal. This corresponds to assuming that the features are independent Gaussian random
variables. The method computes the empirical means µ̂1 and µ̂0 , and diagonal covariance matrices
Σ̂0 , Σ̂1 to fit each class. It then uses the Bayes classifier for this estimated model. In general, this
results in a quadratic decision surface. A total of 4n scalar parameters are learned during the training.

• Radial Basis Function SVM (RBF-SVM). This is a nonlinear classifier based on using a kernel
function and the linear SVM. Conceptually, it maps the training and test data into a higher dimensional
space. Then uses a linear SVM in this space. In practice, this is done seamlessly using a function
known as a radial basis function kernel.

• Neural Network. A neural network is a concatenation of layers. Each layer consists of a set of
affine combinations followed by applications of a fixed scalar nonlinear function. The first layer
is the input layer where x is presented. The output of the next layer is formed by taking affine
combinations of x each followed by the same scalar nonlinear function. This layer is called the
first hidden layer. This can be repeated to form additional hidden layers. The final output layer
maps the results of the previous hidden layer to two outputs (for a binary classifier). Classification
is accomplished by selecting label 0 if the first output has the larger value, and label 1 otherwise.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 24

The number of parameters in a neural network depends on the number hidden layers and the widths
of these layers. A network with one hidden layer of width m1 would typically have O(m1 n) scalar
parameters. The values of these parameters must be learned from the training data. The number, and
the sizes, of hidden layers must be decided prior to training.

1.6.3 Learning Multivariate Classifiers on the Spambase Dataset


All of the classifiers mentioned above were trained and tested using the same random split (60% training,
40% testing) of the labelled examples in the Spambase dataset. The training and testing performance re-
sults are shown in Figure 1.11. Six of the ten classifiers (including two linear classifiers) achieve testing
performance above 90%.

Notes
For extra reading, see the tutorials by Kulkarni and Harman [25], Bousquet, Boucheron, and Lugosi [6], and
the perspective article by Mitchell [31]. The introductory sections of Duda et al. [13], Scholkopf et al [42],
Bishop [5], Murphy [32], Theodoridis [48] are also good resources.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Chapter 2

Vectors and Matrices

For the time being, we will think of datasets as subsets of Rn , the set of n-tuples of real numbers. Doing so
allows us to exploit the algebraic, metric and geometric structure of Rn to pose and solve machine learning
problems. It also brings in the tools of linear algebra, Euclidian geometry, differential calculus, convex
optimization, and so on.
Linear algebra provides an underlying foundation for a great deal of modern machine learning. We
assume a background in linear algebra at the level of an introductory undergraduate course. This chapter
provides a concise review of the critical elements of that material. If you are already an expert, you may
want to skim the chapter before proceeding to the next. If you are a little rusty on the material, use the
chapter to revise and refine your understanding. The chapter will also continue the introduction of our
notational conventions. An excellent notational convention is an asset, but inevitably it will have conflicts
and exceptions. Hence it is essential to learn how to distinguish meaning from context.

2.1 Vectors

Let Rn denote the set of n-tuples of real numbers. By convention, the elements of Rn are called vectors.
We denote elements of Rn by lower case roman letters, e.g., x, y, z and display the components of x ∈ Rn
either by writing x = (x1 , x2 , . . . , xn ), or x = (x(1), x(2), . . . , x(n)), where xi , x(i) ∈ R, i = [1 : n]. For
clarity, we often denote scalars (i.e., real numbers) by lower case Greek letters, e.g., α, β, γ.
There are two core operations on Rn . These are called vector addition and scalar multiplication. These
operations are defined component-wise. For vectors x, y ∈ Rn and scalars α ∈ R,


x + y = (x1 + y1 , . . . , xn + yn )
(2.1)

αx = (αx1 , αx2 , . . . , αxn ).

Many useful concepts and constructions derive from these two operations.
A finite indexed set of elements in Rn is displayed as x1 , x2 , . . . , xk , or {xj }kj=1 , and an infinite sequence
as x1 , x2 , . . . , or {xj }∞
j=1 . This notation conflicts with that used for the components of n-tuple. However,
this is unlikely to confuse unless we simultaneously refer to the elements of xk . In such cases, we use the
alternative notation xk = (xk (1), xk (2), . . . , xk (n)).

25
ELE 435/535 Fall 2018 26

2.2 Matrices
Recall that a real n × m matrix is a rectangular array of real numbers with n rows and m columns. We let
Rn×m denote the set of real n × m matrices and denote elements of Rn×m by upper case roman letters, e.g.,
X, Y, Z. A matrix is said to be square if it has the same number of rows and columns. The components (or
entries) of X are denoted by Xi,j , or X(i, j), for i ∈ [1 : n], j ∈ [1 : m]. The first index is the row index, and
the second is the column index. To display the entries of a matrix X ∈ Rn×m we place the corresponding
rectangular array of elements within square brackets, thus
 
X1,1 X1,2 ... X1,m
 X2,1 X2,2 X2,m 
X= . ..  .
 
 .. . 
Xn,1 Xn,2 . . . Xn,m

It is often convenient to display or specify a matrix by providing a formula for its i, j-element. To do so we
write X = [Xi,j ], where Xi,j is an expression for the i, j-th element of X. For example, the transpose of a
matrix X ∈ Rn×m is the m × n matrix X T with X T = [Xj,i ].
The set Rn×m has two key operations called of matrix addition and scalar multiplication. For X, Y ∈
Rn×m and α ∈ R, these operations are defined component-wise:


X + Y = [Xi,j + Yi,j ]
(2.2)

αX = [αXi,j ].

2.2.1 Vectors as Matrices


Each vector x ∈ Rn can also be regarded as a matrix by writing it as a n × 1 matrix. Such matrices are
called column vectors. If x ∈ Rn is interpreted as a column vector, then xT denotes the corresponding 1 × n
row vector. When using this matrix interpretation, we display x and xT using square brackets thus
 
x1
 x2 
xT = x1 x2 . . .
 
x= .  xn .
 
 .. 
xn

So [x1 , . . . , xn ] denotes a 1 × n matrix and (x1 , . . . , xn ) denotes a vector in Rn . Similarly, a finite set
of vectors {xj }m n
j=1 ⊂ R can be written as a matrix X ∈ R
n×m by letting x be the j-th column of X,
j
j ∈ [1 : m]. Then xTi is the i-th row of the matrix X T .
For X ∈ Rn×m , we let X:,j denote the j-th column of X (a column vector) and Xi,: denote its i-th row
(a row vector).

2.2.2 Matrices as Vectors


In some applications, a data point naturally takes the form of a matrix A. For example, an image. It
is sometimes convenient to map A into a column vector. This operation is called matrix vectorization.
Two simple ways to do this are column-wise or row-wise vectorization. In column-wise vectorization, the

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 27

columns of A are stacked to form a long column vector f (A):


 
A:,1
A ∈ Rn×m 7→ f (A) =  ...  ∈ Rnm .
 

A:,m

Here f denotes the function that maps the matrix A ∈ Rn×m into the vector f (A) ∈ Rnm . It is easy to
verify that this mapping respects the operations of matrix addition and scalar multiplication:

(∀A, B ∈ Rn×m ) : A + B 7→ f (A) + f (B)


(∀α ∈ R)(∀A ∈ Rn×m ) : αA 7→ αf (A)

A function that satisfies this properties is called a linear function. In addition, since we know n and m, the
mapping f is invertible. We can recover A from a = f (A) by simply “unstacking” the columns of a. Thus
column-wise vectorization is an isomorphism between the vector spaces Rn×m and Rnm . See Appendix A
for more details. As far as the algebraic operations of the vector spaces are concerned, it does not matter
whether we work in Rn×m or (under the isomorphism f ) in Rnm ; the results obtained will correspond under
the isomorphism. We will examine this issue again after we discuss the geometric structure of Rn and
Rn×m .

2.3 Vector Spaces


A vector space consists of a set of vectors, and a set of scalars, equipped with two algebraic operations:
vector addition, and scalar multiplication. The operations of vector addition and scalar multiplication must
satisfy certain axioms. These are listed in Appendix A.
The set of vectors Rn , with the scalars R, and the operations defined in (2.1), is probably the best known
example of vector space. However, there are many others. For example, the set Rn×m of real n × m
matrices, with the scalars R, and the operations of matrix addition and scalar multiplication is also a vector
space. Below we highlight useful properties of vector spaces in the context of Rn . Most of the concepts
apply in any vector space.

Linear Combinations
Pm
We say that z is a linear combination of the vectors x1 , . . . , xm using scalars α1 , . . . , αm if z = j=1 αj xj .
The span of a set of vectors {xj }m n
j=1 ⊂ R is the set of all linear combinations of its elements:

∆ Pm
span{x1 , . . . , xm } = {x : x = j=1 αj xj , for some scalars αj , j ∈ [1 : m].

Subspaces
A subspace of Rn is a subset U ⊆ Rn that isPclosed under linear combinations of its elements. So for any
x1 , . . . , xk ∈ U and any scalars α1 , . . . , αk , kj=1 αj xj ∈ U.
Example 2.3.1. Some examples:
(a) For any set of vectors x1 , . . . , xm ∈ Rn , U = span{x1 , . . . , xm } is a subspace of Rn .
(b) Let w ∈ Rn be nonzero, and consider the set of vectors W = {x : x = αw, α ∈ R}. This is called
the line in Rn in the direction of w. Since W = span(w), it is a subspace of Rn .

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 28

(c) U = Rn and U = {0n } are trivial subspaces of Rn (Note: 0n denotes the zero vector in Rn ).

(d) The set of symmetric matrices {A ∈ Rn×n : A = AT } is a subspace of Rn×n .

Matrix Range, Null space, and Rank

For X ∈ Rn×m , the span of the columns of X is a subspace of Rn . This is called the range of X and
denoted by R(X). Similarly, the set of vectors a ∈ Rm with Xa = 0n is a subspace of Rm . This is called
the null space of X and denoted by N (X).
The rank of X ∈ Rn×m is the dimension of the range of X. Hence the rank of X equals the number of
linearly independent columns in X. The rank of X also equals the number of linearly independent rows of
X. Thus rank(X) ≤ min(n, m). The matrix X is said to full rank if rank(X) = min(n, m).

Linear independence

A finite set of vectors {x1 , . . . , xk } ⊂ Rn is linearly independent if for every set of scalars α1 , . . . , αk ,
Pk
i=1 αi xi =0 ⇒ αi = 0, i ∈ [1 : k].

Notice that a linearly independent set can’t contain the zero vector. A set of vectors which is not linearly
independent is said to be linearly dependent. The key consequence of linear independence is that every
x ∈ span{x1 , . . . , xk } has a unique representation as a linear combination of x1 , . . . , xk .

Bases

Let U be a subspace of Rn . A finite set of vectors {x1 , . . . , xk } is said to span U, or to be a spanning set
for U, if U = span{x1 , . . . , xk }. In this case, every x ∈ U has a representation as a linear combination of
{x1 , . . . , xk }. However, a spanning set may be redundant in the sense that one or more elements of the set
may be a linear combination of the remaining elements. A basis for U is a finite set of linearly independent
vectors that span U. The spanning property means that every vector in U has a representation as a linear
combination of the basis vectors, and linear independence ensures that this representation is unique. It is a
standard result that every nonzero subspace U ⊆ Rn has a basis, and every basis for U contains the same
number of vectors.

Dimension

A vector space that has a basis is said to be finite dimensional. The dimension of a finite dimensional
subspace U is the number of elements in any basis for U.
For example, it is easy to see that Rn is finite dimensional. The standard basis for Rn is the set of
vectors ej , j ∈ [1 : n], defined by
(
1, if k = j;
ej (k) =
0, otherwise.

It is clear that if ni=1 αj ej = 0, then αj = 0, j ∈ [1 : n]. Thus the set is linearly independent. It is also
P
clear that any vector in Rn can be written as a linear combination of the ej ’s. Hence e1 , . . . , en is a basis,
and Rn is a finite dimensional. Thus every basis for Rn has n elements, and Rn has dimension n.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 29

Coordinates
Let {bj }nj=1 be a basis for Rn . The coordinates of x ∈ Rn with respect to this basis are the unique scalars
{αj }nj=1 such that x = nj=1 αj bj . Every vector uniquely determines and is uniquely determined by its
P
coordinates.
Pn For the standard basis, the coordinates of x ∈ Rn are simply the entries of x since x =
j=1 x(j)ej .
You can think of the coordinates of x as an alternative representation of x, specified using the selected
basis. If we choose a different basis, then we obtain a distinct representation. The idea of modifying how
we represent data is important in machine learning.
Example 2.3.2. Let w ∈ Rn be a non-zero vector and consider the line in Rn in the direction of w. This
is a one dimensional subspace of Rn with basis {w}. Every point z on this line can be uniquely written as
z = αw for some α ∈ R. The scalar α is the coordinate of z with respect to the basis {w}.

Notes
We have focused on the vector space Rn , but many of the concepts and definitions have natural extensions
to the vector space of real n × m matrices. We illustrate some of these extensions in the examples and the
exercises. For additional reading see the excellent introductory book by Gilbert Strang [46], and Chapter 0
in Horn and Johnson [22]. For the more technical proofs see Horn and Johnson [22].

Exercises
Exercise 2.1. Show that for any x1 , . . . , xk ∈ Rn , span(x1 , . . . , xk ) is a subspace of Rn .
Pn n
Exercise 2.2. Given fixed scalars αi , i ∈ [1 : n], show that the set U = {x : i=1 αi x(i) = 0} is a subspace of R .
(j) n (j)
More generally, given k sets of scalars, {αi }ni=1 , j ∈ [1 : k], show that the set U = {x :
P
i=1 αi x(i) = 0, j ∈ [1 :
n
k]} is a subspace of R .
Exercise 2.3. Show that span{x1 , . . . , xk } is smallest subspace of Rn that contains the vectors x1 , . . . , xk . By this
we mean that if V is a subspace with x1 , . . . , xk ∈ V, then span{x1 , . . . , xk } ⊆ V.
Exercise 2.4. For subspaces U, V ⊆ Rn , let

U ∩ V = {x : x ∈ U and x ∈ V}

U + V = {x = u + v : u ∈ U, v ∈ V}
Show that U ∩ V and U + V are also subspaces of Rn .
Exercise 2.5. Show that:
(a) A linearly independent set in Rn containing n vectors is a basis for Rn .
(b) A subset of Rn containing k > n vectors is linearly dependent.
(c) If U is proper subspace of Rn , then dim(U) < n.
Exercise 2.6. Consider the set of vectors in R4 displayed as the columns of the following matrix:
 
1 1 1 0
  1 1 −1 0 
u1 u2 u3 u4 =  1 −1 0

1
1 −1 0 −1
Show that {u1 , . . . , u4 } is a basis for R4 .

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 30

The vector space of real n × m matrices:

Exercise 2.7. Let Rn×m denote the set of n × m real matrices.


(a) Show that together with the field of scalars R, Rn×m is a vector space.
(b) Let Ei,j = ei dTj , where ei denotes the i-th standard basis element in Rn and dj denotes the j-th standard basis
element in Rm . Thus Ei,j is the outer product of ei and dj . Show that the set of matrices {Ei,j : i ∈ [1 : n], j ∈
[1 : m]} is a basis for Rn×m . This is called the standard basis for Rn×m .
(c) Show that the dimension of Rn×m is nm. (Justify your answer.)

Exercise 2.8. A square matrix S ∈ Rn×n is symmetric if S T = S. Show that the set of n × n symmetric matrices is
a subspace of Rn×n . What is the dimension of this subspace?

Exercise 2.9. A real n × n matrix S is antisymmetric if S T = −S. Show that the set of n × n antisymmetric matrices
is a subspace of Rn×n . What is the dimension of this subspace?

Exercise 2.10. Let Rn×n denote the set of n×n real matrices. Let S ⊂ Rn×n denote the subspace of n×n symmetric
matrices and A ⊂ Rn×n denote the subspace of n × n antisymmetric matrices. Show that Rn×n = S + A.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Chapter 3

Matrix Algebra

3.1 Matrix Product


The matrix product of X ∈ Rn×q and Y ∈ Rq×m , denoted by XY , is the n × m matrix defined by
∆ P
XY = [ qk=1 Xi,k Yk,j ].

Notice that the number of columns in the first matrix X must equal the number of rows in the second matrix
Y . Hence for general rectangular matrices, one or both of the products XY and Y X may not exist. For
X ∈ Rn×m and Y ∈ Rm×n both XY and Y X exist, but have different sizes. When X, Y ∈ Rn×n , both
products exist and have the same size, but in general XY 6= Y X.
When X ∈ Rn×m and y is a column vector of length m,

Xy = [ m
P Pm
k=1 Xi,k yk ] = k=1 X:,k yk .

So Xy is the column vector formed by taking a linear combination of the columns of X using the entries of
y. This can be generalized as follows. For X ∈ Rn×q and Y ∈ Rq×m ,
 
XY = XY:,1 . . . XY:,m .

So the j-th column of XY is the linear combination of the columns of X using the j-th column of Y .

3.1.1 Properties of the Matrix Product


For simplicity, we state the properties below for square matrices of the same size.

Lemma 3.1.1. For all A, B, C ∈ Rn×n and for all α ∈ R:

1) A(BC) = (AB)C

2) A(B + C) = AB + AC

3) α(AB) = (αA)B = A(αB) = (AB)α

4) (AB)T = B T AT .

Proof. Exercise.

31
ELE 435/535 Fall 2018 32

3.1.2 Identity Matrix


Let ej , j ∈ [1 : n], denote the standard
 basis for Rn . Then let In ∈ Rn×n denote the square matrix with
columns e1 , . . . , en . So In = e1 e2 . . . en . Notice that In has 1’s down the diagonal and zeros
elsewhere. For A ∈ Rn×m , Im is the unit element for right multiplication and In is the unit element for left
multiplication: AIm = A and In A = A.

3.1.3 Outer Product


For nonzero vectors x ∈ Rn and y ∈ Rm , the outer product of x and y is the n × m matrix xy T . This matrix
has rank one. To see this notice that for all w ∈ Rn , (yxT )w = y(xT w) is a scalar multiple of y. Moreover,
by suitable choice of w, we can make this scalar any real value. So R(yxT ) = span(y) and the rank of yxT
is one.

3.1.4 Linear Combinations as a Matrix-Vector Products


A linear combination of {xj }m m
j=1 using scalars {αj }j=1 can be written as a matrix equation by letting X be
the matrix with j-th column xj , j = [1 : m], and a ∈ Rm be the column vector a = [αj ]. Then
Pk
z= j=1 αj xj and z = Xa.

The first equation is an expression in the vector space Rn , the second is an equivalent matrix equation. This
interpretation is worth emphasizing: the matrix product Xa forms a linear combination of the columns of
X using the elements of the column vector a.

3.1.5 Schur Product


A second useful matrix product is called the Schur product. It is defined for matrices X, Y ∈ Rn×m of the
same size by
X ⊗ Y = [Xi,j Yi,j ].

Since vectors in Rn can be regarded as n × 1 matrices, this definition also extends to vectors. For x y ∈ Rn ,

(x ⊗ y)j = xj yj .

3.2 Matrix Inverse


A square matrix A ∈ Rn×n is invertible if there exists A−1 ∈ Rn×n with AA−1 = In and A−1 A = In . In
this case, A−1 is called the inverse of A. Not every (nonzero) matrix is invertible.

Lemma 3.2.1. Let A, B ∈ Rn×n . Then:

(1) If A has an inverse, then the inverse is unique;

(2) If A, B ∈ Rn×n are invertible, so is AB with (AB)−1 = B −1 A−1 ;

(3) If A is invertible so is AT and (AT )−1 = (A−1 )T .

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 33

Proof. We prove each part in turn:


(1) If B and C are inverses of A, then B = B(AC) = (BA)C = C.
(2) Use uniqueness together with

(AB)(B −1 A−1 ) = A(BB −1 )A−1 = AA−1 = I


(B −1 A−1 )(AB) = B −1 (A−1 A)B = B −1 B = I

(3) A−1 A = I implies AT (A−1 )T = I; and AA−1 = I implies (A−1 )T AT = I.

Lemma 3.2.2. A ∈ Rn×n is invertible if and only if the columns of A are linearly independent.

Proof. (⇒) First assume that A is invertible. Suppose Ax = 0. So the linear combination of the columns
of A using the entries of x as coefficients is the zero vector. Then 0 = A−1 Ax = x. So x = 0. Hence the
columns of A are linearly independent.
(⇐) Now assume the columns of A are linearly independent. So {Ae1 , . . . , Aen } is a basis. Let B
be the matrix defined on this basis by B(Aej ) = ej , j ∈ [1 : n]. Then BA = I. Suppose Bx = 0.
Writing x as a linear combination of the columns of A we have x = Ay. So 0 = Bx = BAy = y.
Hence y = 0. Thus x = Ay = 0. So the columns of B are linearly independent. Then 0 = BA − I
gives 0 = BAB − B = B(AB − I). Since the columns of B are linearly independent, this implies
AB = BA = I. So B = A−1 .

Lemma 3.2.3. A ∈ Rn×n is invertible if and only if the rows of A are linearly independent.

Proof. Apply the previous lemma to AT .

3.3 Eigenvalues and Eigenvectors


The eigenvectors of a square matrix A ∈ Cn×n are the nonzero vectors x ∈ Cn such that for some λ ∈ C,
called the eigenvalue corresponding to x, Ax = λx. Note that an eigenvector for the eigenvalue λ is
not unique; if x is an eigenvector, so is αx for all nonzero α ∈ C. We have considered the family of
n × n complex matrices since the eigenvalues of a real matrix can be complex with corresponding complex
eigenvectors.

The characteristic polynomial of A is defined to be pA (s) = det(sI − A). It follows from the definition
that the eigenvalues of A ∈ Cn×n are the scalars λ ∈ C such that A − λIn is singular. Hence λ is an
eigenvalue of A if and only if λ is a root of the characteristic polynomial of A. Thus A has n eigenvalues
over the field C. Some of these eigenvalues may be repeated. In general, Pthe characteristic polynomial can
be factored thus pA (s) = (s − λ1 )m1 (s − λ2 )m2 · · · (s − λk )mk where kj=1 mk = n. In this case, we say
that A has k distinct eigenvalues, and that λj is an eigenvalue of A with algebraic multiplicity mj , j ∈ [1 : k].
If λ is an eigenvalue of A, then every nonzero x ∈ Cn with (A − λIn )x = 0 is an eigenvector of A
with eigenvalue λ. We call the null subspace N (A − λIn ) the eigenspace of A for the eigenvalue λ. The
dimension of the eigenspace N (A−λIn ) is called the geometric multiplicity of λ. The geometric multiplicity
equals the maximum number of linearly independent eigenvectors for λ. The geometric multiplicity is at
most equal to (but can be less than) the algebraic multiplicity.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 34

Example 3.3.1. Consider the matrices shown below.


    
1 1 1 1 1 0
M1 = M2 = M3 = .
0 2 0 1 0 1

M1 has distinct eigenvalues 1, 2 each with a one dimensional eigenspace. So the algebraic and geometric
multiplicities both equal 1. M2 has a single eigenvalue at 1 with algebraic multiplicity 2. But it’s eigenspace
has dimension 1. So the algebraic multiplicity is strictly greater than the geometric multiplicity. M3 has
an eigenvalue of 1 with algebraic multiplicity 2. It eigenspace has dimension 2. In this case, the geometric
multiplicity equals the algebraic multiplicity.

Lemma 3.3.1. Eigenvectors of A ∈ Cn×n corresponding to distinct eigenvalues are linearly indepen-
dent.

Proof. Suppose that x1 , x2 , . . . xk are eigenvectors with distinct eigenvalues λ1 , λ2P


, . . . , λk . Assume that
these eigenvectors are not linearly independent. Then there exist scalars such that j αj xj = 0 with not
all of the αj = 0. Remove any terms with αj = 0. What remains is a linear combination of a subset of
m eigenvalues, with all αj 6= 0, that yields the zero vector. Among the set of all such linear combinations,
there is a subset using the smallest value of m. Denote this least value of m by r. Now assume, without loss
of generality, that such a sum can be formed using the first r eigenvectors. So there exists αj 6= 0, j ∈ [1 : r],
such that
Pr
j=1 αj xj = 0. (3.1)

Multiplying both sides of (3.1) by A, using the eigenvalue property, and subtracting from this the result of
multiplying both sides of (3.1) by λr , yields
Pr−1
j=1 (λj − λr )αj xj = 0.

Since the λj are distinct and αj 6= 0, all of the coefficients in this sum are nonzero. Hence using nonzero
coefficients, there is a linear combination of r − 1 eigenvectors that yields the zero vector; this is a contra-
diction.

It follows from Lemma 3.3.1 that if A has k distinct eigenvalues, then A has at least k linearly indepen-
dent eigenvectors. However, this is only a lower bound. Depending on the particular matrix, there can be
anywhere from k to n linearly independent eigenvectors.
The following lemma records some other useful results on matrix eigenvalues.

Lemma 3.3.2. For A, B ∈ Cn×n :

(a) trace(A) = nj=1 λj (A)


P

(b) det(A) = nj=1 λj (A).


Q

(c) AB and BA have the same eigenvalues with the same algebraic multiplicities.

Proof. (a) and (b) are standard results that can be found in any text on linear algebra. (c) is proved in
Exercise 3.8.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 35

3.4 Similarity Transformations and Diagonalization


Let A, B ∈ Cn×n . A and B are said to be similar or related by a similarity transformation, if there exists
an invertible matrix V ∈ Cn×n such that B = V −1 AV .
Proposition 3.4.1. If A, B ∈ Cn×n are similar, then A and B have the same characteristic polynomial,
the same eigenvalues, and the eigenvalues have the same algebraic and geometric multiplicities.

Proof. We first use det(M1 M2 ) = det(M1 ) det(M2 ) and det(M −1 ) = det(M )−1 to show that A and B
have the same characteristic polynomial:

pB (s) = det(sI − V −1 AV ) = det(V −1 (sI − B)V ) = det(V −1 ) det(sI − A) det(V ) = pA (s).

It follows that A and B have the same eigenvalues with the same algebraic multiplicities. Let λ be an
eigenvalue and consider the eigenspace of B for λ:

N (λIn − B) = N (λIn − V −1 AV ) = N (V −1 (λIn − A)V ) = N ((λIn − A)V ) = V −1 N (λIn − A),

where V −1 N (λIn − A) = {x : V x ∈ N (λIn − A)} is the pre-image of N (λIn − A) under V . Since V


has linearly independent columns, dim N (λIn − B) = dim N (λIn − A).

Suppose A and B are similar with B = V −1 AV . Let a ∈ Cn denote the coordinate vector of x ∈ Cn
with respect to the columns of V . So x = V a. Then Ba = V −1 AV a = V −1 Ax. Hence b = Ba is the
coordinate vector of Ax with respect to the columns of V .
The action of A = V BV −1 on x can be separated into three steps: (1) compute the coordinates of x
with respect to V , a = V −1 x; multiply this coordinate vector by B to obtain b = Ba; then use b as a
coordinate vector to give Ax = V b. By this decomposition you see that B is the matrix corresponding to A
when we use the basis V .

3.4.1 Diagonalization
If A ∈ Cn×n has n linearly independent eigenvectors, then A is similar to a diagonal matrix, and we say that
A is diagonalizable. To see this, let v1 , . . . , vn be linearly independent eigenvectors of A with Avi = λi vi ,
i ∈ [1 : n]. Form V ∈ Cn×n with V = v1 v2 . . . vn . Then
 
AV = A v1 v2 . . . vn
 
= λ1 v1 λ2 v2 . . . λn vn
= V Λ,

where Λ ∈ Cn×n is diagonal with the corresponding eigenvalues on the diagonal. It follows that

Λ = V −1 AV and A = V ΛV −1 .

So A is similar to a diagonal matrix with the eigenvalues of A on the diagonal.

3.5 Symmetric and Positive Semidefinite Matrices


A square matrix S ∈ Rn×n is symmetric if S T = S. The eigenvalues and eigenvectors of real symmetric
matrices have special properties.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 36

Theorem 3.5.1. S ∈ Rn×n is symmetric if and only if S has n real eigenvalues and n orthonormal
eigenvectors.

Proof. (⇒) Let Sx = λx with x 6= 0. Then S x̄ = Sx = λ̄x̄. Hence x̄T Sx = λ̄kxk2 and x̄T Sx = λkxk2 .
Subtracting these expressions and using x 6= 0, yields λ = λ̄. Thus λ is real. It follows that x can be
selected in Rn . We prove the second claim under the simplifying assumption that S has distinct eigenvalues.
Let Sx1 = λ1 x1 and Sx2 = λ2 x2 . Then xT2 Sx1 = λ2 xT2 x1 and xT2 Sx1 = λ1 xT2 x1 . Subtracting these
expressions and using the fact that λ1 6= λ2 yields xT2 x1 = 0. Thus x1 ⊥ x2 . For a proof without the
simplifying assumption, see Theorem 2.5.6 in Horn and Johnson [22].
(⇐) We can write S = V ΛV T , where Λ is diagonal with the real eigenvalues of S on the diagonal and
the columns of V ∈ Rn×n are n corresponding orthonormal eigenvectors of S. Here we used the fact that
V −1 = V T . Then S T = (V ΛV T )T = V ΛV T = S.

If we place the n orthonormal eigenvectors of S in the columns of the matrix V and place the corre-
sponding eigenvalues on the diagonal of the diagonal matrix Λ, then SV = V Λ and hence S = V ΛV T .

Positive Semidefinite and Positive Definite Matrices


A symmetric matrix P ∈ Rn×n is positive semidefinite (PSD) if for all x ∈ Rn , xT P x ≥ 0, and positive
definite (PD) if for all x 6= 0, xT P x > 0. For positive semidefinite matrices, we have the following corollary
to Theorem 3.5.1.

Corollary 3.5.1. Let P ∈ Rn×n be a symmetric matrix. Then P is positive semidefinite (respectively
positive definite) if and only if the eigenvalues of P are real and nonnegative (respectively positive).

Proof. Since P is symmetric, all of its eigenvalues are real and it has a set of n real ON eigenvectors. Let x
be an eigenvector of P with eigenvalue λ.
(⇒) Since P is PSD, xT P x = xT λx = λkxk2 ≥ 0. Since kxk2 6= 0, λ ≥ 0. If P is PD, then xT P x > 0.
Hence λkxk2 > 0. Since kxk2 6= 0, we must have λ > 0.
(⇐) Write P = V ΛV T where Λ is diagonal with the nonnegative eigenvalues of P on the diagonal and
the columns of V ∈ Rn×n are nPcorresponding orthonormal eigenvectors of P . Then for any x ∈ Rn ,
xT P x = xT V λV T x = y T Λy = i λi yi2 ≥ 0, where y = V T x. Hence P is PSD. Now let the eigenvalues
of P be positive and x 6= 0. Then y 6= 0, and we obtain x P x = i λi yi2 > 0.
T
P

Notes
You will find all of this material in a good introductory linear algebra textbook. See for example the book
by Gilbert Strang [46]. There are many other similar books. For the more technical proofs see Horn and
Johnson [22].

Exercises
Exercise 3.1. For a n × m real matrix X show that:
(a) R(X) = {z ∈ Rn : z = Xw, for w ∈ Rm } is a subspace of Rn .
(b) N (X) = {a ∈ Rm : Xa = 0} is a subspace of Rm .

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 37

Exercise 3.2. Let Aj ∈ Rnj ×m and Nj = {x ∈ Rn : Aj x = 0}, j = 1, 2. Give a similar matrix equation for the
subspace N1 ∩ N2 .
Exercise 3.3. Let Aj ∈ Rn×mj and Rj = {y ∈ Rn : y = Aj x with x ∈ Rmj }, j = 1, 2. Give a similar matrix
equation for the subspace R1 + R2 .
Exercise 3.4. A permutation matrix P ∈ Rn×n is a square matrix of the form P = [ek1 , ek2 , . . . , ekn ], where
ek1 , . . . , ekn is some ordering of the standard basis.
(a) Show that P T is a permutation matrix.
(b) Show that P T is the inverse permutation of P .
(c) Show that y = P T x is a permutation of the entries of x according to the ordering of the standard basis in the
rows of P T (= the ordering in columns of P ).
(d) Show that P T A permutes the rows of A and AP permutes the columns of A in the same way as above.
Exercise 3.5. Let x ∈ Rm and y ∈ Rn . Find the range and null space of the matrix xy T ∈ Rm×n . What is the rank
of this matrix?
Exercise 3.6. Let R denote the right rotation matrix that maps a column vector x = [x1 , x2 , . . . , xn−1 , xn ]T ∈ Rn to
column vector Rx = [xn , x1 , x2 , . . . , xn−1 ]T ∈ Rn .

(a) Find the matrix R with respect to the standard basis.


(b) Show that: (i) R is a permutation matrix (display it using its columns and the standard basis). (ii) R is orthog-
onal, (iii) R is invertible and (iv) find its inverse matrix.
(c) Use the fact that R is a permutation to find compact expressions for R2 , R3 , . . . , Rn . Show that Rn = I = R0 .
(d) Show that I, R, . . . , Rn−1 is a basis for an n-dimensional subspace of Rn×n .
(e) Is the basis in (d) orthogonal? How about orthonormal? (If not how can you make it orthonormal?)

Exercise 3.7. A real n × n circulant matrix has the form:


 
h1 hn hn−1 h2
 h2 h1 hn 
 
 .. 
C=  h3 h2 h1 . hn−1 
.
 .. .. 
 . . hn 
hn h3 h2 h1

So C = [h, Rh, R2 h, . . . , Rn−1 h] where h ∈ Rn is the first column of C and R ∈ Rn×n is the right rotation matrix.

(a) Show that the family of real n × n circulant matrices is a subspace of Rn×n .
(b) Show that I, R, . . . , Rn−1 is a basis for the above subspace. [Assume the results of previous questions.]
(c) Show that if C1 , C2 are circulant matrices, so is the product C1 C2 .
(d) Show that all circulant matrices commute.

Exercise 3.8. Let A, B ∈ Cn×n . It always holds that trace(AB) = trace(BA). Hence the sum of eigenvalues of
AB is the same as the sum of eigenvalues of BA. We now show the stronger result that AB and BA have the same
characteristic polynomial and hence the same eigenvalues with the same algebraic multiplicities.
(a) Show that the following 2n × 2n block matrices are similar:
     
AB 0 0 0 I A
M1 = M2 = Hint: consider n .
B 0 B BA 0 In

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 38

(b) Show that the characteristic polynomial of M1 is sn pAB (s) and of M2 is sn pBA (s).
(c) Use (a) and (b) to show that AB and BA have the same characteristic polynomial. Show that AB and BA the
same eigenvalues with the same algebraic multiplicities.
(d) Now prove a corresponding result for A ∈ Rm×n and B ∈ Rn×m . Without loss of generality, consider m ≤ n.
Exercise 3.9. Let P, Q ∈ Rn×n be symmetric with P PSD and Q PD. Clearly P and Q have nonnegative eigenvalues.
Show that QP may not be symmetric, but QP always has nonnegative eigenvalues.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Chapter 4

Inner Product Spaces

We now consider the Euclidean geometry of Rn . This geometry brings in the important concepts of length,
distance, angle, orthogonality, and orthogonal projection. Although we focus on Rn , the main concepts also
apply to general finite-dimensional inner product spaces.

4.1 Inner Product on Rn


The Euclidean inner product of vectors x, y ∈ Rn is defined by
∆ Pn
<x, y> = i=1 x(i)y(i). (4.1)

We can equivalently write the inner product as a matrix product: <x, y> = xT y. The following lemma
indicates that this function satisfies the basic properties required of an inner product.

Lemma 4.1.1 (Properties of an Inner Product). For x, y, z ∈ Rn and α ∈ R,

(1) <x, x> ≥ 0 with equality ⇔ x = 0

(2) <x, y> = <y, x>

(3) <αx, y> = α<x, y>

(4) <x + y, z> = <x, z> + <y, z>

Proof. These claims follow from the definition of the inner product via simple algebra.

The Euclidean norm on Rn corresponding to the inner product is given by


∆ 1 Pn 2
1
kxk = (<x, x>) 2 = k=1 |x(k)|
2
.

This norm satisfies the following standard properties required of a norm.

Lemma 4.1.2 (Properties of a Norm). For x, y ∈ Rn and α ∈ R:

(1) kxk ≥ 0 with equality if and only if x = 0 (positivity).

(2) kαxk = |α| kxk (scaling).

39
ELE 435/535 Fall 2018 40

x+y

x ||x+y|| ||x||
||x||
y
||y||

Figure 4.1: An illustration of the triangle inequality: kx + yk ≤ kxk + kyk.

(3) kx + yk ≤ kxk + kyk (triangle inequality).

Proof. Items (1) and (2) easily follow from the definition of the norm. Item (3) can be proved using the
Cauchy-Schwarz inequality. This proof is left as an exercise.

The triangle inequality is illustrated in Figure 4.1. The norm kxk measures the “length” or “size” of the
vector x. Equivalently, kxk is the distance between 0 and x, and kx − yk is the distance between x and y.
If kxk = 1, x is called a unit vector, or a unit direction. The set {x : kxk = 1} of all unit vectors is called
the unit sphere.
The Euclidean inner product and norm satisfy the Cauchy-Schwartz inequality.

Lemma 4.1.3 (Cauchy-Schwarz Inequality). For all x, y ∈ Rn , |<x, y>| ≤ kxk kyk.

Proof. See Exercise 4.10.

For nonzero vectors, the Cauchy-Schwarz inequality can also be written as


<x, y>
−1 ≤ ≤ 1.
kxk kyk

4.1.1 Orthogonality, Pythagorous, and Orthonormal Bases


Vectors x, y ∈ Rn are orthogonal, written x ⊥ y, if <x, y> = 0. A set of vectors {x1 , . . . , xk } in Rn is
orthogonal if each pair is orthogonal: xi ⊥ xj , i, j ∈ [1 : k] and i 6= j. An orthogonal set of vectors has the
following useful property.
Pk Pk
Theorem 4.1.1 (Pythagoras). If {xi }ki=1 ⊂ Rn is an orthogonal set, k j=1 xj k
2 = j=1 kxj k
2.

Proof. Using the definition of the norm and properties of the inner product we have:
k
X k
X k
X k X
X k k
X k
X
k xj k2 = < xi , xj > = <xi , xj > = <xj , xj > = kxj k2 .
j=1 i=1 j=1 i=1 j=1 j=1 j=1

A set of vectors {xi }ki=1 in Rn is orthonormal if it is orthogonal and every vector has unit norm: kxi k =
1, i ∈ [1 : k].

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 41

Lemma 4.1.4. An orthonormal set of vectors is a linearly independent set.

Pk
Proof. Let {xi }ki=1 be an orthonormal set and suppose that i=1 αi xi = 0. Then for each xj we have
0 = < ki=1 αi xi , xj > = αj .
P

An orthonormal basis for Rn is basis of n orthonormal vectors. Since an orthonormal set is always
linearly independent, any set of n orthonormal vectors is an orthonormal basis for Rn . A nice property of
orthonormal bases is that it is easy to find the coordinates of any vector x with respect to the basis. To see
this, let {xi }ni=1 be an orthonormal basis and x = ni=1 αi xi . Then
P

X X
<x, xj > = < αi xi , xj > = αi <xi , xj > = αj .
i i

So the coordinate of x with respect to the basis element xj is simply αj = <x, xj >, j ∈ [1 : n].

Example 4.1.1. The Hadamard basis is an orthonormal basis in Rn with n = 2p . It can be defined recur-
sively as the columns of the matrix the Hadamard matrix Hpa where
 a a

1 Hp−1 Hp−1
H0a =1 and Hpa =√ a a .
2 Hp−1 −Hp−1

For example, in R2 , R4 , and R8 , the Hadamard basis given by the columns of the matrices:
 
1 1 1 1 1 1 1 1
  1 −1 1 −1 1 −1 1 −1
  1 1 1 1 1 1 −1 −1 1 1 −1 −1
1 1 1 1 1 −1 1 −1 1 
1 −1 −1 1 1 −1 −1 1

H1a = √ 1 −1 H2a = √ 1 1 −1 −1
H3a = √ 1 1 1 1 −1 −1 −1 −1
2 4 8  
1 −1 −1 1 1 −1 1 −1 −1 1 −1 1
−1 −1 −1 −1
 
1 1 1 1
1 −1 −1 1 −1 1 1 −1

The verification that the Hadamard basis is orthonormal is left as an exercise.

Example 4.1.2. The Haar basis is an orthonormal basis in Rn with n = 2p . The elements of the basis can
be arranged into groups. The first group consists of one vector: √12p 1. The next group also has one vector.
The entries in first half of the vector are √12p and the second half are − √12p . Subsequent groups are derived

by subsampling by 2, scaling by 2, and translating. We illustrate the procedure below for p = 1, 2, 3. In
each case, the Haar basis consists of the columns of the Haar matrix Hp :

√1 √1 √1 √1
 
8 8 4
0 2
0 0 0
 √1 √1 √1 0 − √12 0 0 0 
1 8 8 4
1 √1 √1 √1 − √14 1
 
0 0 0 0 0
 √
  2 2 2
 8 8 2

√1 √1 1 1
− √12 0 √1 √1 − √14 0 0 − √12 0 0
 
2 2 2 2 8 8
H1 = H2 =  1  H3 =  .
   
√1 − √12 1
− 12 0 √ √1 − √18 0 √1 0 0 √1 0
2 2 2  8 4 2 
1
2
− 12 0 − √12 
 √1
8
− √18 0 √1
4
0 0 − √12 0 

 √1 − √18 0 − √14 0 0 0 √1 
8 2
√1 − √18 0 − √14 0 0 0 − √12
8

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 42

4.1.2 Orthogonal Matrices


A (square) matrix Q ∈ Rn×n is orthogonal if QT Q = QQT = In . In this case, the columns of Q form an
orthonormal basis for Rn , and QT is the inverse of Q. Multiplication of a vector by an orthogonal matrix
preserves inner products, angles, norms, distances, and hence Euclidean geometry.

Lemma 4.1.5. If Q ∈ On , then for each x, y ∈ Rn , <Qx, Qy> = <x, y> and kQxk = kxk.

Proof. <Qx, Qy> = xT QT Qy = xT y = <x, y>, and kQxk2 = <Qx, Qx> = <x, x> = kxk2 .

Let On denote the set of n × n orthogonal matrices. We show below that the set On forms a (noncom-
mutative) group under matrix multiplication. Hence On is called the n × n orthogonal group.
Lemma 4.1.6. The set On contains the identity matrix In , and is closed under matrix multiplication
and matrix inverse.

Proof. If Q, W are orthogonal, then (QW )T (QW ) = W T QT QW = In and (QW )(QW )T = QW W T QT =


In . So QW is orthogonal. If Q is orthogonal, Q−1 = QT is orthogonal. Clearly, In is orthogonal.

4.2 An Inner Product on Rn×n


More generally, a vector space X over the real field equipped with a function <·, ·> satisfying the properties
listed in Lemma 4.1.1 is called an inner product space. For example, we can define an inner product on
Rn×m as follows
∆ P
<A, B> = ni=1 m
P
j=1 Aij Bij . (4.2)
This function satisfies the inner product properties listed in Lemma 4.1.1. The corresponding norm, called
the Frobenius norm and denoted by kAkF , is given by
P P 1/2
∆ 1 n m 2
kAkF = (<A, A>) 2 = i=1 j=1 |Aij | .

The following lemma gives a useful alternative expression for <A, B>.

Lemma 4.2.1. For all A, B ∈ Rn×m , <A, B> = trace(AT B).

Proof. Exercise 4.35.

4.3 Orthogonal Projection


Fix an inner product space X , and let x ∈ X and U be a subspace of X . We consider the fundamental
problem of locating a point in U that is closest to x. We use this simple operation repeatedly.

4.3.1 Projection to a Line


The simplest instance of the problem is when x, u ∈ Rn with kuk = 1, and we seek the closest point to x
on the line span{u}. This can be posed as the constrained optimization problem:

min 1
2 kx − zk2
z∈Rn (4.3)
s.t. z ∈ span{u}.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 43

The subspace span{u} is a line through the origin in the direction u, and we seek the point z on this line
that is closest x. Every point z on the line has a unique coordinate α ∈ R with z = αu. Hence we can
equivalently solve the unconstrained problem

α? = arg min 1
2 kx − αuk2 .
α∈R

Expanding the objective function in (4.3) and setting z = αu yields

1
2 kx − zk2 = 12 <x − z, x − z> = 12 kxk2 − α<u, x> + 12 α2 kuk2 .

This is a strictly convex quadratic function of the scalar α. Hence there is a unique value of α? that minimizes
the objective. Setting the derivative of the above expression w.r.t. α equal to zero gives the unique solution
α? = <u, x>. Hence the closet point to x on the line span(u) is

x̂ = <u, x>u.

The associated error vector rx = x − x̂ is called the residual. We claim that the residual is orthogonal to u
and hence to the subspace span{u}. To see this note that

<u, rx > = <u, x − x̂> = <u, x> − <u, x><u, u> = 0.

Thus x̂ is the unique orthogonal projection of x onto the line span{u}, and by Pythagoras we have kxk2 =
kx̂k2 + krx k2 . This result is illustrated on the left in Figure 4.2.

x x
rx
rx U

^
x
^
u x
0
0
Figure 4.2: Left: Orthogonal projection of x onto a line through zero. Right: Orthogonal projection of x onto the
subspace U.

We can also write the solution using matrix notation. Noting that <u, x> = uT x, we have

x̂ = (uuT )x = P x
rx = (I − uuT )x = (I − P )x,

with P = uuT . So for fixed u, both x̂ and rx are linear functions of x. As one might expect, these linear
functions have special properties. For example, since x̂ ∈ span{u}, the projection of x̂ onto span{u} must
be x̂. Hence P 2 = P . Such a matrix is said to be idempotent. This property is easily checked using the
formula P = uuT . We have P 2 = (uuT )(uuT ) = uuT = P . We also note that P = uuT is symmetric.
Hence P is both symmetric (P T = P ) and idempotent (P 2 = P ). A matrix with these two properties is
called a projection matrix.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 44

4.3.2 Projection to an Arbitrary Subspace


Now let U be a subspace of Rn with an orthonormal basis {ui }ki=1 . For a given x ∈ Rn , we seek a point z
in U that minimizes the distance to x. This can be stated as the optimization problem

min 1
2 kx − zk2
z∈Rn (4.4)
s.t. z ∈ U.
Pk
By uniquely writing z = j=1 αj uj , we can equivalently solve the unconstrained problem:

1 Pk 2.
min kx − j=1 αj uj k
α1 ,...,αk 2

Using the definition of the norm and the properties of the inner product, the objective function can be
expanded to:
1
2 kx − zk2 = 12 <x − z, x − z>
= 12 kxk2 − <z, x> + 12 kzk2
= 12 kxk2 − kj=1 αj <uj , x> + 12 kj=1 αj2 .
P P

Pk
In the last line we used Pythagoras to write kzk2 = 2
j=1 αj . Taking the derivative with respect to αj and
setting this equal to zero yields the unique solution

αj? = <uj , x>, j ∈ [1 : k].

Hence the unique closest point in U to x is


Pk
x̂ = j=1 <uj , x>uj . (4.5)

Moreover, the residual rx = x − x̂ is orthogonal to every uj and hence to the subspace U = span({ui }ki=1 ).
To see this compute

<uj , rx > = <uj , x − x̂> = <uj , x> − <uj , x̂> = <uj , x> − <uj , x> = 0.

Thus x̂ is the unique orthogonal projection of x onto U, and by Pythagoras, kxk2 = kx̂k2 + krx k2 . This
property is illustrated on the right in Figure 4.2. From (4.5), notice that x̂ and the residual rx = x − x̂ are
linear functions of x.
We can also write these results as matrix equations. First let U ∈ Rn×k be the matrix with columns
u1 , . . . , uk . From (4.5) we have
Pk T = ( kj=1 uj uTj )x = P x,
P
x̂ = j=1 uj uj x

Pk T
with P = j=1 uj uj = U U T . Hence,

x̂ = U U T x
rx = (I − U U T )x.

This confirms that x̂ and rx are linear functions of x and that P is symmetric and idempotent.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 45

4.4 The Orthogonal Complement of a Subspace


The orthogonal complement of a subspace U ⊆ Rn is the subset of vectors that are orthogonal to every
vector in U:

U ⊥ = {x ∈ Rn : (∀u ∈ U) x ⊥ u}.

Lemma 4.4.1. U ⊥ is a subspace of Rn and U ∩ U ⊥ = 0.

Proof. Exercise 4.25.

It is clear that {0}⊥ = Rn and (Rn )⊥ = {0}. For a nonzero vector u, span{u}⊥ must be the subspoaxe
of Rn with normal u. When U = span{u}, we write U ⊥ as simply u⊥ .
Given a subspace U in Rn and x ∈ Rn , the projection x̂ of x onto U lies in U, the residual rx lies in U ⊥ ,
and x = x̂ + rx . Because U and U ⊥ are orthogonal, this representation is unique.

Lemma 4.4.2. Every x ∈ Rn has a unique representation in the form x = u + v with u ∈ U and
v ∈ U ⊥.

Proof. By the properties of orthogonal projection, x = x̂ + rx with x̂ ∈ U and rx ∈ U ⊥ . This gives one
decomposition of the required form. Suppose there are two decompositions of this form: x = ui + vi , with
ui ∈ U and vi ∈ U ⊥ , i = 1, 2. Subtracting the expressions gives (u1 − u2 ) = −(v1 − v2 ). Now u1 − u2 ∈ U
and v1 − v2 ∈ U ⊥ , and since U ∩ U ⊥ = 0 (Lemma 4.4.1), we must have u1 = u2 and v1 = v2 .

It follows from Lemma 4.4.2 that U + U ⊥ = Rn . This simply states that every vector in Rn is the sum
of some vector in U and some vector in U ⊥ . Because this representation is also unique, this is sometimes
written as Rn = U ⊕ U ⊥ , and we say that Rn is the direct sum of U and U ⊥ . Exercise 4.26 covers several
additional properties of the orthogonal complement.

4.4.1 N (A)⊥ = R(AT )


The following fundamental result is often useful.

Theorem 4.4.1. Let A ∈ Rn×m have null space N (A) and range R(A). Then N (A)⊥ = R(AT ).

Proof. Let x ∈ N (A). Then Ax = 0 and xT AT = 0T . So for every y ∈ Rn , xT (AT y) = 0. Thus


x ∈ R(AT )⊥ . This shows that N (A) ⊆ R(AT )⊥ . Now for all subspaces: (a) (U ⊥ )⊥ = U, and (b) U ⊆ V
implies V ⊥ ⊆ U ⊥ . Applying these properties yields R(AT ) ⊆ N (A)⊥ .
Conversely, suppose x ∈ R(AT )⊥ . Then for all y ∈ Rn , xT AT y = 0. Hence for all y ∈ Rn , y T Ax = 0.
This implies Ax = 0 and hence that x ∈ N (A). Thus R(AT )⊥ ⊆ N (A) and N (A)⊥ ⊆ R(AT ). We have
shown R(AT ) ⊆ N (A)⊥ and N (A)⊥ ⊆ R(AT ). Thus N (A)⊥ = R(AT ).

4.5 Norms on Rn
The Euclidean norm is one of many norms on Rn . Each of the following functions is also a norm on Rn :

(a) kxk1 = nj=1 |x(j)|. This is called the 1-norm.


P

(b) kxkp = ( nj−1 |x(j)|p )1/p for an integer p ≥ 1. This is called the p-norm.
P

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 46

(c) kxk∞ = maxj {|x(j)|}. This is called the max norm.

(d) kxk = (xT P x)1/2 , where P ∈ Rn×n is symmetric positive definite.

Note that the 1-norm and the 2-norm are instances of the p-norm with p = 1 and p = 2, respectively.

Notes
We have only given a brief outline of the geometric structure of Rn . For more details see the relevant
sections in Chapter 2 of Strang [46].

Exercises
Pn
Exercise 4.1. The mean of a vector x ∈ Rn is the scalar mx = (1/n) i=1 x(i). Show that the set of all vectors in
Rn with mean 0 is a subspace U0 ⊂ Rn of dimension n − 1. Show that all vectors in U0 are orthogonal to 1n ∈ Rn .
Exercise 4.2. Given x, y ∈ Rn find the closest point to x on the line through 0 in the direction of y.
Exercise 4.3. Let P ∈ Rn×n be the matrix for orthogonal projection of Rn onto the subspace U of Rn . Show that P
is symmetric, idempotent, PSD, and that trace(P ) = dim(U).
Exercise 4.4. Let P ∈ Rn×n be the matrix for orthogonal projection of Rn onto the subspace U of Rn . Show that
I − P is symmetric, idempotent, PSD, and that trace(I − P ) = n − dim(U).
Exercise 4.5. Prove or disprove: for any subspace U ⊂ Rn and each x ∈ Rn there exits a point u ∈ U such that
(x − u) ⊥ u.
Exercise 4.6. Prove or disprove: If u, v ∈ Rn are orthogonal, then ku − vk2 = kuk2 + kvk2 .
Exercise 4.7. Prove or disprove: For C ∈ Rn×n , if for each x ∈ Rn , xT Cx = 0, then C = 0.
Exercise 4.8. The correlation of x1 , x2 ∈ Rn is the scalar:
<x1 , x2 >
ρ(x1 , x2 ) = .
kx1 kkx2 k

For given x1 , what vectors x2 maximize the correlation? What vectors x2 minimize the correlation? Show that
ρ(x1 , x2 ) ∈ [−1, 1] and is zero precisely when the vectors are orthogonal.

Basic Properties
Exercise 4.9. Prove that the properties in Lemma 4.1.1 hold for the inner product <x, y> = xT y in Rn .
Exercise 4.10. Use the definition of the inner product and its properties listed in Lemma 4.1.1, together with the
definition of the norm, to prove the Cauchy-Schwarz Inequality (Lemma 4.1.3).
(a) First let x, y ∈ Cn with kxk = kyk = 1.

(1) Set x̂ = <x, y>y and rx = x − x̂. Show that <rx , y> = <rx , x̂> = 0.
(2) Show that krx k2 = 1 − <x̂, x>.
(3) Use the previous result and the definition of x̂ to show that |<x, y>| ≤ 1.

(b) Prove the result when kxk =


6 0 and kyk =
6 0.
(c) Prove the result when x or y (or both) is zero.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 47

Exercise 4.11. Prove that for x ∈ Rn the function (xT x)1/2 satisfies the properties of a norm listed in Lemma 4.1.2.
Hint: For the triangle inequality, using the properties of the inner inner product to expand kx + yk2 and use the
Cauchy-Schwartz inequality.
Exercise 4.12. Show that the Euclidean norm in Cn is:
(a) permutation invariant: if y a permutation of the entries of x, then kyk = kxk.
(b) an absolute norm: if y = |x| component-wise, then kyk = kxk.
Exercise 4.13. Let X , Y be inner product spaces over the same field F with F = R, or F = C. A linear isometry
from X to Y is a linear function D : X → Y that preserves distances: (∀x ∈ X ) kD(x)k = kxk. Show that a linear
isometry between inner product spaces also preserves inner products.
(a) First examine kD(x + y)k2 and conclude that Re(<Dx, Dy>) = Re(<x, y>).
(b) Now examine kD(x + iy)k2 where i is the imaginary unit.

Orthonormal Bases in Rn
p
Exercise 4.14. For p = 1, 2, 3, show that the Haar basis in R(2 ) , is an orthonormal basis (see Example 4.1.2).
Exercise 4.15. Show that the general Haar basis is orthonormal (see Example 4.1.2). One means to do so is to show
that Hp can be specified recursively via
 
1 Hp−1 I2p−1
Pp H p = √ ,
2 Hp−1 −I2p−1
where Pp is an appropriate 2p × 2p permutation matrix.
Exercise 4.16. Show that for p = 1, 2, 3, the Hadamard basis is orthonormal (see Example 4.1.1).
Exercise 4.17. Show that the Hadamard basis is orthonormal and that the Hadamard matrix Hpa is symmetric (see
Example 4.1.1). [Hint: the definition is recursive].
Exercise 4.18. Let u1 , . . . , uk ∈ Rn be an ON set spanning a subspace U and let v ∈ Rn with v ∈/ U. Find a point ŷ
on the linear manifold M = {x : x − v ∈ U} that is closest to a given point y ∈ Rn . [Hint: transform the problem to
one that you know how to solve.]
Exercise 4.19. A Householder transformation on Rn is a linear transformation that reflects each point x in Rn about
a given n − 1 dimensional subspace U specified by giving its unit normal u ∈ Rn . To reflect x about U we want to
move it orthogonally through the subspace to the point on the opposite side that is equidistant from the subspace.
(a) Given U = u⊥ = {x : uT x = 0}, find the required Householder matrix.
(b) Show that a Householder matrix H is symmetric, orthogonal, and its own inverse.
Exercise 4.20. Let H ∈ Rn×n be the matrix of a Householder transformation reflecting about a subspace with unit
normal u ∈ Rn .
(a) Show that H has n − 1 eigenvalues at 1, and one eigenvalue at −1. Find an eigenvector for the eigenvalue −1.
(b) Show that H has a complete set of ON eigenvectors.
Exercise 4.21. Let P ∈ Rn×n be a projection matrix with R(P ) = U.
(a) Show that the distinct eigenvalues of P are {0, 1} with the eigenvalue 1 having k = dim(U) linearly indepen-
dent eigenvectors, and the eigenvalue 0 having n − k linearly independent eigenvectors.
(b) Show that P has n orthonormal eigenvectors.
(c) Show that trace(P ) = dim(U).

Orthogonal Matrices

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 48

Exercise 4.22. Show that an orthogonal matrix Q ∈ On has all of its entries in the interval [−1, 1].
Exercise 4.23. Let Pn denote the set of n × n permutation matrices. Show that Pn is a (noncommutative) group under
matrix multiplication. Show that every permutation matrix is an orthogonal matrix. Hence Pn is a subgroup of On .
Exercise 4.24. A subspace U ⊆ Rn of dimension d ≤ n can be represented by a matrix U = [u1 , . . . , ud ] ∈ Rn×d
with U T U = Id . The columns of U form an orthonormal basis for U. However, this representation is not unique
since there are infinitely many orthonormal bases for U. Show that for any two such representations U1 , U2 for U there
exists a d × d orthogonal matrix Q with U2 = U1 Q.

Orthogonal Complement
Exercise 4.25. Prove Lemma 4.4.1.
Exercise 4.26. Let X be an inner product space of dimension n, and U, V be subspaces of X . Prove each of the
following:
(a) U ⊆ V ⇒ implies V ⊥ ⊆ U ⊥ .
(b) (U ⊥ )⊥ = U.
(c) (U + V)⊥ = U ⊥ ∩ V ⊥
(d) (U ∩ V)⊥ = U ⊥ + V ⊥
(e) If dim(U) = k, then dim(U ⊥ ) = n − k

The Inner Product Space Rn×m


Exercise 4.27. For A, B ∈ Rn×m show that the function <A, B> = trace(AT B) satisfies the properties required of
an inner product.
Exercise 4.28. For A, B ∈ Rm×n , Q ∈ Om , R ∈ On show that <QAR, QBR> = <A, B>. Thus inner products
are invariant under orthogonal transformations.
Exercise 4.29. For A ∈ Rm×n , Q ∈ Om , R ∈ On show that kQARkF = kAkF . Thus the Frobenius norm is
invariant under orthogonal transformations.
Exercise 4.30. Let b1 , . . . , bn be an orthonormal basis for Rn , and set Bj,k = bj bk T , j, k ∈ [1 : n]. Show that
{Bj,k = bj bTk }ni,j=1 is an orthonormal basis for Rn×n .
Exercise 4.31. Consider the vector space of real n × n matrices. Let S and A denote the subsets of symmetric
(P T = P ) and antisymmetric (AT = −A) matrices, respectively. Show that S ⊥ = A and hence that Rn×n = S ⊕ A.
Exercise 4.32. Find the orthogonal projection of M ∈ Rn×n onto the subspace of symmetric matrices in Rn×n .
Similarly, find the orthogonal projection of M onto the subspace of anti-symmetric matrices in Rn×n .
Exercise 4.33. Let u ∈ Rn , v ∈ Rm and A ∈ Rn×m . Find the orthogonal projection of A onto the line span(uv T ).
Exercise 4.34. Let {uj }kj=1 be an orthonormal set in Rn and {vj }kj=1 be an orthonormal set in Rm . Find the orthog-
onal projection of A ∈ Rn×m onto the subspace of matrices U = span(uj vjT , j ∈ [1 : k]) ⊂ Rn×m .
Exercise 4.35. Show that for A, B ∈ Cn×m :
(a) <A, B> = trace(AT B) = trace(B ∗ A).
(b) | trace(B ∗ A)| ≤ trace(A∗ A)1/2 trace(B ∗ B)1/2 .

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 49

Chapter 5

Singular Value Decomposition

This chapter reviews the Singular Value Decomposition (SVD) of a rectangular matrix. The SVD extends the
idea of an eigendecomposition of a square matrix to non-square matrices. It has specific applications in data
analysis, dimensionality reduction (PCA), low-rank matrix approximation, and some forms of regression.
We first present and interpret the main SVD result in what is called the compact form. Then introduce
an alternative version known as the full SVD. After these discussions, we examine the relationship of the
SVD to several norms, then use the SVD to examine the issue of a “best” rank k approximation to a given
matrix. Finally, we turn our attention to the ideas and constructions that form the foundation of the SVD.

5.1 The Compact SVD


Theorem 5.1.1 (Singular Value Decomposition). Let A ∈ Rm×n have rank r ≤ min{m, n}. Then
there exist U ∈ Rm×r with U T U = Ir , V ∈ Rn×r with V T V = Ir , and a diagonal matrix Σ ∈ Rr×r
with positive diagonal entries σ1 ≥ σ2 ≥ · · · σr > 0, such that
r
X
A = U ΣV T = σj uj vjT . (5.1)
j=1

Proof. See Section 5.5.

Figure 5.1: A visualization of the matrices in the compact SVD.

The factorization (5.1) of A is called a compact singular value decomposition (compact SVD) of A.
The positive scalars σj are called the singular values of A. We also write σj (A) to indicate the j-th singular
value of A. The r orthonormal columns of U are called the left or output singular vectors of A, and
the r orthonormal columns of V are called the right or input singular vectors of A. The compact SVD
decomposition is illustrated in Figure 5.1.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 50

Figure 5.2: A visualization of the three operational steps in the compact SVD. The projection of x ∈ Rn onto N (A)⊥
is represented in terms of the basis v1 , v2 . Here x = α1 v1 + α2 v2 . The singular values scale these coordinates. Then
the scaled coordinates are transferred to the output space Rm and used to the form y = Ax as the linear combination
y = σ1 α1 u1 + σ2 α2 u2 .

When r  m, n, the SVD provides a more concise representation of A. In place of the mn entries of A,
the SVD uses the (m+n+1)r parameters required to specify U , V and the r singular values. The conditions
U T U = Ir and V T V = Ir indicate that U and V have orthonormal columns. In general, U U T 6= Im and
V V T 6= In because U and V need not be square matrices. The theorem does not claim that U and V are
unique. We discuss this issue later in the chapter.

Corollary 5.1.1. The matrices U and V in the compact SVD have the following additional properties:

a) The columns of U form an orthonormal basis for the range of A.

b) The columns of V form an orthonormal basis for N (A)⊥ .

(c) The rank one matrices uj vjT , j ∈ [1 : r], form an orthonormal basis in Rm×n .

Proof. (a) Writing Ax = U (ΣV T x) shows that Ax ∈ R(U ) and hence that R(A) ⊆ R(U ). Let u ∈ R(U ).
Then for some z ∈ Rr , u = U z = U ΣV T V (Σ−1 z) = A(V T Σ−1 z). Hence R(U ) ⊆ R(A).
(b) By taking transposes and using part (a), the columns of V form an ON basis for the range of AT . Using
N (A)⊥ = R(AT ) yields the desired result.
(c) Taking the inner product of uk vkT and uj vjT and using the property that trace(AB) = trace(BA)
whenever both products are defined, yields
(
0, if j 6= k;
<uk vkT , uj vjT > = trace(vk uTk uj vjT ) = trace(uTk uj vjT vk ) =
1, if j = k.

The above observations lead to the following operational interpretation of the SVD. Since the columns
of V form an orthonormal basis for N (A)⊥ , the orthogonal projection of x ∈ Rn onto the range of N (A)⊥
is x̂ = V V T x. Hence V T x gives the coordinates of x̂ with respect to V . These r coordinates are then
individually scaled using the r diagonal entries of Σ. Finally, we synthesize the output vector by using the
scaled coordinates and the ON basis U for R(A): y = U (ΣV T x). So the SVD has three steps: (1) An

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 51

Figure 5.3: A visualization of the action of A on the unit sphere in Rn in terms of its SVD.

analysis step: V T x, (2) A scaling step: Σ(V T x), and (3) a synthesis step: U (ΣV T x). In particular, for
x = vk , y = Ax = σk uk , k ∈ [1 : r]. So the r ON basis vectors for N (A)⊥ are mapped to scaled versions of
the corresponding ON basis vectors for R(A). These steps are illustrated in Fig. 5.2. Notice that restricted
to N (A)⊥ , the map A : N (A)⊥ → R(A) is one-to-one and onto and hence invertible.
Finally, we note that the SVD is selecting an orthonormal basis of rank one matrices {uj vjT }rj=1 specif-
ically adapted to A, and expressing A as a positive linear combination of this basis A = rj=1 σj (A)uj vjT .
P

5.2 The Full SVD


There is a second version of the SVD that is sometimes convenient. This second version is frequently called
the SVD. However, to emphasize its distinctness from the equally useful compact SVD, we refer to it as a
full SVD.
The basic idea is straightforward. Let A = Uc Σc VcT be a compact SVD with Uc ∈ Rm×r , Vc ∈ Rn×r ,
r×r . To U we add an orthonormal basis for R(U )⊥ to form the orthogonal matrix U =
 Σc ⊥∈ R m×m
and c c
Uc Uc ∈ R . Similarly, to Vc we add an orthonormal basis for R(Vc )⊥ to form the orthogonal
matrix V = Vc Vc⊥ ∈ Rn×n . To ensure that these extra columns in U and V do not interfere with the
 

factorization of A, we form Σ ∈ Rm×n by padding Σc with zero entries:


 
Σc 0r×(n−r)
Σ= .
0(m−r)×r 0(m−r)×(n−r)

We then have a full SVD factorization A = U ΣV T . The full SVD is illustrated in Fig. 5.4. The utility of
the full SVD derives from U and V being orthogonal (hence invertible) matrices. We note from the above
construction that in general the full SVD is not unique.
If P is a symmetric positive semidefinite matrix, a full SVD of P is simply an eigendecomposition of P :
U ΣV T = QΣQT , where Q is the orthogonal matrix of eigenvectors of P . In this sense, the SVD extends
the eigendecomposition by using different orthonormal sets of vectors in the input and output spaces.
The following theorem is sometimes useful. It says that to maximize the inner product of A and B by
modifying the left and right singular vectors of B, make B’s left and right singular vectors the same as the
corresponding singular vectors for A.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 52

Figure 5.4: A visualization of the matrices in the full SVD.

5.2.1 Some Useful Inequalities


Theorem 5.2.1. Let the entries of the matrices Λ, Ω ∈ Rm×n be zero except on the diagonal where the
entries are nonnegative and ordered by magnitude: λ1 ≥ λ2 ≥ · · · ≥ λq ≥ 0, and ω1 ≥ ω2 ≥ · · · ≥
ωq ≥ 0, with q = min(m, n). Then for each Q ∈ Om and R ∈ On

<Λ, QΩR> ≤ <Σ, Ω> (5.2)

with equality when Q = Im and R = In .

Proof. See Horn and Johnson 2nd Ed, Theorem 7.4.1.1.

Corollary 5.2.1. Let A, B ∈ Rm×n , with full SVDs A = UA ΣA VAT and B = UB ΣB VBT . The
maximum value of <A, QBR> over Q ∈ Om and R ∈ On is <ΣA , ΣB >. This is attained by setting
Q = UA UBT and R = VB VAT .

Proof.

<A, QBR> = trace(VA ΣA UAT QUB ΣB VBT R)


= trace(ΣA (UAT QUB )ΣB (VBT RVA ))
= <ΣA , SΣB T >

where S = (UAT QUB ), and T = (VBT RVA ) and orthogonal matrices. The result now follows by Theorem
5.2.1.

Corollary 5.2.2. Let A, B ∈ Rm×n , have full SVD factorizations A = U ΣA V T and B = QΣB RT .
Then kA − Bk2F ≥ kΣA − ΣB k2F .

Proof. Exercise.

5.3 Singular Values and Norms


Several important norms of A ∈ Rm×n are connected to the singular values of A. Examples include
the Frobenius norm, the induced matrix 2-norm (defined below), and the nuclear norm (discussed in the
exercises).

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 53

5.3.1 Singular Values and the Frobenius Norm


We first express the Frobenius norm of A ∈ Rm×n using the singular values of A. The following lemma
shows that kAkF is the Euclidean norm of the vector of its singular values.
Pr
Lemma 5.3.1. Let A ∈ Rm×n have rank(A) = r and SVD A = U ΣV T = T
j=1 σj uj vj . Then
P 1/2
r 2
kAkF = j=1 σj . (5.3)

Proof. The SVD expresses A as a positive linear combination of the orthonormal, rank one matrices uj vjT ,
j ∈ [1 : r]: A = rj=1 σj (A)uj vjT . Applying Pythagorous’ Theorem we have
P

Pr Pr Pr
kAk2F = k T 2
j=1 σj uj vj kF = 2 T 2
j=1 σj kuj vj kF = 2
j=1 σj .

5.3.2 The Induced Matrix 2-Norm


Consider the real vector spaces Rn and Rm , together with the corresponding Euclidean norms. The gain of
a matrix A ∈ Rm×n acting on a unit norm vector u ∈ Rn is kAuk2 . More generally, for nonzero x ∈ Rn ,
the gain is kAxk2 /kxk2 , where the 2-norm in the numerator is in Rm , and in the denominator it is in Rn .
The induced matrix 2-norm of A is defined to be the maximum gain of A over all x ∈ Rn :

∆ kAxk2
kAk2 = max = max kAxk2 . (5.4)
x6=0 kxk2 kxk2 =1

It is easy to check that the induced norm is indeed a norm on Rm×n , i.e., it satisfies the properties of a norm
listed in Lemma 4.1.2. We say that it is the matrix norm induced by the Euclidean norms on Rn and Rm .
Because of the following connection with eigenvalues, the induced matrix 2-norm is also called the
spectral norm.
p
Lemma 5.3.2. For A ∈ Rm×n , kAk2 = λmax (AT A).

Proof. We want to maximize the expression kAxk22 = xT (AT A)x over unit vectors x ∈ Rn . This is a
simple Raleigh quotient problem (see Appendix D). To maximize the expression, select x to be a unit norm
eigenvector of AT A with maximum eigenvalue.

The induced matrix 2-norm has the following additional properties.

Lemma 5.3.3. Let A, B be matrices of appropriate size and x ∈ Rn . Then

1) kAxk2 ≤ kAk2 kxk2 ;

2) kABk2 ≤ kAk2 kBk2 .

Proof. Exercise 5.5.

The induced matrix 2-norm is also closely connected to singular values.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 54

Lemma 5.3.4. For A ∈ Rm×n , kAk2 = σ1 (A).

Proof. Let A have compact SVD U ΣV T . Using Lemma 5.3.2, kAk22 = λmax (AT A) = λmax (V Σ2 V T ) =
σ12 (A). Hence the induced 2-norm of A equals the maximum singular value of A.

We see from the proof of the above lemma that the input direction with the most gain is v1 , this appears
in the output in the direction u1 , and the gain is σ1 : Av1 = σ1 u1 . This is visualized in Figure 5.3.

5.4 Best Rank k Approximation to a Matrix


Let A ∈ Rm×n have rank r ≤ min{m, n}. For fixed k < r we consider the problem of finding a rank k
matrix B ∈ Rm×n that is closest to A. If we use the Frobenius norm to quantify the distance between A and
B, the problem can be stated as

min kA − Bk2F
B∈Rm×n (5.5)
s.t. rank(B) = k.

If à solves (5.5), we call à a best rank k approximation to A under the Frobenius norm.
Let A have compact SVD A = U ΣV T = ri=1 σi ui viT , and set
P

k
X
Ak = σi ui viT (5.6)
i=1

Ak is formed by truncating the SVD of A.

Theorem 5.4.1. For any matrix A ∈ Rm×n with rank r ≥ k, the matrix Ak formed by truncating an
SVD of A to its k leading terms is a best rank k approximation to A under the Frobenius norm.

Proof. Since the expression (5.6) specifies a compact SVD of Ak , it is clear that Ak has rank k. Its distance
from A can be found using standard equalities:
Pr
kA − Ak k2F = kU ΣV T − U Σk V T k2F = kΣ − Σk k2F = 2
i=k+1 σi .

Now let B ∈ Rm×n be any matrix of rank k. By Corollary 5.2.2,

k
X r
X
kA − Bk2F ≥ kΣ − ΣB k2F = (σi − λi )2 + σi2 ≥ kA − Ak k2F .
i=1 i=k+1

So among all m × n matrices of rank k, the matrix Ak achieves the minimum Euclidean distance to A.

The matrix Ak is also a best rank k approximation to A under the induced 2-norm.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 55

5.5 Inner Workings of the SVD


We now give a quick overview of the origins of the matrices U , V and Σ in an SVD. Let A ∈ Rm×n have
rank r. So the range of A has dimension r, and the null space of A has dimension n − r.
Since AT A ∈ Rn×n is symmetric PSD, it has nonnegative eigenvalues and a full set of orthonormal
eigenvectors. Order the eigenvalues in decreasing order: σ12 ≥ σ22 ≥ · · · ≥ σn2 ≥ 0 and let vj denote the
eigenvector for σj2 . So
(AT A)vj = σj2 vj , j ∈ [1 : n].

Noting that N (A) = N (AT A) (Exercise 5.3), we see that the null space of AT A also has dimension n − r.
It follows that n − r of the eigenvectors of AT A must lie in N (A) and r must lie in N (A)⊥ . Hence

σ12 ≥ · · · ≥ σr2 > 0 and 2


σr+1 = · · · = σn2 = 0,

with v1 , . . . , vr an orthonormal basis for N (A)⊥ .


Similarly, AAT ∈ Rm×m is symmetric and PSD, and hence has nonnegative eigenvalues and a full set
of orthonormal eigenvectors. Order the eigenvalues in decreasing order: λ21 ≥ λ22 ≥ · · · ≥ λm ≥ 0 and let
uj denote the eigenvector for λ2j . So

(AAT )uj = λ2j uj , j ∈ [1 : m].

Noting that N (AAT ) = N (AT ), and N (AT ) = R(A)⊥ , we see that the dimension of N (AAT ) is m − r.
So m − r of the eigenvectors of AAT must lie in N (AT ) and r must lie in N (AT )⊥ = R(A). Hence

λ21 ≥ · · · ≥ λ2r > 0 and λ2r+1 = · · · = λ2m = 0,

with u1 , . . . , ur an orthonormal basis for R(A).


We now connect the eigenvalues σj2 , λ2j and the corresponding eigenvectors vj , uj , for j ∈ [1 : r]. First
note that
(AAT )(Avj ) = A(AT Avj ) = σj2 (Avj ).

So either Avj = 0, or Avj is an eigenvector of AAT with eigenvalue σj2 . If Avj = 0, then (AT A)vj = 0.
This contradicts AT Avj = σj2 vj with σj2 > 0. Hence Avj must be an eigenvector of AAT with eigenvalue
σj2 . Assume for simplicity, that the positive eigenvalues of both AT A and AAT are distinct. Then for some
k, with 1 ≤ k ≤ r:
σj2 = λ2k and Avj = αuk , with α > 0.
We can take α > 0 by swapping −uk for uk if necessary. Using this result we find
(
σj2 vjT vj = σj2 ;
vjT AT Avj =
(Avj )T (Avj ) = α2 uTj uj = α2 .

So we must have α = σj and


Avj = σj uk .
The same analysis, this time using (AAT )uk = λ2k uk with λ2k > 0, yields

(AT A)(AT uk ) = AT (AAT uk ) = λ2k (AT uk ).

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 56

Since λ2k > 0, we can’t have AT uk = 0. So AT uk is an eigenvector of AT A with eigenvalue λ2k . Under the
assumption of distinct nonzero eigenvalues, this implies that for some p with 1 ≤ p ≤ r,

λ2k = σp2 and AT uk = βvp , some β 6= 0.

Using this expression to evaluate uTk (AAT )uk we find λ2k = β 2 . Hence β 2 = λ2k = σp2 and AT uk = βvp .
We now have two ways to evaluate AT Avj :
(
T σj2 vj by definition;
A Avj = T
αA uk = αβvp . using the above analysis.

Equating these answers gives j = p and αβ = σj2 . Since α > 0, it follows that β > 0 and α = σj = λj = β.
Thus Avj = σj uj , j ∈ [1 : r]. Written in matrix form this is almost the compact SVD:
 
σ1
  
A v1 . . . vr = u1 . . . ur 
 .. .

.
σr

From this we deduce that AV V T = U ΣV T . V V T computes the orthogonal projection of x onto N (A)⊥ .
Hence for every x ∈ Rn , AV T
p V x = Ax. p Thus AV V T = A, and we have A = U ΣV T .
Finally note that σj = λj (A A) = λj (AAT ), j ∈ [1 : r]. So the singular values are always unique.
T

If the singular values are distinct, the


PSVD is unique up to sign interchanges between the uj and vj . But this
r
still leaves the representation A = j=1 σj uj vjT unique. If the singular values are not distinct, then U and
V are not unique. For example, In = U In U T for every orthogonal matrix U .

Notes
For more detailed reading about the SVD see Chapter 7, section 7.3, of Horn and Johnson [22].

Exercises

Preliminaries
Exercise 5.1. Let A ∈ Rm×n . The rank of A is the dimension of R(A). Show that: (a) the rank of A equals the
number of linearly independent columns of A, (b) the rank of A equals the number of linearly independent rows of A.
Exercise 5.2. Let A ∈ Rm×n have rank r. Show that
(a) dim N (A) = n − r
(b) dim R(A)⊥ = m − r
(c) dim N (A)⊥ = dim R(A)
(d) dim N (A) + dim R(A) = n
Exercise 5.3. Let A ∈ Rm×n . Show that N (AT A) = N (A) and N (AAT ) = N (AT ).

Induced 2-norm
Exercise 5.4. Show that the induced 2-norm satisfies the properties of a norm.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 57

Exercise 5.5. Prove Lemma 5.3.3.


Exercise 5.6. Show that the induced 2-norm of A ∈ Rm×n is invariant under orthogonal transformations.
Exercise 5.7. Show that the induced 2-norm of a symmetric matrix is the maximum magnitude of its eigenvalues.

Basic SVD Properties


Exercise 5.8. Prove the following:
(a) For α ∈ R and A ∈ Rm×n , σi (αA) = |α|σi (A), i ∈ [1 : q] = min(m, n).
(b) For A, B ∈ Rm×n , σ1 (A + B) ≤ σ1 (A) + σ1 (B).
(c) For A ∈ Rn×n , and every eigenvalue λ of A, |λ| ≤ σ1 (A).
(d) For A ∈ Rm×n , |Ai,j | ≤ σ1 (A).
Pq Pq Pq
(e) For A, B ∈ Rm×n and q = min(m, n), i=1 σi (A + B) ≤ i=1 σi (A) + i=1 σi (B)). [Hint: nuclear
norm]
Exercise 5.9. For A ∈ Rm×n and Q ∈ Om , R ∈ On , show that the singular values of QAR are the same as those of
A. Hence singular values are invariant under orthogonal transformations.
Exercise 5.10. Let P be a real symmetric n × n matrix. Show how to form an SVD factorization of P from its
eigendecomposition P = V ΛV T . Here the columns of V are ON eigenvectors of P and Λ is a diagonal matrix with
the corresponding eigenvalues on the diagonal.
Exercise 5.11. Show that the eigenvalues of a symmetric PSD matrix are also its singular values.
Pr
Exercise 5.12. Let A ∈ Rm×n have rank r and compact SVD A = U ΣV T = i=1 σi ui viT . For k ∈ [1 : r], let
k
Uk = span{ui viT }ki=1 , and Ak = i=1 σi ui viT . Clearly Ak ∈ Uk and Ar = A. Using the Frobenius norm as the
P
distance measure, show the following:

(a) Ak has rank k.


(b) Ak is the closest point in Uk to A.
(c) Ak is the closest rank k matrix to A.

Exercise 5.13. Let A ∈ Rn×n have rank n. Show that minkxk2 =1 kAxk2 = σn (A).
Exercise 5.14. Let A ∈ Rm×n and B ∈ Rn×m . So AB ∈ Rm×m and BA ∈ Rn×n . If both AB and BA
are symmetric, show that the nonzero singular values of AB are the same as those of BA, including multiplicities.
Without loss of generality assume m ≤ n.

Inverse Maps and the SVD


Exercise 5.15. Let A ∈ Rn×n be a square invertible matrix with SVD A = U ΣV T . Show that A−1 = V Σ−1 U T .
Exercise 5.16. The Moore-Penrose pseudo-inverse of a matrix A ∈ Rm×n is the unique matrix A+ ∈ Rn×m satisfy-
ing the following four properties:
(1) A(A+ A) = A
(2) (A+ A)A+ = A+
(3) (A+ A)T = A+ A
(4) (AA+ )T = AA+
Let A have compact SVD A = U ΣV T . Show that A+ = V Σ−1 U T . Give an interpretation of A+ in terms of N (A),
N (A)⊥ and R(A).

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 58

Exercise 5.17. Let A ∈ Rm×n and y ∈ Rm be given. We want to find a solution x of the linear equations AT Ax =
AT y. Show that if A = U ΣV T is a compact SVD of A, then a solution is x? = V Σ−1 U T y and x? ∈ N (A)⊥ .

Maximizing Inner Products


Exercise 5.18. Let Σ, Λ ∈ Rn×n be diagonal matrices with nonnegative diagonal entries. We want to find permutation
matrices P, Q ∈ Rn×n to maximize <Σ, P ΛQ>.

(a) Show that optimal permutation matrices P ? , Q? must exist.


(b) Show that P ? ΛQ? must be diagonal.
(c) In light of (b), show that the problem can be simplified as follows. For x, y ∈ Rn with nonnegative entries, find
a permutation matrix P ∈ Rn×n that maximizes the inner product <x, P y> = xT P y.
(d) Solve the problem in part (c). You can initially assume that x(1) ≥ x(2) ≥ · · · ≥ x(n). Once you have a
solution, show how to remove the above assumption.

Exercise 5.19. Let Σr be an r × r diagonal matrix with positive diagonal entries σ1 ≥ σ2 ≥ · · · ≥ σr > 0. Let
Σ ∈ Rn×n be block diagonal with Σ = diag(Σr , 0(n−r)×(n−r) ).
(a) What orthogonal matrices Q ∈ Or maximize the inner product <Q, Σr >?
(b) What orthogonal matrices W ∈ On maximize the inner product <W, Σ>?
Exercise 5.20. Let A ∈ Rn×n .
(a) Find Q ∈ On to maximize the inner product <Q, A> and determine the maximum value.
(b) Show that Q ∈ On maximizes <Q, A> iff QT A is symmetric PSD.
Exercise 5.21. For given A, B ∈ Rm×n , find an orthogonal matrix Q ∈ Om to maximize the inner product <A, QB>.
This Q “rotates” the columns of B to maximize the inner product with A.
Exercise 5.22. Let Σ ∈ Rr×r be diagonal with positive diagonal entries. Find a matrix B ∈ Rr×r subject to
σ1 (B) ≤ 1 that maximizes the inner product <Σ, B>.
Exercise 5.23. For given A ∈ Rm×n with r = rank(A),
Pr we want to find B ∈ R
m×n
subject to σ1 (B) ≤ 1 that
maximizes <A, B>. Show that the maximum value is i=1 σi (A) and find a solution B.

Minimizing Norms
Exercise 5.24. For given A, B ∈ Rm×n , we seek Q ∈ Om to minimize kA − QBk2F . Show that this is equivalent to
finding Q ∈ Om to maximize the inner product <A, QB>, and determine the minimum achievable value.
Exercise 5.25. For A, B ∈ Rm×n , we seek orthogonal matrices Q ∈ Om and R ∈ On to minimize kA − QBRk2F .
Let A = UA ΣA VAT and B = UB ΣB VBT be full singular value decompositions. For Q ∈ Om and R ∈ On , show the
following claims:

(a) minQ,R kA − QBRk2F = trace(Σ2A ) + trace(Σ2B ) − 2 maxQ,R trace(ΣTA QΣB R).


(b) minQ,R kA − QBRk2F = kΣA − ΣB k2F .

Exercise 5.26. Let A, B ∈ Rm×n have full SVDs A = UA ΣA VAT and B = UB ΣB . Show that

kA − BkF ≥ kΣA − ΣB kF .

The Nuclear Norm Pq


Let A ∈ Rm×n and q = min(m, n). The nuclear norm of A (also called the trace norm) is kAk∗ = i=1 σi (A).
Exercise 5.27. Show that kAk∗ = maxkCk2 ≤1 <A, C>.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 59

Exercise 5.28. Show that k · k∗ is a norm on Rm×n .


Exercise 5.29. Show that:
(a) The nuclear norm is invariant under orthogonal transformations.

(b) For A ∈ Rm×n , kAk∗ = trace( AT A).
(c) If P is symmetric PSD, kP k∗ = trace(P ).

Miscellaneous
Exercise 5.30. Prove or disprove: Let X ∈ Rm×n . Let the columns of V be k orthonormal eigenvectors for the
nonzero eigenvalues of X T X, and the columns of U be k orthonormal eigenvectors for the nonzero eigenvalues of
XX T . Then XV = U .
Pk
Exercise 5.31. Prove or disprove: Let {xj }kj=1 be a linearly independent set in Rn . Then A = j=1 xj xTj has k non
zero singular values.
Exercise 5.32. Prove or disprove: For X ∈ Rm×n with X 6= 0, the nonzero eigenvalues of the matrices X T X and
XX T are identical.
Exercise 5.33. Prove or disprove: For x ∈ Rm and y ∈ Rn , kxy T kF = kxy T k2 = kxy T k∗ .
Exercise 5.34. Let W ∈ Rn×k , with k ≤ n, have ON columns (note W need not be square). Prove or disprove
each of the following claims: for each x ∈ Rk and each A ∈ Rk×m : (a) kW xk2 = kxk2 , (b) kW Ak2 = kAk2 , (c)
kW AkF = kAkF , and (d) kW Ak∗ = kAk∗ .
Exercise 5.35 (Idempotent But Not Symmetric). Let P ∈ Rn×n have rank r and compact SVD P = U ΣV T . If
P 2 = P , show that either r = n and P = In or r < n and P = V V T + V0 V T where the columns of V0 lie in R(V )⊥ .
Choosing V0 = 0 yields a projection matrix P = V V T . But choosing V0 6= 0, this yields an idempotent matrix P
that is not symmetric.
Exercise 5.36. Let A ∈ Rn×n have a compact SVD UA ΣA VAT . Show that trace(A) ≤ trace(ΣA ).

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 60

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 61

Chapter 6

Multivariable Differentiation

Many machine learning problems are formulated as optimization problems. Typically, this involves a real-
valued loss function f (w) where the vector w ∈ Rn parameterizes the possible solutions. The objective is to
select w to minimize the loss f (w). Assuming the function f : Rn → R is differentiable, a natural approach
is find the derivative of f with respect to w and set this equal to zero. Sometimes it is possible to solve the
resulting equation for w or reduce its solution to known computations.
An example is the generalized Rayleigh quotient problem

wT P w
maxn f (w) =
w∈R wT Qw

where P and Q are n × n real symmetric matrices with P positive semidefinite and Q positive semidefinite.
This problem arises in one formulation of Linear Discriminant Analysis.
More generally, the cost could be a function of matrix W ∈ Rn×d . In this case, we want to take the
derivative of f (W ) with respect to the matrix W , set this equal to zero and solve for the optimal W . An
example is the matrix Rayleigh quotient problem

max f (W ) = trace(W T P W )
W ∈Rn×d
s.t. W T W = Id ,

where P ∈ Rn×n is a symmetric positive semidefinite matrix. This constrained optimization problem arises
in dimensionality reduction via Principal Components Analysis.
In many cases, even if we can compute the derivative of f it may not be clear how to solve for the optimal
value of the parameter. In such cases, an alternative is to determine (or approximate) the gradient of the loss
function f . Then we can iteratively minimize f by gradient descent (or stochastic gradient descent when
we have an approximation to the gradient). For example, this is the currently preferred means of training a
neural network.
The first issue in addressing all of these problems is the ability to take derivatives and compute gradients.
This chapter reviews this task. We define the derivative of various forms of functions (real-valued, vector-
valued, matrix-valued) defined on Rn or Rn×m . We also define the gradient of real-valued functions and
point out the distinction between the gradient and the derivative. By the end of the chapter the reader will
be equipped to derive the gradients and derivatives shown in Table 6.1, and to solve the Rayleigh quotient
problems shown above.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 62

f ∇f Df
aT x a aT v
kxk22 2x 2xT v
kxk2 x/kxk2 (x/kxk2 )T v
kAxk22 2AT Ax 2xT (AT A)v
a⊗x n/a a⊗v
x⊗x n/a 2x ⊗ v
trace(AT M ) A trace(AT V )
kM k2F 2M 2 trace(M T V )
trace (M/kM kF )T V

kM kF M/kM kF
trace(M ) In trace(V )
det(M ) det(M )M −T det(M ) trace(M −1 H)
A⊗M n/a A⊗V
M ⊗M n/a 2M V
M2 n/a MV + V M
M −1 n/a −M −1 V M −1

Table 6.1: Summary Table. A selection of functions f shown with corresponding derivatives and gradients. In the
table entries, x and M are the variables of the function f , a and A are constants, and v and V are the dummy variables
of the derivatives, with x, a, v ∈ Rn and M, A, V ∈ Rn×n .

6.1 Real Valued Functions on Rn


Consider a function f mapping a subset D ⊆ Rn into R. We call D the domain of f . Simple examples
include, f (x) = xT x, f (x) = aT x for some fixed a ∈ Rn , and f (x) = kAx − yk22 for fixed A ∈ Rm×n
and y ∈ Rm .
Let x be a point in the interior of D. Intuitively, the derivative of f at x, denoted by Df (x), is the
best linear approximation to f around the point x. So for a small vector increment v, f (x) + Df (x)(v)
gives the best linear approximation to f (x + v). We say that to first order f (x + v) ≈ f (x) + Df (x)(v).
Geometrically one can think of the derivative as the tangent plane to the surface (x, f (x)) at x.
The above discussion motivates the following definition. A function f : Rn → R is differentiable at
x ∈ Rn if there exists a linear function Df (x) : Rn → R such that

f (x + v) − f (x) − Df (x)(v)
lim = 0. (6.1)
v→0 kvk2

Here v ∈ Rn , Df (x)(v) ∈ R, and the norm in the denominator is the 2-norm in Rn .


Df (x) can be computed using partial derivatives. For this it is convenient to write x = (x1 , x2 , . . . , xn )

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 63

and v = (v1 , v2 , . . . , vn ). Then


h i
∂f (x) ∂f (x) ∂f (x)
Df (x) = ∂x1 ∂x2 ··· ∂xn
(6.2)
n
X ∂f (x)
Df (x)(v) = vj . (6.3)
∂xj
j=1

So Df (x) is a row vector with j-th entry ∂f∂x(x)


j
, j ∈ [1 : n]. In general, this row vector depends on x. As
we move the point x around, the row vector will change. This change makes sense since the best linear
approximation to f locally about x will depend on x.
When f is differentiable at every point in a open set X ⊂ Rn , we say that f is a differentiable function
on X , and when f is differentiable on Rn , we simply say that f is a differentiable function.

Example 6.1.1. Here are some simple examples.

(a) Let a ∈ R2 and f : R2 → R with f (z) = aT z. Then


h i
∂f (z) ∂f (z)
Df (z)(v) = ∂z1 ∂z2
v = a1 v1 + a2 v2 = aT v.

(b) Let p : R2 → R with p(z) = z1 z2 . Then


h i
∂p(z) ∂p(z)  
Dp(z)(v) = ∂z1 ∂z2
v = z2 v1 + z1 v2 = z2 z1 v.

(c) Let f : R2 → R with f (z) = z T z. Then


h i
∂f (z) ∂f (z)
Df (z)(v) = ∂z1 ∂z2
v = 2z1 v1 + 2z2 v2 = 2z T v.

6.1.1 The Gradient


Any real-valued linear function l(·) on Rn , can be written in the form l(x) = <a, x> for some a ∈ Rn . Let
f : Rn → R be differentiable. Then the derivative of f at x is a linear function Df (x) : Rn → R. Hence
there exists a vector gx ∈ Rn , that depends on x, such that Df (x)(v) = <gx , v>. The vector gx is called
the gradient of f at x and is denoted by ∇f (x). Thus we can write Df (x)(v) = <∇f (x), v> = ∇f (x)T v.
The gradient ∇f (x) ∈ Rn points in the direction of steepest ascent of f at the point x, and k∇f (x)k2
gives the rate of increase of f in this direction. To see this write Df (x) = gxT , and let u be a unit norm
vector (a direction). Then to first order, the rate of increase of f at x in the direction u is gxT u. By Cauchy-
Schwartz, the direction that gives the maximum rate of increase is u? = gx /kgx k2 , and the rate of increase
in this direction is gxT gx /kgx k2 = kgx k2 .
Using (6.2), ∇f (x) is given by
h iT
∂f (x) ∂f (x) ∂f (x)
∇f (x) = ∂x1 ∂x2 ··· ∂xn
.

Since Df (x)(v) = ∇f (x)T v = <∇f (x), v>. The derivative of f at x determines the gradient of f at x,
and the gradient of f at x determines the derivative of f at x.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 64

Example 6.1.2. We illustrate the gradient using the functions in Example 6.1.1.

(a) Let a ∈ R2 and f : R2 → R with f (z) = aT z. Then ∇f (x) = a.


    
2 z2 0 1 z1
(b) Let p : R → R with p(z) = z1 z2 . Then ∇p(z) = = .
z1 1 0 z2

(c) Let f : R2 → R with f (z) = z T z. Then ∇f (z) = 2z.

The avoid confusion, keep in mind that the derivative and the gradient are not the same. The definition
of the gradient assumes a real-valued function. In contrast, the derivative can exist for real-valued, vector-
valued, and matrix-valued functions. Even when both are defined, one is a linear function mapping the
domain of f into R, while the other is a vector in Rn . In short, when both are defined, the derivative and the
gradient are distinct but connected objects.

6.1.2 Next Steps


The natural next step is to discuss additional properties of the derivative and gradient. However, it will be
convenient to first discuss the derivative of functions from Rn into Rm . Real-valued functions constitute the
special case m = 1, and we can infer properties of the derivative (and the gradient) from the more general
case.

6.2 Functions f : Rn → Rm
Conceptually, the derivative of a function f : Rn → Rm at x ∈ Rn is the linear function Df (x) that gives
the best local approximation to f around the point x. So to first order, the best linear approximation to f at
x is f (x + v) ≈ f (x) + Df (x)(v). This leads to the following definition.
A function f : Rn → Rm is differentiable at x ∈ Rn if there a linear function Df (x) : Rn → Rm such
that
f (x + v) − f (x) − Df (x)v
lim = 0. (6.4)
v→0 kvk2
Here v ∈ Rn , Df (x)(v) ∈ Rm , and the norm in the denominator is the 2-norm in Rn .
T
Write f (x) = f1 (x) f2 (x) . . . fm (x) where fi : Rn → R is a scalar valued function that gives


the value of the i-th entry of f (x) ∈ Rm . The best linear approximation to f at x must also give the best
linear approximation for each component of f (x). Hence
 
Df1 (x)(v)
 Df2 (x)(v) 
Df (x)(v) =  .
 
..
 . 
Dfm (x)(v)

Let x = (x1 , x2 , . . . , xn ) and v = (v1 , v2 , . . . , vn ). Then we can use partial derivatives to write
 ∂f (x) ∂f (x)
· · · ∂f∂x
1 (x)

1 1
∂x1 ∂x2 n
 ∂f2 (x) ∂f2 (x)
 ∂x1 ∂x 2
· · · ∂f∂x
2 (x) 
n 

Df (x)(v) =   .. .. ..  v.
 . . . 
∂fm (x) ∂fm (x) ∂fm (x)
∂x1 ∂x2 ··· ∂xn

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 65

So Df (x) is an m × n matrix with i, j-th entry ∂f∂x i (x)


j
, i ∈ [1 : m], j ∈ [1 : n]. In general, as we move the
point x around, the matrix Df (x) will change since the best local linear approximation to f will depend
on x. Fortunately, in many useful instances, we don’t have to compute all of the above partial derivatives.
Instead, we can use properties of the derivative to determine the functional form of Df (x)(v).
Example 6.2.1. The derivative of a linear function f : Rn → Rn .
(a) Let x ∈ R2 , A ∈ R2×2 and f (x) = Ax. Then
   
∂fi (x) A1,1 A1,2
Df (x)(v) = v= v = Av.
∂xj A2,1 A2,2

(b) Let a, x ∈ Rn and f (x) = a ⊗ x where ⊗ denotes the Schur product. Then fi (x) = ai xi and
 
∂fi (x)
Df (x)(v) = v = diag(a)v = a ⊗ v.
∂xj

6.2.1 Key Properties


The derivative of a function f : Rn → Rm has three key properties. First, the derivative of a linear function is
the same linear function. Second, the derivative of a composition of two functions is the composition of the
derivatives of each function in the same order. This property is called the chain rule. A simple consequence
of the first two properties is that the derivative of a linear function of f (x) is the same linear function of
Df (x)(v). In particular, for any scalar α, the derivative of αf (x) is αDf (x)(v), and the derivative of a sum
of functions is the sum of the derivatives of the functions. Third, one obtains the derivative of the product
of two scalar-valued functions f (x)g(x) from the derivatives of each separate function using a simple rule
called the product rule. We discuss all of these properties below.

The derivative of a linear function


Lemma 6.2.1. A linear function f : Rn → Rm is differentiable and Df (x)(v) = f (v).

Proof. If it exists, the derivative of f at x is the unique linear function from Rn to Rm that best matches f in
a neighborhood of x. Since f is linear, its best linear approximation at any point x is itself. Hence a linear
function is differentiable everywhere and Df (x)(v) = f (v). Let’s check
f (x + v) − f (x) − Df (x)(v) f (x) + f (v) − f (x) − f (v)
lim = lim = 0.
v→0 kvk2 v→0 kvk2

The derivative of a composition of functions: chain rule


Let g ◦ f denote the composition of the functions f : Rn → Rm and g : Rm → Rd . This function first
computes f (x) then applies g to f (x). So (g ◦ f )(x) = g(f (x)). Let f be differentiable at x and g be
differentiable at f (x). Then to first order, a small perturbation of x to x + v is mapped by f to f (x) +
Df (x)(v). So the point f (x) is perturbed to f (x) + Df (x)(v). Thus when g is applied, to first order the
result is
g (f (x + v)) = g (f (x) + Df (x)(v)) = g (f (x)) + Dg (f (x)) (Df (x)(v)) .
This informal argument suggests the following result.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 66

Lemma 6.2.2 (Chain Rule). Let f : Rn → Rm be differentiable at x, g : Rm → Rd be differentiable


at f (x), and set h(x) = (g ◦ f )(x). Then

Dh(x)(v) = Dg(f (x)) ◦ Df (x)(v). (6.5)

Proof. See any book on multivariable differential calculus.

A linear function of a set of functions


Let f : Rn → Rm , g : Rm → Rd and consider the composition g ◦ f . When g is a linear function,
Dg(f (x))(w) = g(w) and hence by the chain rule
D(g ◦ f )(v) = g (Df (x)(v)) = g ◦ Df (x)(v).
So the derivative of a linear function of a function evaluated at x is the same linear function of the derivative
of f evaluated at x.
Now consider m scalar valued functions fk : Rn → R and let h(x) = m
P
j=1 aj fj (x), for scalars aj ,
T
 T
j ∈ [1 : m]. We can write this as h(x) = a f (x), where f (x) = f1 (x) · · · fm (x) . Then by the
previous result
Xm
Dh(x)(v) = aT Df (x)(v) = aj Dfj (x)(v). (6.6)
j=1
So the derivative of a linear combination of scalar valued functions is the same linear combination of the
derivatives of these functions. Equation (6.6) continues to hold when the functions fj take values in Rm
(Exercise 6.1).

The derivative of a product or inner product of two functions: product rule


The product rule for scalar-valued functions is a simple consequence of the chain rule. Let f, g : Rn → R
be differentiable at x and consider the function h : Rn → R with h(x) = f (x)g(x). We want to find
the derivative of h at x. To do so, bring in the function s : Rn → R2 with s(x) = (f (x), g(x)), and the
function p : R2 → R, with p(z) = z1 z2 . The composition of p and s yields the desired overall function,
p ◦ s(x) = p(f (x), g(x)) = f (x)g(x) = h(x). So we can find the derivative of h using the chain rule and
the derivatives of p and s. The derivative of s at x is found by taking the derivative of each component of s
at x:  
Df (x)(v)
Ds(x)(v) = .
Dg(x)(v)
The derivative of p at z ∈ R2 was obtained in Example 6.1.2:
 
Dp(z)(v) = z2 z1 v.
Now set z = (f (x), g(x)). Then z = s(x) and h(x) = p(z). Using the chain rule, we have
Dh(x)(v) = D(p ◦ s(x))(v) = Dp(z) ◦ Ds(x)(v).
Substituting the expressions for Dp(z) and Ds(x) yields the product rule for finding the derivative of the
product of functions f, g : Rn → R:
D(f (x)g(x))(v) = [g(x), f (x)][Df (x)(v), Dg(x)(v)]T = g(x)Df (x)(v) + f (x)Dg(x)(v).

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 67

Lemma 6.2.3 (Product Rule). Let f, g : Rn → R be differentiable at x. Then

D(f (x)g(x))(v) = g(x)Df (x)(v) + f (x)Dg(x)(v). (6.7)

Proof. We have given a proof outline above.

The product rule can be generalized to functions f, g : Rn → Rm by considering h(x) = <f (x), g(x)>.
This leads to the following extension of Lemma 6.2.4.

Lemma 6.2.4 (Inner Product Rule). Let f, g : Rn → Rm be differentiable at x and set h(x) =
<f (x), g(x)>. Then h is differentiable at x and

Dh(x)(v) = <Df (x)(v), g(x)> + <f (x), Dg(x)(v)>.

Pm
Proof. h(x) = j=1 fj (x)gj (x). Hence using the product rule
Pm Pm
Dh(x)(v) = j=1 D(fj (x)gj (x))(v) = j=1 Dfj (x)(v)gj (x) + fj (x)Dg(x)(v).

The RHS of the above equation can be rearranged to yield the stated result.

Example 6.2.2. The following examples illustrate the above properties using functions f : Rn → R. We
also determine the gradient of each example.

(a) Fix a ∈ Rn and for x ∈ Rn set f (x) = aT x. f (x) is a linear function from Rn to R. The best
approximation to a linear function is the linear function itself. Hence,

Df (x)(v) = aT v,
∇f (x) = a.

(b) Let x ∈ Rn , A ∈ Rn×n and f (x) = xT Ax. Using the generalized product rule we have:

Df (x)(v) = v T Ax + xT Av = xT (AT + A)v,


∇f (x) = (AT + A)x.

√ 1
(c) Let x ∈ Rn , A ∈ Rn×n and f (x) = xT Ax. For α > 0, define g(α) = α 2 . Then write f (x) =
1
g(xT Ax). Scalar differentiation yields g 0 (α) = 21 α− 2 . Hence by the chain rule

1
Df (x)(v) = √ xT (A + AT )v, xT Ax 6= 0,
T
2 x Ax
1
∇f (x) = √ (A + AT )x, xT Ax 6= 0.
T
2 x Ax

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 68

6.2.2 The Rayleigh Quotient Problem


Let P ∈ Rn×n be a given symmetric positive semidefinite matrix, and consider the problem of selecting
x ∈ Rn to maximize the quotient
xT P x
.
xT x
Since the quotient is invariant to a scaling of x, we can assume that kxk2 = 1. The problem can then be
stated as the constrained optimization problem,

x? = arg maxn xT P x
x∈R (6.8)
s.t. xT x = 1.

Problem (6.8) can be solved using the method of Lagrange multipliers. We first give some insights into
how this works. The constraint xT x = 1 is the zero level set of the function g(x) = 1 − xT x. This level
set is a surface in Rn (in this particular case it’s a sphere in Rn of radius 1). At a point x on the surface,
g(x + v) ≈ g(x) + ∇g(x)v. So the set of vectors tangent to the surface at x is the subspace ∇g(x)⊥ . This
result implies that ∇g(x) is normal to the tangent plane of the surface at x.
We want to maximize f (x) = xT P x. The gradient ∇f (x) gives the direction of maximum increase in
f at x. Generally, moving x in this direction will violate the constraint. But we can always move x in the
direction of the orthogonal projection of ∇f (x) onto the tangent plane at x. This projection is

1
d(x) = ∇f (x) − ∇g(x)∇g(x)T ∇f (x).
k∇g(x)k22

As long as d(x) is nonzero, x can be moved in the tangent plane to increase the objective function f (x).
Hence for x to be a solution it is necessary that d(x) = 0. This requires that for some scalar µ,

∇f (x) + µ∇g(x) = 0.

Now bring in the Lagrangian function


L(x, µ) = f (x) + µg(x),

where µ ∈ R is a Lagrange multiplier or dual variable. Taking the derivative of L(x, µ) with respect to x
and setting this equal to zero yields (Df (x) + µDg(x))v = 0 for each v ∈ Rn . This yields the necessary
condition ∇f (x) + µ∇g(x) = 0. Setting the derivative of L(x, µ) with respect to µ equal to zero yields the
constraint g(x) = 0. So the derivatives of the Lagrangian give two necessary conditions for x to a solution
of the optimization problem:

∇f (x) + µ∇g(x) = 0 for some µ ∈ R,


g(x) = 0.

We now apply these ideas to the Rayleigh quotient problem.

Theorem 6.2.1. Let the eigenvalues of P be λ1 ≥ λ2 ≥ · · · ≥ λn . Problem (6.8) has the optimal value
λ1 and this is achieved if and only if x? is a unit norm eigenvector of P for λ1 . If λ1 > λ2 , this solution
is unique up to the sign of x? .

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 69

Proof. We want to maximize xT P x subject to xT x = 1. Bring in the Lagrangian L(x, µ) = xT P x + µ(1 −


xT x), with µ ∈ R. Taking the derivative of this expression with respect to x and setting this equal to zero
yields the necessary condition P x = µx. Hence µ must be an eigenvalue of P with x a corresponding unit
norm eigenvector (xT x = 1). For such x, xT P x = µxT x = µ. Hence the maximum achievable value of
the objective is λ1 and this is achieved when x is a corresponding unit norm eigenvector of P . Conversely,
if u is any unit norm eigenvector of P for λ1 , then uT P u = λ1 and hence u is a solution.

6.3 Real Valued Functions on Rn×m


Now consider functions mapping real n × m matrices into the set of real numbers. Examples include
f (M ) = kM kF , f (M ) = det(M ), and f (M ) = trace(M ). Conceptually, the derivative of f at M ,
denoted Df (M ), is a linear function from Rn×m to R that gives the best local linear approximation to f
around the point M . So for a small matrix increment V , to first order f (M + V ) ≈ f (M ) + Df (M )(V ).
The formal definition is the natural extension of the two previous cases. A function f : Rn×m → R is
differentiable at M ∈ Rn×m is there exists a linear function Df (M ) : Rn×m → R such that
f (M + V ) − f (M ) − Df (M )V
lim = 0. (6.9)
V →0 kV kF
Here V ∈ Rn×m , Df (M )(V ) ∈ R, and the norm in the denominator is the Frobenius norm on Rn×m .
Example 6.3.1. Example derivatives of real valued functions of a matrix.
(a) Let A, B, M ∈ Rn×n be square matrices, and set f (M ) = trace(AM T B). This is a composition
of three linear functions: trace(·), A(·)B, and transpose. Thus f a linear function of M . Hence
Df (M )(V ) = trace(AV T B) = trace(V T BA) = <V, BA>. This gives
Df (M )(V ) = <BA, V >,
∇f (M ) = BA.

(b) Let A ∈ Rn×n and f (M ) = trace(M T AM ). Since trace(·) is a linear function, Df (M )(V ) =
trace(D(M T AM )V ). The product rule then yields Df (M )(V ) = trace(V T AM + M T AV ) =
trace(M T (A + AT )V ) = <(A + AT )M, V >. Thus
Df (M )(V ) = <(A + AT )M, V >,
∇f (M ) = (A + AT )M.

(c) Let f (M ) = kM k2F . We can write f (M ) = <M, M > = trace(M T M ). By the chain rule and
linearity of the trace function, Df (M )(V ) = trace(D(M T M )(V )). Then by the product rule,
D(M T M )(V ) = V T M + M T V . So Df (M )(V ) = trace(V T M + M T V ) = 2 trace(M T V ). Thus
Df (M )(V ) = 2<M, V >
∇f (M ) = 2M.

6.3.1 The Derivative of det(M)


The following lemma gives a less obvious derivative.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 70

Lemma 6.3.1. For invertible M ∈ Rn×n let f (M ) = det(M ). Then

Df (M )(V ) = det(M ) trace(M −1 V ). (6.10)

Proof. See Appendix 6.5.

To see informally why the above result holds, first note that

det(M + V ) = det(M (I + M −1 V )) = det(M ) det(I + M −1 V ).

For any matrix A, det(I + A) is the product of the eigenvalues of I + A. If  sufficiently small, I + A
is dominated by its diagonal terms 1 +Qaii . Hence to first order, the product of the eigenvalues can be
approximated by the first order terms in ni=1 (1 + aii ). This gives det(I + A) = 1 +  trace(A) + O(2 ).
Thus
det(M + V ) − det(M )
Df (M )(V ) = lim = det(M ) trace(M −1 V ).
→0 

6.3.2 The Matrix Rayleigh Quotient Problem


Let P ∈ Rn×n be a symmetric positive semidefinite matrix, and d ≤ rank(P ). Consider the matrix Rayleigh
quotient problem:

max trace(W T P W )
W ∈Rn×d (6.11)
s.t. W T W = Id .

This problem seeks d orthonormal vectors wi , i ∈ [1 : d] (the columns of W ), such that trace(W T P W ) =
P d T
j=1 wj P wj is maximized. The objective function is a real valued function of W . There are d constraints
of the form wjT wj = 1, j ∈ [1 : d], and d(d − 1)/2 constraints of the form wjT wk = 0 for j ∈ [1 : d − 1],
k ∈ [j + 1 : d]. So we will need d scalar dual variables for the first set of constraints and another d(d − 1)/2
scalar dual variables for the second set of constraints. Each constraint will be multiplied by its corresponding
dual variable and these products will be summed and added to the objective function to form the Lagrangian.
Since the constraints Id − W T W = 0 are symmetric, we represent the dual variables as the entries of real
symmetric matrix Ω. We can then write the Lagrangian in the compact form

L(W, Ω) = trace(W T P W ) + <Ω, Id − W T W >.

Theorem 6.3.1. Let P ∈ Rn×n be a symmetric PSD matrix and d ≤ rank(P ). Then every solution of

max trace(W T P W )
W ∈Rn×d (6.12)
s.t. W T W = Id ,

has the form W ? = We Q where the columns of We are orthonormal eigenvectors of P for its d largest
(hence non-zero) eigenvalues, and Q is a d × d orthogonal matrix.

Proof. The Lagrangian takes the form

L(W, Ω) = trace W T P W + Ω(Id − W T W ) ,




c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 71

with Ω a real symmetric matrix. Setting the derivative of L with respect to W acting on V ∈ Rn×d equal to
zero yields

0 = trace 2W T P V − 2ΩW T V = 2<W T P − ΩW T , V >.




Since this holds for all V , a solution W ? must satisfy the necessary condition

P W ? = W ? Ω. (6.13)

By the symmetry of Ω ∈ Rd×d , there exists an orthogonal matrix Q ∈ Od such that Ω = QΛQT with Λ a
diagonal matrix with the real eigenvalues of Ω listed in decreasing order on the diagonal. Substituting this
expression into (??) and rearranging yields

P (W ? Q) = (W ? Q)Λ, (6.14)
trace(Λ) = trace (W ? Q)T P (W ? Q) = trace(W ? T P W ? ).

(6.15)

The last term in (6.15) is the optimal value of (6.12). Thus the optimal value of (6.12) is trace(Λ), and
W ? Q is also a solution of (6.12). Finally, (6.14) shows that the columns of We = W ? Q are orthonormal
eigenvectors of P . By optimality, the diagonal entries of Λ must hence be the d largest eigenvalues of P .
Since d ≤ rank(P ), all of these eigenvalues are positive.

6.4 Matrix Valued Functions of a Matrix


Finally we briefly consider functions mapping Rn×m into Rp×q . Examples include functions such as
f (M ) = M −1 , f (M ) = AM B, and f (M ) = M T M . Conceptually, the derivative of f at M , denoted
Df (M ), is a linear function from Rn×m to Rp×q that gives the best local linear approximation to f around
the point M . So for a small matrix increment V , to first order f (M + V ) ≈ f (M ) + Df (M )(V ). We give
some examples below.

Example 6.4.1. Examples derivatives of a matrix valued functions of a matrix.

(a) Let A ∈ Rp×m , B ∈ Rn×q and f (M ) = AM B. This is a linear function of M . Hence Df (M )(V ) =
AV B.

(b) Let A ∈ Rn×n and f (M ) = A⊗M . This is also a linear function of M . Hence Df (M )(V ) = A⊗V .

(c) Let f (M ) = M T M . Then Df (M )(V ) = V T M + M T V .

(d) Let M be square and invertible, and f (M ) = M −1 . Then

Df (M )(V ) = −M −1 V M −1 . (6.16)

This follows by noting that M M −1 = I and hence

V M −1 + M Df (M )(V ) = 0.

Rearranging this expression gives the result above.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 72

6.5 Appendix: Proofs


6.5.1 The Determinant of I + A
Lemma 6.5.1. For A ∈ Rn×n and small  ∈ R, |I + A| = 1 +  trace(A) + O(2 ).
Proof. If A is a 2 × 2 matrix, then
1 + a11 a12
|I + A| = = 1 +  trace(A) + 2 |A|.
a21 1 + a22
So the result holds for 2 × 2 matrices. Assume the result holds for any real n × n matrix. We show it then
holds for any real (n + 1) × (n + 1) matrix. Let B be an (n + 1) × (n + 1) matrix and write B in the form
 
A b
B= T
c an+1,n+1
with A ∈ Rn×n , b ∈ Rn , c ∈ Rn and an+1,n+1 ∈ R. Then
I + A b
|I + B| = T .
c 1 + an+1,n+1
Expand the determinant along the last row of the matrix. This yields a sum of n + 1 terms. The terms from
the first n entries of the last row have the form (−1)(n+1)+i ci |Di |, where Di is the submatrix formed by
eliminating the last row and i-th column of the matrix I + B, i ∈ [1 : n]. Notice that the last column of Di is
b. So |Di | = |D̄i |, where D̄i is obtained from Di by replacing b by b. Here we have used the property of
the determinant that scaling a column, scales the determinant. Hence each of the first n terms has the form
(−1)(n+1)+i 2 ci |D̄i |. Now |D̄i | is a sum of products of entries of D̄i and hence is a polynomial in . Hence
each of the first n terms is O(2 ). Now let’s look at the last term in the expansion. Using the induction
hypothesis, this has the form
(−1)2(n+1) (1 + an+1,n+1 )|I + A| = (1 + an+1,n+1 )(1 +  trace(A) + O(2 ))
= 1 +  trace(B) + O(2 ).
Thus |I + B| = I +  trace(B) + O(2 ).

6.5.2 The Derivative D det(A)(V )


Lemma 6.5.2. For invertible A ∈ Rn×n , and any V ∈ Rn×n ,
DA det(A)(V ) = |A| trace(A−1 V ). (6.17)
Proof. Using Lemma 6.5.1
det(A + V ) − det(A) det(A(I + A−1 V )) − det(A)
=
 
det(A)(1 +  trace(A−1 V ) + O(2 )) − det(A)
=

O(2 )
= det(A) trace(A−1 V ) + .

Hence
det(A + V ) − det(A)
lim = det(A) trace(A−1 V ).
→0 

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 73

Notes
Our presentation of multivariable differentiation follows that of standard texts such as Rudin [38] and Flem-
ing [15]. For additional reading on this topic see Section 3.3 of [15]. For a discussion of vector-valued
functions of a vector variable, see Chapter 4 of the same book.

Exercises
Exercise 6.1. Show that equation (6.6) continues to hold when the functions fj take values in Rm .

The Derivative and Gradient of Real-Valued Functions on Rn


Exercise 6.2. Derive the derivative and gradient of the following functions f : Rn → R.
Pn Pn
(a) f (x) = j=1 xj (e) f (x) = e j=1 xj
Pn
Qn (f) f (x) = j=1 exj
(b) f (x) = j=1 xj
(g) f (x) = xT Ax + aT x + b, where b ∈ R, a ∈ Rn ,
(c) f (x) = kxk22 and A ∈ Rn×n .
(d) f (x) = kxk2 (h) f (x) = kAx − yk22 , where y ∈ Rn and A ∈ Rn×n .

The Derivative and Gradient of Real-Valued Functions on Rm×n


Exercise 6.3. Let M ∈ Rm×n and f (M ) be differenitiable at M . Show that the gradient and derivative of f (M ) with
respect to M T is (∇f (M ))T and trace(∇f (M )H), respectively.
Exercise 6.4. Let M ∈ Rn×m and f (M ) be a real valued function of M . For each of the following functions,
determine a sufficient condition for f to be differentiable at M . Under this condition find Df (M )(H) and ∇f (M ).

(a) f (M ) = kM kF (e) On the subspace S of real symmetric n × n matri-


ces, define f (S) = λmax (S)
(b) f (M ) = kM k2 = σ1 (M ) where σ1 (M ) is the
maximum singular value of M (f) Let n = m, A, B ∈ Rn×n , and f (M ) =
trace(AM −1 B)
(c) Let n = m and f (M ) = ln det(M )
(g) Let m = n, A, B ∈ Rn×n , and f (M ) =
n×n
(d) Let M ∈ R and f (M ) = trace(M ) trace(M AM T B)

Exercise 6.5. Let k be an positive integer, and A, B, M be invertible matrices


 with A and B fixed and M a variable.
Find the gradient of the real valued function f (M ) = ln det(AM k B) .

The Derivative of Functions Rn → Rm


Exercise 6.6 (The Sofmax Function). The softmax function is a core component of several machine learning algo-
rithms. It maps x ∈ Rn to a probability mass function s(x) on n outcomes. It can be written as the composition of
two functions s(x) = q(p(x)), where p : Rn → Rn+ and q : Rn+ → Rn+ are defined by

1
p(x) = [exi ] q(z) = z
1T z
Here Rn+ denotes the postive cone {x ∈ Rn : xi > 0}. p(·) maps x ∈ Rn , into the positive cone Rn+ , and for z ∈ Rn+ ,
q(·) normalizes z to a probability mass function in Rn+ .
(a) Determine the derivative of p(x) at x ∈ Rn

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 74

(b) Determine the derivative of q(z) at z ∈ Rn+


(c) Determine the derivative of the softmax function at x ∈ Rn

The Derivative of a Matrix Valued Function


Exercise 6.7 (Outer Product). Let f : Rn → Rn×n with f (x) = xxT . Prove that Df (x)(v) = vxT + xv T .
Exercise 6.8. Let A, B, M ∈ Rn×n Determine when the derivative of the functions below exist at M Under these
conditions find Df (M )(V )
(a) f (M ) = AM −1 B
(a) f (M ) = M T AM
(b) f (M ) = M BM T
d
Exercise 6.9. Let M ∈ Rn×n be a differentiable function of a scalar variable t and Ṁ = dt M .

(a) Show that d


dt det(M ) = trace(M −1 Ṁ ) det(M ) (1 line)
(b) For G ∈ Rn×n , use (a) to show that det(eGt ) = etrace(G)t (2 lines)
(c) The adjunct (or adjugate) of a square matrix M ∈ Rn×n is the matrix adj(M ) ∈ Rn×n formed by taking
the transpose of the cofactor matrix of M . It has the property that adj(M )M = det(M )In . Prove Jacobi’s
d
formula: dt det(M ) = trace(adj(M )Ṁ ) (1 line)
Exercise 6.10. The adjunct of a square matrix M ∈ Rn×n is defined in Exercise 6.9. For M ∈ Rn×n , let f (M ) =
adj(M ). Determine when the derivative of f exists at M . Under these conditions find Df (M )(H).
Exercise 6.11. Compute the derivative of the following functions.
(a) f (x) = A diag(x)B
Exercise 6.12. We perform gradient descent on the function

f (M ) = | trace(M − diag(xo ))|,

starting from M = Mo . Here M ∈ Rn×n and xo ∈ Rn is a fixed but unknown vector.


(a) For what M does the gradient exist?
(b) Compute the gradient where it exists
(c) Assuming the gradient exists at M0 , find the minimum attainable value of f under gradient descent and the final
value of M
Exercise 6.13. Let the columns of X ∈ Rn×m be given data points. We want to learn M ∈ Rn×d and A ∈ Rd×m
such that X ≈ M A under the constraints that the entries of A are nonnegative and each column of A sums to 1. Pose
this as an optimization problem and determine necessary conditions for an optimal solution.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Chapter 7

Convex Sets and Functions

7.1 Preliminaries
We begin with some simple properties of subsets of Rn . Specifically, the properties of being bounded, open,
closed, and compact. A set S ⊂ Rn is bounded on Rn there exists β > 0, such that for each x ∈ S,
kxk ≤ β. So a bounded set is, as the name implies, bounded in extent 1 . The set S is open if, roughly
speaking, it doesn’t contain any boundary points. For example the interval (0, 1) in R is open. This interval
has two boundary points 0, 1 and these are not contained in the set. A more precise definition is that for
every point x ∈ S there exists δ > 0, such that the ball B = {z : kx − zk ≤ δ} is entirely contained in
S. So every point is S is completely surrounded by other points in S. The set S is closed if it contains its
boundary. For example, the interval [0, 1] in R is closed. By contrast the intervals (0, 1), [0, 1), (0, 1] are
not, since each fails to contain all of its boundary points. A more precise definition is that S is closed if
every convergent sequence of points in S converges to a point in S. So a closed set contains the limits of
its convergent sequences. The subset S is said to be compact if it is both closed and bounded. Compactness
has important implications. For example, the extreme value theorem states that every real-valued continuous
function defined on a compact subset S ⊂ Rn achieves a minimum and maximum value at points in S.

7.2 Convex Sets


For x1 , x2 ∈ Rn the line segment between x1 and x2 is the set of points xα = (1−α)x1 +αx2 , for α ∈ [0, 1].
We can also write xα = x1 + α(x2 − x1 ). This shows that xα starts at x1 when α = 0 and proceeds in a
straight line in the direction x2 − x1 reaching x2 when α = 1.
A subset S ⊂ Rn is convex if for each x1 , x2 ∈ S the line segment joining x1 and x2 is contained in S.
So S is convex if for each x1 , x2 ∈ S and each α ∈ [0, 1], xα = (1 − α)x1 + αx2 ∈ S.
Example 7.2.1. Here are some simple examples of convex subsets of Rn .

(a) The empty set and Rn are convex.

(b) Any interval is a convex subset of R. Conversely, a convex subset of R must be an interval. If the
interval is bounded, then it takes one the forms: (a, b), (a, b], [a, b), or [a, b], where a, b ∈ R and
a < b. If it is unbounded, then it takes one of the forms (a, ∞), [a, ∞), (−∞, a], (−∞, a), or R. In
all cases, the interior of the interval has the form (a, b), where a could be −∞ and b could be ∞.
1
By the equivalence of norms on Rn , if S is bounded in some norm, it is bounded in all norms on Rn .

75
ELE 435/535 Fall 2018 76

(c) A subspace of U ⊂ Rn is convex. If u, v ∈ U, then for α ∈ [0, 1], it is clear that (1 − α)u + αv ∈ U.

(d) A closed half space H of Rn is a set of the form {x : aT x ≤ b} where a ∈ Rn with a 6= 0 and b ∈ R.
Thus H is one side of a hyperplane plane in Rn including the plane itself. Every closed half space is
convex. If aT x ≤ b and aT y ≤ b, and α ∈ [0, 1], then

aT ( (1 − α)x + αy ) = (1 − α)aT x + αaT y ≤ (1 − α)b + αb = b.

Pn
the 1-norm, B1 = {x :
(e) The until ball of P i=1 |xi | ≤ P To see this let x, y ∈ B1 and
P 1}, is convex.
α ∈ [0, 1]. Then ni=1 |(1 − α)xi + αyi | ≤ (1 − α) ni=1 |xi | + α ni=1 |yi | ≤ 1.

7.2.1 Set Operations that Preserve Convexity


The following theorem gives a useful list of set operations that preserve convexity.

Theorem 7.2.1. Convex sets have the following properties:

(a) Closure under intersection: If for each a ∈ A, Sa ⊂ Rn is convex, then ∩a∈A Sa is convex.

(b) Translation: If S ⊂ Rn is convex, and x0 ∈ Rn , then x0 + S = {z : z = x0 + s, s ∈ S} is


convex.

(c) Image of a convex set under a linear map: If S ⊂ Rn is convex and F is a linear map from Rn to
Rm , then F (S) = {z ∈ Rm : z = F s, s ∈ S} is convex.

(d) Pre-image of a convex set under a linear map: If S ⊂ Rm is convex and F is a linear map from

Rn to Rm , then F −1 (S) = {x ∈ Rn : F x ∈ S} is convex.

Proof. Exercise.

Example 7.2.2. We can use Theorem 7.2.1 to provide some additional examples of convex sets.

(a) An affine manifold in Rn is a set of the form M = x0 + U = {x = x0 + u, u ∈ U} where x0 ∈ Rn


and U is a subspace of Rn . A subspace is convex. Hence by the translation property, every affine
manifold is convex.

(b) A closed polytope is any bounded region of Rn defined by the intersection of a finite number of closed
half spaces aTj x ≤ bj , j ∈ [1 : p]. Since half spaces are convex, and convex sets are closed under
intersection, a closed polytope is convex.

(c) The unit ball of the max norm, B∞ = {x : kxk∞ ≤ 1}, is a closed polytope defined by the intersection
of the closed half spaces eTj x ≤ 1 and eTj x ≥ −1, j ∈ [1 : n]. Hence it is a (closed) convex set.

(d) For A ∈ Rm×n and b ∈ Rm , the set C = {x : Ax ≤ b} is convex. C is the intersection of the half
spaces {x : Ai,: x ≤ bi } where Ai,: is the i-th row of A and bi is the i-th entry of b. Since half spaces
are convex, and convex sets are closed under intersection, C is convex.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 77

f(y)
(1 )f(x) + f(y)

f(x)

f(x )

x x = (1 )x + y y

Figure 7.1: Illustration of the concept of a convex function f . This function is strictly convex over the interval [x, y].

7.3 Convex Functions


Let C ⊆ Rn be a convex set. A function f : C → R is convex if for all x, y ∈ C and α ∈ [0, 1],

f ((1 − α)x + αy) ≤ (1 − α)f (x) + αf (y). (7.1)

For α ∈ [0, 1], the point xα = (1 − α)x + αy lies on the line joining x (α = 0) to y (α = 1). On the other
hand, the scalar value fα = (1−α)f (x)+αf (y) is the corresponding linear interpolation of the values f (x)
(α = 0) and f (y) (α = 1). Convexity requires that the value of the function f along the line segment from
x to y is no greater than the corresponding linear interpolation (1 − α)f (x) + αf (y) of the values f (x) and
f (y). This is illustrated in Figure 7.1. A function is strictly convex if for x 6= y and α ∈ (0, 1), (7.1) holds
with strict inequality. Hence strict convex requires that the value of the function f along the line segment
between two distinct x and y is strictly less than the corresponding linear interpolation (1−α)f (x)+αf (y).
It is easy to see that a strictly convex function is convex, but that not every convex function is strictly convex.
A concept closely related to convexity is called concavity. A function f : Rn → R is a concave if −f (x)
is a convex, and it is strictly concave if −f (x) is strictly convex.
For f : Rn → R and S ⊂ Rn , the restriction of f to S, is the function g : S → R with g(x) = f (x)
for each x ∈ S. Clearly, if f is a convex function and C ⊂ Rn is a convex set, then the restriction of f to
C is a convex function. In particular, the restriction f to any line in Rn is convex, and the restriction of f
to any finite line segment is convex. The last statement is essentially the definition of convexity. Thus the
following statements are equivalent: f is convex on Rn , f is convex on every convex subset of Rn , f is
convex on every line in Rn , and f is convex on every line segment in Rn .
The above idea is often used in the following way. Let f : C → R be a convex function. Pick x, y ∈ C.
Then for t ∈ [0, 1], define the function g : [0, 1] → R by g(t) = f ((1 − t)x + ty). It is easy to see that g
is a convex function on [0, 1]. The converse of this result also holds. If for every x, y ∈ C, g(t) is a convex
function on [0, 1], then f is a convex function.

7.3.1 Examples of Convex Functions


The follow theorem gives some examples of convex functions on R and Rn . In each case, we show that the
function is convex by directly verifying the condition in the definition of convexity.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 78

Theorem 7.3.1. Here are some simple convex functions:

(a) For every b ∈ R, the constant function f (x) = b is a convex function on R.

(b) For every a ∈ R, the linear function f (x) = ax is a convex function on R.

(c) f (x) = x2 is a strictly convex function on R.

(d) Every norm on Rn is a convex function.

Proof. (a) Let x, y ∈ R, and α ∈ [0, 1]. Then

f ( (1 − α)x + αy ) = b = (1 − α)b + αb = (1 − α)f (x) + αf (y).

Hence f is convex but not strictly convex.


(b) Let x, y ∈ R, and α ∈ [0, 1]. Then

f ( (1 − α)x + αy ) = a(1 − α)x + aαy


= (1 − α)(ax) + α(ay)
= (1 − α)f (x) + αf (y).

Hence f is convex but not strictly convex.


(c) Let x, y ∈ R, with x 6= y, and α ∈ (0, 1). Then

(1 − α)x2 + αy 2 − ( (1 − α)x + αy )2 = ( (1 − α) − (1 − α)2 )x2 + (α − α2 )y 2 − 2(1 − α)αxy


= α(1 − α)(x2 + y 2 − 2xy)
= α(1 − α)(x − y)2
>0

Hence ( (1 − α)x + αy )2 < (1 − α)x2 + αy 2 . Thus f is strictly convex.


(d) Let x, y ∈ Rn , and α ∈ [0, 1]. Then by the triangle inequality, k(1 − α)x + αyk ≤ (1 − α)kxk + αkyk.
Thus k · k is a convex function.

7.3.2 Jensen’s Inequality


When αi ≥ 0 and ki=1 αi = 1, we call ki=1 αi xi a convex combination of x1 , . . . , xk . The definition of
P P
a convex function requires that the value of a function f at a convex combination of two points is bounded
above by the same convex combination of the values of f at those points. An obvious generalization is that
the value of a convex function at a convex combination of k points is less than the same convex combination
of the function values at those points. This property is called Jensen’s inequality.

Theorem 7.3.2 (Jenson’s Inequality). Let C be a convex set P and f : C → R be a convex function. If
{xi }ki=1 ⊂ C, and {αi }ki=1 is a set on negative scalars with ki=1 αi = 1, then
k k
!
X X
f αi xi ≤ αi f (xi ).
i=1 i=1

Proof. Exercise.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 79

y
v2

v2 v1 v1
1 1 x v3 x

v4

Figure 7.2: Illustration of the vertices of the unit 1-norm balls in Rk , for k = 1, 2, 3.


Example 7.3.1 (An illustration of convex combinations). We show that any point in the set B1 = {x ∈
Rn : kxk1 ≤ 1} (the unit ball of the 1-norm) can be written as a convex combination of its 2n vertices
{vi }2n
i=1 = {e1 , . . . , en , −e1 , . . . , −en }. These vertices are illustrated in Figure 7.2.
P n n
Let
Pn x ∈ B 1 . First assume
Pn that i=1 |xi | = 1. We know that for a unique set of scalars {αi }i=1 ,
x = i=1 αi ei . Clearly, i=1 |αi | = 1. Define scalars {βj }2n j=1 as follows:
( (
αj , αj ≥ 0; −αj , αj < 0;
for j ∈ [1 : n] : βj = for j ∈ [n + 1 : 2n] : βj =
0, otherwise. 0, otherwise.

Then x = 2n
P P
j=1 βj vj with βj ≥ 0 and j βj = 1.
Now assume x is in the interior of B1 . If x = 0, then x = 1/2e1 − 1/2e1 = 1/2v1 + 1/2vn+1 . Hence
assume x 6= 0. Consider P Pn through 0 and x. This line intersects the boundary of B1 at
the line that passes
n n
two points a, b ∈ R with i=1 |ai | = i=1 |bi | = 1. Since x lies on the line segment from a to b, we can
write x = (1 − γ)a + γb for some γ ∈ [0, P 1]. By construction, a, b are on the boundary of B1 . So each is
convex combination of the vertices: a = 2n
P2n
α v
j=1 j j and b = j=1 βj vj . Then
P  P  P
2n 2n 2n
x = (1 − γ) α
j=1 j jv + γ j=1 j j =
β v j=1 ( (1 − γ)αj + γβj )vj ,
P2n
with ( (1 − γ)αj + γβj ) ≥ 0 and j=1 ( (1 − γ)αj + γβj ) = 1. Hence every x ∈ B1 is a convex
combination of its vertices {vj }2n
j=1 .

Example 7.3.2 (An application of Jensen’s inequality). Let B1 denote the 1-norm unit ball in Rn . From
Examples
P 7.2.1 and 7.3.1 we knowPthat B1 is a convex set, and that any x ∈ B1 can be written in the form
x = 2n α
j=1 j jv with α j ≥ 0 and 2n n
j=1 αj = 1, where the points vj ∈ R are the 2n vertices of B1 . Now
consider a convex function f : B1 → R. By Jensen’s inequality, we have

f (x) = f ( 2n
P P2n
j=1 αj vj ) ≤ j=1 αj f (vj ) ≤ maxj f (vj ).

Hence the values of f on B1 are bounded above by the maximum value of f on the vertices of B1 .

7.3.3 Operations on Convex Functions


A variety of standard operations on functions preserve convexity. This allows known convex functions to
be combined to form new convex functions. Conversely, it allows the identification of a convex function by
noting that it is formed from convex functions using such operations.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 80

Theorem 7.3.3. Let f and g be convex functions on Rn . Then the following functions are convex:

(a) h(x) = βf (x) where β ≥ 0.

(b) h(x) = f (x) + g(x).

(c) h(x) = max{f (x), g(x)}.

(d) h(x) = f (Ax + b) where A ∈ Rn×m and b ∈ Rn .

(e) h(x) = g(f (x)) where g : R → R is convex and nondecreasing on the range of f .

(f) h(x) = limk→∞ fk (x) where the sequence of functions {fk }k≥1 converges pointwise to h.

Proof. Let α ∈ [0, 1] and x, y ∈ Rn .


(a) Exercise.
(b) By the definition of h:

h((1 − α)x + αy) = f ((1 − α)x + αy) + g((1 − α)x + αy)


≤ (1 − α)f (x) + αf (y) + (1 − α)g(x) + αg(y) f, g are convex
= (1 − α)(f (x) + g(x)) + α(f (y) + g(y)
= (1 − α)h(x) + αh(y).

(c) By the definition of h:

h((1 − α)x + αy) = max{f ((1 − α)x + αy), g((1 − α)x + αy)}
≤ max{(1 − α)f (x) + αf (y), (1 − α)g(x) + αg(y)} f, g are convex
≤ (1 − α) max{f (x), g(x)} + α max{f (y), g(y)}
= (1 − α)h(x) + αh(y).

(d) By the definition of h:

h((1 − α)x + αy) = f ((1 − α)(Ax) + α(Ay) + b)


= f ((1 − α)(Ax + b) + α(Ay + b))
≤ (1 − α)f (Ax + b) + αf (Ay + b) f is convex
= (1 − α)h(x) + αh(y).

(e) Since f is convex, f ((1 − α)x + αy) ≤ (1 − α)f (x) + αf (y). Then since g is nondecreasing

h((1 − α)x + αy) = g(f ((1 − α)x + αy)) ≤ g((1 − α)f (x) + αf (y))
≤ (1 − α)g(f (x)) + αg(f (y)) g is convex
= (1 − α)h(x) + αh(y).

(f) Let x, y ∈ Rn , α ∈ [0, 1], and xα = (1 − α)x + αy. By assumption, limk→∞ fk (xα ) = h(xα ). Let

gk (xα ) = fk (xα )−(1−α)fk (x)−αfk (y). By the convexity of fk , gk (xα ) ≤ 0, and then by the assumption
of pointwise convergence, limk→∞ gk (xα ) = h(xα ) − (1 − α)h(x) − αh(y) ≤ 0. Thus h is convex.

The following theorem illustrates applications of the results in Theorem 7.3.3.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 81

Theorem 7.3.4. Some additional convex functions:

(a) For a ≥ 0, b, c ∈ R, f (x) = ax2 + bx + c is a convex function on R.

(b) For b ∈ Rn , c ∈ R, the affine function f (x) = bT x + c is convex on Rn .

(c) For x = (x1 , . . . , xn ) ∈ Rn , the coordinate projection f (x) = xi is convex.

(d) For symmetric positive semidefinite P ∈ Rn×n , f (x) = xT P x is a convex function on Rn .

(e) For symmetric postive semidefinite P ∈ Rn×n , and b ∈ Rn , c ∈ R, f (x) = xT P x + bT x + c is


a convex function on Rn .

Proof. We use the results in Theorem 7.3.3.


(a) By Theorem 7.3.1, g(x) = x2 , g(x) = x and g(x) = c are convex functions on R. Hence by Theorem
7.3.3 and the assumption a ≥ 0, f (x) = ax2 + bx + c is a convex function.
(b) The function g(x) = x in convex on R, and h(x) = bT x + c is an affine function from Rn to R. Hence
by Theorem 7.3.3, g(h(x)) = bT x + c is convex.
(c) This is a special case of (b). √
(d) The norm k · k2 is convex and hence the function k P xk2 is convex (Theorem 7.3.3(d)), and clearly
takes nonnegative values. The function g(x)
√ = x2 is convex (Theorem 7.3.1) and monotone increasing on
the nonnegative reals. Hence xT P x = k P xk22 is a convex function (Theorem 7.3.3(e)).

7.3.4 Strongly Convex Functions

A function f : Rn → R is strongly convex if for some c > 0, f (x) − 2c kxk22 is convex. When this holds, we
say that f is strongly convex with modulus c.

Lemma 7.3.1. A strongly convex function is strictly convex.

Proof. Exercise.

Example 7.3.3. These examples are selected to highlight the distinctions between convexity, strict convexity
and strong convexity.

(a) Let f : R → R with f (x) = ax + b, for a, b ∈ R. Then f (x) is a convex, but not strictly convex on
R. You can see this directly from the derivation in Example 7.3.1(a).

(b) Let f : R → R with f (x) = x2 . Then f (x) is strictly convex (Example 7.3.1(b)). It is also strongly
convex. To see this let 0 < c ≤ 2 and set g(x) = f (x) − 2c x2 . Then g(x) = x2 − 2c x2 = (1 − 2c )x2 .
Since g(x) is a non-negative scaling of the convex function x2 , it is convex. Thus f (x) is strongly
convex on R.

(c) Consider the function fk : R → R with fk (x) = x2k for an integer k ≥ 0. The function f0 (x) = 1
is convex but not strictly convex. The function f1 (x) = x2 is strongly convex (see (b) above) and
hence also strictly convex. The function f2 (x) = x4 is strictly convex but not strongly convex. First

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 82

we show that it’s strictly convex:

( (1 − α)x + αy )4 = ( (1 − α)x + αy )2 ( (1 − α)x + αy )2


< ( (1 − α)x2 + αy 2 )2 by the strict convexity of f1
= ( (1 − α)u + αv )2 , u = x2 , v = y 2
< (1 − α)u2 + αv 2 by the strict convexity of f1
4 4
= (1 − α)x + αy .

To show that it is not strongly convex, pick c > 0 and consider

g(x) = x4 − 2c x2 = x2 (x2 − 2c ).

We see that g is an even function, g(0) = 0, g(x) < 0 for x2 < 2c , and g(x) > 0 for x2 > 2c , Thus it
can’t be convex for any c > 0. So f2 (x) = x4 is not strongly convex. The same argument shows that
the functions f2k (x) = x2k for k > 1 are strictly convex but not strongly convex.
Like convexity, strict convexity and strong convexity are preserved by a variety of standard operations
on functions. A list of these operations would be similar to, but not identical to those given in Theorem
7.3.3. These operations are examined in the exercises.

7.3.5 The Continuity of Convex Functions


The following theorem shows that convex functions are always continuous on the interior of their domain.

Theorem 7.3.5. A convex function defined on a convex set C ⊂ Rn is continuous on the interior of C.

Proof. See Appendix 7.7

7.4 Differentiable Convex Functions


We now consider functions that are differentiable and ask if the derivative can be used to verify convexity.
This leads to conditions for convexity of two forms: first order conditions involving the derivative of f , and
second order conditions involving the second derivative of f . We first develop these conditions for functions
defined on the real line, or an open interval I of the real line. Then extend the conditions to functions defined
on a open convex subset of Rn .

7.4.1 Differentiable Convex Functions on R


First Order Conditions for Convexity
Let f : I → R be differentiable on an open interval I ⊂ R. For fixed x ∈ I and variable y ∈ I, the line
f (x) + f 0 (x)(y − x) defines the tangent to f at x. The following theorem shows that f is convex if and only
if this linear function of y is a lower bound for f (y). This is illustrated in the left plot of Figure 7.3.

Theorem 7.4.1. Let f : I → R be differentiable on an open interval I. Then f is convex if and only if
for all x, y ∈ I,
f (y) ≥ f (x) + f 0 (x)(y − x). (7.2)

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 83

Proof. See Appendix 7.7

It possible to extend Theorem 7.4.1 to cover strictly convex and strongly convex functions. These
extensions are given in the following corollary. Part (a) is proved in Appendix 7.7. The proof of part (b) is
left as at exercise.

Corollary 7.4.1. Let f : I → R be differentiable on an open interval I. Then

(a) f is strictly convex if and only if for all x, y ∈ I with y 6= x, f (y) > f (x) + f 0 (x)(y − x).

(b) f is strongly convex if and only if there exists c > 0 such that for all x, y ∈ I with y 6= x,
f (y) ≥ f (x) + f 0 (x)(y − x) + 2c (y − x)2 .

Second Order Conditions for Convexity

If f has a second derivative, then the convexity of f is determined by the sign of f 00 (x).

Theorem 7.4.2. Let f : I → R be twice differentiable on an open interval I. Then f is convex on I if


and only if at each point x ∈ I, f 00 (x) ≥ 0.

Proof. See Appendix 7.7

The second derivative is the rate of change of the slope f 0 (x) of f at x. In these terms, f is convex if and
only if the slope f 0 (x) is nondecreasing.
There is a partial extension of Theorem 7.4.2 to strictly convex functions, and a full extension to strongly
convex functions. These extensions are given in the following corollary. The proof is left as exercise.

Corollary 7.4.2. Let f : I → R be twice differentiable on an open interval I.

(a) If for all x ∈ I, f 00 (x) > 0, then f is strictly convex on I.

(b) f is strongly convex on I if and only if there exists c > 0 such that for all x ∈ I, f 00 (x) ≥ c.

Example 7.4.1. The following examples illustrate the application of Theorem 7.4.2 and its corollaries.

(a) f (x) = x2 has f 00 (x) = 2 > 0. Hence f is strictly convex on R. In addition, at each point x ∈ R,
f 00 (x) ≥ 2. Hence f is strongly convex on R.

(b) f (x) = x4 has f 00 (x) = 12x2 . Using Corollary 7.4.2 we conclude that f (x) is strictly convex on
every open interval not containing 0. However, we know that f (x) = x4 is strictly convex on R. So
in general, f 00 (x) > 0 for all x ∈ I, is sufficient but not necessary for the strict convexity on I.

(c) f (x) = x for x ∈ (0, ∞). f 00 (x) = − 14 x−3/2 is negative for x > 0. Hence f is not convex on

(0, ∞). However, − x is strictly convex on (0, ∞). Hence f (x) is strictly concave on (0, ∞).

(d) f (x) = x ln x for x ∈ (0, ∞). We have f 0 (x) = ln x + 1 and f 00 (x) = 1/x. Since f 00 (x) > 0 at each
x ∈ (0, ∞), f is strictly convex on (0, ∞).

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 84

f(v) f(y)
(1 )f(x) + f(y)

f(x) g(y)
f(v) f(x) + Df(x)(v x)

f(x) f(x )
g(x) g(v) = f(x ) + Df(x )(v x )
x x = (1 )x + y y

Figure 7.3: Left: Illustration of the bound (7.3). The function f (v) is bounded below by the best linear approximation
to the function at any point x. Right: Illustration of the bound (7.3) applied at the point xα . We see that f (x) ≥ g(x)
and f (y) ≥ g(y). Hence (1 − α)f (x) + αf (y) ≥ (1 − α)g(x) + αg(y) = f (xα ).

7.4.2 Differentiable Convex Functions on Rn


First Order Conditions for Convexity
As illustrated in left plot of Figure 7.3, a differentiable function is convex if and only if at every point in the
interior of its domain the best local linear approximation to f is a global lower bound for f .

Theorem 7.4.3. Let f : C → R be differentiable on an open convex set C ⊂ Rn . Then f is convex if


and only if for all x, y ∈ Rn ,
f (y) ≥ f (x) + Df (x)(y − x). (7.3)

Proof. See Appendix 7.7

You will often see equation (7.3) written in an the following equivalent form using the gradient of f :

f (y) ≥ f (x) + ∇f (x)T (y − x). (7.4)

The following corollary gives the extension of Theorem 7.4.3 to strictly convex and strongly convex
functions. The proof is left as an exercise.

Corollary 7.4.3. Let f : C → R be differentiable on an open convex set C ⊂ Rn .

(a) f is strictly convex if and only if for all x, y ∈ Rn , f (y) > f (x) + Df (x)(y − x).

(b) f is strongly convex if and only if there exists c > 0 such that for all x, y ∈ C, f (y) ≥ f (x) +
Df (x)(y − x) + 2c ky − xk2 .

Second Order Conditions for Convexity


Now assume that f is a C 2 function, i.e., f is twice continuously differentiable. Then to second order, the
Taylor series expansion of f at x is

f (y) = f (x) + ∇f (x)T (y − x) + 1/2(y − x)T Hf (x)(y − x),

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 85

where Hf (x) ∈ Rn×n is the Hessian matrix of f at x. Comparison of this equation with (7.3) suggests that
f is convex if and only if H(x) is a positive semidefinite matrix for each x in the domain of f . This leads to
the following multivariable analog of Theorem 7.4.2.

Theorem 7.4.4. A C 2 function f on an open convex set C ⊂ Rn is convex if and only if at each x ∈ C
the Hessian matrix Hf (x) is positive semidefinite.

Proof. See Appendix 7.7

Here are the generalizations for strict convexity and strong convexity. The proofs are left as exercises.

Corollary 7.4.4. Let f : C → R be a C 2 function on an open convex set C.

(a) If for all x ∈ C, Hf (x) is positive definite, then f is strictly convex on C.

(b) f is strongly convex on C if and only if there exists c > 0 such that for all x ∈ C, Hf (x) − cIn is
positive semidefinite.

7.5 Minimizing a Convex Function


Convex functions are frequently used as objective functions in optimization problems. Here we consider
how the property of convexity gives insight into the existence and uniqueness of a point that minimizes a
convex function over a convex set.

7.5.1 Sublevel Sets


For any function f : Rn → R, the c-sublevel set of f is the set of points

Lc = {x : f (x) ≤ c}.

Sublevel sets are nested in the sense that for a ≤ b, La ⊆ Lb . In particular, if f has a global minimum at x? ,
then the set of x that achieve the global minimum value is Lf (x? ) , and for any x0 ∈ Rn , Lf (x? ) ⊆ Lf (x0 ) .
For convex functions we can say more.

Theorem 7.5.1. The sublevel sets of a convex function f : Rn → R are convex and closed.

Proof. If Lc = ∅, then it is closed and convex. Otherwise let x, y ∈ Lc . Then f ((1 − αx) + αy) ≤
(1 − α)f (x) + αf (y) ≤ c. Hence Lc is convex. Since f is convex on Rn it is continuous on Rn . Thus
if {xk } ⊂ Lc with xk converging to x, then by the continuity of f , limk→∞ f (xk ) = f (x). Finally, since
f (xk ) ≤ c, limk→∞ f (xk ) = f (x) ≤ c. Thus Lc is closed.

If the sublevel sets of a continuous function are bounded, then by the extreme value theorem, there
exists x? ∈ Rn such that f achieves a global minimum value at x? . In general, a convex function f need
not have bounded sublevel sets. For example, the linear function f (x) = ax + b on R is convex and has
unbounded sublevel sets. Hence convexity alone does not ensure the existence of minimizing point x? .
Strong convexity, however, is sufficient.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 86

Theorem 7.5.2. The sublevel sets of strongly convex functions are bounded.

Proof. See Appendix 7.7

7.5.2 Local Minima


We now turn our attention to the local minima of a convex function.
Theorem 7.5.3. Let f : Rn → R be a convex function. Then:

(a) If f has a local minima, then it is global minima of f .

(b) The subset of Rn where f has a global minima is convex.

(c) If f is strictly convex and has a local minima, then this is the unique global minima.

(d) If f is strongly convex, then there exists a unique x? ∈ Rn at which f has a local and hence
global minima.

Proof. (a) Assume f has a local minima at x? , with f (x? ) = c. Then there exists r > 0 such that for all x
with kx − x? k2 < r, f (x) ≥ c. Suppose that for some z ∈ Rn , f (z) < c. Let xα = (1 − α)x? + αz with
α ∈ (0, 1). Then for α > 0 sufficiently small, kxα − x? k2 < r and f (xα ) ≤ (1 − α)f (x? ) + αf (z) < c; a
contradiction. Hence f has a global minima at x? .
(b) If f has no local minima, then the set of global minima is empty; which is a convex set. Now assume f
has a global minima at x? ∈ Rn with f (x? ) = c. The set of all points at which f has a global minima is the
sublevel set Lc = {x : f (x) ≤ c}. By Theorem 7.5.1, this set is convex.
(c) Assume f has a local minima at x? with f (x? ) = c. Then by (a), f has a global minima at x? . If f has
global minima at two distinct points x? and y ? , then f has a global minima at all points on the line joining
x? and y ? . But this violates the strict convexity of f .
(d) By Theorem 7.5.3, since f is strongly convex, its sublevel sets are compact and convex. The continuity
of f on Rn and the extreme value theorem then ensure that f achieves a minimum value over each nonempty
sublevel set. Select a sublevel set La with nonempty interior. Then f achieves a minimum value at some
point x? ∈ La . If this point is on the boundary of La , then f (x? ) = a. But La has interior points y with
f (y) < a; a contradiction. So x? is an interior point of La . Hence f has a local minimum at x? . Since
strong convexity implies strict convex, by (c) x? is the unique point at which f has a local minimum.

In summary, convexity ensures that if f has a local minima, it is a global minima. Moreover, the set of
points at which f attains a global minima is itself convex. Strict convexity allows us to conclude that f has at
most one local minima. However, this leaves the question of the existence of a local minima unresolved. A
strongly convex function always has a local minima. The previous results then ensure that this is the unique
point at which f achieves a global minimum value.
Strong convexity is sufficient, but not necessary for the existence of local minima. For example the
function f (x) = x4 is not strongly convex, but it has a unique local minima at x = 0.

7.5.3 Minimizing a Differentiable Convex Function


Let f : Rn → R be a differentiable convex function and C ⊆ Rn be a convex set. Recall that ∇f (x? ) is the
direction of greatest increase in f at x? . Now suppose x? minimizes f (x) over C. Then f (x) can’t decrease
as x moves from x? in any direction h that keeps x ∈ C. Hence for such h, ∇f (x? )T h ≥ 0. In particular,

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 87

for y ∈ C, x? + α(y − x? ) lies in C for α ∈ [0, 1]. It follows that if x? minimizes f (x) over C, then for each
y ∈ C,
∇f (x? )T (y − x? ) ≥ 0. (7.5)
It turns out that (7.5) is both necessary and sufficient for x? to minimize f over C. Here is a formal
statement of this result.

Theorem 7.5.4. Let f : Rn → R be a differentiable convex function and C be a convex subset of Rn .


A point x? ∈ C minimizes f over C if and only if for each y ∈ C

Df (x? )(y − x? ) ≥ 0. (7.6)

Proof. See Appendix 7.7

Theorem 7.5.4 allows for the possibility that x? is a boundary point of C. If we exclude this possibility,
then a stronger result holds. Specifically, if C is an open set, then x? ∈ C minimizes f (x) over C if and only
if Df (x? ) = 0.

Corollary 7.5.1. Let C be an open convex subset of Rn and f : C → R be a differentiable convex


function on C. A point x? ∈ C minimizes f over C if and only if

Df (x? ) = 0. (7.7)

Proof. See Appendix 7.7

7.6 Projection onto a Convex Set


Now we consider a distinct but related problem. Let C denote a nonempty, closed, convex subset of Rn . We
show that for any y ∈ Rn , there exists a unique point ŷ ∈ C that minimizes the Euclidean distance to y. This
point is called the projection of y onto C.

Theorem 7.6.1. For each y ∈ Rn , there exists a unique ŷ ∈ C that is closest to y in the Euclidean norm.

Proof. If y ∈ C, then ŷ = y. Suppose y ∈ / C. Since C is bounded, there exists r > 0 such that R =
{x : ky − xk2 ≤ r} ∩ C = 6 ∅. The set R is a closed and bounded, and hence compact, subset of C. The
function f (x) = ky − xk2 is a continuous function defined on Rn . Hence by the extreme value theorem, f
achieves a minimum on the compact set R. So there exists ŷ ∈ R ⊆ C minimizing the Euclidean distance
to y over all points in R, and hence over all points in C. That ŷ is unique follows by noting that f (x) is a
strictly convex function and applying Theorem 7.5.3. So ŷ is the unique point in C closest to y.

Theorem 7.5.4 can be used to give the following characterization of the point ŷ. This characterization is
illustrated in Figure 7.4.

Lemma 7.6.1. For y ∈ Rn , z = ŷ if and only if for each x ∈ C, (y − z)T (x − z) ≤ 0.

Proof. f (x) = ky − xk22 is convex and differentiable with Df (x)(h) = −2(y − x)T h. By Theorem 7.5.4,
z minimizes f (x) over C if and only if for each x ∈ C, (y − z)T (x − z) ≤ 0.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 88

y1 y1
y2
y1 y1
y2 y3
y2 x2
y2 y3 = y4
x1 y4

Figure 7.4: Left: An illustration of the projection ŷ of a point y onto a closed convex set C. For each point x ∈ C,
the angle between y − ŷ and x − ŷ must be at least π/2. Right: An illustration of the non-expansive property of the
projection. For all points y1 , y2 , the distance between the ŷ1 and ŷ2 is at most the distance between y1 and y2 .

7.6.1 Nonexpansive Functions


A function f : Rn → Rn is said to be non expansive if for each y1 , y2 ∈ Rn , kf (y1 )−f (y2 )k2 ≤ ky1 −y2 k2 .
So f is non expansive if the action of f on any pair of points does not move the points further apart.
It is easy to see that a non expansive function must be continuous. To show this let y ∈ Rn and
{xk } ⊂ Rn be a sequence of points converging to y. Then kf (y) − f (xk )k2 ≤ ky − xk k2 → 0, as k → ∞.
This f is continuous at y. Since y was a arbitrary point, it follows that f is continuous at every point.

7.6.2 The Projection Function PC


For a nonempty, closed, convex set C ⊂ R, let PC : Rn → C denote the projection function that maps y ∈ Rn
to the closest point ŷ ∈ C to y.

Lemma 7.6.2. The projection function PC is non expansive.

Proof. If ŷ1 = ŷ2 , then y1 = y2 and the result holds. Hence assume ŷ1 6= ŷ2 . By Lemma 7.6.1, (y1 −
ŷ1 )T (ŷ2 − ŷ1 ) ≤ 0 and (y2 − ŷ2 )T (ŷ1 − ŷ2 ) ≤ 0. Adding these expressions and expanding we find

0 ≥ (y1 − ŷ1 )T (ŷ2 − ŷ1 ) + (y2 − ŷ2 )T (ŷ1 − ŷ2 )


= y1T ŷ2 − y1T ŷ1 − ŷ1T ŷ2 + ŷ1T ŷ1 + y2T ŷ1 − y2T ŷ2 − ŷ2T ŷ1 + ŷ2T ŷ2 .

By regrouping terms and using the Cauchy-Schwarz inequality this can be written as

0 ≥ kŷ1 − ŷ2 k22 + (y1 − y2 )T (ŷ2 − ŷ1 )


≥ kŷ1 − ŷ2 k22 − ky1 − y2 k2 kŷ2 − ŷ1 k2 .

Since ŷ1 6= ŷ2 , this yields kŷ1 − ŷ2 k2 ≤ ky1 − y2 k2 .


Now consider the distance from y to C. This is given by dC (y) = ky − PC (y)k2 . This is a well-defined
function of y that is zero for y ∈ C and positive for y ∈
/ C. As one might expect, dC is a convex function.

Lemma 7.6.3. The distance function dC (y) = ky − PC (y)k2 is convex.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 89

Proof. Let y1 , y2 ∈ Rn and ŷi = PC (yi ), i = 1, 2. Then for α ∈ [0, 1], let yα = (1 − α)y1 + αy2 and
ŷα = (1 − α)ŷ1 + αŷ2 . Since C is convex, ŷα ∈ C, and hence dC (yα ) ≤ kyα − ŷα k2 . Thus

dC (yα ) ≤ kyα − ŷα k2


= k(1 − α)(y1 − ŷ1 ) + α(y2 − ŷ2 )k2
≤ (1 − α)k(y1 − ŷ1 )k2 + αk(y2 − ŷ2 )k2
= (1 − α)dC (y1 ) + αdC (y2 ).

7.7 Appendix: Proofs


Proof of Theorem 7.3.5. Pick a point P z ∈ C. Since C is open, there exists a 1-norm ball centered at z that
n
is contained in C. Let this be B = {x : i=1 |xi − zi | ≤ δ}. Let x be any point in this ball other than z, and
consider
Pn the line passing
Pn through z and x. This line intersects the boundary of B at two points a and b with
i= |ai − zi | = i=1 |bi − zi | = δ. f (a) and f (b) are bounded above by the maximum value of f on the
vertices of B (see Example 7.3.2). Denote this upper bound by c. So f (a) ≤ c and f (b) ≤ c.
Let x lie on the line segment from z to a. Then x = z + α(a − z), with α ∈ [0, 1]. In this case,
x − z = α(a − z), and by taking the 1-norm of both sides we find

kx − zk1
α= . (7.8)
δ
Now write x = (1 − α)z + αa, and use the convexity of f to obtain

f (x) ≤ (1 − α)f (z) + αf (a) ≤ f (z) + α(c − f (z)). (7.9)

Similarly, the point z lies on the line segment from x to b. So z = x + β(b − x). The 1-norm of b − x is
δ(1 + α) and that of z − x is δα, with α given by (7.8). Hence

δα α
β= = . (7.10)
δ(1 + α) 1+α

Now write z = (1 − β)x + βb, and use the convexity of f and (7.10) to obtain

1 α
f (z) ≤ (1 − β)f (x) + βf (b) = f (x) + f (b).
1+α 1+α
Multiplying both sides of this equation by 1 + α and rearranging the result yields

f (z) − f (x) ≤ α(f (b) − f (z)) ≤ α(c − f (z)). (7.11)

Together, the bounds (7.9) and (7.11) imply that

kx − zk1
|f (x) − f (z)| ≤ |c − f (z)|.
δ
Hence if a sequence xk converges to z, then f (xk ) converges to f (z). Thus f is continuous at z.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 90

Proof of Theorem 7.4.1. (If) Assume that the lower bound (7.2) holds. Let x, y ∈ I, α ∈ [0, 1], and set
xα = (1 − α)x + αy. Applying (7.2) at xα with variable v we obtain f (v) ≥ f (xα ) + f 0 (xα )(v − xα ). For
v = x and v = y this yields f (x) ≥ f (xα ) + f 0 (xα )(x − xα ) and f (y) ≥ f (xα ) + f 0 (xα )(y − xα ). Hence

(1 − α)f (x) + αf (y) ≥ (1 − α)(f (xα ) + f 0 (xα )(x − xα )) + α(f (xα ) + f 0 (xα )(y − xα ))
= f (xα ) + f 0 (xα )[(1 − α)(x − xα ) + α(y − xα )]
= f (xα ).

(Only If) If y = x, the result clearly holds. Hence let y 6= x, and α ∈ (0, 1). Since f is convex, f (x + α(y −
x)) ≤ f (x) + α(f (y) − f (y)). Thus f (x + α(y − x)) − f (x) ≤ α(f (y) − f (x)). Dividing both sides by
α gives
f (x + α(y − x)) − f (x)
(y − x) ≤ f (y) − f (x).
α(y − x)
Taking the limit as α → 0 yields f 0 (x)(y − x) ≤ f (y) − f (x).

Proof of Corollary 7.4.1. (If) This follows the proof of the corresponding part of Theorem 7.4.1. (Only
If) f is strictly convex and hence convex. Let x ∈ I. By the convexity of f , for any y ∈ I, f (y) ≥
f (x) + f 0 (x)(y − x). Suppose that for some y ∈ I, with y 6= x, f (y) = f (x) + f 0 (x)(y − x). Then for
β ∈ (0, 1), and zβ = (1 − β)x + βy, we have

f (zβ ) < (1 − β)f (x) + βf (y) = f (x) + βf 0 (x)(y − x) by strong convexity


0 0
f (zβ ) ≥ f (x) + f (x)(zβ − x) = f (x) + βf (x)(y − x) by convexity and (7.2) ;

a contradiction. Thus for y ∈ I, with y 6= x, f (y) > f (x) + f 0 (x)(y − x).

Proof of Theorem 7.4.2. (If) By Taylor’s theorem, f (x + h) = f (x) + f 0 (x)h + f 00 (z)h2 where z is a point
between x and x + h. Hence f (x + h) ≥ f (x) + f 0 (x)h. So at any point in I, f is bounded below by its
derivative. We can then apply Theorem 7.4.1 to conclude that f is convex.
(Only If) By convexity, f (x) = f ( x+h+x−h
2 ) ≤ f (x+h)+f
2
(x−h)
. Hence f (x + h) − 2f (x) + f (x − h) ≥ 0.
The second derivative of f at x is found by taking the limit as h ↓ 0 of
 
1 f (x + h) − f (x) f (x) − f (x − h) f (x + h) − 2f (x) + f (x − h)
− = ≥ 0.
h h h h2

Hence f 00 (x) ≥ 0.

Proof of Theorem 7.4.3. (If) This part of the proof follows the corresponding part of the proof of Theorem
7.4.1, except that we use the bound (7.3).
(Only If) f is convex on C. Let x, y ∈ C, and set xt = (1 − t)x + ty, for t ∈ R. Since C is open, there
exists an open interval (a, b) containing [0, 1], such that for t ∈ (a, b), xt ∈ C, with x0 = x and x1 = y.

Then g(t) = f ( (1 − t)x + ty ) is a convex function on (a, b) with g(0) = f (x) and g(1) = f (y). Hence by
Theorem 7.4.1, for each t ∈ (a, b), g(t) ≥ g(0) + g 0 (0)t. This gives g(t) ≥ f (x) + Df (x)(y − x)t. Setting
t = 1 we obtain f (y) ≥ f (x) + Df (x)(y − x).

Proof of Theorem 7.4.4. (If) Let x, y ∈ C. By the multivariable version of Taylor’s theorem we have

f (y) = f (x) + ∇f (x)T (y − x) + 1/2(y − x)T Hf (z)(y − x),

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 91

where z is a point on the line segment joining x and y. Since Hf (z) is PSD, it follows that f (y) ≥
f (x) + ∇f (x)T (y − x). Thus f is convex.
(Only If) Assume f is convex. Fix x ∈ C and consider an open ball B around x contained in C. Let y ∈ B
with y 6= x. Foor t ∈ R, let z(t) = (1 − t)x + ty. Then there exists an interval (a, b) containing [0, 1] such
that for t ∈ (a, b), z(t) ∈ B, with z(0) = x and z(1) = y.
Now for t ∈ (a, b) define g(t) = f ( (1 − t)x + ty ). Since f is convex on C, g is convex on (a, b). Hence
for t ∈ (a, b), g 00 (z(t)) ≥ 0. By direct evaluation we find g 0 (z(t)) = (∇f (z(t)))T (y − x). Here we have
used the gradient to representation of the derivative and omitted the dummy variable h since we are dealing
with a scalar variable t. The function ∇f (z) is a map from Rn into Rn . Hence its derivative is a linear map
Hf (z) from Rn into Rn . Thus

g 00 (z(t)) = ( Hf (z(t))(y − x) )T (y − x) = (y − x)T Hf (z(t))(y − x),

where we have used the fact that the Hessian matrix is symmetric. Evaluation at t = 0 ∈ (a, b) gives

0 ≤ g 00 (0) = (y − x)T Hf (x)(y − x).

Since this holds for all y ∈ B, we conclude that Hf (x) is positive semidefinite.

Proof of Theorem 7.5.2. Let a ∈ R and consider the sublevel set La (f ) = {x : f (x) ≤ a}. If La (f )
is empty, then it is bounded. Hence assume La (f ) is nonempty and select x0 ∈ La (f ). Without loss of
generality we can assume x0 = 0. This follows by noting that the sublevel sets of f (x) and h(x) = f (x−x0 )
are related by a translation. Hence La (f ) is bounded if and only if La (h) is bounded. Moreover, since
x0 ∈ La (f ), 0 ∈ La (h).
We now prove by contradiction, that La (f ) is bounded. Assume that La (f ) is unbounded. Then La (f )
contains an unbounded sequence of points {xk }k≥0 . We can always ensure that the first term is x0 = 0, and
by selecting a subsequence if necessary, that
kx1 k2 ≥ 1 and,
(7.12)
kxk+1 k2 ≥ kxk k2 k ≥ 1.

By radially mapping xk onto the unit ball centered at 0, we obtain a bounded sequence {yk }k≥1 with
1
yk = αk xk where αk = . (7.13)
kxk k2
The conditions (7.12) ensure {αk }k≥1 ⊂ (0, 1] and that {αk }k≥1 monotonically decreases to 0.
Since f is strongly convex, for some c > 0, g(x) = f (x) − 2c kxk22 is convex. The convexity of g implies
it is continuous and hence it has a finite maximum and minimum value over the unit ball centered at x0 . The
points yk are on the boundary of this ball. It follows that |g(yk )| is bounded as k → ∞.
On the other hand, using the convexity of f and k · k22 , the bound f (xk ) ≤ a, and (7.13) we find

g(yk ) = f ( αk xk + (1 − αk )x0 ) − c/2k αk xk + (1 − αk )x0 k22


≤ αk f (xk ) + (1 − αk )f (x0 ) − c/2 αk kxk k22 − c/2 (1 − αk )kx0 k22
1
≤ αk a + (1 − αk )a − c/2 kxk k22
kxk k2
= a − c/2kxk k2 .

The final term converges to −∞ as k → ∞, contradicting the fact that g takes bounded values on the
sequence {yk }. Hence the nonempty sublevel set La (f ) must be bounded.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 92

Proof of Theorem 7.5.4. (If) For y ∈ C, (7.3) and (7.6) give f (y) − f (x? ) ≥ Df (x? )(y − x? ) ≥ 0. Hence
f (y) ≥ f (x? ).
(Only If) Suppose x? minimizes of f over C. If (7.6) does not hold, then for some y ∈ C, Df (x? )(y − x? ) <
0. Let h = y − x? and note that for α ∈ [0, 1], x? + αh = (1 − α)x? + αy ∈ C. Using the definition of the
derivative we have
f (x? + αh) − f (x? )
Df (x? )h = lim < 0.
α↓0 α
By the definition of a limit, there exists α0 > 0 such that for all 0 < α ≤ α0 , f (x? + αh) − f (x? ) < 0. For
such α, x? + αh ∈ C and f (x? + αh) < f (x? ); a contradiction.

Proof of Corollary 7.5.1. (If) By Theorem 7.4.3, for each y ∈ C, f (y) ≥ f (x? )+Df (x? )(y−x? ) = f (x? ).
(Only If) Since x? is an interior point of S, for some α > 0, and each direction h ∈ Rn , x? + αh ∈ C. Hence
for each direction h, by Theorem 7.5.4, f (x? + αh) − f (x? ) ≥ Df (x? )(αh) ≥ 0. But then Df (x? )(h) ≥ 0
and Df (x? )(−h) ≥ 0. Thus Df (x? ) = 0.

Notes
The material is this chapter is standard and can be found in any modern book on optimization or convex
analysis. See, for example, the optimization books by Bertsekas [4], Boyd and Vandenberghe [7], and Chong
and Zak [9]; and the texts by Fleming [15], and Urruty and Lemaréchal [18]. The proof of Theorem 7.3.5 is
drawn (with modifications) from [15, Theorem 3.5]. For a physical interpretation of Jensen’s inequality, see
MacKay [28, §3.5].

Exercises
Convex Sets
Exercise 7.1. Prove the following basic properties of convex sets:
(a) Let Sa ⊂ Rn be a convex set for each a ∈ A. Show that ∩a∈A Sa is a convex set.
(b) Let S ⊂ Rn be a convex set and F be a linear map from Rn to Rm . Show that F (S) = {z : z = F s, s ∈ S} is
a convex set.

Convex Functions
Exercise 7.2. Let k · k be a norm on Rn . Are its sublevel sets convex? Are the sublevel sets bounded? Does k · k have
a unique global minimum?
Exercise 7.3. Let P ∈ Rn×n , b ∈ Rn , and c ∈ R. Show that the quadratic function xT P x + bT x + c is convex if and
only if P is positive semidefinite.
Exercise 7.4. Let f (x, y) be a convex function of (x, y) ∈ Rp+q with x ∈ Rp and y ∈ Rq . Show that for each

x0 ∈ Rp the function gx0 (y) = f (x0 , y) is a convex function of y ∈ Rq .
Exercise 7.5 (Epigraph). Let S ⊆ Rn , and f : S → R. The epigraph of f is the set of points

epi(f ) = {(x, v) ∈ S × R : f (x) ≤ v}.

Show that:

(a) If S is a convex set and f is a convex function, then epi(f ) is a convex subset of Rn+1

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 93

(b) If epi(f ) is a convex set, then S is a convex set and f is a convex function.

Exercise 7.6 (Jensens’s Inequality). Let C be a convex set, and f : C → R be a convex function. Show that for any
Pk
{xi }ki=1 ⊂ C, and {αi ≥ 0}ki=1 with i=1 αi = 1,
P  P
k k
f i=1 αi xi ≤ i=1 αi f (xi ).

Exercise 7.7. For {Aj }kj=1 ∈ Rm×n , show that σ12 ( j Aj ) ≤ j σ12 (Aj ).
P P

Exercise 7.8. Let fj : Rn → R be a strongly convex function, j = 1, 2, 3. Assuming it is non-empty, give a labelled
conceptual sketch of the region {x : fj (x) ≤ βj , j = 1, 2, 3} and indicate its key properties.

Identifying Convex Functions


Exercise 7.9. Determine general sufficient conditions (if any exist) under which the function indicated is convex.
(a) f : [0, ∞) → R with f (x) = xr . (f) f : [0, ∞) → R with f (x) = − log(x).
(b) f : R → R with f (x) = |x|. (g) f : [0, ∞) → R with f (x) = x log(x).
r
(c) f : R → R with f (x) = |x| . (h) f : (0, ∞) → R with f (x) = 1/xr
(d) f : [0, ∞) → R with f (x) = ex . (i) f : [d, ∞) → R with f (x) = ax3 + bx2 + c.
(e) f : [0, ∞) → R with f (x) = e−x .

Exercise 7.10. Determine general sufficient conditions (if any exist) under which the indicated function is convex.
(a) f : Rn → R with f (x) = (xT Qx)r . Here Q ∈ Rn×n is symmetric PSD.
Pn
|x(i)|)r
(b) f : Rn → R with f (x) = 1 + e( i=1 .
m×n
(c) f : R → R with f (A) = σ1 (A), where σ1 (A) is the maximum singular value of A.
(d) f : Rm×n
P
→ R with f (A) = j σj (A), where σj (A) is the j-th singular value of A.
Exercise 7.11. Let C = {x ∈ Rn : x(i) > 0, i ∈ [1 : n]} and for x ∈ C, let ln(x) = [ln(x(i))] ∈ Rn . Prove or
disprove: f (x) = xT ln(x) is a convex function on the set C.
2
Exercise 7.12. Let p : Rn → R be the function p(x) = e−γkxk2 and f : R → R be a convex function such that for
each x ∈ Rn , the function f (u)p(u − x) is integrable. Then we can “smooth” f by convolution with p to obtain
Z
g(x) = f (u)p(x − u)du.
Rn

Show that g(x) is also convex on Rn .

Bounded Sublevel Sets


Exercise 7.13. Show that every norm on Rn has bounded sublevel sets.
Exercise 7.14. Under what conditions on a, b, c ∈ R is the function on R defined by f (x) = a|x| + bx + c convex
with bounded sublevel sets.
Exercise 7.15. Find necessary and sufficient conditions on b ∈ Rn for the convex function f : Rn → R with f (x) =
kxk1 + bT x to have bounded sublevel sets.
Exercise 7.16. Let f : Rn → R be a convex function with bounded sublevel sets. Show that the following functions
are convex and have bounded sublevel sets:
(a) αf (x) for α > 0.
(b) f (x + b) where b ∈ Rn .

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 94

(c) f (Ay) where A ∈ Rn×m is injective and y ∈ Rm .


(d) g(f (x)) where g is convex and strictly increasing on the range of f .
(e) f (x) + g(x) where g is also convex with bounded level sets.
Exercise 7.17. Let k · k be a norm on Rn , and f : Rn → R be a convex function. Show that if there exist c > 0, and

a > 1 such that g(x) = f (x) − ckxka is a convex function, then f has bounded sublevel sets.
Exercise 7.18. Let f (x) be a convex function on Rn with bounded sublevel sets. Let A ∈ Rm×n with rank(A) =
m < n. Then for y ∈ Rm set

g(y) = min f (x)


x∈Rn
s.t. Ax = y.

Show that g(y) is a convex function of y ∈ Rm .


Exercise 7.19. Let A ∈ Rm×n with rank(A) = m < n. Then for y ∈ Rm set

g(y) = min kxk1


x∈Rn
s.t. Ax = y.

Determine conditions under which g(y) is a convex function of y ∈ Rm .

Strict Convexity
Exercise 7.20. Let f, g : Rn → R with f strictly convex, and g convex. Show that:
(a) For α > 0, αf (x) is strictly convex.
(b) f (x) + g(x) is a strictly convex.
Exercise 7.21. Show that if f, g : Rn → R are both strictly convex, then h(x) = max{f (x), g(x)} is strictly convex.
Exercise 7.22. Show that if f : Rn → R is strictly convex, A ∈ Rn×m has rank m, and b ∈ Rm , then h(x) =
f (Ax + b) is strictly convex.
Exercise 7.23. Assume that g has the properties stated below on the image of f . Show that:
(a) If f is strictly convex, and g is convex and strictly increasing, then h(x) = g(f (x)) is strictly convex.
(b) If f convex, and g is strictly convex and nondecreasing, then h(x) = g(f (x)) is strictly convex.
Exercise 7.24. Show that for symmetric P ∈ Rn×n , xT P x is strictly convex on Rn if and only if P is positive
definite. Similarly, show that for a ∈ Rn and b ∈ R, the quadratic function xT P x + aT x + b is strictly convex if and
only if P is positive definite.
Exercise 7.25. Let f : I → R be twice differentiable on an open interval I. Show that if for all x ∈ I, f 00 (x) > 0,
then f is strictly convex on I.

Strong Convexity
Exercise 7.26. Show that a strongly convex function is strictly convex.
Exercise 7.27. Show that f : Rn → R is strongly convex with modulus c if and only if for x, y ∈ Rn and α ∈ [0, 1],

f ( (1 − α)x + αy ) ≤ (1 − α)f (x) + αf (y) − 1/2 cα(1 − α)kx − yk22 . (7.14)

Exercise 7.28. Show that if f (x) is strongly convex with modulus c > 0, then f (x) is strongly convex with modulus
β for 0 < β ≤ c.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 95

Exercise 7.29. Let f, g : Rn → R. Show that:


(a) If f is strongly convex and α > 0, then αf (x) is strongly convex.
(b) If f is strongly convex, and g(x) is convex, then f (x) + g(x) is strongly convex.
Exercise 7.30. Show that if f, g : Rn → R are strongly convex with moduli cf , cg respectively, then h(x) =
max{f (x), g(x)} is strongly convex with modulus c = min{cf , cg }.
Exercise 7.31. Show that if f : Rn → R is strongly convex, A ∈ Rn×m has rank m, and b ∈ Rm , then h(x) =
f (Ax + b) is strongly convex with modulus 0 < β ≤ cσm (A).
Exercise 7.32. Let {fk (x)}k≥1 be a sequence of strongly convex functions on Rn converging pointwise to f (x). Let
ck denote the modulus of fk and assume c = limk→∞ ck > 0. Show that f (x) is strongly convex with modulus c.
Exercise 7.33. For a symmetric P ∈ Rn×n show that the quadratic function xT P x + aT x + b is strongly convex if
and only if P is positive definite.
Exercise 7.34. Determine which (if any) of the following functions are strongly convex on Rn :
(a) f (x) = ky − xk22 , for fixed y ∈ Rn .
(b) f (x) = kx − yk1 , for fixed y ∈ Rn .
Exercise 7.35. Let f : I → R be differentiable on an open interval I. Show that f is strongly convex on I if and only
if there exists c > 0 such that for all x, y ∈ I,
c
f (y) ≥ f (x) + f 0 (x)(y − x) + (y − x)2 . (7.15)
2
Exercise 7.36. Let f : I → R be twice differentiable on an open interval I. Show that f is strongly convex on I if
and only if there exists c > 0 such that for all x ∈ I, f 00 (x) ≥ c.
Exercise 7.37. Let f : C → R be differentiable on the open convex set C. Show that f is strongly convex on C if and
only if there exists c > 0 such that for all x, y ∈ C,
c
f (y) ≥ f (x) + Df (x)(y − x) + ky − xk2 . (7.16)
2
Exercise 7.38. Let f : C → R be twice differentiable on the open convex set C. Show that f is strongly convex on C
if and only if there exists c > 0 such that for all x ∈ C, Hf (x) − cI is positive semidefinite.
Exercise 7.39. A function f : Rm×n → R is strongly convex if for some c > 0, f (A) − 2c kAk2F is a convex function.
Show that the Frobenius norm is a strongly convex function. It is hence also a strictly convex function.
Exercise 7.40. Show that the nuclear norm and the induced 2-norm are not strongly convex functions. [Hint: by
counterexample show these norms are not strictly convex.]

Proximal Operators
Exercise 7.41. Let f : Rn → R be a convex function, x ∈ Rn , and λ > 0. The function

∆ 1
Pf (x) = arg minn f (v) + kx − vk22
v∈R 2λ
is called a proximal operator. Show that the problem on the RHS always has a unique solution.
Pn
Exercise 7.42. Let λ > 0 and kxk1 = j=1 |xj |. Obtain an analytical expression for the proximal operator

1
P1 (x) = arg minn kvk1 + kx − vk22 .
v∈R 2λ

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 96

Projection to a Convex Set


Exercise 7.43. For a closed convex set C ⊂ Rn and y ∈ Rn , let PC (y) denote the unique projection of y onto C, i.e.,
PC (y) is the unique point in C closest to y. Derive analytic expressions for PC in each of the following cases:

(a) C = {x : wT x − b = 0} for fixed w ∈ Rn , b ∈ R, with w 6= 0.


(b) C = {x : wT x − b ≤ 0} for fixed w ∈ Rn , b ∈ R, with w 6= 0.
(c) C = {x : kxk2 ≤ r} for fixed r > 0.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 97

Chapter 8

Principal Components Analysis

In many situations a dataset {xj }m n n


j=1 ⊂ R doesn’t fill out its ambient space R uniformly in all directions,
n
but rather lies close to some lower dimensional surface in R . So the data has (at least approximately) a
lower dimensional structure within the ambient space Rn . The simplest case is when the data lies close
to a subspace in Rn , or to a translated subspace in Rn . Such a surface is called a linear manifold. In this
situation, we could usefully approximate the data by projecting it onto the linear manifold. Of course, in
practice we don’t know the linear manifold. Hence we set out to learn a suitable linear manifold from the
data. This is the problem we now explore.

8.1 Preliminaries
8.1.1 Linear Manifolds

A linear manifold of dimension d is a set of the form M = {x : x = z + u, u ∈ U}, where z ∈ Rn is fixed


and U is a fixed d-dimensional subspace of Rn . For brevity we sometimes write M = z + U. The subspace
U defines the orientation and dimension of the linear manifold, and the vector z defines its translation from
the origin. Since z = z + 0 and 0 ∈ U, it is always the case that z ∈ M. For a given linear manifold
M = z + U, the subspace U is unique, but the translation vector z is not. For any u ∈ U, M = z + U and
M = (z + u) + U. So the translation vector is unique up to the addition of an element of U.
To approximate the data points {xj }m j=1 by points in M, we write xj = (z + ûj ) + j where ûj ∈ U,
n
and j ∈ R is the approximation
P error.2 In particular, we want to select the ûj to minimize the sum of
squared approximation errors m j=1 kj k2 . Since xj − z = ûj + j , we see that ûj should be the orthogonal
projection of xj − z onto the subspace U with j ∈ U ⊥ the corresponding residual. Hence, if the columns
of U ∈ Rn×d form an orthonormal basis for U, then ûj = U U T (xj − z) and j = (I − U U T )(xj − z).

8.1.2 Centering the Data


1 Pm
The empirical mean of the dataset {xj }m j=1 is the vector µ̂ = m j=1 xj . The operation of subtracting the
empirical mean µ̂ from each data point is called centering the data. This shifts µ̂ to the origin and yields the
centered datatset {yj = xj − µ̂}m j=1 . The centered data has zero sample mean. In the next section, we show
that centering the data transforms the problem of fitting a linear manifold to the dataset to the problem of
fitting a subspace to the centered dataset.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 98

8.1.3 Parameterizing the Family of d-Dimensional Subspaces


Let Vn,d denote the family of n × d real matrices with orthonormal columns. If U ∈ Vn,d then U T U = Id .
This set of matrices is called is called the Stiefel manifold in Rn×d .
A subspace U ⊆ Rn of dimension d ≤ n can be represented by a matrix U ∈ Vn,d by selecting the
columns of U to form an orthonormal basis for U. However, this representation is not unique since there are
infinitely many orthonormal bases for U. The following lemma shows that any two such representations are
related by a d × d orthogonal matrix.

Lemma 8.1.1. Let U1 , U2 ∈ Vn,d both be representations for the d-dimensional subspace U ⊂ Rn .
Then the exists Q ∈ Od with U2 = U1 Q and U1 = U2 QT .

Proof. Let U1 , U2 ∈ Rn×d be two orthonormal bases for the d-dimensional subspace U. Since U1 is a basis
for U and every column of U2 lies in U, there must exist a matrix Q ∈ Rd×d such that U2 = U1 Q. It follows
that Q = U1T U2 . Using U1 U1T U2 = U2 and U2 U2T U1 = U1 , we then have

QT Q = U2T U1 U1T U2 = U2T U2 = Id , and


QQT = U1T U2 U2T U1 = U1T U1 = Id .

Hence Q ∈ Od , and the result follows.

8.2 Selecting the Best Linear Manifold Approximation


We now address the problem a finding a d-dimensional linear manifold that best approximates a given
dataset. We do this in two stages. First we fix a d-dimensional subspace U and determine the best choice of
the translation vector z. Then with z determined we find the best d-dimensional subspace U.

8.2.1 Selecting the Best Translation Vector z


Fix U and let the columns of U ∈ Rn×d be an orthonormal basis for U. For any z ∈ Rn we can uniquely
write z = u + v where u ∈ U and v ∈ U ⊥ . This decomposition can then be used to write

xj = z + ûj + j
= v + u + U U T (xj − (u + v)) + (I − U U T )(xj − (u + v))
= v + U U T (xj − v) + (I − U U T )(xj − v).

So only the component of z in U ⊥ plays a role in determining the resulting residuals. Moreover, the sum of
squared residuals is
Pm 2
Pm Pm
j=1 kj k2 = j=1 k(I − U U T )xj − vk22 = j=1 kpj − vk22 ,

where pj = (I − U U T )xj , j ∈ [1 : m]. The minimization of the above expression over v is a simple calculus
problem with solution
m m
1 X 1 X
v= pj = (I − U U T ) xj = (I − U U T )µ̂,
m m
j=1 j=1

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 99

Figure 8.1: A linear manifold µ̂ + U and its subspace U in R2 . Here U = span(u) with kuk2 = 1.

where µ̂ is the empirical mean of the data. So given a fixed U with orthonormal basis U , an optimal z is
obtained by setting v = (I − U U T )µ̂, letting u ∈ U be arbitrary, and setting z = v + u. In particular, it is
convenient to select u = U U T µ̂ since the resulting z is then independent of the selection of U:

z = U U T µ̂ + (I − U U T )µ̂ = µ̂.

Note that z is not unique; any point z = µ̂ + u with u ∈ U is also optimal.


Once U is selected, the linear manifold is M = µ̂ + U. It follows that µ̂ ∈ M and for any u ∈ U,
µ̂ + u ∈ M. Since (I − U U T )µ̂ = µ̂ − U U T µ̂, it follows that v = (I − U U T )µ̂ ∈ M. The point v is the
intersection of the manifold M with U ⊥ . This is illustrated in Figure 8.1.

8.2.2 Selecting the Best Subspace U


We know that an optimal value for the translation vector is z = µ̂. Hence we can write xj − µ̂ = ûj + j
with ûj the orthogonal projection of xj − µ̂ onto the subspace U and j the corresponding residual. Our
goal now is to find a d-dimensional subspace U such that the orthogonal projection of the centered data
onto U minimizes the sum of squared norms of the residuals. Assuming such a subspace exists, we call it
an optimal projection subspace of dimension d.
From this point forward we we assume the data X has been centered. Let the columns of U ∈ Vn,d
form an orthonormal basis for U. Then the matrix of projected data is X̂ = U U T X, and the corresponding
matrix of residuals is (I − U U T )X. Hence we seek to solve:

min kX − U U T Xk2F (8.1)


U ∈Vn,d

Problem 8.1 is an optimization problem over the Stiefel manifold. We know that it does not have a unique
solution since if U is a solution, then U Q is a solution for every Q ∈ Od . These solutions correspond to
different parameterizations of the same subspace. However, beyond this obvious non-uniqueness, it may be
possible that two distinct subspaces are both optimal projection subspaces of dimension d. We will examine
that issue in due course.
We can rewrite the problem 8.1 to make the constraint U ∈ Vn,d explicit as follows:

min kX − U U T Xk2F
U ∈Rn×d (8.2)
s.t. U T U = Id .

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 100

Using standard equalities, the objective function of (8.2) can be rewritten as

kX − U U T Xk2F = trace(X T (I − U U T )(I − U U T )X) = trace(XX T ) − trace(U T XX T U ).

Let S = XX T ∈ Rn×n denote the scatter matrix of the data. Then we can equivalently solve:

max trace(U T SU )
U ∈Rn×d (8.3)
s.t. U T U = Id .

This is a matrix Rayleigh quotient problem. The simplest version with d = 1 is easily solved. The solution
is to take u to be a unit norm eigenvector corresponding to a maximum eigenvalue of S. The solution of the
general case with d > 1 is only slightly more complicated. One solution is obtained by taking the columns
of U to be d orthonormal eigenvectors corresponding to the d largest eigenvalues of S. Then for all Q ∈ Od ,
U ? = U Q is also a solution (Theorem 6.3.1).
It follows that a solution U ? to (8.3) is obtained by selecting the columns of U ? to be a set orthonormal
eigenvectors of S = XX T corresponding to its d largest eigenvalues. Working backwards, we see that U ? is
then also a solution to (8.2). Any basis of the form U ? Q with Q ∈ Od also spans the same optimal subspace
U ? . In general, the subspace U ? need not be unique. To see this, consider the situation when λd = λd+1 .
When this holds, the selection of a d-th eigenvector in U ? is not unique. However, aside from this very
special situation, U ? is unique.
In summary, a solution to problem (8.2) is obtained as follows. Find the d largest eigenvalues of the
scatter matrix S = XX T and a corresponding set of orthonormal eigenvectors U ? . Then over all d dimen-
sional subspaces, U ? = R(U ? ) minimizes the sum of the squared norms of the projection residuals. Here
are a some other important observations:

(1) The case d = 0 merits comment. When d = 0, U = 0 and M = {z}. So we project the data to
Pm z (the translation
a single point vector). Hence we seek z ∈ Rn that minimizes thePsum of squared
1 m
distances j=1 kz − xj k22 . The solution is the empirical mean of the data µ̂ = m j=1 xi . This is
consistent with our result that xj = µ̂ + ûj + j , since in this case ûj ∈ U = 0.

(2) For d = 1 we find the unit norm eigenvector for the largest eigenvalue of the scatter matrix of the
centered data. Let this be u. Then the best approximation linear manifold is the straight line µ̂ +
span{u}. This is illustrated using synthetic data in Figure 8.2.

(3) As we vary d we obtain a nested set of optimal projection subspaces U0? ⊂ U1? ⊂ · · · ⊂ Ur? where r is
the rank of the centered data matrix. Thus the best approximation linear manifolds M?d = µ̂ + Ud? are
also nested. Hence for each data point xj there is a sequence of progressively refined approximations:

d = 0 : xj = µ̂ + j,0 j,0 = xj − µ̂.


d = 1 : xj = µ̂ + u1 uT1 j,0 + j,1 j,1 = (I − u1 uT1 )j,0
d=2: xj = µ̂ + u1 uT1 j,0 + u2 uT2 j,1 + j,2 j,2 = (I − u2 uT2 )j,1
.. .. ..
. . .
Xr
d = r : xj = µ̂ + ui uTi j,i−1 j,r = 0.
i=1

Here ui is the i-th unit norm eigenvector of the scatter matrix of the centered data, listed in order of
decreasing eigenvalues.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 101

Figure 8.2: Optimal data approximation using a linear manifold in R2 . When d = 0, the best approximation linear
manifold is the empirical mean µ̂. When d = 1 it is the 1-dimensional linear manifold M shown, and when d = 2 it
is R2 .

(2) By projecting xj − µ̂ to x̂j = U ? (U ? )T (xj − µ̂) we obtain an approximation of the centered data
as points in the subspace U ? . Each x̂j is exactly represented by its coordinates aj = (U ? )T xj with
respect the orthonormal basis U ? . So by forming a d-dimensional approximation to the data we have
effectively reduced the dimension of its representation from n to d < n. This is an example of
dimensionality reduction.

8.3 Principal Components Analysis


An alternative way to view the problem considered in Section 8.2 focuses on the directions of maximum
variance of the centered data. This viewpoint yields some additional insights into the solution previously
derived.

8.3.1 Directions of Maximum Variance


The data points xj are “spread out” around the sample mean µ̂. Recall that the sample covariance matrix is
the symmetric PSD matrix
m
1 X
R= (xj − µ̂)(xj − µ̂)T . (8.4)
m
j=1

Since confusion is unlikely, we have dropped the hat on R for notational simplicity. We can use R to write
an expression for the empirical variance of the data in any direction u:

σ 2 (u) = uT Ru. (8.5)

The direction u in which the data has maximum variance is then obtained by solving the problem:

arg maxn uT Ru
u∈R (8.6)
s.t. uT u = 1

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 102

with R a symmetric positive semidefinite matrix. This is a simple Rayleigh quotient problem with d = 1.
The solution is to take u = v1 , where v1 is a unit norm eigenvector for a maximum eigenvalue σ12 of R.
We must take care if we want to find two directions of largest variance. Without any constraint, the
second direction can be arbitrarily close to v1 and yield variance near σ12 . One way to prevent this is to
constrain the second direction to be orthogonal to the first. Then if we want a third direction, constraint it
to orthogonal to the two previous directions, and Pso on. In this case, for d orthogonal directions we want
d
to find U = [u1 , . . . , ud ] ∈ Vn,d to maximize j=1 uTj Ruj = trace(U T RU ). Hence we want to solve
problem (8.3) with S = R. As discussed previously, one solution is attained by taking the d directions to be
unit norm eigenvectors v1 , . . . , vd for the largest d eigenvalues of R.
By this means you see that we can obtain n orthonormal directions of maximum empirical variance in
the data. These directions v1 , v2 , . . . , vn and the corresponding emppirical variances σ12 ≥ σ22 ≥ · · · ≥ σn2
are eigenvectors and corresponding eigenvalues of R: Rvj = σj2 vj , j ∈ [1 : n]. The vectors vj are called the
principal components of the data, and this decomposition is called Principal Components Analysis (PCA).
Let V be the matrix with the vj as its columns, and Σ2 = diag(σ12 , . . . , σn2 ) (note σ12 ≥ σ22 ≥ · · · ≥ σn2 ).
Then PCA is an ordered eigen-decomposition of the emperical covariance matrix: R = V Σ2 V T .
There is a clear connection between PCA and finding a subspace that minimizes the sum of squared
norms of the residuals. We can see this by noting that the sample covariance is just a scalar multiple of the
scatter matrix XX T :
m
1 X 1
R= xj xTj = XX T .
m m
j=1

Hence the principal components are the unit norm eigenvectors of S = XX T listed in order of decreasing
eigenvalues. In particular, the first d principal components are the first d unit norm eigenvectors (ordered
by eigenvalue) of XX T . This is an orthonormal basis that defines an optimal d-dimensional projection
subspace U. Thus the leading d principal components give a particular orthonormal basis for an optimal
d-dimensional projection subspace.
A direction in which the data has small variance relative to σ12 may not be not an important direction;
after all the data stays close to the mean in this direction. If one accepts this hypothesis, then the directions
of the largest variance are the important directions. These directions explain most of the variance in the data.
This hypothesis suggests that we could select an integer d < rank(R) and project the centered data onto
the d directions of largest variance. Let Vd = [v1 , v2 , . . . , vd ]. Then the projection of the centered data onto
the span of the columns of Vd is x̂j = Vd (VdT xj ). The term aj = VdT xj gives the coordinates of xj with
respect to Vd , and the product Vd aj synthesizes x̂j using these to form the appropriate linear combination of
the columns of Vd .
Here is a critical observation: since the directions are fixed and known, we do not need to form x̂j .
Instead we can simply map xj to the coordinate vector aj ∈ Rd . We lose no information in working with
aj instead of x̂j since the latter is an invertible linear function of the former. Hence {aj }m j=1 gives a new set
of data that captures most of the variance in the original data, and lies in the reduced dimension space Rd
(d ≤ rank(R) ≤ n).
We now address the question of how to select the dimension d. The selection of d involves a tradeoff
between dimensionality reductionP and the amount of captured variance in the P resulting approximation. The
“variance captured” is ν 2 = dj=1 σj2 and the “residual variance” is ρ2 = nj=d+1 σj2 . Reducing d reduces
ν 2 and increases ρ2 . The selection of d thus involves determining the fraction of the total variance that
ensures the projected data is useful for the task at hand. For example, if the projected data is to be used to
learn a classifier, then d can be selected to yield acceptable (or perhaps best) classifier performance using
cross-validation.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 103

8.4 The Connection between PCA and SVD


We have shown that the d-dimensional PCA projection subspace is spanned by the leading eigenvectors
(those with largest eigenvalues) of S = XX T or equivalently R = 1/mXX T , where X denotes the matrix
of centered data. An alternative way to find the principal components by computing a compact SVD of
X. To see this, let X have rank r, and X = U ΣV T be a compact SVD. So U ∈ Vn×r , V ∈ Vm×r and
Σ ∈ Rr×r is diagonal with postive diagonal entries σ1 ≥ · · · ≥ σr > 0. Then

S = XX T = U ΣV T V ΣU T = U Σ2 U T .

Hence the principal components with nonzero variances are the r left singular vectors of U , and the variance
1 2
of the data in direction the j-th principal component uj is m σj , j ∈ [1 : r].
Now let d ≤ r, and write U = [Ud Ur−d ] and V = [Vd Vr−d ]. Similarly, let Σd be the top left
d × d submatrix of Σ, and Σr−d denote its bottom right (r − d) × (r − d) submatrix. To form a d-
dimensional approximation of the centered data we project X onto the subspace spanned by its first d
principal components:
  T 
 Σd 0 Vd
Ud UdT X = Ud UdT U ΣV = Ud Id 0n×(r−d)

T = Ud Σd Vd .
0 Σr−d Vr−d

Hence PCA projection of the data to d dimensions is equivalent to finding the best rank d approximation to
the centered data matrix X.
Here are some other important points to note from the above expression:

(1) If we know we want the d-dimensional PCA projection of X, then we need only to compute the best
rank d approximation Ud Σd Vd to the centered data X.

(2) The d-dimensional coordinates the projected points with respect to the basis Ud are given in the
columns of the d × m matrix Σd VdT . So the coordinates can be found directly from a compact SVD
of X, or from a compact SVD of its best rank d approximation.

(3) The captured and


1 Pd Pr from2 the compact SVD of X: the captured
residual variances can be found directly
2 and the residual variance is
variance is m j=1 jσ j=d+1 σj .

Exercises
Exercise 8.1. Let x1 , . . . , xm ∈ R. We seek the point z ? ∈ R that is “closest” to the set of points {xi }m
i=1 .
m
(a) If we measurePcloseness using squared error, then we seek z ? = arg minz∈R i=1 (z − xi )2 . In this case show
P
? 1 m
that z = m i=1 xi . This is the empirical mean of the data points.
Pm
(b) If we measure closeness using absolute error, then we seek z ? = arg minz∈R i=1 |z − xi |. Show that in this
case z ? is the median of the data points.
Exercise 8.2. Let x1 , . . . , xm ∈ Rn . We seek the point z ? ∈ Rn that is “closest” to the set of points {xi }m
i=1 .
?
P n 2
(a) If we measure closeness
? 1
Pm using squared error, then we seek z = arg minz∈Rn i=1 kz − xi k2 . In this case
show that z = m i=1 xi . This is the empirical mean of the data points.
Pn
(b) If we measure closeness using absolute error, then we seek z ? = arg minz∈Rn i=1 kz − xi k1 . Show that in
this case z ? (i) = median{xj (i)}m j=1 . This is the vector median of the data points.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 104

1
Exercise 8.3. Let {xj }mj=1 be a dataset of interest and set X = [x1 , . . . , xm ] ∈ R
n×m
. Then µ̂ = m X1m . Let X̃
1 T
denote the corresponding matrix of centered data and u = m 1m . Show that X̃ = X(Im − uu ). Hence centering

can be written as a matrix multiplication operation on the data matrix X.


Exercise 8.4. Let X ∈ Rn×m . Show that the set of nonzero eigenvalues of XX T is the same as the set of nonzero
eigenvalues of X T X.
Exercise 8.5. The Rayleigh quotient of matrix P ∈ Rn×n evaluated at a nonzero x ∈ Rn is

xT P x
RP (x) = .
xT x
If P is symmetric, show that λmin (P ) ≤ RP (x) ≤ λmax (P ).

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 105

Chapter 9

Least Squares Regression

Let L denote the family of real-valued linear functions on Rn , i.e.,

L = {f : f (x) = wT x, w ∈ Rn }.

Based on a finite set of training examples {(xj , yj ) ∈ Rn × R}m j=1 , we want to select f ∈ L so that the
linear function f (x) best approximates the relationship between the input variable x ∈ Rn and the output
variable y ∈ R. Since L is parameterized by the vector variable w ∈ Rn , this is equivalent to selecting ŵ so
that f (x) = ŵT x achieves the above goal.
Let X = [x1 , . . . , xm ] ∈ Rn×m be the matrix of training examples (input vectors) and y = [yj ] ∈ Rm
be the vector of corresponding target values (output values). Each row of X gives the observed values of
one feature of the data examples. For example, in a medical context xi (1) might be a measurement of heart
rate, xi (2) of blood pressure, and so on. In this case, the first row of X gives the values of the heart rate
feature across all examples, and the second row gives the values of the blood pressure feature, and so on.
We call the rows of X the feature vectors.
For a given w ∈ Rn , the vector of predicted values ŷ ∈ Rm , and the corresponding vector of prediction
errors ε ∈ Rm on the training data, are given by

ŷ = X T w (9.1)
T
ε = y − ŷ = y − X w. (9.2)

Each row in X T is a training example, and each column of X T is a feature vector. So ŷ is formed as a linear
combination of the feature vectors. The error vector ε is the part of y that is “unexplained” by X T w. It often
called the residual.
To learn a w that achieves the “best” prediction performance, we need to measure performance using a
function of w ∈ Rn . This can be done, for example, by taking a norm of the residual vector on the training
data. What is ultimately important, however, is the prediction error on held-out testing examples. This is
the testing error or generalization error.
The problem described above is called linear regression. Linear regression finds the “best approxima-
tion” to the target vector y as a linear combination of the feature vectors f1 , . . . , fn (the columns of X T ) by
minimizing a cost function of the residual ε = y − X T w on the training data. In this context, the matrix X T
is often called the regression matrix and a column of X T is called a regressor.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 106

9.1 Ordinary Least Squares


Least squares is a linear regression method based on selecting the parameter w to minimize the squared
Euclidean norm of the residual kεk22 = ky − F wk22 , where
Pmwe have set F = X T . The columns of F are the
2
features. Since this metric is the sum of squared errors i=1 ε(i) , it is called the least squares or residual
sum of squares (RSS) objective.
This gives rise to the standard least squares problem:
w? = arg minn ky − F wk22
w∈R
(9.3)
?T
ŷ(x) = w x.
We do not claim that this method of learning w is “optimal” beyond that w? minimizes the least squares
objective function on the training data.

9.1.1 Some Simple Variations


The properties of the solution of (9.3), and methods for finding the solution, can often be transferred to
related forms of least squares. Here are some examples of closely related problems.

(1) Learning an affine function


Suppose we want to learn an affine function of the form ŷ(x) = wT x + b with w ∈ Rn and b ∈ R. Letting
F = X T and y be defined as before, you hence set out to solve:
w? , b? = arg min ky − F w − b1m k22 . (9.4)
w∈Rn ,b∈R

 least squares problem. Let F̃ be formed


By a simple reorganization, problem 9.4 can be recast as a standard
by adding a column of all 1’s at the right of F . So F̃ = F 1m . Similarly, let v be formed by concatenat-
T
ing b to the end of w. So v = wT b . Then problem 9.4 is equivalent to the standard problem:


v ? = arg min ky − F̃ vk22 ,


v∈Rn+1
T
with v ? = w? T

b? .

(2) Quadratic performance metric



Instead of the Euclidean norm, suppose we use the quadratic norm kxkP = xT P x where P ∈ Rn×n is
symmetric positive definite. This is sometimes called a Mahalanobis norm. In this case, the least squares
problem becomes
w? = arg minn ky − F wk2P , (9.5)
w∈R
By a simple transformation, problem (9.5) can be written in the standard form (9.3). To see this, bring
in the symmetric positive definite square root of the matrix P . Using an eigendecomposition of P ,
√ √
P = U ΛU T = (U Λ /2 U T )(U Λ /2 U T ) = P P .
1 1

√ √
The matrix P is the unique symmetric, positive definite, square root of P . Using P we can write
√ √
ky − F wk2P = (y − F w)T P (y − F w) = k P y − P F wk22 = kỹ − F̃ wk22 .

This reduces (9.5)√to a standard least squares problem with a modified regressor matrix F̃ = P F and
target vector ỹ = P y.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 107

(3) Multiple least squares objectives


Now consider a problem with multiple least squares objectives:

w? = arg minn kF1 w − y1 k22 + kF2 w − y2 k22 . (9.6)


w∈R

Here F1 ∈ Rm1 ×n , F2 ∈ Rm2 ×n , y1 ∈ Rm1 , and y2 ∈ Rm2 . Noting that the sum kF1 w−y1 k22 +kF2 w−y2 k22
is just a sum of squares we can write:
  2     2
F1 w − y1 F1 y
kF1 w − y1 k22 + kF2 w − y2 k22 = = w− 1 = kF̃ w − ỹk22 ,
F2 w − y2 2
F2 y2 2

where    
F y
F̃ = 1 ∈ R(m1 +m2 )×n and ỹ = 1 ∈ Rm1 +m2 .
F2 y2

This reduces (9.6) to a standard least squares problem with an augmented regression matrix F̃ and tar-
get vector ỹ. A similar transformation can be applied if the objective is a finite sum of quadratic terms:
P k 2
j=1 kFj w − yj k2 .

(4) Ridge regression


A particular example of linear regression, called ridge regression, takes the form:

w? = arg minn kF w − yk22 + λkwk22 . (9.7)


w∈R

Here the scalar λ > 0 is selected to appropriately balance the competing objectives of minimizing the
residual squared√error while keeping w small. Notice that (9.7) is a special case of (9.6) with F1 = F ,
y1 = y, F2 = λIn and y2 = 0. Hence problem (9.7) can be transformed into a standard least squares
problem with the objective kF̃ w − ỹk22 , where
   
F y
F̃ = √ ∈ R(m+n)×n and ỹ = ∈ Rm+n .
λIn 0

9.2 The Normal Equations and Properties of the Solution


Let J(w) denote the objective function in (9.3). J(w) is the square of a norm of an affine function of
w. Hence by the results in Chapter 7, it is a convex function1 . Thus any local minimum of J is a global
minimum2 . To find a local minimum, we set the derivative of J with respect to w acting on h ∈ Rn equal to
zero. This yields

DJ(w)(h) = (y − F w)T (−F h) + (−F h)T (y − F w)


= 2(F w − y)T F h
= 0.
1
Theorems 7.3.1 and 7.3.3.
2
Theorem 7.5.3

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 108

Figure 9.1: Least squares regression. Find w? such that F w? = ŷ.

At a stationary point, equality in this expression must hold for all h ∈ Rn . Hence for w? to be a solution it
is necessary that
F T F w? = F T y. (9.8)

These are called the normal equations. The convexity of the objective function and Corollary 7.5.1 ensure
that a solution w? of the normal equations is a solution of the least squares problem. Hence any vector
satisfying (9.8) is called a least squares solution of (9.3).
All least squares solutions have the following fundamental property.

Lemma 9.2.1. Let w? be a solution of the normal equations (9.8), and set ŷ = F w? and ε = y − ŷ.
Then
y = ŷ + ε with ŷ ∈ R(F ) and ε ∈ R(F )⊥ .

Proof. Clearly ŷ = F w? ∈ R(F ), and by definition y = ŷ +ε. By (9.8), 0 = F T y −F T F w? = F T (y − ŷ).


So ε = y − ŷ ∈ N (F T ) = R(F )⊥ .

By Lemma 9.2.1, ŷ is the unique orthogonal projection of y onto R(F ) (the span of the feature vectors)
and ε is the orthogonal residual. Every least squares solution w? gives an exact representation of ŷ as a
linear combination of the columns of F . Hence w? is unique if and only if the columns of F (the feature
vectors) are linearly independent. This requires rank(F ) = n ≤ m. So uniqueness requires more examples
than features. One can readily show that the columns of F are linearly independent if and only if F T F is
invertible. In that case,
w? = (F T F )−1 F T y.

On the other hand, if the columns of F are linearly dependent (rank(F ) = r < n), then N (F ) is nontrivial,
and there are infinitely many solutions w? , each giving a different representation of the same point ŷ.
In summary, finding the solution of a standard least squares problem involves two operations:

(a) Projection: y is orthogonally projected onto R(F ) to yield the unique vector ŷ.

(b) Representation: ŷ is exactly represented as a linear combination of the columns of F : ŷ = F w? .


However, the uniqueness of w? depends on the rank of F . If r = rank(F ) = n, then the columns of
F are linearly independent and the solution w? is unique. But if rank(F ) < n the columns of F are
linearly dependent, N (F ) is nontrivial, and w? is not unique.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 109

9.2.1 SVD Analysis


Let F = U ΣV T be a compact SVD. Substituting this into the normal equations (9.8) gives

V ΣU T U ΣV T w = V ΣU T y.

Using the properties of U , V and Σ this expression can be simplified to

V V T w? = V Σ−1 U T y. (9.9)

If the columns of F are linearly independent, N (F ) = 0 and N (F )⊥ = Rn . Recall that the columns of V
span N (F )⊥ . So in this case, V ∈ On , V V T w? = w? and the unique least squares solution is given by

w? = V Σ−1 U T y. (9.10)

On the other hand, if the columns of F are linearly dependent then N (F ) is nontrivial. In this case, if w? is
a solution of the normal equations, then so is w? + v for every v ∈ N (F ). Conversely, if w is a solution of
the normal equations, then F T F (w − w? ) = 0. So w = w? + v with v ∈ N (F ). Hence the set of solutions
is a linear manifold formed by translating the subspace N (F ) by a particular solution: w? + N (F ).
We claim that the solution manifold always contains a unique solution wln ? of least norm. To see this,
2
note that the manifold is a convex set, and kwk2 is a strongly convex function on this set. Hence by Theorem
7.5.3, there is a unique point of least norm on the solution manifold. What is less obvious is that (9.10) is
the least norm solution.
? = V Σ−1 U T y.
Theorem 9.2.1. Let U ΣV T be a compact SVD of F . Then wln

Proof. Let w̃ = V Σ−1 U T y. We first show that w̃ is a solution. This follows by the expansion

ky − F wk22 = k(I − U U T )y + U U T y − U (ΣV T w)k22 = k(I − U U T )yk22 + kU T y − ΣV T wk22 .

The first term is a constant and the second is made zero by setting w = w̃. Hence w̃ achieves the minimum
? = w̃ + w for some w ∈
the least squares cost. Thus it must be a solution. It follows that we can write wln 0 0

N (F ). From the definition of w̃, note that w̃ ∈ N (F ) . Hence by Pythagorous, kwln ? k2 = kw̃k2 + kw k2 .
2 2 0 2
So kwln? k2 ≥ kw̃k2 . Since w ? is the least norm solution, we must have w = 0, and hence w ? = w̃.
2 2 ln 0 ln

So when the columns of F are linearly independent, (9.10) gives the unique solution, and when the
columns are linearly dependent, it gives the least norm solution. The matrix F + = V Σ−1 U T is the Moore-
Penrose pseudo-inverse of F (see Exercise 5.16).

An Alternative Parameterization
Here is an corollary to Theorem 9.2.1. Recall that F = X T , where X ∈ Rn×m is the matrix with the
training examples as its columns.

Corollary 9.2.1. The least norm solution wln ? (including the least squares solution when it is unique)

lies in the span of the training examples: wln ∈ R(X). Hence for some a? ∈ Rm , wln
? ? = Xa? .

? = V Σ−1 U T y ∈ N (F )⊥ . Hence w ? ∈ R(X).


Proof. wln ln

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 110

Corollary 9.2.1 allows us to re-parameterize the least squares problem as follows:

a? = arg minm ky − X T Xak22


a∈R
(9.11)
?T T
ŷ(x) = a X x.

At first sight, this reformulation of the problem might seem to offer little advantage. But notice that the
reformulated problem uses the m × m Gram matrix X T X of the examples. If the number of examples m
is significantly less than the dimension of the data n, that could be useful. In addition, the computations in
(9.11) only require taking inner products of training examples (X T X), and of the training examples with a
test example (X T x). That also turns out to be potentially useful.

Linear Independence of the Columns of F


Recall that the first column of F ∈ Rm×n is the vector of measurements of the first feature, the second is
the vector of measurements of the second feature, and so on. If the third column is linearly dependent on the
first two, then the third feature can be exactly predicted as a linear function of the first two features. In this
sense, the features are redundant. More generally, if F has rank r < n, then the feature vectors are linearly
dependent. If the number of features n is great that the number of examples m, this will always the case.
Even when the number of features is less that the number of examples (n < m), one could still encounter
linearly dependent features vectors. As a result, the feature vectors span a proper subspace R(F ) ⊂ Rm of
dimension r = rank(F ) ≤ min(n, m).
Let F = U ΣV T be a compact SVD. The least norm solution wln ? = V Σ−1 U T y is obtained by first

projecting y onto the r left singular vectors U T y, then using V Σ−1 to map these coordinates to the vector
wln? ∈ N (F )⊥ . If the columns of F are almost linearly dependent, we expect the features to be very

close to a subspace of dimension d < m in Rm . A natural candidate for this subspace is obtained by a
rank d approximation of F . Let Ud , Vd denote the matrices consisting of the first d columns of U and V
respectively, and Σd denote the top left d × d submatrix of Σ. Then the least squares solution using the rank
d approximation Ud Σd VdT to F is

d
X 1
wd? = Vd Σ−1 T
d Ud y = vj uTj y.
σj
j=1

This gives a sequence of solutions wd? , d ∈ [1 : r], with wr? = wln


?.

9.3 Tikhonov and Ridge Regularization


The least norm solution wln ? is one approach for handling non-unique solutions. Another approach is to add a

new term to the objective function that ensures a unique solution. For example, if prior knowledge suggests
that w? is likely to be small, one can modify the least squares problem to minw∈Rn ky−F wk2 +λkwk2 . Here
λ > 0 is a selectable hyperparameter that balances the two competing terms, ky − F wk22 and kwk22 in the
overall objective. We show below that this modified problem always has a unique solution. More generally,
prior information may indicate that w is likely to be close to a given vector b ∈ Rn . This information can be
incorporated by selecting w to minimize ky − F wk22 + λkw − bk22 . This problem also has a unique solution.
The imposition of an auxiliary term into the least squares objective, as illustrated above, is often called
regularization. It was first investigated in the context of underdetermined problems by the Russian math-
ematician Tikhonov (1943). Stated in our context, he studied regularization terms of the form kGwk22 for

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 111

a specified matrix G. This approach can be generalized to kGw − gk22 where g is a given vector. This
form of regularization is called Tikhonov regularization. Somewhat later, regularized linear regression was
studied by Hoerl (1962) [19], and Hoerl and Kennard (1970) [20], using the regularization term kwk22 . This
approach is now widely known as ridge regression.
Tikhonov regularized least squares can be posed as:

min kF w − yk22 + λkGw − gk22 , (9.12)


w∈Rn

where F ∈ Rm×n and y ∈ Rm are the usual elements of the least squares problem, and G ∈ Rk×n and
g ∈ Rk are the new elements of the regularization penalty. The selectable scalar λ > 0 balances the two
competing objectives. Ridge regularization is a special case with G = In and g = 0. We have already seen
that (9.12) can be transformed into a standard least squares problem. Since the objective of (9.12) is a sum
of squares, we can write it as
  2
2 √ Fw − y
kF w − yk + λkGw − gk22 = = kF̃ w − ỹk22 ,
λ(Gw − g) 2

where
   
F y
F̃ = √ ∈ R(m+k)×n and ỹ = √ ∈ Rm+k .
λG λg

The corresponding augmented normal equations are

(F T F + λGT G)w = F T y + λGT g.

If F̃ has n linearly independent columns (i.e., rank(F̃ ) = n), the augmented problem has the unique
solution
w? (λ) = (F T F + λGT G)−1 (F T y + λGT g) . (9.13)
A sufficient condition ensuring that rank(F̃ ) = n is rank(G) = n (Exercise 9.10).
Ridge regularization is a special case with k = n, G = In and g = 0. In this case, F T F is symmetric
positive semidefinite, and adding λIn ensures the sum is positive definite and hence invertible. Thus ridge
regression always has a unique solution, and this is obtained by solving
?
wrr (λ) = (F T F + λIn )−1 F T y . (9.14)

The residual ε = y − F w? under Tikhonov and ridge regularization is generally not orthogonal to the
columns of F . You can see this using the normal equations. For ridge regularization F T (y − F wrr
? ) = λw ? ,
rr
and for Tikhonov regression F T (y − F w? ) = λGT (Gw? − g).

Regularization Path
The solutions (9.13) and (9.14) are functions of the regularization parameter λ. As λ varies the solutions
trace out a curve in Rn called the regularization path. For ridge regression, the entire regularization path is
contained R(X).

? (λ) traces out a curve in N (F )⊥ = R(X).


Theorem 9.3.1. For λ > 0, wrr

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 112

w ∗rr w ∗ln

Figure 9.2: An example of a ridge regression regularization path as λ varies from 0 to ∞.

Proof. Abbreviate wrr ? (λ) to w ? , and let w ? = v + w with v ∈ N (F )⊥ and w ∈ N (F ). Then F w ? =


rr rr rr
F (v + w) = F v and kF wrr ? − yk2 = kF v − yk2 . So w ? and v have the same least squares cost. By
2 2 rr
Pythagorous, kwrr? k2 = kvk2 + kwk2 . If w 6= 0, kvk2 < kw ? k2 ; a contradiction. Hence w ? ∈ N (F )⊥ .
2 2 2 2 rr 2 rr
Finally, N (F )⊥ = R(F T ) = R(X).
We can say more by bringing in a compact SVD F = U ΣV T . Recall that the columns of U form an ON
basis for R(F ) and the columns of V form an ON basis for N (F )⊥ . Substituting F = U ΣV T into (9.14)
yields:
(F T F + λIn )wrr
?
= (V Σ2 V T + λIn )wrr
?
, and
T T
F y = V ΣU y .
By Theorem 9.3.1, V V T wrr
? = w ? . Hence
rr

(V Σ2 V T + λIn )wrr
?
= (V Σ2 V T + λIn )V V T wrr
?
= V (Σ2 + λIr )V T wrr
?
.
This allows us to write the normal equations as V (Σ2 + λIr )V T wrr
? = V ΣU T y. Multiplying both sides of
2 −1 T
this equation by V (Σ + λIr ) V yields:
r
" # !
? 2 −1 T σj T
X σj T
wrr (λ) = V (Σ + λIr ) ΣU y = V diag 2 U y = 2 vj uj y. (9.15)
λ + σj j=1
λ + σ j

By varying λ in (9.15), the entire regularization path is easily computed. In addition, (9.15) provides a proof
of the following result.
? (λ) = w ? .
Lemma 9.3.1. limλ→0 wrr ln

? (λ) = V Σ−1 U T y. Thus by Lemma 9.2.1, lim


Proof. By (9.15), limλ→0 wrr ? ?
λ→0 wrr (λ) = wln .
? (λ) starts at w ? ∈ R(X), moves along the regularization path within the
So as λ increases from 0, wrr ln
subspace R(X), and gradually shrinks to 0 as λ ↑ ∞. An example is shown in Figure 9.2.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 113

9.4 On-Line Least Squares


In a useful variation of least squares the training examples are presented sequentially and the least squares
solution is updated as each example is presented. This approach gives rise to an update algorithm known as
recursive least squares.
If one continues to add training examples, eventually m > n and the regression problem becomes
overdetermined. In this situation, let Fm ∈ Rm×n denote the feature matrix, ym ∈ Rm denote the target
vector, and assume that Fm has n linearly independent columns. Since we will be adding rows to a narrow
regression matrix (m > n), it is natural to write the normal equations in terms of the rows of Fm . To this
end, let Pm denote the scatter matrix of the data at step m
TF = x
T Pm
= j=1 xj xTj ,
 
Pm = Fm m 1 x2 . . . xm x1 x2 . . . xm (9.16)
and Pm
Ty =
sm = Fm m j=1 xj y(j) . (9.17)
In terms of Pm and sm , the normal equations are ? = s .
P m wm Since the columns of Fm are linearly
m
? = P −1 s .
independent, Pm is nonsingular and the unique least squares solution is wm m m
Adding an (m + 1)-th training example adds a new row to Fm , and a new entry to ym . This yields
T
Pm+1 = Fm+1 Fm+1 = Pm + xm+1 xTm+1 , (9.18)
T
sm+1 = Fm+1 ym+1 = sm + y(m + 1)xm+1 . (9.19)
?
Hence to find wm+1 we need to solve the updated normal equations

(Pm + xm+1 xTm+1 )wm+1


?
= sm + y(m + 1)xm+1 . (9.20)
We want to exploit the connection between Pm and Pm+1 to efficiently update the least squares solution.
Since Pm+1 and sm+1 have the same dimensions as Pm ∈ Rn×n and sm ∈ Rn , respectively, directly solving
the updated normal equations requires O(n3 ) time. Assuming we have previously computed Pm −1 , we seek

a more efficient solution.


We can state the fundamental problem as follows. Given an invertible symmetric matrix P ∈ Rk×k and
a vector d ∈ Rk , efficiently find the inverse of the matrix P + ddT , assuming it exists. Remarkably, this
problem has a straightforward solution.

Lemma 9.4.1. Let P ∈ Rk×k be an invertible symmetric matrix and d ∈ Rk . If 1 + dT P −1 d 6= 0, then


P + ddT is invertible and
−1 1
P + ddT = P −1 − (P −1 d)(P −1 d)T . (9.21)
1+ dT P −1 d

Proof. Since 1 + dT P −1 d 6= 0, the RHS of (9.21) is a finite valued k × k real matrix. To prove the claim
simply multiply the RHS by (P + ddT ):
 
T −1 1 −1 −1 T
(P + dd ) P − (P d)(P d)
1 + dT P −1 d
1 1
= In + ddT P −1 − −1
ddT P −1 − d(dT P −1 d)dT P −1
T
1+d P d 1 + d P −1 d
T

dT P −1 d
 
1
= In + 1 − − ddT P −1
1 + dT P −1 d 1 + dT P −1 d
= In

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 114

For an alternative proof, and a generalization, see Appendix B. Also see the appendix to this chapter.

By assumption, the columns of Fm are linearly independent. Hence Pm is nonsingular and the least squares
? = P −1 s . After obtaining w ? we assume that P −1 and s remain available. When a
solution is wm m m m m m
new training example is added, Fm+1 still has n linearly independent columns (Fm has linearly independent
columns and adding a row to these vectors does not change this). Hence Pm+1 is also invertible. Application
−1
of Lemma 9.4.1 to equation (9.20) yields the following set of recursive equations for computing Pm+1 from
Pm−1 and x ?
m+1 , and hence for obtaining wm+1 :

−1 x
(Pm −1 T
−1 −1 m+1 )(Pm xm+1 )
Pm+1 = Pm − −1 , (9.22)
1 + xTm+1 Pm xm+1
? −1
wm+1 = Pm+1 sm+1 . (9.23)

If we substitute (9.22) and (9.19) into (9.23) and simplify we obtain

ŷ(m + 1) = xTm+1 wm
?
, (9.24)
Pm−1 x
? ? m+1
wm+1 = wm + T −1 (y(m + 1) − ŷ(m + 1)) . (9.25)
1 + xm+1 Pm xm+1
?
This update procedure is known as recursive least squares. It gives an efficient update formula for wm+1 in
?
terms of wm and each new training example. The update is driven by the prediction error y(m+1)−ŷ(m+1),
with no update required if the prediction error is zero. Inverting Pm+1 ∈ Rn×n requires O(n3 ) operations.
On the other hand, the recursive equations above require O(n2 ) operations per update. Hence RLS is an
efficient procedure when examples are presented sequentially and a solution is needed immediately.
Exercise 9.19 explores online least squares using mini-batch updates, and Exercise 9.20 explores an
online version of ridge regression.

Appendix: Matrix Inversion After a Rank One Update


Let P ∈ Rk×k be an invertible symmetric matrix and u ∈ Rk . For a scalar α, the matrix B(α) = P + αuuT
is the point in Rk×k obtained by moving from P in a straight line in the direction of the rank one matrix
uuT . B(α) is also symmetric, and for α sufficiently small, it will also be invertible. This observation
follows by observing that the coefficients of the characteristic polynomial of B(α) are polynomials in α and
are hence continuous in α. It is a standard result that the roots of a polynomial are continuous functions of
its coefficients. Since the eigenvalues of P are nonzero, it follows that for α sufficiently small, so are those
of B(α).
As α increases from 0, B(α) traces out a ray in Rk×k starting from the point P . We are interesting in
investigating the corresponding curve B −1 (α) traced out in Rk×k as α increases from 0. It is remarkable
that this curve is also a ray in Rk×k that starts from P −1 . This is summarized as:

Lemma 9.4.2. For all α with 1 + αuT P −1 u 6= 0,


α
(P + αuuT )−1 = P −1 − (P −1 u)(P −1 u)T .
1 + αuT P −1 u

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 115

It is easy to prove this result; one need only do the required multiplication. Here is an alternative
analysis. For λ small
d
B −1 (λ) = P −1 + λ B −1 (λ)|λ=0 + H.O.T.

Now B −1 (λ)B(λ) = I. Hence

d −1
B (λ) = −B −1 (λ)uuT B −1 (λ) . (9.26)

Thus for λ small
B −1 (λ) = P −1 − λ(P −1 u)(P −1 u)T + H.O.T.
We see that the curve leaves the point P −1 in the direction (P −1 u)(P −1 u)T .
Now look at the second derivative. This is easily found using (9.26)

d2 −1
B (λ)|λ=0 = 2B −1 (λ)uuT B −1 (λ)uuT B −1 (λ)|λ=0
dλ2
= 2P −1 u(uT P −1 u)uT P −1
= 2[uT P −1 u] (P −1 u)(P −1 u)T .

This yields the second order expansion

B −1 (λ) = P −1 + [−λ + λ2 (uT P −1 u)2 ] (P −1 u)(P −1 u)T + H.O.T.

To second order, the curve still leaves the point P −1 in a straight line in the direction (P −1 u)(P −1 u)T .
One may already see that this will be true to any order. The same method used to obtain the second derivative
can be used to obtain derivatives to any order, and when evaluated at λ = 0, each is a scalar multiple of the
same matrix (P −1 u)(P −1 u)T . So the curve is a straight line leaving P −1 in the direction (P −1 u)(P −1 u)T .
To find the parameterization of the straight line given in the lemma, work out the n-th derivative of
B −1 (λ) at λ = 0. Then examine the Taylor series. This results in
"∞ #
X
T −1 −1 n n T −1 n−1
(P + λuu ) = P + (−1) λ (u P u) (P −1 u)(P −1 u)T
n=1
λ
= P −1 − (P −1 u)(P −1 u)T .
1 + λuT P −1 u

Exercises
Preliminaries
Exercise 9.1. Let A ∈ Rm×n . Show each of the following claims.
a) AT A is positive definite (hence invertible) if and only if the columns of A are linearly independent.
b) AAT is positive definite if and only if the rows of A are linearly independent.
c) N (AT A) = N (A) and N (AAT ) = N (AT ).
d) R(AT A) = R(AT ) and R(AAT ) = R(A).
Exercise 9.2. Show that for any A ∈ Rm×n and λ > 0, AT (λIm + AAT )−1 = (λIn + AT A)−1 AT .
Exercise 9.3. Let P ∈ Rn×n be a symmetric positive definite matrix.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 116

a) Show that there exists a unique symmetric positive definite matrix P 1/2 such that P = P 1/2 P 1/2 .

b) Let kxkP = (xT P x)1/2 . Show that k · kP is a norm on Rn .


Exercise 9.4. (Vandermonde matrix) Show that if the real numbers {tj }m
j=1 are distinct and m > n − 1, then the
Vandermonde matrix:
1 t1 t21 . . . tn−1
 
1
1 t2 t22 . . . tn−1 
2
V = .
 
.. 
 .. . 
2 n−1
1 tm tm . . . tm
has n linearly independent columns (Hint: a polynomial of degree n − 1 has at most n − 1 roots).

Least Squares Regression

Exercise 9.5. (Affine least squares and data centering) Let {(xi , yi )}m n
i=1 ⊂ R × R be a training dataset. Place
n×m
the xi in the columns of X ∈ R , and the target values yi in the rows of y ∈ Rm . Consider the affine predictor
ŷ(x) = w x + b, with x ∈ R . One can fit this predictor to the training data by forming w̃ = [w, b]T , X̃ = [X, 1m ],
T n

and solving the standard least squares problem

min kỹ − X̃ T w̃k22 .


w̃∈Rn+1

The solution gives w? and b? for the affine predictor. Here we explore the alternative option of explicitly determining
w and b by solving
w? , b? = arg min n
ky − X T w − b1m k22 . (9.27)
w∈R ,b∈R

(a) Show that the modified normal equations for (9.27) are

XX T w? + b? X1m = Xy (9.28)
µ̂Tx w? ?
+ b = µ̂y , (9.29)

1 1 T
where µ̂x = m X1m ∈ Rn and µ̂y = m 1m y ∈ R.

(b) Show that the affine predictor takes the form ŷ(x) = w? T (x − µ̂x ) + µ̂y .

(c) By substituting (9.29) into (9.28), show that w? must satisfy

Xc XcT w? = Xc yc (9.30)

where Xc = X − µ̂x 1Tm is the matrix of centered input examples, and yc = y − µ̂y 1m is the vector of centered
output targets. Hence we can find w? by first centering the data, and solving a standard least squares problem.
We can then use w? , µ̂x , µ̂y , and (9.29), to find b? .
Exercise 9.6. (Affine ridge regression) Let {(xi , yi )}m n
i=1 ⊂ R × R be a training dataset. Place the xi in the columns
of X ∈ R n×m
, and the target values yi in the rows of y ∈ R . We learn an affine predictor ŷ(x) = w? T x + b? by
m

solving the ridge regression problem

w? , b? = arg min ky − X T w − b1m k22 + λkwk22 , λ > 0. (9.31)


w∈Rn ,b∈R

(a) Determine the modified normal equations for this problem.

(b) Show that the affine predictor takes the form ŷ(x) = w? T (x − µ̂x ) + µ̂y , where µ̂x = 1
m X1m ∈ Rn and
1 T
µ̂y = m 1m y ∈ R.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 117

(c) Show that w? is the solution of the ridge regression problem

min kyc − Xc wk22 + λkwk22 (9.32)


w∈Rn

where Xc = X − µ̂x 1Tm is the matrix of centered input examples, and yc = y − µ̂y 1m is the vector of centered
output labels. Hence by first centering the data, and solving a standard ridge regression problem we find w? .
Show that we can then use w? , µ̂x , µ̂y , and the normal equations to find b? .
Exercise 9.7. (Centered versus uncentered problems) Let F ∈ Rm×n have rank n, and y ∈ Rm . Center the rows
of F and the entries of y by forming
1 T 1 T
µ̂y = m 1m y yc = (I − m 1m 1m )y = y − 1m µ̂y .
µ̂Tf = 1 T
m 1m F Fc = (I −m 1
1m 1Tm )F = F − 1m µ̂Tf

Assume Fc also has rank n. Determine the relationship between w? and wc , where

w? ∈ arg minn ky − F wk22 wc ∈ arg minn kyc − Fc wk22 . (9.33)


w∈R w∈R

Exercise 9.8. (Least squares regression with vector targets.) We are given training data {(xi , zi )}m i=1 with input
examples xi ∈ Rn and vector target values zi ∈ Rd . Place the input examples into the columns of X ∈ Rn×m and the
targets into the columns of Z ∈ Rd×m . We want to learn a linear predictor of the targets z ∈ Rd of test inputs x ∈ Rn .
To do so, first use the training data to find:

W ? = arg min kY − F W k2F + λkW k2F , (9.34)


W ∈Rn×d

where we have set Y = Z T and F = X T and require λ ≥ 0 (λ = 0 removes the ridge regularizer).
(a) Show that (9.34) separates into d standard ridge regression problems each solvable separately.
(b) Without using the property in (a), find an expression for the solution W ? . Is the separation property evident
from this expression?
Exercise 9.9. (SVD/PCA Regression). Let F ∈ Rm×n have rank r and compact SVD U ΣV T . Let y ∈ Rm and wln ?

denote the least norm solution to the regression problem minw∈Rn ky − F wk2 . For 1 ≤ k ≤ r, let Fk = Uk Σk VkT be
the SVD rank k approximation to F . Here Uk and Vk consist of the first k columns of U and V , respectively, and Σk
is the upper right k × k submatrix of Σ. Then let wk? be the least norm solution to the problem:

min ky − Fk wk2 .
w∈Rn

Show that:
(a) wk? = Vk Σ−1 T
k Uk y.

(b) wk? is the orthogonal projection of wln


?
onto the subspace R(Vk ).
(c) Fk wk? yields the orthogonal projection ŷk of y onto R(Uk ).
Instead of regressing y on Fk , suppose we regress y on the first k left singular vectors Uk . This problem always has a
unique solution. In this case we solve:
zk? = arg min ky − Uk zk2
z∈Rk

(d) Show that the unique solution is zk? = UkT y.


(e) Show that zk? = Σk VkT wk? . So zk? is simply the vector of coordinates of the least norm solution wk? w.r.t. Vk
scaled by Σk .
(f) What can you say about the solutions wˆk? and zˆk? when Ûk and V̂k are k corresponding columns of U and V (not
necessarily the first k), and Σ̂k is the corresponding submatrix of Σ?

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 118

Exercise 9.10. (Uniqueness of the Tikhonov solution) Tikhonov regularized least squares is posed as:

min kF w − yk22 + λkGw − gk22 , (9.35)


w∈Rn

where F ∈ Rm×n , y ∈ Rm , G ∈ Rk×n , g ∈ Rk and λ > 0.


(a) Show that a sufficient condition for (9.35) to have a unique solution is that rank(G) = n.
(b) Show that a necessary and sufficient condition is that N (F ) ∩ N (G) = 0.
Exercise 9.11. (Weighted least squares.) Suppose you want to put more weight on matching the target values of
some examples and less weight on others. You can do this by bringing in a diagonal matrix D ∈ Rm×m with diagonal
weights di > 0 and solving arg minw∈Rn ky − F wk2√D .
(a) Determine and interpret the effect of the di on the least norm least squares solution.
(b) Do the same for the corresponding form of ridge regression.
Exercise 9.12. Let A ∈ Rn×p have rank r, and λ > 0. Consider the regression problem

w? = arg minp ky − F wk22 + λkwk22 .


w∈R

Show that w? = By for some matrix B ∈ Rp×n , and relate the singular values of B to those of F .

Exercise 9.13. Let X ∈ Rn×m and y ∈ Rm be given, and consider the problem

w? = arg minn ky − X T wk22 + λkwk22 . (9.36)


w∈R

Show that there exists a unique solution w? and that w? ∈ R(X). Using (9.36) and these two results, show that
w? = X(X T X + λIm )−1 y.

Approximation Problems
Exercise 9.14. For y, z ∈ Rn and λ > 0, find and interpret the solution of the approximation problem

arg minn ky − xk22 + λkx − zk22 .


x∈R

Exercise 9.15. Let D ∈ Rn×n be diagonal with nonnegative diagonal entries and consider the problem:

min kx − yk22 + λkDxk22 .


x∈Rn

This problem seeks to best approximate y ∈ Rn with a nonuniform penalty for a large entries in x.
(a) Solve this problem using the formula for the solution of regularized regression.
(b) Show that the objective function is separable into a sum of decoupled terms. Show that this decomposes the
problem into n independent scalar problems.
(c) Find the solution of each scalar problem.
(d) By putting these scalar solutions together, find and interpret the solution to the original problem.
Exercise 9.16. You are given k points {zi }ki=1 in Rn and you want to find the points x ∈ Rn that minimize the sum
of squared distances to these fixed points.
(a) Solve this directly using calculus.
(b) Now pose it as a regression problem and solve it using your knowledge of least squares and the SVD.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 119

Exercise 9.17. You want to learn an unknown function f : [0, 1] → R using a set of noisy measurements (xj , yj ),
with yj = f (xj ) + j , j ∈ [1 : m]. Your plan is to approximate f (·) by a Fourier series on [0, 1] with q ∈ N terms:
q
∆ a0 X
fq (x) = + ak cos(2πkx) + bk sin(2πkx).
2
k=1

To control the smoothness of fq (·), you also decide to penalized the size of the coefficients ak , bk more heavily as k
increases.
(a) Formulate the above problem as a regularized regression problem.
(b) For q = 2, display the regression matrix, the target y, and the regularization term.
(c) Comment briefly on how to select q.

On-line Least Squares


Exercise 9.18. Let P ∈ Rn×n be symmetric PD and u ∈ Rn . Show that
1
lim (P + λuuT )−1 = P −1 − P −1 uuT P −1 .
λ→∞ uT P −1 u
Exercise 9.19. (On-line least squares with mini-batch updates.) You want to solve a least squares regression
problem by processing the data in small batches (mini-batches), yielding a new least squares solution after each
batch update. For simplicity, assume each mini-batch contains k training examples. Group the examples in the
t-th mini-batch
Pt−1 intoTthe columns of Xt ∈ Rn×k , and the corresponding targets into the
Prows of yt ∈ Rk . Let
n×n −1 t−1
Pt−1 = i=1 Xi Xi ∈ R . Assume Pt−1 exists and is known. Similarly, let st−1 = i=1 Xi yi ∈ Rn .vDerive
the following equations for the t-th mini-batch update:

ŷt = XtT wt−1
?
target prediction
−1 −1
wt? = wt−1
?
+ Pt−1 Xt [Ik + XtT Pt−1 Xt ]−1 (yt − ŷt ) update w?
Pt−1 = Pt−1
−1 −1
− Pt−1 −1
Xt [Ik + XtT Pt−1 −1
Xt ]−1 XtT Pt−1 update P.

How do these equations change if the mini-batches are not all the same size?
Exercise 9.20. (On-line ridge regression)
(a) Determine the detailed equations for on-line ridge regression.
(b) Use these equations to explain how RLS and on-line ridge regression differ.
(c) As m → ∞, will the two solutions differ?
(d) In light of your answer to (c), what is the role of λ in ridge regression?

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 120

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Chapter 10

Sparse Least Squares

We now consider linear least squares problems with the additional objective of finding a sparse solution. By
this we mean that many of the entries of the solution w? are zero. Let kwk0 denote the number of nonzero
components of w ∈ Rn . Then sparse linear regression can be posed as

w? = arg minn ky − F wk22


w∈R (10.1)
s.t. kwk0 ≤ k.
Sparse vectors and matrices arise naturally in a variety of practical problems, often in variations of the
simple problem posed above. We give some examples below.

Underdetermined Problems. When F has more columns than rows, problem (10.1) seeks a sparse solution
to an underdetermined least squares problem.

Subset Selection. Given training data, we can learn a linear predictor by solving a least squares problem
such as ridge regression:
ŵ = arg minn ky − F wk22 + λkwk22 . (10.2)
w∈R
Generally, the entries of the solution ŵ are all non-zero, suggesting that all of the features are important for
predicting y. However, in many practical applications this is unlikely to be true. Not all of the available
features will be relevant to the desired prediction. This suggests seeking a (small) subset of the features that
are most relevant for a forming a linear prediction of y. This is called a subset selection problem. Solving
(10.1) to find a sparse solution w? incorporates subset selection directly into the regression problem. The
indices with nonzero entries in w? indicate the selected subset of features.

Sparse Representation Classification. Suppose we are provided with a database of labelled face images
{(fj , zj )}m n
j=1 . Here fj ∈ R is the j-th vectorized face image and zj is its label (the person’s identity).
Form the face examples into matrix F = [f1 , . . . , fp ]. In this case, the examples are the columns of F and
we call F the dictionary. Given a new (unlabelled) face image y, we want to predict its label. We suspect
that the subset of images in the database corresponding to the identity of y will be most important. So we
set out to find a sparse representation of y in terms of the columns of F by solving (10.1). The result is an
approximate representation of y as a linear combination of relatively few columns of F . This is called a
sparse representation of y using the dictionary F . The solution w? selects a subset of the columns of F and
gives each selected column a nonzero weight. If we extract the subset of selected columns with label z and
the corresponding weights, then we can form a class z predictor of y as a linear combination of the columns

121
ELE 435/535 Fall 2018 122

of F with label z. The class predictor that yields the least error in representing y, provides the estimated
label z of y. This is called sparse representation classification.

Sparse Data Corruption. In security applications one needs to classify a new face image with respect to a
set of previously captured face images. Unlike the images in the dictionary, in the new image the subject is
wearing glasses, or sun glasses, or a scarf. So the new image will most likely differ significantly from the
best match among the previous images by a sparse set of pixels. Hence we could model the new image y as
y ≈ ŷ + ys , where ŷ is a sparse representation of y using images in the database and ys is a sparse image
with most pixels having value 0. One might then pose the problem of finding the best match to y from the
dictionary F as
min ky − F w − ys k22
w∈Rm ,ys ∈Rn

s.t. kwk0 + kys k0 ≤ k.


Compressive Sensing. Consider a signal x ∈ Rn with a sparse representation with respect to a known
basis B. So x = Bw, where kwk0 ≤ k  n. Suppose we use a sensing matrix S ∈ Rm×n to obtain
m  n measurements y = Sx, where the rows of S are selected randomly. We say that y is formed by
compressive sensing of x using the sensing matrix S. Given y ∈ Rm we want to reconstruct x ∈ Rn . A natu-
ral way to proceed is to first solve the sparse least squares problem (10.1) with F = SB. Then set x̂ = Bw? .

Sparse Approximation. The simplest version of problem (10.1) is called sparse approximation. Given
y ∈ Rn we want to find an approximation x to y such that x has at most k nonzero entries. This can be
posed as:
min ky − xk
x∈Rn (10.3)
s.t. kxk0 ≤ k.
Since x is close to y, but is specified with fewer nonzero coordinates, we can regard x as a compressed form
of y. We suspect that such an approximation will be useful if y has relatively few large entries, and many
small entries. In other words, y has some special structure that makes it “compressible”. This is often the
case for natural forms of data such as speech and images. For example, after the application of a wavelet
transform, most natural images are highly compressible.

10.1 Preliminaries
The support of x ∈ Rn is the set of indices S(x) = {i : x(i) 6= 0}. Let |S(x)| denote the size of S(x). Then
the number of nonzero entries in the vector x is |S(x)|. For an integer k ≥ 0, we say that x is k-sparse if
|S(x)| ≤ k. More generally, we say that a vector x ∈ Rn is sparse if |S(x)|  n, i.e., relatively few of its
entries are nonzero.
An alternative notation is to let |α|0 denote the indicator function of the set {α ∈ R : α 6= 0}:
(
∆ 0, if α = 0;
|α|0 = (10.4)
1, otherwise.
The number of nonzero entries in a vector x ∈ Rn can then be expressed as
n

X
kxk0 = |x(j)|0 . (10.5)
j=1

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 123

p = 1, 3/2, 2, 4, p = 1/4, 1/2, 3/4, 1

Figure 10.1: An illustration of the 1-sublevel sets of the function fp = ( j |x(j)|p )1/p and the connection to the
P
1-sublevel set of kxk0 . Left: for p ≥ 1, fp (x) is a norm and the 1-sublevel sets must be convex. Right: for p < 1,
the sublevel sets of fp (x) are no longer convex. As p ↓ 0, these 1-sublevel sets “look like” the 1-sublevel set of kxk0
intersected with the 1-norm unit ball.

The function k · k0 is not a norm. You can verify this by checking each of the required norm properties:
(a) kxk0 is non-negative and zero iff x = 0 (positivity); (b) k · k0 satisfies kx + yk0 ≤ kxk0 + kyk0 (triangle
inequality); (c) but for α 6= 0, kαxk0 6= |α| kxk0 . So k · k0 doesn’t satisfy the scaling property of a norm.
The function k · k0 is also not convex. To see this, let e1 , e2 denote the first two standard basis vectors
in Rn , and consider xα = (1 − α)e1 + αe2 for α ∈ (0, 1). The points e1 , e2 are clearly 1-sparse, but xα is
not 1-sparse. Hence the 1-sublevel set of k · k0 is not convex. Thus k · k0 is not a convex function.
The left panel of Figure 10.1 illustrates the unit balls of some `p norms on R2 . P The right panel of the
figure illustrates that for 0 < p < 1 the 1-sublevel sets of the function fp (x) = ( j |x(j)|p )1/p are no
longer convex. Indeed as p ↓ 0, these sets begin to look like the 1-sublevel set of k · k0 restricted to the unit
ball of the 1-norm (or any other p-norm).

10.2 Sparse Least Squares Problems


Henceforth, we denote the matrix in a standard sparse least squares problem by A. The columns of A might
be features, as in a regression problem; or the data examples, as in a dictionary representation problem. A
general sparse least squares problem can be posed in the following three forms:

min ky − Axk22
x∈Rn (10.6)
s.t. kxk0 ≤ k, k ∈ N, k > 0

min kxk0
x∈Rn (10.7)
s.t. ky − Axk22 ≤ , >0

min ky − Awk22 + λkwk0 , λ > 0. (10.8)


w∈Rn

The three formulations share the difficulty that k · k0 is not a convex function. For the moment we will focus
on the formulation (10.8).

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 124

10.2.1 Some Simplifications


Let A = [a1 , . . . , am ]. We will refer to the columns ai of A as atoms. We first show that without loss of
generality, we can always make certain simplifications to problem (10.8). By this we mean that there is a
modified problem satisfying the stated properties that is equivalent to the original problem.
Lemma 10.2.1. Without loss of generality, we can always assume that in (10.8): (a) all atoms of A are
nonzero, (b) y ∈ R(A), and (c) y and the atoms of A have unit norm.

Proof. (a) It is clear that any zero atom of A can be removed. This changes the size of A and w, but
otherwise leaves the problem intact.
(b) Let y = ŷ + ỹ with ŷ ∈ R(A) and ỹ ∈ R(A)⊥ . Then
ky − Awk22 + λkwk0 = kỹ + ŷ − Awk22 + λkwk0
= kỹk22 + kŷ − Awk22 + λkwk0
≡ kŷ − Awk22 + λkwk0 .
So solving the problem using ŷ yields a solution of the original problem. Conversely, if w? is a solution of
the original problem, then it is also a solution for the problem with ŷ. Hence without loss of generality we
can assume y ∈ R(A).
(c) If the atoms do not have unit norm, let à = AD where the atoms of à have unit norm and D is diagonal
with positive diagonal entries. If z = Dw, then kwk0 = kD−1 zk0 = kzk0 and
ky − Awk22 + λkwk0 = ky − ÃDwk22 + λkwk0
= ky − Ãzk22 + λkD−1 zk0
= ky − Ãzk22 + λkzk0 .

So if we solve the sparse regression problem using à (unit norm atoms) to obtain z ? , then w? = D−1 z ?
solves the original problem. Conversely, if w? solves the original problem, then Dw? is a solution for the
problem using Ã. Hence without loss of generality we can assume that A has unit norm atoms.
Now multiply the objective function by c2 > 0 and set u = cy, z = cw and λ̃ = c2 λ. Noting that
kcwk0 = kwk0 this gives
c2 ky − Awk22 + λkwk0 = kcy − Acwk22 + λc2 kcwk0


= ku − Azk22 + λc2 kzk0


= ku − Azk22 + λ̃kzk0 .

If z ? solves the modified problem using u = cy and λ̃ = c2 λ, then w? = z ? /c solves the origin problem
using y and λ. Conversely, if w? is a solution of the original problem, then z ? = cw? is a solution for the
modified problem using u and λ̃. Now note that choosing c = 1/kyk2 ensures u has unit norm. Hence
without loss of generality we can assume y has unit norm.

10.2.2 Special Cases


Problem (10.8) is easily solved for a matrix A ∈ Rn×k , when k ≤ n and A has a SVD factorization of
the form A = U ΣP T , where U ∈ Vn,k , Σ ∈ Rk×k is diagonal with a positive diagonal, and P ∈ Rk×k
is a generalized permutation matrix. A generalized permutation, is a square matrix of the form DP where
D = diag[±1] and P is a permutation.
Special instances of the above include:

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 125

(a) Sparse approximation:


min ky − xk22 + λkxk0 .
x∈Rn

(b) Sparse weighted approximation: For D = diag(d) with d ∈ Rn with d > 0,

min ky − Dxk22 + λkxk0 ,


x∈Rn

(c) Sparse representation in an orthonormal basis: For Q ∈ Vn,k ,

min ky − Qxk22 + λkxk0 ,


x∈Rk

We show in the next section that problem (a) is easily solved. The other special cases are all reducible to
problem (a), and hence are also easily solved. In contrast, the general sparse least squares problem is difficult
to solve. For example, problem (10.7) is known to be NP-hard [33]. In light of the difficulty of finding an
efficient general solution method, a number of greedy algorithms have been proposed for efficiently finding
an approximate solution. Examples of such methods are examined in §10.4.

10.3 Sparse Approximation


Given y ∈ Rn and k < n, consider the sparse approximation problem of finding a k-sparse vector x ∈ Rn
that is “closest” to y. This can be formulated as

min f (y − x)
x∈Rn (10.9)
s.t. kxk0 ≤ k,

where f : Rn → R is a cost function. Typically, f (z) = g(kzk), where k · k is a norm on Rn and g is a


strictly monotone increasing convex function g : R+ → R+ with g(0) = 0.
Example 10.3.1. Examples of possible cost functions in (10.9) include:
(a) f (z) = kzk1 = nj=1 |z(j)|.
P

(b) f (z) = (kzk2 )2 = nj=1 |z(j)|2 .


P

(c) f (z) = (kzkp )p = nj=1 |z(j)|p .


P

(d) f (z) = kzk∞ = maxj {|z(j)|}.

(e) f (x) = xT P x, where P ∈ Rn×n is symmetric postive definite.


Alternatively, we can bring in a parameter λ > 0 and add the penalty λkxk0 to the objective function in
(10.9) to form the unconstrained problem:

min f (y − x) + λkxk0 . (10.10)


x∈Rn

Here f (y − x) and λkxk0 are competing objectives. Selecting a small value for λ encourages less sparsity
and a better match between y and x. Increasing λ encourages greater sparsity, but potentially a worse match
between x and y. A solution x? of (10.10) will be generally sparse, but we can’t guarantee it will be k-sparse.
Nevertheless, it will be convenient to first solve (10.10).

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 126

10.3.1 Separable Objective Function


Problem (10.10) is easy to solve when f (z) is a separable function. By this we that
n
X
f (z) = hj (|z(j)|).
j=1

We assume that the functions hj : R+ → R+ are strictly monotone increasing functions with hj (0) = 0. In
this case, we want to minimize
n
X
f (y − x) + λkxk0 = ( hj (|y(j) − x(j)|) + λ|x(j)|0 ) . (10.11)
j=1

This is the sum of n decoupled scalar subproblems. Subproblem j has the form
min hj (|y(j) − α|) + λ|α|0 .
α

This subproblem can be solved by considering three cases: (1) If y(j) = 0, we set α = 0; (2) If y(j) 6= 0,
there are two options to consider: either we set α = y(j) and incur a cost λ, or we set α = 0 and incur a
cost hj (|y(j)|). This yields the following result.

Theorem 10.3.1. Assume f (z) = nj=1 hj (|z(j)|), where the functions hj : R+ → R+ are strictly
P
monotone increasing functions with hj (0) = 0. Then the solution of problem (10.10) is
(
? y(j), if hj (|y(j)|) ≥ λ;
x (j) = (10.12)
0, otherwise.

Proof. This is proved above.

So for suitable separable functions f , the solution to the sparse approximation problem is obtained by
hard thresholding y(j) based on a comparison of hj (|y(j)|) and λ. Because hj (·) is strictly monotone in-

creasing, it has an inverse tj (λ) = h−1
j (λ). Using t(λ) we can equivalently write the thresholding condition
as |y(j)| ≥ tj (λ).
Bring in the generic hard thresholding operator defined for z ∈ R by
(
z, if |z| ≥ t;
Ht (z) = (10.13)
0, otherwise .

Then in terms of Ht (z) and tj (λ) = h−1


j (λ) we can write
h i y(j), if |y(j)| ≥ t (λ)
? j
x = Htj (λ) (y(j)) = .
0, otherwise

Example 10.3.2. Consider the problem: minx∈Rn ky−xk22 +λkxk0 . For this problem, f (x) = nj=1 (y(j)−
P

x(j))2 = nj=1 h(|y(j) − x(j)|) with h(α) = α2 . Hence the appropriate hard threshold is t = λ. This is
P
applied to each entry to y to obtain:
 √ 
? √ y(j), if |y(j)| ≥ λ;
x = H λ (y) = .
0, otherwise.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 127

A key observation from (10.12) is that x? is a function of λ, and as λ decreases smoothly, kx? (λ)k0
increases monotonically in a staircase fashion. Depending on the value of λ, we may obtain kx? k0 > k,
or kx? k0 < k. As a thought experiment, imagine starting with a large value of λ and computing x? (λ) as
λ smoothly decreases. For simplicity, suppose that at least k of the values fj (|y(j)|) are nonzero and that
the nonzero values are distinct. For each value of λ, let S(λ) denote the support set of x? (λ), i.e., the set of
indices j for which x? (j) 6= 0. As we decrease λ the following things happen:
(0) Start with λ = λ0 > maxj fj (|y(j)|), then x? (λ0 ) = 0 and S(λ0 ) = ∅.
(1) When λ decreases to λ = λ1 = maxj {fj (|y(j)|)} with j1 = arg maxj {fj (|y(j)|)}, there is a jump
change. At this point S(λ1 ) = {j1 }, x? (λ1 )(j1 ) = y(j1 ), and x? (λ1 ) is the unique optimal 1-sparse
approximation to y.
(2) Continuing to decrease λ we reach a value λ2 equal to the second largest value of {hj (|y(j)|)}.
Suppose λ2 = hj2 (|y(j2 )|). At this point, j2 is added to S so that S(λ2 ) = {j1 , j2 }, and x? is
modified so that x? (j2 ) = y(j2 ). Then x? (λ2 ) is the unique optimal 2-sparse approximation to y.
(3) Continuing in this fashion, we see that the optimal k-sparse approximation x? to y is obtained by
letting S be the set of indices of the first k largest values of hj (y(j)), and setting
(
y(j), if j ∈ S;
x? (j) = (10.14)
0, otherwise.

The assumption that all of the nonzero values fj (|y(j)|) are distinct was made to simplify the explana-
tion. More generally, one can prove the following result.
Theorem 10.3.2. Let S be the indices of any set of k largest values of hj (|y(j)|). Then x? defined by
(10.14) is a solution to problem (10.9). This solution is unique if and only if the k-th largest value of
hj (|y(j)|) is strictly larger than the (k + 1)-st value.
Proof. Exercise.
For a fixed value of λ, solving problem (10.10) under the assumption of separability is very efficient.
One just needs to threshold the values of y(j) based on the corresponding values of hj (|y(j)|) and λ. The
downside is that solving problem (10.10) doesn’t give precise control of the resulting value of kx? (λ)k0 .
But we now see how to solve problem (10.9), and this gives precise control over kx? k0 . We simply need to
find the indices j1 , . . . , jk of any set of k largest values of hi (|y(i)|). This can be done using the following
algorithm:
(1) Scan the entries y(j) in order for j ∈ [1 : n].
(2) Maintain a sorted list of at most k pairs (j, hj (|y(j)|)) of the k largest values of hj (|y(j)|) seen so far.
(3) When entry j of y(j) is scanned, compute hj (|y(j)|). If the number of table entries is less than k, add
(j, hj (|y(j)|)). Otherwise, if hj (|y(j)|) is larger than the smallest corresponding value in the table,
add (j, hj (|y(j)|)) to the sorted table and remove the entry (i, hi (|y(i)|)) with the smallest value
hi (|y(i)|). Otherwise, read the next value.
The overall complexity of this algorithm O(n) for computing hj (|y(j)|) and making a comparison, and
an additional O(k log k) overhead for keeping an ordered list of the k largest values. If a predetermined
value of k is required, then the second solution method is probably more efficient. But if either k or λ
is to be determined by cross-validation (checking performance on held-out testing data), then the solution
method based on thresholding using λ may have an efficiency advantage.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 128

10.3.2 A Non-Separable Objective Function


As an example of a non-separable sparse approximation problem, consider:

min ky − xk∞
x∈Rn (10.15)
s.t. kxk0 ≤ k.

Note that the max norm kxk∞ = maxi |x(i)|, is a non-separable function. For simplicity, initially assume
that the entries of y are nonzero with distinct absolute values, and that these values are arranged in y from
largest to smallest by absolute value. Hence kyk∞ = |y(1)| > |y(2)| > · · · > |y(n)| > 0.
We can use up to k nonzero elements in x to minimize ky − xk∞ . The optimal allocation is to use
x(1), . . . x(k) to make |y(j)−x(j)| no larger than |y(k+1)|, j ∈ [1 : k]. This yields ky −x? k∞ = |y(k+1)|.
This is the smallest achievable value of ky − xk∞ using a k-sparse x. Any x? with

|x? (j) − y(j)| ≤ |y(k + 1)|, j ∈ [1 : k],

is an optimal solution. So in general, the solution to problem (10.15) is not unique.


One should now see how to obtain the general solution without the simplifying assumptions. Sequen-
tially scan the elements of y and determine an ordered list of the k largest values of |y(j)|. We also need to
record the indices jp of these elements (and the corresponding values y(jp ), p ∈ [1 : k]). The entries of the
list determine which components of x? will be nonzero and the required values of these components. This
solution (and the algorithm for obtaining it) is remarkably similar to that discussed in the previous section
for problem (10.9) under a separability assumption. This suggests that separability is not the critical attribute
defining this solution. This is explored further in §10.3.3 and in the exercises.

10.3.3 Sparse Approximation under a Symmetric Norm


A norm on Rn is said to be symmetric if it has the following two properties:
(a) for any permutation matrix P and all x ∈ Rn , kP xk = kxk; and
(b) for any diagonal matrix D with diagonal entries in {±1}, and all x ∈ Rn , kDxk = kxk.
A norm that satisfies the first property is called a permutation invariant norm, and one that satisfies the
second property is called an absolute norm (since kxk = k|x|k). A square matrix of the form DP where
D = diag[±1] and P is a permutation is called a generalized permutation. Let GP(n) denote the set of
n × n generalized permutation matrices. Then a norm is symmetric if and only if it is invariant under all
generalized permutations.

Lemma 10.3.1. GP(n) is closed under matrix multiplication, contains the identity, and every P ∈
GP(n) has an inverse P −1 ∈ GP(n).

Proof. Let Dn denote the family of n × n diagonal matrices with diagonal entries in {±1}. Then I ∈ Dn
and if D ∈ Dn , then D2 = I. So D−1 = D ∈ Dn . Finally, if D1 , D2 ∈ Dn , then D = D1 D2 ∈ Dn .
Let Q be a permutation matrix and P = DQ ∈ GP(n). Then P = QD0 for some D0 ∈ Dn and hence
P −1 = D0 QT ∈ GP(n). Now let Pi = Di Qi ∈ GP(n), with Qi permutation matrices, i = 1, 2. Then
P1 P2 = D1 Q1 D2 Q2 = DQ with Q1 D2 = D20 Q1 , D = D1 D20 and Q = Q1 Q2 . So P1 P2 ∈ GP(n).

Example 10.3.3. Some examples of symmetric norms are given below. In the accompanying verifications,
Q ∈ GP(n).
(a) Every p-norm: kQxkp = ( nj=1 |(Qx)(j)|p )1/p = ( nj=1 |x(j)|p )1/p = kxkp .
P P

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 129

(b) The max norm: kQxk∞ = maxj {|(Qx)(j)|}nj=1 = maxj {|x(j)|}nj=1 = kxk∞ .

(c) The c-norm defined by kxkc = maxP ∈GP(n) {xT P c}, where c ∈ Rn is nonzero:

kQxkc = maxP ∈GP(n) {(Qx)T P c} = maxP ∈GP(n) {xT P c} = kxkc .

Example 10.3.4. Let 



2 0
D= ,
0 1
and consider the norm kxkD = (xT Dx)1/2 . We have eT1 De1 = 2, eT2 De2 = 1, with e1 and e2 related by a
permutation e2 = P e1 . Hence kxkD is not a symmetric norm.
It will be convenient to use the following notation. If the entries of a vector x ∈ Rn are positive (resp.
nonnegative) we write x > 0 (resp. x ≥ 0). For x, y ∈ Rn , x ≤ y means y − x ≥ 0. The following
property of symmetric norms will be useful.

Lemma 10.3.2. For any symmetric norm k · k on Rn , if 0 ≤ x ≤ y, then kxk ≤ kyk.

Proof. This proof uses some advanced aspects of norms and convex sets. If x = 0, then kxk = 0 ≤ kyk.
Hence assume x 6= 0, and set Lkxk = {u : kuk ≤ kxk}. This is the kxk-sublevel set of k · k with kxk > 0.
Clearly x ∈ Lkxk . A norm is a convex function, and the sublevel sets of a convex function are closed and
convex. Hence Lkxk is a closed convex set.
Consider the two convex sets Lkxk and C = {x}. Lkxk contains interior points, C is nonempty and con-
tains no interior points of Lkxk . Hence by the separation theorem for convex sets, there exists a hyperplane
wT z = c, with w 6= 0, such that for all z ∈ Lkxk , wT z ≤ c, and for all z ∈ C, wT z ≥ c. Since x is in both
sets, wT x = c.
Under the assumption that x > 0, we show that w ≥ 0. Suppose to the contrary that w(i) < 0. Form x̂
from x by setting x̂(i) = −x(i). By the symmetry of the norm we have kx̂k = kxk and x̂ ∈ Lkxk . Hence
we must have wT x̂ ≤ c. On the other hand, w(i) < 0 and x̂i = −x(i) < 0 imply that

wT x̂ = nj=1 w(j)x̂(j) = j6=i w(j)x(j) + (−x(i)w(i)) > wT x = c.


P P

This is a contradiction. Thus w ≥ 0.


Continuing with the assumption x > 0, we now show that kxk ≤ kyk. Since x ≤ y and x > 0 we
have y > 0 and z = y − x ≥ 0. If wT y > c, then kyk > kxk and we are done. On the other hand, if
wT y = wT (x + z) ≤ c, then using wT x = c we conclude that wT z ≤ 0. But w, z ≥ 0. So we must have
wT z = 0. Thus wT y = wT (x + z) = wT x = c. This still leaves three possibilities:
(1) y ∈/ Lkxk . In this case kxk < kyk and we are done.
(2) y is a boundary point of Lkxk . In this case kxk = kyk and we are done.
(3) y is an interior point of Lkxk .
If y is an interior point of Lkxk , then there exists a ball B(y, ) = {a : ka − yk ≤ } centered at y with
radius  > 0 such that B(y, ) ⊂ Lkxk . Hence for some small δ > 0, (1 + δ)y ∈ Lkxk . But this means that
wT (1 + δ)y = (1 + δ)wT y = (1 + δ)c > c. A contradiction. Thus kxk ≤ kyk.
The final step is to consider x ≥ 0 with x ≤ y. Form
( (
y(j), if y(j) > 0; x(j), if x(j) > 0;
ỹ(j) = and x̃(j) =
1, if y(j) = 0, ỹ(j), if x(j) = 0.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 130

Then 0 < x̃ ≤ ỹ. Let xα = (1 − α)x + αx̃ and yα = (1 − α)y + αỹ for α ∈ [0, 1]. For α > 0,

0 < xα = (1 − α)x + αx̃ ≤ (1 − α)y + αỹ = yα .

Hence by the result proved above, kxα k ≤ kyα k. Now take the limit as α → 0 and use the continuity of the
norm to conclude that kxk ≤ kyk.

Under a symmetric norm, the sparse approximation problem (10.3) has the following simple solution.

Theorem 10.3.3. Let k · k be a symmetric norm, y ∈ Rn , and S be the indices of k largest values of
|y(j)|, j ∈ [1 : n]. Then (10.14) gives a solution to problem (10.3).

Proof. For any P ∈ GP (n), we have ky − xk = kP y − P xk and kxk0 = kP xk0 . So we can select a
P to ensure (P y)(1) ≥ (P y)(2) ≥ · · · ≥ (P y)(n) ≥ 0. Hence from this point forward we assume that
y(1) ≥ y(2) ≥ · · · ≥ y(n) ≥ 0. To simplify the proof, we will also assume the largest k values in y are
distinct, but this is not required.
Let z = y − x. The possibilities for the best k-sparse solution fall into two forms: either z(j) = 0, for
all j ∈ [1 : k], or there exist integers p, q with 1 ≤ p ≤ k and k + 1 ≤ q ≤ n, such that z(p) 6= 0, and
z(q) = 0 with y(q) < y(p). In the first case, let z1 = y − x, and the second, let z2 = y − x. We can permute
the entries of z2 to form z20 by swapping the zero values outside the range 1, . . . , k with the locations with
nonzero values in the range 1, . . . , k. This is visualized below.

z1 : 0 . . . 0 0 y(k + 1) y(k + 2) · · · y(n)


z2 : 0 . . . y(k − 1) y(k) 0 y(k + 2) · · · 0
z20 : 0 . . . 0 0 y(k − 1) y(k + 2) · · · y(k)

So kz2 k = kz20 k and 0 ≤ z1 ≤ z20 . Hence by Lemma 10.3.2, kz1 k ≤ kz20 k = kz2 k. Thus x? (j) = y(j),
j ∈ [1 : k], achieves an objective value at least as good as any other x.

The solution under a symmetric norm need not be unique, even when |y(k)| > |y(k + 1)| in an ordered
list of these values. For example, it is not unique under the max norm.

10.4 Greedy Algorithms for Sparse Least Squares Problems


We now consider the general sparse regression problem in either of the forms (10.6) or (10.7) using a general
dictionary A ∈ Rm×n with atoms a1 , a2 , . . . , am . No computationally efficient solution algorithm is known
for this general class of problems (and it seems unlikely that one exists). This has lead to the development
of a variety of efficient suboptimal solution methods. These methods typically employ an iterative “greedy”
approach to finding a solution.
Greedy solution methods operate by iteratively alternating between two actions: (1) adding (or remov-
ing) an atom (or atoms) to the support set of the estimated solution w? , and (2) updating the weights assigned
to atoms indexed by the support set. For notational simplicity assume that y and the atoms of A have unit
norm. Let t ∈ N denote the iteration number, and at the completion of step t, let: Ŝt denote the set of
indices of the selected atoms, ŷt denote the sparse approximation to y, and rt = y − ŷt denote the associated
residual.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 131

10.4.1 Matching Pursuit


The Matching Pursuit (MP) algorithm iteratively either adds one new atom to the estimated support set and
assigns a new weight to this atom, or updates the weight of an atom already in the estimated support set.

(0) Initialize t = 0, St = ∅, ŷt = 0, rt = y.

(1) (a) Update the iteration count t = t + 1


(b) Select an atom ait most correlated with the current residual: it ∈ arg maxi |aTi rt−1 |.
(c) Update the indices of the selected atoms St = Ŝt ∪ {it }
(d) The projection of rt−1 onto ait is r̂t−1 = (aTit rt−1 )ait
(e) Set w(it ) = w(it ) + aTit rt−1
(f) Update ŷt = ŷt−1 + w(it )ait = ŷt−1 + r̂t−1
(g) Find the new residual rt = y − ŷt = rt−1 − r̂t−1

(2) Check if a termination condition is satisfied (see below). If not, go to step (1).

The construction terminates after a desired number of distinct atoms have been selected (problem (10.6)) or
the size of the residual falls below some threshold (problem (10.7)). On termination, the algorithm results
in the set of indicesPof the selected atoms i1 , . . . , ik and weights w(i1 ), . . . , w(ik ). These give the sparse
approximation ŷ = kj=1 w(ij )aij to y with residual r = y − ŷ.

10.4.2 Orthogonal Matching Pursuit


Orthogonal Matching Pursuit (OMP) is a similar algorithm to MP except that during iteration t the weights
are updated jointly by orthogonal projection of y onto the span of the atoms selected up to and including
iteration t. OMP (and its extensions) can quickly produce a solution with a specified sparsity or accuracy,
and has been found useful in a variety of applications.
Let At denote the matrix consisting of the columns of A with indices in Ŝt .

(0) Initialize: t = 0, S0 = ∅, A0 = [ ], r0 = y.

(1) (a) Update the iteration count t = t + 1


(b) Select an atom ait most correlated with the residual rt−1 : it ∈ arg maxi |aTi rt−1 |
(c) Update the selected atoms St = St−1 ∪ {it }, At = [At−1 , ait ]
(d) Solve the least squares problem wt = arg minw∈Rn ky − At wk22
(e) Update the residual rt = y − At wt .

(2) Check if a termination condition is satisfied (see below). If not, go to step (1).

Step (1)(d) requires solving a least squares problem with one additional column than the previous iteration.
Note that after step (1) is completed, rt ⊥ span(At ). So once an atom has been selected it can’t be selected
a second time. The construction terminates after a desired number of atoms have been selected (for (10.6)),
the size of the residual falls below some threshold (for (10.7)), or no atom can be found that has a nonzero
correlation with the residual. On termination, the algorithm results in a set of selected atoms ai1 , . . . , aik
and weights w(i1 ), . . . , w(ik ). These give ŷ = kj=1 w(ij )aij as a sparse approximation to y with residual
P
r = y − ŷ.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 132

10.5 Spark, Coherence, and the Restricted Isometry Property


For A ∈ Rm×n and y ∈ Rm , consider the set of linear equations Aw = y. Assume that A has rank m.
Hence Aw = b has a solution but in general this need not be unique. This could be resolved by finding the
unique least norm solution. However, we now have another option: find the sparsest solution. It is clear that
we can always find a solution that minimizes kwk0 . The question is whether this solution unique. In other
words, does the following problem have a unique solution:
min kwk0
w∈Rn (10.16)
s.t. Aw = y.

10.5.1 The Spark of a Matrix


Let A ∈ Rm×n have linearly dependent columns. So r = rank(A) < n. In this case, define the spark of A,
denoted by spark(A), to be the least number of linearly dependent cols of A. Hence

spark(A) = min{kxk0 : Ax = 0, x 6= 0}.

If spark(A) = 1, at least one column of A must be 0. We normally assume all columns of A are nonzero.
In this case, it takes at least two columns to form a linearly dependent set, and by the definition of rank, any
r + 1 columns of A will be linearly dependent. Hence for matrices with nonzero columns

2 ≤ spark(A) ≤ r + 1.

Note that if rank(A) = n, then every subset of columns of A is linearly independent. In this situation,
spark(A) is not defined.
Example 10.5.1. Here are some illustrative examples:

(a) Let A = [e1 , e2 , e3 , e4 , e1 ]. Then rank(A) = 4 and spark(A) = 2. This achieves the lower bound
for spark.
(b) Let A = [e1 , e2 , e3 , e4 , 4j=1 ej ]. In this case, rank(A) = 4 and spark(A) = 5. This achieves the
P
upper bound for spark.
(c) Let A = [e1 , e2 , e3 , 3j=1 ej , e5 , e6 ]. In this case, rank(A) = 5 and spark(A) = 4. This achieves
P
neither the lower nor upper bound for spark.

The spark of A is of interest because it can be used to give a sufficient condition for (10.16) to have a
unique solution. Roughly, if a solution is sufficiently sparse, then it is the unique sparsest solution. To show
this we let Sk = {x : kxk0 ≤ k} denote the set of k-sparse vectors in Rn . We now state the following simple
results.
Lemma 10.5.1. k < spark(A) ⇔ N (A) ∩ Sk = 0.

Proof. (IF) Assume k < spark(A). Let x ∈ Sk ∩ N (A). Hence Ax = 0. If x 6= 0, then kxk0 of the
columns of A are linearly dependent. Hence k ≥ kxk0 ≥ spark(A); a contradiction. Thus x = 0. Since x
was an arbitrary element in Sk ∩ N (A) we conclude that Sk ∩ N (A) = 0.
(ONLY IF) Assume N (A) ∩ Sk = 0. Then for any nonzero x with kxk0 ≤ k, Ax 6= 0. Hence spark(A) >
k.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 133

Lemma 10.5.2. 2k < spark(A) ⇔ A is injective on Sk .

Proof. (IF) Assume 2k < spark(A). For any x, z ∈ Sk , x − z ∈ S2k . Now suppose Ax = Az. Then
A(x − z) = 0. Since x − z ∈ S2k and 2k < spark(A), Lemma 10.5.1 implies x − z = 0, i.e., x = z. So A
is injective on Sk .
(ONLY IF) Assume A is injective on Sk . Any nonzero w ∈ S2k can be written as w = x − z for nonzero
x, z ∈ Sk . If kwk0 ≤ k, set x = w/2 and z = −w/2, otherwise allot the nonzero entries of w between x
and −z so that both are nonzero and each is in Sk . Then x 6= z and Aw = A(x − z) = A(x) − A(z) 6= 0,
by the assumption. Since w was any nonzero element of S2k , it follows that spark(A) > 2k.

1
Theorem 10.5.1. Let Aw? = y. If kw? k0 < 2 spark(A), then w? is the unique sparsest solution of
Aw = y.

Proof. Set k = kw? k0 and assume k < 21 spark(A). By Lemma 10.5.2, A is injective on Sk . Hence for
every z ∈ Sk with z 6= w? , Az 6= y = Aw? . Hence w? is the unique solution in Sk . Thus is it the unique
sparsest solution.
(ONLY IF) Assume w? is the unique sparsest solution of Aw = y. Let k = kw? k0 .Then for each z ∈ Sk
with z 6= w? , Az 6= Aw? . So z − w? ∈
/ N (A).

Theorem 10.5.1 indicates that there are situations in which (10.16) has a unique sparsest solution. The
condition kw? k0 < 21 spark(A) in the theorem is sufficient but not necessary for w? to be a unique sparsest
solution. Moreover, in general, computing spark(A) is computationally expensive.

Example 10.5.2. Let


       
1 −1 1 ? 1 0
A= y= w = z=
1 −1 1 0 −1

Then spark(A) = 2, Aw? = y, and kw? k0 = 1 = 12 spark(A). So the condition of Theorem 10.5.1 fails to
hold. Moreover, z is also a 1-sparse solution. So there is not a unique sparsest solution.

Example 10.5.3. Let


     
0 0 0 ? 0
A= y= w =
0 1 1 1
1
Then spark(A) = 1, Aw? = y, and kw? k0 > 2 spark(A). So the condition of Theorem 10.5.1 fails to hold.
But w? is the unique sparsest solution.

Example 10.5.4. Let


     
1 −1 3 3 0
A = 1 −1 2 y = 2 w? = 0
0 0 1 1 1
1
Then spark(A) = 2, Aw? = y, and kw? k0 = 2 spark(A). So the condition of Theorem 10.5.1 fails to hold.
But w? is the unique sparsest solution.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 134

10.5.2 The Coherence of a Matrix


For A ∈ Rn×k , define the coherence of A by

∆ |aTi aj |
µ(A) = max . (10.17)
i6=j kai k2 kaj k2

The coherence of A is relatively easy to compute and gives a measure of pairwise similarity among the
columns of A. In general, would we like µ(A) to be small. For example, for A ∈ Vn,k , the columns of A
are orthonormal, and µ(A) = 0.
Since the columns of A appear in (10.17) in normalized form, without loss of generality we can assume
that the columns of A have unit norm. Bring in the Gram matrix G = AT A. The diagonal entries of G are
1 and the off diagonal entires indicate the signed similarity of pairs of distinct columns of A. Moreover,

µ(A) = max |G| = max |G − In | ≤ σ1 (AT A − In ) = kAT A − In k2 .


i6=j i,j

Here σ1 (M ) denotes the maximum singular value, and kM k2 the induced 2-norm of M . The center in-
equality holds since the magnitudes of the entries of M are bounded above by σ1 (M ) (Exercise 5.8).
If µ(A) = 0, then the columns of A are orthonormal and hence linearly independent. In this case the
spark of A is not defined. When spark(A) is defined, we must have µ(A) > 0. In this case, the coherence
can be used to lower bound spark(A).

Lemma 10.5.3. Let A be a matrix with linearly dependent columns. Then µ(A) > 0, and
1
1+ ≤ spark(A). (10.18)
µ(A)

Proof. Without loss of generality, assume A has unit norm columns. Let ai = A:,i denote the i-th column
of A, and G = AT A. Then Gij = 1 if i = j and has magnitude at most Pn µ(A) otherwise. Since A has
x ∈ n with Ax =
linearly dependent columns there exists a nonzero R i=1 xP
i ai = 0. Hence for some j,
aj = i6=j xxji ai . Taking the inner product of both sides with aj yields, 1 = i6=j xxji aTj ai . From this we
P

conclude that maxi6=j |aTj ai | > 0. Thus µ(A) > 0.


P
Select k columns of A and consider the corresponding k × k Gram matrix G. If i6=j |Gij | < Gjj for
each j, then by Theorem G.1.1, G is positive definite and hence the k selected columns of A are linearly
independent. Since Gjj = 1, a sufficient condition for the above to hold is (k − 1)µ(A) < 1, i.e., k <
1 + 1/µ(A). Hence spark(A) ≥ 1 + 1/µ(A).

Combining Lemma 10.5.3 and Theorem 10.5.1 yields the following result.
 
1 1
Theorem 10.5.2. If Aw? = y and kw? k0 < 2 1+ µ(A) , then w? is the unique solution of (10.16).

Proof. Exercise.

This result is weaker than the corresponding result using spark(A), but is easier to verify.
Example 10.5.5. Consider " #
1 0 √1 √1
A= 2 2 .
0 1 √1 − √12
2

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 135

A has unit norm columns. Any three columns must be linearly dependent. So spark(A) ≤ 3. By inspection
we see that no two columns are linear dependent. Hence spark(A) = 3. To determine µ(A) we compute
1 0 √1 √1
 
2 2
0 1 √1 − √12 
AT A =  √1 2
 
√1 1 0

 2 2

1 1

2
− √2 0 1
√ 1

Hence µ(A) = 1/ 2. We can then check 3 = spark(A) ≥ 1 + µ(A) = 1 + 2 ≈ 2.414. The condition
kwk0 < 1.5 is sufficient to ensure a unique sparsest solution of Aw = y. This is equivalent to y being a
column of A.

10.5.3 The Restricted Isometry Property


A matrix A ∈ Rm×n satisfies the Restricted Isometry Property (RIP) of order k if there exists a δ ∈ [0, 1)
such that for all x ∈ Rn with kxk0 ≤ k,
(1 − δ)kxk22 ≤ kAxk22 ≤ (1 + δ)kxk22 . (10.19)
If A satisfies the RIP of order k, we let δk denote the smallest value of δ for which (10.19) holds.
If A satisfies the RIP of order k, then A has special properties when acting on the set Sk = {x : kxk0 ≤
k}. For example, if δk is very small, A acts on vectors in Sk almost like a matrix with orthonormal columns.
But even for larger values of δk the condition δk < 1 gives A special properties on Sk . These properties are
detailed below.
Lemma 10.5.4. If A satisfies the RIP of order k, then k < spark(A).

Proof. Let x ∈ Sk be non-zero. By the RIP property, δk < 1 and (1 − δk )kxk22 ≤ kAxk22 ≤ (1 + δk )kxk22 .
So kAxk22 > 0 and hence Ax ∈/ N (A). Thus k < spark(A).
Combining Lemma 10.5.4 with Lemmas 10.5.1, 10.5.2, and Theorem 10.5.1 we see that:
(a) If A satisfies the RIP of order k, then N (A) ∩ Sk = {0}.
(b) If A satisfies the RIP of order 2k, then A is injective on Sk . If in addition, w? ∈ Sk and Aw? = b,
then w? is the unique sparsest solution of Aw = y.
The proof of these claims is left as an exercise.

Notes
The sparse representation classifier discussion in the Introduction is based on the work by Wright et al. in [50]. The
Matching Pursuit algorithm is due to Mallat and Zhang [29]. Orthogonal Matching Pursuit was introduced Pati et al.
in [35]. These methods followed earlier approaches for iterative basis selection that had come to be called projection
pursuit [16], [23]. The Restricted Isometry Property was introduced by Candes and Tao in [8].
Other suggested work includes:
Donoho, D.L., Elad, M.: Optimally sparse representation in general (nonorthogonal) dictionaries via `1 minimiza-
tion, Proc. Nat. Acad. Sci. 100, 21972202 (2003).
Candès, E.J., Tao, T.: Decoding by linear programming, IEEE Trans. Inform. Theory 51 (2005)
Davenport, M.A., Duarte, M.F., Eldar, Y.C., Kutyniok, G.: Introduction to Compressed Sensing. In: Eldar, Y.C.,
Kutyniok, G. (Eds.), Compressed Sensing: Theory and Applications, Cambridge University Press (2011)
Fickus, M., Mixon, D.G.: Deterministic matrices with the restricted isometry property, Proc. SPIE (2011)

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 136

Exercises

Sparsity
Exercise 10.1. Invariance properties of k · k0 . A norm on Rn is symmetric if: (a) for any permutation matrix P
and all x ∈ Rn , kP xk = kxk; and (b) for any diagonal matrix D with diagonal entries in {±1}, and all x ∈ Rn ,
kDxk = kxk. A square matrix DP , with D = diag[±1] and P is a permutation is called a generalized permutation.
Show that k · k0 is:
(a) A symmetric function, i.e., invariant under the group of generalized permutations on Rn .
(b) Is invariant under the action of any diagonal matrix with nonzero diagonal entries.
(c) Is not invariant under the orthogonal group On .

Separability
Pn
Exercise 10.2. Let f : Rn → R be a separable convex function with f (z) = j=1 hj (z(j)) for functions hj : R → R
with hj (0) = 0. Show that each hj is a convex function.

Sparse Approximation
Exercise 10.3. (Approximation with an `1 penalty) Let x, y ∈ Rn . Replacing the sparsity penalty kxk0 in problem
(10.10) by the 1-norm penalty kxk1 , yields the approximation problem:

min ky − xk22 + λkxk1 .


x∈Rn

Solving this problem can also give a sparse solution.


(a) Find a solution to the above the problem, and determine if the solution is unique.
(b) Interpret the solution in terms of an operation on the components of y.
(c) Assume the components of y are permuted so that |y(1)| ≥ |y(2)| ≥ · · · ≥ |y(n)|. Show that the solution is
k-sparse if and only if |yk+1 | ≤ λ < |yk |.
(d) Does this solution give a better k-sparse approximation to y than the solution of problem (10.10)?
Exercise 10.4. (Approximation with an `1 constraint) Let x, y ∈ Rn .
(a) Use the solution of part (a) in Exercise (10.3) to solve the following problem:

min ky − xk22
x∈Rn
subject to: kxk1 ≤ α.

Determine if the solution is unique.


(b) Assume the components of y are permuted so that |y(1)| ≥ |y(2)| ≥ · · · ≥ |y(n)|. Show that the solution of
Pk Pk
part a) is k-sparse if and only if j=1 |y(j)| − k|y(k)| ≤ α < j=1 |y(j)| − k|y(k + 1)|.
(c) Give an algorithm for directly solving the problem in part (a).
Exercise 10.5. Let D ∈ Rn×n be diagonal with nonnegative diagonal entries. Solve the following problem. The
problem seeks the best approximation x to y with a non-symmetric `1 penalty on x.

min ky − xk22 + λkDxk1 .


x∈Rn

Exercise 10.6. (Sparse approximation in a non-symmetric norm) Let D ∈ Rn×n be diagonal and positive definite.
Then kxkD = (xT Dx)1/2 is a norm.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 137

(a) Solve the following problem and determine if the solution is unique.

min ky − xk2D
x∈Rn
subject to: kxk0 ≤ k.

(b) Show that a similar approach also provides a solution for:

min ky − xk2D + λkxk0 .


x∈Rn

For D 6= αIn , k · kD is non-symmetric. So sparse approximation is easily solved for some non-symmetric
norms.

Special Cases of Sparse Regression

Exercise 10.7. (Sparse representation in an ON basis) Let r ≤ n and Q ∈ Rn×r have orthonormal columns.
(a) Find a solution of the following problem and determine if the solution is unique.

min ky − Qxk22
x∈Rr
subject to: kxk0 ≤ k.

(b) Show that the same method also gives solutions for:

min ky − Qxk22 + λkxk0 ;


x∈Rr

min ky − Qxk22 , subject to kxk1 ≤ c; and


x∈Rr

min ky − Qxk22 + λkxk1 .


x∈Rr

In each case, give the corresponding solution.


Exercise 10.8. Let the columns of X ∈ Rn×m be a centered set of unlabelled training data and the columns of
Q ∈ Rn×r be the left singular vectors of a compact SVD of X. In this context, interpret the solution of the problem,

min ky − Qwk22
w∈Rw
subject to: kwk0 ≤ k.

Exercise 10.9. (Sparse representation when A = UΣPT ) Let r ≤ n, U ∈ Rn×r have orthonormal columns,
Σ ∈ Rr×r be diagonal with positive diagonal entries, and P ∈ Rr×r be a generalized permutation. Set A = U ΣP T .
(a) Find a solution of the following problem and determine if the solution is unique:

min ky − Axk22
x∈Rr
subject to: kxk0 ≤ k.

(b) Show that a similar approach gives the solution for: minx∈Rr ky − Axk22 + λkxk0 .
Exercise 10.10. (Sparse approximation in a quadratic norm) Let P, D ∈ Rn×n be symmetric PD with D diagonal,
∆ ∆
and let Q ∈ On . For x ∈ Rn , define kxkP = (xT P x)1/2 and kxkD = (xT Dx)1/2 .

(a) Show that kxkP and kxkD are norms.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 138

(b) Show that the following two problems are equivalent, in the sense that an instance of one can be transformed
into an instance of the other. Thus a solution method for one problem gives a solution method for the other.

min 1 ky − xk2P , subject to kxk0 ≤ k min 1 kz − Qxk2D , subject to kxk0 ≤ k.


x∈Rn 2 x∈Rn 2

For general P , the first problem is a sparse approximation problem in a non-symmetric quadratic norm. The
second problem is a sparse representation problem with respect to an orthonormal basis and a simpler non-
symmetric norm.

Exercise 10.11. Let P ∈ Rn×n be symmetric. Show that f (x) = xT P x is separable if and only if P is diagonal.

Spark, Mutual Coherence


h i
Exercise 10.12. Let M = e1 √1 (e1 + e2 ) e3 √1 (e1 + e2 + e3 ) , where ei denotes the i-th standard basis
2 3
vector in Rn .
(a) Show that the columns of M are linearly dependent.
(b) Determine spark(M ).
(c) Determine the mutual coherence µ(M ).
Exercise 10.13. One way to create a dictionary is to combine known ON bases. Here we explore combining the
standard basis with the Haar wavelet basis. The Haar wavelet basis (discussed in Example 4.1.2) consists of n = 2p
ON vectors in Rn . These vectors can be arranged into the columns of an orthogonal matrix Hp . Hp is displayed for
p = 1, 2, 3 in Example 4.1.2. Form a dictionary D ∈ Rn×2n by setting D = [In , Hp ] with n = 2p .

(a) For p = 1, show that spark(D) = 3 and µ(D) = 1/ 2.
(b) For all p > 1, determine spark(D) and µ(D).
(c) For a given y ∈ Rn , we seek the sparest solution of y = Dw. What condition on y is sufficient to ensure the
sparsest solution is unique?

Exercise 10.14. The Hadamard basis is an orthonormal basis in Rn with n = 2p . It can be defined recursively as the
columns of the matrix Hpa with
 a a

1 Hp−1 Hp−1
H0a =1 and Hpa =√ a a .
2 Hp−1 −Hp−1

Specific instances of Hpa are given in Example 4.1.1.


Form the dictionary Dp = [In , Hpa ] ∈ Rn×2n with n = 2p .

(a) Show that µ(D1 ) = 1/ 2 and spark(D1 ) = 3.
(b) Show thatµ(D2 ) = 1/2 and spark(D2 ) = 4.
(c) Determine µ(Dp ) for p ≥ 1.
p
(d) Show that 1 + 2 2 ≤ spark(Dp ) ≤ 2 + 2p−1
(e) For a given y ∈ Rn , we seek the sparest solution of y = Dp w. What condition on y is sufficient to ensure the
sparsest solution is unique?

The Restricted Isometry Property


Exercise 10.15. (The Restricted Isometry Property) Assume A satisfies the RIP of order s. Let S ⊂ [1 : n] with
|S| = s, and AS denote the matrix of the columns of A indexed by S.
(a) Show that RIP requires: ∀S ⊂ [1 : n], ∀z ∈ Rs , |z T (ATS AS − Is )z| ≤ δkzk22 .

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 139

(b) Show that (a) implies kATS AS − Is k2 ≤ δ.


(c) Give an expression for δs .
(d) Show that δ1 ≤ δ2 ≤ · · · ≤ δn .
(e) Derive an interval bound on the eigenvalues of ATS AS .
(f) Derive an interval bound on the singular values of AS .
Exercise 10.16. Let the columns of A ∈ Rm×n have unit norm.
(a) Show that A has the RIP order 1 with δ1 = 0.
(b) Show that A has the RIP order 2 iff µ(A) < 1, where µ(A) is the coherence of A (see §10.5.2).

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 140

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Chapter 11

The Lasso

11.1 Introduction
Sparse regression problems, e.g., minimizing ky − Axk22 + λkxk0 , have two coupled two aspects: select-
ing a subset of atoms (subset selection) and forming a weighted sum of these atoms to obtain the best
approximation to y. In general, the first aspect is computationally challenging, while the second is easy.
The computational challenge of solving these problems, motivated the introduction of greedy methods for
finding an approximation to the solution.
An alternative approach is to relax the sparsity function k · k0 to the convex approximation k · k1 . To see
why we regard k·k1 as a convex relaxation of k·k0 , see Figure 10.1 and the informal explanation given in the
caption of the figure. The relaxed problem requires minimizing a convex function, e.g., ky − Axk22 + λkxk1 .
In general, for appropriate values of λ, the solution of the relaxed problem will be sparse. In this case, the
solution selects a subset S of atoms of A and finds a linear combination of these atoms to approximate y. In
a variety of applications the solution obtained may be an adequate substitute for the solution of the original
sparse regression problem. However, one can also use the subset of selected atoms to solve the least squares
problem minw∈Rn ky − AS wk22 , where the columns of AS are the selected atoms. This yields an alternative
approximation to the solution of the original sparse regression problem.

11.2 Lasso Problems


Consider the 1-norm regularized least squares problem

min ky − Awk22 + λkwk1 , λ > 0. (11.1)


w∈Rm

Problem (11.1) is an unconstrained convex optimization problem. Indeed, the objective function the sum
of two competing convex terms, kAw − bk22 and λkwk1 . These terms are competing in the sense that
each has its own distinct minimizer. Example sublevel sets of these terms are illustrated in Figure 11.1.
Consideration of these sublevel sets leads to two equivalent formulations of 1-norm regularized regression.
The first minimizes kAw − yk22 over a fixed sublevel set of the `1 norm:

min kAw − yk22


w∈Rn (11.2)
s.t. kwk1 ≤  .

141
ELE 435/535 Fall 2018 142

x∗
0 x∗sr

Figure 11.1: Sublevel sets of kAw − yk22 (red) and the kwk1 regularizer (blue) together with the regularization path of
the lasso solution x?sr . For comparison, a sublevel set of the `2 norm (dashed blue) and the corresponding regularization
path of the ridge regression solution are also shown.

The second minimizes kwk1 over a fixed sublevel set of kAw − yk22 :

min kwk1
w∈Rn (11.3)
s.t. kAw − yk22 ≤ δ.

The optimal solution of (11.2) occurs at the point w? where a level set of kAw − yk22 first intersects the
-ball of k · k1 . This is illustrated in Figure 11.1. We also illustrate the `2 -ball that first intersects the same
level set of kAw − yk22 . Notice the difference in the sparsity of the two intersection points. The `1 ball
has vertices and these are aligned with the coordinate axes. These vertices are more likely to first touch a
sublevel level set of the quadratic term. So the shape of the unit `1 ball encourages a sparse solution. There
will be exceptions, but the exceptional cases require the sublevel sets of kAw − yk22 to be positioned in a
particular way relative to the axes. The solution of (11.3) occurs at the point wδ? where a level set of k · k1
first intersects the δ-sublevel set of kAw − yk2 . Notice that for appropriate choices of δ and , problems
(11.2) and (11.3) have the same solution. See Figure 11.1.
The above three problems are often called lasso problems. In general, no closed form expressions for
the solutions of (11.1), (11.2) and (11.3) are known. This is the usual situation for most convex optimization
problems. The important point is that the above convex problems are amenable to solution via efficient
numerical algorithms. The special case of (11.1) for sparse approximation is particularly easy to solve and
this connects to the sparse approximation problem using k · k0 . We discuss this in the next section.

Scaling the Data


Multiplying the objective of (11.1) by α2 , with α > 0, yields the equivalent problem:

min kȳ − Āwk22 + λ̄kwk1 ,


w∈Rm

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 143

where ȳ = αy, Ā = αA, and λ̄ = α2 λ. As a result, it is meaningless to talk about the value of λ employed
when solving (11.1) without accounting for possible re-scaling of the data. One way to do this is to let
aj denote the j-th column of A and define λmax = maxm T
j=1 |aj y|. Then the ratio λ/λmax is invariant to
scaling.

11.3 1-Norm Sparse Approximation


We begin by considering the `1 -sparse approximation problem

x? = arg minn ky − xk22 + λkxk1 . (11.4)


x∈R

This requires selecting x to approximate y with a convex 1-norm penalty on x to encourage a sparse approx-
imation. The objective function in (11.4) is convex and separable:
Pn
ky − xk22 + λkxk1 = i=1 (y(i) − x(i))2 + λ|x(i)| .

Each term in the sum can be optimized separately, leading to n scalar problems of the form:

min (y(i) − α)2 + λ|α| . (11.5)


α∈R

We would like to use differential calculus to solve (11.5), but we see that |α| is not differentiable at α = 0.
For the moment, we will handled this as follows. First consider α > 0, and set the derivative w.r.t. α of the
objective in (11.5) equal to 0. This yields

−2(y(i) − α) + λ = 0 ⇒ α = y(i) − λ/2, provided y(i) > λ/2.

Doing the same for α < 0 yields

−2(y(i) − α) − λ = 0 ⇒ α = y(i) + λ/2, provided y(i) < −λ/2.

The only case that remains is α = 0, with objective value y(i)2 . This must be the solution for −λ/2 ≤
y(i) ≤ λ/2. Hence the solution of each scalar problem is:

y(i) − λ/2, if y(i) ≥ λ/2;

?
x (i) = 0, if − λ/2 < y(i) < λ/2;

y(i) + λ/2, if y(i) ≤ −λ/2.

If the magnitude of y(i) is smaller than λ/2, x? (i) is set to 0. This introduces sparsity in x? . The remaining
nonzero components of x? are formed by reducing the magnitude of the corresponding values in y by λ/2.
For this reason, this operation is called shrinkage.
Bring in the scalar soft thresholding function:

z − t, if z ≥ t;

St (z) = 0, if − t < z < t;

z + t, if z ≤ −t.

This function is illustrated in Figure 11.2. We have shown above that x? (i) = Sλ/2 (y(i)). The optimal

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 144

Ht(z) St(z)

-t -t
t z z
t

Figure 11.2: The hard (left) and soft (right) scalar thresholding functions Ht (z) and St (z).

solution of (11.4) can then be written in vector form as:

x? = Sλ/2 (y), (11.6)

where the vector function St : Rn → Rn acts componentwise.


The same approach can be used to solve the problem minx∈Rn ky − xk22 + λkDxk1 , where D ∈ Rn×n
is a positive definite diagonal matrix. The solution is again a soft thresholding operation on y except that
the threshold ti for component i depends on d(i). The soft thresholding method of solution generalizes to
any problem of the form minx∈Rk ky − U DP T xk22 + kxk1 , where U ∈ Vn,k has orthonormal columns,
D ∈ Rk×k is diagonal and positive definite, and P ∈ Rk×k is a generalized permutation matrix. The
verification of these claims is left as an exercise.
Example 11.3.1. The sparse approximation problems:

min ky − xk22 + λkxk0 and min ky − xk22 + λkxk1 ,


x∈Rn x∈Rn

have the solutions


x?0 = H√λ (y) and x?1 = Sλ/2 (y).

The two solutions use similar but distinct thresholding functions and distinct threshold values: λ versus
λ/2. In both cases, the components of y with magnitudes above the threshold will remain nonzero, and
components with values below the threshold are set to zero. So a higher threshold creates the opportunity
for greater sparsity in the solution.
These functions are compared in Figure 11.3. We see that for λ < 1, sparse approximation using kxk0
has the opportunity to yield a sparser solution. We say opportunity because the sparsity of the solution will
also depend on the values of the components of y. In practice, λ often takes very small values, well below
1. If y has a few large values and all other values equal to zero, then for small λ, x?0 and x?1 will have the
same support set. On the other hand, if y has a few large magnitude components and many components with
magnitudes clustered in a neighborhood of 0, then we expect x?0 to be sparser than x?1 .

11.4 Subgradients and the Subdifferential


We have already noted that the scalar function g(z) = |z| is not differentiable at z = 0. Hence kwk1 =
P n
j=1 |w(j)| is not differentiable at any point for which a component of w is zero. Since (11.4) seeks a
sparse solution, it’s thus unlikely that kwk1 is differentiable at a solution.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 145

1.0 f( ) =
f( ) = /2
0.8

0.6

f( )
0.4

0.2

0.0
10 4 10 3 10 2 10 1 100


Figure 11.3: Plots of λ and λ/2 versus λ.

We can handle this issue by introducing the concept of the subgradiant of a convex function. Recall that
the derivative of a differentiable function f : Rn → R at x is the linear function from Rn into R given by
Df (x)(h) = ∇f (x)T h. If f is also convex, then for any z ∈ Rn ,

f (z) ≥ f (x) + ∇f (x)T (z − x). (11.7)

This gives a global lower bound for f (z) in terms of f (x), the gradient of f at x, and the deviation of z from
x. This result was given in Chapter 7 and is illustrated Figure 7.3.
The bound (11.7) provides a way to define a “generalized gradient” for nondifferentiable convex func-
tions. A vector g ∈ Rn is called a subgradient of a convex function f at x if for all z ∈ Rn ,

f (z) ≥ f (x) + g T (z − x).

The function f is called subdifferentiable at x if it has a nonempty set of subgradients. When this holds, the
set of subgradients, denoted by ∂f (x), is called the subdifferential of f at x. If f is differentiable at x then
it is subdifferentiable, and its subdifferential is ∂f (x) = {∇f (x)}.
Example 11.4.1. Consider the scalar function |z|. For z 6= 0, this is differentiable with ∂|z| = 1 if z > 0
and ∂|z| = −1 if z < 0. At z = 0, we have ∂|z| = [−1, 1]. We can write this as

1,
 if z > 0;
g ∈ ∂|z| ⇐⇒ g = γ ∈ [−1, 1], if z = 0;

−1, if z < 0.

Similarly, 
1,
 if w(i) > 0;
g ∈ ∂kwk1 ⇐⇒ g(i) = γ ∈ [−1, 1], if w(i) = 0; (11.8)

−1, if w(i) < 0.

The following important lemma extends Corollary 7.5.1 to subdifferentials.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 146

Lemma 11.4.1. The point w? minimizes the convex function f : Rn → R if and only if 0 ∈ ∂f (w? ).

Proof. If w? minimizes f , then for every z, f (z) ≥ f (w? ) = f (w? ) + 0T (z − w? ). Hence 0 ∈ ∂f (w? ).
Conversely, if 0 ∈ ∂f (w? ), then for all z, f (z) ≥ f (w? ) + 0T (z − w? ) = f (w? ). So w? minimizes f .

11.5 Application to 1-Norm Regularized Least Squares


We now use Lemma 11.4.1 to find necessary and sufficient conditions for a point w to be a minimizer of the
regularized loss function f (w) = ky − Awk22 + λkwk1 . The first term in f can be expanded as

wT AT Aw − 2wT AT y + y T y.

The gradient of this term is 2(AT Aw − AT y). So the subdifferential of f at w is:

∂f (w) = 2AT (Aw − y) + λ∂kwk1 . (11.9)

Appealing to Lemma 11.4.1 then yields the following result.


? is a solution of (11.1) if and only if
Theorem 11.5.1. wla
? = AT y − λ g,
AT Awla ? k ∩ N (A)⊥ .
for some g ∈ ∂kwla (11.10)
2 1

Proof. (If) Assume the condition (11.10) holds. Then 0 ∈ AT (Awla ? − y) + λ ∂kw ? k . Hence by Lemma
2 la 1
? is a solution of (11.1).
11.4.1, wla
(Only If) Assume wla ? is a solution of (11.1). Then by Lemma 11.4.1, 0 ∈ AT (Aw ? −y)+ λ ∂kw ? k . Hence
la 2 la 1
there exists g ∈ ∂kwla ? k such that AT Aw ? = AT y − λ g. This can be rearranged as λ g = AT (y − Aw ? ).
1 la 2 2 la
It follows that g must also be in R(AT ) = N (A)⊥ . Thus (11.10) holds.

The term y − Awla ? is the residual, and the rows of AT are the atoms (columns of A). The j-th entry

in g depends on the sign of the j-th entry of w? , and by the above this must equal the inner product of the
residual with the j-th atom. Specifically,

? (j) > 0;
λ/2,
 if wla
aTj (y − Awla
?
) = λ2 g = γ ∈ [−λ/2, λ/2], if wla ? = 0;


−λ/2, ? (j) < 0.
if wla

So when wla ? uses atom a (i.e., w ? (j) 6= 0), the inner product of a and the residual is λ sign(w ? (j)); but
j la j 2 la
? does not use atom a , then the inner product of a and the residual lies in the interval [− λ , λ ]. Notice
if wla j j 2 2
that the residual is never orthogonal to an atom used by wla ?.

We have previously derived corresponding conditions for ridge and least squares regression. For least
squares the condition is
AT Awls? = AT y.
In this case, for every atom ai of A, aTi (y − Awls? ) = 0. So every atom is orthogonal to the residual. If
AT A is invertible, the solution wls? = (AT A)−1 AT y is unique and is linear in y. For ridge regression the
corresponding condition is
AT Awrr
?
= AT y − λwrr ?
.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 147

Hence for any atom aj , the residual y − Awrr ? satisfies aT (y − AT w ? ) = λw ? (j). So an atom is orthogonal
j rr rr
? . The unique ridge solution is w ? = (AT A+λI)−1 AT y,
to the residual if and only if it has zero weight in wrr rr
which is also linear in y.
Equation (11.10) has the same form as those above, but g is a nonlinear function of y and wla ? . In
?
particular, wla is not a linear function of y. This is shown in the examples below.

Corollary 11.5.1. In the limit as λ → 0, the solution of (11.1) is a least 1-norm solution of the least
squares problem minw∈Rn ky − Awk2 .

Proof. From (11.10) we have

λ λ
kAT Awla
?
(λ) − AT yk22 = kgk22 ≤ n.
2 2

Hence in the limit as λ → 0, wla ? satisfies the normal equations and is hence a solution of the stated least

squares problem. If the least squares problem has a unique solution, then we are done. If not, there is at least
one point satisfying the normal equations that has the least 1-norm over all such solutions. Without loss of
generality we can assume that y ∈ R(A). Then ky − Auk22 = 0. Hence by the optimality of wla ? (λ),

?
ky − Awla (λ)k22 + λkwla
?
(λ)k1 ≤ ky − Auk22 + λkuk1 = λkuk1 .

? (λ)k ≤ λkuk . Taking the limit as λ → 0, and using the continuity of k · k , we obtain
Thus kwla 1 1 1

? ?
k lim wla (λ)k1 = lim kwla (λ)k1 ≤ kuk1 .
λ→0 λ→0

Since limλ→0 wla? (λ) solves the least squares problem, we must have equality in the final inequality: Hence
?
k limλ→0 wla (λ)k1 = kuk1 .

Corollary 11.5.2. The application of Theoem 11.5.1 to various special cases of (11.1) yields the fol-
lowing previously stated results for 1-norm regularization regression:

(a) Approximation: If A = In , then w? = Sλ/2 (y)

(b) Weighted approximation: if A = diag[d1 , . . . , dn ] with di > 0, i ∈ [1 : n], then w? =


D−1 S λ D−1 (y).
2

(c) Representation in an ON basis: If A = U ∈ Vn,k , then w? = Sλ/2 (U T y)

Proof. (a) Applying the necessary and sufficient conditions (11.10) to the approximation problem (11.4)
yields x? = y − λ2 g for some g ∈ ∂kx? k1 . Writing this out componentwise we obtain

λ/2
 if x? (i) > 0 ≡ y(i) > λ/2;
?
x (i) = y(i) − γ ∈ [−λ/2, λ/2] if x? (i) = 0 ≡ y(i) ∈ [−λ/2, λ/2] with γ = y(i);

− /2 if x? (i) < 0 ≡ y(i) < −λ/2.
 λ

= Sλ/2 (y(i)).

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 148

Hence x? is obtained by soft-thresholding y using the threshold λ/2.


(b) Applying (11.10) we obtain D2 w? = Dy − λ2 g for some g ∈ ∂kx? k1 ∩ N (D)⊥ . Since N (D) = 0, we
only need g ∈ ∂kx? k1 . Writing this out componentwise yields
 λ
 2d2i
 if w? (i) > 0 ≡ y(i) > 2dλi ;
y(i) 
w? (i) = − γ ∈ [− 2dλ2 , 2dλ2 ] if w? (i) = 0 ≡ y(i) ∈ [− 2dλi , 2dλi ]; with di γ = y(i);
di  i i
− λ

if w? (i) < 0 ≡ y(i) < − λ .
2d2i 2di
1
= S λ (y(i)).
di 2di
Hence w? is obtained by scaling a soft-thresholding of y: w? = D−1 S λ D−1 (y).
2
(c) Applying (11.10) we obtain w? = U T y + λ2 g for some g ∈ ∂kx? k1 ∩ N (U )⊥ . Since U has ON columns,
N (U ) = Rk . Hence we only need g ∈ ∂kx? k1 . Thus by (a), w? = Sλ/2 (U T y).

11.5.1 Examples
Example 11.5.1. Let u, y ∈ Rn , with kuk2 = 1, and set A = u . For λ > 0 and w ∈ R, we seek
 

the solution of (11.1). We have AT A = uT u = 1, N (A)⊥ = R, and AT y = uT y. Hence by (11.5.1) a


necessary and sufficient condition for w? to be a solution is that for some g ∈ ∂|w? | ∩ R,

1,
 if w? > 0 ≡ uT y > λ/2;
? T λ
w = (u y) − g, where g = γ ∈ [−1, 1], if w? = 0 ≡ −λ/2 ≤ uT y ≤ λ/2;
2 
−1, if w? < 0 ≡ uT y < −λ/2.

? (λ) = S (uT y).


Hence wla λ/2

Example 11.5.2. Let u, y ∈ Rn , with kuk2 = 1, and set A = u u . For λ > 0 and w ∈ R2 , we seek the
 

solution of (11.1). We have AT A = 11T , N (A)⊥ = span{1}, and AT y = uT y1. By (11.5.1) a necessary
and sufficient condition for w? to be a solution is

1(1T w? ) = (uT y)1 − λ2 g where g ∈ ∂kw? k1 ∩ span{1}. (11.11)

The condition g ∈ ∂kw? k1 ∩ span{1} implies that g must take the form
 
γ
g= where γ ∈ [−1, 1].
γ

If 1T w? > 0, then one entry of w? is positive and the other must be nonnegative. In this case γ = 1. If
1T w? < 0, then one entry of w? is negative and the other must be nonpositive. In this case γ = −1. if
1T w? = 0 then w? = 0, and γ ∈ [−1, 1]. In all cases we can write g = γ1 for appropriate γ ∈ [−1, 1].
Substituting this into (11.11) reveals that we require 1T w? = uT y − λ2 γ for the prescribed allowed values
of γ. This yields 
T T ? T
u y − λ/2, if 1 w > 0 ≡ u y > λ/2;

1T w? = 0, if 1T w? = 0 ≡ −λ/2 ≤ uT y ≤ λ/2;

 T
u y + λ/2, if 1T w? < 0 ≡ uT y < −λ/2.
Thus 1T w? = Sλ/2 (uT y). This uniquely specifies the sum of the entries of w? . In general w? is not unique.
If the entries of w? are nonnegative, then any point w ≥ 0 with w(1) + w(2) = Sλ/2 (uT y) is a solution.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 149

Similarly, if the entries of w? are nonpositive, then any point w ≤ 0 with w(1) + w(2) = Sλ/2 (uT y) is a
solution.
Example 11.5.3. Consider problem (11.1) with
   
1 0 0.5 1
A= , y= .
0 1 0.5 0

Note that y is one of the atoms in the dictionary. Simple computation yields

1 0 12
       
1 1 0 1
AT A =  0 1 12  , N (A) = span  1  , N (A)⊥ = span  0 1  , T
and A y = 0  .

1 1 1 1 1 1
2 2 2 −2 2 2 2

By (11.5.1) a necessary and sufficient condition for w? to be a solution is

1 0 12
     
1 1 0
 0 1 1  w? =  0  − λ g where g ∈ ∂kw? k1 ∩ span  0 1 . (11.12)
2 2
1 1 1 1 1 1
2 2 2 2 2 2

To find a solution of (11.14), first consider the limit as λ → 0. In this limiting case, w? = e1 + αn is a
solution, where n = (1, 1, −2) and α ∈ R. It easily checked that kw? k1 is minimized at α = 0. So at λ = 0
the least 1-norm solution is w? = e1 . As λ increases we expect the soft threshold operator to shrink the
above solution to w? = (1 − λ/2)e1 . This is verified by selecting g = (1, 0, 1/2), and checking the equality
of both sides in (11.14):

1 0 21 1 − λ2 1 − λ2 1 − λ2
          
1 1
0 1 1   0  =  0  and 0 − λ 0 =  0  .
2 2
1 1 1 1 λ 1 1 1 λ
2 2 2 0 2 − 4 2 2 2 − 4

Equality continues holds for 0 ≤ λ < 2. Hence w? = (1 − λ2 )e1 is a solution for 0 ≤ λ < 2. At λ = 2 the
solution is w? = 0. We expect that increasing λ further will leave the solution invariant at 0. This is verified
by selecting g = ( λ2 , 0, λ1 ) and checking the equality of both sides of (11.14). This is left as an exercise. In
summary, we have determined the lasso solution path
(
? (1 − λ/2)e1 , 0 ≤ λ ≤ 2;
wla (λ) = (11.13)
0, λ > 2.

Example 11.5.4. Consider the dictionary in Example 11.5.3. This time we seek the solution for yt =
(1 − t)e1 + te2 , where t ∈ [0, 1]. For simplicity, consider λ < 1/2. In this case
 
1−t
AT y t =  t  ,
1
2

and by (11.5.1) a necessary and sufficient condition for w? to be a solution is

1 0 12
     
1−t 1 0
 0 1 1  w? =  t  − λ g where g ∈ ∂kw? k1 ∩ span  0 1 . (11.14)
2 2
1 1 1 1 1 1
2 2 2 2 2 2

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 150

For t = 0 we know from Example (11.5.3) that the solution is w? = (1 − λ/2)e1 . Hence for t > 0 we first
examine if w? = αe1 , with α > 0, continues to be a solution. In this case, (11.14) requires that
     
1 1−t 1
λ
α  0  =  t  −  γ  some γ ∈ [−1, 1].
1 1 2 1 γ
2 2 2 + 2

This yields

α = 1 − t − λ2 , for t ≤ 1 − λ2 ;
2t
γ= λ, for t ≤ λ2 .

So now we have the solution


w? = (1 − t − λ2 )e1 , 0 ≤ t ≤ λ2 .
To proceed further, we must bring another column of A into play. Hence we consider solutions of the form
w? = αe1 + βe2 , with α, β > 0. This solution uses the first two columns of A. In this case, (11.14) requires
that      
α 1−t 1
λ
 =  t  − 1 , some γ ∈ [−1, 1].
 β
1 1 2
2 (α + β) 2 1
This yields
λ
α=1−t− 2 for 0 ≤ t ≤ 1 − λ2 ;
β = t − λ2 , for λ
2 ≤ t ≤ 1.

So now we have the solution


(
(1 − t − λ2 )e1 , 0 ≤ t ≤ λ2 ;
w? =
(1 − t − λ2 )e1 + (t − λ2 )e2 , λ λ
2 ≤ t ≤ 1 − 2;

We note that at t = 1 − λ2 , w? (1) = 0. So for larger values of t we examine w? = βe2 . In this case, (11.14)
requires that      
0 1−t γ
λ
β  1  =  t  −  1  , some γ ∈ [−1, 1].
1 1 2 1 γ
2 2 2 + 2
This yields

β = t − λ2 , for λ
2 ≤ t ≤ 1;
γ = (1 − t) λ2 , for 1 − λ
2 ≤ t ≤ 1.

Finally, we have the complete solution path as t takes values from 0 to 1 for a fixed value of λ > 0:

λ
(1 − t − 2 )e1 ,
 0 ≤ t ≤ λ2 ;
? (t) =
wla (1 − t − λ2 )e1 + (t − λ2 )e2 , λ2 ≤ t ≤ 1 − λ2 ; (11.15)
 λ λ
(t − 2 )e2 , 1 − 2 ≤ t ≤ 1.

We now make the following observations. For the solution w? to be linear in y, the solution path for
yt must be w? (t) = (1 − t)w? (0) + tw? (1), where w? (0) is the solution for y0 and w? (1) for y1 . So the

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 151

solution path would move in a straight line from w? (0) to w? (1). But we see that this is not the case. So the
lasso solution is not linear in y. From w? (0) = (1 − λ/2)e1 the solution first shrinks until t = λ/2. At this t,
w? (t) = (1 − λ)e1 . Symmetrically, as t decreases from 1, the solution w? (1) = (1 − λ/2)e2 shrinks until
t = 1 − λ/2. At which point w? (t) = (1 − λ)e2 . As t transits from λ/2 to 1 − λ/2, the solution path does
linearly interpolate between w? (λ/2) and w? (1 − λ/2).

11.6 The Dual Lasso Problem


Setting z = y − Aw in (11.1) yields the constrained problem:

min 1/2 z T z + λkwk1


z∈Rn w∈Rm (11.16)
s.t. z = y − Aw.

Bring in a dual variable µ ∈ Rn and form the Lagrangian

L(z, w, µ) = 1/2 z T z + λkwk1 + µT (y − Aw − z).

Computing the subdifferentials with respect to z and w yields:

∂z L = z − µ, ∂w L = −AT µ + λg, g ∈ ∂kwk1 . (11.17)

The optimality condition that 0 must be in each subdifferential gives µ = z and AT µ = λg for some
g ∈ ∂kwk1 . These equations allow the elimination of z and w from L. In detail,

L = −1/2µT µ + λkwk1 + µT y − µT Aw; setting µ = z


= −1/2µT µ + λkwk1 + µT y − λg T w; setting µT Aw = λg T w
(11.18)
= −1/2µT µ + µT y; using g T w = kwk1
= −1/2ky − µk22 + 1/2kyk22 ; completing the square.

To ensure the satisfiability of the equation AT µ = λg for some g ∈ ∂kwk1 , it is necessary that |aTi µ| ≤ λ
for each atom ai . This leads to the following dual problem:

max 1/2 kyk22 − 1/2kµ − yk22


µ∈Rn
(11.19)
subject to: |aTi µ| ≤ λ, i ∈ [1 : m].

This can be simplified to

min kµ − yk22
µ∈Rn
(11.20)
subject to: |aTi µ| ≤ λ, i ∈ [1 : m].

By the above construction, the solutions w? ∈ Rm of (11.1) and µ? ∈ Rn of (11.19) satisfy:

y = µ? + Aw? (11.21)
(
T ? λsgn(w? (i)), if w? (i) 6= 0;
ai µ = (11.22)
γ ∈ [−λ, λ] , if w? (i) = 0.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 152

From (11.21) we see that µ? is the optimal residual resulting from the selection of w? . So the dual problem
directly seeks the optimal residual.
Let Fλ denote the set of µ satisfying the constraints in (11.20). These constraints can be written as a
set of 2m linear half plane constraints aT µ ≤ λ for each a ∈ {±ai }pi=1 . So the set of feasible points is a
convex polytope in Rn . In addition, maximizing the objective function in (11.20) seeks the closest point in
Fλ to y. Hence µ? = PFλ (y).
It is sometimes convenient to introduce the scaled dual variable θ = µ/λ. This removes λ from the
constraints in the dual problem. This results in the modified dual problem,

min kθ − y/λk22
θ∈Rn
(11.23)
s.t. |aTi θ| ≤ 1 i ∈ [1 : m],

and the solutions w? ∈ Rp of (11.1) and θ? ∈ Rn of (11.23) satisfy:


(
? ? ? ?T sgn(w? (i)), if w? (i) 6= 0;
θ = Aw + λθ , θ ai = (11.24)
γ ∈ [−1, 1] , if w? (i) = 0.

In this case, the dual solution θ? (λ) is the unique projection of y/λ onto the closed convex polytope F =
{θ : |aTi θ| ≤ 1}. Notice that now the dual feasible set F does not depend on λ and we project the point
y/λ onto F. This is illustrated in Figure 11.4. Version (11.23) of the dual problem makes it very clear that
the dual solution θ? is a continuous function of λ, and this can be used to show that the solution w? of the
primal problem (11.1) is also a continuous function of λ.

Theorem 11.6.1. Let θ? (λ) be the solution of the dual problem (11.23) and w? (λ) be the solution of
the primal problem (11.1). Then θ? (λ), Aw? (λ) are continuous functions of λ.

Proof. From (11.23) we note that the set of dual feasible points F is the intersection of a set of closed half
spaces. Hence F is closed and convex. Thus θ? (λ) is the unique projection of y/λ onto F. The argument
y/λ is continuous at each λ > 0, and the function PC : Rn → F is continuous. Hence θ? (λ) is continuous
in λ for all λ > 0. From (11.24) we see that Aw? (λ) = (1 − λ)θ? (λ) is also continuous in λ.

Notes
Not done yet.

Exercises
Exercise 11.1. Solve each of the following problems:
(a) minx∈Rn ky − xk22 + λkDxk1 .

(b) minx∈Rk ky − U xk22 + λkxk1 , where U ∈ Vn,k .

(c) minx∈Rk ky − U DP T xk22 + λkxk1 , where U ∈ Rn×k has ON columns, D ∈ Rk×k is diagonal with a positive
diagonal, and P ∈ Rk×k is a generalized permutation.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 153

S n-1 -a1

-a3

-a2
0

a2
Dual
Feasible a
3
Region
a1 y

y/λ

Figure 11.4: The geometry of the dual problem in terms of the dual variable θ.

Exercise 11.2. Consider the dictionary  


1 0 1/2
A= .
0 1 1/2

Suppose λ > 1/2 and yγ = (1 − γ)e1 + γe2 where γ ∈ [0, 1]. Find and plot the solution wγ? of the corresponding lasso
regression problem as a function of γ. Also plot ŷγ = Awγ? .

Exercise 11.3. Consider the dictionary  √ 


1 0 1/ 2
A= √ .
0 1 1/ 2

Let yγ = (1 − γ)e1 + γe2 where γ ∈ [0, 1]. By imposing an upper bound on λ, find and plot the solution wγ? of the
corresponding lasso regression problem as a function of γ. Also plot ŷγ = Awγ? .

Exercise 11.4. Let D = [d1 , . . . , dp ] be a dictionary of unit norm atoms and consider the sparse representation
problem
minp ky − Dwk22 + λkwk1 .
w∈R

(a) Let y = dj . Show that w? = (1 − λ)ej and Dw? = (1 − λ)dj , for 0 < λ < 1.


(b) Fix dj , and let k = arg maxi6=j |dTi dj |, C = dTk dj , and for convenience assume C ≥ 0. Constrain λ <
(1 + C)/2. Now let yγ = (1 − γ)dj + γdk for γ ∈ [0, 1]. So yγ traces out the line segment from dj to dk .
Determine the corresponding solution of wγ? of the `1 -sparse regression problem as a function of γ.

Exercise 11.5. (Basis Pursuit) Let A ∈ Rm×n with rank(A) = m < n, and y ∈ Rm . We seek the sparsest solution
of Ax = y:
minn kxk0 , s.t. Ax = y. (11.25)
x∈R

The convex relaxation of this problem is called Basis Pursuit:

min kxk1 , s.t. Ax = y. (11.26)


x∈Rn

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 154

Show that (11.26) is equivalent to the linear program:

minx,z∈Rn 1T z
s.t. Ax = y
x−z ≤ 0
−x − z ≤ 0
−z ≤ 0

Exercise 11.6. Let A ∈ Rm×n with rank(A) = m < n. Then for y ∈ Rm set

f (y) = minn kxk1


x∈R
s.t. Ax = y.

Show that f (y) is a convex function on Rm .


Exercise 11.7. Consider the lasso problem

min ky − Awk22 + λkwk1 , (11.27)


w∈Rn

with λ > 0, A ∈ Rm×n , and y ∈ Rm . For simplicity, assume that rank(A) = m < n.
(a) Show that problem (11.27) can be written in the form

min ky − xk22 + λf (x), (11.28)


x∈Rm

where f (x) is a non-negative, convex function with f (0) = 0.


(b) Show that (11.28) has a unique solution.
(c) Show that the unique solution x? (λ) of (11.28) is a bounded and continuous function of λ ∈ (0, ∞).

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Chapter 12

Generative Models for Classification

This chapter formulates a generative model for data drawn from c classes. We then derive the corresponding
the Bayes classifier for this model. To illustrate this in greater detail, we examine Gaussian generative
models as a special case. We then introduce the exponential family of densities and derive the binary
Bayes classifier for any two densities in the same exponential family. In the process, we also introduce
the concept of the KL-divergence of two densities or probability mass functions. Finally, we introduce
empirically derived Bayes classifiers and show that several simple classifiers (Nearest Centroid, Gaussian
Naive Bayes, and Linear Discriminant Analysis), are instances of this method of classifier design under
particular assumptions about the structure of an underlying Gaussian generative model.

12.1 Generative Models


Let a labelled example take the form (x, y) with x ∈ Rn and y ∈ [1 : c]. The values x and y are assumed to
be the outcomes of random variables X and Y, with the joint probability density described in the factored
form
fXY (x, y) = fY (y)fX|Y (x|y) = πy py (x). (12.1)
∆ ∆
Here π is a probability mass function over c classes with πk = fY (k), k ∈ [1 : c], and pk (x) = fX|Y (x|k) is
the conditional density of X given that Y = k, k ∈ [1 : c]. We call πk the prior probability of class k, and the
densities pk (x) the class-conditional densities of X. For the moment we assume that the prior probabilities
πk and the class-conditional densities pk (x), k ∈ [1 : c], are known. Our objective is to determine the
classifier that minimizes the probability of misclassification.
The model (12.1) is an example of a generative model. A generative model directly specifies how one
can generate examples that conform to the specified joint probability distribution. For example, for the
model (12.1) with two classes labelled 0 and 1, first flip a biased coin with sides labelled “0” (probability
π0 ) and “1” (probability π1 = 1 − π0 ) to determine a label value k ∈ {0, 1}. Then generate an example x
according to the selected class-conditional density pk (x).

12.2 The Bayes Classifier


Using (12.1) and Bayes’ rule we can write

fX|Y (x|k)fY (k) πk pk (x)


fY|X (k|x) = = (12.2)
fX (x) fX (x)

155
ELE 435/535 Fall 2018 156

The term fY|X (k|x) is called the posterior probability of class k. The example x gives information about
its corresponding class. Hence once x is known, the prior class probabilities πk can be updated. The pmf
fY|X (k|x) specifies the posterior class probabilities after observing the example x. Notice that the posterior
pmf combines information provided by x with information provided by the class prior pmf.
If we predict the label of x to be k, then the probability of error is 1 − fY|X (k|x). Hence to minimize
the probability of error, we select the label with maximum posterior probability. This yields the Maximum
Aposteriori Probability (MAP) classifier. Since the term fX (x) in (12.2) is common to all label values, it
can be dropped. The MAP classifier is then specified by

ŷMAP (x) = arg max πk pk (x). (12.3)
k∈[1:c]

By taking the natural log of the RHS, this can be equivalently written as

ŷMAP (x) = arg max ln pk (x) + ln(πk ). (12.4)


k∈[1:c]

The term ln pk (x) is the log-likelihood for class k under the observation x. It gives a measure of the
likelihood that the class is k, given the example value x. The term ln πk is a measure of the prior likeli-
hood that the class is k. The sum of the two terms blends prior information with information provided by
observing x, to give an overall measure of the likelihood of label k. Each of the functions gk (x) = πk pk (x)
and hk (x) = ln pk (x) + ln(πk ) measure the likelihood that the class is k given the data. In each case,
classification is performed by taking the maximum over these functions: ŷ(x) = arg maxk∈[1,c] gk (x) =
arg max k ∈ [1, c]hk (x). Hence the functions {gk (x)} and the functions {hk (x)}, k ∈ [1 : c], are called
discriminant functions.

12.2.1 The Binary MAP Classifier


When there are just two classes, the Bayes classifier can be simplified by subtracting the two functions in
(12.4) and comparing the result with 0. This yields the following expression for the classifier:
(    
1, ln pp01 (x)
(x)
+ ln ππ10 > 0
ŷ(x) = (12.5)
0, otherwise.

The term p1 (x)/p0 (x) is called the likelihood ratio, and ln(p1 (x)/p0 (x)) is called the log-likelihood ratio.
One can think of these terms as quantifying the relative evidence in support of class 1 provided by x.
Similarly, ln(π1 /π0 ) quantifies the prior relative evidence in support of class 1. So the classifier (12.5)
decides the class is 1 when the aggregate relative evidence provided by the example and the prior for class 1
is greater than that for class 0.

12.3 MAP Classifiers for Scalar Gaussian Conditional Densities


To provide a more detailed set of examples, in this section we assume the class-conditional densities
fX|Y (x|k), k ∈ [1 : c], are Gaussian. To build insight, we first assume each example is the outcome of
a scalar Gaussian random variable X with mean µk ∈ R and variance σk2 ∈ R+ , k ∈ [1 : c]. Under this
model, the class-conditional densities are
1 1 2 2
pk (x) = fX|Y (x|k) = √ e− 2 (x−µk ) /σk , k ∈ [1 : c]. (12.6)
2πσk

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 157

Given the example value x ∈ Rn , we select the label value to minimize the probability of error. This is a
MAP classification problem based on a scalar valued feature.
The posterior probabilities fY|X (k|x) satisfy

1 1 2 2
fX (x)fY|X (k|x) = πk √ e− 2 (x−µk ) /σk , k ∈ [1 : c].
2πσk
We obtain the MAP classifier by taking the negative of the natural log of this expression, and dropping terms
that don’t depend on k. This yields

x − µk 2
 
ŷMAP (x) = arg min + 2 ln(σk ) − 2 ln(πk ). (12.7)
k∈[1,c] σk

The term σk inside the parentheses in (12.7) scales x − µk to units of class k standard deviations. The
quadratic term thus penalizes the standardized distance between x and µk . The second term adds a penalty
for larger variance, and the third term adds a penalty for lower class probability.
If all class variances equal σ 2 , we obtain the slightly simpler classifier

x − µk 2
 
ŷMAP (x) = arg min − 2 ln(πk )
k∈[1,c] σ (12.8)
= arg min (x − µk )2 − 2σ 2 ln(πk ).
k∈[1,c]

The first line in (12.8) classifies to the closest class mean in standardized units, except for a bias term that
takes into account prior class probability. The second expression is equivalent. It indicates that if the distance
to the mean is not in standardized units, then the bias due to the prior must be scaled by the variance.

12.3.1 Binary Classification


For binary classification, only two terms appear in the minimization in (12.7). These terms can be subtracted
and the result compared to 0. This yields the binary MAP classifier:

1, if x−µ0 2 − x−µ1 2 + 2 ln σ0 + 2 ln π1 > 0;
       
ŷMAP (x) = σ0 σ1 σ1 π0 (12.9)
0, otherwise.

The first term ((x − µ0 )/σ0 )2 on the LHS of (12.9) measures the squared standardized distance between
x and µ0 . The second term does the same computation for class 1. These terms are then subtracted to
determine which is larger. The third term accounts for any difference in class variances. It gives a bonus to
the class of smaller variance. The final term adds a bias in favor of the class of higher prior probability.
There are two interesting special cases. The first is equal class variances: σ02 = σ12 = σ 2 . For simplicity
assume µ1 ≥ µ0 , and let
∆ µ 0 + µ1 ∆ µ1 − µ 0
µ̄ = and κ= .
2 σ
The point µ̄ is the midpoint of the line joining µ0 and µ1 . The term κ is a measure of the dissimilarity of the
class-conditional densities1 . The more dissimilar the class-conditional densities, the more information an
example provides (on average) about its class. For example, increasing σ makes the examples more noisy
and hence less “informative”. Conversely, increasing µ1 − µ0 makes the examples more “informative”.
1
Dissimilarity of densities is measured by KL-divergence. In this case, DKL (N (µ1 , σ 2 )kN (µ0 , σ 2 )) = 21 κ2 .

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 158

0.25 0.25 0.25


0N(0, 2.2) 0N(1, 2.2) 0N(2, 1.0)
0= 0.5 1N(0, 1.0) 0= 0.5 1N( 1, 1.0) 0= 0.5 1N( 2, 1.0)
0.20 1= 0.5 decide 1 0.20 1= 0.5 decide 1 0.20 1= 0.5 decide 1
decide 0 decide 0 decide 0

0.15 0.15 0.15


kN( k, k2)

kN( k, k2)

kN( k, k2)
0.10 0.10 0.10

0.05 0.05 0.05

0.00 0.00 0.00


6 4 2 0 2 4 6 6 4 2 0 2 4 6 6 4 2 0 2 4 6
x x x
0.25 0.25 0.25
0N(0, 2.2) 0N(1, 2.2) 0N(2, 1.0)
0= 0.6 1N(0, 1.0) 0= 0.6 1N( 1, 1.0) 0= 0.6 1N( 2, 1.0)
0.20 1= 0.4 decide 1 0.20 1= 0.4 decide 1 0.20 1= 0.4 decide 1
decide 0 decide 0 decide 0

0.15 0.15 0.15


kN( k, k2)

kN( k, k2)

kN( k, k2)
0.10 0.10 0.10

0.05 0.05 0.05

0.00 0.00 0.00


6 4 2 0 2 4 6 6 4 2 0 2 4 6 6 4 2 0 2 4 6
x x x

Figure 12.1: Six examples of pairs of conditional Gaussian densities and the resulting MAP decision regions. In the
left column, the class means are equal, while in the right column the class variances are equal. The middle column
shows a general example. In the top row π0 = π1 = 0.5, and in the bottom row π0 = 0.6, π1 = 0.4. Note that the
left two figures in both rows have a decision interval for class 1 that is surrounded by the decision region for class 0.
This is due to the quadratic form of the decision surface. In contrast, the two plots on the right use a single threshold
to separate the line into two decision regions. These plots show linear classifiers.

In the current setting, the classifier (12.9) simplifies to


(  
x−µ̄  π1
1, if σ κ + ln π0 > 0;
ŷMAP (x) =
0, otherwise.
(   (12.10)
σ π0
1, x>τ where τ = µ̄ + κ ln π1 ;
=
0, otherwise.

The term (x − µ̄)/σ in (12.10) measures the signed standardized distance from µ̄ to x. If κ > 1, κ amplifies
(x − µ̄)/σ, otherwise it reduces it. In the first case, we weight the example more and discount the prior. In
the second, we discount the example and put more weight on the prior. The second expression in (12.10)
indicates that for equal class variances, the binary Gaussian MAP classifier simply compares x to a threshold
τ , deciding class 1 if x > τ and class 0 otherwise (recall we assumed µ1 ≥ µ0 ). This is a linear classifier.
The classifier is illustrated in the right plots of Figure 12.1.
The second special case is when the class means are equal µ0 = µ1 = µ.2 Under this constraint, the
2 2
  
σ1 σ0
2
In this case, DKL (N (µ, σ12 )kN (µ, σ02 )) = 1
2 2
σ0
− 1 + ln 2
σ1
.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 159

1.0 1.0
R0
R1

1 R1: Probability of success | class 1


0.8 R 0.8
Probability of Error: R0, R1, R

MAP Classifier Performance ROC Curves


N( 0, 2) vs N( 1, 2) N( 0, 2) vs N( 1, 2)
0.6 0.6
0 = 0.6

= 4.00
0.4 0.4 = 2.00
= 1.00
= 0.50
0.2 0.2 =0
MAP 0 = 0.5
MAP 0 = 0.6
0.0 0.0
1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0
R0: Probability of error | class 0
Figure 12.2: Binary MAP classification for conditional class densities N (µk , σ 2 ), k ∈ {0, 1}. Left: MAP classifier
performance as a function of κ = |µ1 −µ σ
0|
. The plot shows the conditional probability of error Rk given that the true
label is k ∈ {0, 1}, and the overall probability of error R? . Right: ROC curves for various values of κ. An ROC
curve plots the probability of classifier success given that the true label is 1 versus the probability of error given that
the true label is 0. Each curve is the locus of the point (R0 , 1 − R1 ) as the threshold τ in (12.10) moves from −∞
to ∞. The MAP classifier specifies a particular operating point (determined by τ in (12.10)) on each curve. These
points are indicated for two values of π0 . R0 is sometimes called the probability of a false positive (or false alarm,
false discovery), and 1 − R1 , the probability of a true postive (or true detection, true discovery). The abbreviation
ROC stands for Receiver Operating Characteristic, a term coined in the early days of radar.

classifier (12.9) simplifies to

 2 
σ12
     
x−µ
1, if σ1 σ02
− 1 + 2 ln σσ01 + 2 ln ππ01 > 0;
ŷMAP (x) = (12.11)
0, otherwise.

The decision boundary is the set of points satisfying the above inequality with equality. This is a level set
of a quadratic function in x. Hence one decision region will be surrounded by the other. If σ1 > σ0 , the
quadratic is convex. So the outer region will classify to class 1 and the inner region to class 0. Conversely, if
σ0 > σ1 , the quadratic is concave, the inner region classifies to class 1 and the outer region to class 0. This
case is illustrated in the left plots of Figure 12.1.

Binary Classifier Performance

Let Dk ⊂ R denote the set of x ∈ R for which the binary classifier (12.9) makes decision k, k ∈ {0, 1}.
Clearly D1 = D0c . We call these sets the decision regions of the classifier. For the quadratic function
in (12.9), in general, one decision region is an interval of the form (a, b) and the other takes the form
(−∞, a] ∪ [b, ∞).
The classifier can make two types of error. It can decide the class is 1 when the true class is 0, or it can

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 160

decide the class is 0 when the true class is 1. These errors are made with conditional probabilities
Z
1 1 2 2
Pe|0 = √ e− 2 (x−µ0 ) /σ0 dx (12.12)
2πσ0
ZD1
1 1 2 2
Pe|1 = √ e− 2 (x−µ1 ) /σ1 dx. (12.13)
D0 2πσ1
The total probability of error Pe is then the average of Pe|0 and Pe|1 with respect to the class probabilities:
Pe = π0 Pe|0 + π1 Pe|1 . (12.14)
In the literature, you will often see Pe|1 and Pe|0 denoted by R1 and R0 , respectively. This notation reflects
the description of these quantities as the conditional risks. The probability of error of the Bayes classifier is
then called the Bayes risk and denoted by R? .
It will be useful to determine detailed expressions for R0 , R1 and R? under the assumption that σ02 = σ12 .
To do so assume µ0 ≤ µ1 , let τ denote the threshold in (12.10), and Φ denote the cumulative distribution
function of the N (0, 1) density. Then
Z ∞  
τ − µ0
R0 = Nµ0 ,σ2 (x)dx = 1 − Φ
τ σ
Z τ    
τ − µ1 µ1 − τ
R1 = Nµ1 ,σ2 (x)dx = Φ =1−Φ .
−∞ σ σ
Substituting the value for τ from (12.10), and simplifying yields:
   
τ − µ0 κ 1 π0 µ1 − τ κ 1 π0
= + ln and = − ln .
σ 2 κ π1 σ 2 κ π1
Substituting these expressions into those for R0 and R1 , and averaging over the class probabilities, we find
      
? κ 1 π0 κ 1 π0
R = 1 − π0 Φ + ln + π1 Φ − ln . (12.15)
2 κ π1 2 κ π1
We expect the probability of error to decrease monotonically as κ2 increases. This is indeed the case.
Lemma 12.3.1. The probability of error (12.15) decreases monotonically as κ2 increases.

Proof. When π0 = 1/2, we have π1 = 1−π0 = 1/2 and the probability of error simplifies to Pe = 1−Φ( κ2 ).
This function is clearly strictly monotone decreasing in κ. Since Pe is continuous in κ, it follows that
there must be an open interval around the point 1/2 such that for each π0 in this interval, Pe is monotone
decreasing in κ. The proof of the lemma (given in the appendix) shows that this interval is (0, 1).
Figure 12.2 shows plots of R0 , R1 , and R? versus κ, and of R0 and R1 plotted as ROC curves. Note that
R0 and R1 are not both monotonic in κ, but the weighted sum Pe = R? is monotonically decreasing in κ.

12.4 Bayes Classifiers for Multivariate Gaussian Conditional Densities


Let the random vector X take values in Rn and have mean µ ∈ Rn and covariance matrix Σ ∈ Rn×n . We
say that X is a non-degenerate Gaussian random vector if Σ is positive definite, and X has the density
1 1 − 1 (x−µ)T Σ−1 (x−µ)
fX (x) = n 1 e
2 . (12.16)
(2π) 2 |Σ| 2
This is called a multivariate Gaussian density.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 161

12.4.1 Multivariate Gaussian Bayes Classifiers


Now assume each class conditional density pk (x) = fX|Y (x|k) is a multivariate Gaussian density with mean
µk ∈ Rn and covariance matrix Σk ∈ Rn×n . The densities fX|Y fXY , and fX are then given by
1 1 1 T −1
fX|Y (x|k) = n/2 1/2
e− 2 (x−µk ) Σk (x−µk ) (12.17)
(2π) |Σk |
1 1 1 T −1
fXY (x, k) = πk n/2 1/2
e− 2 (x−µk ) Σk (x−µk ) (12.18)
(2π) |Σk |
c
X 1 1 1 T −1
fX (x) = πk n/2 1/2
e− 2 (x−µk ) Σk (x−µk ) (12.19)
k=1
(2π) |Σk |

The density (12.19) is called a mixture of Gaussians. It is the convex sum of k distinct Gaussian densities.
Taking the natural log of pk (x) = fX|Y (x|k) yields
n 1 1
ln (pk (x)) = − ln(2π) − ln (|Σk |) − (x − µk )T Σ−1
k (x − µk ). (12.20)
2 2 2
Substituting this expression into (12.4), dropping constant terms, and multiplying by −2, yields the MAP
classifier for the assumed model:

ŷ(x) = arg min (x − µk )T Σ−1


k (x − µk ) + ln(|Σk |) − 2 ln(πk ). (12.21)
k

In general, the discriminant functions dk (x) = (x − µk )T Σ−1


k (x − µk ) + ln(|Σk |) − 2 ln(πk ) are quadratic
functions in x. Hence the boundaries between decision regions are piecewise smooth quadratic functions.

Binary Classification
For binary classification only two terms appear in (12.21). These can be subtracted and the result compared
to 0. This yields the classifier,
   
|Σ0 |
(
1, if (x − µ0 )T Σ−1 T −1 π1
0 (x − µ0 ) − (x − µ1 ) Σ1 (x − µ1 ) + ln |Σ1 | + 2 ln π0 > 0;
ŷ(x) =
0, otherwise.
(12.22)

All classes have the same covariance


If the conditional densities pk (x) all have the same covariance matrix Σ, the MAP classifier simplifies to

ŷMAP (x) = arg min kx − µk k2Σ−1 − 2 ln(πk )


k
µTk Σ−1 µk (12.23)
= arg min −µTk Σ−1 x + − ln(πk ).
k 2
Subject to additive offsets that adjust for prior probabilities, this classifier selects the closest class mean µk
under the squared Mahalanobis distance k · k2Σ−1 . The resulting discriminant functions are linear in x.
For binary classification, (12.23) simplifies to the linear classifier
(  
1, if (µ1 − µ0 )T Σ−1 (x − µ0 +µ 2
1
) + ln π1
π0 > 0;
ŷMAP (x) = (12.24)
0, otherwise.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 162

with parameters
w = Σ−1 (µ1 − µ0 )
(12.25)
 
π1
b = −wT (µ0 + µ1 )/2 + ln .
π0

All classes have covariance Σ = σ 2 In


In this case, x is classified by selecting the closest class mean µk subject to a variance weighted adjustment
for the class prior probability:
ŷ(x) = arg min kx − µk k22 − 2σ 2 ln(πk )
k
(12.26)
= arg min −µTk x + 1/2µTk µk − σ 2 ln(πk )
k

If the classes are equiprobable, the classifier selects the closest class mean. For two classes, the classifier is
linear, with w and b given by (12.25) using Σ = σ 2 In .

12.5 Learning Bayes Classifiers


In machine learning, we don’t know a priori the generative model for the training and testing data. One
approach is to assume a specific generative model (e.g., Gaussian class-conditional densities) and then use
the training data to estimate the unknown class prior and the parameters of the class conditional densities.
The resulting estimates can then be used in the MAP classifier for the assumed generative model. Some
well-known machine learning classifiers, including nearest centroid, naive Gaussian Bayes, and the linear
discriminant analysis classifier are examples of such classifiers.
Given a set of labelled training data, one can estimate the class prior probabilities by counting the fraction
of occurrence of each class in the training data. However, estimating class-conditional densities is a more
nuanced problem. A primary consideration is to model the data adequately while avoiding overfitting the
model. Generally, this requires limiting or regularizing the degrees of freedom of the model. For example,
to use the classifier (12.21) one could estimate the unknown parameters of the corresponding generative
model: µk , Σk , and πk , k ∈ [1 : c]. Since each Σk has (n2 + n)/2 free parameters, the number of parameters
to be estimated is quadratic in the number of features (dimension of the examples). If the amount of training
data is constrained, and n is large, this can be problematic. To ameliorate this problem, one could restrict
the degrees of freedom of the model by assuming the features are independent. This restriction allows
a separate density estimate for each feature. For example, for Gaussian class-conditional densities this
assumption requires estimating the sample mean and variance of each feature, a total of 2cn quantities.
Since the features are rarely independent in practice, empirical Bayes classifiers based on this assumption
are called Naive Bayes Classifiers.

12.5.1 Nearest Centroid Classifier


The nearest centroid classifier is an empirically derived Bayes classifier for a Gaussian model in which the
covariance matrix of each class is assumed to have the form Σk = σ 2 In , k ∈ [1, c]. This model assumes
the features are independent, each has the same variance, and the variance is not class dependent. This
assumption is stronger than Naive Bayes. Under this assumption, we only need to estimate the class sample
means, and one scalar variance over all examples. Then use the estimates in the MAP classifier (12.26).

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 163

The Nearest Centroid Classifier. This is an empirically derived Gaussian Bayes classifier that as-
sumes all classes have the same spherical covariance. If m and mk denote the number of examples and
the number of examples in class k, respectively, then the parameter estimates are
c n
mk 1 X 1 XXX
π̂k = , µ̂k = xj , σ̂ 2 = (xj (i) − µ̂k (i))2 .
m mk nm
yj =k k=1 yj =k i=1

The classifier is formed by using these estimates in the MAP classifier (12.26).

12.5.2 Gaussian Naive Bayes


The Gaussian Naive Bayes classifier uses the training data to fit multivariate Gaussian densities to each
class, under the assumption that the features are independent Gaussian random variables. This requires that
the covariance matrix for each class is diagonal:

Σk = diag σk2 (1), . . . , σk2 (n) , k ∈ [1 : c].


 

The corresponding MAP classifier is


n 
!
x(i) − µk (i) 2
X 
σk2 (i)

ŷ(x) = arg min + ln − 2 ln(πk ). (12.27)
k σk (i)
i=1

In general, this classifier has quadratic decision surfaces.

Gaussian Naive Bayes. This is an empirically derived Gaussian Bayes classifier that assumes inde-
pendent features. If m and mk denote the number of examples and the number of examples in class k,
respectively, then
mk 1 X 1 X
π̂k = , µ̂k (i) = xj (i), σ̂k2 (i) = (xj (i) − µ̂k (i))2 , i ∈ [1 : n], k ∈ [1 : c].
m mk mk
y(j)=k y(j)=k

The classifier is formed by using these estimates in the MAP classifier (12.27).

12.5.3 Linear Discriminant Analysis


We first present Linear Discriminant Analysis (LDA) roughly as originally proposed. Then we examine an
alternative viewpoint yielding the same result. Roughly, LDA seeks a direction w such that after projection
onto the line in direction w, the classes in the labelled data are “maximally separated” [14]. One can draw
an analogy with PCA applied to unlabelled data. In PCA, the first principal component is the line on which
the projected data has maximal sample variance. In contrast, LDA uses labeled data. Instead of maximizing
variance, LDA seeks the line direction such that the projection of the labelled data onto the line maximally
separates the two classes. In other words, select w such that for some b, the linear classifier (w, b) gives the
best separation of the projected training data into two classes.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 164

To set the stage, recall that the line though the origin in the direction of w ∈ Rn is the set of all points
of the form αw with α ∈ R. Each point p on this line is uniquely specified by its corresponding coordinate
α ∈ R. Without loss of generality, we will assume that kwk2 = 1. Hence the orthogonal projection of
x ∈ Rn onto the line is the point p(x) = (wT x)w, and the coordinate of p(x) with respect to w is wT x.
Now consider a binary labelled training dataset {xi , yi }m n
i=1 with xi ∈ R and yi ∈ {0, 1}, i ∈ [1 : m].
n
A linear classifier for this dataset consists of a pair w ∈ R and b ∈ R. Since any nonzero scaling of w
and b does not change the classifier, we assume that kwk2 = 1. The classifier orthogonally projects the
dataset onto the line through the origin in the direction w, yielding a labelled set of scalars {(wT xi , yi )}m
i=1 .
If (wT x) > −b, (wT x) is predicted to be in one class, and in the other class otherwise.
Bring in the class sample means µ̂k and sample covariance matrices R̂k , k = 0, 1. Then the class sample
means and variances of the projected training examples {(wT xi , yi )}m i=1 are given by
 P 
ν̂k (w) = m1k yi =k wT xi = wT m1j yi =k xi = wT µ̂k
P

1 X T
σ̂k2 (w) = (w (xi − µ̂k ))2 = wT R̂k w, k = 0, 1.
mk
yi =k

To improve class separation under the projection, it seems reasonable to select w to make (ν̂1 (w) − ν̂0 (w))2
large (push the projected class means apart). However, this ignores the dependence of the variances σ̂k2 (w)
on w. Large variances could hinder class separation. To resolve this, LDA selects the unit norm vector w to
maximize the ratio of the squared distance between the projected means
(ν̂0 (w) − ν̂1 (w))2 = wT (µ̂0 − µ̂1 )(µ̂0 − µ̂1 )T w,
and the weighted sum of class variances
m0 T m1 T
w R̂0 w + w R̂1 w = wT (S0 + S1 )w,
m m
where we have dropped m since it is a constant, and Sk denotes the scatter matrix of class k = 0, 1. This
leads to the objective function
wT (µ̂0 − µ̂1 )(µ̂0 − µ̂1 )T w wT SB w
J(w) = = ,
wT (S0 + S1 )w w T SW w
where
SB = (µ̂0 − µ̂1 )(µ̂0 − µ̂1 )T
is called the between-class scatter matrix and
X X
SW = S0 + S1 = (xi − µ̂0 )(xi − µ̂0 )T + (xi − µ̂1 )(xi − µ̂1 )T
yi =0 yi =1

is called the within-class scatter matrix. By this reasoning we arrive at the LDA optimization problem:
? wT SB w
wLDA = arg maxn
w∈R wT SW w (12.28)
s.t. kwk22 = 1.
Problem (12.28) is an instance of a generalized Rayleigh quotient problem. In general, finding a solution
involves finding the maximum eigenvalue and a corresponding unit norm eigenvector of an appropriate
matrix (see Appendix D). However, in this instance, the rank one structure of SB ensures that (12.28) has a
?
simple solution: up to a scale factor, wLDA must be a solution of the set of linear equations
SW w = µ0 − µ1 .
This solution can then be scaled to ensure it has unit norm (Exercise 12.1).

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 165

An Alternative Perspective
Assume the example value x ∈ Rn and its label y ∈ {0, 1} are the outcomes of random variables X and Y,
with the joint probability density specified in the factored form (12.1). Let the class-conditional densities be
N (µk , Σ), and πk denote the prior probability of class k, k ∈ {0, 1}. By (12.24), the MAP classifier for this
? T
model is linear. It predicts class 1 if wMAP x + b?MAP > 0, and class 0 otherwise, where
?
wMAP = Σ−1 (µ1 − µ0 )
(12.29)
 
T π1
b?MAP = ?
−wMAP (µ0 + µ1 )/2 + ln .
π0
To link this model to the LDA problem, let w ∈ Rn have unit norm and consider the scalar random variable
Z = wT X. Z will have a scalar class-conditional densities N (νk , σ 2 ), and class prior probabilities πk ,
k = 0, 1. We must have
νk = EN (µk ,Σ) wT X = wT EN (µk ,Σ) [X] = wT µk , k = 0, 1,
 

σ 2 = EN (µk ,Σ) (wT X − νk )2 = EN (µk ,Σ) (wT (X − µk ))(X − µk )T w = wT Σw.


   

Without loss of generality, assume that ν0 < ν1 . Then the MAP classifier for the scalar model is given by
(12.10). Translated into the current notation, this has the form
(
1, wT x + b > 0;
ŷ(wT x) = (12.30)
0, otherwise.
where the threshold b is given by
wT Σw
 
T π1
b = −w (µ0 + µ1 )/2 + T ln . (12.31)
w (µ1 − µ0 ) π0
Expression (12.31) is an extension to general w of the expression for b?MAP in (12.29). The first terms in
the expressions for b?MAP and b are the same. Moreover, if we substitute the value of w given in (12.29) into
the second term in (12.31), it simplifies to the second term in the expression for b?MAP .
We now select w to minimize the classifier probability of error Pe given in (12.15). In the current
situation
(ν1 − ν0 )2 wT (µ0 − µ1 )(µ0 − µ1 )T w
κ2 = = .
σ2 wT Σw
By Lemma 12.3.1, the probability of error decreases monotonically as κ2 increases. Hence we select w to
maximize κ2 :
wT (µ0 − µ1 )(µ0 − µ1 )T w
maxn
w∈R wT Σw (12.32)
2
s.t. kwk2 = 1.
Modulo a scaling of the objective function, problem (12.32) is the same as problem (12.28). We now make a
few more observations. The Bayes classifier (12.29) minimizes the probability of error for its corresponding
model. Since problem (12.32) includes the normalized solution wMAP ? in its feasible set, the solution of
(12.32) is at least as good as that of (12.29). The optimality of (12.29) then implies that both solutions are
equivalent in the sense that each specifies the same hyperplane. So the binary LDA classifier determined by
solving (12.32), is the binary Bayes classifier for the “common covariance” Gaussian model. We summarize
LDA in the box below.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 166

The Binary LDA Classifier. LDA is an empirically derived Gaussian Bayes classifier that assumes
both classes have the same covariance. This assumption results in a linear classifier. Assuming the
estimated covariance matrix Σ̂ is full rank, the parameters of the LDA classifier can be specified by

wT (µ̂0 − µ̂1 )(µ̂0 − µ̂1 )T w


w? = arg maxn
w∈R wT Σ̂w
s.t. wT w = 1. (12.33)
w? T Σ̂w?
 
(µ̂0 + µ̂1 ) π̂1
b? = −w? T + ?T ln .
2 w (µ̂0 − µ̂1 ) π̂0

or simply as the solution of the linear equations

Σ̂w? = µ̂1 − µ̂0


(12.34)
 
? ?T (µ̂0 + µ̂1 ) π̂1
b = −w + ln .
2 π0

12.6 Kullback-Leibler Divergence


A useful measure of the distinctiveness or dissimilarity of densities (or pmfs) p0 (x) and p1 (x) is the
Kullback-Leibler Divergence (KL-divergence, or divergence for short). This is denoted by D(p1 kp0 ), and
defined as follows
Z     
p1 (x) p1 (x)
D(p1 kp0 ) = p1 (x) ln dx = Ep1 ln ,
p0 (x) p0 (x)

provided the expectation exists (is finite). By definition, KL-divergence is the expected value under p1 of the
log-likelihood ratio ln(p1 (x)/p0 (x)). In general, D(p1 (x)kp0 (x)) 6= D(p0 (x)kp1 (x)). So KL-divergence
is not symmetric. Hence it is not a true distance. However, it is nonnegative. To prove this we use the
following extension of Jensen’s inequality.

Theorem 12.6.1 (Jensen’s Inequality [11]). If f is a convex function and X is a random variable, then

f (E [X]) ≤ E [f(X)] . (12.35)

Moreover, if f is strictly convex, then equality in (12.35) implies that X = E [X] with probability 1.

Proof. See [11, Theorem 2.6.2 ].

When two functions are equal except on a set of points with probability zero, we say that the functions
are equal with probability one (w.p.1.). A closely related concept is that two functions are equal almost
everywhere (a.e.). This holds when the functions are equal except on a set that has zero “measure”. An
example of a set measure is the probability of the set. Another example is the length of an interval. The cor-
responding measure for sets built from unions and intersections of intervals is derived accordingly. We will
not need to go into the details of set measures, except to understand the concept of equal almost everywhere.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 167

Theorem 12.6.2. For all p0 (x), p1 (x) for which D(p1 kp0 ) is defined, D(p1 kp0 ) ≥ 0 with equality if
and only if p1 = p0 a.e..

Proof. We use the strict convexity of the function − ln(·) together with Jensen’s inequality to obtain

D(p1 kp0 ) = Ep1 [ln(p1 /p0 )]


= Ep1 [− ln(p0 /p1 )]
≥ − ln(Ep1 [p0 /p1 ]) by Jensen’s inequality
= − ln 1
= 0.

Suppose D(p1 kp0 ) = 0. Then there must be equality in Jensen’s inequality. Theorem 12.6.1 then implies
that ln(p1 /p0 ) = 0 a.e., and hence that p1 = p0 a.e..

12.7 The Exponential Family


A density or pmf fX for a random variable X taking values in a subset D ⊆ Rn is said to belong to the
exponential family if it can be written in the form
1
fX (x) = h(x)e<θ,t(x)> . (12.36)
Z(θ)

Here θ is a vector parameter taking values in a finite dimensional inner product space H, t(x) is a vector
valued function of x taking values in H, h(x) is a real-valued function of x, and Z(θ) is a real-valued
function of θ. The term <θ, t(x)> denotes the inner product of θ and t(x) in H.
We require that h(x)e<θ,t(x)> is non-negative. Hence h(x) ≥ 0. The purpose of Z(θ) is to ensure that
fX (x) integrates to 1. Hence we also require that
Z
h(x)e<θ,t(x)> dx < ∞.

Z(θ) = (12.37)
D

The admissible set of natural parameter values, denoted by Θ, consists of all θ for which (12.37) holds. The
partition function Z is hence a real valued function defined on Θ ⊆ H.
The parameter θ is called the natural parameter, t(x) is called the sufficient statistic, and Z(θ) is called
the partition function of the density. An alternative but equivalent notation is to set A(θ) = ln(Z(θ)) and
write
fX (x) = h(x)e<θ,t(x)>−A(θ) .
A(θ) is called the log partition function.
The parameterization (12.36) is non-redundant if every density in the family is specified by a unique
natural parameter θ ∈ Θ. If the parameterizarion is redundant, then for some θ0 , θ1 ∈ Θ, with θ0 6= θ1 ,
fθ1 (x) = fθ0 (x) a.e..

Connection to KL-Divergence and MAP Classification


It is easy to compute the KL-divergence between exponential family densities of the same form, and it easy
to find the corresponding MAP classifier. We state these results in the following two theorems.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 168

Theorem 12.7.1. Let fk (x) = 1 <θk ,t(x)> ,


Z(θk ) h(x)e k = 0, 1, be density functions. Then
 
Z(θ0 )
D(f1 kf0 ) = <(θ1 − θ0 ), Ef1 [t(X)] > + ln . (12.38)
Z(θ1 )

Proof. Exercise.

Theorem 12.7.2. Let fk (x) = Z(θ1 k ) h(x)e<θk ,t(x)> , k = 0, 1, be density functions. Then the binary
MAP classifier under class-conditional densities f0 , f1 is
(    
Z(θ0 ) π1
1, if <(θ1 − θ0 ), t(x)> + ln Z(θ 1)
+ ln π0 > 0;
ŷ(x) = (12.39)
0, otherwise.

Proof. Exercise.

By computing the (possibly) nonlinear function t(x) we obtain an alternative (possibly lossy) represen-
tation of x. Nevertheless, the MAP classifier can be exactly implemented using t(x). Moreover, the resulting
the MAP classifier is linear in t(x).

12.7.1 Examples
Example 12.7.1 (Exponential Density). The scalar exponential density is defined on R+ by

f (x) = λe−λx , x ∈ [0, ∞).

Here λ > 0 is fixed parameter. This density has the form (12.36) with

h(x) = 1, t(x) = −x, θ = λ, Z(θ) = 1/θ.

Using (12.38), the KL-divergence of two exponential densities with parameters λ0 and λ1 is
 
λ0 λ1
D(λ1 kλ0 ) = − 1 + ln ,
λ1 λ0

and using 12.39, the corresponding binary MAP classifier is linear in t(x) with parameters
   
λ1 π1
w = (λ1 − λ0 ), b = ln + ln .
λ0 π0

Example 12.7.2 (Poisson pmf). The Poisson pmf is defined on the natural numbers N by

λk −λ
f (k) = e , k ∈ N.
k!
Here λ > 0 is parameter of the density. This density is often used to model the number of events occurring
over a fixed interval of time. The density can be placed in the form (12.36) by setting
1 θ
h(k) = , t(k) = k, θ = ln λ, Z(θ) = eλ = ee .
k!
c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 169

Using (12.38), the KL-divergence of poisson pmfs with parameters λ0 and λ1 is


 
λ1
D(λ1 kλ0 ) = λ1 ln + (λ0 − λ1 ),
λ0

and by (12.39), the corresponding binary MAP classifier is the linear in t(k) with parameters
   
λ1 π1
w = ln , b = (λ0 − λ1 ) + ln .
λ0 π0

Example 12.7.3 (Bernoulli pmf). The Bernoulli pmf is defined on two outcomes {0, 1} by f (1) = p and
f (0) = 1 − p, where p ∈ [0, 1] is a parameter. We can write this pmf in the form (12.36) as follows:

f (x) = px (1 − p)(1−x) = ex ln(p)+(1−x) ln(1−p) = ex ln(p/(1−p)) (1 − p).

Hence we can take  


p
h(x) = 1, t(x) = x, θ = ln .
1−p
To find Z(θ) we first invert the formula for θ. This gives

1 1 e−θ 1
p= and 1−p=1− = = . (12.40)
(1 + e−θ ) 1 + e−θ 1 + e−θ 1 + eθ

Thus Z(θ) = 1 + eθ .
We can now use 12.38 to determine the KL-divergence between two members of the family with param-
pk
eters θk = ln( 1−p k
), k = 0, 1. First note that

pk 1
Z(θk ) = 1 + eθk = 1 + = , k = 0, 1
1 − pk 1 − pk
     
p1 p0 p1 (1 − p0 )
θ1 − θ0 = ln − ln = ln .
1 − p1 1 − p0 p0 (1 − p1 )
Hence we have
     
Z(θ0 ) p1 (1 − p0 ) 1 − p1
D(p1 kp0 ) = (u1 − u0 )Ep1 [x] + ln = ln p1 + ln . (12.41)
Z(θ1 ) p0 (1 − p1 ) 1 − p0

Similarly, by (12.39) the corresponding MAP classifier is linear in t(x) with parameters
     
p1 (1 − p0 ) 1 − p1 π1
w = ln , b = ln + ln .
p0 (1 − p1 ) 1 − p0 π0
Assume the prior is uniform. If x = 0, the classifier decides according to which of the probabilities 1 − p0
or 1 − p1 is larger. But if x = 1, it decides according to which of the probabilities p0 or p1 is larger.
Example 12.7.4 (Binomial pmf). The binomial pmf gives the probability of k successes in n independent
Bernoulli trials when the probability of success in each trial is p. Hence
 
n k
f (k) = p (1 − p)n−k , k ∈ [0 : n].
k

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 170

When the number of trials is fixed, the binomial pmf is in the exponential family. To see this write
 
n k ln p+(n−k) ln(1−p) h(k) θT t(k)
f (k) = e = e ,
k Z(θ)
 
p
, t(k) = k, h(k) = nk , and Z(θ) = e−n ln(1−p) = (1 + eθ )n .

where θ = ln 1−p
It follows that the KL-divergence of two binomial pmfs using the same number of trials n but distinct
parameters p0 and p1 is
   
p1 1 − p0 1 − p1
D(B(n, p1 )kB(n, p0 )) = < ln , EB(n,p1 ) [K] > + n ln
1 − p1 p0 1 − p0
   
p1 (1 − p0 ) 1 − p1
= np1 ln + n ln
p0 (1 − p1 ) 1 − p0
   
p1 1 − p1
= np1 ln + n(1 − p1 ) ln .
p0 1 − p0
Similarly using (12.39), the corresponding MAP classifier is linear in t(x) with parameters
   
p1 (1 − p0 ) 1 − p1
w = ln , b = n ln .
p0 (1 − p1 ) 1 − p0

x−µ 2
1
e− /2( )
1
Example 12.7.5 (Univariate Gaussian). The univariate Gaussian density f (x) = √2πσ σ , can
written as
1 1 2 2 1 1 2 µ µ2 1 T
f (x) = √ e− 2σ2 (x −2µx+µ ) = √ e− 2σ2 x + σ2 x e− 2σ2 = h(x)eθ t(x) ,
2πσ 2πσ Z(θ)
where
2 T µ −1 √ µ2
r
π −θ(1)2
h(x) = 1, t(x) = [x, x ] , θ = [ 2 , 2 ]T , Z(θ) = 2πσe 2σ 2 = e 4θ(2) .
σ 2σ −θ(2)
Notice that x is a scalar, but t(x) is a vector of dimension 2.
Using 12.38, the KL-divergence of two exponential densities N (µk , σk2 ) with natural parameters θk ,
k = 0, 1, is given by
   
2 2 µ1 µ0 1 1 Z(θ0 )
D(N (µ1 , σ1 )kN (µ0 , σ0 )) = − , − Ep1 [t(X)] + ln
σ12 σ02 2σ02 2σ12 Z(θ1 )
2 2
!
σ0 e1/2µ0 /σ0
  
µ1 µ0 1 1 µ1
= − , − + ln
σ12 σ02 2σ02 2σ12 σ12 + µ21 2 2
σ1 e1/2µ1 /σ1
   2
µ21
   
µ1 µ0 1 1 µ1 σ0 µ0
= − , − + ln + −
σ12 σ02 2σ02 2σ12 σ12 + µ21 σ1 2σ02 2σ12
 !
µ0 − µ1 2 σ12
 
1 σ0
= + 2 − 1 + 2 ln .
2 σ0 σ0 σ1

By (12.39), the corresponding MAP classifier is a linear classifier of the sufficient statistic t(x) ∈ R2 with
   2
µ21
    
µ1 µ0 1 1 σ0 µ0 π1
wT = − , − , b = ln + − + ln .
σ12 σ02 2σ02 2σ12 σ1 2σ02 2σ12 π0

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 171

1 T Σ−1 (x−µ)
Example 12.7.6 (Multivariate Gaussian). The density f (x) = 1
n
1
1 e− 2 (x−µ) , can be
(2π) 2 |Σ| 2
written as
1 1 − 1 (xT Σ−1 x−2µT Σ−1 x+µT Σ−1 µ)
f (x) = n 1 e
2
(2π) 2 |Σ| 2
1 1 µT Σ−1 x−trace(1/2Σ−1 xxT ) −1/2µT Σ−1 µ
= n 1 e e
(2π) |Σ| 2
2

1
= h(x)e<θ,t(x)> ,
Z(θ)

where
1
h(x) = 1, t(x) = (x, − xxT ), θ = (Σ−1 µ, Σ−1 ),
2
and
n 1 T Σ−1 µ n 1 T u(2)−1 u(1)
Z(θ) = (2π) 2 |Σ| 2 e /2µ = (2π) 2 det(u(2))− 2 e /2u(1)
1 1
.
Here t(x), θ ∈ Rn × Sn with the inner product <(x, M ), (y, N )> = <x, y> + <M, N >.
We now compute the KL-divergence of two exponential densities N (µk , Σk ), k = 0, 1, with natural
parameters θ0 and θ1 . We first list some intermediate results:

EN (µ1 ,Σ1 ) [t(x)] = EN (µ1 ,Σ1 ) (x, −1/2xxT ) = (µ1 , −1/2(Σ1 + µ1 µT1 ))
 

<(θ1 − θ0 ), EN (µ1 ,Σ1 ) [t(x)] > = <(Σ−1 −1 −1 −1 T


1 µ1 − Σ0 µ0 , Σ1 − Σ0 ), (µ1 , − /2(Σ1 + µ1 µ1 ))>
1

= µT1 Σ−1 T −1 −1 T T
1 µ1 − µ0 Σ0 µ1 − /2 trace(In − Σ0 Σ1 + µ1 Σ1 µ1 − µ1 Σ0 µ1 )
1

= µT1 Σ−1 T −1 −1 −1 −1
1 µ1 − µ0 Σ0 µ1 + /2(−n + trace(Σ0 Σ1 ) − µ1 Σ1 µ1 + µ1 Σ0 µ1 )
1

1 −1
T
!
|Σ0 | 2 e1/2µ0 Σ0 µ0
   
Z(u0 ) |Σ0 |
ln = ln 1 1/2µT Σ−1 µ1
= 1/2 ln + 1/2µT0 Σ−1 1 T −1
0 µ0 − /2µ1 Σ1 µ1 .
Z(u1 ) |Σ1 | e
2 1 1 |Σ1 |

Putting these results together we find:


 
|Σ0 |
D(N (µ1 , Σ1 )kN (µ0 , Σ0 )) = /2 ln
1 + 1/2(trace(Σ−1 −1 T −1 T −1
0 Σ1 ) − n + µ1 Σ0 µ1 − 2µ0 Σ0 µ1 + µ0 Σ0 µ0 )
|Σ1 |
   
|Σ 0 | −1 T −1
= 1/2 ln + trace(Σ0 Σ1 ) − n + (µ1 − µ0 ) Σ0 (µ1 − µ0 ) .
|Σ1 |

The corresponding MAP classifier is a linear classifier of the sufficient statistic t(x) = (x, − 21 xxT ) ∈
Rn × Rn×n . Let w = (v, Ω). Then

w = θ1 − θ0 = ( Σ−1 −1 −1 −1
1 µ1 − Σ0 µ0 , Σ1 − Σ0 ).

Hence v = Σ−1 −1 −1 −1
1 µ1 − Σ0 µ0 and Ω = Σ1 − Σ0 . The MAP classifier then takes the form:
(
1, if <w, t(x)> + b > 0;
ŷ(x) =
0, otherwise.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 172

We have

<w, t(x)> = v T x − 1/2<Ω, xxT >


= µT1 Σ−1 T −1
x − 1/2 trace((Σ−1 −1 T

1 − µ0 Σ 0 1 − Σ0 )xx )
= µT1 Σ−1 T −1
x − 1/2 xT Σ−1 T −1
 
1 − µ0 Σ 0 1 x − x Σ0 x ,

and
   
|Σ0 | 1/2µT Σ−1 µ0 1/2µT Σ−1 µ1
π1
b= 1/2 ln + 0 0 − 1 1 + ln .
|Σ1 | π0

12.7.2 Geometry of the Log Partition Function


Our objective is examine the geometry of the log partition function ln(Z(θ)). But we first digress to intro-
duce a useful mathematical result.

Hölder’s Inequality

A real valued function f defined on a subset D ⊆ Rn is said to be integrable if D |f (x)|dx < ∞. We often
R

have functions f, g and we want to show that the product f g is integrable. For example, suppose we know
that f, g are square integrable (i.e., f 2 and g 2 are integrable), does that imply that f g is integrable? Hölder’s
inequality tells us that the answer is yes. Remarkably, Hölder’s inequality tells us that for any real numbers
p, q > 1 with 1/p + 1/q = 1, if f p and g q are integrable so is f g.

Theorem 12.7.3 (Hölders Inequality). Let p, q > 1 be real


R numbers satisfying 1/p
R + 1/q = 1. Let
f (x), g(x) be functions defined on a subset D of Rn with D |f (x)|p dx < ∞ and D |g(x)|q dx < ∞.
Then f (x)g(x) is an integrable function and
Z Z 1/p Z 1/q
p q
|f (x)g(x)|dx ≤ |f (x)| dx |g(x)| dx . (12.42)
D D D
1/p
Equality holds in (12.42) if and only if |f (x)|p /kf kpp = |g|q /kgkqq a.e., where kf kp = |f (x)|p dx
R
D
1/q
and kgkq = D |f (x)|q dx
R
.

Proof. A proof can be found, for example, in [1, 2, 15, 37]. The proofs given in Bartle [2] and Fleming [15]
make the condition for equality explicit.

Convexity of the Log Partition Function

We now examine the convexity of the set Θ and the convexity of the log partition function.

Theorem 12.7.4. The admissible set Θ of a density (or pmf) f in the exponential family is convex, and
its log partition function ln(Z(θ)) is a convex function on Θ. If the parameterization of the density is
non-redundant, then ln(Z(θ)) is strictly convex on Θ.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 173

Proof. Let θ0 , θ1 ∈ Θ.Then Z(θk ) is nonnegative and finite for k = 0, 1. Hence for any α ∈ (0, 1),
Z(θ0 )1−α and Z(θ1 )α are finite. Now consider the point θα = (1 − α)θ0 + αθ1 . By definition,
Z
Z(θα ) = h(x)e<θα ,t(x)> dx
Z   
= h(x)1−α e<(1−α)θ0 ,t(x)> h(x)α e<αθ1 ,t(x)> dx
Z 1 1
= g0 (x)g1 (x)dx g01−α (x) and g1α (x)are integrable
 1 1−α  1 α
1−α
≤ g0 (x)dx α
g1 (x)dx Hölder’s Inequality

= Z(θ0 )1−α Z(θ1 )α .

From the above bound and the finiteness of Z(θ0 )1−α and Z(θ1 )α , it follows that Z(θα ) is finite. So θα ∈ Θ,
and thus Θ is a convex set. Taking the log of the above inequality yields ln(Z(θα )) ≤ (1 − α) ln(Z(θ0 )) +
α ln(Z(θ1 )). Hence the log partition function is convex on Θ. It is strictly convex if for each α ∈ (0, 1)
Hölder’s inequality is strict. Set p = 1/(1 − α) and q = 1/α. There is equality in Hölder’s inequality at α
if and only if
g0p (x) h(x)e<θ0 ,t(x)> h(x)e<θ1 ,t(x)> g q (x)
p = = = 1 q a.e.
kg0 kp Z(θ0 ) Z(θ1 ) kg1 kq
Thus strong convexity follows by the assumption that the density parameterization is non-redundant.

The Gradient of the Log Partition Function


It will useful to determine the derivative and gradient of the log partition function for the density f in (12.36).

Lemma 12.7.1. The gradient of the log partition function of f is ∇ ln(Z(θ)) = Ef [t(X)].

1
Proof. We have Dθ ln(Z(θ))(v) = Z(θ) Dθ Z(θ)(v). So if Dθ Z(θ)(v) exists, then the previous equation
gives Dθ ln(Z(θ))(v). To take the derivative of Z(θ) we interchange the derivative operator and the integra-
tion. An interchange limiting operations is something we need to check. This issue is discussed in Appendix
12.8. Making the exchange and taking the derivative of the function inside the integral yields
Z 
Dθ Z(θ))(v) = Dθ h(x)e < θ,t(x)>
dx (v)
Z
= <v, t(x)> h(x)e<θ,t(x)> dx
Z
= <v, t(x)h(x)e<θ,t(x)> dx>.

Putting the first and second result together we find,


Z
h(x) <θ,t(x)>
Dθ ln(Z(θ))(v) = <v, t(x) e dx> = <v, Ef [t(X)] >.
Z(θ)

Hence ∇ ln(Z(θ)) = Ef [t(X)].

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 174

Figure 12.3: The geometry of the log partition function of a density in the exponential family.

The Derivative Provides a Global Lower Bound

We now appeal to Theorem 7.4.3. This theorem indicates that for a differentiable convex function f over a
convex domain C and any fixed point x1 in the interior of C, for every x ∈ C

f (x) ≥ f (x1 ) + <∇f (x1 ), x − x1 >.

So the derivative provides a global lower bound for the function. This is illustrated in Figure 7.3.
Now consider the log partition function. The set of all points (θ, ln(Z(θ)) ∈ Θ × R is called the graph
of ln(Z(θ)). This set of points forms the surface in Θ × R illustrated in Figure 12.3. The lower bound for
ln Z(θ) given by the derivative at θ1 is

ln(Z(θ)) ≥ ln(Z(θ1 )) + <∇ ln(Z(θ1 )), θ − θ1 >. (12.43)

This lower bound is illustrated as the shaded plane in Figure 12.3. The plane is tangent to the surface at
(θ1 , ln(Z(θ1 ))). In Θ, if the gradient vector ∇ ln(Z(θ1 )) is translated to the point ln(Z(θ1 )) it points in the
direction of greatest increase in ln(Z(θ)) at θ1 . It is thus orthogonal to the level set of ln(Z(θ)) with value
ln(Z(θ1 )). This is also illustrated in Figure 12.3.
Now consider a second point θ0 ∈ Θ. The lower bound (12.43) requires that ln(Z(θ0 )) ≥ ln(Z(θ1 )) +
<∇ ln(Z(θ1 )), θ0 − θ1 >. The point (θ0 , ln(Z(θ0 ))) lies on the surface, while the corresponding point on
the lower bound is (θ0 , ln(Z(θ1 )) + <∇ ln(Z(θ1 )), θ0 − θ1 >. The distance between the point on the plane

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 175

and the point on the surface is

∆ = ln(Z(θ0 ))) − ln(Z(θ1 )) + <∇ ln(Z(θ1 )), θ0 − θ1 >


 
Z(θ0 )
= ln + <θ1 − θ0 , ∇ ln(Z(θ1 ))>
Z(θ1 )
 
Z(θ0 )
= ln + <θ1 − θ0 , Efθ1 [t(X)] > by Lemma 12.7.1
Z(θ1 )
= D(fθ1 kfθ0 ) by Theorem 12.7.1.

This is also illustrated in Figure 12.3.

12.8 Appendix: Interchanging Derivatives and Integrals


In general, changing the order of two limit operations can change the result obtained. However, there are
sufficient conditions when exchanging limit operations is possible. A particular case of interest is changing
the order of integration and differentiation.
Let f (x, t) be a function of two variables. We want toR integrate f (x, t)R over x and differentiate f (x, t)
d d
with respect to t. There are two ways we can do this: dt f (x, t)dx and dt f (x, t)dx. Do both methods
yield the same result? Questions such as this can often be answered using a result known as the dominated
convergence theorem. Consider the differentiation operation at t = t0 as the limit as v → 0 of the function

f (x, t0 + v) − f (x, t0 )
fv (x) = .
v
The dominated convergence theorem tells us that if there exists an integrable function g(x) such that
|fv (x)| ≤ g(x), then
Z Z Z Z
d d
f (x, t)dx = lim fv (x, t)dx = lim fh (x, t)dx = f (x, t)dx.
dt v→0 v→0 dt

For example, h(x)e<θ,t(x)> is a function of θ ∈ Θ and x ∈ D. For simplicity assume that D is a compact
set. At θ0 , we have
<θ0 ,t(x)>
h(x)e<θ0 +v,t(x)>−h(x)e e<v,t(x)> − 1
= h(x)e<θ0 ,t(x)> .
kvk kvk

For sufficiently small kvk, the term on the right is bounded above by the integrable function h(x)e<θ0 ,t(x)> .
Hence under the stated assumptions,
Z Z
Dθ h(x)e<θ,t(x)> dx(v) = Dθ h(x)e<θ,t(x)> (v)dx.

For further reading see the Bartle [2, Theorem 5.6 and Corollary 5.9].

12.9 Appendix: Proofs


Proof of Lemma 12.3.1 . For notational simplicity set α = ln(π0 /π1 ) and assume α ≥ 0. The case α ≤ 0
will follow by symmetry. We can always assume κ > 0. Rather than working with the probability of error,

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 176

it is more convenient to work with the probability of correct classification. From (12.15) you see that this is
the sum of the terms,
κ α κ α
C0 = π0 Φ + and C1 = π1 Φ − ,
2 κ 2 κ
and that the derivatives of these terms w.r.t. κ are
   
0 κ α 1 α
 κ α 1 α
0
C0 = π0 Φ + − and C10 = π1 Φ 0
− + 2 .
2 κ 2 κ2 2 κ 2 κ

Notice that C00 > 0 for κ2 > 2α and C10 > 0 for all κ2 > 0. Hence for κ2 ≥ 2α the probability of being
correct is monotone increasing in κ. For 0 < κ2 ≤ 2α, C00 ≤ 0 and C10 > 0. To ensure C00 + C10 > 0 for
0 < κ2 ≤ 2α, it is sufficient that
 
0 κ α α 1

0
−C0 = π0 Φ + − < C10 .
2 κ κ2 2
To verify that the above holds we examine
! α 1
!
−C00 Φ0 κ2 + ακ 2 − 2
  
π0 κ
ln = ln + ln  + ln α
C10 π1 Φ0 κ2 − ακ κ2
+ 12
!
α 1

  
1 κ α 2  κ α 2 2 2
=α+ − + + − + ln κα
2 2 κ 2 κ κ2
+ 21
!
α 1
2 − 2
= ln κα
κ2
+ 12
< 0.

Hence for α ≥ 0, the probability that the classifier is correct is monotonically increasing in κ and hence also
in κ2 .

Notes
For a good introduction to Bayesian decision theory see Duda et al. [13, Ch. 2], and the more advanced discussion
for Gaussian conditional densities in Murphy [32, Ch. 4]. Poor [36, Ch 2], gives a concise summary of general
Bayes decision rules. See also Wasserman [49], Silvey [45], and Lehmann [27]. For additional reading on ROC
curves, see Duda et al. [13, §2.8.3]. The exponential family is discussed in many texts. For a detailed modern
treatment see [32, §9.2]. Other interesting accounts are given in the books by Wasserman [49, §9.13.3], Berger [3],
and Lehmann [26, §1.5], [27, §2.7]. The essence of the proof of Theorem 12.7.4 follows that in [27, §2.7].

Exercises
Exercise 12.1. Show that the solution of the LDA problem (12.28) with SW PD can be obtained by scaling the solution
of the linear equations SW w = µ1 − µ0 to unit norm. (This only requires linear algebra.)
Exercise 12.2. Prove Theorem 12.7.1.
Exercise 12.3. Prove Theorem 12.7.2.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Chapter 13

Generative Models for Regression

13.1 Introduction
This chapter formulates a generative model for the regression problem.
Let a data example take the form (x, y) with x ∈ Rn and y ∈ Rq . The values x and y are the assumed
to be the outcomes of random variables X and Y with joint density fXY . For the moment we assume that
this joint density is known.
The specific problem of interest to estimate the value of Y having observed the outcome x of X. Hence
we call Y the target and X the observation. The information provided by x about the outcome of Y refines
the uncertainty in Y through the conditional density
fXY (x, y)
fY|X (y|x) = .
fX (x)

If X and Y are independent, then by definition fXY (x, y) = fX (x)fY (y). From this equality it follows that

fXY (x, y) fX (x)fY (y)


fY|X (y|x) = = = fY (y).
fX (x) fX (x)
In this case, knowing the outcome of X does nothing to refine our uncertainty in the value of Y. Thus for
the problem to be interesting, X and Y must be dependent random vectors. This dependency is captured by
the joint density fXY and reflected in the conditional density fY|X .
By factoring the joint density this way

fXY (x, y) = fX|Y (x|y)fY (y), (13.1)

we can write
fXY (x, y) fX|Y (x|y)fY (y)
fY|X (y|x) = = for all x with fX (x) > 0. (13.2)
fX (x) fX (x)

This equation relates the prior density fY (y) (prior to the observation) to the posterior density fY|X (y|x)
(after the observation). Equation (13.2) also suggests that the natural generative model in this situation is
given by the right hand side of (13.1).
The estimated value of Y will be a function of the observed value of X. This function, denoted by ŷ(·),
is called an estimator or predictor. In contrast, the value produced by the estimator for a particular value x
of X is called an estimate or prediction of Y.

177
ELE 435/535 Fall 2018 178

Our objective is to determine an optimal estimator for Y given the value of X. To do so we must
first decide what criterion we seek to optimize. Since Y can take on a continuum of values, it is not the
probability of error that is important (it will almost certainly be 1), but how close, on average, the estimate
ŷ(x) is to the actual value y of Y. There are many ways to measure this error. One way is the squared
distance kŷ(x) − yk22 . Sometimes this will be small, other times it will be large. What is important is the
expected value of this cost over the outcomes of X and Y. This leads to the performance metric

E kŷ(X) − Yk22 .
 

This is called the mean squared error (MSE), and an estimator that minimizes this cost is called a minimum
mean square error (MMSE) estimator.

13.2 Determining the MMSE Estimator


First consider the situation when we have no information about the value of Y beyond its density fY (y).
In other words, we don’t observe the value of X and must rely solely on fY (y) to estimate the value of Y.
 we seek a value ŷ (the estimate) that minimizes the expected value of the squared error,
In this situation
2
E kY − ŷk2 . To ensure that each ŷ yields a finite MSE cost, it is sufficient to assume that Y has finite first
and second moments. Under this assumption, the following lemma verifies the intuitively reasonable result
that the MSE is minimized by the selection ŷ = µY .

Lemma 13.2.1. Assume that the components  of Y have


2
 finite first and second order moments. Then
the estimator ŷ = µY minimizes the MSE E kY − ŷk2 .

Proof. Expanding the expression for the MSE gives

E kY − ŷk22 = E kY − µY + µY − ŷk22
   

= E (Y − µY + µY − ŷ)T (Y − µY + µY − ŷ)
 

= E trace(Y − µY )T (Y − µY ) + kµY − ŷk22


 

= trace(ΣY ) + kµY − ŷk22 .

The assumption that Y has finite first and second order moments ensures that the final expression for the
MSE is finite, and the expression is clearly minimized by setting ŷ = µY .

A different cost function would different result for the optimal estimator ŷ, but it will still be some
constant determined by the density of Y and the selected cost function. The MSE cost function is easy to
work with since it is quadratic in ŷ and it yields an intuitively reasonable result.
Now consider how to estimate the value of Y when a realization of (X, Y) is determined but we can
only observe the value assumed by X, i.e., we know that X = x. In this case, we simply replace the prior
density of Y by its posterior density.

Lemma 13.2.2. Assume the conditional density fY|X (y|x) has finite first and second order moments.
Then the optimal MSE estimate of the value of Y given that X = x is the mean of the conditional
density fY|X (y|x): Z
ŷ(x) = E[Y|X = x] = yfY|X (y|x)dy = µY|X (x).
R

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 179

Proof. Given X = x, any residual uncertainty in the value of Y is completely described by fY|X (y|x).
Hence, under the assumptions of Lemma 13.2.1, the optimal MSE estimate of the value of Y given that
X = x, is the mean of the conditional density.

The mean of the condition density is a function of the observed value x. Hence it is an estimator. The
value of the conditional mean for a particular observed value x of X is the corresponding estimate ŷ(x) of
Y. In general, it is difficult to find a closed form expression for the (usually nonlinear) conditional mean
estimator. So although we know that this estimator exists and that it is optimal under the MSE cost, in
general, computing the estimator can be a challenge. It is easy to the find the MMSE estimator when X and
Y are jointly Gaussian. We consider this situation in the next section.

13.3 The MSE Estimator for Jointly Gaussian Random Vectors


Let the random vector Z take values in Rp and assume that Z is a non-degenerate Gaussian random vector
with density
1 1 − 1 (z−µ)T Σ−1 (z−µ)
fZ (z) = p 1 e
2 . (13.3)
(2π) 2 |Σ| 2
Partition Z into two component vectors X and Y, and concordantly partition the mean and covariance of Z
as follows      
X µX ΣX ΣXY
Z= , µ= , Σ= . (13.4)
Y µY ΣYX ΣY
Since X and Y need not have the same dimensions, in general ΣXY is not a square matrix. Given the value
of X we want to predict the value of Y.
Properties of jointly Gaussian random vectors are discussed in Appendix F. There it is shown that
fY|X (y|x) is also a Gaussian density with mean and variance

µY|X = µY + ΣYX Σ−1


X (x − µX )
(13.5)
ΣY|X = ΣY − ΣYX Σ−1
X ΣXY .

Hence the MMSE estimator of Y given X is the affine function:

ŷ(x) = µY|X (x) = ΣYX Σ−1


X (x − µX ) + µY .

Let

W T = ΣYX Σ−1
X
(13.6)
b = µY − ΣYX Σ−1
X µX .

Then the MMSE estimator can be written as ŷ(x) = W T x + b.


Now consider the random vector Ŷ = W T X+b. Since Ŷ is an affine function of X, it is also a Gaussian.
Using (13.5) we see that Ŷ has mean µY and covariance ΣŶ = ΣY − ΣYX Σ−1 X ΣXY . It’s outcomes are the
predictions of Y based on the corresponding outcome of X. The resulting prediction error is the random
vector
E = Y − Ŷ = Y − W T X − b = (Y − µY ) − W T (X − µX ).
This is also Gaussian with mean µE = 0, and covariance ΣE = ΣY − ΣYX Σ−1
X ΣXY .

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 180

13.4 Examples
In the following examples we assume that X and Y are jointly Gaussian, and that the joint density s specified
using the generative model (13.1). Hence Y is N (µY , ΣY ), with µY and ΣY known. The conditional density
fX|Y (x|y) is then be specified using the outcome of Y. Finally, we observe the value of X, and want to
estimate a corresponding value of Y.

13.4.1 The Observation is a Noisy Copy of the Target


The simplest nontrivial specification of the condition density is fX|Y (x|y) = N (y, ΣN ). This corresponds
to setting X = Y + N, where N is a Gaussian random noise vector independent of Y with mean 0 and
covariance ΣN . Hence the observation X is a noisy copy of Y. Under this model:

µX = E[Y + N] = µY
ΣX = E (Y − µY + N)(Y − µY + N)T = ΣY + ΣN
 

ΣXY = E (Y − µY + N)(Y − µY )T = ΣY .
 

By (13.17) the MMSE affine estimator is

ŷ(x) = ΣY (ΣY + ΣN )−1 (x − µY ) + µY . (13.7)

Additional insight into 13.7 can be obtained by considering the special cases discussed below.

Scalar valued X and Y


In this case, µX = µY = µ, ΣY = σ 2 , ΣN = σN
2 , Σ = σ 2 + σ 2 , and the MMSE estimator is
X N

σ2
ŷ(x) = 2 (x − µ) + µ
σ 2 + σN
(13.8)
σ2
= αx + (1 − α)µ, with α = 2 2 ∈ [0, 1].
σ + σN

This estimator has an intuitive interpretation. When σN 2  σ 2 , the observation provides clear information
1
about the value of Y . In this case, α ≈ 1 and ŷ(x) ≈ x. On the other hand, when σN 2  σ 2 , the observation

provides little information about Y. In this case, α ≈ 0, and ŷ(x) ≈ µ. In the first case, the observation is
trust worthy and the estimator puts most of its emphasis on x. In the second, the observation is unreliable,
and the estimator puts most of is weight on the prior mean µ. For situations between these extremes the
estimator forms a convex combination of the observation x and the prior mean µ.

The noise has uncorrelated components


Now consider the special case when the noise components are uncorrelated and each has the same variance
σN2 , i.e., Σ = σ 2 I . Let Σ = U DU T be an eigen-decomposition of the symmetric positive semidefinite
N N q Y
matrix ΣY . So the columns uj of U are orthonormal eigenvectors of ΣY and D denotes the diagonal matrix
formed from the corresponding eigenvalues σj2 , j ∈ [1 : q]. The random variable uTj (Y − µY ) is the vector
1
The relevant KL-Divergence is DKL (N (µ, σ 2 ), N (µ, σ 2 + σN
2
)) = 1/2(α − ln(α) − 1).

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 181

of coordinates of the projection of Y − µY onto the line in the direction uj . The variance of this random
variable is
E uTj (Y − µY )(Y − µY )T uj = uTj ΣY uj = σj2 .
 

So we can interpret σj2 as the variance of Y in the direction of its j-th unit norm eigenvector uj .
Now use the fact that U T U = Iq to express W T in terms of U :
−1
W T = ΣY (ΣY + ΣN )−1 = (U DU T ) U DU T + σN 2
UUT
−1 T
= (U DU T ) U D + σN 2
In U
" #
2
σj T
=U 2 U . (13.9)
σj2 + σN

So W T attenuates the deviations of X from its mean in the direction of uj using the gain

∆ σj2
αj = , αj ∈ [0, 1], j ∈ [1 : q].
σj2 + σN
2

For eigenvectors uj with σj2  σN 2 , α ≈ 1, while for eigenvectors with σ 2  σ 2 , α ≈= 0. For the
j N j j
T
first set of directions, X − µX passes through W with almost unit gain, but for the second set, X − µX
is highly attenuated. In situations between these extremes, the attenuation in direction uj is determined by
αj = σj2 /(σj2 + σN 2 ).

Now bring the mean µY into the picture by using (13.9) to rewrite (13.7) in the form
ŷ(x) = U [diag(αj )] U T x + [diag(1 − αj )] U T µY .

(13.10)
If σj2 < σN
2 , α is small and the component of ŷ in the direction u is formed mainly from the component
j j
2 2
of µY in this direction. Conversely, when σj > σN , αj is large and the component of ŷ in the direction uj
2  σ , for all j ∈ [1 : q],
is formed mainly from the component of x in this direction. In particular, when σN j
the optimal MMSE estimate of Y given X = x reverts to the prior mean µY . This makes sense, since in this
case the observation X = x adds very little information about the value of Y and (under the MSE metric)
the best we can do is to estimate the value of Y to be its mean µY .

13.4.2 The Observation is a Noisy Linear Function of the Target Vector


Let’s add one more element to the previuous model. Let B ∈ Rn×q be given and set
fX|Y (x|y) = N (By, ΣN ).
So a target value y is first generated. Then X is normal with mean By and covariance σN .
Let N be a Gaussian noise vector with density N (0, ΣN ), and Y and N be independent. Then under the
above generative model, the observed variable X can be written as
X = BY + N. (13.11)
Given X = x and knowledge of B, µY , ΣY and ΣN , we want to estimate the value of the target Y. We
have
µX = E[BY + N] = BµY
ΣX = E B(Y − µY )(Y − µY )T B T + NNT = BΣY B T + ΣN
 

ΣYX = E (Y − µY )(B(Y − µY ) + N)T = ΣY B T .


 

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 182

Hence the minimum MSE affine estimator is given by

W T = ΣY B T (BΣY B T + ΣN )−1
(13.12)
ŷ(x) = W T (x − µX ) + µY .

Scalar valued X and Y


In this case, µY = µ, ΣY = σ 2 , ΣN = σN
2 , µ = bµ, Σ = b2 σ 2 + σ 2 , and the MMSE estimator is
X X N

bσ 2
ŷ(x) = 2 (x − bµ) + µ
b2 σ 2 + σN
(13.13)
−1 b2 σ 2
= α(b)(b x) + (1 − α(b))µ, with α(b) = 2 2 2 ∈ [0, 1].
b σ + σN

The second equation in (13.13) indicates that when b 6= 0, we can divide x by b and then proceed as in the
previous example. As b becomes large, the estimate moves towards x/b, and as b becomes small it moves
towards the prior mean µ. If b = 0 then the observation is useless and the best estimate is just µ.

X and Y take values in R2 and B is diagonal


For simplicity take µY = 0 and
 2   2   
σ1 σ12 σN 0 b1 0
ΣY = , ΣN = 2 , B= .
σ21 σ22 0 σN 0 b2

Then
−1
σ12 σ12 b1 0
     2    2
T b1 0 σ1 σ12 b1 0 σN 0
W = +
σ21 σ22 0 b2 0 b2 σ21 σ22 0 b2 2
0 σN
−1
(13.14)
b σ 2 b2 σ12 b21 σ12 + σN 2
   
b1 b2 σ12
= 1 1
b1 σ21 b2 σ22 b1 b2 σ21 b22 σ22 + σN2

When b1 , b2 6= 0, we can rearrange the above equation as


−1  2 −1 !−1  −1
σ12 σ12
   2   
T σ1 σ12 b1 0 σN 0 b1 0 b1 0
W = +
σ21 σ22 σ21 σ22 0 b2 0 σN 2 0 b2 0 b2
−1  −1 (13.15)
σ1 σ12 σ12 + b−2
 2  2
= 1 σN σ12 b1 0
σ21 σ22 σ21 σ22 + b−2 2
2 σN 0 b2

Equation (13.15) indicates that we can also implement the optimal predictor by first applying B −1 to the
observation x and then the MMSE estimator for the model (B −1 X) = Y + B −1 N . One should take,
however, if either of b1 , b2 are close to 0 since then the center matrix in the second equation in (13.15) is
likely to be ill-conditioned.
Let’s examine what happens when b2 = 0. In this case, (13.14) is still applicable and simplifies to
" 1 #  b1 σ2 

b σ 2 0 b21 σ12 +σN2 0 2 2
1
2 0
WT = 1 1 1 =  b1bσ11σ+σ N 
b1 σ21 0 0 σ 2 2 2
21
2 0
N b1 σ1 +σN

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 183

Writing x = (x1 , x2 ) we hence have  


b1 σ12
ŷ(x) =  b1 σ12 +σN2  x1 .
2
b1 σ21
b21 σ12 +σN2

Notice that the estimator provides an estimated value for both components of y even though we had no direct
observation of the second component (b2 = 0). The estimator is able to so because of the cross-correlation
term σ21 in ΣY .

13.5 General Affine Estimation


We now return to the problem of estimating the value of a random vector Y given the outcome of a related
random vector X. Previously we assumed that the joint density fXY was known. This time we make the
weaker assumption that only the first and second order statistics of the joint density are known. In general,
this provides less information about the dependency between X and Y, but it can still be a useful model.
Moreover, these statistics can be learned using training data. We will also restrict the form of allowed
estimators to be affine functions of x. This assumption, ensures that we don’t need to know fXY , just its
first and second order statistics.
Consider random vectors X and Y generated under a general joint density fXY . Rather than seeking the
condition mean estimator µY|X (x) for fXY , we seek the affine estimator that minimizes the mean square
error. So we seek W ∈ Rn×q and b ∈ Rq such that averaged over the values of X and Y, the estimate
ŷ(x) = W T x + b of the outcome of Y achieves the minimal MSE. This problem can be stated as

= E kY − W T X − bk22 .
 
min (13.16)
W ∈Rn×q , b∈Rq

Let X have mean µX ∈ Rn and covariance ΣX ∈ Rn×n , Y have mean µY ∈ Rq and covariance
ΣY ∈ Rq×q , and let the cross covariance of X and Y be ΣXY ∈ Rn×q . We show below that a solution of
(13.16) can be found using only these quantities.

Theorem 13.5.1. A minimum MSE affine estimator of Y given X = x has the form

ŷ(x) = W ? T (x − µX ) + µY , (13.17)

with W ? satisfying ΣX W ? = ΣXY . In particular, if ΣX is positive definite, the minimum MSE affine
estimator is unique and is specified by W ? = Σ−1
X ΣXY .

Proof. Assume for the moment that µX = 0 and µY = 0. Expanding the RHS of (13.16) yields

E kY − W T X − bk22 = E (Y − W T X − b)T (Y − W T X − b)
   
(13.18)
T T T T T T
   
= E (Y − W X) (Y − W X) + E −2(Y − W X) b + b b
= E (Y − W T X)T (Y − W T X) + bT b.
 
(13.19)

Since the two terms in (13.19) are nonnegative, and the first does not depend on b, the expression is mini-
mized with b = 0. Using the properties of the trace function, the first term in (13.19) can rewritten as

E (Y − W T X)T (Y − W T X) = E trace(YT Y − YT W T X − XT W Y + XT W W T X)
   

= E trace YYT − 2W T XYT + W T XXT W .


 

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 184

Hence
E kY − W T X − bk22 = trace ΣY − 2W T ΣXY + W T ΣX W .
  
(13.20)
The derivative of (13.20) with respect to W acting on H ∈ Rn×q is

trace −2H T ΣXY + H T ΣX W + W T ΣX H = 2 trace H T ΣX W − H T ΣXY .


 
(13.21)

Setting this expression equal to zero we find that for every H ∈ Rn×q , <H, ΣX W − ΣXY > = 0. It follows
that a necessary condition for W to minimize the MSE cost is that

ΣX W = ΣXY . (13.22)

To verify that a solution of (13.22) minimizes the objective function we can compute the second derivative,
i.e., the derivative with respect to W of the RHS of (13.21). This yields trace(H T ΣX H). Noting that ΣX is
PSD, we conclude that trace(H T ΣX H) ≥ 0 and hence that all solutions of (13.22) minimize the MSE cost
objective. If ΣX is positive definite, (13.22) has the unique solution

W ? = Σ−1
X ΣXY . (13.23)

When the means µX and µY are nonzero, we apply the reasoning above to predict Y − µY given the
value of X − µX . This yields the minimum MSE predictor W ? T (x − µX ). Hence the least MSE predictor
of Y is ŷ(x) = W ? T (x − µX ) + µY , where W ? is any solution of (13.22).

13.6 Learning an Affine Estimator from Training Data


Theorem 13.5.1 assumes that we know the relevant first and second order statistics of the random vectors X
and Y. If these are unknown, we can learn an affine estimator using training data {(xi , yi )}m
i=1 . We assume
the pairs (xi , yi ) are independent samples from the joint density fXY (x, y). Let X denote the matrix with
the training examples as its columns, and Y denote the matrix containing the corresponding targets as its
columns. The training data can be used to form empirical estimates µ̂X , µ̂Y , Σ̂X , and Σ̂XY of the quantities
of interest as shown in Appendix F. In general, these empirical estimates will not be maximum likelihood
estimates since fXY (x, y) need not be Gaussian. The empirical estimates can then be used in (13.22) or
(13.23) to form the affine estimator Ŵ and b̂ = µ̂Y − Ŵ T µ̂X .
Alternatively, we could directly learn an affine estimator by solving the least squares problem

Wls , bls = arg min kY − W T X − b1Tm k2F . (13.24)


W ∈Rn×q ,b∈Rq

We leave it as an exercise to show that these two approaches have the same set of solutions.

Exercises
Exercise 13.1. Prove Lemma F.0.1.
Exercise 13.2. Let X be a Gaussian random vector with parameters µ, Σ. Show that:
(a) E[X]
 = µ. 
(b) E (X − µ)(X − µ)T = Σ.
Exercise 13.3. Let X be an n-dimensional Gaussian random vector with mean µ and PD covariance Σ and let Y =
A(X − µ) + b for b ∈ Rn and an invertible matrix A ∈ Rn×n . Show that Y is also a Gaussian random vector with
mean b and covariance Ω = AΣAT .

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 185

Exercise 13.4. Let X be an n-dimensional Gaussian random vector with mean µ and PD covariance Σ. Show that
there exists a symmetric PD matrix A and vector b such that Y = AX + b, is a Gaussian random vector with mean 0
and covariance In .
Exercise 13.5. Let X be an n-dimensional Gaussian random vector with mean µ and PD covariance Σ. Show that
there exists a matrix A and a vector b such that X = AY + b, where Y is a zero mean Gaussian random vector with
independent components of unit variance. Is A unique? Can A be chosen to be symmetric PD? Is such a choice
unique?
Exercise 13.6. Find the mean and covariance of Ŷ? .
Exercise 13.7. Find the mean and covariance of the error random vector E = Y − Ŷ? . Use this to show that in all
directions, the variance of Y is greater than or equal to the variance of E.
Exercise 13.8. Assume ΣX is postive definite. Show that the MSE of the optimal MSE affine estimator is

trace ΣY − ΣYX Σ−1



X ΣXY .
 
Exercise 13.9. Show that the MSE of the estimator (13.7) is trace (I − W ? T )ΣY .

Exercise 13.10. Show that the MSE of the estimator (13.12) is trace ((I − W ? B)ΣY ).
Exercise 13.11. Let B ∈ Rm×n and Y, N, X be random vectors with Y and N independent, E[N] = 0, and X =
BY + N. The first and second order statistics of Y and N are known, and ΣN is PD.
The minimum MSE estimate of Y given X = x is

ŷ ? (x) = µY + ΣY B T (BΣY B T + ΣN )−1 (x − BµY ). (13.25)

(a) Show that using minimum MSE affine denoising to estimate BY given X = x, then multiplying the result by
B + , yields the same estimator.
(b) Show that in general, the estimator that results by first multiplying X by B + and then denoising is not an
optimal MSE affine estimator.

Exercise 13.12. Bias, error covariance, and MSE. Consider random vectors X and Y with a joint density fXY and
PD covariance Σ. Let X have mean µX ∈ Rn and covariance ΣX ∈ Rn×n , Y have mean µY ∈ Rq and covariance
ΣY ∈ Rq×q , and let the cross-covariance of X and Y be ΣXY ∈ Rn×q .

Let ŷ(x) be an estimator of Y given X = x, and denote the corresponding prediction error by E = Y − ŷ(X). Of
interest is µE , ΣE and the MSE. The estimator is said to be unbiased if µE = 0.

(a) For any estimator ŷ with finite µE and MSE, show that MSE(ŷ) = trace(ΣE ) + kµE k22 . This shows that the
MSE is the sum of two terms: the total variance trace(ΣE ) of the error, and the squared norm of the bias kµE k22 .
(b) Let ŷ(x) = µY . Show that this is an unbiased estimator, determine ΣE , show that ΣE is PD, and determine the
estimator MSE.
(c) The minimum MSE affine estimator of Y given X = x is

ŷ ? (x) = µY + W ? T (x − µX ) with ΣX W ? = ΣXY .

Show that ŷ ? (·) is an unbiased estimator, determine ΣE , show that ΣE is PD, and determine the estimator MSE.

Exercise 13.13. Empirical statistics, MSE affine prediction, and least squares. Fix a training dataset {(xi , yi )}m
i=1 ,
with examples xi ∈ Rn and targets yi ∈ Rq . Let X denote the matrix with the examples as its columns, Y denote the
matrix with the corresponding targets as its columns. Define the following first and second order empirical statistics
of the data:
µ̂X = 1/mX1m µ̂Y = 1/mY 1m
(13.26)
Σ̂X = 1/m(X − µ̂X 1Tm )(X − µ̂X 1Tm )T Σ̂XY = 1/m(X − µ̂X 1Tm )(Y − µ̂Y 1Tm )T

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 186

A MMSE affine estimator ŷ(x) = W T x + b based on the empirical statistics (13.26) must satisfy

Σ̂X W = Σ̂XY b = µ̂Y − W T µ̂X . (13.27)

(a) Consider the least squares problem

Wls , bls = arg min kY − W T X − b1Tm k2F . (13.28)


W ∈Rn×q ,b∈Rq

Show that Wls , bls satisfy (13.27). Thus directly solving the least squares problem (13.28) yields an optimal
MSE affine estimator for the empirical first and second order statistics in (13.26).
(b) Consider the ridge regression problem

Wrr , brr = arg min 1/mkY − W T X − b1Tm k2F + λkW k2F , λ > 0. (13.29)
W ∈Rn×q ,b∈Rq

Determine if Wrr , brr satisfy (13.27). If not, what needs to be changed in (13.26) to ensure Wrr , brr satisfy
(13.27). Interpret the change you suggest.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Chapter 14

Convex Optimization

This chapter briefly summarizes the essential elements of convex optimization. There are many good books
on this topic that go into much greater depth. See the references at the end of the chapter for additional
reading. Before reading this chapter, it may be helpful to revise the material on convex sets and functions in
Chapter 7.

14.1 Convex Programs


A convex program is an optimization problem of the form

min f (w)
w∈Rn
s.t. fi (w) ≤ 0, i ∈ [1 : k] (14.1)
Aw − b = 0,

where f, fi : Rn → R, i ∈ [1 : k], are convex functions, A ∈ Rm×n and b ∈ Rm . The function f is called the
objective function, the functions fi are called the constraint inequalities, and the rows of the affine vector
function Aw − b are called the affine equality constraints.
A point w ∈ Rn satisfying all of the constraints in (14.1) is said to be a feasible point, and the set
of all feasible points is called the feasible set. The feasible set is the intersection of the 0-sublevel sets
(i)
L0 = {w : fi (w) ≤ 0}, i ∈ [1 : k], and the set S = {w : Aw − b = 0}. The set S is either empty
(b ∈/ R(A)), or is an affine manifold of the form wp + N (A) where Awp = b. Hence S is closed and
(i)
convex. Since each of the functions fi is convex, each sublevel set L0 is closed and convex. So the feasible
set is an intersection of closed convex sets and is hence closed and convex. Thus problem (14.1) seeks to
minimize a convex function f over a closed, convex set. To be an interesting problem, we need the feasible
set to be nonempty. In this case we say that the problem is feasible. Otherwise we say that it is infeasible.

14.1.1 Linear Programs


Let F ∈ Rm×n , g ∈ Rm and h ∈ Rn . A linear program minimizes a linear function subject to linear
inequality constraints:

min hT w
w∈Rn (14.2)
s.t. F w ≤ g.

187
ELE 435/535 Fall 2018 188

The objective function f (w) = hT w is linear and hence convex. Let Fi,: denote the i-th row of F and gi
denote the i-th entry of g. Then there are m constraint inequalities fi (w) = Fi,: w − gi ≤ 0. These are affine
and hence convex. Thus a linear program is a convex program. To ensure feasibility, we need the existence
of a w ∈ Rn such that F w ≤ g.

14.1.2 Quadratic Programs


Let P ∈ Rn×n be a symmetric positive semidefinite matrix, q ∈ Rn , r ∈ R, F ∈ Rk×n , and g ∈ Rk . A
quadratic program minimizes a quadratic objective subject to affine inequality constraints:

min 1/2 wT P w + q T w + r
w∈Rn (14.3)
s.t. F w ≤ g.

Since P is symmetric positive semidefinite, the objective function f (w) = wT P w + q T w + r is convex.


There are m affine constraint functions fi (w) = Fi,: w − gi ≤ 0, i ∈ [1 : m]. To ensure feasibility, we need
at least one point w ∈ Rn to satisfy F w ≤ g.

14.2 The Lagrangian and the Dual Problem


Consider the general convex program (14.1). For clarity, we will refer to (14.1) as the primal problem and
its objective and constraints as the primal objective and primal constraints, respectively.
Bring in dual variables λ ∈ Rk , with λ ≥ 0, and µ ∈ Rm , and form the Lagrangian:

k

X
L(w, λ, µ) = f (w) + λi fi (w) + µT (Aw − b).
i=1

By construction, for all λ ≥ 0, µ ∈ Rm and feasible w:

L(w, λ, µ) ≤ f (w) and max L(w, λ, µ) = f (w).


λ≥0,µ

Notice that L(w, λ, µ) is a convex function of w. Moreover, if the objective f (w) and constraint functions
fi (w), i ∈ [1 : k], are differentiable w.r.t. w, then so is L(w, λ, µ).
We now set out to minimize the unconstrained convex function L with respect to w, without requiring
that w is feasible. This gives rise to the dual objective function g(λ, µ) defined by

g(λ, µ) = minn L(w, λ, µ).


w∈R

The domain of g is the set of all (λ, µ) ∈ Rk × Rm satisfying λ ≥ 0 and g(λ, µ) > −∞. Such points are
said to be dual feasible. By construction, for all dual feasible (λ, µ) and feasible w,

g(λ, µ) = minn L(v, λ, µ) ≤ L(w, λ, µ) ≤ max L(w, λ, µ) = f (w). (14.4)


v∈R λ≥0,µ

So for all dual feasible (λ, µ) and all feasible w, the dual objective lower bounds the primal objective:

g(λ, µ) ≤ f (w). (14.5)

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 189

We now use g(λ, µ) to define the dual problem:


max g(λ, µ)
λ∈Rk ,µ∈Rm (14.6)
s.t. (λ, µ) is dual feasible.
The requirement of dual feasibility sometimes imposes implicit constraints on λ and µ that go beyond the
obvious requirement that λ ≥ 0.
Example 14.2.1 (The Dual of a Quadratic Program). Consider the quadratic program introduced in
§14.1.2 with the stronger assumption that P is symmetric positive definite. Bring in a dual variable λ ∈ Rm
with λ ≥ 0 and form the Lagrangian
L(w, λ) = 1/2 wT P w + q T w + r + λT (F w − c). (14.7)
The corresponding dual objective function is g(λ) = minw∈Rn L(w, λ). Setting the derivative of L with
respect to w equal to zero yields
wT P h + q T h + λT F h = (wT P + q T + λT F )h = 0.
Hence the value of w minimizing L(w, λ) is
w = −P −1 q − P −1 F T λ. (14.8)
Substituting this expression into L(w, λ) and simplifying gives the dual objective
g(λ) = −1/2 λT (F P −1 F T )λ − (q T P −1 F T + cT )λ + (r − 1/2 q T P −1 q).
Thus the dual problem is
max − 1/2 λT (F P −1 F T )λ − (q T P −1 F T + cT )λ + (r − 1/2 q T P −1 q)
λ∈Rm (14.9)
s.t. λ ≥ 0.
Using some simple algebra, one can write the dual problem in the standard form (14.3) for a quadratic
program. In this sense, it is also a quadratic program.

14.3 Weak Duality, Strong Duality and Slater’s Condition


Let w? denote a solution of the primal problem and (λ? , µ? ) denote a solution of the dual problem. The
point w? must be feasible, and (λ? , µ? ) must be dual feasible. Hence by (14.5) the following weak duality
condition always holds:
g(λ? , µ? ) ≤ f (w? ). (14.10)
However, in general, there need not be equality in this expression. If for a particular convex program
g(λ? , µ? ) = f (w? ), then we say that strong duality holds.

14.3.1 Slater’s Condition


There are a number of sufficient conditions, known as constraint qualifications, each ensuring that strong
duality holds. One of the simplest of these is known as Slater’s condition. We describe this below. The
important point is that if Slater’s condition is satisfied, then strong duality holds.
For the convex program (14.1), Slater’s condition requires that there is feasible point w satisfying
fi (w) < 0, i ∈ [1 : k]. In the special case when the inequality constraints in (14.1) are affine constraints of
the form fi (w) = aTi w − bi ≤ 0, Slater’s condition is even simpler: it only requires that the primal problem
(14.1) has a feasible point.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 190

14.4 Complementary Slackness


Assume that strong duality holds. Then g(λ? , µ? ) = f (w? ) and by (14.4) it follows that g(λ? , µ? ) =
L(w? , λ? , µ? ) = f (w? ). Hence
f (w? ) = L(w? , λ? , µ? )
k
X
= f (w? ) + λ?i fi (w? ) + µ? T (Aw? − b)
i=1
Xk
= f (w? ) + λ?i fi (w? ).
i=1
Pk ? ?
Thus i=1 λi fi (w ) = 0. Since the terms in this sum are nonpositive, each term must be zero:
λ?i fi (w? ) = 0, i ∈ [1 : k]. (14.11)
The equations (14.11) are called complementary slackness conditions. If fi (w? ) = 0, we say that the
constraint is active, otherwise it is inactive or slack. The dual variable λ?i must satisfy λ?i ≥ 0. Hence when
λ?i = 0, its constraint is active, and when λ?i > 0, it is slack. So (14.11) says that at an optimal solution:
if a primal constraint is slack (fi (w? ) < 0), then the corresponding dual variable constraint must be active
(λ?i = 0), and if the dual variable constraint is slack (λ?i > 0), then the corresponding primal constraint must
be active (fi (w? ) = 0). Hence the term complementary slackness.

14.5 The KKT Conditions


Assume that strong duality holds and that the functions f (w) and fi (w), i ∈ [1 : k], are continuously
differentiable. For feasible w, the inequality
g(λ? , µ? ) ≤ L(w, λ? , µ? ) ≤ f (w),
and strong duality imply that w? minimizes L(w, λ? , µ? ) over feasible w. Since L(w, λ? , µ? ) is differen-
tiable w.r.t. w, it follows that
k
X
?
∇f (w ) + λ?i ∇fi (w? ) + AT µ? = 0.
i=1

Hence the following conditions, known as the KKT conditions1 , are necessarily satisfied at w? , λ? , µ? :
k
X
∇f (w? ) + λ?i ∇fi (w? ) + AT µ? = 0 ∇w L(w? , λ? , µ? ) = 0
i=1
for i ∈ [1 : k], fi (w? ) ≤ 0 primal constraint
?
Aw − b = 0 primal constraint
?
λ ≥0 dual constraints (There can be more)
for i ∈ [1 : k], λ?i fi (w? ) =0 complementary slackness
For general optimization problems that’s as much as we can say. However, for convex programs satisfying
the above assumptions one can say more.
1
Named for those who first published the result: William Karush (1939), and Harold W. Kuhn and Albert W. Tucker (1951).

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 191

Theorem 14.5.1. For a convex program in which strong duality holds and the functions f (w) and fi (w), i ∈
[1 : k], are continuously differentiable, the KKT conditions are both necessary and sufficient for optimality.
Example 14.5.1 (The KKT Conditions for a Quadratic Program). The primal quadratic program is
a convex program with affine inequality constraints. Hence if the primal problem has a feasible point,
Slater’s condition is satisfied and strong duality holds. Let w? and λ? denote the solutions to the primal
and dual problems, respectively. Then g(λ? ) = f (w? ). In addition, the primal objective function f (w)
is continuously differentiable. Hence the following KKT conditions are necessary and sufficient for the
optimality of w? , λ? :

w? + P −1 q + P −1 F T λ? = 0 ∇w L(w? , λ? , µ? ) = 0
F w? − g ≤ 0 primal constraint
?
λ ≥0 dual constraint
for i ∈ [1 : m] λ?i (Fi,: w? − gi ) = 0 complementary slackness.

Notes
For additional reading see the books by Boyd and Vandenberghe [7], Bertsekas [4] and Chong and Zak [9]. The
presentation here follows [7].

Exercises
Exercise 14.1. Consider the primal problem below. Here X ∈ Rn×m , y ∈ Rm and 1 ∈ Rm is the vector of all 1’s.
Give a detailed but concise derivation of the corresponding dual problem.

min 1/2kwk2
w∈Rn ,b∈R

s.t. X T w + by ≥ 1.

Exercise 14.2. Let A ∈ Rm×n with rank(A) = n and y ∈ Rm . Consider the constrained regression problem:

min 1/2ky − Awk22


w∈Rn
s.t. max |w(j)| ≤ 1.
j

This requires that we minimize the least squares residual while keeping the magnitude of the entries of w to at most 1.
Show that this is a feasible convex program and that strong duality holds.
Exercise 14.3. Let A ∈ Rm×n , y ∈ Rm and c > 0. Consider the constrained regression problem:

min 1/2ky − Awk22


w∈Rn

s.t. kwk22 ≤ c.

We want to minimize the least squares residual while ensuring the squared norm of w is at most c.
(a) Verify that this is a convex program, that it is feasible and that strong duality holds.
(b) Write down the KKT conditions.
(c) Show that the primal solution is given by ridge regression using the optimal value λ? of the dual variable.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 192

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 193

Chapter 15

The Linear Support Vector Machine

The support vector machine (SVM) is a framework for using a set of labelled training examples to learn
a binary classifier. A special case of the SVM, called the generalized portrait method, was introduced by
Vapnik and Lerner in 1963. It assumed that the labelled training data could be separated using a hyperplane
(i.e., linearly separable training data). This is a special case of what is now called the linear SVM. The
original method was later extended to allow a nonlinear decision boundary constructed as a hyperplane in
a higher dimensional space. The SVM framework became popular after 1995 when Cortes and Vapnik
extended the framework by removing the requirement of linearly separable training data. The extended
method, together with the previous nonlinear extensions, is what we know today as the support vector
machine. From a practical perspective, the SVM has been employed across a wide variety of applications
and has produced robust classifiers that generalize well. The two main limitations of the SVM are its natural
binding to binary classification and the need to specify (rather than learn) a kernel function (more on this
later).
This chapter focuses on the linear SVM. We begin by considering linearly separable training data. Then
show how to remove the linearly separable assumption. In a subsequent chapter we show how to extend
these results to the general (nonlinear) SVM.

15.1 Preliminaries
15.1.1 Hyperplanes
Recall that a hyperplane H in Rn is a subset of Rn of the form {x : wT x + b = 0}, where w ∈ Rn is
nonzero, and b ∈ R. The ray in the direction of w is the set of all points of the form αw for α > 0. This ray
intersects the hyperplane at the point q with q = αw satisfying wT (αw) + b = 0. Thus q = −bw/kwk2 .
The distance d from the origin to the hyperplane is just the norm of q. Hence d = |b|/kwk. Finally, for each
x on the hyperplane, wT (x − q) = wT x + b = 0. Thus w ⊥ (x − q). In this sense, the vector w is normal
to the hyperplane. These observations are summarized in the left diagram in Figure 15.1.
The parameters (w, b) of a given hyperplane H are not unique. For any α > 0, (w, b) and (αw, αb)
specify the same hyperplane. Let [H] denote the set of equivalent parameters (w, b) all of which yield the
same hyperplane H.
A hyperplane H divides Rn into a positive half space: {x : wT x + b > 0}, and a negative half space:
{x : wT x + b < 0}. For the sake of definiteness, we will include the hyperplane H = {x : wT x + b = 0}
in the positive half space. This yields a binary subdivision of Rn that assigns each x ∈ Rn a label ŷw,b (x) ∈

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 194

xj

)
x w (x−q)

j ,H
d( x
p = aw pj = wwT xj /||w||2

q = −bw/||w||2 q = −bw/||w||2

w w
w

w
T

T
x+

x+
b=

b=
0

0
0 H 0 H

Figure 15.1: Right: Hyperplane properties. Left: Computing the distance d(xj , H) from xj to the hyperplane H.

{±1} using the rule (


+1, if wT x + b ≥ 0;
ŷw,b (x) =
−1, if wT x + b < 0.

This is a linear classifier on Rn . Normally we write this linear classifier as ŷw,b (x) = sign(wT x + b), where
it is understood that we will take sign(0) = 1.
The family of linear classifiers on Rn is parameterized by (w, b) ∈ Rn+1 . For any α > 0, (w, b) and
(αw, αb) specify the same hyperplane, hence the same linear classifier. So a linear classifier corresponds to
the equivalence class [H] of pairs (w, b) yielding the same hyperplane H.

15.1.2 Linearly Separable Training Data


Let {(xj , yj )}m n
j=1 denote a set of training examples, with xj ∈ R and yj ∈ {±1}, j ∈ [1 : m]. If the label
yj of example xj is +1, then xj said to be a positive example, otherwise it is said to be a negative example.
We make a standing assumption that the training data contains both positive and negative examples.
We say that the training data is linearly separable if for some hyperplane H, the positive training exam-
ples lie in the positive half space of H, and the negative training examples lie in the negative half space of
H. In this case, we also say that H is a separating hyperplane and that it separates the training data. This
is illustrated in the left diagram of Figure 15.2.
Using the above definition, we see that H separates the training data if and only if for all (w, b) ∈ [H]:

if yj = −1, then wT xj + b < 0;


for j ∈ [1 : m], (15.1)
if yj = +1, then wT xj + b > 0.

This is a set of m linear inequalities in (w, b) defining a subset of (w, b) pairs in Rn+1 . These equations can
be written more compactly as
yj (wT xj + b) > 0, j ∈ [1 : m]. (15.2)
In light of (15.2), for every example xj there exists a scalar γj > 0 such that yj (γj wT xj + γj b) = 1. Letting
γ = maxj γj we see that
yj (γwT xj + γb) ≥ 1 j ∈ [1 : m].

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 195

wT x + b >0

w
T
x+
w
wT

b
T
x+

=1
x+
b=

b
T
0

=0
x+
b
=−
1
H
H
wT x + b <0

Figure 15.2: Left: Linearly separable data in R2 . The red points are the positive examples, and the blue points are
the negative examples. The hyperplane H separates these two sets of points. Right: The maximum margin separating
hyperplane, and its support vectors (circled), for the separable data on the left.

Moreover there exists at least one j such that yj (γwT xj + γb) = 1. For future reference, we summarize
these useful observations in the following lemma.
Lemma 15.1.1. A hyperplane H separates the training data {(xj , yj )}m
j=1 if and only if there exists
(w, b) ∈ [H] satisfying

yj (wT xj + b) ≥ 1 j ∈ [1 : m]; (15.3)


T
min yj (w xj + b) = 1. (15.4)
j

Proof. (If) If (w, b) satisfies (15.3) and (15.4), then (w, b) satisfies (15.2). So H separates the training data.
(Only If) Suppose H separates the training data and let (w, b) ∈ [H]. Then (w, b) must satisfying (15.2).
So for each training example, yj (wT xj + b) > 0. It follows that for each j there exists γj > 0 such that
yj ((γj w)T xj + (γj b)) = 1. Since the training data is finite, γ = maxj∈[1:m] {γj } exists and is finite. Let
(w̄, b̄) = (γw, γb). Then (w̄, b̄) ∈ [H], and for each training example, yj (w̄T xj + b̄) ≥ 1. Hence (w̄, b̄)
satisfies (15.3). In addition, for some j, γ = γj , and hence yj (w̄T xj + b̄) = 1. Thus (15.4) is satisfied.

15.2 A Simple Linear SVM


To explain the key ideas of the SVM we first make the strong assumption that the training data is linearly
separable. In this case, there will generally be many separating hyperplanes. Let H be a separating hy-
perplane specified by (w, b) ∈ [H] satisfying (15.3) and (15.4). The distance d(xj , H) from xj to H can
be computed by orthogonally projecting xj onto the line Lw = {x : x = αw, α}. This yields the point
x̂j = wwT xj /kwk2 . Let q be the intersection of Lw and the hyperplane H. We have already shown that
q = −wb/kwk2 . Thus
|wT xj + b| yj (wT xj + b)
d(xj , H) = kx̂j − qk2 = = . (15.5)
kwk kwk

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 196

This construction is illustrated in the right diagram of Figure 15.1.


Define the margin of a separating hyperplane H, denoted ρH , to be the distance from H to the closest
training example. Using (15.5) and (15.4) it follows that under our parametrization:

yj (wT xj + b) 1
ρH = min = . (15.6)
j kwk kwk

The linear SVM problem for separable data is posed as finding the separating hyperplane of maximum
margin. By (15.6), this is achieved by minimizing kwk2 under the constraints of Lemma 15.1.1:

min 1/2kwk2
w∈Rn ,b∈R

s.t. yj (wT xj + b) ≥ 1 j ∈ [1 : m]; (15.7)


min yj (wT xj + b) = 1.
j

We claim that minimizing kwk2 always ensures that the second constraint in (15.7) is satisfied. To see this
suppose that (w, b) satisfies the first constraint with strict inequality for each j. Then there is a positive
γ < 1 so that (γw, γb) satisfies the second constraint. Since this scaling decreases kwk2 , it will always be
favored under the minimization of kwk2 . This observation allows us to simplify (15.7) to:

min 1/2kwk2
w∈Rn ,b∈R
(15.8)
s.t. yj (wT xj + b) ≥ 1, j ∈ [1 : m].

This is a convex program with affine inequality constraints. Under the assumption of linearly separable
training data, the problem is feasible. Hence Slater’s condition is satisfied and strong duality holds. Since
strong duality holds, and the objective and the constraint functions are continuously differentiable, the KKT
conditions are both necessary and sufficient for optimality.
Problem (15.8) can be written more compactly by forming the vector y ∈ {±1}m with y(i) = yi , and
letting Z ∈ Rn×m be the matrix Z = [y1 x1 , . . . , ym xm ] of label-weighted examples. Then (15.8) can be
written as

min 1/2 wT w
w∈Rn ,b∈R
(15.9)
s.t. Z T w + by − 1 ≥ 0.

Here 1 denotes the vector of all 1’s, 0 denotes the vector of all 0’s, and the inequality is interpreted compo-
nentwise. Now bring in the dual variables α ∈ Rm with α ≥ 0, and form the Lagrangian

L(w, b, α) = 1/2 wT w − αT (Z T w + by − 1). (15.10)

Setting the derivatives of L w.r.t. w and b equal to zero gives:

Dw L(w, b, α)(h) = wT h − αT Z T h = (wT − αT Z T )h = 0 (15.11)


T
Db L(w, b, α)(r) = −α yr = 0. (15.12)

From these equations we conclude that w − Zα = 0 and αT y = 0. Finally, for u, v ∈ Rm , let u ⊗ v =

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 197

[u(i)v(i)] denote the Schur product of u and v. Then the KKT conditions for problem (15.9) are:
w − Zα = 0 (∇w L = 0) (15.13)
αT y = 0 (∇b L = 0) (15.14)
T
Z w + yb − 1 ≥ 0 primal constraint (15.15)
α≥0 dual variable constraint (15.16)
T
α ⊗ (Z w + yb − 1) = 0 complementary slackness (15.17)
Since Slater’s condition is satisfied, we known that w? , b? , and α? satisfy these equations if and only if
w? , b? is a solution of (15.9) and α? is a solution of the corresponding dual problem.

(1) Under our assumptions, α? 6= 0 and w? 6= 0


Assume to the contrary that α? = 0. Then by (15.13) w? = 0. Using (15.15), this implies b? y ≥ 1. This
is impossible since we have a standing assumption that y contains both positive and negative labels. Thus
α? 6= 0. We obtain the same contradiction if we assume w? = 0. Thus w? 6= 0.

(2) Support vectors


From (15.31) we see that
m
X X
w? = Zα? = αi? yi xi = αi? yi xi (15.18)
i=1 i∈A

where A = {i : αi? > 0}. Since α? 6= 0, A is nonempty. The examples with indices in A are called the
support vectors of the classifier. Only the support vectors contribute to forming w? .
By (15.17), for i ∈ A, yi xTi w? + yi b? = 1. Multiplying both sides by yi allows us to write this as

w? T xi + b? = yi , i ∈ A. (15.19)
Equation (15.19) shows that each support vector lies on one of the two hyperplanes w? T x + b? = ±1,
according to the value of its label. This is illustrated on the right of Figure 15.2. We know from the argu-
ment used to simplify (15.7) to (15.8) that there exist training examples satisfying the primal constraint with
equality. The support vectors constitute a subset of these training examples.

(3) The nonzero dual variables determine w? , b? and the classifier


We know from (15.18) that w = i∈A αi yi xi . So w? is determined by the values of αi? , i ∈ A. We can
? ?
P
similarly determine b? by rearranging (15.19) to
b? = yi − w? T xi , i ∈ A. (15.20)
If we average (15.20) over i ∈ A we obtain
!
1 X 1 X
b? = yi − w ? T xi = ȳA − w? T x̄A . (15.21)
|A| |A|
i∈A i∈A

Here ȳA is the average label, and x̄A the average example, over the support vectors.
So w? and b? are determined by the nonzero entries of α? , and the resulting linear classifier is
 
ŷ(x) = sign w? T (x − x̄A ) + ȳA . (15.22)

Only the support vectors play a role in classification. Hence once we know α? , the classifier is determined.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 198

15.2.1 The Dual Problem


Often the simplest way to find α? is by directly solving the dual SVM problem. Following the procedure
outlined in Chapter 14, the dual problem is found to be

max 1T α − 1/2 αT Z T Zα
α∈Rm
s.t. y T α = 0 (15.23)
α ≥ 0.
Solving this problem gives α? , and the results of the previous subsection then uniquely determine the SVM
classifier. By multiplying the objective by −1 and changing the max to a min we see that the dual problem
is equivalent to a convex program with a quadratic objective and affine inequality constraints.

15.3 The Linear SVM for General Training Data


Since realistic data is usually not linearly separable, the simple SVM problem (15.8) has restricted appli-
cability. Realistic data requires modifying (15.8) to allow some training examples to violate the constraint
yj (wT xj + b) ≥ 1. This can be done by bringing in nonnegative variables {sj }m j=1 and relaxing the con-
straint in (15.8) to
yj (wT xj + b) + sj ≥ 1, j ∈ [1 : m]. (15.24)
Fix a hyperplane H and let (w, b) ∈ [H] satisfy (15.24). If yj (wT xj + b) < 1, then we need sj > 0 to
ensure yj (wT xj + b) + sj ≥ 1. If yj (wT xj + b) > 0, we only need 0 < sj < 1. This training example is
still on the correct side of the hyperplane and hence it will not be misclassified. But if yj (wT xj + b) < 0
we need sj > 1, and xj is misclassified by H.
Relaxing the constraint yj (wT xj + b) ≥ 1 allows us to handle non separable training data, but we must
also add a penalty for the use of positive sj , otherwise there is no incentive to classify training examples
correctly. The size of sj is a measure
P of how much (xj , yj ) violates the original constraint. This suggests
that we add a penalty of the form m j=1 sj . This leads to what is called the primal linear SVM problem:
m
X
min 1/2kwk2 +C sj
w∈R ,b∈R,s∈Rm
n
j=1
(15.25)
s.t. yj (xTj w + b) + sj ≥ 1, j ∈ [1 : m],
sj ≥ 0, j ∈ [1 : m].
Here C > 0 is a regularization parameter that weights how strongly we penalize nonzero sj .
If the training data is linearly separable, and C is sufficiently large, problem (15.25) will yield the same
solution as the original problem (15.8). This is illustrated in Figure 15.3 (also see Exercise 15.3). In this
sense, problem (15.25) is a generalization of (15.8). If the training data is not linearly separable, then the
original problem (15.8) is infeasible, but the relaxed problem (15.25) is feasible and often yields a useful
solution. It does so by balancing the competing objectives of minimizing kwk2 and the cost of violations of
the original constraint.

15.3.1 Analysis of the Primal Linear SVM Problem


It will be convenient to write (15.25) in a more compact form. Let Z ∈ Rn×m be the matrix of label-
weighted examples Z = [y1 x1 , . . . , ym xm ], and form the vectors y ∈ {±1}m with y(i) = yi , and s ∈ Rm

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 199

w T

w
x+

T
w b

x+
w
T =1

b
T
x+

=1
x+
w b

b
T
T =0
x+

=0
x+
b

b
=−

=−
1

1
H H

Figure 15.3: An illustration of the linear SVM applied to separable training data. In both plots, the red points are the
positive examples, and the blue points are the negative examples. Left: A linear SVM classifier trained with C = 0.5.
There are six support vectors. Two satisfy yi (wT xi + b) = 1, but four have yi (wT xi + b) < 1 and hence si > 0 even
though the data is linearly separable. Right: A linear SVM classifier trained on the same data with C = 5. Now there
are only two support vectors and both have si = 0.

with s(i) = si . The primal linear SVM problem can then be written as

min 1/2 wT w + C1T s


w∈Rn ,b∈R,s∈Rm

s.t. Z T w + by + s − 1 ≥ 0 (15.26)
s ≥ 0,

with the inequalities interpreted componentwise. Notice that the training data appears in the first constraint
as the n × m matrix Z and the vector y.
Problem (15.26) is a feasible, convex (quadratic) program. Since it is feasible and has affine inequality
constraints, Slater’s condition is satisfied and strong duality holds. It also has a continuously differentiable
objective function. Thus the KKT conditions are both necessary and sufficient for optimality.
To obtain the KKT conditions, bring in the dual variables α, µ ∈ Rm with α, µ ≥ 0, and form the
Lagrangian
L(w, b, s, α, µ) = 1/2 wT w + C1T s − αT (Z T w + by + s − 1) − µT s. (15.27)

Setting the derivatives of L w.r.t. w, b and s equal to zero gives:

Dw L(h) = wT h − αT Z T h = (wT − αT Z T )h = 0 (15.28)


Db L(r) = −αT yr = 0 (15.29)
T T T T T T
Ds L(t) = C1 t − α t − µ t = (C1 − α − µ )t = 0. (15.30)

From these equations we conclude that w − Zα = 0, αT y = 0, and α + µ = C1. The KKT conditions for

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 200

problem (15.26) are thus:

w − Zα = 0 (∇w L = 0) (15.31)
T
α y=0 (∇b L = 0) (15.32)
α + µ − C1 = 0 (∇s L = 0) (15.33)
Z T w + yb − 1 + s ≥ 0 primal constraint (15.34)
s≥0 primal constraint (15.35)
α≥0 dual variable constraint (15.36)
µ≥0 dual variable constraint (15.37)
α ⊗ (Z T w + yb − 1 + s) = 0 complementary slackness (15.38)
µ⊗s=0 complementary slackness (15.39)

We can use the KKT conditions to draw several conclusions about the solution w? , b? , s? and α? , µ? .

(1) Under our assumptions, α? 6= 0


Suppose to the contrary that α? = 0. Then by (15.31), w? = 0, and by (15.33), µ?i = C1 > 0. Hence
using (15.39), s? = 0. It then follows by (15.34) that yb? ≥ 1. This is impossible since, by assumption, y
contains both positive and negative entries. Thus α? 6= 0.

(2) Support vectors


From (15.31) we see that
m
X X
w? = Zα? = αi? yi xi = αi? yi xi , (15.40)
i=1 i∈A


where A = {i : αi? > 0} is nonempty since α? 6= 0. The examples with indices in A are called the support
vectors. Only these examples contribute to forming w? . By (15.38), for i ∈ A we have

w? T xi + b? = yi (1 − s?i ). (15.41)

The support vectors can take two forms:

(a) 0 < αi? < C. In this case, (15.33) implies µi > 0 and then (15.39), implies si = 0. So this support
vector lies on one of the two parallel hyperplanes:

w? T xi + b? = yi , yi ∈ {±1}.

(b) αi? = C. In this case, by (15.33), µ?i = 0 and we have si ≥ 0. This support vector lies on the
hyperplane specified by (15.41). If si < 1, the support vector is correctly classified, otherwise it is
incorrectly classified.

The P
support vectors are made up of both positive and negative examples. To see this note that (15.32)
implies i∈A αi? yi = 0. Since for i ∈ A, αi? > 0, we conclude that yi must take both positive and negative
values over i ∈ A.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 201

w
T
x+

w
w

T
T

b=

x+
x+

w
T
1
w

b=
T

b=

x+
w
x+

1
0

b=
x+
b=

0
b=
−1


1
Figure 15.4: Training a linear SVM using nonseparable data. In both plots, the red points are the positive examples,
and the blue points are the negative examples. Left: A linear SVM classifier trained with C = 1. There are twelve
support vectors. Right: A linear SVM classifier trained on the same data with C = 5. In this case there are nine
support vectors.

(3) The Support Vectors Determine w? , b? , s? , and the classifier


By (15.40), w? = i∈A αi? yi xi . In addition, by (15.26) and (15.31), b? and s? are determined by solving
P
the following feasible linear program:

min 1T s
s∈Rm ,b∈R
   
Im 0 s 0
 (15.42)
s.t ≥ .
Im y b 1 − Z T Zα?

Thus w? , b? , and s? , and hence the classifier, are all determined by α? :


!
  X
?T ?
ŷ(x) = sign w x+b = sign αi? yi (xTi x) +b ?
.
i∈A

Only the support vectors participate in classification, and do so only via inner products with the test example.
15.3.2 The Dual Linear SVM Problem
The analysis in the previous section indicates the importance of α? in determining the linear SVM classifier.
Often α? is obtained by directly solving the dual problem. Following the procedure in Chapter 14, the dual
SVM problem is found to be,

max 1T α − 1/2 αT Z T Zα
α∈Rm , µ∈Rm

s.t. y T α = 0
α + µ − C1 = 0 (15.43)
α≥0
µ ≥ 0.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 202

We can drop µ from the problem and replace the second constraint with α ≤ C1. Then µ = C1 − α. This
simplifies the dual to:

max 1T α − 1/2 αT Z T Zα
α∈Rm
s.t. y T α = 0 (15.44)
α ≤ C1
α ≥ 0.

By multiplying the objective by −1 and changing the max to a min we see that the dual problem is equivalent
to a convex program with a quadratic objective and affine inequality constraints. The ij-th entry of the matrix
Z T Z is yi (xTi xj )yj . Using this expression to write the term αT Z T Zα out in detail we see that
m X
X m
αT Z T Zα = αi αj yi yj (xTi xj ).
i=1 j=1

So the training examples appear in the dual problem through the terms yi yj (xTi xj ). In due course we will
see that this gives the dual problem a significant advantage.

15.4 The ν-SVM


The linear SVM has one free parameter that is frequently denoted by C > 0. For this reason it is often called
the C-parameterized SVM, or just C-SVM for short. There is an alternative formulation of the SVM called
the ν-parameterized SVM or ν-SVM for short. It has one free parameter ν > 0. For comparison purposes
the primal C-SVM and ν-SVM are shown at the top of Table 15.1. The primal ν-SVM makes three changes
to the primal C-SVM. First, the parameter C has been removed and replaced by the fixed constant 1/m. To
compensate, a new free parameter ν is introduced together with a new scalar variable r ≥ 0. These appear
as an additional term −νr in the objective, and the first constraint in the primal C-SVM has been modified
to Z T w + by + s − r1 ≥ 0. The Lagrangians and KKT conditions for these problems are also shown in
Table 15.1.
There are two obvious questions. First, are the C-SVM and ν-SVM equivalent? And second, what is the
benefit of the ν-SVM? The answer to the first question is yes: the two forms of SVM are equivalent. This is
the content of the following proposition.

Proposition 15.4.1 (After Schölkopf et al, 2000). If the primal ν-SVM for given ν > 0 has solution
w? , b? , r? , s? with r? > 0, then the solution of the primal C-SVM problem with C = 1/(r? m) yields
the identical classifier. Conversely, if the C-SVM for given C > 0 has solution w? , b? , s? , α? , µ? , then
the ν-SVM with ν = (1T α? )/(Cm) yields the identical classifier.

Proof. Let w? , b? , r? , s? with r? > 0 be a solution of the primal ν-SVM problem. We claim that w̄ = w? /r? ,
b̄ = b? /r? , and s̄ = s? /r? solve the C-SVM problem for C = 1/(r? m). To see this, go down the list of
KKT conditions for the primal ν-SVM and transform each by the modifications above. Then check that
the corresponding equation in the column for the primal C-SVM is satisfied. For most of the equations
this simply requires dividing each side of the equation by r? . For the complementary slackness equations
it requires dividing each side of the equation by r? 2 . Hence for every ν > 0, if the solution of the primal
ν-SVM problem yields r? > 0, then the solution of the primal C-SVM with C = 1/(r? m) has solution

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 203

Primal C-SVM: Primal ν-SVM:

min 1/2 wT w + C1T s min 1/2 wT w − νr + 1/m1T s


w∈Rn ,b∈R,s∈Rm w∈Rn ,b∈R,r∈R,s∈Rm

s.t. Z T w + by + s − 1 ≥ 0 s.t. Z T w + by + s − r1 ≥ 0
s ≥ 0. s≥0
r ≥ 0.

Lagrangian: Lagrangian:

L =1/2wT w + C1T s L =1/2wT w − νr + 1/m1T s


− αT [Z T w + by + s − 1] − αT [Z T w + by + s − r1]
− µs − µs − γr

KKT conditions: KKT conditions:

w − Zα = 0 (∇w L) w − Zα = 0 (∇w L)
αT y = 0 (∇b L) αT y = 0 (∇b L)
α + µ − C1 = 0 (∇s L) α+µ− 1/m1 =0 (∇s L)
T
α 1−ν−γ =0 (∇r L)
Z T w + yb − 1 + s ≥ 0 p. c. Z T w + yb − r1 + s ≥ 0 p. c.
s≥0 p. c. s≥0 p. c.
r≥0 p. c.
α≥0 d. c. α≥0 d. c.
µ≥0 d. c. µ≥0 d. c.
γ≥0 d. c.
T T
α ⊗ (Z w + yb − 1 + s) = 0 c. s. α ⊗ (Z w + yb − r1 + s) = 0 c. s.
µ⊗s=0 c. s. µ⊗s=0 c. s.
γr = 0 c. s.

Table 15.1: The C-SVM and ν-SVM and corresponding Lagrangians and KKT conditions.

w̄ = w? /r? , b̄ = b? /r? , and s̄ = s? /r? . This solution defines the same hyperplane and hence the same
classifier as the ν-SVM.
To prove the converse, let w? , b? , s? , α? , µ? satisfy the KKT conditions for the C-SVM for some C > 0.
We claim that w̄, b̄, s̄, ᾱ, µ̄ (where z̄ = z/(Cm)), with r̄ = 1/(Cm) and γ̄ = 0 satisfy the KKT conditions
for the primal ν-SVM with ν = 1T ᾱ. To see this, go down the list of KKT conditions for the primal C-SVM
and transform each by the modifications above. For most of the equations this simply requires dividing each
side of the equation by Cm. For the complementary slackness equations, divide each side of the equation by
(Cm)2 . Then check that the corresponding equation in the column for the primal µ-SVM is satisfied. This
verifies all equations except the 4-th, 9-th and 12-th. Since r̄ = 1/(Cm) > 0, select γ̄ = 0. This ensures the
9-th and 12-th equations are satisfied. Then ν = 1T ᾱ ensures the 4-th equation is satisfied. Since all KKT

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 204

conditions are satisfied, we have a solution of the primal ν-SVM. Moreover this solution yields the same
hyperplane and hence classifier as the C-SVM solution.

The question concerning advantages of the ν-SVM is answered in the following propositions.

Proposition 15.4.2 (Schölkopf et al, 2000). If we solve a ν-SVM yielding a solution with r? > 0, then
ν is an upper bound on the fraction of the training examples with s?i > 0 and a lower bound on the
fraction of training examples that are support vectors.

Proof. Let q be the number of training examples with s?i > 0 and let t ≥ q be the number of support vectors.
Since r? > 0, the KKT conditions imply γ ? = 0 and hence ν = α? T 1. Moreover, by the KKT conditions,
1 1
if s?i > 0, then µ?i = 0 and αi? = 1/m; otherwise αi? ≤ 1/m. Hence q m ≤ ν = 1T α ? ≤ t m .

Proposition 15.4.3 (Schölkopf et al, 2000). Suppose the m training examples used in the ν-SVM are
drawn independently from a distribution p(x, y) such that p(x|y = 1) and p(x|y = −1) are absolutely
continuous. Then with probability one, in the limit as m → ∞ the fraction of support vectors and the
fraction of margin errors both converge to ν.

Proof. See Schölkopf et al, 2000.

15.5 One Class SVMs


A set of unlabeled examples {xi ∈ Rn }m i=1 is said to be linearly separable from the origin if for some
w ∈ R with w 6= 0, w xi > 0, i ∈ [1 : m]. Assume this holds and set  = mini wT xi . Since  > 0, we
n T

can set ŵ = w/. Then ŵT xi = wT xi / ≥ 1, i ∈ [1 : m]. Thus there exists a hyperplane ŵT x + b = 0 with
b = −1 that separates the data from the origin. Moreover, the distance from 0 to this hyperplane is 1/kŵk2 .
It follows that an unlabeled dataset {xi }m i=1 is linearly separable from the origin if and only if there
exists w ∈ Rn such that wT xi − 1 ≥ 0, i ∈ [1 : m]. Moreover, the distance from any such hyperplane to the
origin is 1/kwk2 . It is now clear that we can find a hyperplane in this family that is furthest from the origin.
It will be convenient to set X = [x1 , . . . , xm ] and write the condition wT xi − 1 ≥ 0, i ∈ [1 : m], in
vector form as X T w − 1 ≥ 0. We can then pose the simple one-class SVM problem as:

min 1/2 wT w
w∈Rn (15.45)
s.t. X T w − 1 ≥ 0.

Since problem (15.45) assumes the data is linearly separable from the origin, it is analogous to the simple
SVM problem. The problem can be generalized by allowing some points to be on the “wrong” side of the
hyperplane. This removes the linear separability assumption and leads to the following formulation of the
one class C-SVM (shown in primal and dual forms):

One Class C-SVM Primal: One Class C-SVM Dual:

min 1/2 wT w + C1T s max − 1/2 αT (X T X)α + 1T α


w∈Rn α∈Rm (15.46)
s.t. X T w − 1 + s ≥ 0 s.t. 0 ≤ α ≤ C1.
s ≥ 0.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 205

One can also consider a one-class ν-SVM. In this case the primal and dual problems can be stated as:

One Class ν-SVM Primal: One Class ν-SVM Dual:

min 1/2 wT w − νr + 1/m1T s max − 1/2 αT (X T X)α


w∈Rn α∈Rm
s.t. X T w − r1 + s ≥ 0 s.t. 1T α = ν (15.47)
s≥0 0 ≤ α ≤ 1/m1.
r ≥ 0.
These problems look very similar to the binary SVM problems we have previously discussed. For
example, we can obtain the one-class C-SVM from the binary C-SVM by setting b = 0 (no offset) and
replacing the labeled weighted data matrix Z = [yi xi ] by the unlabeled data matrix X = [xi ].
Here is an interesting variation on the above ideas. Suppose we are provided with m > 1 data points
{xj ∈ Rn }m n
j=1 drawn from one class. To estimate the support region of the data in R , we seek a tight
spherical region that “encloses” the training data. This leads to the optimization problems:

Spherical One Class C-SVM Primal: Spherical One Class C-SVM Dual:

min R2 + C1T s max − αT (X T X)α


R∈R,a∈Rn ,s∈Rm α

s.t. kxi − ak22 ≤ R2 + si , + 1T diag(X T X)α


(15.48)
i ∈ [1 : m] s.t. 0 ≤ α ≤ C1
s ≥ 0. 1T α = 1.
Here the variables R and a represents the radius and center, respectively, of the estimated spherical
region, and the variables in s are slack variables that allow some points to lie outside the estimated region.
This is another instance of a one class SVM.

Notes
The SVM has a long history. It rose to wide prominence after the paper by Cortes and Vapnik in1995 [10].
Notice that this paper did not use the name SVM. The ν-SVM appears in Schölkopf et al. [41].
The use of spheres with hard boundaries to describe data was examined with a soft margin one-class
SVM by Tax and Duin in 1999 [47]. The one class ν-SVM appears in the paper by Schölkopf et al. in
1999 [44] and is expanded upon in [39]. See [42] for a more expansive coverage of one class SVM classifiers.

Exercises
Primal SVM Problem
Exercise 15.1. (Uniqueness of SVM solutions)
(a) Show that the solution w? , b? of the simple linear SVM based on separable data is unique.
(b) Show that if w? , s? , b? is a solution of the primal linear SVM problem, then w? and 1T s? are unique. Show
that the uniqueness of s? and b? is determined by a feasible linear program.
Exercise 15.2. Show that if the solution of the primal linear SVM problem is unique, then there is at least one support
vector with s?i = 0. This support vector determines b? . Thus when the primal problem has a unique solution, b? is
easily determined from α? . What can one say when the solution is not unique?

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 206

Exercise 15.3. Let {(xi , yi }m n


i=1 with xi ∈ R and yi ∈ {±1}, i ∈ [1 : m], be a linearly separable set of training data.
Show that if C is sufficiently large, the solution of the primal SVM problem will give the unique maximum margin
separating hyperplane. Give a method to determine how large C needs to be.
Exercise 15.4. Let ŷ(x) = sign(wT x + b) be a linear classifier with w, x ∈ Rn and b ∈ R. Suppose that the example
space Rn is transformed using a “rigid body” transformation z = Q(x − h) with Q ∈ On and h ∈ Rn . This translates
h to the origin and performs an orthogonal transformation about the origin. Let {xi } be any set of testing data in the
original space and {zi } be the corresponding test data in the transformed space.

(a) Show that the linear classifier ŷ(z) = sign (Qw)T z + (b + wT h) in the transformed space achieves the same
performance on the test examples zi as the original linear classifier on the test examples xi .
(b) Derive a similar result for the transformation z = Q(x − h) + g.
Exercise 15.5. Let {(xi , yi }m n
i=1 with xi ∈ R and yi ∈ {±1}, i ∈ [1 : m], be a training dataset. For a fixed value of
? ?
C, let the corresponding SVM classifier be w , b .

(a) Let h ∈ Rn and Q ∈ On , and form the second training set: {Q(xi − h), yi )}m
i=1 . Show that the SVM classifier
for this dataset is Qw? , w? T h + b? .
(b) Show that both classifiers have the same accuracy on any testing set.
(c) In particular, if we first center the training examples, how does this change the SVM classifier?

Dual SVM Problems


Exercise 15.6. Give a clear and concise derivation of the dual of the simplified SVM problem shown below and
explain the origin of each of the constraints in the dual problem.

min 1/2 wT w
w∈Rn ,b∈R

s.t. Z T w + by − 1 ≥ 0

Exercise 15.7. Give a clear and concise derivation of the dual of the primal linear SVM problem shown below and
explain the origin of each of the constraints in the dual problem.

min 1/2 wT w + C1T s


w∈Rn ,b∈R,s∈Rm

s.t. Z T w + by + s − 1 ≥ 0
s ≥ 0.

Exercise 15.8. Show that the solution of the dual SVM problem is invariant under a “rigid body” transformation of
the training data. Specifically, let {(xi , yi )}m n
i=1 with xi ∈ R and yi ∈ {±1}, i ∈ [1 : m], be a training dataset, and
h ∈ R , Q ∈ On . Then the solutions of the dual SVM problems for the datasets {(xi , yi )}m
n m
i=1 and {(Q(xi −h), yi )}i=1
are identical.

Modified SVM Problems


Pm
Pm C 2 i=1 si as the penalty term in the objective of the primal SVM
Exercise 15.9. Suppose that instead of using
problem we use the quadratic penalty 1/2C i=1 si , while maintaining the constraint si ≥ 0.
(a) Formulate the new primal problem in vector form. When is the primal problem feasible?
(b) Does strong duality hold for this problem? Justify your answer.
(c) Write down the KKT conditions.
(d) Find the dual problem.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 207

Exercise 15.10. If we drop the constraint s ≥ 0 in the primal linear SVM problem, we obtain the modified problem

min 1/2 wT w + C1T s


w∈Rn ,b∈R,s∈Rm

s.t. Z T w + by + s − 1 ≥ 0.

(a) Show that the above problem is equivalent to:

min 1/2 wT w + C1T s


w∈Rn ,b∈R,s∈Rm

s.t. Z T w + by + s − 1 = 0.

(b) Analyze the classifier that results from the problem in (a) in detail. In particular, show its connection to the
nearest centroid classifier.

Exercise 15.11. As in Exercise 15.10, drop the constraint P s ≥ 0 in the primal linear SVM problem and make the
m
primal constraint
Pm 2 an equality. However, this time replace C i=1 si in the SVM objective with the quadratic penalty
1/2C s .
i=1 i

(a) Formulate the new primal problem in vector form and determine when the primal problem feasible and when
strong duality holds.
(b) Write down the KKT conditions.
(c) Show that α? and b? solve a set of linear equations.
(d) Show that these linear equations have a unique solution.
(e) Find the dual problem in its simplest form.
Exercise 15.12. You are provided with m > 1 data points {xj ∈ Rn }m j=1 of which at least d, with 1 < d ≤ m are
distinct. Let X = [x1 , . . . , xm ] and consider the one class SVM problem:

min R2 + C1T s
R∈R,a∈Rn ,s∈Rm

s.t. kxi − ak22 ≤ R2 + si , i ∈ [1 : m],


s ≥ 0.

(a) Show that this is a feasible convex program and that strong duality holds. [Hint: let r = R2 ]
(b) Write down the KKT conditions.
(c) Show that α? 6= 0 and that if C > 1/(d − 1) then (R2 )? > 0 (harder).
(d) What are the support vectors for this problem?
(e) Derive the dual problem.
(f) Assume C > 1/(d − 1). Given the dual solution, how should a and R2 be selected?

Exercise 15.13. You are provided with m > 1 data points {xi ∈ Rn }m i=1 that are linearly separable from the origin.
Let X = [x1 , . . . , xm ] and consider the (simple) one-class SVM problem

min 1/2 wT w
w∈Rn

s.t. X T w − 1 ≥ 0.

(a) Show that this is a feasible convex program and that strong duality holds.
(b) Find the KKT conditions.
(c) Derive the corresponding dual problem.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 208

Exercise 15.14. Consider the same set-up as the previous exercise except the set of unlabeled examples {xi ∈ Rn }m
i=1
may not be linearly separable from the origin. In this case, the one-class SVM primal problem is

min 1/2 wT w + C1T s


w∈Rn

s.t. XT w + s − 1 ≥ 0
s ≥ 0.

(a) Show that this is a feasible convex program and that strong duality holds.
(b) Write down the KKT conditions.
(c) Derive the corresponding dual problem.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 209

Chapter 16

Feature Maps and Kernels

A feature map is a function that maps the examples of interest (both training and testing) into an inner
product space. In doing so, the feature map can transform the representation of the data to aid subsequent
analysis. Here are several potential advantages of this approach:

(1) The new representation can enable the application of known machine learning methods. For example,
let the original data is a set of text, binary images, or tables of categorical data. If a feature map can
be founds that represents the important aspects of the data as Euclidean vectors, standard machine
learning methods can be applied to the transformed data.

(2) A feature map can reorganize the data representation to aid data analysis. This is usually done by
extracting, combining and reorganizing informative features from the initial data representation. For
example, mapping an image into a vector of wavelet features, or a sound signal into a vector spectral
features, or using a trained neural network to map images into a large dimensional vector of learned
image features.

(3) A nonlinear feature map has the ability to warp the data space to reduce within class variation and
increase between class variation. This ability can often be enhanced by making the dimension of the
feature space higher than that of the initial space.

We first discuss feature maps and the potential advantages and disadvantages of using such maps in
machine learning. Then we bring in the concept of the kernel of a feature map and discuss kernel properties.

16.1 Feature Maps


Let the examples of interest be drawn from a set X . This could be an arbitrary set, or a subset of a Euclidean
space. A feature map is function φ : X → H mapping examples x ∈ X to feature vectors φ(x) in the
feature space H. We require the space H to be a Hilbert space. A Hilbert space is an inner product space
that satisfies an extra technical condition (every Cauchy sequence converges). Every Euclidean space Rq is
a Hilbert space, and certain infinite dimensional function spaces are also Hilbert spaces.
To make this more concrete, suppose we want to use binary labelled training data {(xi , yi )}m i=1 , with
xi ∈ Rn and yi ∈ {±1}, to learn a linear classifier. However, instead of doing this directly in Rn ,
we first apply a (nonlinear) feature map φ : Rn → Rq that maps the training examples to the points
{(φ(xi ), yi )}m q
i=1 ⊂ R × {±1}. Then we use the mapped data to learn a linear classifier in R . After
q

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 210

6
2.5 30

4 20
2.0
10

2 x1 x2
2
0
1.5 w x+T

p
b
w T x + =1 0
10
b =0

x2
1.0 w T x + b 20
= −1
H 2
30

0.5 25
20
4 15
0.0 0
5 10

x2 2
10
6 15 5
6 4 2 0 2 4 6 x12 20
0.5 x1 25
0
1.5 1.0 0.5 0.0 0.5 1.0 1.5

Figure 16.1: Left: A simple example showing a set of training data along the x-axis that is not linearly separable, and
above that, a mapping of the data into R2 using φ(x) = (x, 0.5 + x2 ), after which it is linearly separable. Middle and
Right: Illustrative plots for the second example in Example 16.1.1. The middle plot shows a set of training data √ in the
plane that is not linearly separable. The right plot shows a mapping of the data into R3 using φ(x) = (x21 , x22 , 2x1 x2 ),
after which it is linearly separable using the plane φ(x)(3) = 0.

learning the classifier, a new test point x ∈ Rn is classified in Rq using its image φ(x):
(
1, if w? T φ(x) + b? ≥ 0;
ŷ(φ(x)) =
−1, otherwise.

Example 16.1.1. The following examples give some insight into the potential benefits of the above idea.

(1) Consider the specific binary classification problem shown in left plot of Figure 16.1. The set of binary
labeled training examples along the x-axis is not linearly separable. Shown above that are the results
of mapping the data into R2 using φ(x) = (x, 0.5 + x2 ). The mapped training examples are now
linearly separable in R2 .

(2) Let x ∈ R2 and write x = (x1 , x2 ). We create a variation of the famous “exclusive or” problem,
by letting the label of x = (x1 , x2 ) be y(x) = sign(x1 x2 ). Under this labelling, any sufficiently
large training set drawn from a density supported on R2 will not be linearly separable. But if we first
apply the nonlinear feature map φ(x) = x1 x2 , the data is linearly separable in R. This feels like
cheating; how would we know to select that special feature map? If we thought that second order
statistics of the data might √
help classification, we could instead apply a simple quadratic feature map
such as φ(x) = x21 x22 2x1 x2 . The mapped data is then linearly separable in R3 using the third


feature. Moreover, this fact can be learned using the training data to find a linear SVM classifier. This
is illustrated in the middle and right plots of Figure 16.1.

Example 16.1.1 suggests that a feature map can be a powerful tool. The feature map can nonlinearly
warp the original space to bring examples with the same label closer together (thus reducing within class
variation) while moving examples with distinct labels further apart (thus increasing between class distances).
This task can potentially benefit by lifting the data into a higher dimension space since that adds extra degrees
of freedom for warping the data surface. Of course that’s a vague idea. It suggests that using nonlinear
functions to map the data into a high dimensional space is a good idea, but leaves open the question of how

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 211

to actually select the feature map. That will be application dependent, but the general idea is to identify and
build informative composite features from the original weakly informative features.
We should also note a potential computational issue with the use of high dimensional feature maps.
Each example could already have a high-dimensional representation (e.g., a large image). Hence mapping
the data into an even higher dimensional space and doing subsequent computation in that higher dimensional
space could be computationally expensive or even infeasible.

16.2 Kernels and Kernel Properties


Let X be a nonempty set from which data is drawn, and φ be a feature map from X into a Hilbert space H.
We define the kernel function k : X × X → R of φ by


k(x, z) = <φ(x), φ(z)>. (16.1)

We say that k is the kernel of φ and that k is a kernel on X . If it is important to indicate that a kernel k and
feature map φ are connected by (16.1), we denote the kernel by kφ . A few important properties of kernels
follow immediately from (16.1). We list these below.

Elementary Properties of Kernels

(1) For any function h : X → R, k(x, z) = h(x)h(z) is a kernel on X . This follows by noting that the
simplest feature maps extract one scalar feature, i.e., φ : X → R. Every such function defines a kernel
on X of the form k(x, z) = φ(x)φ(z). Hence h(x)h(z) is the kernel for the feature map h.

(2) If kφ is a kernel on X , then preceding kφ by a feature map ψ : Y → X also yields a kernel:

kφ (ψ(u), ψ(v)) = <φ ◦ ψ(u), φ ◦ ψ(v)> = kφ◦ψ (u, v),

where f ◦ g denotes the composition of functions f and g.

(3) A kernel k(x, z) inherits the following properties of the inner product in H: positivity, k(x, x) ≥ 0;
symmetry, k(x, y) = k(y, x); and the Cauchy-Schwartz inequality,
p
|k(x, z)| = |<φ(x), φ(z)>| ≤ kφ(x)k2 kφ(z)k2 = k(x, x)k(z, z).

Hence
k(x,z)
−1 ≤ √ ≤ 1.
k(x,x)k(z,z)

In this sense, k(x, z) can be interpreted as a measure of similarity between x, z ∈ X .

√ same√kernel function. For example, let X = R and consider the


Distinct feature maps can give rise to the
feature maps φ1 (x) = x and φ2 (x) = (x/ 2 x/ 2). Both feature maps have the same kernel: k1 (x, z) =
xz and k2 (x, z) = 1/2xz + 1/2xz = xz. This suggests that kernels are more fundamental than feature maps.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 212

Potential Computational Advantages


The kernel of a feature map has potential computational advantages. Using the feature map directly requires
mapping example points into the feature space H and performing subsequent computations in H. Since H
generally has a higher dimension (possibly infinite), this incurs an additional computational cost. On the
other hand, if the kernel function can be evaluated efficiently (e.g., without first computing φ(x) and φ(z)),
then it can provide a computational advantage. This is illustrated by the simple kernels in the example below.
Each can be evaluated without leaving the original space.
Example 16.2.1. Let x, z ∈ R2 and write x = (x1 x2 ) and z = (z1 z2 ).
(a) φ1 (x) = x1 x2 ⇒ k1 (x, z) = x1 x2 z1 z2 .
⇒ k2 (x, z) = x1 z1 + x2 z2 + x1 x2 z1 z2 = xT z + k1 (x, z).

(b) φ2 (x) = x1 x2 x1 x2

(c) φ3 (x) = x21 x22 ⇒ k3 (x, z) = x21 z12 + x22 z22 + 2x1 x2 z1 z2 = (xT z)2 .

2x1 x2
√ √ √
(d) φ4 (x) = x21 x22

2x1 x2 2x1 2x2 1 ⇒

k4 (x, y) = x21 z12 + x22 z22 + 2x1 x2 z1 z2 + 2x1 z1 + 2x2 z2 + 1 = (xT z + 1)2 .

16.2.1 Properties of Kernels


Since kernels are more fundamental objects, we might consider selecting a kernel instead of selecting a
feature map. This raises an important question: what properties must a function k : X × X → R have in
order to be the kernel of some feature map? We address this question below.
The Gram matrix of a kernel kφ on the set of points {xi }m
i=1 is the m × m matrix

K = [<φ(xi ), φ(xj )>].


The Gram matrix is always a real, symmetric, positive semidefinite matrix. Symmetry is clear. To show that
it is positive semidefinite, let u ∈ Rm and compute
m X
X m m
X m
X
T
u Ku = u(i)u(j)<φ(xi ), φ(xj )> = < u(i)φ(xi ), u(j)φ(xj )> ≥ 0.
i=1 j=1 i=1 j=1

Hence if we want to pick a function k : X × X → R such that k is the kernel of some feature map on X ,
then for every integer m ≥ 1 and every set of points {xj ∈ X }m j=1 , the matrix K = [k(xi , xj )] ∈ R
m×m

must be symmetric positive semidefinite. It turns out that this is also a sufficient condition for k to be the
kernel of some feature map on X .
Theorem 16.2.1. A function k : X × X → R is the kernel of some feature map φ : X → H, where H
is a Hilbert space, if and only if for every integer m ≥ 1 and every set of m points {xj ∈ X }m
j=1 , the
matrix K = [k(xi , xj )] is symmetric positive semidefinite.

Proof. We proved necessity above the statement of the theorem. The proof of sufficiency involves the
construction of a reproducing kernel Hilbert space H of functions with reproducing kernel k(·, ·). The
required feature map is then φ(xi ) = k(xi , ·) ∈ H. See, for example [42].
Together with the definition of a kernel and its immediate consequences, the following core properties
play a key role in identifying and constructing kernels.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 213

Theorem 16.2.2 (Core Properties of Kernels). Kernels have the following properties:

(a) Nonnegative scaling: If k is a kernel on X and α ≥ 0, then αk is a kernel on X .

(b) Sum: If k1 , k2 are kernels on X , then k1 + k2 is a kernel on X .

(c) Product: k1 , k2 are kernels on X , then k(x, x0 ) = k1 (x, x0 )k2 (x, x0 ) is a kernel on X .

(d) Tensor product: k1 is a kernel on X and k2 is a kernel on Y, then

k((x, y), (x0 , y 0 )) = k1 (x, x0 )k2 (y, y 0 )

is a kernel on X × Y.

(e) Limits: If {kp }p≥1 are kernels on X and limp→∞ kp (x, z) = k(x, z). Then k is a kernel on X .

Proof. In the following, m is an integer with m ≥ 1 and {xj }m m


j=1 ⊂ X and {yj }j=1 ⊂ Y.
(a) K = [αk(xi , xj )] = α[k(xi , xj )] is symmetric positive semidefinite.
(b) The matrix K = [k1 (xi , xj )] + [k2 (xi , xj )] is symmetric positive semidefinite.
(c) Let K = [k1 (xi , xj )k2 (xi , xj )] = [k(xi , xj )]⊗[k2 (xi , xj )]. The result then follows by the Schur product
theorem (Theorem E.1.1).
(d) Consider the points {(xj , yj )}m j=1 ⊂ X ×Y. Let K = [k1 (xi , xj )k2 (yi , yj )] = [k1 (xi , xj )]⊗[k2 (yi , yj )].
The result then follows by the Schur Product Theorem (Theorem E.1.1).
(e) Let Kp = [kp (xi , xj )]. Then limp→∞ Kp = [limp→∞ kp (xi , xj )] = [k(xi , xj )]. So the limit limp→∞ Kp
exists. Now note that Kp is symmetric positive semidefinite and the limit of a sequence of symmetric positive
semidefinite matrices is symmetric positive semidefinite.
The following additional properties are easy corollaries of Theorem 16.2.2.
Corollary 16.2.1. If k is a kernel on X , then

(a) For any integer d ≥ 1, k(x, z)d is a kernel on X .

(b) For a polynomial q(s) with nonnegative coefficients, q(k(x, z)) is a kernel on X .

(c) ek(x,z) is a kernel on X .


Proof. Exercise.

16.3 Examples of Kernels


The elementary properties of kernels following from the definition, together with the properties listed in
Theorem 16.2.2 and Corollary 16.2.1, allow us to identify a variety of specific kernels. In each example
below, we denote the kernel by k, a matching feature map by φ, and a kernel matrix by K.
(a) k(x, z) = 1 on any set X .
This kernel results from the feature map φ(x) = 1. For any m points in X , the kernel matrix is
K = 1m 1Tm .
(b) k(x, z) = xT z on Rn .
This kernel results from the identity map φ(x) = x. For any m points in Rn arranged in the columns
of X ∈ Rn×m , the kernel matrix is K = X T X (the Gram matrix of X).

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 214

(c) k(x, z) = (xT z)d on Rn , d ∈ N.


This kernel is the d-th power of the kernel in (b). It is called the d-th order homogeneous polynomial
kernel. For n = 2 and d = 2, k(x, z) = (x(1)z(1) + x(2)z(2))2 . Expansion using the binomial
theorem yields
k(x, z) = x(1)2 z(1)2 + 21 x(1)z(1)x(2)z(2) + x(2)2 z(2)2 .



Hence we can set φ(x) = (x(1)2 , x(2)2 , 2x(1)x(2)). For n = 2 and general d, the same expansion
yields
 
 12
d d
 21 d
 21
φ(x) = x(1)d x(1)d−1 x(2) x(1)d−2 x(2)2 . . . x(1)1 x(2)d−1 x(2)d .
1 2 1

For general n and d, we expand k(x, z) = (x(1)z(1) + x(2)z(2) + · · · + x(n)z(n))d using the
multinomial theorem, and use the pattern from above to obtain
  1 Qn 
n
φ(x) = k1 ,...kn
2
t=1 x(t)kt .

The term inside the outer parentheses is the generic component of the vector. The components are
indexed by k1 , . . . , kn where nt=1 kt = n. For any m points in Rn arranged in the columns of X,
P
the kernel matrix is K = ⊗dj=1 (X T X).

(d) k(x, z) = (1 + xT z)d on Rn , d ∈ N.


This kernel is called the d-th order inhomogeneous polynomial kernel. It is the d-th power of the sum
of the kernels in (a) and (b). This can be mapped to the kernel in (c) by added an extra dimension to
x and setting x(n + 1) = 1.

(e) k(x,√z) = xT P z on Rn , where P ∈ Rn×n is a symmetric PSD matrix. √


Let P denote the symmetric square root of P . Then we can set φ(x) = P x. For any m points in
Rn arranged in the columns of X, K = X T P X.

2
(f) k(x, z) = e−γkx−zk2 on Rn , with γ ≥ 0.
This kernel is called the Gaussian kernel. Writing

2 T 2 T
k(x, z) = e−γkxk2 e2γx z e−γkzk2 = f (x)f (z)e2γx z ,

and using the results of Theorem 16.2.2 and Corollary 16.2.1, one sees that this is indeed a kernel. To
obtain a feature map, for x ∈ Rn consider
√ −2γkθ−xk2
φ(x) = gx (θ) = Ce , (16.2)

2
where the normalizing constant satisfies C Rn e−4γkθk = 1. The function gx (θ) lies in the Hilbert
R

space L2 (Rn ) of real valued square integrable functions on Rn with inner product
Z
<f (θ), g(θ)> = f (θ)g(θ)dθ. (16.3)
Rn

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 215

To check that φ is a feature map of the Gaussian kernel, use (16.3) to compute

k(x, z) = <φ(x), φ(z)>


Z
2 2
=C e−2γkθ−xk e−2γkθ−zk dθ
n
ZR
T T T T
=C e−2γ(2θ θ−2θ (x+z)+x x+z z) dθ
Rn Z
1
−2γ(xT x+z T z) γ(x+z)T (x+z) T T T
=e e C e−4γ( θ θ−θ (x+z)+ 4 (x+z) (x+z) ) dθ
Z Rn
T T T 1 2
= e−γ(x x−2x z+z z) C e−4γkθ− 2 (x+z)k dθ
Rn
−γkx−zk22
=e .

T
(g) For any symmetric PD matrix P ∈ Rn×n , k(x, y) = e−(x−z) P (x−z) on Rn .
The Gaussian kernel is a special case with P = γIn . A second special case is P = 1
2 diag(σj2 )−1 for
(xj −zj )2/σ 2
Pn
fixed scalars σj2 , j ∈ [1 : n]. This gives k(x, z) = e− /2
1
j=1 j (Exercise 16.31).

(h) For γ > 0, k(x, z) = e−γkx−zk2 on Rn .


This is called the Exponential kernel (Exercise 16.35).

(k) For γ > 0, k(x, z) = e−γkx−zk1 on Rn .


This is called the Laplacian kernel (Exercise 16.24).
The Gaussian, Exponential and Laplacian kernels arise from feature maps into an infinite dimensional
Hilbert space.

16.4 Shift-Invariant, Isotropic, and Smoothed Kernels


16.4.1 Shift-Invariant Kernels
A kernel k(x, z) is shift-invariant if for some function f : R → R, k(x, z) = f (x − z). So k(x, z) depends
on the relative location of x and z, not on absolute position. Some examples include: the Gaussian kernel
2
k(x, z) = e−γkx−zk2 , the exponential kernel k(x, z) = e−γkx−zk2 , and the Laplacian kernel k(x, z) =
e−γkx−zk1 . Additional examples are given in the Exercises.

16.4.2 Isotropic kernels


A kernel k(x, z) on Rn is isotropic with respect to the Euclidean norm k · k2 on Rn if there exists a function
f : R → R such that k(x, z) = f (kx − zk2 ). So k(x, z) depends on the Euclidean distance between x and
2
z but not on direction. Examples of isotropic kernels include the Gaussian kernel k(x, z) = e−γkx−zk2 , and
the exponential kernel k(x, z) = e−γkx−zk2 . Additional examples are given in Exercise 16.36.

16.4.3 Smoothed Kernels


Some kernels have a free parameter which the user can select. For example, the Gaussian kernel k(x, z) =
2
e−tkx−zk2 , t ≥ 0, is a family of kernels indexed by the nonnegative Pj parameter t. For a finite set of values
−t kx−zk 2
t1 , . . . , tj and nonnegative scalars α1 , . . . , αj , the weighted sum i=1 αi e i 2 is also a kernel. This

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 216

remains true if we take a limit of weighted sums. If the weights are appropriately selected, this yields a
kernel formed as a weighted integral:
Z b
2
k(x, z) = p(t)e−tkx−zk2 dt,
a

where 0 ≤ a ≤ b, and p(t) is a non-negative integrable function on [a, b]. This is called a smoothed kernel.
In this example, one can regard t (or more appropriately 1/t) as a length scale parameter. By scaling the
term kx − zk22 , t determines when points are regarded as close versus distant. Roughly, the division between
close and distant is determined by tkx − zk22 = 1. Smoothing over a range of t can incorporate a range of
length scales into the kernel.
More generally, one can prove the following result.

Theorem 16.4.1. Let k(t, x, z) be a family of kernels on Rn parameterized by t ∈ [a, b]. Assume that
for all x, y ∈ Rn , and t ∈ [a, b], |k(t, x, y)| ≤ B(x, y) < ∞. Then for any real-valued, non-negative,
integrable function p(t) defined over [a, b],
Z b
κ(x, z) = p(t)k(t, x, z)dt,
a

is a kernel on Rn .
Rb
Proof. Let m ≥ 1, {xj }m n
j=1 ⊂ R , and set K = [κ(xi , xj )] = [ a p(t)k(t, xi , xj )dt]. Clearly K is
symmetric. For any a ∈ Rn ,
 
Z b Z b Xm X
m
aT Ka = p(t)aT [k(t, xi , xj )]a dt = p(t)  ai aj k(t, xi , xj ) dt.
a a i j

The term (. . . ) in the second integral is non-negative and bounded above by maxi,j B(xi , xj )kak21 . Since
p(t) is integrable, the integral is well defined, and since p(t) is non-negative, it yields a (finite) non-negative
number. Hence aT Ka ≥ 0. Thus K is PSD and κ(x, z) is a kernel.

Example 16.4.1. Here are some examples of smoothed kernels:


2 2
b
e−a(1+kx−zk2 ) − e−b(1+kx−zk2 )
Z
−t −tkx−zk22
ka,b (x, z) = e e dt =
a 1 + kx − zk22
2
1 − e−b(1+kx−zk2 )
kb (x, z) =
1 + kx − zk22
1
k(x, z) = .
1 + kx − zk22

Notes
The material in this chapter is standard and can be found in most modern books and tutorials on machine
learning. See, for example, [42, Part III], [5, Chapter 6], [32, Chapter 14], [48, Chapter 11] and the tutorial
paper [21].

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 217

Exercises

Miscellaneous

Exercise 16.1. Prove the results in Corollary 16.2.1. Specifically, if k(x, z) is a kernel on X , then

(a) For any integer d ≥ 1, k(x, z)d is a kernel on X .

(b) For a polynomial q(s) with nonnegative coefficients, q(k(x, z)) is a kernel on X .

(c) ek(x,z) is a kernel on X .

Exercise 16.2. Let KX = {k : X × X → R, k is a kernel} denote the family of kernels on the set X . Shown that KX
is a closed convex cone.

Exercise 16.3. Let A be a finite set and for each subset U ⊆ A let |U| denote the number of elements in U. For
U, V ⊂ A, let k(U, V) = |U ∩ V|. By finding a suitable feature map, show that k(·, ·) is a kernel on the power set
P(A) of all subsets of A.

Simple kernels on R

Exercise 16.4. For a ≥ 1, show that k(x, z) = axz is a kernel on R.

Exercise 16.5. Let a > 0 and L2 [0, a] denote the set of real valued square integrable functions on the interval [0, a].
Ra Rt
L2 [0, a] is a Hilbert space under the inner product <g, h> = 0 g(s)h(s)ds. For f ∈ L2 [0, a], let g(t) = 0 f 2 (s)ds
Ra
and h(t) = t f 2 (s)ds, where t ∈ [0, a]. Show that

(a) k(x, z) = g(min(x, z)) is a kernel on [0, a]

(b) k(x, z) = h(max(x, z)) is a kernel on [0, a].

Exercise 16.6. Show that if for each a > 0, if k(x, z) is a kernel on [−a, a], then k(x, z) is a kernel on R.

Exercise 16.7. Show that for all x, z ∈ R, |x − z| = max(x, z) − min(x, z).

Exercise 16.8. Show that for each a > 0,

(a) k(x, z) = min(x, z) is a kernel on [0, a]

(b) k(x, z) = a − max(x, z) is a kernel on [0, a]

(c) k(x, z) = e−(max(x,z)−min(x,z)) is a kernel on [0, a]

(d) for each a > 0, and γ ≥ 0, k(x, z) = e−γ|x−z| is a kernel on [−a, a]

(e) for each γ ≥ 0, k(x, z) = e−γ|x−z| is a kernel on R

Exercise 16.9. Show that

(a) x2 + z 2 − (x − z)2 is a kernel on R.


2
(b) For γ ≥ 0, e−γ(x−z) is a kernel on R.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 218

Shift-invariant kernels on R
It will be convenient to consider kernels on R taking complex values. In this case, the feature space H is a Hilbert
space over the complex field C, and for some feature map φ : X → H, k(x, z) = <φ(x), φ(z)>. Any kernel matrix
K is then required to be hermitian (K = K ∗ = K̄ T ) and PSD.
Exercise 16.10. A complex-valued function r : R → C is said to be positive semidefinite if for every integer m ≥ 1

and every set of m points {xj ⊂ R, j ∈ [1 : m]}, the matrix K = [r(xi − xj )] is hermitian (K = K ∗ = K̄ T ) and
m T
positive semidefinite (∀a ∈ C , a Kā ≥ 0). Show that r(x − z) is a shift-invariant kernel if and only if r(t) is a
positive semidefinite function.
Exercise 16.11. If r is a positive semi-definite function, show that
(a) r(0) ≥ 0
(b) for all t ∈ R, r(−t) = r(t) (Hint: consider K for 2 points.)
(c) for all t ∈ R, r(0) ≥ |r(t)|
(d) for all s, t ∈ R, |r(t) − r(s)|2 ≤ r(0)|r(0) − r(t − s)| (Hint: consider K for 3 points t, s, 0.)
(e) if r(t) is continuous at t = 0, then r(t) is uniformly continuous on R
Exercise 16.12. Show that each of the following operations on PSD functions yields a PSD function.
(a) non-negative scaling: r(t) a PSD function and α ≥ 0, implies αr(t) is a PSD function.
(b) addition: r1 (t), r2 (t) PSD functions implies r1 (t) + r2 (t) is a PSD function.
(c) product: r1 (t), r2 (t) PSD functions implies r1 (t)r2 (t) is a PSD function.
(d) limit: rj (t), j ≥ 1, PSD functions and limj→∞ rj (t) = r(t) implies r(t) is a PSD function.
Exercise 16.13. Determine, with justification, which of the following functions are PSD:
P∞
(a) r(t) = eiωt , for each ω ∈ R (f) r(t) = k=−∞ αk eikωt , for ω ∈ R, αk ≥ 0,
and pointwise series convergence
(b) r(t) = eiω1 t + eiω2 t , for general ω1 , ω2 ∈ R
(g) r(t) = sin2 (ωt), for each ω ∈ R
(c) r(t) = cos(ωt), for each ω ∈ R (
2
(d) r(t) = cos (ωt), for each ω ∈ R 1, − 21 ≤ t ≤ 21 ;
(h) r(t) =
0, otherwise
Pn sin((n+ 1 )t)
(e) dn (t) = 1 + 2 k=1 cos(kt) = sin t 2
(2) (i) r(t) = |t|.

Exercise 16.14. (Simplified Bochner’s Theorem) Prove that if r̂(ω) is an integrable nonnegative function, then
Z
1
r(t) = r̂(ω)eiωt dω (16.4)
2π R

is a continuous PSD function. If in addition, r̂(ω) is an even function, r(t) is real-valued.


Note that (16.4) expresses r(t) as the inverse Fourier transform of r̂(ω). If r̂(ω) is nonzero, it can be normalized to a
probability density. Hence, up to a positive scale factor, r(t) is the inverse Fourier transform of a probability density
on sinusoidal frequencies. Bochner’s theorem also states that this condition necessarily holds for a continuous PSD
function r(t). The proof of the second part of Bochner’s theorem is more advanced.
Exercise 16.15. Determine, with justification, which of the following functions are PSD:
∆ sin(πt) 1 2
(a) r(t) = sinc(t) = πt (d) r(t) = t2 e− 2 t
1
(b) r(t) = e−|t| (e) r(t) = 1+t2
1 2 1
(c) r(t) = e− 2 t (f) r(t) = t4 +2t2 +1

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 219

Exercise 16.16. Let L1 (R) denote the set of real-valued integrable functions on R. The convolution of f, g ∈ L1 (R)
is the function f ∗ g : R → R defined by
Z

(f ∗ g)(t) = f (s)g(t − s)ds. (16.5)
R

If f, g ∈ L1 (R), then (f ∗ g) is well defined and is also in L1 (R). In addition, the Fourier transforms fˆ, ĝ, ĥ are well
defined, and ĥ(ω) = fˆ(ω)ĝ(ω) (convolution theorem).
Show that if r1 , r2 ∈ L1 (R) are continuous PSD functions , then r = (r1 ∗ r2 ) is a continuous PSD function.
Hence the convolution of integrable, continuous PSD functions is an integrable, continuous PSD function.

Shift-invariant kernels and autocorrelation functions on R

Exercise 16.17. Let L2 (R) denote the set of real-valued, square-integrable functions on R. L2 (R) is a Hilbert space
∆ R
under the inner product <f, g> = R f (s)g(s)ds. Define the autocorrelation function rh of h ∈ L2 (R) by
Z

rh (τ ) = h(s)h(s + τ )ds. (16.6)
R

By determining a corresponding feature map, show that rh (x − z) is a shift-invariant kernel on R. Hence rh (τ ) is a


real-valued PSD function.

Exercise 16.18. The convolution f ∗ g given by (16.5) of f, g ∈ L2 (R) is well defined. The convolution operation is
commutative, f ∗ g = g ∗ f . It is also associative: (f ∗ g) ∗ h = f ∗ (g ∗ h), provided the stated convolutions are well

defined. For h ∈ L2 (R), let hr (t) = h(−t).
Show that:
(a) For h ∈ L2 (R), rh (t) = (h ∗ hr )(t). Hence for each h ∈ L2 (R), h ∗ hr is a shift-invariant kernel on R

(b) For g, h ∈ L2 (R), (g ∗ h)r = g r ∗ hr

(c) If g, h, rg , rh ∈ L2 (R), then rg ∗ rh = rg∗h , and hence rg ∗ rh is a shift-invariant kernel on R

Exercise 16.19. Every function h ∈ L2 (R) has a Fourier transform ĥ(ω) with ĥ(ω) a complex valued, square inte-
grable function on R. Moreover, the inverse Fourier transform of ĥ(ω) yields a function that is equal to h(t) almost
everywhere. Let rh (τ ) denote the autocorrelation function of h ∈ L2 (R).

(a) Show that rh (τ ) has a Fourier transform and r̂h (ω) = ĥ(ω)ĥ(−ω)

(b) Without using Bochner’s theorem, show that r̂h (ω) is a real-valued, even, non-negative, integrable function.
Exercise 16.20. Use the result of Exercise 16.17 to prove that for γ ≥ 0,
(a) k(x, z) = e−γ|x−z| is a kernel on R

(b) k(x, z) = (1 + γ|x − z|)e−γ|x−z| is a kernel on R


2
(c) k(x, z) = e−γ(x−z) is a kernel on R
Exercise 16.21. Use Exercise 16.17 to prove that k(x, z) = Ta (x − z) is a shift-invariant kernel on R, where a > 0
and 

 0, s < −a;

s + a, −a ≤ s < 0;
Ta (s) = (16.7)


 a − s, 0 ≤ s < a;
0, a ≤ s.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 220

Exercise 16.22. Consider the real-valued integrable function


(
1, t ∈ [−1/2, 1/2];
p(t) = (16.8)
0, otherwise.

For each function below, sketch the function and determine (with a justification) if it is a shift-invariant kernel on R:
(a) r(t) = p(t) (c) r(t) = (p ∗ p ∗ p)(t)
(b) r(t) = (p ∗ p)(t) (d) r(t) = (p ∗ p ∗ p ∗ p)(t)

Kernels on Rn
Exercise 16.23. (Separable
Qn Kernels) Let κ(u, v) bePa kernel on R. Show that the following functions are kernels on
n
Rn : (a) k(x, z) = i=1 κ(xi , zi ) and (b) k(x, z) = i=1 κ(xi , zi ).
Exercise 16.24. For all γ ≥ 0, show that k(x, z) = e−γkx−zk1 is a separable kernel on Rn .
2
Exercise 16.25. For all γ ≥ 0, show that k(x, z) = e−γkx−zk2 is a separable kernel on Rn .
Exercise 16.26. Determine, with justification, which of the following functions are kernels on Rn :
Pn Pn
(a) j=1 e
iωj (xj −zj )
, ωj ∈ R (d) ei j=1 ωj (xj −zj ) , ωj ∈ R
Pn Pq Pn
(b) k(x, z) = j=1 cos(xj − zj ) (e) k=1 cos( j=1 ωjq (xj − zj )), ωjq ∈ R
P P
Pn
(c) k(x, z) = j=1 cos2 (xj − zj ) (f) cos( j xj − j zj )

Exercise 16.27. Let k(x, z) be a kernel on Rn . Show that the following function is a kernel on the same space:
( k(x,z)
√ , if k(x, x), k(z, z) 6= 0;
k̃(x, z) = k(x,x)k(z,z)
0, otherwise.

Exercise 16.28. Let kj be a kernel on X with feature map φj : X → Rq , j = 1, 2. In each part below, find a simple
feature map for the kernel k in terms of feature maps for the kernels kj . By this means, give an interpretation for the
new kernel k.
(a) k(x, z) = k1 (x, z) + k2 (x, z)
(b) k(x, z) = k1 (x, z)k2 (x, y)
p
(c) k(x, z) = k1 (x, z)/ k1 (x, x)k1 (z, z)
Exercise 16.29. Find a feature map for the 4th order homogeneous polynomial kernel on R3 .
Exercise 16.30. You want to learn an unknown function f : [0, 1] → R using a set of noisy measurements (xj , yj ),
with yj = f (xj ) + j , j ∈ [1 : m]. To do so, you plan to approximate f (·) by a Fourier series on [0, 1] with q ∈ N
terms:
q
∆ a0
X
fq (x) = + ak cos(2πkx) + bk sin(2πkx).
2
k=1

Then learn the coefficients ak , bk using regularized regression (see Exercise 9.17). Give an expression for the feature
map being used and determine its kernel in the simplest form.
1 T
Exercise 16.31. Let P ∈ Rn×n be symmetric PD. Show that k(x, z) = e− 2 (x−z) P (x−z)
is a kernel on Rn .
Exercise 16.32. Show that the following functions are kernels on Rn : (a) k(x, z) = kxk22 + kz||22 − kx − zk22 , and (b)
k(x, z) = kx + zk22 − kx − zk22 .
Exercise 16.33. Which of the following functions are kernels on Rn and why/why not?

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 221

Pk
Pk αj (xT Pj z)
(a) Let αj ≥ 0, j=1 αj = 1, and {Pj }kj=1 ⊂ Rn×n be symmetric PSD. Set k(x, z) = e j=1 .
T T
(b) k(x, z) = (xe−x ) (ze−z
x T z
).

(c) k(x, z) = 1/2 1 + (x/kxk2 )T (z/kzk2 ) .
(d) k(x, y) = Πni=1 |xi yi |.
(e) k(x, y) = ht (QT x)T ht (QT y) where Q ∈ Rn×n is orthogonal and ht is a thresholding function that maps
z = [zi ] to ht (z) = [z̃i ] with z̃i = zi if |zi | > t and 0 otherwise.

Smoothed kernels on Rn
Exercise 16.34. Let p(t) be a real-valued, non-negative, integrable function on [0, ∞). Show that the following
function is a kernel on Rn : Z ∞
2
k(x, z) = p(t)e−tkx−zk2 dt
0
√ 2 2 √
x2 + x
R∞ 2 R∞ b
Exercise 16.35. From −∞
e−t dt = π one can derive the identity 0
e−(a 2)
dx = 1 π −2ab
2 a e (see, e.g.,
3/2
−αkx−zk2 n −αkx−zk2
[17]). Use this identity to show that k(x, z) = e is a kernel on R . Is k(x, z) = e a kernel on
Rn ?
Exercise 16.36. Let t ≥ 0 and σ > 0. Show that the following shift-invariant functions are kernels on Rn :
σ2
(a) k(x, z) = σ 2 +kx−zk22
.
σ
(b) k(x, z) = σ+kx−zk2 .
σ
(c) k(x, z) = σ+kx−zk1 .

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 222

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 223

Chapter 17

Machine Learning with Kernels

We now explore using kernels to form nonlinear extensions of learning algorithms. For mathematical sim-
plicity, we consider feature maps φ : Rn → Rq for finite q. This allows us to continue to use vector and
matrix notation to derive a kernel extension of an existing machine learning method. The equations derived
will be useful in their own right but will also suggest the extension to infinite dimensional feature spaces
(e.g., a Gaussian kernel), but we omit the proofs in the infinite dimensional setting.
We begin with a general observation. Suppose we decide to employ a feature map φ : Rn → Rq ,
and then apply a known machine learning method in the feature space Rq . If the kernel function kφ (x, z)
is known, and can be efficiently evaluated, it gives a means of bypassing computing the feature vectors
φ(x), and the inner products of feature vectors, by directly computing inner products using the kernel:
k(x, z) = <φ(x), φ(z)>. Any machine learning algorithm using training and testing examples only within
inner products can be applied in feature space using such kernel evaluations. This gives a way to efficiently
extend various machine learning methods to include nonlinear feature maps. This is the path we explore.

17.1 Kernel SVM


We begin by examining how to use a feature map φ (or its kernel) with the linear support vector machine.
The appropriate place to focus is the dual SVM problem:

max 1T α − 1/2αT Z T Zα
α∈Rm
yT α = 0 (17.1)
α ≤ C1
α ≥ 0.

The matrix Z = [yi xi ] ∈ Rn×m contains the m label-weighted examples as its columns. So Z T Z ∈ Rm×m
has i, j-th entry yi (xTi xj )yj . Notice that the training examples appear in the form of inner products.
We first examine how the SVM problem can be solved directly in the feature space. To do so, bring
in a feature map φ : Rn → Rq and map the training examples {(xi , yi )}m i=1 into feature space to obtain
m
{(φ(xi ), yi )}i=1 . Let φ(Z) denote the q × m matrix [yj φ(xj )]. Then in feature space, the dual SVM
problem requires the matrix

K = φ(Z)T φ(Z) = [yi yj <φ(xi ), φ(xj )>].

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 224

Feature Map Based SVM Kernel Based SVM


A) Training A) Training
1) Pick a map φ : Rn → H 1) Pick a kernel k : Rn × Rn → R
2) Compute φ(xi ), i ∈ [1 : m] 2) −−
3) Compute <φ(xi ), φ(xj )>, i, j ∈ [1 : m] 3) Compute k(xi , xj ), i, j ∈ [1 : m]
4) Solve dual SVM problem 4) Solve dual SVM problem
B) Classifying B) Classifying
5) Compute φ(x) 5) −−
6) Compute <φ(xi ), φ(x)>, j ∈ A 6) Compute k(xi , x), i ∈ A
7) Classify x 7) Classify x

Figure 17.1: A comparison of the operations required to: (left) compute the feature map of the training data, use the
SVM to learn a classifier in the high dimensional space, then classify new data in the high dimensional space; versus:
(right) picking a kernel and using that directly in the SVM to learn and subsequently classify in the original space.

Thus in feature space the dual linear SVM problem is simply

max 1T α − 1/2αT Kα
α∈Rm
yT α = 0 (17.2)
α ≤ C1
α ≥ 0.

The label-weighted kernel matrix K is m × m with ij-th entry yi <φ(xi ), φ(xj )>yj . So the dual problem is
no larger, and in this sense no harder to solve. However, we need to spend extra time and energy computing
each φ(xi ) and the inner products <φ(xi ), φ(xj )> in the higher dimensional space to obtain K. If the
feature map has a known kernel function, then some of this work can be avoided by using direct evaluations
of the kernel function: K = [yi yj k(xi , xj )].
Once the dual problem is solved, let A = {i : αi? > 0} denote the indices of the support vectors. In the
feature map approach, a new test example x is then classified by computing
!
X
ŷ(x) = sign αi? yi <φ(xi ), φ(x)> + b? .
i∈A

So we first compute φ(x) and then compute |A| feature space inner products <φ(xi ), φ(x)>, i ∈ A. Typ-
ically |A| is much less that the number of training examples. Alternatively, if the kernel function of φ is
known, we can simply compute |A| kernel evaluations k(xi , x), i ∈ A, to obtain
X
ŷ(x) = αi? yi k(xi , x) + b? .
i∈A

Computation in the higher dimensional feature space will generally be more expensive. Hence in terms of
computation, direct and efficient evaluation of the kernel k(x, z) is critical. The kernels of some useful fea-
ture maps can indeed be efficiently evaluated without requiring any computation in the higher dimensional
space. For example, the homogeneous and inhomogeneous quadratic kernels, polynomial kernels, and the
Gaussian kernel.
The above discussion reinforces an important idea. Rather than picking a feature map φ, instead pick
a kernel function that is useful, and efficiently computable in the original space. Then kernel evaluations
don’t require computation in the higher dimensional space. Sometimes this is called the “kernel trick”, but

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 225

it’s more of a clever observation that a trick. Figure 17.1 illustrates the idea for the SVM by contrasting the
two ways of organizing the computational process: in feature space, and in the original space making direct
use of a known kernel function.

17.2 Kernel Ridge Regression


Consider a set of labelled training data {(xj , yj )}m n
j=1 with xj ∈ R and yj ∈ R, j ∈ [1 : m]. Set X =
[x1 , . . . , xm ] ∈ Rn×m and y = [y1 , . . . , ym ]T ∈ Rm . Recall that ridge regression uses this training data to
form a predictor ŷ(x) for a new test example x ∈ Rn by setting

w? = arg minn ky − X T wk22 + λkwk22 (17.3)


w∈R
?T
ŷ(x) = w x. (17.4)

This can be solved in closed form yielding

w? = (λIn + XX T )−1 Xy (17.5)


ŷ(x) = y T X T (λIn + XX T )−1 x. (17.6)

Notice that in these formulas the examples appear as XX T rather than the Gram matrix X T X and x
does not appear in the form X T x. However, this is easily remedied. First we make a side observation. If
A ∈ Rn×n and B ∈ Rm×m are invertible and AM = M B for some M ∈ Rn×m , then A−1 M = M B −1
(exercise). Using this together with the simple equality (XX T + λIn )X = X(X T X + λIm ), we conclude
that
(XX T + λIn )−1 X = X(X T X + λIm )−1 .
Applying this equality to (17.5) and (17.6) yields

w? = X(λIm + X T X)−1 y (17.7)


T T −1 T
ŷ(x) = y (λIm + X X) X x. (17.8)

These formulas indicate how a kernel can be introduced, and also remind us of some known properties of the
solution. First, by (17.7) we see that w? lies in the range of X. So there exists a vector a? ∈ Rm such that
T m
w? = Xa? . Then from (17.8), ŷ(x) = a? X T x = j=1 a? (j)xTj x. Hence the ridge regression predictor
P

ŷ(x) is formed by taking a linear combination of the inner products xTj x. These properties were discussed
previously - see below.

A Representer Theorem

The following theorem (an instance of a “representer theorem”) yields the above observations directly from
(17.3). The theorem was previously stated as Lemma 9.3.1 in Chapter 9 on least squares.

Theorem 17.2.1. The solution w? of the ridge regression problem (17.3) can be represented as w? =
Xa? , for some a? ∈ Rm , and the corresponding ridge regression predictor is given by

ŷ(·) = m ?
P
j=1 a (j)<xj , ·>.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 226

Proof. For each w ∈ Rn write w = ŵ + w̃ where ŵ ∈ R(X) and w̃ ∈ R(X)⊥ . Then ky − X T wk22 =
ky −X T ŵk22 . So w and ŵ give the same value for the first term in the ridge regression objective. The second
term is kwk22 = kŵk22 + kw̃k22 ≥ kŵk22 . Hence w? ∈ R(X). So w? = Xa? for some a? ∈ Rm . The second
claim then follows from ŷ(x) = w? T x = (Xa? )T x.
By Theorem 17.2.1 we can substitute w = Xa into (17.3) to obtain the following problem for a? :
a? = arg minm 1/2ky − X T Xak22 + λkXak22 .
a∈R

This is a Tikhonov regularized least squares problem with corresponding normal equations
X T X(λIm + X T X)a? = X T Xy. (17.9)
It is clear that a? = (λIm + X T X)−1 y is a particular solution. Hence w? = X(λIm + X T X)−1 y. This
gives (17.7), from which (17.8) immediately follows.
If the columns of X are linearly dependent, a? is not unique. Indeed, if a? is a solution to (17.9), so is
a? + v for every v ∈ N (X). The good news is that the predictor is unique even when a? is not. This follows
by noting that w? defines the predictor and w? = Xa? . So the component of a? in N (X) can’t influence
the predictor.

Kernel Ridge Regression


Now bring in a feature map φ : Rn → Rq with kernel k(x, y) = <φ(x), φ(y)>. Let K denote the kernel
matrix for the training examples and k(x) denote the vector of kernel evaluations k(xi , x), i ∈ [1 : m], for a
test example x ∈ Rn . So
K = [k(xi , xj )] ∈ Rm×m and k(x) = [k(xi , x)] ∈ Rm .
Finally, let φ(X) = [φ(x1 ), . . . , φ(xm )] ∈ Rq×m . Using the same label vector y, the ridge regression
problem in feature space can be stated as:
w? = arg minq 1/2ky − φ(X)T wk22 + λkwk22
w∈R
(17.10)
?T
ŷ(x) = w φ(x).
Applying Theorem 17.2.1 we conclude that
w? = φ(X)a? , a? ∈ Rm (17.11)
?T
ŷ(x) = a k(x). (17.12)
that determining w? requires aP
The first equation indicates P computation in feature space. The second equa-
tion follows by noting that m i=1 a ? (i)<φ(x ), φ(x)> =
i
m ? ?T
i=1 a (i)k(xi , x) = a k(x). Once we know
?
a , this can be computed in the ambient space of the examples.
It only remains to determine a particular solution a? . To do so, substitute w = φ(X)a into (17.10) and
simplify. This yields the Tikhonov regularized problem

a? = arg minm 1/2ky − Kak22 + λk Kak22 ,
a∈R

with normal equations


K(λIm + K)a? = Ky. (17.13)
Note that a? = (λI + K)−1 y is a particular solution. Hence the unique kernel ridge regression predictor is
ŷ(x) = y T (λIm + K)−1 k(x). (17.14)

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 227

17.3 Kernel PCA


Let the columns of X = [x1 , . . . , xm ] ∈ Rn×m be a set of centered data points in Rn (i.e., m
P
i=1 xi = 0),
and d be an positive integer with d ≤ rank(X). Recall that the PCA projection to dimension d seeks
U ∈ Rn×d with d orthonormal columns (i.e., U ∈ Vn,d ) so that the projection of the data onto the range of
U maximizes the captured variance of the data. This projection is U U T X and the variance of the projected
data is proportional to the trace of the (scaled) empirical covariance matrix:
trace (U U T X)(U U T X)T = trace(U T XX T U ).


Hence to find the d-dimensional PCA projection we seek a solution U ? of the matrix Rayleigh quotient
problem (see (8.3)):
max trace(U T XX T U )
U ∈Rn×d (17.15)
s.t. U T U = Id .

We then project each example xi to its d-dimensional coordinates zi = U ? T xi with respect to U ? . Let
Z = [z1 , . . . , zm ] ∈ Rd×m be the matrix of these coordinates. Then
Z = U ? T X. (17.16)
Keep in mind that the objective of PCA dimensionality reduction is to compute Z. U ? is just an intermediate
variable.
By Theorem D.3.1, a solution U ? of (17.15) can be obtained by letting the columns of U ? be the leading
d eigenvectors of XX T (or equivalently the leading d left singular vectors of X). However, it is not imme-
diately clear how to add a kernel to this procedure since it involves XX T rather than X T X. To remedy this
we bring in the following representation result (a representer theorem).

Theorem 17.3.1. For every solution U ? of (17.15) there exists A? ∈ Rm×d such that U ? = XA? .

Proof. By Theorem D.3.1, U ? = Ue Q where the d orthonormal columns of Ue are eigenvectors of XX T ,


and Q is a d × d orthogonal matrix. Hence XX T Ue = Ue Λ, where the diagonal matrix Λ has the d largest
eigenvalues of XX T down the diagonal. Since d < rank(XX T ) = rank(X), Λ is invertible. It follows
that XX T (Ue Q) = Ue Q(QT ΛQ). Rearranging this yields U ? = X(X T Ue Λ−1 Q) = XA? .

We now use (17.16) and Theorem 17.3.1 to write


Z = U ? T X = A? T X T X = A? T G, (17.17)
where G = X T X is the m × m Gram matrix of the data. This looks more promising. If we can find A? ,
then we can perform PCA dimensionality reduction using the Gram matrix X T X. To find A? we substitute
the representation U = XA into the PCA problem (17.15) to obtain a problem in A:
max trace(AT G2 A)
A∈Rm×d (17.18)
s.t. AT GA = Id .

The change of variables V = GA simplifies this to
max trace(V T GV )
V ∈Rm×d (17.19)
s.t. V T V = Id .

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 228

This is another instance of a matrix Rayleigh quotient problem. One solution is obtained by selecting
the columns of V ? to be orthonormal eigenvectors of G for its d largest eigenvalues. This V ? satisfies
GV ? = V ? Λ where Λ ∈ Rd×d is the diagonal matrix with the d largest eigenvalues of G listed in decreasing
order on the diagonal. These are the same√d largest eigenvalues of XX T . From the change of coordinates
√ √ √
we have V ? = GA? and hence GA? = GV ? . Using GV ? = V ? Λ, then yields
√ √
GA? = GV ? = V ? Λ.

Finally, substituting the above expression into (17.17) we obtain:



Z= ΛV ? T . (17.20)

In summary, PCA projection to d dimensions can be accomplished by finding the largest d eigenvalues
Λ = diag(λ1 , . . . , λd ) and corresponding orthonormal eigenvectors V ? = [v1 , . . . , vd ] of the Gram matrix
G = X T X. The PCA projections of the data to dimension d are then given by
√ √ √ 
√λ1 v1 (1) √λ1 v1 (2) . . . √λ1 v1 (m)
   λ2 v2 (1) λ2 v2 (2) λ2 v2 (m)
Z = z1 z 2 . . . zm =  . (17.21)

.. .. ..
√ . √ . √ .
 
λd vd (1) λd vd (2) . . . λd vd (m)

Note that this alternative method of computing Z uses only the Gram matrix G = X T X.

17.3.1 Kernel PCA

Now bring in a feature map φ : Rn → Rq with kernel k(·, ·). We seek the projection of the mapped data
{φ(xj )}m
j=1 in feature space onto its first d principal components. Normally the first step is to center the
data. For the moment assume this has been done. After discussing the main steps of computing a kernel
PCA projection, we show how to modify the procedure to include centering.
Let φ(X) = [φ(x1 ), . . . , φ(xm )] ∈ Rq×m , and K denote the m × m kernel matrix on the training data:

K = [k(xi , xj )] = [<φ(xi ), φ(xj )>] = φ(X)T φ(X)

Assuming the kernel function is given, the matrix K can be computed in the ambient space of the data.
To obtain the PCA projection from feature space to d in dimensions we first solve problem (17.19) for
the data mapped into feature space. The only required modification of (17.19) is the replacement of G by
the kernel matrix K. A solution of the modified problem is obtained by letting Λ be the diagonal d × d
matrix with the largest eigenvalues of K listed in decreasing order down the diagonal and forming V ? from
the corresponding orthonormal eigenvectors of K. The desired PCA projection is then given by (17.20).
So the kernel PCA projection of the data is obtained by finding the leading d eigenvectors of the PSD
kernel matrix K and the corresponding eigenvalues λj , j ∈ [1 : d]. Then training example xj ∈ Rn is
projected to the point zj ∈ Rd by taking the j-th entries of the eigenvectors scaled by the square roots of
the corresponding eigenvalues as shown in (17.21). This only uses the m × m kernel matrix K and can be
computed in the ambient space of the examples.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 229

Centering Data in Feature Space


We now return to the task of centering the data prior to computing the PCA projections. We want to handle
this without computation in the feature space. The centered feature space data is given by
m
1 X
φ̃(xj ) = φ(xj ) − φ(xi ), j ∈ [1 : m].
m
i=1

We can write this as a matrix equation by noting that m


P
i=1 φ(xi ) = φ(X)1m . Then the mean can be
subtracted from each φ(xj ) to form the centered data as follows

φ̃(X) = φ(X) − 1/m φ(X)1m 1Tm = φ(X)(Im − 1/m 1m 1Tm ).

So the kernel matrix of the centered training data in feature space is

K̃ = <φ̃(X), φ̃(X)>
= <φ(X)(Im − 1/m 1m 1Tm ), φ(X)(Im − 1/m 1m 1Tm )>
= (Im − 1/m 1m 1Tm )K(Im − 1/m 1m 1Tm )
= K − 1/m 1m 1Tm K − 1/m K1m 1m T + 1/m2 (1Tm K1m ) 1m 1Tm . (17.22)

Hence to perform kernel PCA with centered data we simply follow the method outlined previously using
the kernel matrix K̃ in place of K. The matrix K̃ can be computed directly from K without the need for
any computation in feature space.

17.4 Kernel Nearest Centroid


Let {(xi , yi )}m
i=1 be a set of labelled training data with c classes. Recall that the nearest centroid classifier
first computes the sample mean for each class, then assigns to a new example x the class label of the
nearest sample mean. This aggregates the training data into just the c class sample means. Hence when a
classification is done, only c inner products need to be evaluated.
1
ŷ(x) = arg min kx − µj k22 = arg max µTj x − kµj k22 .
j j 2
Let there be mj examples in class j, and gather the examples in class j into the columns of a matrix Xj ,
j ∈ [1 : c]. Then the sample mean of class j is µj = 1/mj Xj 1, and the classifier can be written as

ŷ(x) = arg max {µTj x − 1/2µTj µj }


j∈[1:c]

= arg max {1/mj 1T XjT x − 1/2m2j 1T XjT Xj 1} (17.23)


j∈[1:c]

Now we implement the same classifier in feature space. Let φ : Rn → Rq be a feature map with kernel
k(·, ·), and φ(Xj ) ∈ Rq×mj be obtained by applying φ to each column of Xj . Using (17.23), the nearest
sample mean classifier in feature space can be written as

ŷ(x) = arg max 1/mj 1T φ(Xj )T φ(x) − 1/2m2j 1T φ(Xj )T φ(Xj )1


j∈[1:c]

= arg max 1/mj 1T kj (x) − 1/2m2j 1T Kj 1, (17.24)


j∈[1:c]

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 230

where
kj (x) = φ(Xj )T φ(x) = [k(xi , x)] ∈ Rmj

denotes the vector of kernel evaluations using x and the examples xi in class j, and Kj = φ(Xj )T φ(Xj ) =
[k(xi1 , xi2 )] is the kernel matrix of the examples in class j. Equation (17.24) gives an expression for kernel
nearest centroid classification in terms of the kernel of the feature map. This is directly computable in the
original space. However, there is penalty. For each classification, a kernel evaluation k(xi , x) is done for
every training example. So we must now remember every example in the training set and each classification
requires m kernel evaluations.

17.5 Kernel Nearest Neighbor


Consider a set of labeled training data {(xj , yj )}m n
j=1 with xj ∈ R and yj ∈ {1, . . . , c}, j ∈ [1 : m]. The
nearest neighbor classifier finds the closest point among the training examples to x and uses the label of that
point as the estimated label of x:

j ? = arg min kx − xj k22


j∈[1:m]
(17.25)
ŷ(x) = yj ? .

Let’s see how to form a kernel version of the nearest neighbor classifier. We bring in a feature map
φ : Rn → Rq and then use nearest neighbor classification on the mapped training examples {(φ(xi ), yi )}m
i=1 .
This results in the classifier:

j ? = arg min kφ(x) − φ(xj )k22


j∈[1:m]

ŷ(x) = yj ? .

The core computation (finding distances) can be written as

kφ(x) − φ(xj )k2H = <φ(x) − φ(xj ), φ(x) − φ(xj )>


= <φ(x), φ(x)> − 2<φ(x), φ(xj )> + <φ(xj ), φ(xj )>
= k(x, x) − 2k(xj , x) + k(xj , xj )

where k(·, ·) is the kernel function of φ. Noting that k(xj , xj ) = Kjj , where K = [k(xi , xj )] is the kernel
matrix of k on the training data, yields the following kernelized version of the nearest neighbor classifier:

j ? = arg max {2k(xj , x) − Kjj }


j∈[1:m]

ŷ(x) = yj ? .

For each xj , k(xj , ·) is a function from Rn into R. The kernelized nearest neighbor classifier evaluates each
of these functions at x and forms the estimated label of x as a function of these evaluations:

ŷ(·) = f (k(x1 , ·), . . . , k(xm , ·)).

Each test classification, requires at most m kernel evaluations.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 231

Notes
For extra reading, see [42] and [21]. Kernel PCA is first mentioned in [43] and in [30]. A more refined
perspective is given in [42] and [40].

Exercises
Exercise 17.1. A binary labelled set of data in R2 is used to learn a SVM using the homogeneous quadratic kernel.
By writing the equation for the decision boundary in terms of a quadratic form, reason about the types of decision
boundaries that are possible in R2 using this framework. In each case, give a neat sketch.
Exercise 17.2. The solution of kernel PCA projection is not unique. Characterize the set of solutions.
Exercise 17.3. Let {(xi , yi )}m
i=1 be a set of labeled training data. Suppose we select a kernel k(·, ·) and use this to
perform kernel PCA dimensionality reduction to dimension d. This yields a new training set {(zi , yi )}m i=1 . Is training
a linear SVM on the reduced dimension training data equivalent to using a kernel SVM on the original dataset? If so,
is the SVM kernel distinct from k?

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 232

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Part I

Appendices

233
Appendix A

Vector Spaces

A.1 Definition
The concept of a vector space generalizes the algebraic structure of cartesian space Rn . A vector space
consists of a set X of vectors and a field F of scalars together with an operation of vector addition x + y for
x, y ∈ X and an operation of scalar multiplication αx for x ∈ X , α ∈ F. The operations of vector addition
and scalar multiplication must satisfy the following axioms.
For vector addition:

1. (∀x1 , x2 ∈ X ), x1 + x2 = x2 + x1

2. (∀x1 , x2 , x3 ∈ X ), x1 + (x2 + x3 ) = (x1 + x2 ) + x3

3. ∃0 ∈ X such that ∀x ∈ X , x + 0 = 0 + x = x

4. ∀x ∈ X , ∃ − x ∈ X such that x + (−x) = 0

These axioms ensure that X together with the operation of vector addition forms a commutative group.
For scalar multiplication:

1. (∀α, β ∈ F)(∀x ∈ X ), α(βx) = (αβ)x

2. (∀x, y ∈ X )(∀α ∈ F), α(x + y) = αx + αy

3. (∀x, y ∈ X )(∀α ∈ F) (α + β)x = αx + βx

4. (∀x ∈ X ), 1x = x & 0x = 0

We denote generic vector spaces by upper case script letters U, V, . . . ; generic vectors by lower case
roman letters u, v, . . . ; and generic scalars by lower case greek letters α, β, . . . . If extra clarity is required,
we will emphasize that u is a vector by writing u or ~u. However, usually, context will distinguish the
intended meaning.
Example A.1.1. Vector addition on Cn and scalar multiplication by α ∈ C is defined similarly to the
corresponding operations on Rn . Under these operations, the vectors Cn and scalars C form a vector space.
Example A.1.2. Vector addition on Rn×m and scalar multiplication by α ∈ R are defined by (2.2). Under
these operations, the set of matrices Rn×m and scalars R form a vector space.

235
ELE 435/535 Fall 2018 236

A.2 Functions between Vector Spaces


Let f denote a function from a vector space X into vector space Y. We display this as f : X → Y and
denote the value of f at x ∈ X by f (x). The function f is:

a) onto (or surjective) if for each y ∈ Rn there exists x ∈ Rm with f (x) = y. In this case, every point
in Y is the image of some point in X .

b) one-to-one (or injective) if for each x1 , x2 ∈ Rm , with x1 6= x2 , we have f (x1 ) 6= f (x2 ). In this case,
no two distinct points in X map to the same point in Y.

c) invertible if f is both onto and one-to-one. In this case, there exists a function f −1 : Y → X such that
∀x ∈ X , f −1 (f (x)) = x and ∀y ∈ Y, f (f −1 (y)) = y.

A.2.1 Linear Functions and Vector Space Isomorphisms


A function f : Rm → Rn is linear if the following two properties hold:

(∀x1 , x2 ∈ Rm ) : f (x1 + x2 ) = f (x1 ) + f (x2 )


(∀α ∈ R)(∀x ∈ Rm ) : f (αx) = αf (x).

Sometimes we also call a linear function a linear map. More generally, if X and Y are vector spaces over
the same field, a function from X to Y that satisfies the above two properties is called a linear function or
linear map.
An invertible linear function is called an isomorphism [isomorphism from the greek isos (equal), and
morphe (shape)]. Two vector spaces are isomorphic is there exists an isomorphism f from one to the other.
In this case, the two vector spaces have the same algebraic structure. The isomorphism f verifies this by
giving a correspondence between the vectors of the two spaces that matches (or respects) the vector space
operations.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Appendix B

Matrix Inverses and the Schur Complement

B.1 Block Matrices and the Schur Complement


Let A ∈ Rp×p and D ∈ Rq×q be square matrices and consider block matrices of the form

A B
M= . (B.1)
C D

Assuming A is invertible, we can zero the block below A by left matrix multiplication:
    
Ip 0 A B A B
= . (B.2)
−CA−1 Iq C D 0 D − CA−1 B

This is just a 2 × 2 block matrix version of Gaussian elimination. Similarly, we can zero the block to the
right of A by right matrix multiplication:

Ip −A−1 B
    
A B A 0
= . (B.3)
C D 0 Iq C D − CA−1 B

The same matrices operating together yield a block diagonal result:

Ip −A−1 B
     
Ip 0 A B A 0
= , (B.4)
−CA−1 Iq C D 0 Iq 0 SA

where the q × q matrix



SA = D − CA−1 B (B.5)
is called the Schur complement of A in M .
The following lemma indicates that the matrices used above are easily invertible.

Lemma B.1.1. For all X ∈ Rq×p and Y ∈ Rp×q :


 −1    −1  
Ip 0 Ip 0 Ip Y Ip −Y
= and = .
X Iq −X Iq 0 Iq 0 Iq

Proof. Multiply.

237
ELE 435/535 Fall 2018 238

We can use Lemma B.1.1 and (B.4) to factor M as follows

Ip A−1 B
     
A B Ip 0 A 0
= .
C D CA−1 Iq 0 SA 0 Iq

On the RHS, the first and third matrices are invertible and the submatrix A of M is assumed to be invertible.
Hence if SA is also invertible, then M must be invertible and the above equality can be used to find an
expression for M −1 . Alternatively, we can assume that M is invertible. Then SA must be invertible, and we
again arrive at an expression for M −1 .

Lemma B.1.2. If M and A are invertible, then SA is invertible and


−1
Ip −A−1 B A−1
    
A B 0 Ip 0
= −1
C D 0 Iq 0 SA −CA−1 Iq
 −1 −1 −1
(B.6)
A + A−1 BSA CA−1 −A−1 BSA

= −1 −1 .
−SA CA−1 SA

Proof. If M and A are invertible, then (B.4) and Lemma B.1.1 imply that SA is invertible. The result then
follows by taking the inverse of both sides of (B.4) and using Lemma B.1.1.

A parallel set of results arises by assuming that D is invertible. In this case,

Ip −BD−1
     
A B Ip 0 SD 0
= , (B.7)
0 Iq C D −D−1 C Iq 0 D

where the p × p matrix


SD = A − BD−1 C (B.8)

is called the Schur complement of D in M . These derivations yield the following result.

Lemma B.1.3. If M and D are invertible, then SD is invertible and


−1   −1
Ip −BD−1
   
A B Ip 0 SD 0
=
C D −D−1 C Iq 0 D−1 0 Iq
−1 −1
(B.9)
−SD BD−1
 
SD
= −1 −1 .
−D−1 CSD D−1 + D−1 CSD BD−1

Proof. Exercise.

Lemmas B.1.2 and B.1.3 give distinct expressions for M −1 . So the result of applying any function to
each of these matrix expressions must yield the same result. This and other properties are explored in the
exercises below.

Notes
For further reading on block matrices and matrix identities see the comprehensive summary in [24, Appendix A].

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 239

Exercises
Exercise B.1. Let M be given by (B.1). Show that:
(a) M is invertible if and only if A and SA = D − CA−1 B are invertible.
(b) M is invertible if and only if D and SD = A − BD−1 C are invertible.
Exercise B.2. Let A ∈ Rp×p and D ∈ Rq×q . Assuming A or D is invertible, as appropriate, show that
 (
det(A) det(D − CA−1 B)

A B
det =
C D det(D) det(A − BD−1 C).

Exercise B.3. Let A ∈ Rp×p , D ∈ Rq×q , and B, C be conforming matrices as in (B.1).


(a) Assuming A and D, and at least one of SA or SD is invertible, derive the equality:

(A − BD−1 C)−1 = A−1 + A−1 B(D − CA−1 B)−1 CA−1 .

(b) Using (a), show that if A and D, and at least one of A + BDC or D−1 + CA−1 B is invertible, then:

(A + BDC)−1 = A−1 − A−1 B(D−1 + CA−1 B)−1 CA−1 .

Equalities of this form are called the Woodbury identity or Matrix Inversion Lemma.
Exercise B.4. Let P ∈ Rn×n have known inverse P −1 and u, v ∈ Rn . Derive the following identity:

P −1 uv T P −1
(P + uv T )−1 = P −1 − .
1 + v T P −1 u
Exercise B.5. Let P ∈ Rn×n have known inverse P −1 and U, V ∈ Rn×r with r  n. Show that

(P + U V T )−1 = P −1 − P −1 U (I + V T P −1 U )−1 V T P −1 .

What is the complexity of finding the inverse on the LHS compared with using the formula on the RHS.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 240

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Appendix C

QR-Factorization

C.1 The Gram-Schmidt Procedure


Let U be a subspace of Rn with a given basis {a1 , . . . , ak }. An orthonormal basis {q1 , . . . , qk } for U can be
found using the Gram-Schmidt Procedure:
(1) r1 = a1 q1 = r1 /kr1 k
(2) r2 = (In − q1 q1T )a2 q2 = r2 /kr2 k
.. ..
. .
(j) rj = (In − j−1 T
P
i=1 qi qi )aj qj = rj /krj k
.. ..
. .
(k) rk = (In − k−1 T
P
i=1 qi qi )ak qk = rk /krk k
The first step selects a1 as the first vector in the new basis, and scales it to unit norm. The second step
projects a2 onto span{q1 } and forms the residual r2 . This residual cannot be zero, since a2 is not in the
span of a1 . The residual is then scaled to unit norm to produce q2 . Since the residual r2 is orthogonal to
q1 , we have q2 ⊥ q1 . At step j we project aj onto the span of q1 , . . . , qj−1 and subtract this from aj to
form the residual rj . This residual is orthogonal to qi , i < j. We then scale rj to form qj , and so on. At
each step we always have rj 6= 0, otherwise aj would be in the span of a1 , . . . , aj−1 and this would violate
the assumption of linearly independence. Moreover, by construction we have rj ⊥ {q1 , . . . , qj−1 }. Hence
qj ⊥ {q1 , . . . , qj−1 }.
Theorem C.1.1. The set of vectors {q1 , . . . , qk } produced by the Gram-Schmidt procedure is an orthonor-
mal basis for the subspace U.

Proof. It is clear that {q1 , . . . , qk } is an ON set. Hence we need only show that span{a1 , . . . , ak } =
span{q1 , . . . , qk }. We first show that qj ∈ span{a1 , . . . , ak }, j ∈ [1 : k], using induction. Note that q1 ∈ U.
Now assume qi ∈ U, i < j. Then rj = aj − j−1 T
P
i=1 (qi aj )qi ∈ U. Hence qj ∈ U. We now show that
aj ∈ span{q1 , . . . , qk }. From step (j) of the Gram-Schmidt procedure,
j−1
X j
X
aj = krj kqj + qi qiT aj = rij qi . (C.1)
i=1 i=1

By taking the inner product of both sides of (C.1) with qi we have


rij = qi T aj , 1≤i≤j (C.2)

241
ELE 435/535 Fall 2018 242

Thus span{q1 , . . . , qk } = span{a1 , . . . , ak }.

C.2 QR-Factorization
The matrix version of the Gram-Schmidt procedure is called QR-factorization.
Theorem C.2.1. If A ∈ Rn×k has linearly independent columns, then there exists a matrix Q ∈ Vn,k and
an upper triangular and invertible matrix R ∈ Rk×k such that A = QR.

Proof. Let A = a1 . . . ak ∈ Rn×k with {a1 , . . . , ak } a linearly independent set. Writing equations
 

(C.1) in matrix form yields:


 
r11 r12 . . . r1k
    0 r22 r2k 
A = a1 a2 . . . ak = q1 q2 . . . qk  . ..  = QR, (C.3)

 .. .. ..
. . . 
0 ... 0 rkk

where Q ∈ Vn,k and the matrix R ∈ Rk×k is upper triangular with the terms rij given by (C.2). In particular,
rii = kri k2 > 0, i ∈ [1 : k]. So R has positive diagonal entries and is invertible.

By construction, the columns of Q form an ON basis for the range of A. One can also see this from
R(A) = R(QR) and since R is invertible R(QR) = R(Q).
Theorem C.2.1 is a particular case of the following result.
Theorem C.2.2 (General QR-Factorization). Let A ∈ Rn×m and r = rank(A). Then A can be written in
the form
AP = QR
where P ∈ Rm×m is a permutation matrix, Q ∈ Vn,r , and the first r × r block of R ∈ Rr×m is upper
triangular and invertible.

Proof. Perform Gram-Schmidt as before, except when aj ∈ span(a1 , . . . , aj−1 ), add the coefficients to a
matrix R̂ but do not add a new column to Q. This yields A = QR̂ where Q ∈ Vn,r , and R̂ ∈ Rr×m has the
form:  
• × × × × × × × ×
0 • × × × × × × ×
 
 · 0 0 • × × × × ×
R̂ =  
· 0 0 0 • × ×
0 · · 0 • ×
Here • indicates a positive entry, and × indicates a possibly nonzero entry. Now use a permutation matrix
P to permute the columns of R̂ and A so the r linearly independent columns are first:
 
• × × × × × × × ×
0 • × × × × × × ×
 
R = R̂P =   · 0 • × × × × × ×

· 0 • × × × × ×
0 · · 0 • × × × ×
The first r × r block of R is upper triangular with positive diagonal entries. So R has r linearly independent
rows and hence rank(R) = r. Finally, AP = QR with R = R̂P of the required form.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 243

C.2.1 The Time Complexity of QR-Factorization


n×k are linearly independent, then a QR-factorization of A requires at most
Pkthe columns of A ∈ R
If
j=1 (jn + 2) = /2(k + 1)n + 2 multiply-add operations. In this case, computing a QR-factorization
1

has O(kn) time complexity. When the columns of A ∈ Rn×m are not linearly independent, some computa-
tions are not required. But the time complexity remains O(mn).

C.2.2 The Time Complexity of SVD via QR-Factorization


An interesting situation that arises frequently is when A ∈ Rn×k with k  n. Thus A is a tall and narrow
matrix wth rank r ≤ k. Directly computing a compact SVD of A has time complexity O(nk 2 ). However,
first computing a QR-factorization A = QR with Q ∈ Vn,r and R ∈ Rr×k , then computing an SVD of R,
yields
A = QR = (QU )ΣV T
This is a compact SVD of A with U ∈ Rr×r , QU ∈ Vn,r , Σ ∈ Rr×r , and V ∈ Rr×m . The time complexity
k2
of this approach is O(nk + kr2 ). This approach has a speed advantage roughly when k−1 < n. This
certainly holds when 1 < k  n.

Exercises
QR-Factorization
Exercise C.1. Let {a1 , . . . , ak } ⊂ Rn be a linearly independent set of vectors. Show that the Gram-Schmidt procedure
can be written in the form:
r11 q1 = a1
r22 q2 = (In − Q1 QT1 )a2 Q1 = [q1 ]
r33 q3 = (In − Q2 QT2 )a3 Q2 = [q1 q2 ]
.. ..
. .
rnn qk = (In − Qk−1 QTk−1 )ak Qk−1 = [q1 · · · qk−1 ].

Exercise C.2. Let A1 ∈ Rn×k have rank k and QR-factorization A1 = Q1 R1 . Let A2 ∈ Rn×m be such that
A = [A1 A2 ] has linearly independent columns. Show that the QR-factorization of A is
 
    R1 U
A1 A2 = Q1 Q2
0 R̄2

where U = QT1 A2 and Q2 R2 is a QR-factorization of (I − Q1 QT1 )A2 .


Exercise C.3. Let A ∈ Rm×n have linearly independent columns and a given QR-factorization. We want to remove
a contiguous block of columns of A and obtain a new QR-factorization.
(a) Write the QR-factorization of A as:
 
 R1 b U
r cT 
  
A = A1 a A2 = Q1 q Q2  0
0 0 R2
where a is a column we want to remove, and q is the corresponding column in Q. Show that the QR-factorization
of [A1 A2 ] is  
    R1 U
A1 A2 = Q1 Q̄2
0 R̄2

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 244

where Q̄2 R̄2 is a QR-factorization of qcT + Q2 R2 .


(b) Now write the given QR-factorization of A as:
 
    R1 U V
A = A1 A2 A3 = Q1 Q2 Q3  0 R2 W
0 0 R3

where A2 is the block of columns we want to remove and Q2 is the block of corresponding columns in Q. Show
that the QR-factorization of [A1 A3 ] is
 
    R1 V
A1 A3 = Q1 Q̄3
0 R̄3

where Q̄2 R̄2 is a QR-factorization of Q2 W + Q3 R3 .

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Appendix D

Rayleigh Quotient Problems

For convenient reference, this Appendix gathers a set of Rayleigh quotient problems together in one place.

D.1 The Rayleigh Quotient


Let P ∈ Rn×n be a given symmetric positive semidefinite matrix, and consider the problem of selecting
x ∈ Rn to maximize the quotient
xT P x
.
xT x
This is called a Rayleigh quotient. Since the quotient is invariant to a scaling of x, we can assume that
kxk2 = 1. The problem can then be stated as the constrained optimization problem,

x? = arg maxn xT P x
x∈R (D.1)
s.t. xT x = 1.

Theorem D.1.1. Let the eigenvalues of P be λ1 ≥ λ2 ≥ · · · ≥ λn . Problem (D.1) has the optimal value
λ1 and this is achieved if and only if x? is a unit norm eigenvector of P for λ1 . If λ1 > λ2 , this solution is
unique up to the sign of x? .

Proof. We want to maximize xT P x subject to xT x = 1. Bring in a dual variable µ ∈ R and form the
Lagrangian L(x, µ) = xT P x + µ(1 − xT x). Take the derivative of L(x, µ) with respect to x. Setting the
derivative equal to zero yields the necessary condition P x = µx. Hence µ must be an eigenvalue of P with
x a corresponding eigenvector normalized so that xT x = 1. For such x, xT P x = µxT x = µ. Hence the
maximum achievable value of the objective is λ1 and this is achieved when u is a corresponding unit norm
eigenvector of P . Conversely, if u is any unit norm eigenvector of P for λ1 , then uT P u = λ1 and hence u
is a solution.

D.2 The Generalized Rayleigh Quotient


Given two symmetric matrices P, Q ∈ Rn×n , with P positive semidefinite and Q positive definite, consider
the problem of selecting x ∈ Rn to maximize the generalized Rayleigh quotient

xT P x
.
xT Qx

245
ELE 435/535 Fall 2018 246

Since the quotient is invariant to a scaling of x, we can set xT Qx = 1 and state the problem as

x? = arg maxn xT P x
x∈R (D.2)
s.t. xT Qx = 1.

The following lemma will be useful.

Lemma D.2.1. If P and Q are symmetric positive semidefinite matrices and Q is invertible, then Q−1 P has
real nonnegative eigenvalues.

Proof. A similarity transformation of a square matrix A leaves its eigenvalues invariant, i.e., for an invertible
1
matrix B, A and BAB −1 have the same eigenvalues. Let Q 2 denote the symmetric PD square root of Q.
1 1 1 1
It follows from the above that Q−1 P and Q 2 (Q−1 P )Q− 2 = Q− 2 P Q− 2 have the same eigenvalues. The
second matrix is symmetric PSD. Hence the eigenvalues of Q−1 P are real and nonnegative.

We can now state and prove the following result.

Theorem D.2.1 (Generalized Rayleigh Quotient). Denote the real nonnegative eigenvalues of Q−1 P by
λ1 ≥ · · · ≥ λn . Then the solution of problem (D.2) is given by any unit norm eigenvector x? for the
maximum eigenvalue λ1 of Q−1 P . If λ1 > λ2 , this solution is unique up to sign.

1 1 1
Proof. Let Q 2 denote the symmetric PD square root of Q and set y = Q 2 x. So x = Q− 2 y. Making this
substitution in (D.2) yields the equivalent problem
 1 1

y ? = arg maxn y T Q− 2 P Q− 2 y
y∈R (D.3)
T
s.t. y y = 1.
 1 1

By Theorem D.1.1, any unit norm eigenvector y ? of Q− 2 P Q− 2 for its largest eigenvalue is a solution of
1 1 1 1 1
D.3. Then Q− 2 P Q− 2 y ? = λy ? implies Q−1 P (Q− 2 y ? ) = λ(Q− 2 y ? ). It follows that x? = Q− 2 y ? is an
eigenvector of Q−1 P with eigenvalue λ1 .

D.3 Matrix Rayleigh Quotient


Let P ∈ Rn×n be a given symmetric positive semidefinite matrix, and d ≤ rank(P ). Consider the problem:

W ? ∈ arg max trace(W T P W )


W ∈Rn×d (D.4)
s.t. W T W = Id .

This problem is a matrix Rayleigh quotient problem.

Theorem D.3.1. Let P ∈ Rn×n and d be specified as above. Then every solution of D.4 has the form
W ? = We Q where the columns of We are orthonormal eigenvectors of P for its d largest (hence nonzero)
eigenvalues, and Q is a d × d orthogonal matrix.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 247

Proof. Form the Lagrangian L = trace(W T P W ) − trace(Ω(W T W − Id )). Here Ω is a real symmetric
matrix. Setting the derivative of L with respect to W acting on V ∈ Rn×d equal to zero yields
0 = trace 2W T P V − 2ΩW T V = 2<W T P − ΩW T , V >.


Since this holds for all V , we conclude that a solution W ? must satisfy the necessary condition
P W ? = W ? Ω, (D.5)
By the symmetry Ω ∈ Rd×d , there exists an orthogonal matrix Q ∈ Od such that Ω = QΛQT with Λ a
diagonal matrix with the real eigenvalues of Ω listed in decreasing order on the diagonal. Substituting this
expression into (D.5) and rearranging yields
P (W ? Q) = (W ? Q)Λ, (D.6)
trace(Λ) = trace (W Q) P (W Q) = trace(W ? T P W ? ).
? T ?

(D.7)
The last term in (D.7) is the optimal value of (D.4). Thus the optimal value of (D.4) is trace(Λ), and W ? Q
is also a solution of (D.4). Lastly, we note that by (D.6) the columns of We = W ? Q are orthonormal
eigenvectors of P . By optimality, the diagonal entries of Λ must be the d largest eigenvalues of P . Since
d ≤ rank(P ), all of these eigenvalues are positive.

D.4 Generalized Matrix Rayleigh Quotient


Let P, Q ∈ Rn×n be symmetric positive semidefinite matrices, with Q invertible, and let d ≤ rank(P ).
Consider the problem:
W ? ∈ arg max trace(W T P W )
W ∈Rn×d (D.8)
s.t. W T QW = Id .
This problem is a generalized matrix Rayleigh quotient problem.
Theorem D.4.1. Let P, Q ∈ Rn×n and d be specified as above. Then every solution of D.8 has the form
W ? = Ue R where the columns of Ue are eigenvectors of Q−1 P for its d largest eigenvalues, with the
eigenvectors scaled to ensure UeT QUe = Id , and R is a d × d orthogonal matrix.
1 1
Proof. Let Q 2 denote the symmetric PD root of Q, and set V = Q 2 W . In terms of V , problem D.8
becomes
1 1
V ? ∈ arg max trace(V T (Q− 2 P Q− 2 )V )
V ∈Rn×d (D.9)
s.t. V T V = Id .
By Theorem D.3.1, every solution of D.9 has the form V ? = Ve R where the columns of Ve are orthonormal
1 1
eigenvectors of Q− 2 P Q− 2 for its d largest (hence nonzero) eigenvalues, and R is a d × d orthogonal
1 1 1
matrix. It follows that W ? = Q− 2 Ve R. Note that Q− 2 P Q− 2 Ve = Ve Λ where Λ is a diagonal matrix with
1 1 1 1
the ordered d largest eigenvalues of Q− 2 P Q− 2 on the diagonal. Hence (Q−1 P )(Q− 2 Ve ) = (Q− 2 Ve )Λ.
1 1
The matrices Q−1 P and Q− 2 P Q− 2 are similar (see the proof of Lemma D.2.1), and hence have the same
1
eigenvalues. Thus the columns of Ue = Q− 2 Ve are eigenvectors of Q−1 P for its first d largest eigenvalues,
with the eigenvectors scaled to ensure UeT QUe = Id . Finally, W ? = Ue R, for R any d × d orthogonal
matrix.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 248

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Appendix E

Schur Product Theorem

Let ⊗ denote the Schur product of matrices. If A, B ∈ Rn×n are symmetric, it is clear that A ⊗ B is also
symmetric. What is less obvious is that if A, B are also positive semidefinite, so is A ⊗ B. This can been
shown using the following elementary properties of the Schur product. For u, v, x ∈ Rn ,

(uuT ) ⊗ (vv T ) = (u ⊗ v)(u ⊗ v)T (E.1)


(u ⊗ v)T x = uT (v ⊗ x) . (E.2)

The first of these properties shows that the stated claim holds for symmetric rank one matrices.

E.1 The Theorem and its Proof


Theorem E.1.1 (Schur Product Theorem). If A, B ∈ Rn×n are symmetric positive semidefinite, so is A⊗B.
Moreover, if A, B are symmetric positive definite, so is A ⊗ B.

Pra By (E.1),
Proof. thePtheorem holds if A and B have rank 1. More generally, compact SVDs give: A =
T, B = rb T
σ u
i=1 i i iu j=1 ρj vj vj . Hence using (E.1),

σi ui uTi ) ⊗ ( rj=1
P a Prb
A ⊗ B = ( ri=1 ρj vj vjT ) = ri=1 T
Pa Pb
j σi ρj (ui ⊗ vj )(ui ⊗ vj ) (E.3)

Since σi ρj > 0, and (ui ⊗ vj )(ui ⊗ vj )T is symmetric PSD, it follows that A ⊗ B is symmetric PSD.
Now assume that A and B are symmetric PD. So ra = rb = n and {ui }ni=1 and {vi }ni=1 are orthonormal
bases for Rn . Since A ⊗ B is symmetric PSD, we only need to show that xT (A ⊗ B)x = 0 implies x = 0.
From the SVD expansion (E.3), xT (A ⊗ B)x = 0 implies ∀i∀j, (ui ⊗ vj )T x = 0. Using (E.2), this means
that ∀i∀j, uTi (vj ⊗ x) = 0. Since {ui }ni=1 is an ON basis, it follows that ∀j, vj ⊗ x = 0. Now since {vj }nj=1
is a basis, for no index k can it be that ∀j, vj (k) = 0. Hence x = 0.

This statement and proof of the Schur product theorem follows that in the first edition of Horn and
Johnson. See also Horn and Johnson [22, p. 479].

Exercises
Exercise E.1. Prove the two results listed in equations (E.1) and (E.2).

249
ELE 435/535 Fall 2018 250

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Appendix F

Gaussian Random Vectors

Let the random vector X take values in Rn and have mean µ ∈ Rn and covariance matrix Σ ∈ Rn×n . We
say that X is a non-degenerate Gaussian random vector if Σ is positive definite, and X has the density

1 1 − 1 (x−µ)T Σ−1 (x−µ)


fX (x) = n 1 e
2 . (F.1)
(2π) |Σ| 2
2

This is called a multivariate Gaussian density. The matrixK = Σ−1 is called the precision matrix of the
density. Clearly, K is also a symmetric positive definite matrix.
We can use (F.1) to write,

ln fX (x) = −1/2(x − µ)T Σ−1 (x − µ) + C


= −1/2 xT Kx + xT Kµ + C 0 , (F.2)

where C and C 0 are constants that do not depend on x. In expression (F.2), the quadratic term specifies K,
and the linear term specifies Kµ. Since K is known from the quadratic term, Kµ specifies the mean µ. So
if fX (x) is known to be a Gaussian density, then the precision matrix K and the mean µ can be extracted
from the quadratic and linear terms in an expansion of ln fX (x). A converse result is given in the following
lemma.
Lemma F.0.1. If fX (x) is a density and ln fX (x) has the form (F.2), with K symmetric positive definite,
then fX (x) is a Gaussian density with precision matrix K and mean µ.

Proof. Exercise.

F.1 Jointly Gaussian Random Variables


Consider a Gaussian random vector Z that is partitioned into two component vectors X and Y, and concor-
dantly partition the mean and covariance of Z as follows
     
X µX ΣX ΣXY
Z= , µ= , Σ= . (F.3)
Y µY ΣYX ΣY

ΣXY is called the cross covariance of X and Y and ΣYX is called the cross covariance of Y and X. Clearly
ΣTXY = ΣYX . Since X and Y need not have the same dimensions, in general ΣXY is not a square matrix.

251
ELE 435/535 Fall 2018 252

Let fXY (x, y) denote the density of Z = (X, Y), and fX (x) and fY (y) denote the marginal densities of
X and Y, respectively. We will think of X as a random vector that generates an example, and Y as a random
vector that generates its corresponding target value. Given the value of X, and we want to predict the value
of Y.

F.2 Some Prelimminary Lemmas


Let Σ denote the symmetric positive definite matrix (F.3).
Lemma F.2.1. If Σ is symmetric positive definite then ΣX and ΣY are symmetric positive definite.
n q
Proof. Taking the transpose of Σ shows  ΣY are symmetric. For any x ∈ R and y ∈ R , with
 that ΣX and
  ΣX ΣXY x
(x, y) 6= 0, Σ PD implies that x y > 0. Hence for any x 6= 0 and y = 0,
ΣYX ΣY y
  
 ΣX ΣXY x
= xT ΣX x > 0.

x 0
ΣYX ΣY 0
Hence ΣX is PD. A similar argument shows that ΣY is PD.

Lemma F.2.2. If Σ is symmetric positive definite, then so is SΣX = ΣY − ΣYX Σ−1


X ΣXY . Moreover,
−1 
I −Σ−1 ΣXY Σ−1
   
ΣX ΣXY X X 0 I 0
= −1 . (F.4)
ΣYX ΣY 0 I 0 SΣ X
−ΣYX Σ−1
X I

Proof. By Lemma F.2.1, ΣX is invertible. Hence, we can use the results in Appendix B to write

ΣX ΣXY I −Σ−1
     
I 0 X ΣXY = ΣX 0
,
−ΣYX Σ−1X I ΣYX ΣY 0 I 0 SΣX

where SΣX = ΣY − ΣYX Σ−1 T


X ΣXY . The LHS has the form A ΣA, with A invertible. Since Σ is symmetric
PD, and A is invertible, AT ΣA is also symmetric PD. Hence so is the matrix on the RHS. Thus SΣX is
symmetric PD. Taking the inverse of both sides of the above matrix equation and rearranging the results
yields (F.4).

Lemma F.2.3.

ln fXY (x, y) = −1/2(x − µX )T Σ−1


X (x − µX )
− 1/2(y − µY − ΣYX Σ−1 T −1 −1
X (x − µX )) SΣX (y − µY − ΣYX ΣX (x − µX )) + C, (F.5)

where C does not depend on x or y.

Proof. Since fXY (x, y) is a non-degenerate Gaussian density, we can write


 T  −1  
x − µ X Σ X Σ XY x − µ X
ln fXY (x, y) = −1/2 + C,
y − µY ΣYX ΣY y − µY
where Σ is symmetric PD, and C does not depend on x or y. Then use (F.4) to write
T 
I −Σ−1 ΣXY Σ−1
    
x − µX X X 0 I 0 x − µX
ln fXY (x, y) = − /2
1
−1 +C
y − µY 0 I 0 SΣ X
−ΣYX Σ−1 X I y − µY

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 253

where SΣX = ΣY −ΣYX Σ−1


X ΣXY . Evaluating the right and left most products first, and using the symmetry
of ΣX we obtain
ln fXY (x, y)
 T  −1  
x − µX ΣX 0 x − µX
= −1/2 −1 +C
(y − µY ) − ΣYX Σ−1
X (x − µX ) 0 SΣ X
(y − µY ) − ΣYX Σ−1
X (x − µX )
= −1/2(x − µX )T Σ−1
X (x − µX )
− 1/2(y − µY − ΣYX Σ−1 T −1 −1
X (x − µX )) SΣX (y − µY − ΣYX ΣX (x − µX )) + C.

This is the stated result.

F.3 The Marginal Densities fX (x) and fY (y)


We are now ready to say more about the marginal and conditional densities of fXY (x, y). The marginal
densities fX (x) and fY (y) are obtained by integrating fXY (x, y) over x and y, respectively. These densities
have the following properties.
Theorem F.3.1. If the joint density fXY (x, y) is a non-degenerate Gaussian, so are the marginal densities
fX (x) and fY (y). Moreover, the marginal means and covariances are µX , ΣX and µY , ΣY , respectively.

Proof. Using the expression (F.5) for ln fXY (x, y) yields


T Σ−1 (x−µ ) −1/2(y−µY −ΣYX Σ−1 T −1 −1
X (x−µX )) SΣ (y−µY −ΣYX ΣX (x−µX ))
fXY (x, y) = C e− /2(x−µX )
1
X X
e X .
T −1
Integrating this expression over y gives fX (x) = CC 0 e−1/2(x−µX ) ΣX (x−µX ) , where C 0 does not depend on
x or y. This proves the result for fX (x). The result for fY (y) follows by a symmetric argument.

F.4 The Conditional Density fX|Y (x|y)


Of particular interest is the conditional density fY|X (y|x) = fXY (x, y)/fX (x). This density has the follow-
ing properties.
Theorem F.4.1. When fXY (x, y) is a non-degenerate Gaussian density, so is the conditional density fY|X (y|x).
Moreover, its mean µY|X and covariance ΣY|X are given by

µY|X = µY + ΣYX Σ−1


X (x − µX ) (F.6)
ΣY|X = ΣY − ΣYX Σ−1
X ΣXY . (F.7)
Proof. Use ln fXY (x, y) = ln fY|X (y|x) + ln fX (x), and (F.5), to write

ln fY|X (y|x) = ln fXY (x, y) − ln fX (x)


= −1/2(y − µY − ΣYX Σ−1 T −1 −1
X (x − µX )) SΣX (y − µY − ΣYX ΣX (x − µX )) + C
0

where C 0 does not depend on x or y. As a function of y, fY|X (y|x) is a density, and ln fY|X (x|y) has the
−1
form (F.2). In addition, by Lemma F.2.2, SΣ X
is symmetric PD. Hence by Lemma F.0.1, fY|X (y|x) is a
Gaussian density. Equations (F.6) and (F.7) then follow directly from the previous expression.

Notice that the conditional covariance of fY|X (y|x) does not depend on x. But, as expected, the condi-
tional mean µY|X does depend x.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 254

F.5 Maximum Likelihood for the Gaussian Density


Let X be a non-degenerate Gaussian random vector with density (F.1). If µ and Σ are unknown, then we can
try to learn these parameters using a set of training data {xi ∈ Rn }m
i=1 , where each example is assumed to be
drawn independently from the density fX (x). Under the model (F.1), the joint density of the m independent
draws from fX is
m m
Y Y 1 1 − 1 (xi −µ)T Σ−1 (xi −µ)
L(x1 , . . . , xm ; µ, Σ) = fX (xi ) = n 1 e
2 . (F.8)
i=1 i=1
(2π) 2 |Σ| 2

When we fix the data x1 , . . . , xm , and regard µ and Σ as the variables, this function is called the likelihood
function. It measures the likelihood of the observed training data under each set of parameters. Maximum
likelihood estimation selects estimates for the unknown parameters by maximizing the likelihood function,
or equivalently by maximizing the log-likelihood ln(L). For (F.8), the log-likelihood is
m
1 1X
ln(L) = − m ln det(Σ) − (xi − µ)Σ−1 (xi − µ) + C,
2 2
i=1

where the constant C does not depend on the data, µ, or Σ. Thus the problem of maximizing the log-
likelihood is equivalent to:
m
X
min J(µ, Σ) = m ln det(Σ) + (xi − µ)T Σ−1 (xi − µ)
µ∈Rn ,Σ∈Rn×n (F.9)
i=1
s.t. Σ is symmetric PD.

Problem (F.9) can be solved as follows. First set the derivative of J(µ, Σ) with respect to µ equal to
zero. This gives
m
X
Dµ J(µ, Σ)(h) = −hT Σ−1 (xi − µ) − (xi − µ)T Σ−1 h
i=1
m
!
X
= −2 (xi − µ)T Σ−1 h
i=1
= 0.

Since this holds for all h ∈ Rn , we have m T −1 = 0. Multiplying both sides of this expression
P
i=1 (xi − µ) Σ
by Σ and rearranging gives the maximum likelihood estimate
m
1 X
µ̂ = xi . (F.10)
m
i=1

This is just the empirical mean of the training data. Note that this expression does not depend on Σ.
We can now substitute µ̂ for µ in J(µ, Σ) to obtain Pma newTobjective that isT only a function of Σ. It is
convenient to do this by setting z = x − µ̂ and S = z z . Noting that z Σ −1 z = trace(z z T Σ−1 ),
i i i=1 i i i i i i
gives m T Σz = trace(SΣ−1 ). The new problem can now be written as
P
i=1 iz i

min J(Σ) = m ln det(Σ) + trace(SΣ−1 )


Σ∈Rn×n (F.11)
s.t. Σ is symmetric PD.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 255

The symmetric matrices in Rn×n form a subspace of dimension (n + 1)n/2. The symmetric positive
semidefinite matrices constitute a closed subset C of this subspace and the symmetric positive definite ma-
trices form the interior of C. If F.9 has a positive definite solution, then this lies in the interior of C and we
can use calculus to try to find it.
To take the derivative of the objective function in (F.11) with respect to Σ we use the derivatives of the
functions f : Rn×n → R with f (M ) = det(M ) and g : Rn×n → Rn×n with g(M ) = M −1 (where M is
assumed to be invertible). Expressions for these derivatives for general M (not necessarily symmetric) were
given in Lemma 6.3.1 and Example 6.4.1. For convenience, these results are shown in the box below.

Matrix Derivatives For invertible M ∈ Rn×n :

(a) If g(M ) = M −1 , then Dg(M )(H) = −M −1 HM −1 .

(b) If f (M ) = det(M ), then Df (M )(H) = det(M ) trace(M −1 H).

Now we return to (F.11) and set the derivative of the objective function with respect to Σ equal to zero.
This gives
1
DJ(Σ)(H) = m det(Σ) trace(Σ−1 H) − trace(SΣ−1 HΣ−1 )
det(Σ)
= trace mΣ−1 − Σ−1 SΣ−1 H
 

= 0.

Thus for all H, <mΣ−1 − Σ−1 SΣ−1 , H> = 0. It follows that mΣ−1 − Σ−1 SΣ−1 = 0. Multiplying both
sides of this expression on the right and left by Σ and rearranging yields the candidate maximum likelihood
estimate
m
1 1 X
Σ̂ = S = (xi − µ̂)(xi − µ̂)T . (F.12)
m m
i=1

This estimate is just the empirical covariance of the training data. It is symmetric and positive semidefinite
but it might fail to be positive definite. Assuming fX (x) is non-degenerate, if Σ̂ fails to be positive definite,
then we have used insufficient training data.
We have proved the following result.
Theorem F.5.1. Let {xi }mi=1 be independent samples from a non-degenerate multivariate Gaussian density
with mean µ ∈ R and covariance Σ ∈ Rn×n . Then the maximum likelihood estimates of µ and Σ based
n

on the given samples are


m m
1 X 1 X
µ̂ = xi and Σ̂ = (xi − µ̂)(xi − µ̂)T .
m m
i=1 i=1

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 256

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Appendix G

Gershgorin Circle Theorem

One often needs to bound the eigenvalues of matrix. A simple result that is often useful for this purpose is
Gershgorin’s Circle Theorem. It gives estimates for the locations of the eigenvalues in terms of discs in the
complex plane known to contain the eigenvalues. You will see Gershgorin’s name spelt in various ways, and
the theorem is sometimes called the Gershgorin disc theorem.

G.1 The Theorem and its Proof


Theorem G.1.1 (Gershgorin 1931). For A = [aij ] ∈ Cn×n let ri =
P
j6=i |aij |. Then every eigenvalue of
A lies in at least one of the discs

Di = {z : |z − aii | ≤ ri }, i = 1, . . . , n.

Proof. Let λ be an eigenvalue of A with eigenvector x ∈ Cn . ThenP λx − Ax = 0. Writing out the i-th row
of this equation and separating the aii term gives λxi − aii xi = j6=i aij xj . Thus
P
|λ − aii ||xi | ≤ j6=i |aij ||xj |. (G.1)

Now select i so that |xi | = maxj |xj |. Since x 6= 0, |xi | =


6 0. Dividing both sides of (G.1) by |xi | yields

P |xj | P
|λ − aii | ≤ j6=i |aij | |xi | ≤ j6=i |aij |.

Hence λ lies in the Gershgorin disc Di .

The Gershgorin disc Di is centered at aii and its radius is the sum of the absolute values of the terms on
the i-th row of A, excluding the entry on the diagonal. If A diagonal, say A = diag(a11 , . . . , ann ), then the
eigenvalues are the diagonal entries. In this case, the discs have zero radius and the eigenvalues are located
at the centers of the discs of radius 0.
One can apply the same reasoning in Gershgorin’s theorem to the columns of A by applying the theorem
to AT . Since A and AT have the same eigenvalues and the same diagonal entries, the only possible change
is the radius of each Gershgorin disc.

257
ELE 435/535 Fall 2018 258

G.2 Examples
Example G.2.1. The 5 × 5 matrix
 
0.5000 −0.4000 0 0 0
0.4500 0.4500 0 0 0 
 
A= 0 0 −0.6500 0 0.3000 

0.1000 0 0 −0.2500 0.4000 
0.1000 0 0.1000 −0.3000 −0.300

has the Gershgorin discs indicated in Figure G.1. All eigenvalues A (shown using ∗) are inside the unit
circle and all of the Gershgorin discs are within the unit circle. The same plot for AT is shown on the right
of the figure. The eigenvalues and disc centers remain the same, but this time the Gershgorin discs are not
contained in the unit circle.

1 1 1

0.5 0.5 0.5


imaginary axis

imaginary axis

imaginary axis
0 0 0

−0.5 −0.5 −0.5

−1 −1 −1

−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1


real axis real axis real axis

Figure G.1: The Gershgorin discs for Examples G.2.1 and G.2.2. Left: For A and AT in Example G.2.1. Right: For F
in Example G.2.2. Disc centers shown as • and eigenvalues as ∗. In all plots, the unit circle is indicated by the dashed
blue curve.

Example G.2.2. The symmetric matrix


 
0.5 0.5 0 0
0.5 0.3 0.2 0 
F =
 0 0.2 0.4 0.4
 (G.2)
0 0 0.4 0.6

has the Gershgorin discs shown on the right in Figure G.1. The matrix has one eigenvalue at 1 and all other
eigenvalues inside the unit circle.

Exercises
Exercise G.1. Show that a symmetric matrix S ∈ Rn×n with 0 <
P
j6=i |Sij | < Sii , i ∈ [1 : n], is PD.
Exercise G.2. Let P ∈ Rn×n be symmetric, have nonnegative entries, and satisfy P 1 = 1. Show that if Pii > 1/2,
i ∈ [1 : n], then P is PD.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
Appendix H

Maximum Likelihood Estimation

H.1 The Likelihood Function


Let X be a random vector with density fX (x; θ) that depends on a parameter θ. Our objective is to learn
the value of the parameter θ using a set of training data {xi ∈ Rn }mi=1 , where each training example drawn
independently from the density fX (x; θ). The joint density of the training examples is hence
m
Y
fXi ,...,Xm (x1 , . . . , xm ; θ) = fX (xi ; θ). (H.1)
i=1

When we fix the data x1 , . . . , xm , and regard θ as the variable, this function is called the likelihood function
and is denoted by L(θ). It measures the likelihood of the training data under each value of the parameter.
Maximum likelihood estimation selects the value of the unknown parameter by maximizing the likelihood
function, or equivalently by maximizing the log-likelihood function ln L(θ). Notice that subject only to the
assumed form of the density fX , this method gives the data total control. The assumed form of the density,
is the only means of regularizing the estimate.

H.1.1 Maximum Likelihood for an Exponential Family


Assume the density fX (x) takes the form
h(x) <θ,t(x)>
fX (x; θ) = e . (H.2)
Z(θ)
1
The likelihood function for θ given m i.i.d. draws from fX is then L(θ) = Z(θ)
Qm <θ,t(xi )> and
Pm Pm
m i=1 h(xi )e
the log likelihood is ln L(θ) = −m ln(Z(θ)) + j=1 ln h(xi ) + <θ, i=1 t(xi )>. From the log likelihood
we see that the maximum likelihood estimate of the natural parameter θ is obtained by solving
m
X
θ̂ = arg min m ln(Z(θ)) − <θ, t(xi )>. (H.3)
θ∈Θ
i=1

The objective function is convex, and strictly convex if the density parameterization is non-redundant. Hence
if a local minima exists, it is a global minima. To obtain a solution we take the derivative of the log likelihood
and set this equal to zero. This gives
m
1 X
∇ ln(Z(θ)) = t(xi ).
m
i=1

259
ELE 435/535 Fall 2018 260

We also know that ∇ ln(Z(θ)) = Eθ [t(X)]. Hence the maximum likelihood estimate of θ satisfies

m
1 X
∇ ln(Z(θ̂)) = Eθ̂ [t(X)] = t(xi ). (H.4)
m
i=1

H.1.2 Examples
Example H.1.1 (Exponential Density). The scalar exponential density is f (x) = λe−λx , x ∈ [0, ∞). Here
λ > 0 is fixed parameter. This density has the form (H.2) with h(x) = 1, t(x) = −x, θ = λ, Z(θ) = 1/θ.
Using (H.4), the maximum likelihood estimate of θ given m i.i.d. examples drawn from the density must
1 Pm
satisfy ∇ ln(Z(θ)) = −1
θ = − m i=1 xi . Thus the maximum likelihood estimate is

1
λ̂ = θ̂ = 1 Pm .
m i=1 xi

k
Example H.1.2 (Poisson pmf). The Poisson pmf is f (k) = λk! e−λ , k ∈ N, k ∈ N. Here λ > 0 is a fixed
1 θ
parameter. This density has the form (H.2) with h(k) = k! , t(k) = k, θ = ln λ, Z(θ) = eλ = ee . Using
(H.4), the maximum likelihood estimate of θ given m i.i.d. examples drawn from the density must satisfy
1 Pm
∇ ln(Z(θ)) = eθ = λ = m i=1 ki . Thus the maximum likelihood estimate is

m
1 X
λ̂ = xi and θ̂ = ln λ̂.
m
i=1

Example H.1.3 (Bernoulli pmf). The Bernoulli pmf is given by f (x) = px (1 − p)(1−x) where
 p is a

p
parameter and x ∈ {0, 1}. This takes the form (H.2) with h(x) = 1, t(x) = x, θ = ln 1−p and
Z(θ) = 1 + eθ . By (H.4), the maximum likelihood estimate of θ given m i.i.d. examples drawn from the
density is
m
eθ̂
 
1 X p̂
p̂ = = xi and θ̂ = ln
1 + eθ̂ m 1 − p̂
i=1

Notice that p̂ is simply the faction of successes in the m i.i.d. trials.


n
pk (1 −p)n−k

Example H.1.4 (Binomial pmf). The binomial pmf is f (k) =  , k ∈ [0 : n]. When the
k
p
number of trials is fixed, the binomial pmf has the form (H.2) with θ = ln 1−p , t(k) = k, h(k) = nk ,


and Z(θ) = (1 + eθ )n . By (H.4), the maximum likelihood estimate of θ given m i.i.d. examples drawn from
the density is
m
eθ̂
 
1 X p̂
p̂ = = xi and θ̂ = ln .
1 + eθ̂ nm 1 − p̂
i=1

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 261

Example H.1.5 (Univariate Gaussian). As shown in Example 12.7.5, the univariate Gaussian density has
√ µ2 q −θ(1)2
the form (H.2) with h(x) = 1, t(x) = (x, x2 ), θ = ( σµ2 , 2σ
−1
2 ), and Z(θ) =
π
2πσe 2σ2 = −θ(2) e 4θ(2) .
The maximum likelihood estimate of θ given m i.i.d. examples drawn from the density satisfies (H.4) with
m m m
!
1 X 1 X 1 X 2
t(xi ) = xi , xi .
m m m
i=1 i=1 i=1

We know that Eθ [t(X)] = (µ, σ 2 + µ2 ). Hence the maximum likelihood estimates of µ and σ 2 are
m m m
1 X 2 1 X 2 1 X
µ̂ = xi σ̂ = xi − µ̂2 = (xi − µ̂)2 .
m m m
i=1 i=1 i=1

If desired, one can then obtain θ̂ using the known relationship between θ, µ and σ 2 .
Example H.1.6 (Multivariate Gaussian). As shown in Example 12.7.6, the multivariate Gaussian density
n 1 T −1
has the form (H.2) with h(x) = 1, t(x) = (x, xxT ), θ = (Σ−1 µ, − 21 Σ−1 ), and Z(θ) = (2π) 2 |Σ| 2 e1/2µ Σ µ .
Here t(x), θ ∈ Rn × Sn with the inner product <(x, M ), (y, N )> = <x, y> + <M, N >. The maximum
likelihood estimate of θ given m i.i.d. examples drawn from the density satisfies (H.4) with
m m m
!
1 X 1 X 1 X T
t(xi ) = xi , xi xi .
m m m
i=1 i=1 i=1

Using Eθ [t(X)] = (µ, Σ + µµT ) yields the maximum likelihood estimates of µ and Σ:
m m m
1 X 1 X 1 X
µ̂ = xi Σ̂ = xi x2i − µ̂µ̂T = (xi − µ̂)(xi − µ̂)T .
m m m
i=1 i=1 i=1

One can obtain θ̂ using the known relationship between θ, µ and Σ. You might like to contrast this deriva-
tion of the maximum likelihood estimates for a multivariate Gaussian with the derivation from scratch in
Appendix F.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 262

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
References

[1] Malcolm. Adams and Victor Guillemin. Measure Theory and Probability. Wadsworth, 1986.

[2] R.G. Bartle. The Elements of Integration. John Wiley and Sons, 1966.

[3] J.O. Berger. Statistical Decision Theory and Bayesian Analysis. Springr-Verlag, 2nd edition, 1985.

[4] Demitri Bertsekas. Convex Optimization Theory. Athena Scientific, 2009.

[5] Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2007.

[6] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In: Bousquet O.,
von Luxburg U., Rtsch G. (eds), Advanced Lectures on Machine Learning. Lecture Notes in Computer
Science, vol 3176. Springer, Berlin, Heidelberg, 2004.

[7] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[8] E. J. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory,
51(12):4203–4215, Dec 2005.

[9] Edwin Chong and Stanislaw Zak. An Introduction to Optimization. John Wiley and Sons, 2008.

[10] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273–297,
1995.

[11] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991.

[12] Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017.

[13] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. John Wiley and Sons, 2nd
edition, 2001.

[14] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179–
188, 1936.

[15] Wendell Fleming. Functions of Several Variables. Spring-Verlag, 1977.

[16] J. H. Friedman and W. Stuetzle. Projection pursuit regression. J. Amer. Statist. Asso., 76:817–823,
1981.

[17] M. L. Glasser. A remarkable property of definite integrals. Mathematics of Computation, 40(162):561–


563, April 1983.

[18] J.-B Hiriart-Urruty and C. Lemaréchal. Fundamentals of Convex Analysis. Springer, 2001.

263
ELE 435/535 Fall 2018 264

[19] A. E. Hoerl. Application of ridge analysis to regression problems. Chemical Engineering Progress,
58:54–59, 1962.

[20] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.
Technometrics, 12(1):55–67, 1970.

[21] Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola. Kernel methods in machine learning.
The Annals of Statistics, 36(3):1171–1220, 2008.

[22] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 2nd edition,
2013.

[23] P. J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, 1985.

[24] Thomas Kailath, Ali H. Sayed, and Babak Hassibi. Linear Estimation. Prentice Hall, 2000.

[25] S.R. Kulkarni and G. Harman. Statistical learning theory: a tutorial. Wiley Interdisciplinary Reviews:
Computational Statistics, 3:543–556, 2011.

[26] E. L. Lehmann, S. Fienberg, and G. Casella. Theory of Point Estimation. Springer, 1998.

[27] E.L. Lehmann. Testing Statistical Hypotheses. Wiley Interscience, 2nd edition, 1986.

[28] David J.C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University
Press, 2003.

[29] S. G. Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transac-
tions on Signal Processing, 41(12):3397–3415, Dec 1993.

[30] Sebastian Mika, Bernhard Schölkopf, Alex Smola, Klaus-robert Müller, Matthias Scholz, and Gunnar
Rätsch. Kernel pca and de-noising in feature spaces. Analysis, 11(i):536–542, 1999.

[31] Tom M Mitchell. The Discipline of Machine Learning. Machine Learning, 17(July):1–7, 2006.

[32] Kevin Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.

[33] B.K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing,
24(2):227–234, 1995.

[34] Cathy O’Neil. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens
Democracy. Crown Publishing Group, New York, NY, USA, 2016.

[35] Y. Pati, R. Rezaiifar, and P. Krishnaprasad. Orthogonal matching pursuit: recursive function approxi-
mation with application to wavelet decomposition. Asilomar Conf. on Signals, Systems and Computing,
1993.

[36] H. Vincent Poor. An Introduction to Signal Detection and Estimation. Springer, 2nd edition, 1994.

[37] H.L. Royden. Real Analysis. Macmillan, 2nd edition, 1968.

[38] Walter Rudin. Principles of Mathematical Analysis. McGraw-Hill Education, 3rd edition, 1976.

[39] B. Schölkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, and R.C. Williamson. Estimating the support of
a high-dimensional distribution. Neural computation, 13(7):1443–1471, 2001.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.
ELE 435/535 Fall 2018 265

[40] B. Scholkopf, A. J. Smola, and K.-R. Muller. Kernel Principal Component Analysis. Computer Vision
And Mathematical Methods In Medical And Biomedical Image Analysis, 1327:583–588, 2012.

[41] Bernhard Schölkopf, Alex J. Smola, Robert C. Williamson, and Peter L. Bartlett. New Support Vector
Algorithms. Neural Computation, 12(5):1207–1245, 2000.

[42] Bernhard Schölkopf and Alexander Smola. Learning with Kernels: Support Vector Machines, Regu-
larization, Optimization, and Beyond. The MIT Press, 2002.

[43] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Nonlinear Component Analysis as
a Kernel Eigenvalue Problem. Neural Computation, 10(5):1299–1319, 1998.

[44] Bernhard Schölkopf, Robert Williamson, Alex Smola, John Shawe-Taylor, and John Platt. Support
Vector Method for Novelty Detection. Advances in Neural Information Processing Systems 12, pages
582–588, 1999.

[45] S.D. Silvey. Statistical Inference. Halsted Press, 1975.

[46] Gilbert Strang. Linear Algebra and Its Applications. Brooks Cole; 4th edition, 2006.

[47] David M. J. Tax and Robert P. W. Duin. Support vector domain description. Pattern Recognition
Letters, 20:1191–1199, 1999.

[48] Sergios Theodoridis. Machine Learning: A Bayesian and Optimization Perspective. Elsevier, 2015.

[49] Larry Wasserman. All of Statistics. Springer, 2003.

[50] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Yi Ma. Robust face recognition via sparse rep-
resentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(2):210–227, Feb
2009.

c Peter J. Ramadge, 2015, 2016, 2017, 2018. Please do not distribute without permission.

You might also like