GIUSEPPE CALAFIORE AND L AU R E N T E L G H AO U I
O P T I M I Z AT I O N M O D E L S
EXERCISES
CAMBRIDGE
Contents
2. Vectors 4
3. Matrices 7
4. Symmetric matrices 11
5. Singular Value Decomposition 16
6. Linear Equations 21
7. Matrix Algorithms 26
8. Convexity 30
9. Linear, Quadratic and Geometric Models 35
10. Second-Order Cone and Robust Models 40
11. Semidefinite Models 44
12. Introduction to Algorithms 51
13. Learning from Data 57
14. Computational Finance 61
15. Control Problems 71
16. Engineering Design 75
4
2. Vectors
Exercise 2.1 (Subpaces and dimensions) Consider the set S of points
such that
x1 + 2x2 + 3x3 = 0, 3x1 + 2x2 + x3 = 0.
Show that S is a subspace. Determine its dimension, and find a basis
for it.
Exercise 2.2 (Affine sets and projections) Consider the set in R3 de-
fined by the equation
n o
P = x ∈ R3 : x1 + 2x2 + 3x3 = 1 .
1. Show that the set P is an affine set of dimension 2. To this end,
express it as x (0) + span( x (1) , x (2) ), where x (0) ∈ P , and x (1) , x (2)
are linearly independent vectors.
2. Find the minimum Euclidean distance from 0 to the set P , and a
point that achieves the minimum distance.
Exercise 2.3 (Angles, lines and projections)
1. Find the projection z of the vector x = (2, 1) on the line that passes
through x0 = (1, 2) and with direction given by vector u = (1, 1).
2. Determine the angle between the following two vectors:
1 3
x = 2 , y = 2 .
3 1
Are these vectors linearly independent?
Exercise 2.4 (Inner product) Let x, y ∈ Rn . Under which condition
on α ∈ Rn does the function
n
f ( x, y) = ∑ αk xk yk
k =1
define an inner product on Rn ?
Exercise 2.5 (Orthogonality) Let x, y ∈ Rn be two unit-norm vectors,
that is, such that k x k2 = kyk2 = 1. Show that the vectors x − y and
x + y are orthogonal. Use this to find an orthogonal basis for the
subspace spanned by x and y.
Exercise 2.6 (Norm inequalities)
5
1. Show that the following inequalities hold for any vector x:
1 √
√ k x k2 ≤ k x k ∞ ≤ k x k2 ≤ k x k1 ≤ n k x k2 ≤ n k x k ∞ .
n
Hint: use the Cauchy–Schwartz inequality.
2. Show that for any nonzero vector x,
k x k21
card( x ) ≥ ,
k x k22
where card( x ) is the cardinality of the vector x, defined as the num-
ber of nonzero elements in x. Find vectors x for which the lower
bound is attained.
Exercise 2.7 (Hölder inequality) Prove Hölder’s inequality (2.4).
Hint: consider the normalized vectors u = x/k x k p , v = y/kykq ,
and observe that
| x > y | = k x k p k y k q · | u > v | ≤ k x k p k y k q ∑ | u k v k |.
k
Then, apply Young’s inequality (see Example 8.10) to the products
|uk vk | = |uk ||vk |.
Exercise 2.8 (Linear functions)
1. For a n-vector x, with n = 2m − 1 odd, we define the median of
x as the scalar value x a such that exactly n of the values in x are
≤ x a and n are ≥ x a (i.e., x a leaves half of the values in x to its left,
and half to its right). Now consider the function f : Rn → R, with
values f ( x ) = x a − n1 ∑in=1 xi . Express f as a scalar product, that is,
find a ∈ Rn such that f ( x ) = a> x for every x. Find a basis for the
set of points x such that f ( x ) = 0.
2. For α ∈ R2 , we consider the “power law” function f : R 2++ → R,
α
with values f ( x ) = x1 1 x2α2 . Justify the statement: “the coefficients
αi provide the ratio between the relative error in f to a relative
error in xi ”.
Exercise 2.9 (Bound on a polynomial’s derivative) In this exercise,
you derive a bound on the largest absolute value of the derivative
of a polynomial of a given order, in terms of the size of the coeffi-
cients.1 For w ∈ Rk+1 , we define the polynomial pw , with values 1
See the discussion on regularization
in Section 13.2.3 for an application of
. this result.
p w ( x ) = w1 + w2 x + · · · + w k +1 x k .
6
Show that, for any p ≥ 1
dpw ( x )
∀ x ∈ [−1, 1] : ≤ C (k, p)kvk p ,
dx
where v = (w2 , . . . , wk+1 ) ∈ Rk , and
k p = 1,
C (k, p) = k3/2 p = 2,
k ( k +1)
2 p = ∞.
Hint: you may use Hölder’s inequality (2.4) or the results from Exer-
cise 2.6.
7
3. Matrices
Exercise 3.1 (Derivatives of composite functions)
1. Let f : Rm → Rk and g : Rn → Rm be two maps. Let h : Rn → Rk
be the composite map h = f ◦ g, with values h( x ) = f ( g( x )) for
x ∈ Rn . Show that the derivatives of h can be expressed via a
matrix–matrix product, as Jh ( x ) = J f ( g( x )) · Jg ( x ), where Jh ( x ) is
the Jacobian matrix of h at x, i.e., the matrix whose (i, j) element
∂hi ( x )
is ∂x .
j
2. Let g be an affine map of the form g( x ) = Ax + b, for A ∈ Rm,n ,
b ∈ Rm . Show that the Jacobian of h( x ) = f ( g( x )) is
Jh ( x ) = J f ( g( x )) · A.
3. Let g be an affine map as in the previous point, let f : Rn → R (a
scalar-valued function), and let h( x ) = f ( g( x )). Show that
∇ x h( x ) = A> ∇ g f ( g( x )),
∇2x h( x ) = A> ∇2g f ( g( x )) A.
Exercise 3.2 (Permutation matrices) A matrix P ∈ Rn,n is a permu-
tation matrix if its columns are a permutation of the columns of the
n × n identity matrix.
1. For an n × n matrix A, we consider the products PA and AP. De-
scribe in simple terms what these matrices look like with respect
to the original matrix A.
2. Show that P is orthogonal.
Exercise 3.3 (Linear maps) Let f : Rn → Rm be a linear map. Show
how to compute the (unique) matrix A such that f ( x ) = Ax for every
x ∈ Rn , in terms of the values of f at appropriate vectors, which you
will determine.
Exercise 3.4 (Linear dynamical systems) Linear dynamical systems
are a common way to (approximately) model the behavior of physical
phenomena, via recurrence equations of the form2 2
Such models are the focus of Chap-
ter 15.
x (t + 1) = Ax (t) + Bu(t), y(t) = Cx (t), t = 0, 1, 2, . . . ,
where t is the (discrete) time, x (t) ∈ Rn describes the state of the
system at time t, u(t) ∈ R p is the input vector, and y(t) ∈ Rm is the
output vector. Here, matrices A, B, C, are given.
8
1. Assuming that the system has initial condition x (0) = 0, ex-
press the output vector at time T as a linear function of u(0), . . .,
u( T − 1); that is, determine a matrix H such that y( T ) = HU ( T ),
where
u (0)
. ..
U (T ) =
.
u ( T − 1)
contains all the inputs up to and including at time T − 1.
2. What is the interpretation of the range of H?
Exercise 3.5 (Nullspace inclusions and range) Let A, B ∈ Rm,n be
two matrices. Show that the fact that the nullspace of B is contained
in that of A implies that the range of B> contains that of A> .
Exercise 3.6 (Rank and nullspace) Consider the image in Figure 3.1,
a gray-scale rendering of a painting by Mondrian (1872–1944). We
build a 256 × 256 matrix A of pixels based on this image by ignoring
grey zones, assigning +1 to horizontal or vertical black lines, +2 at
the intersections, and zero elsewhere. The horizontal lines occur at
row indices 100, 200 and 230, and the vertical ones at columns indices
50, 230.
1. What is nullspace of the matrix?
2. What is its rank?
Exercise 3.7 (Range and nullspace of A> A) Prove that, for any ma- Figure 3.1: A gray-scale rendering of
a painting by Mondrian.
trix A ∈ Rm,n , it holds that
N ( A > A ) = N ( A ),
R ( A > A ) = R ( A > ). (3.1)
Hint: use the fundamental theorem of linear algebra.
Exercise 3.8 (Cayley–Hamilton theorem) Let A ∈ Rn,n and let
.
p(λ) = det(λIn − A) = λn + cn−1 λn−1 + · · · + c1 λ + c0
be the characteristic polynomial of A.
1. Assume A is diagonalizable. Prove that A annihilates its own
characteristic polynomial, that is
p( A) = An + cn−1 An−1 + · · · + c1 A + c0 In = 0.
Hint: use Lemma 3.3.
9
2. Prove that p( A) = 0 holds in general, i.e., also for non-diagona-
lizable square matrices. Hint: use the facts that polynomials are
continuous functions, and that diagonalizable matrices are dense
in Rn,n , i.e., for any e > 0 there exist ∆ ∈ Rn,n with k∆kF ≤ e such
that A + ∆ is diagonalizable.
Exercise 3.9 (Frobenius norm and random inputs) Let A ∈ Rm,n be
a matrix. Assume that u ∈ Rn is a vector-valued random variable,
with zero mean and covariance matrix In . That is, E{u} = 0, and
E{uu> } = In .
1. What is the covariance matrix of the output, y = Au?
2. Define the total output variance as E{ky − ŷk22 }, where ŷ = E{y}
is the output’s expected value. Compute the total output variance
and comment.
Exercise 3.10 (Adjacency matrices and graphs) For a given undirec- 1 2
ted graph G with no self-loops and at most one edge between any
3
pair of nodes (i.e., a simple graph), as in Figure 3.2, we associate a
n × n matrix A, such that
5 4
(
1 if there is an edge between node i and node j,
Aij = Figure 3.2: An undirected graph with
0 otherwise. n = 5 vertices.
This matrix is called the adjacency matrix of the graph.3 3
The graph in Figure 3.2 has adja-
cency matrix
1. Prove the following result: for positive integer k, the matrix Ak
0 1 0 1 1
has an interesting interpretation: the entry in row i and column j 1 0 0 1 1
A= 0 0 0 0 1 .
gives the number of walks of length k (i.e., a collection of k edges) 1
1 0 0 0
leading from vertex i to vertex j. Hint: prove this by induction on 1 1 1 0 0
k, and look at the matrix–matrix product Ak−1 A.
2. A triangle in a graph is defined as a subgraph composed of three
vertices, where each vertex is reachable from each other vertex
(i.e., a triangle forms a complete subgraph of order 3). In the
graph of Figure 3.2, for example, nodes {1, 2, 4} form a triangle.
Show that the number of triangles in G is equal to the trace of A3
divided by 6. Hint: For each node in a triangle in an undirected
graph, there are two walks of length 3 leading from the node to
itself, one corresponding to a clockwise walk, and the other to a
counter-clockwise walk.
Exercise 3.11 (Nonnegative and positive matrices) A matrix A ∈ Rn,n
is said to be non-negative (resp. positive) if aij ≥ 0 (resp. aij > 0) for
10
all i, j = 1, . . . , n. The notation A ≥ 0 (resp. A > 0) is used to denote
non-negative (resp. positive) matrices.
A non-negative matrix is said to be column (resp. row) stochastic,
if the sum of the elements along each column (resp. row) is equal to
one, that is if 1> A = 1> (resp. A1 = 1). Similarly, a vector x ∈ Rn
is said to be non-negative if x ≥ 0 (element-wise), and it is said to
be a probability vector, if it is non-negative and 1> x = 1. The set of
probability vectors in Rn is thus the set S = { x ∈ Rn : x ≥ 0, 1> x =
1}, which is called the probability simplex. The following points you
are requested to prove are part of a body of results known as the
Perron–Frobenius theory of non-negative matrices.
1. Prove that a non-negative matrix A maps non-negative vectors
into non-negative vectors (i.e., that Ax ≥ 0 whenever x ≥ 0), and
that a column stochastic matrix A ≥ 0 maps probability vectors
into probability vectors.
2. Prove that if A > 0, then its spectral radius ρ( A) is positive. Hint:
use the Cayley–Hamilton theorem.
3. Show that it holds for any matrix A and vector x that
| Ax | ≤ | A|| x |,
where | A| (resp. | x |) denotes the matrix (resp. vector) of moduli
of the entries of A (resp. x). Then, show that if A > 0 and λi , vi is
an eigenvalue/eigenvector pair for A, then
|λi ||vi | ≤ A|vi |.
4. Prove that if A > 0 then ρ( A) is actually an eigenvalue of A (i.e., A
has a positive real eigenvalue λ = ρ( A), and all other eigenvalues
of A have modulus no larger than this “dominant” eigenvalue),
and that there exist a corresponding eigenvector v > 0. Further,
the dominant eigenvalue is simple (i.e., it has unit algebraic mul-
tiplicity), but you are not requested to prove this latter fact.
Hint: For proving this claim you may use the following fixed-point
theorem due to Brouwer: if S is a compact and convex set4 in Rn , and 4
See Section 8.1 for definitions of
f : S → S is a continuous map, then there exist an x ∈ S such that compact and convex sets.
.
f ( x ) = x. Apply this result to the continuous map f ( x ) = 1>AxAx ,
with S being the probability simplex (which is indeed convex and
compact).
5. Prove that if A > 0 and it is column or row stochastic, then its
dominant eigenvalue is λ = 1.
11
4. Symmetric matrices
Exercise 4.1 (Eigenvectors of a symmetric 2 × 2 matrix) Let p, q ∈ Rn
be two linearly independent vectors, with unit norm (k pk2 = kqk2 =
.
1). Define the symmetric matrix A = pq> + qp> . In your derivations,
.
it may be useful to use the notation c = p> q.
1. Show that p + q and p − q are eigenvectors of A, and determine
the corresponding eigenvalues.
2. Determine the nullspace and rank of A.
3. Find an eigenvalue decomposition of A, in terms of p, q. Hint: use
the previous two parts.
4. What is the answer to the previous part if p, q are not normalized?
Exercise 4.2 (Quadratic constraints) For each of the following cases,
determine the shape of the region generated by the quadratic con-
straint x > Ax ≤ 1.
" #
2 1
1. A = .
1 2
" #
1 −1
2. A = .
−1 1
" #
−1 0
3. A = .
0 −1
Hint: use the eigenvalue decomposition of A, and discuss depending
on the sign of the eigenvalues.
Exercise 4.3 (Drawing an ellipsoid)
1. How would you efficiently draw an ellipsoid in R2 , if the ellipsoid
is described by a quadratic inequality of the form
n o
E = x > Ax + 2b> x + c ≤ 0 ,
where A is 2 × 2 and symmetric, positive definite, b ∈ R2 , and
c ∈ R? Describe your method as precisely as possible.
2. Draw the ellipsoid
n o
E = 4x12 + 2x22 + 3x1 x2 + 4x1 + 5x2 + 3 ≤ 1 .
12
Exercise 4.4 (Minimizing a quadratic function) Consider the uncon-
strained optimization problem
1 >
p∗ = min x Qx − c> x
x 2
where Q = Q> ∈ Rn,n , Q 0, and c ∈ Rn are given. The goal of this
exercise is to determine the optimal value p∗ and the set of optimal
solutions, X opt , in terms of c and the eigenvalues and eigenvectors
of the (symmetric) matrix Q.
1. Assume that Q 0. Show that the optimal set is a singleton, and
that p∗ is finite. Determine both in terms of Q, c.
2. Assume from now on that Q is not invertible. Assume further
that Q is diagonal: Q = diag (λ1 , . . . , λn ), with λ1 ≥ . . . ≥ λr >
λr+1 = . . . = λn = 0, where r is the rank of Q (1 ≤ r < n). Solve
the problem in that case (you will have to distinguish between two
cases).
3. Now we do not assume that Q is diagonal anymore. Under what
conditions (on Q, c) is the optimal value finite? Make sure to ex-
press your result in terms of Q and c, as explicitly as possible.
4. Assuming that the optimal value is finite, determine the optimal
value and optimal set. Be as specific as you can, and express your
results in terms of the pseudo-inverse5 of Q. 5
See Section 5.2.3.
Exercise 4.5 (Interpretation of covariance matrix) As in Example 4.2,
we are given m points x (1) , . . . , x (m) in Rn , and denote by Σ the sam-
ple covariance matrix:
m
. 1
Σ=
m ∑ (x(i) − x̂)(x(i) − x̂)> ,
i =1
where x̂ ∈ Rn is the sample average of the points:
m
. 1
x̂ =
m ∑ x (i ) .
i =1
We assume that the average and variance of the data projected along
a given direction does not change with the direction. In this exercise
we will show that the sample covariance matrix is then proportional
to the identity.
We formalize this as follows. To a given normalized direction
w ∈ Rn , kwk2 = 1, we associate the line with direction w passing
through the origin, L(w) = {tw : t ∈ R }. We then consider the
projection of the points x (i) , i = 1, . . . , m, on the line L(w), and look
13
at the associated coordinates of the points on the line. These projected
values are given by
.
ti (w) = arg min ktw − x (i) k2 , i = 1, . . . , m.
t
We assume that for any w, the sample average t̂(w) of the projected
values ti (w), i = 1, . . . , m, and their sample variance σ2 (w), are both
constant, independent of the direction w. Denote by t̂ and σ2 the
(constant) sample average and variance. Justify your answer to the
following questions as carefully as you can.
1. Show that ti (w) = w> x (i) , i = 1, . . . , m.
2. Show that the sample average x̂ of the data points is zero.
3. Show that the sample covariance matrix Σ of the data points is
of the form σ2 In . Hint: the largest eigenvalue λmax of the matrix
Σ can be written as: λmax = maxw {w> Σw : w> w = 1}, and a
similar expression holds for the smallest eigenvalue.
Picture 1.png 250×165 pixels
Exercise 4.6 (Connected graphs and the Laplacian) We are given a
graph as a set of vertices in V = {1, . . . , n}, with an edge joining any
pair of vertices in a set E ⊆ V × V. We assume that the graph is
undirected (without arrows), meaning that (i, j) ∈ E implies ( j, i ) ∈
E. As in Section 4.1, we define the Laplacian matrix by
−1 if (i, j) ∈ E, Figure 4.3: Example of an undirected
Lij = d(i ) if i = j, graph.
0 otherwise.
Here, d(i ) is the number of edges adjacent to vertex i. For example,
d(4) = 3 and d(6) = 1 for the graph in Figure 4.3.
1. Form the Laplacian for the graph shown in Figure 4.3.
2. Turning to a generic graph, show that the Laplacian L is symmet-
ric.
3. Show that L is positive-semidefinite, proving the following iden-
tity, valid for any u ∈ Rn :
. 1
2 (i,j∑
u> Lu = q(u) = ( u i − u j )2 .
)∈ E
Hint: find the values q(k), q(ek ± el ), for two unit vectors ek , el such
that (k, l ) ∈ E.
4. Show that 0 is always an eigenvalue of L, and exhibit an eigenvec- 6
See Section 4.4.4.
tor. Hint: consider a matrix square-root6 of L.
http://cnx.org/content/m29399/1.1/Picture%201.png
14
5. The graph is said to be connected if there is a path joining any
pair of vertices. Show that if the graph is connected, then the zero
eigenvalue is simple, that is, the dimension of the nullspace of L
is 1. Hint: prove that if u> Lu = 0, then ui = u j for every pair
(i, j) ∈ E.
Exercise 4.7 (Component-wise product and PSD matrices) Let A, B
∈ Sn be two symmetric matrices. Define the component-wise product
of A, B, by a matrix C ∈ Sn with elements Cij = Aij Bij , 1 ≤ i, j ≤ n.
Show that C is positive semidefinite, provided both A, B are. Hint:
prove the result when A is rank-one, and extend to the general case
via the eigenvalue decomposition of A.
Exercise 4.8 (A bound on the eigenvalues of a product) Let A, B ∈
Sn be such that A 0, B 0.
1. Show that all eigenvalues of BA are real and positive (despite the
fact that BA is not symmetric, in general).
.
2. Let A 0, and let B−1 = diag k a1> k1 , . . . , k a> >
n k1 , where ai ,
i = 1, . . . , n, are the rows of A. Prove that
0 < λi ( BA) ≤ 1, ∀ i = 1, . . . , n.
3. With all terms defined as in the previous point, prove that
ρ( I − αBA) < 1, ∀α ∈ (0, 2).
Exercise 4.9 (Hadamard’s inequality) Let A ∈ Sn be positive semidef-
inite. Prove that
n
det A ≤ ∏ aii .
i =1
Hint: Distinguish the cases det A = 0 and det A 6= 0. In the lat-
.
ter case,
consider thenormalized matrix à = DAD, where D =
−1/2
diag a11 , . . . , a−
nn
1/2
, and use the geometric–arithmetic mean in-
equality (see Example 8.9).
Exercise 4.10 (A lower bound on the rank) Let A ∈ Sn+ be a sym-
metric, positive semidefinite matrix.
1. Show that the trace, trace A, and the Frobenius norm, k AkF , de-
pend only on its eigenvalues, and express both in terms of the
vector of eigenvalues.
2. Show that
(trace A)2 ≤ rank( A)k Ak2F .
15
3. Identify classes of matrices for which the corresponding lower
bound on the rank is attained.
Exercise 4.11 (A result related to Gaussian distributions) Let Σ ∈ Sn++
be a symmetric, positive definite matrix. Show that
Z √
1 > Σ −1 x
e− 2 x dx = (2π )n/2 det Σ.
Rn
You may assume known that the result holds true when n = 1. The
above shows that the function p : Rn → R with (non-negative) val-
ues
1 1 > −1
p( x ) = √ e− 2 x Σ x
(2π )n/2 · det Σ
integrates to one over the whole space. In fact, it is the density func-
tion of a probability distribution called the multivariate Gaussian (or
normal) distribution, with zero mean and covariance matrix Σ. Hint:
you may use the fact that for any integrable function f , and invertible
n × n matrix P, we have
Z Z
f ( x )dx = | det P| · f ( Pz)dz.
x ∈Rn z ∈Rn
16
5. Singular Value Decomposition
Exercise 5.1 (SVD of an orthogonal matrix) Consider the matrix
−1 2 2
1
A = 2 −1 2 .
3
2 2 −1
1. Show that A is orthogonal.
2. Find a singular value decomposition of A.
Exercise 5.2 (SVD of a matrix with orthogonal columns) Assume a
matrix A = [ a1 , . . . , am ] has columns ai ∈ Rn , i = 1, . . . , m that are
orthogonal to each other: ai> a j = 0 for 1 ≤ i 6= j ≤ n. Find an SVD
for A, in terms of the ai s. Be as explicit as you can.
Exercise 5.3 (Singular values of augmented matrix) Let A ∈ Rn,m ,
with n ≥ m, have singular values σ1 , . . . , σm .
1. Show that the singular values of the (n + m) × m matrix
" #
. A
à =
Im
q
are σ̃i = 1 + σi2 , i = 1, . . . , m.
2. Find an SVD of the matrix Ã.
Exercise 5.4 (SVD of score matrix) An exam with m questions is gi-
ven to n students. The instructor collects all the grades in a n × m
matrix G, with Gij the grade obtained by student i on question j. We
would like to assign a difficulty score to each question, based on the
available data.
1. Assume that the grade matrix G is well approximated by a rank-
one matrix sq> , with s ∈ Rn and q ∈ Rm (you may assume that
both s, q have non-negative components). Explain how to use the
approximation to assign a difficulty level to each question. What
is the interpretation of vector s?
2. How would you compute a rank-one approximation to G? State
precisely your answer in terms of the SVD of G.
Exercise 5.5 (Latent semantic indexing) Latent semantic indexing is
an SVD-based technique that can be used to discover text documents
similar to each other. Assume that we are given a set of m docu-
ments D1 , . . . , Dm . Using a “bag-of-words” technique described in
17
Example 2.1, we can represent each document D j by an n-vector d j ,
where n is the total number of distinct words appearing in the whole
set of documents. In this exercise, we assume that the vectors d j
are constructed as follows: d j (i ) = 1 if word i appears in document
D j , and 0 otherwise. We refer to the n × m matrix M = [d1 , . . . , dm ]
as the “raw” term-by-document matrix. We will also use a normal-
ized7 version of that matrix: M̃ = [d˜1 , . . . , d˜m ], where d˜j = d j /kd j k2 , 7
In practice, other numerical repre-
j = 1, . . . , m. sentation of text documents can be
used. For example we may use the
Assume we are given another document, referred to as the “query relative frequencies of words in each
document,” which is not part of the collection. We describe that document, instead of the `2 -norm
normalization employed here.
query document as an n-dimensional vector q, with zeros every-
where, except a 1 at indices corresponding to the terms that appear
in the query. We seek to retrieve documents that are “most similar”
to the query, in some sense. We denote by q̃ the normalized vector
q̃ = q/kqk2 .
1. A first approach is to select the documents that contain the largest
number of terms in common with the query document. Explain
how to implement this approach, based on a certain matrix–vector
product, which you will determine.
2. Another approach is to find the closest document by selecting the
index j such that kq − d j k2 is the smallest. This approach can
introduce some biases, if for example the query document is much
shorter than the other documents. Hence a measure of similarity
based on the normalized vectors, kq̃ − d˜j k2 , has been proposed,
under the name of “cosine similarity”. Justify the use of this name
for that method, and provide a formulation based on a certain
matrix–vector product, which you will determine.
3. Assume that the normalized matrix M̃ has an SVD M̃ = UΣV > ,
with Σ an n × m matrix containing the singular values, and the
unitary matrices U = [u1 , . . . , un ], V = [v1 , . . . , vm ] of size n × n,
m × m respectively. What could be an interpretation of the vectors
ul , vl , l = 1, . . . , r? Hint: discuss the case when r is very small, and
the vectors ul , vl , l = 1, . . . , r, are sparse.
4. With real-life text collections, it is often observed that M is effec-
tively close to a low-rank matrix. Assume that a optimal rank-k
approximation (k min(n, m)) of M̃, M̃k , is known. In the latent
semantic indexing approach8 to document similarity, the idea is to 8
In practice, it is often observed that
first project the documents and the query onto the subspace gener- this method produces better results
than cosine similarity in the original
ated by the singular vectors u1 , . . . , uk , and then apply the cosine space, as in part 2.
similarity approach to the projected vectors. Find an expression
for the measure of similarity.
18
Exercise 5.6 (Fitting a hyperplane to data) We are given m data points
d1 , . . . , dm ∈ Rn , and we seek a hyperplane
.
H(c, b) = { x ∈ Rn : c> x = b},
4
where c ∈ Rn , c 6= 0, and b ∈ R, that best “fits” the given points, 1
in the sense of a minimum sum of squared distances criterion, see 0
−1
Figure 5.4.
−2
Formally, we need to solve the optimization problem −3
−4
−3 −2 −1 0 1 2 3
m
min
c,b
∑ dist 2
(di , H(c, b)) : kck2 = 1,
Figure 5.4: Fitting a hyperplane to
i =1
data.
where dist(d, H) is the Euclidean distance from a point d to H. Here
the constraint on c is imposed without loss of generality, in a way
that does not favor a particular direction in space.
1. Show that the distance from a given point d ∈ Rn to H is given by
dist(d, H(c, b)) = |c> d − b|.
2. Show that the problem can be expressed as
min f 0 (b, c),
b,c : kck2 =1
where f 0 is a certain quadratic function, which you will determine.
3. Show that the problem can be reduced to
min c> ( D̃ D̃ > )c
c
s.t.: kck2 = 1,
where D̃ is the matrix of centered data points: the i-th column of
.
D̃ is di − d,¯ where d¯ = (1/m) ∑im=1 di is the average of the data
points. Hint: you can exploit the fact that at optimum, the partial
derivative of the objective function with respect to b must be zero,
a fact justified in Section 8.4.1.
4. Explain how to find the hyperplane via SVD.
Exercise 5.7 (Image deformation) A rigid transformation is a map-
ping from Rn to Rn that is the composition of a translation and a
rotation. Mathematically, we can express a rigid transformation φ as
φ( x ) = Rx + r, where R is an n × n orthogonal transformation and
r ∈ Rn a vector.
19
We are given a set of pairs of points ( xi , yi ) in Rn , i = 1, . . . , m,
and wish to find a rigid transformation that best matches them. We
can write the problem as
m
R ∈R
min
n,n, r ∈Rn
∑ k Rxi + r − yi k22 : R> R = In , (5.2)
i =1
where In is the n × n identity matrix.
The problem arises in image processing, to provide ways to de-
form an image (represented as a set of two-dimensional points) based
on the manual selection of a few points and their transformed coun-
terparts.
1. Assume that R is fixed in problem (5.2). Express an optimal r as a Figure 5.5: Image deformation via
rigid transformation. The image on
function of R. the left is the original image, and that
on the right is the deformed image.
2. Show that the corresponding optimal value (now a function of R Dots indicate points for which the de-
only) can be written as the original objective function, with r = 0 formation is chosen by the user.
and xi , yi replaced with their centered counterparts,
m m
1 1
x̄i = xi − x̂, x̂ =
m ∑ xj, ȳi = yi − ŷ, ŷ =
m ∑ yj.
j =1 j =1
3. Show that the problem can be written as
min k RX − Y kF : R> R = In ,
R
for appropriate matrices X, Y, which you will determine. Hint:
explain why you can square the objective; then expand.
4. Show that the problem can be further written as
max trace RZ : R> R = In ,
R
for an appropriate n × n matrix Z, which you will determine.
5. Show that R = VU > is optimal, where Z = USV > is the SVD of
Z. Hint: reduce the problem to the case when Z is diagonal, and
use without proof the fact that when Z is diagonal, In is optimal
for the problem.
6. Show the result you used in the previous question: assume Z is
diagonal, and show that R = In is optimal for the problem above.
Hint: show that R> R = In implies | Rii | ≤ 1, i = 1, . . . , n, and
using that fact, prove that the optimal value is less than or equal
to trace Z.
20
7. How woud you apply this technique to make Mona Lisa smile
more? Hint: in Figure 5.5, the two-dimensional points xi are given
(as dots) on the left panel, while the corresponding points yi are
shown on the left panel. These points are manually selected. The
problem is to find how to transform all the other points in the
original image.
21
6. Linear Equations
Exercise 6.1 (Least squares and total least squares) Find the least-
squares line and the total least-squares9 line for the data points ( xi , yi ), 9
See Section 6.7.5.
i = 1, . . . , 4, with x = (−1, 0, 1, 2), y = (0, 0, 1, 1). Plot both lines on
the same set of axes.
Exercise 6.2 (Geometry of least squares) Consider a least-squares
problem
p∗ = min k Ax − yk2 ,
x
where A ∈ Rm,n ,
y∈ We assume that y 6∈ R( A), so that p∗ > 0.
Rm .
Show that, at optimum, the residual vector r = y − Ax is such that
r > y > 0, A> r = 0. Interpret the result geometrically. Hint: use the
SVD of A. You can assume that m ≥ n, and that A is full column
rank.
Exercise 6.3 (Lotka’s law and least squares) Lotka’s law describes
the frequency of publication by authors in a given field. It states
that X a Y = b, where X is the number of publications, Y the relative
frequency of authors with X publications, and a and b are constants
(with b > 0) that depend on the specific field. Assume that we have
data points ( Xi , Yi ), i = 1, . . . , m, and seek to estimate the constants a
and b.
1. Show how to find the values of a, b according to a linear least-
squares criterion. Make sure to define the least-squares problem
involved precisely.
2. Is the solution always unique? Formulate a condition on the data
points that guarantees unicity.
Exercise 6.4 (Regularization for noisy data) Consider a least-squares
problem
min k Ax − yk22 ,
x
in which the data matrix A ∈ Rm,n is noisy. Our specific noise model
assumes that each row ai> ∈ Rn has the form ai = âi + ui , where
the noise vector ui ∈ Rn has zero mean and covariance matrix σ2 In ,
with σ a measure of the size of the noise. Therefore, now the matrix
A is a function of the uncertain vector u = (u1 , . . . , un ), which we
denote by A(u). We will write  to denote the matrix with rows âi> ,
i = 1, . . . , m. We replace the original problem with
min Eu {k A(u) x − yk22 },
x
22
where Eu denotes the expected value with respect to the random
variable u. Show that this problem can be written as
min k Âx − yk22 + λk x k22 ,
x
where λ ≥ 0 is some regularization parameter, which you will deter-
mine. That is, regularized least squares can be interpreted as a way
to take into account uncertainties in the matrix A, in the expected
value sense. Hint: compute the expected value of (( âi + ui )> x − yi )2 ,
for a specific row index i.
Exercise 6.5 (Deleting a measurement in least squares) In this exer-
cise, we revisit Section 6.3.5, and assume now that we would like to
delete a measurement, and update the least-squares solution accord-
ingly.10 10
This is useful in the context of cross-
We are given a full column rank matrix A ∈ Rm,n , with rows ai> , validation methods, as evoked in Sec-
tion 13.2.2.
i = 1, . . . , m, a vector y ∈ Rm , and a solution to the least-squares
problem
m
x ∗ = arg min
x
∑ (ai> x − yi )2 = arg min
x
k Ax − yk2 .
i =1
Assume now we delete the last measurement, that is, replace ( am , ym )
by (0, 0). We assume that the matrix obtained after deleting any one
of the measurements is still full column rank.
1. Express the solution to the problem after deletion, in terms of
the original solution, similar to the formula (6.15). Make sure to
explain why any quantities you invert are positive.
2. In the so-called leave-one-out analysis, we would like to efficiently
compute all the m solutions corresponding to deleting one of the
m measurements. Explain how you would compute those solu-
tions computationally efficiently. Detail the number of operations
(flops) needed. You may use the fact that to invert a n × n matrix
costs O(n3 ).
Exercise 6.6 The Michaelis–Menten model for enzyme kinetics re-
lates the rate y of an enzymatic reaction to the concentration x of a
substrate, as follows:
β1 x
y= ,
β2 + x
where β i , i = 1, 2, are positive parameters.
1. Show that the model can be expressed as a linear relation between
the values 1/y and 1/x.
23
2. Use this expression to find an estimate β̂ of the parameter vector
β using linear least squares, based on m measurements ( xi , yi ),
i = 1, . . . , m.
3. The above approach has been found to be quite sensitive to errors
in input data. Can you experimentally confirm this opinion?
Exercise 6.7 (Least norm estimation on traffic flow networks) You
want to estimate the traffic (in San Francisco for example, but we’ll
start with a smaller example). You know the road network as well as
the historical average of flows on each road segment.
1. We call qi the flow of vehicles on each road segment i ∈ I. Write
down the linear equation that corresponds to the conservation of
vehicles at each intersection j ∈ J. Hint: think about how you
might represent the road network in terms of matrices, vectors,
etc.
2. The goal of the estimation is to estimate the traffic flow on each of
the road segments. The flow estimates should satisfy the conserva-
tion of vehicles exactly at each intersection. Among the solutions
that satisfy this constraint, we are searching for the estimate that
is the closest to the historical average, q, in the `2 -norm sense. The
vector q has size I and the i-th element represent the average for
the road segment i. Pose the optimization problem.
3. Explain how to solve this problem mathematically. Detail your
answer (do not only give a formula but explain where it comes
from).
Figure 6.6: Example of the traffic es-
1" 2" 3" 4" timation problem. The intersections
are labeled a to h. The road segments
5" a" 6" b" 7" c" 8" d" 9" are labeled 1 to 22. The arrows indi-
cate the direction of traffic.
10" 11" 12" 13"
14" e" 15" f" 16" g" 17" h" 18"
19" 20" 21" 22"
4. Formulate the problem for the small example of Figure 6.6 and
solve it using the historical average given in Table 6.1. What is the
flow that you estimate on road segments 1, 3, 6, 15 and 22?
5. Now, assume that besides the historical averages, you are also
given some flow measurements on some of the road segments of
24
the network. You assume that these flow measurements are correct
segment average measured
and want your estimate of the flow to match these measurements 1 2047.6 2028
perfectly (besides matching the conservation of vehicles of course). 2 2046.0 2008
3 2002.6 2035
The right column of Table 6.1 lists the road segments for which we 4 2036.9
have such flow measurements. Do you estimate a different flow 5 2013.5 2019
on some of the links? Give the difference in flow you estimate for 6 2021.1
7 2027.4
road segments 1,3, 6, 15 and 22. Also check that your estimate 8 2047.1
gives you the measured flow on the road segments for which you 9 2020.9 2044
10 2049.2
have measured the flow. 11 2015.1
12 2035.1
Exercise 6.8 (A matrix least-squares problem) We are given a set of 13 2033.3
points p1 , . . . , pm ∈ Rn , which are collected in the n × m matrix P = 14 2027.0 2043
15 2034.9
[ p1 , . . . , pm ]. We consider the problem 16 2033.3
17 2008.9
m
. λ
∑ k xi − pi k22 + 2 ∑ k xi − x j k22 , 18 2006.4
min F ( X ) =
X 19 2050.0 2030
i =1 1≤i,j≤m
20 2008.6 2025
21 2001.6
where λ ≥ 0 is a parameter. In the above, the variable is an n × m 22 2028.1 2045
matrix X = [ x1 , . . . , xm ], with xi ∈ Rn the i-th column of X, i =
Table 6.1: Table of flows: historical
1, . . . , m. The above problem is an attempt at clustering the points averages q (center column), and some
pi ; the first term encourages the cluster center xi to be close to the measured flows (right column).
corresponding point pi , while the second term encourages the xi s to
be close to each other, with a higher grouping effect as λ increases.
1. Show that the problem belongs to the family of ordinary least-
squares problems. You do not need to be explicit about the form
of the problem.
2. Show that
1
2 1≤∑
k xi − x j k22 = trace XHX > ,
i,j≤m
where H = mIm − 11> is an m × m matrix, with Im the m × m
identity matrix, and 1 the vector of ones in Rm .
3. Show that H is positive semidefinite.
4. Show that the gradient of the function F at a matrix X is the n × m
matrix given by
∇ F ( X ) = 2( X − P + λXH ).
Hint: for the second term, find the first-order expansion of the
function ∆ → trace(( X + ∆) H ( X + ∆)> ), where ∆ ∈ Rn,m .
5. As mentioned in Remark 6.1, optimality conditions for a least-
squares problem are obtained by setting the gradient of the objec-
tive to zero. Using the formula (3.10), show that optimal points
25
are of the form
1 mλ
xi = pi + p̂, i = 1, . . . , m,
mλ + 1 mλ + 1
where p̂ = (1/m)( p1 + . . . + pm ) is the center of the given points.
6. Interpret your results. Do you believe the model considered here
is a good one to cluster points?
26
7. Matrix Algorithms
Exercise 7.1 (Sparse matrix–vector product) Recall from Section 3.4.2
that a matrix is said to be sparse if most of its entries are zero. More
formally, assume a m × n matrix A has sparsity coefficient γ( A) 1,
.
where γ( A) = d( A)/s( A), d( A) is the number of nonzero elements
in A, and s( A) is the size of A (in this case, s( A) = mn).
1. Evaluate the number of operations (multiplications and additions)
that are required to form the matrix–vector product Ax, for any
given vector x ∈ Rn and generic, non-sparse A. Show that this
number is reduced by a factor γ( A), if A is sparse.
2. Now assume that A is not sparse, but is a rank-one modification of
a sparse matrix. That is, A is of the form à + uv> , where à ∈ Rm,n
is sparse, and u ∈ Rm , v ∈ Rm are given. Devise a method to
compute the matrix–vector product Ax that exploits sparsity.
Exercise 7.2 (A random inner product approximation) Computing
the standard inner product between two vectors a, b ∈ Rn requires n
multiplications and additions. When the dimension n is huge (say,
e.g., of the order of 1012 , or larger), even computing a simple inner
product can be computationally prohibitive.
Let us define a random vector r ∈ Rn constructed as follows:
choose uniformly at random an index i ∈ {1, . . . , n}, and set ri = 1,
and r j = 0 for j 6= i. Consider the two scalar random numbers ã, b̃
that represent the “random projections” of the original vectors a, b
along r:
.
ã= r > a = ai ,
.
b̃ = r > b = bi .
Prove that
nE{ ãb̃} = a> b,
that is, n ãb̃ is an unbiased estimator of the value of the inner product
a> b. Observe that computing n ãb̃ requires very little effort, since it
is just equal to nai bi , where i is the randomly chosen index. Notice,
however, that the variance of such an estimator can be large, as it is
given by
n 2
var{n ãb̃} = n ∑ a2i bi2 − a> b
k =1
(prove also this latter formula). Hint: let ei denote the i-th standard
basis vector of Rn ; the random vector r has discrete probability distri-
bution Prob{r = ei } = 1/n, i = 1, . . . , n, hence E{r } = n1 1. Further,
27
observe that the products rk r j are equal to zero for k 6= j and that the
.
vector r2 = [r12 , . . . , rn2 ]> has the same distribution as r.
Generalizations of this idea to random projections onto k-dimen-
sional subspaces are indeed applied for matrix-product approxima-
tion, SVD factorization and PCA on huge-scale problems. The key
theoretical tool underlying these results is known as the Johnson–
Lindenstrauss lemma.
Exercise 7.3 (Power iteration for SVD with centered, sparse data)
In many applications such as principal component analysis (see Sec-
tion 5.3.2), one needs to find the few largest singular values of a
centered data matrix. Specifically, we are given a n × m matrix X =
[ x1 , . . . , xm ] of m data points in Rn , i = 1, . . . , m, and define the cen-
tered matrix X̃ to be
.
X̃ = [ x̃1 · · · x̃m ], x̃i = xi − x̄, i = 1, . . . , m,
with x̄ = m1 ∑im=1 xi the average of the data points. In general, X̃ is
dense, even if X itself is sparse. This means that each step of the
power iteration method involves two matrix–vector products, with a
dense matrix. Explain how to modify the power iteration method in
order to exploit sparsity, and avoid dense matrix–vector multiplica-
tions.
Exercise 7.4 (Exploiting structure in linear equations) Consider the
linear equation in x ∈ Rn
Ax = y,
where A ∈ Rm,n , y ∈ Rm . Answer the following questions to the best
of your knowledge.
1. The time required to solve the general system depends on the
sizes m, n and the entries of A. Provide a rough estimate of that
time as a function of m, n only. You may assume that m, n are of
the same order.
2. Assume now that A = D + uv> , where D is diagonal, invertible,
and u ∈ Rm , v ∈ Rn . How would you exploit this structure to
solve the above linear system, and what is a rough estimate of the
complexity of your algorithm?
3. What if A is upper-triangular?
Exercise 7.5 (Jacobi method for linear equation) Let A = ( aij ) ∈ Rn,n ,
b ∈ Rn , with aii 6= 0 for every i = 1, . . . , n. The Jacobi method for solv-
ing the square linear system
Ax = b
28
consists of decomposing A as a sum: A = D + R, where D =
diag ( a11 , . . . , ann ), and R contains the off-diagonal elements of A,
and then applying the recursion
x (k+1) = D −1 (b − Rx (k) ), k = 0, 1, 2, . . . ,
with initial point x̂ (0) = D −1 b.
The method is part of a class of methods known as matrix splitting,
where A is decomposed as a sum of a “simple” invertible matrix and
another matrix; the Jacobi method uses a particular splitting of A.
1. Find conditions on D, R that guarantee convergence from an arbi-
.
trary initial point. Hint: assume that M = − D −1 R is diagonaliz-
able.
2. The matrix A is said to be strictly row diagonally dominant if
∀ i = 1, . . . , n : | aii | > ∑ |aij |.
j 6 =i
Show that when A is strictly row diagonally dominant, the Jacobi
method converges.
Exercise 7.6 (Convergence of linear iterations) Consider linear iter-
ations of the form
x (k + 1) = Fx (k ) + c, k = 0, 1, . . . , (7.3)
where F ∈ Rn,n , c ∈ Rn , and the iterations are initialized with x (0) =
x0 . We assume that the iterations admit a stationary point, i.e., that
there exists x̄ ∈ Rn such that
( I − F ) x̄ = c. (7.4)
In this exercise, we derive conditions under which x (k) tends to a
finite limit for k → ∞. We shall use these results in Exercise 7.7,
to set up a linear iterative algorithm for solving systems of linear
equations.
1. Show that the following expressions hold for all k = 0, 1, . . .:
x ( k + 1) − x ( k ) = F k ( I − F )( x̄ − x0 ), (7.5)
k
x (k) − x̄ = F ( x0 − x̄ ). (7.6)
2. Prove that, for all x0 , limk→∞ x (k) converges to a finite limit if and
only if F k is convergent (see Theorem 3.5). When x (k ) converges,
its limit point x̄ satisfies (7.4).
29
Exercise 7.7 (A linear iterative algorithm) In this exercise we intro-
duce some “equivalent” formulations of a system of linear equations
Ax = b, A ∈ Rm,n , (7.7)
and then study a linear recursive algorithm for solution of this sys-
tem.
1. Consider the system of linear equations
Ax = AA† b, (7.8)
where A† is any pseudoinverse of A (that is, a matrix such that
AA† A = A). Prove that (7.8) always admits a solution. Show
that every solution of equations (7.7) is also a solution for (7.8).
Conversely, prove that if b ∈ R( A), then every solution to (7.8) is
also a solution for (7.7).
2. Let R ∈ Rn,m be any matrix such that N ( RA) = N ( A). Prove that
.
A† = ( RA)† R
is indeed a pseudoinverse of A.
3. Consider the system of linear equations
RAx = Rb, (7.9)
where R ∈ Rn,m is any matrix such that N ( RA) = N ( A) and Rb ∈
R( RA). Prove that, under these hypotheses, the set of solutions of
(7.9) coincides with the set of solutions of (7.8), for A† = ( RA)† R.
4. Under the setup of the previous point, consider the following lin-
ear iterations: for k = 0, 1, . . .,
x (k + 1) = x (k ) + αR(b − Ax (k)), (7.10)
where α 6= 0 is a given scalar. Show that if limk→∞ x (k) = x̄, then
x̄ is a solution for the system of linear equations (7.9). State appro-
priate conditions under which x (k) is guaranteed to converge.
5. Suppose A is positive definite (i.e., A ∈ Sn , A 0). Discuss
how to find a suitable scalar α and matrix R ∈ Rn,n satisfying the
conditions of point 3, and such that the iterations (7.10) converge
to a solution of (7.9). Hint: use Exercise 4.8.
6. Explain how to apply the recursive algorithm (7.10) for finding
a solution to the linear system Ãx = b̃, where à ∈ Rm,n with
m ≥ n and rank à = n. Hint: apply the algorithm to the normal
equations.
30
8. Convexity
Exercise 8.1 (Quadratic inequalities) Consider the set defined by the
following inequalities:
( x1 ≥ x2 − 1 and x2 ≥ 0) or ( x1 ≤ x2 − 1 and x2 ≤ 0) .
1. Draw the set. Is it convex?
2. Show that it can be described as a single quadratic inequality of
the form q( x ) = x > Ax + 2b> x + c ≤ 0, for a matrix A = A> ∈
R2,2 , b ∈ R2 and c ∈ R which you will determine.
3. What is the convex hull of this set?
Exercise 8.2 (Closed functions and sets) Show that the indicator
function IX of a convex set X is convex. Show that this function
is closed whenever X is a closed set.
Exercise 8.3 (Convexity of functions)
1. For x, y both positive scalars, show that
yex/y = max α( x + y) − yα · ln α.
α >0
Use the above result to prove that the function f defined as
(
yex/y if x > 0, y > 0,
f ( x, y) =
+∞ otherwise,
is convex.
2. Show that for r ≥ 1, the function f r : Rm
+ → R, with values
!r
m
f r (v) = ∑ v1/r
j
j =1
is concave. Hint: show that the Hessian of − f takes the form
κdiag (y) − zz> for appropriate vectors y ≥ 0, z ≥ 0, and scalar
κ ≥ 0, and use Schur complements11 to prove that the Hessian is 11
See Section 4.4.7.
positive semidefinite.
Exercise 8.4 (Some simple optimization problems) Solve the follow-
ing optimization problems. Make sure to determine an optimal pri-
mal solution.
1. Show that, for given scalars α, β,
(
. β2 −∞ if α ≤ 0,
f (α, β) = min αd + = √
d >0 d 2| β| α otherwise.
31
2. Show that for an arbitrary vector z ∈ Rm ,
!
1 m z2
kzk1 = min ∑ di + i . (8.11)
d >0 2 i =1 di
3. Show that for an arbitrary vector z ∈ Rm , we have
m z2i m
kzk21 = min
d
∑ d
: d > 0, ∑ di = 1.
i =1 i i =1
Exercise 8.5 (Minimizing a sum of logarithms) Consider the follow-
ing problem:
n
p∗ = maxn
x ∈R
∑ αi ln xi
i =1
s.t.: x ≥ 0, 1> x = c,
where c > 0 and αi > 0, i = 1, . . . , n. Problems of this form arise, for
instance, in maximum-likelihood estimation of the transition proba-
bilities of a discrete-time Markov chain. Determine in closed-form a
minimizer, and show that the optimal objective value of this problem
is
n
p∗ = α ln(c/α) + ∑ αi ln αi ,
i =1
.
where α = ∑in=1 αi .
Exercise 8.6 (Monotonicity and locality) Consider the optimization
problems (no assumption of convexity here)
.
p1∗ = min f 0 ( x ),
x ∈X1
.
p2∗ = min f 0 ( x ),
x ∈X2
∗ .
p13 = min f 0 ( x ),
x ∈X1 ∩X3
∗ .
p23 = min f 0 ( x ),
x ∈X2 ∩X3
where X1 ⊆ X2 .
1. Prove that p1∗ ≥ p2∗ (i.e., enlarging the feasible set cannot worsen
the optimal objective).
2. Prove that, if p1∗ = p2∗ , then it holds that
∗
p13 = p1∗ ⇒ ∗
p23 = p2∗ .
3. Assume that all problems above attain unique optimal solutions.
Prove that, under such a hypothesis, if p1∗ = p2∗ , then it holds that
∗
p23 = p2∗ ⇒ ∗
p13 = p1∗ .
32
Exercise 8.7 (Some matrix norms) Let X = [ x1 , . . . , xm ] ∈ Rn,m , and
p ∈ [1, +∞]. We consider the problem
.
φ p ( X ) = max k X > uk p : u> u = 1.
u
If the data is centered, that is, X1 = 0, the above amounts to finding
a direction of largest “deviation” from the origin, where deviation is
measured using the l p -norm.
1. Is φ p a (matrix) norm?
2. Solve the problem for p = 2. Find an optimal u.
3. Solve the problem for p = ∞. Find an optimal u.
4. Show that
φ p ( X ) = max k Xvk2 : kvkq ≤ 1,
v
where 1/p + 1/q = 1 (hence, φ p ( X ) depends only on X > X). Hint:
you can use the fact that the norm dual to the l p -norm is the lq -
norm and vice versa, in the sense that, for any scalars p ≥ 1, q ≥ 1
with 1/p + 1/q = 1, we have
max u> v = kuk p .
v: kvkq ≤1
Exercise 8.8 (Norms of matrices with non-negative entries) Let X ∈
n,m
R+ be a matrix with non-negative entries, and p, r ∈ [1, +∞], with
p ≥ r. We consider the problem
φ p,r ( X ) = max k Xvkr : kvk p ≤ 1.
v
1. Show that the function f X : Rm
+ → R, with values
!r
n m
∑ ∑
1/p
f X (u) = Xij u j
i =1 j =1
is concave when p ≥ r.
2. Use the previous result to formulate an efficiently solvable convex
problem that has φ p,r ( X )r as optimal value.
Exercise 8.9 (Magnitude least squares) For given n-vectors a1 , . . . , am ,
we consider the problem
m 2
p∗ = min
x
∑ | ai> x | − 1 .
i =1
33
1. Is the problem convex? If so, can you formulate it as an ordinary
least-squares problem? An LP? A QP? A QCQP? An SOCP? None
of the above? Justify your answers precisely.
2. Show that the optimal value p∗ depends only on the matrix K =
A> A, where A = [ a1 , . . . , am ] is the n × m matrix of data points
(that is, if two different matrices A1 , A2 satisfy A1> A1 = A2> A2 ,
then the corresponding optimal values are the same).
Exercise 8.10 (Eigenvalues and optimization) Given an n × n sym-
metric matrix Q, define
w1 = arg min x > Qx, and µ1 = min x > Qx,
k x k2 =1 k x k2 =1
and for k = 1, 2, . . . , n − 1:
wk+1 = arg min x > Qx such that wi> x = 0, i = 1, . . . , k,
k x k2 =1
µk+1 = min x > Qx such that wi> x = 0, i = 1, . . . , k.
k x k2 =1
Using optimization principles and theory:
1. show that µ1 ≤ µ2 ≤ · · · ≤ µn ;
2. show that the vectors w1 , . . . , wn are linearly independent, and
form an orthonormal basis of Rn ;
3. show how µ1 can be interpreted as a Lagrange multiplier, and that
µ1 is the smallest eigenvalue of Q;
4. show how µ2 , . . . , µn can also be interpreted as Lagrange multipli-
ers. Hint: show that µk+1 is the smallest eigenvalue of Wk> QWk ,
where Wk = [wk+1 , . . . , wn ].
Exercise 8.11 (Block norm penalty) In this exercise we partition vec-
tors x ∈ Rn into p blocks x = ( x1 , . . . , x p ), with xi ∈ Rni , n1 + · · · +
n p = n. Define the function ρ : Rn → R with values
p
ρ( x ) = ∑ k x i k2 .
i =1
1. Prove that ρ is a norm.
.
2. Find a simple expression for the “dual norm,” ρ∗ ( x ) = sup z> x.
z: ρ(z)=1
3. What is the dual of the dual norm?
34
4. For a scalar λ ≥ 0, matrix A ∈ Rm,n and vector y ∈ Rm , we
consider the optimization problem
.
p∗ (λ) = min k Ax − yk2 + λρ( x ).
x
Explain the practical effect of a high value of λ on the solution.
5. For the problem above, show that λ > σmax ( Ai ) implies that we
can set xi = 0 at optimum. Here, Ai ∈ Rm,ni corresponds to the
i-th block of columns in A, and σmax refers to the largest singular
value.
35
9. Linear, Quadratic and Geometric Models
Exercise 9.1 (Formulating problems as LPs or QPs) Formulate the
problem
.
p∗j = min f j ( x ),
x
for different functions f j , j = 1, . . . , 5, with values given in Table 9.2,
f1 (x) = k Ax − yk∞ + k x k1
as QPs or LPs, or, if you cannot, explain why. In our formulations, f2 (x) = k Ax − yk22 + k x k1
we always use x ∈ Rn as the variable, and assume that A ∈ Rm,n , f3 (x) = k Ax − yk22 − k x k1
f4 (x) = k Ax − yk22 + k x k21
y ∈ Rm , and k ∈ {1, . . . , m} are given. If you obtain an LP or QP f5 (x) = ∑ik=1 | Ax − y|[i] + k x k22
formulation, make sure to put the problem in standard form, stating
Table 9.2: Table of the values of dif-
precisely what the variables, objective and constraints are. Hint: for
ferent functions f . |z|[i] denotes the
the last one, see Example 9.10. element in a vector z that has the i-th
largest magnitude.
Exercise 9.2 (A slalom problem) A two-dimensional skier must sla-
lom down a slope, by going through n parallel gates of known posi-
tion ( xi , yi ), and of width ci , i = 1, . . . , n. The initial position ( x0 , y0 )
is given, as well as the final one, ( xn+1 , yn+1 ). Here, the x-axis repre-
sents the direction down the slope, from left to right, see Figure 9.7.
1. Find the path that minimizes the total length of the path. Your
answer should come in the form of an optimization problem.
2. Try solving the problem numerically, with the data given in Ta-
ble 9.3. N =5
Figure 9.7: Slalom problem with n =
5 obstacles. “Uphill” (resp. “down-
Exercise 9.3 (Minimum distance to a line segment) The line segment hill”) is on the N left (resp. right) side.
�
linking two points p, q ∈ Rn (with p 6= q) is the set L = {λp + (1 −
2
The middle k −dashed,
minimize path(zis zk−1 ) initial and
k=1
final subject
positions
to |zkare
− yk |not
≤ σkshown.
λ ) q : 0 ≤ λ ≤ 1}. , i = 1, . . . , N
1 T
P 2x Px + qT x + r
1. Show that the minimum distance D∗ from a point a ∈ Rn to the 2
−2
−2
4 −2
line segment L can be written as a QP in one variable: i x−2i yi ci
P =
0
0 4 4 N/A
−2
−2 43 −2
1 4 5
min kλc + dk22 : 0 ≤ λ ≤ 1, 2 8 4
−2 2
2
λ
3 12 6 2
4 16 5 1
for appropriate vectors c, d, which you will determine. Explain
5 20 7 2
why we can always assume a = 0. 6 24 4 N/A
m
Table 9.3: Problemi data for Exer-
2. Prove that the minimum distance is given by12 fi |fi | ≤ F
cise 9.2.
+x
V
> 2 z̈ = 1
ẋ = V
m f (t) żi
q> q − (q ( p−q2)) if p> q ≤ min(q> q, p> p),
1 1 1
k p − q k2 zk+1 = zk + tk żk + fk t2 = zk + (xk+1 − xk )żk + (xk+1 − xk )2 fk
D∗2 = q> q if p> q > q> q, fk
2m k V 2mV 2
żk+1 = żk + (xk+1 − xk )
p> p if p> q > p> p.
12 mV
Notice that the conditions express-
ing D 1 2 are mutually exclusive, since
∗ N − x0 )
V (x
| p > q | ≤ k p k2 k q k2 .
3. Interpret the result geometrically.
36
Exercise 9.4 (Univariate LASSO) Consider the problem
. 1
min f ( x ) = k ax − yk22 + λ| x |,
x ∈R 2
where λ ≥ 0, a ∈ Rm , y ∈ Rm are given, and x ∈ R is a scalar
variable. This is a univariate version of the LASSO problem discussed
in Section 9.6.2. Assume that y 6= 0 and a 6= 0, (since otherwise the
optimal solution of this problem is simply x = 0). Prove that the
optimal solution of this problem is
(
∗
0 if | a> y| ≤ λ,
x =
xls − sgn( xls ) k ak2 if | a> y| > λ,
λ
2
where
. a> y
xls =
k ak22
corresponds to the solution of the problem for λ = 0. Verify that this
solution can be expressed more compactly as x ∗ = sthrλ/k ak2 ( xls ),
2
where sthr is the soft threshold function defined in (12.65).
Exercise 9.5 (An optimal breakfast) We are given a set of n = 3
types of food, each of which has the nutritional characteristics de-
scribed in Table 9.4. Find the optimal composition (amount of serv-
ings per each food) of a breakfast having minimum cost, number
of calories between 2000 and 2250, amount of vitamin between 5000
and 10000, and sugar level no larger than 1000, assuming that the
maximum number of servings is 10.
Food Cost Vitamin Sugar Calories Table 9.4: Food costs and nutritional
values per serving.
Corn 0.15 107 45 70
Milk 0.25 500 40 121
Bread 0.05 0 60 65
Exercise 9.6 (An LP with wide matrix) Consider the LP
p∗ = min c> x : l ≤ Ax ≤ u,
x
where A ∈ Rm,n , c ∈ Rn , and l, u ∈ Rm , with l ≤ u. We assume that
A is wide, and full rank, that is: m ≤ n, m = rank( A). We are going
to develop a closed-form solution to the LP.
1. Explain why the problem is always feasible.
2. Assume that c 6∈ R( A> ). Using the result of Exercise 6.2, show
that p∗ = −∞. Hint: set x = x0 + tr, where x0 is feasible, r is such
that Ar = 0, c> r > 0, and let t → −∞.
37
3. Now assume that there exists d ∈ Rm such that c = A> d. Using
the fundamental theorem of linear algebra (see Section 3.2.4), any
vector x can be written as x = A> y + z for some pair (y, z) with
Az = 0. Use this fact, and the result of the previous part, to
express the problem in terms of the variable y only.
4. Reduce further the problem to one of the form
min d> v : l ≤ v ≤ u.
v
Make sure to justify any change of variable you may need. Write
the solution to the above in closed form. Make sure to express the
solution steps of the method clearly.
Exercise 9.7 (Median versus average) For a given vector v ∈ Rn , the
average can be found as the solution to the optimization problem
min kv − x1k22 , (9.12)
x ∈R
where 1 is the vector of ones in Rn . Similarly, it turns out that the
median (any value x such that there is an equal number of values in
v above or below x) can be found via
min kv − x1k1 . (9.13)
x ∈R
We consider a robust version of the average problem (9.12):
min max kv + u − x1k22 , (9.14)
x u : kuk∞ ≤λ
in which we assume that the components of v can be independently
perturbed by a vector u whose magnitude is bounded by a given
number λ ≥ 0.
1. Is the robust problem (9.14) convex? Justify your answer precisely,
based on expression (9.14), and without further manipulation.
2. Show that problem (9.14) can be expressed as
n
min
x ∈R
∑ (|vi − x| + λ)2 .
i =1
3. Express the problem as a QP. State precisely the variables, and
constraints if any.
4. Show that when λ is large, the solution set approaches that of the
median problem (9.13).
38
5. It is often said that the median is a more robust notion of “middle”
value than the average, when noise is present in v. Based on the
previous part, justify this statement.
Exercise 9.8 (Convexity and concavity of optimal value of an LP)
Consider the linear programming problem
.
p∗ = min c> x : Ax ≤ b,
x
where c ∈ Rn ,
A∈ Rm,n ,
b ∈ Rm . Prove the following statements, or
provide a counter-example.
1. The objective function p∗ is a concave function of c.
2. The objective function p∗ is a convex function of b (you may as-
sume that the problem is feasible).
3. The objective function p∗ is a concave function of A.
Exercise 9.9 (Variational formula for the dominant eigenvalue)
Recall from Exercise 3.11 that a positive matrix A > 0 has a dominant
eigenvalue λ = ρ( A) > 0, and corresponding left eigenvector w > 0
and right eigenvector v > 0 (i.e., w> A = λw> , Av = λv) which be-
long to the probability simplex S = { x ∈ Rn : x ≥ 0, 1> x = 1}.
In this exercise, we shall prove that the dominant eigenvalue has
an optimization-based characterization, similar in spirit to the “vari-
ational” characterization of the eigenvalues of symmetric matrices.
Define the function f : S → R++ with values
. a> x
f ( x ) = min i , for x ∈ S,
i =1,...,n xi
ai> x .
where ai> is the i-th row of A, and we let xi = +∞ if xi = 0.
1. Prove that, for all x ∈ S and A > 0, it holds that
13
For a non-negative matrix A ≥ 0 an
Ax ≥ f ( x ) x ≥ 0. extension of the results stated in Exer-
cise 3.11 for positive matrices holds.
2. Prove that More precisely, if A ≥ 0, then λ =
ρ( A) ≥ 0 is still an eigenvalue of
f ( x ) ≤ λ, ∀ x ∈ S. A, with a corresponding eigenvector
v ≥ 0 (the difference here being that
3. Show that f (v) = λ, and hence conclude that λ could be zero, and not simple, and
that v may not be strictly positive).
λ = max f ( x ), The stronger results of λ > 0 and sim-
x ∈S ple, and v > 0 are recovered under
the additional assumption that A ≥ 0
which is known as the Collatz–Wielandt formula for the dominant
is primitive, that is there exist an in-
eigenvalue of a positive matrix. This formula actually holds more teger k such that Ak > 0 (Perron–
generally for non-negative matrices,13 but you are not asked to Frobenius theorem).
prove this fact.
39
Exercise 9.10 (LS with uncertain A matrix) Consider a linear least-
squares problem where the matrix involved is random. Precisely, the
residual vector is of the form A(δ) x − b, where the m × n A matrix is
affected by stochastic uncertainty. In particular, assume that
p
A(δ) = A0 + ∑ Ai δi ,
i =1
where δi , i = 1, . . . , p are i.i.d. random variables with zero mean and
variance σi2 . The standard least-squares objective function k A(δ) x −
bk22 is now random, since it depends on δ. We seek to determine x
such that the expected value (with respect to the random variable δ)
of k A(δ) x − bk22 is minimized. Is such a problem convex? If yes, to
which class does it belong to (LP, LS, QP, etc.)?
40
10. Second-Order Cone and Robust Models
Exercise 10.1 (Squaring SOCP constraints) When considering a se-
cond-order cone constraint, a temptation might be to square it in
order to obtain a classical convex quadratic constraint. This might
not always work. Consider the constraint
x1 + 2x2 ≥ k x k2 ,
and its squared counterpart:
( x1 + 2x2 )2 ≥ k x k22 .
Is the set defined by the second inequality convex? Discuss.
Exercise 10.2 (A complicated function) We would like to minimize
the function f : R3 → R, with values:
f (x) = max x1 + x2 − min min( x1 + 2, x2 + 2x1 − 5), x3 − 6),
!
( x1 − x3 )2 + 2x22
,
1 − x1
with the constraint k x k∞ < 1. Explain precisely how to formulate
the problem as an SOCP in standard form.
Exercise 10.3 (A minimum time path problem) Consider Figure 10.8,
in which a point in 0 must move to reach point p = [4 2.5]> , crossing
three layers of fluids having different densities.
Figure 10.8: A minimum-time path
problem.
In the first layer, the point can travel at a maximum speed v1 ,
while in the second layer and third layers it may travel at lower max-
imum speeds, respectively v2 = v1 /η2 , and v3 = v1 /η3 , with η2 , η3 >
41
1. Assume v1 = 1, η2 = 1.5, η3 = 1.2. You have to determine what
is the fastest (i.e., minimum time) path from 0 to p. Hint: you may
use path leg lengths `1 , `2 , `3 as variables, and observe that, in this
problem, equality constraints of the type `i = “something” can be
equivalently substituted by inequality constraints `i ≥ ”something”
(explain why).
Exercise 10.4 (k-ellipses) Consider k points x1 , . . . , xk in R2 . For a
given positive number d, we define the k-ellipse with radius d as the
set of points x ∈ R2 such that the sum of the distances from x to the
points xi is equal to d.
1. How do k-ellipses look when k = 1 or k = 2? Hint: for k = 2,
show that you can assume x1 = − x2 = p, k pk2 = 1, and describe
the set in an orthonormal basis of Rn such that p is the first unit
vector.
2. Express the problem of computing the geometric median, which is
the point that minimizes the sum of the distances to the points xi ,
i = 1, . . . , k, as an SOCP in standard form.
3. Write a code with input X = ( x1 , . . . , xk ) ∈ R2,k and d > 0 that
plots the corresponding k-ellipse.
Exercise 10.5 (A portfolio design problem) The returns on n = 4
assets are described by a Gaussian (normal) random vector r ∈ Rn ,
having the following expected value r̂ and covariance matrix Σ:
0.12 0.0064 0.0008 −0.0011 0
0.10 0.0008 0.0025 0 0
r̂ = , Σ = .
0.07 −0.0011 0 0.0004 0
0.03 0 0 0 0
The last (fourth) asset corresponds to a risk-free investment. An in-
vestor wants to design a portfolio mix with weights x ∈ Rn (each
weight xi is non-negative, and the sum of the weights is one) so as
to obtain the best possible expected return r̂ > x, while guaranteeing
that: (i) no single asset weights more than 40%; (ii) the risk-free as-
sets should not weight more than 20%; (iii) no asset should weight
less than 5%; (iv) the probability of experiencing a return lower than
q = −3% should be no larger than e = 10−4 . What is the maximal
achievable expected return, under the above constraints?
Exercise 10.6 (A trust-region problem) A version of the so-called
(convex) trust-region problem amounts to finding the minimum of
a convex quadratic function over a Euclidean ball, that is
42
1 >
min x Hx + c> x + d
x 2
s.t.: x > x ≤ r2 ,
where H 0, and r > 0 is the given radius of the ball. Prove that the
optimal solution to this problem is unique and is given by
x (λ∗ ) = −( H + λ∗ I )−1 c,
where λ∗ = 0 if k H −1 ck2 ≤ r, or otherwise λ∗ is the unique value
such that k( H + λ∗ I )−1 ck2 = r.
Exercise 10.7 (Univariate square-root LASSO) Consider the problem
.
min f ( x ) = k ax − yk2 + λ| x |,
x ∈R
where λ ≥ 0, a ∈ Rm , y ∈ Rm are given, and x ∈ R is a scalar vari-
able. This is a univariate version of the square-root LASSO problem
introduced in Example 8.23. Assume that y 6= 0 and a 6= 0, (since oth-
erwise the optimal solution of this problem is simply x = 0). Prove
that the optimal solution of this problem is
0 r if | a> y| ≤ λkyk2 ,
x∗ = 2 2 > 2
xls − sgn( xls ) λ 2 k ak2 kyk22 −(a2 y) if | a> y| > λkyk2 ,
k ak 2 k ak −λ 2
where
. a> y
xls = .
k ak22
Exercise 10.8 (Proving convexity via duality) Consider the function
f : Rn++ → R, with values
n q
f ( x ) = 2 max t − ∑ xi + t2 .
t
i =1
1. Explain why the problem that defines f is a convex optimization
problem (in the variable t). Formulate it as an SOCP.
2. Is f convex?
3. Show that the function g : Rn++ → R, with values
n
1 1
g(y) = ∑ yi − n
i =1
∑ yi
i =1
is convex. Hint: for a given y ∈ Rn++ , show that
g(y) = max − x T y − f ( x ).
x >0
Make sure to justify any use of strong duality.
43
Exercise 10.9 (Robust sphere enclosure) Let Bi , i = 1, . . . , m, be m
given Euclidean balls in Rn , with centers xi and radii ρi ≥ 0. We
wish to find a ball B of minimum radius that contains all the Bi ,
i = 1, . . . , m. Explain how to cast this problem into a known convex
optimization format.
44
11. Semidefinite Models
Exercise 11.1 (Minimum distance to a line segment revisited) In this
exercise, we revisit Exercise 9.3, and approach it using the S -procedure
of Section 11.3.3.1.
1. Show that the minimum distance from the line segment L to the
origin is above a given number R ≥ 0 if and only if
kλ( p − q) + qk22 ≥ R2 whenever λ(1 − λ) ≥ 0.
2. Apply the S -procedure, and prove that the above is in turn equiv-
alent to the LMI in τ ≥ 0:
" #
k p − qk22 + τ q> ( p − q) − τ/2
0.
q> ( p − q) − τ/2 q > q − R2
3. Using the Schur complement rule,14 show that the above is con- 14
See Theorem 4.9.
sistent with the result given in Exercise 9.3.
Exercise 11.2 (A variation on principal component analysis) Let X =
[ x1 , . . . , xm ] ∈ Rn,m . For p = 1, 2, we consider the problem
m
.
φ p ( X ) = max
u
∑ |xi> u| p : u> u = 1. (11.15)
i =1
If the data is centered, the case p = 1 amounts of finding a direction
of largest “deviation” from the origin, where deviation is measured
using the `1 -norm; arguably, this is less sensitive to outliers than the
case p = 2, which corresponds to principal component analysis.
1. Find an expression for φ2 , in terms of the singular values of X.
2. Show that the problem, for p = 1, can be approximated via an
SDP, as φ1 ( X ) ≤ ψ1 ( X ), where
m q
.
ψ1 ( X ) = max
U
∑ xi> Uxi : U 0, trace U = 1.
i =1
Is ψ1 a norm?
3. Formulate a dual to the above expression. Does strong duality
hold? Hint: introduce new variables zi = xi> Uxi , i = 1, . . . , m, and
dualize the corresponding constraints.
4. Use the identity (8.11) to approximate, via weak duality, the prob-
lem (11.15). How does your bound compare with ψ1 ?
45
5. Show that
ψ1 ( X )2 = min trace D : D diagonal, D 0, D X > X.
D
Hint: scale the variables in the dual problem and optimize over
the scaling. That is, set D = α D̄, with λmax ( X D̃ −1 X > ) = 1 and
α > 0, and optimize over α. Then argue that we can replace the
equality constraint on D̃ by a convex inequality, and use Schur
complements to handle that corresponding inequality.
6. Show that
φ1 ( X ) = max k Xvk2 .
v : k v k ∞ ≤1
Is the maximum always attained with a vector v such that |vi | = 1
for every i? Hint: use the fact that
k z k1 = max z> v.
v : k v k ∞ ≤1
7. A result by Yu. Nesterov15 shows that for any symmetric matrix 15
Yu. Nesterov, Quality of semidef-
Q ∈ Rm,m , the problem inite relaxation for nonconvex
quadratic optimization, discussion
paper, CORE, 1997.
p∗ = max v> Qv
v : k v k ∞ ≤1
can be approximated within π/2 relative value via SDP. Precisely,
(2/π )d∗ ≤ p∗ ≤ d∗ , where
d∗ = min trace D : D diagonal, D Q. (11.16)
D
Use this result to show that
r
2
ψ ( X ) ≤ φ1 ( X ) ≤ ψ1 ( X ).
π 1
That is, the SDP approximation is within ≈ 80% of the true value,
irrespective of the problem data.
8. Discuss the respective complexity of the problems of computing
φ2 and ψ1 (you can use the fact that, for a given m × m symmetric
matrix Q, the SDP (11.16) can be solved in O(m3 )).
Exercise 11.3 (Robust principal component analysis) The following
problem is known as robust principal component analysis:16 16
See Section 13.5.4.
.
p∗ = min k A − X k∗ + λk X k1 ,
X
where k · k∗ stands for the nuclear norm,17 and k · k1 here denotes the
17
The nuclear norm is the sum of the
singular values of the matrix; see Sec-
sum of the absolute values of the elements of a matrix. The interpre- tion 11.4.1.4 and Section 5.2.2.
tation is the following: A is a given data matrix and we would like to
46
decompose it as a sum of a low rank matrix and a sparse matrix. The
nuclear norm and `1 norm penalties are respective convex heuristics
for these two properties. At optimum, X ∗ will be the sparse compo-
nent and A − X ∗ will be the low rank component such that their sum
gives A.
1. Find a dual for this problem. Hint: we have, for any matrix W:
kW k∗ = max trace W > Y : kY k2 ≤ 1,
Y
where k · k2 is the largest singular value norm.
2. Transform the primal or dual problem into a known programming
class (i.e. LP, SOCP, SDP, etc.). Determine the number of variables
and constraints. Hint: we have
kY k2 ≤ 1 ⇐⇒ I − YY > 0,
where I is the identity matrix.
3. Using the dual, show that when λ > 1, the optimal solution is the
zero matrix. Hint: if Y ∗ is the optimal dual variable, the comple-
mentary slackness condition states that |Yij∗ | < λ implies Xij∗ = 0
at optimum.
Exercise 11.4 (Boolean least squares) Consider the following prob-
lem, known as Boolean least squares:
φ = min k Ax − bk22 : xi ∈ {−1, 1}, i = 1, . . . , n.
x
Here, the variable is x ∈ Rn , where A ∈ Rm,n and b ∈ Rm are given.
This is a basic problem arising, for instance, in digital communica-
tions. A brute force solution is to check all 2n possible values of x,
which is usually impractical.
1. Show that the problem is equivalent to
φ = min trace( A> AX ) − 2b> Ax + b> b
X,x
s.t.: X = xx > ,
Xii = 1, i = 1, . . . , n,
in the variables X = X > ∈ Rn,n and x ∈ Rn .
2. The constraint X = xx > , i.e., the set of rank-1 matrices, is not
convex, therefore the problem is still hard. However, an efficient
approximation can be obtained by relaxing this constraint to X
xx > , as discussed in Section 11.3.3, obtaining
47
φ ≥ φsdp = min trace( A> AX ) − 2b> Ax + b> b
X " #
X x
s.t.: 0,
x> 1
Xii = 1, i = 1, . . . , n.
The relaxation produces a lower-bound to the original problem.
Once that is done, an approximate solution to the original problem
can be obtained by rounding the solution: xsdp = sgn( x ∗ ), where
x ∗ is the optimal solution of the semidefinite relaxation.
3. Another approximation method is to relax the non-convex con-
straints xi ∈ {−1, 1} to convex interval constraints −1 ≤ xi ≤ 1
for all i, which can be written k x k∞ ≤ 1. Therefore, a different
lower bound is given by:
.
φ ≥ φint = min k Ax − bk22 : k x k∞ ≤ 1.
Once that problem is solved, we can round the solution by xint =
sgn ( x ∗ ) and compare the original objective value k Axint − bk22 .
4. Which one of φsdp and φint produces the closest approximation to
φ? Justify your answer carefully.
5. Use now 100 independent realizations with normally distributed
data, A ∈ R10,10 (independent entries with mean zero) and b ∈ R10
(independent entries with mean 1). Plot and compare the his-
tograms of k Axsdp − bk22 of part 2, k Axint − bk22 of part 3, and
the objective corresponding to a naïve method k Axls − bk22 , where
xls = sgn ( A> A)−1 A> b is the rounded ordinary least squares
solution. Briefly discuss accuracy and computation time (in sec-
onds) of the three methods.
6. Assume that, for some problem instance, the optimal solution
( x, X ) found via the SDP approximation is such that x belongs
to the original non-convex constraint set { x : xi ∈ {−1, 1}, i =
1, . . . , n}. What can you say about the SDP approximation in that
case?
Exercise 11.5 (Auto-regressive process model) We consider a process
described by the difference equation
y(t + 2) = α1 (t)y(t + 1) + α2 (t)y(t) + α3 (t)u(t), t = 0, 1, 2, . . . ,
where the u(t) ∈ R is the input, y(t) ∈ R the output, and the coeffi-
cient vector α(t) ∈ R3 is time-varying. We seek to compute bounds
48
on the vector α(t) that are (a) independent of t, (b) consistent with
some given historical data.
The specific problem we consider is: given the values of u(t) and
y(t) over a time period 1 ≤ t ≤ T, find the smallest ellipsoid E in R3
such that, for every t, 1 ≤ t ≤ T, the equation above is satisfied for
some α(t) ∈ E .
1. What is a geometrical interpretation of the problem, in the space
of αs?
2. Formulate the problem as a semidefinite program. You are free to
choose the parameterization, as well as the measure of the size of
E that you find most convenient.
3. Assume we restrict our search to spheres instead of ellipsoids.
Show that the problem can be reduced to a linear program.
4. In the previous setting, α(t) is allowed to vary with time arbi-
trarily fast, which may be unrealistic. Assume that a bound is
imposed on the variation of α(t), such as kα(t + 1) − α(t)k2 ≤ β,
where β > 0 is given. How would you solve the problem with this
added restriction?
Exercise 11.6 (Non-negativity of polynomials) A second-degree po-
lynomial with values p( x ) = y0 + y1 x + y2 x2 is non-negative every-
where if and only if
" #> " #" #
x y0 y1 /2 x
∀x : ≥ 0,
1 y1 /2 y2 1
which in turn can be written as an LMI in y = (y0 , y1 , y2 ):
" #
y0 y1 /2
0.
y1 /2 y2
In this exercise, you show a more general result, which applies to
any polynomial of even degree 2k (polynomials of odd degree can’t
be non-negative everywhere). To simplify, we only examine the case
k = 2, that is, fourth-degree polynomials; the method employed here
can be generalized to k > 2.
1. Show that a fourth-degree polynomial p is non-negative every-
where if and only if it is a sum of squares, that is, it can be written
as
4
p( x ) = ∑ q i ( x )2 ,
i =1
49
where qi s are polynomials of degree at most two. Hint: show that
p is non-negative everywhere if and only if it is of the form
p( x ) = p0 ( x − a1 )2 + b12 ( x − a2 )2 + b22 ,
for some appropriate real numbers ai , bi , i = 1, 2, and some p0 ≥ 0.
2. Using the previous part, show that if a fourth-degree polynomial
is a sum of squares, then it can be written as
h i 1
p( x ) = 1 x x2 Q x (11.17)
x 2
for some positive semidefinite matrix Q.
3. Show the converse: if a positive semidefinite matrix Q satisfies
condition (11.17) for every x, then p is a sum of squares. Hint: use
a factorization of Q of the form Q = AA> , for some appropriate
matrix A.
4. Show that a fourth-degree polynomial p( x ) = y0 + y1 x + y2 x2 +
y3 x3 + y4 x4 is non-negative everywhere if and only if there exists
a 3 × 3 matrix Q such that
Q 0, yl −1 = ∑ Qij , l = 1, . . . , 5.
i + j = l +1
Hint: equate the coefficients of the powers of x in the left and right
sides of equation (11.17).
Exercise 11.7 (Sum of top eigenvalues) For X ∈ Sn , and i ∈ {1, . . .
, n}, we denote by λi ( X ) the i-th largest eigenvalue of X. For k ∈
{1, . . . , n}, we define the function f k : Sn → R with values
k
fk (X) = ∑ λ i ( X ).
i =1
This function is an intermediate between the largest eigenvalue (ob-
tained with k = 1) and the trace (obtained with k = n).
1. Show that for every t ∈ R, we have f k ( X ) ≤ t if and only if there
exist Z ∈ Sn and s ∈ R such that
t − ks − trace( Z ) ≥ 0, Z 0, Z − X + sI 0.
Hint: for the sufficiency part, think about the interlacing prop-
erty18 of the eigenvalues. 18
See Eq. (4.6)
50
2. Show that f k is convex. Is it a norm?
3. How would you generalize these results to the function that as-
signs the sum of the top k singular values to a general rectangular
m × n matrix, with k ≤ min(m, n)? Hint: for X ∈ Rm,n , consider
the symmetric matrix
" #
. 0 X
X̃ = .
X> 0
51
12. Introduction to Algorithms
Exercise 12.1 (Successive projections for linear inequalities) Con-
sider a system of linear inequalities Ax ≤ b, with A ∈ Rm,n , where
ai> , i = 1, . . . , m, denote the rows of A, which are assumed, without
loss of generality, to be nonzero. Each inequality ai> x ≤ bi can be
normalized by dividing both terms by k ai k2 , hence we shall further
assume without loss of generality that k ai k2 = 1, i = 1, . . . , m.
Consider now the case when the polyhedron described by these
.
inequalities, P = { x : Ax ≤ b} is nonempty, that is, there exists at
least a point x̄ ∈ P . In order to find a feasible point (i.e., a point
in P ), we propose the following simple algorithm. Let k denote the
iteration number and initialize the algorithm with any initial point
xk = x0 at k = 0. If ai> xk ≤ bi holds for all i = 1, . . . , m, then we have
found the desired point, hence we return xk , and finish. If instead
.
there exists ik such that ai> xk > bik , then we set sk = ai> xk − bik , we
k k
update19 the current point as 19
This algorithm is a version of
the so-called Agmon–Motzkin–
x k +1 = x k − s k a i k , Shoenberg relaxation method for linear
inequalities, which dates back to
1953.
and we iterate the whole process.
1. Give a simple geometric interpretation of this algorithm.
2. Prove that this algorithm either finds a feasible solution in a fi-
nite number of iterations, or it produces a sequence of solutions
{ xk } that converges asymptotically (i.e., for k → ∞) to a feasible
solution (if one exists).
3. The problem of finding a feasible solution for linear inequalities
can be also put in relation with the minimization of the non-
smooth function f 0 ( x ) = maxi=1,...,m ( ai> xk − bi ). Develop a sub-
gradient-type algorithm for this version of the problem, discuss
hypotheses that need be assumed to guarantee convergence, and
clarify the relations and similarities with the previous algorithm.
Exercise 12.2 (Conditional gradient method) Consider a constrained
minimization problem
p∗ = min f 0 ( x ), (12.18)
x ∈X
where f 0 is convex and smooth and X ⊆ Rn is convex and compact.
Clearly, a projected gradient or proximal gradient algorithm could be
applied to this problem, if the projection onto X is easy to compute.
When this is not the case, the following alternative algorithm has
been proposed.20 Initialize the iterations with some x0 ∈ X , and set 20
Versions of this algorithm are
known as the Franke–Wolfe algo-
rithm, which was developed in 1956
for quadratic f 0 , or as the Levitin–
Polyak conditional gradient algorithm
(1966).
52
.
k = 0. Determine the gradient gk = ∇ f 0 ( xk ) and solve
zk = arg min gkT x.
x ∈X
Then update the current point as
x k + 1 = ( 1 − γk ) x k + γk z k ,
where γk ∈ [0, 1], and, in particular, we choose
2
γk = , k = 0, 1, . . .
k+2
Assume that f 0 has a Lipschitz continuous gradient with Lipschitz
constant21 L, and that k x − yk2 ≤ R for every x, y ∈ X . In this 21
As defined in Section 12.1.1.
exercise, you shall prove that
. 2LR2
δk = f 0 ( xk ) − p∗ ≤ , k = 1, 2, . . . (12.19)
k+2
1. Using the inequality
L
f 0 ( x ) − f 0 ( xk ) ≤ ∇ f 0 ( xk ) T ( x − xk ) +
k x − xk k22 ,
2
which holds for any convex f 0 with Lipschitz continuous gradi-
ent,22 prove that 22
See Lemma 12.1.
LR2
f 0 ( xk+1 ) ≤ f 0 ( xk ) + γk ∇ f 0 ( xk ) T (zk − xk ) + γk2
.
2
Hint: write the inequality condition above, for x = xk+1 .
2. Show that the following recursion holds for δk :
δk+1 ≤ (1 − γk )δk + γk2 C, k = 0, 1, . . . ,
. LR2
for C = 2 . Hint: use the optimality condition for zk , and the
convexity inequality f 0 ( x ∗ ) ≥ f 0 ( xk ) + ∇ f 0 ( xk ) T ( x ∗ − xk ).
3. Prove by induction on k the desired result (12.19).
Exercise 12.3 (Bisection method) The bisection method applies to one-
dimensional convex problems23 of the form 23
See an application in Sec-
tion 11.4.1.3.
min f ( x ) : xl ≤ x ≤ xu ,
x
where xl < xu are both finite, and f : R → R is convex. The algo-
rithm is initialized with the upper and lower bounds on x: x = xl ,
x = xu , and the initial x is set as the midpoint
x+x
x=
.
2
Then the algorithm updates the bounds as follows: a subgradient g of
f at x is evaluated; if g < 0, we set x = x; otherwise,24 we set x = x. 24
Actually, if g = 0 then the algorithm
Then the midpoint x is recomputed, and the process is iterated until may stop and return x as an optimal
solution.
convergence.
53
1. Show that the bisection method locates a solution x ∗ within accu-
racy e in at most log2 ( xu − xl )/e − 1 steps.
2. Propose a variant of the bisection method for solving the uncon-
strained problem minx f ( x ), for convex f .
3. Write a code to solve the problem with the specific class of func-
tions f : R → R, with values
n
1
f (x) = ∑ max
2
Aij x2 + Bij x + Cij ,
i =1 1≤ j ≤ m
where A, B, C are given n × m matrices, with every element of A
non-negative.
Exercise 12.4 (KKT conditions) Consider the optimization problem25 25
Problem due to Suvrit Sra (2013).
n
minn
x ∈R
∑ 1 2
2 di xi + ri xi
i =1
>
s.t.: a x = 1, xi ∈ [−1, 1], i = 1, . . . , n,
where a 6= 0 and d > 0.
1. Verify if strong duality holds for this problem, and write down
the KKT optimality conditions.
2. Use the KKT conditions and/or the Lagrangian to come up with
the fastest algorithm you can to solve this optimization problem.
3. Analyze the running time complexity of your algorithm. Does the
empirical performance of your method agree with your analysis?
Exercise 12.5 (Sparse Gaussian graphical models) We consider the
following problem in a symmetric n × n matrix variable X
max log det X − trace(SX ) − λk X k1 : X 0,
X
where S 0 is a (given) empirical covariance matrix, k X k1 denotes
the sum of the absolute values of the elements of the positive definite
matrix X, and λ > 0 encourages the sparsity in the solution X. The
problem arises when fitting a multivariate Gaussian graphical model
to data.26 The `1 -norm penalty encourages the random variables in 26
See Section 13.5.5.
the model to become conditionally independent.
1. Show that the dual of the problem takes the form
min − log det(S + U ) : |Uij | ≤ λ.
U
54
2. We employ a block-coordinate descent method to solve the dual.
Show that if we optimize over one column and row of U at a time,
we obtain a sub-problem of the form
min x > Qx : k x − x0 k∞ ≤ 1,
x
where Q 0 and x0 ∈ Rn−1 are given. Make sure to provide the
expression of Q, x0 as functions of the initial data, and the index
of the row/column that is to be updated.
3. Show how you can solve the constrained QP problem above us-
ing the following methods. Make sure to state precisely the algo-
rithm’s steps.
• Coordinate descent.
• Dual coordinate ascent.
• Projected subgradient.
• Projected subgradient method for the dual.
• Interior-point method (any flavor will do).
Compare the performance (e.g., theoretical complexity, running
time/convergence time on synthetic data) of these methods.
4. Solve the problem (using block-coordinate descent with five up-
dates of each row/column, each step requiring the solution of the
QP above) for a data file of your choice. Experiment with different
values of λ, report on the graphical model obtained.
Exercise 12.6 (Polynomial fitting with derivative bounds)
In Section 13.2, we examined the problem of fitting a polynomial of
degree d through m data points (ui , yi ) ∈ R2 , i = 1, . . . , m. With-
out loss of generality, we assume that the input satisfies |ui | ≤ 1,
i = 1, . . . , m. We parameterize a polynomial of degree d via its coef-
ficients:
p w ( u ) = w0 + w1 u + · · · + w d u d ,
where w ∈ Rd+1 . The problem can be written as
min kΦ> w − yk22 ,
w
where the matrix Φ has columns φi = (1, ui , . . . , uid ), i = 1, . . . , m. As
detailed in Section 13.2.3, in practice it is desirable to encourage poly-
nomials that are not too rapidly varying over the interval of interest.
To that end, we modify the above problem as follows:
min kΦ> w − yk22 + λb(w), (12.20)
w
55
where λ > 0 is a regularization parameter, and b(w) is a bound on
the size of the derivative of the polynomial over [−1, 1]:
d
b(w) = max pw (u) .
u : |u|≤1 du
1. Is the penalty function b convex? Is it a norm?
2. Explain how to compute a subgradient of b at a point w.
3. Use your result to code a subgradient method for solving prob-
lem (12.20).
Exercise 12.7 (Methods for LASSO) Consider the LASSO problem,
discussed in Section 9.6.2:
1
min k Ax − yk22 + λk x k1 ,
x 2
Compare the following algorithms. Try to write your code in a way
that minimizes computational requirements; you may find the result
in Exercise 9.4 useful.
1. A coordinate-descent method.
2. A subgradient method, as in Section 12.4.1.
3. A fast first-order algorithm, as in Section 12.3.4.
Exercise 12.8 (Non-negative terms that sum to one) Let xi , i = 1, . . . ,
n, be given real numbers, which we assume without loss of general-
ity to be ordered as x1 ≤ x2 ≤ · · · ≤ xn , and consider the scalar
equation in variable ν that we encountered in Section 12.3.3.3:
n
.
f (ν) = 1, where f (ν) = ∑ max(xi − ν, 0).
i =1
1. Show that f is continuous and strictly decreasing for ν ≤ xn .
2. Show that a solution ν∗ to this equation exists, it is unique, and it
must belong to the interval [ x1 − 1/n, xn ].
3. This scalar equation could be easily solved for ν using, e.g., the
bisection method. Describe a simpler, “closed-form” method for
finding the optimal ν.
Exercise 12.9 (Eliminating linear equality constraints) We consider
a problem with linear equality constraints
min f 0 ( x ) : Ax = b,
x
56
where A ∈ Rm,n , with A full row rank: rank A = m ≤ n, and where
we assume that the objective function f 0 is decomposable, that is
n
f0 (x) = ∑ h i ( x i ),
i =1
with each hi a convex, twice differentiable function. This problem can
be addressed via different approaches, as detailed in Section 12.2.6.
1. Use the constraint elimination approach of Section 12.2.6.1, and
consider the function f˜0 defined in Eq. (12.33). Express the Hessian
of f˜0 in terms of that of f 0 .
2. Compare the computational effort27 required to solve the prob- 27
See the related Exercise 7.4.
lem using the Newton method via the constraint elimination tech-
nique, versus using the feasible update Newton method of Sec-
tion 12.2.6.3, assuming that m n.
57
13. Learning from Data
Exercise 13.1 (SVD for text analysis) Assume you are given a data
set in the form of an n × m term-by-document matrix X correspond-
ing to a large collection of news articles. Precisely, the (i, j) entry in
X is the frequency of the word i in the document j. We would like to
visualize this data set on a two-dimensional plot. Explain how you
would do the following (describe your steps carefully in terms of the
SVD of an appropriately centered version of X).
1. Plot the different news sources as points in word space, with max-
imal variance of the points.
2. Plot the different words as points in news-source space, with max-
imal variance of the points.
Exercise 13.2 (Learning a factor model) We are given a data matrix
X = [ x (1) , . . . , x (m) ], with x (i) ∈ Rn , i = 1, . . . , m. We assume that the
data is centered: x (1) + · · · + x (m) = 0. An (empirical) estimate of the
covariance matrix is28 28
See Example 4.2.
m
1
Σ=
m ∑ x(i) x(i)> .
i =1
In practice, one often finds that the above estimate of the covariance
matrix is noisy. One way to remove noise is to approximate the co-
variance matrix as Σ ≈ λI + FF > , where F is an n × k matrix, contain-
ing the so-called “factor loadings,” with k n the number of factors,
and λ ≥ 0 is the “idiosyncratic noise” variance. The stochastic model
that corresponds to this setup is
x = F f + σe,
where x is the (random) vector of centered observations, ( f , e) is a
random variable with zero mean and unit covariance matrix, and σ =
√
λ is the standard deviation of the idiosyncratic noise component σe.
The interpretation of the stochastic model is that the observations are
a combination of a small number k of factors, plus a noise part that
affects each dimension independently.
To fit F, λ to the data, we seek to solve
min kΣ − λI − FF > kF . (13.21)
F,λ≥0
1. Assume λ is known and less than λk (the k-th largest eigenvalue
of the empirical covariance matrix Σ). Express an optimal F as a
function of λ, which we denote by F (λ). In other words: you are
asked to solve for F, with fixed λ.
58
2. Show that the error E(λ) = kΣ − λI − F (λ) F (λ)> kF , with F (λ)
the matrix you found in the previous part, can be written as
p
E ( λ )2 = ∑ ( λ i − λ )2 .
i = k +1
Find a closed-form expression for the optimal λ that minimizes
the error, and summarize your solution to the estimation prob-
lem (13.21).
3. Assume that we wish to estimate the risk (as measured by vari-
ance) involved in a specific direction in data space. Recall from
Example 4.2 that, given a unit-norm n-vector w, the variance along
the direction w is w> Σw. Show that the rank-k approximation to
Σ results in an under-estimate of the directional risk, as compared
with using Σ. How about the approximation based on the factor
model above? Discuss.
Exercise 13.3 (Movement prediction for a time-series) We have a
historical data set containing the values of a time-series r (1), . . . , r ( T ).
Our goal is to predict if the time-series is going up or down. The ba-
sic idea is to use a prediction based on the sign of the output of an
auto-regressive model that uses n past data values (here, n is fixed).
That is, the prediction at time t of the sign of the value r (t + 1) − r (t)
is of the form
ŷw,b (t) = sgn (w1 r (t) + · · · + wn r (t − n + 1) + b) ,
In the above, w ∈ Rn is our classifier coefficient, b is a bias term,
and n T determines how far back into the past we use the data to
make the prediction.
1. As a first attempt, we would like to solve the problem
T −1
min
w,b
∑ (ŷw,b (t) − y(t))2 ,
t=n
where y(t) = sgn(r (t + 1) − r (t)). In other words, we are trying to
match, in a least-squares sense, the prediction made by the classi-
fier on the training set, with the observed truth. Can we solve the
above with convex optimization? If not, why?
2. Explain how you would set up the problem and train a classi-
fier using convex optimization. Make sure to define precisely
the learning procedure, the variables in the resulting optimization
problem, and how you would find the optimal variables to make
a prediction.
59
Exercise 13.4 (A variant of PCA) Return to the variant of PCA ex-
amined in Exercise 11.2. Using a (possibly synthetic) data set of your
choice, compare the classical PCA and the variant examined here, es-
pecially in terms of its sensitivity to outliers. Make sure to establish
an evaluation protocol that is as rigorous as possible. Discuss your
results.
Exercise 13.5 (Squared vs. non-squared penalties) We consider the
problems
.
P(λ) : p(λ) = min f ( x ) + λk x k,
x
. 1
Q(µ) : q(µ) = min f ( x ) + µk x k2 ,
x 2
where f is a convex function, k · k is an arbitrary vector norm, and
λ > 0, µ > 0 are parameters. Assume that for every choice of these
parameters, the corresponding problems have a unique solution.
In general, the solutions for the above problems for fixed λ and µ
do not coincide. This exercise shows that we can scan the solutions
to the first problem, and get the set of solutions to the second, and
vice versa.
1. Show that both p, q are concave functions, and q̃ with values q̃(µ) =
q(1/µ) is convex, on the domain R+ .
2. Show that
λ2 λ2
p(λ) = min q(µ) + , q(µ) = max p(λ) − .
µ >0 2µ λ >0 2µ
For the second expression, you may assume that dom f has a
nonempty interior.
3. Deduce from the first part that the paths of solutions coincide.
That is, if we solve the first problem for every λ > 0, for any
µ > 0 the optimal point we thus find will be optimal for the second
problem; and vice versa. It will convenient to denote by x ∗ (λ)
(resp. z∗ (µ)) the (unique) solution to P(λ) (resp. Q(µ)).
4. State and prove a similar result concerning a third function
.
r (κ ) : r (κ ) = min f ( x ) : k x k ≤ κ.
x
5. What can you say if we remove the uniqueness assumption?
Exercise 13.6 (Cardinality-penalized least squares) We consider the
problem
.
φ(k) = min k X > w − yk22 + ρ2 kwk22 + λ card(w),
w
60
where X ∈ Rn,m , y ∈ Rm , ρ > 0 is a regularization parameter, and
λ ≥ 0 allows us to control the cardinality (number of nonzeros) in
the solution. This in turn allows better interpretability of the results.
The above problem is hard to solve in general. In this exercise, we
denote by ai> , i = 1, . . . , n the i-th row of X, which corresponds to a
particular “feature” (that is, dimension of the variable w).
1. First assume that no cardinality penalty is present, that is, λ = 0.
Show that ! −1
n
1
φ(0) = y> I + 2 ∑ ai ai> y.
ρ i =1
2. Now consider the case λ > 0. Show that
! −1
n n
1
φ(λ) = min
u∈{0,1}n
y >
Im + 2
ρ ∑ ui ai ai> y + λ ∑ ui .
i =1 i =1
3. A natural relaxation to the problem obtains upon replacing the
constraints u ∈ {0, 1}n with interval ones: u ∈ [0, 1]n . Show that
the resulting lower bound φ(λ) ≥ φ(λ) is the optimal value of the
convex problem
n ( ai> v)2
φ(λ) = max 2y> v − v T v − ∑ ( − λ)+ .
v
i =1
ρ2
How would you recover a suboptimal sparsity pattern from a so-
lution v∗ to the above problem?
4. Express the above problem as an SOCP.
5. Form a dual to the SOCP, and show that it can be reduced to the
expression
n
ρx
φ(λ) = k X > w − yk22 + 2λ ∑ B √ i ,
i =1 λ
where B is the (convex) reverse Hüber function: for ξ ∈ R,
|ξ | if |ξ | ≤ 1,
. 1 ξ2
B(ξ ) = min z + = ξ 2+1
2 0≤ z ≤1 z otherwise.
2
Again, how would you recover a suboptimal sparsity pattern from
a solution w∗ to the above problem?
6. A classical way to handle cardinality penalties is to replace them
with the `1 -norm. How does the above approach compare with
the `1 -norm relaxation one? Discuss.
61
14. Computational Finance
Exercise 14.1 (Diversification) You have $12,000 to invest at the be-
ginning of the year, and three different funds from which to choose.
The municipal bond fund has a 7% yearly return, the local bank’s
Certificates of Deposit (CDs) have an 8% return, and a high-risk ac-
count has an expected (hoped-for) 12% return. To minimize risk, you
decide not to invest any more than $2,000 in the high-risk account.
For tax reasons, you need to invest at least three times as much in
the municipal bonds as in the bank CDs. Denote by x, y, z be the
amounts (in thousands) invested in bonds, CDs, and high-risk ac-
count, respectively. Assuming the year-end yields are as expected,
what are the optimal investment amounts for each fund?
I took this out, too simplistic
Exercise 14.2 (Portfolio optimization problems) We consider a sin-
gle-period optimization problem involving n assets, and a decision
vector x ∈ Rn which contains our position in each asset. Determine
which of the following objectives or constraints can be modeled using
convex optimization.
1. The level of risk (measured by portfolio variance) is equal to a
given target t (the covariance matrix is assumed to be known).
2. The level of risk (measured by portfolio variance) is below a given
target t.
3. The Sharpe ratio (defined as the ratio of portfolio return to port-
folio standard deviation) is above a target t ≥ 0. Here both the
expected return vector and the covariance matrix are assumed to
be known.
4. Assuming that the return vector follows a known Gaussian dis-
tribution, ensure that the probability of the portfolio return being
less than a target t is less than 3%.
5. Assume that the return vector r ∈ Rn can take three values r (i) ,
i = 1, 2, 3. Enforce the following constraint: the smallest portfolio
return under the three scenarios is above a target level t.
6. Under similar assumptions as in part 5: the average of the small-
est two portfolio returns is above a target level t. Hint: use new
variables si = x > r (i) , i = 1, 2, 3, and consider the function s →
s[2] + s[3] , where for k = 1, 2, 3, s[k] denotes the k-th largest element
in s.
62
7. The transaction cost (under a linear transaction cost model, and
with initial position xinit = 0) is below a certain target.
8. The number of transactions from the initial position xinit = 0 to
the optimal position x is below a certain target.
9. The absolute value of the difference between the expected portfo-
lio return and a target return t is less than a given small number e
(here, the expected return vector r̂ is assumed to be known).
10. The expected portfolio return is either above a certain value tup ,
or below another value tlow .
Exercise 14.3 (Median risk) We consider a single-period portfolio op-
timization problem with n assets. We use past samples, consisting of
single-period return vectors r1 , . . . , r N , where rt ∈ Rn contains the
returns of the assets from period t − 1 to period t. We denote by
.
r̂ = (1/N )(r1 + · · · + r N ) the vector of sample averages; it is an esti-
mate of the expected return, based on the past samples.
As a measure of risk, we use the following quantity. Denote by
ρt ( x ) the return at time t (if we had held the position x at that time).
Our risk measure is
N
. 1
R1 ( x ) =
N ∑ |ρt (x) − ρ̂(x)|,
t =1
where ρ̂( x ) is the portfolio’s sample average return.
1. Show that R1 ( x ) = k R> x k1 , with R an n × N matrix that you will
determine. Is the risk measure R1 convex?
2. Show how to minimize the risk measure R1 , subject to the con-
dition that the sample average of the portfolio return is greater
than a target µ, using linear programming. Make sure to put the
problem in standard form, and define precisely the variables and
constraints.
3. Comment on the qualitative difference between the resulting port-
folio and one that would use the more classical, variance-based
risk measure, given by
N
. 1
R2 ( x ) =
N ∑ (ρt (x) − ρ̂(x))2 .
t =1
Exercise 14.4 (Portfolio optimization with factor models – 1)
63
1. Consider the following portfolio optimization problem:
p∗ = min x > Σx
x
s.t.: r̂ > x ≥ µ,
where r̂ ∈ Rn is the expected return vector, Σ ∈ Sn , Σ 0 is
the return covariance matrix, and µ is a target level of expected
portfolio return. Assume that the random return vector r follows
a simplified factor model of the form
.
r = F ( f + fˆ), r̂ = F fˆ,
where F ∈ Rn,k , k n, is a factor loading matrix, fˆ ∈ Rk is given,
and f ∈ Rk is such that E{ f } = 0 and E{ f f > } = I. The above
optimization problem is a convex quadratic problem that involves
n decision variables. Explain how to cast this problem into an
equivalent form that involves only k decision variables. Interpret
the reduced problem geometrically. Find a closed-form solution to
the problem.
2. Consider the following variation on the previous problem:
p∗ = min x > Σx − γr̂ > x
x
s.t.: x ≥ 0,
where γ > 0 is a tradeoff parameter that weights the relevance in
the objective of the risk term and of the return term. Due to the
presence of the constraint x ≥ 0, this problem does not admit, in
general, a closed-form solution.
Assume that r is specified according to a factor model of the form
r = F ( f + fˆ) + e,
where F, f , and fˆ are as in the previous point, and e is an idiosyn-
cratic noise term, which is uncorrelated with f (i.e., E{ f e> } = 0)
.
and such that E{e} = 0 and E{ee> } = D2 = {d21 , . . . , d2n } 0.
Suppose we wish to solve the problem using a logarithmic barrier
method of the type discussed in Section 12.3.1. Explain how to ex-
ploit the factor structure of the returns to improve the numerical
performance of the algorithm. Hint: with the addition of suitable
slack variables, the Hessian of the objective (plus barrier) can be
made diagonal.
Exercise 14.5 (Portfolio optimization with factor models – 2) Con-
sider again the problem and setup of in point 2 of Exercise 14.4. Let
64
.
z = F > x, and verify that the probem can be rewritten as
p∗ = min x > D2 x + z> z − γr̂ > x
x ≥0, z
s.t.: F > x = z.
Consider the Lagrangian
L( x, z, λ) = x > D2 x + z> z − γr̂ > x + λ> (z − F > x )
and the dual function
.
g(λ) = min L( x, z, λ).
x ≥0,z
Strong duality holds, since the primal problem is convex and strictly
feasible, thus p∗ = d∗ = maxλ g(λ).
1. Find a closed-form expression for the dual function g(λ).
2. Express the primal optimal solution x ∗ in terms of the dual opti-
mal variable λ∗ .
3. Determine a subgradient of − g(λ).
Exercise 14.6 (Kelly’s betting strategy) A gambler has a starting
capital W0 and repeatedly bets his whole available capital on a game
where with probability p ∈ [0, 1] he wins the stake, and with prob-
ability 1 − p he loses it. His wealth Wk after k bets is a random
variable: (
2k W0 with probability pk ,
Wk =
0 with probability 1 − pk .
1. Determine the expected wealth of the gambler after k bets. De-
termine the probability with which the gambler eventually runs
broke at some k.
2. The results of the previous point should have convinced you that
the described one is a ruinous gambling strategy. Suppose now
that the gambler gets more cautious, and decides to bet, at each
step, only a fraction x of his capital. Denoting by w and ` the
(random) number of times where the gambler wins and loses a
bet, respectively, we have that his wealth at time k is given by
Wk = (1 + x )w (1 − x )` W0 ,
where x ∈ [0, 1] is the betting fraction, and w + ` = k. Define the
exponential rate of growth of the gambler capital as
1 W
G = lim log2 k .
k→∞ k W0
65
(a) Determine an expression for the exponential rate of growth G
as a function of x. Is this function concave?
(b) Find the value of x ∈ [0, 1] that maximizes the exponential
rate of growth G. Betting according to this optimal fraction is
known as the optimal Kelly’s gambling strategy.29 29
After J. L. Kelly, who introduced it
in 1956.
3. Consider a more general situation, in which an investor can in-
vest a fraction of his capital on an investment opportunity that
may have different payoffs, with different probabilities. Specifi-
cally, if W0 x dollars are invested, then the wealth after the out-
come of the investment is W = (1 + rx )W0 , where r denotes the
return of the investment, which is assumed to be a discrete ran-
dom variable taking values r1 , . . . , rm with respective probabilities
p1 , . . . , pm (pi ≥ 0, ri ≥ −1, for i = 1, . . . , m, and ∑i pi = 1).
The exponential rate of growth G introduced in point 2 of this
exercise is nothing but the expected value of the log-gain of the
investment, that is
G = E{log(W/W0 )} = E{log(1 + rx )}.
The particular case considered in point 2 corresponds to taking
m = 2 (two possible investment outcomes), with r1 = 1, r2 = −1,
p1 = p, p2 = 1 − p.
(a) Find an explicit expression for G as a function of x ∈ [0, 1].
(b) Devise a simple computational scheme for finding the optimal
investment fraction x that maximizes G.
Exercise 14.7 (Multi-period investments) We consider a multi-stage,
single-asset investment decision problem over n periods. For any
given time period i = 1, . . . , n, we denote by yi the predicted return,
σi the associated variance, and ui the dollar position invested. As-
suming our initial position is u0 = w, the investment problem is
n +1
.
φ(w) = max
u
∑ yi ui − λσi2 u2i − c|ui − ui−1 | : u0 = w, un+1 = 0,
i =1
where the first term represents profit, the second, risk, and the third,
approximate transaction costs. Here, c > 0 is the unit transaction
cost and λ > 0 a risk-return trade-off parameter. (We assume λ = 1
without loss of generality.)
1. Find a dual for this problem.
2. Show that φ is concave, and find a subgradient of −φ at w. If φ is
differentiable at w, what is its gradient at w?
66
3. What is the sensitivity issue of φ with respect to the initial position
w? Precisely, provide a tight upper bound on |φ(w + e) − φ(w)|
for arbitrary e > 0, and with y, σ, c fixed. You may assume φ is
differentiable for any u ∈ [w, w + e].
Exercise 14.8 (Personal finance problem) Consider the following
personal finance problem. You are to be paid for a consulting job,
for a total of C = $30, 000, over the next six months. You plan to
use this payment to cover some past credit card debt, which amounts
to D = $7000. The credit card’s APR (annual interest rate) is r1 =
15.95%. You have the following items to consider:
• At the beginning of each month, you can transfer any portion of
the credit card debt to another card with a lower APR of r2 = 2.9%.
This transaction costs r3 = 0.2% of the total amount transferred.
You cannot borrow any more from either credit cards; only trans-
fer of debt from card 1 to 2 is allowed.
• The employer allows you to choose the schedule of payments: you
can distribute the payments over a maximum of six months. For
liquidity reasons, the employer limits any month’s pay to 4/3 ×
(C/6).
• You are paid a base salary of B = $70, 000 per annum. You can-
not use the base salary to pay off the credit card debt; however it
affects how much tax you pay (see next).
• The first three months are the last three months of the current
fiscal year and the last three months are the first three months
of the next fiscal year. So if you choose to be paid a lot in the
current fiscal year (first three months of consulting), the tax costs
are high; they are lower if you choose to distribute the payments
over several periods. The precise tax due depends on your gross
annual total income G, which is your base salary, plus any extra
income. The marginal tax rate schedule is given in Table 14.5.
• The risk-free rate (interest rate from savings) is zero.
• Time line of events: all events occur at the beginning of each
month, i.e. at the beginning of each month, you are paid the cho-
sen amount, and immediately you decide how much of each credit
card to pay off, and transfer any debt from card 1 to card 2. Any
outstanding debt accumulates interest at the end of the current
month.
• Your objective is to maximize the total wealth at the end of the two
fiscal years whilst paying off all credit card debt.
67
Table 14.5: Marginal tax rate sched-
ule.
Total gross income G Marginal tax rate Total tax
$0 ≤ G ≤ $80, 000 10% 10% × G
$80, 000 ≤ G 28% 28% × G plus $8000 = 10% × $80, 000
1. Formulate the decision-making problem as an optimization prob-
lem. Make sure to define the variables and constraints precisely.
To describe the tax, use the following constraint:
Ti = 0.1 min( Gi , α) + 0.28 max( Gi − α, 0), (14.22)
where Ti is the total tax paid, Gi is the total gross income in years
i = 1, 2 and α = 80, 000 is the tax threshold parameter.
2. Is the problem a linear program? Explain.
3. Under what conditions on α and Gi can the tax constraint (14.22)
be replaced by the following set of constraints? Is it the case for
our problem? Can you replace (14.22) by (14.23) in your problem?
Explain.
Ti = 0.1d1,i + 0.28d2,i , (14.23)
d2,i ≥ Gi − α,
d2,i ≥ 0,
d1,i ≥ Gi − d2,i ,
d1,i ≥ d2,i − α.
4. Is the new problem formulation, with (14.23), convex? Justify your
answer.
5. Solve the problem using your favorite solver. Write down the opti-
mal schedules for receiving payments and paying off/transferring
credit card debt, and the optimal total wealth at the end of two
years. What is your total wealth W?
6. Compute an optimal W for α ∈ [70k, 90k ] and plot α vs. W in this
range. Can you explain the plot?
Exercise 14.9 (Transaction costs and market impact) We consider
the following portfolio optimization problem:
max r̂ > x − λx > Cx − c · T ( x − x0 ) : x ≥ 0, x ∈ X, (14.24)
x
where C is the empirical covariance matrix, λ > 0 is a risk parameter,
and r̂ is the time-average return for each asset for the given period.
Here, the constraint set X is determined by the following conditions.
68
• No shorting is allowed.
• There is a budget constraint x1 + · · · + xn = 1.
In the above, the function T represents transaction costs and market
impact, c ≥ 0 is a parameter that controls the size of these costs,
while x0 ∈ Rn is the vector of initial positions. The function T has
the form
n
T (x) = ∑ B M ( x ),
i =1
where the function B M is piece-wise linear for small x, and quadratic
for large x; that way we seek to capture the fact that transaction
costs are dominant for smaller trades, while market impact kicks in
for larger ones. Precisely, we define B M to be the so-called “reverse
Hüber” function with cut-off parameter M: for a scalar z, the func-
tion value is
|z| if |z| ≤ M,
.
B M (z) = z 2 + M2
otherwise.
2M
The scalar M > 0 describes where the transition from a linearly
shaped to a quadratically shaped penalty takes place.
1. Show that B M can be expressed as the solution to an optimization
problem:
w2
B M (z) = min v + w + : |z| ≤ v + w, v ≤ M, w ≥ 0.
v,w 2M
Explain why the above representation proves that B M is convex.
2. Show that, for given x ∈ Rn :
1 >
T ( x ) = min 1> (v + w) + w w : v ≤ M1, w ≥ 0,
w,v 2M
| x − x0 | ≤ v + w,
where, in the above, v, w are now n-dimensional vector variables,
1 is the vector of ones, and the inequalities are component-wise.
3. Formulate the optimization problem (14.24) in convex format. Does
the problem fall into one of the categories (LP, QP, SOCP, etc.) seen
in Chapter 8?
4. Draw the efficient frontier of the portfolio corresponding to M =
0.01, 0.05, 0.1, 1, 5, with c = 5 × 10−4 . Comment on the qualitative
differences between the optimal portfolio for two different values
of M = 0.01, 1.
69
Exercise 14.10 (Optimal portfolio execution) This exercise deals with
an optimal portfolio execution problem, where we seek to optimally
liquidating a portfolio given as a list of n asset names and initial num-
ber of shares in each asset. The problem is stated over a given time
horizon T, and shares are to be traded at fixed times t = 1, . . . , T.
In practice, the dimension of the problem may range from n = 20 to
n = 6000.
The initial list of shares is given by a vector x0 ∈ Rn , and the final
target is to liquidate our portfolio. The initial position is given by a
price vector p ∈ Rn , and a vector s that gives the side of each asset
(1 to indicate long, −1 to indicate short). We denote by w = p ◦ s the
so-called price weight vector, where ◦ denotes the component-wise
product30 . 30
For two n-vectors u, v, the notation
Our decision variable is the execution schedule, a n × T matrix X, u ◦ v denotes the vector with compo-
nents ui vi , i = 1, . . . , n.
with Xit the amount of shares (in hundreds, say) of asset i to be sold
at time t. We will not account for discretization effects and treat X
as a real-valued matrix. For t = 1, . . . , T, we denote by xt ∈ Rn the
t-th column of X; xt encapsulates to all the trading that takes place
at period t.
In our problem, X is constrained via upper and lower bounds:
we express this as X l ≤ X ≤ X u , where inequalities are understood
component-wise, and X l , X u are given n × T matrices (for example, a
no short selling condition is enforced with X l = 0). These upper and
lower bounds can be used to make sure we attain our target at time
t = T: we simply assume that the last columns of X l , X u are both
equal to the target vector, which is zero in the case we seek to fully
liquidate the portfolio.
We may have additional linear equality or inequality constraints.
For example we may enforce upper and lower bounds on the trading:
.
0 ≤ yt = xt−1 − xt ≤ yut , t = 1, . . . , T,
where Y = [y1 , . . . , y T ] ∈ Rn,T will be referred to as the trading ma-
trix, and Y u = [y1u , . . . , yuT ] is a given (non-negative) n × T matrix that
bounds the elements of Y from above. The lower bound ensures that
trading decreases over time; the second constraint can be used to
enforce a maximum participation rate, as specified by the user.
We will denote by X ⊆ Rn,T our feasible set, that is, the set of
n × T matrices X = [ x1 , . . . , x T ] that satisfy the constraints above,
including the upper and lower bounds on X.
We also want to enforce a dollar neutral strategy at each time step.
This requires to have the same dollar position both in long and short.
This can be expressed with the conditions w> xt = 0, t = 1, . . . , T,
where w = p ◦ s ∈ Rm contains the price weight of each asset. We
70
can write the dollar-neutral constraint compactly as X > w = 0.
Our objective function involves three terms, referred to as impact,
risk, and alpha respectively. The impact function is modeled as
T n
I (X) = ∑ ∑ Vti (Xti − Xt−1,i )2 ,
t =1 i =1
where V = [v1 , . . . , v T ] is a n × T matrix of non-negative numbers that
model the impact of transactions (the matrix V has to be estimated
with historical data, but we consider it to be fully known here). In
the above, the n-vector of initial conditions x0 = ( X0,i )1≤i≤n is given.
The risk function has the form
T
R( X ) = ∑ ( w ◦ x t ) > Σ ( w ◦ x t ),
t =1
where ◦ is the component-wise product, w = p ◦ s is the price weight
vector, and Σ is a positive semidefinite matrix the describes the daily
market risk. In this problem, we assume that Σ has a “diagonal-plus-
low-rank” structure, corresponding to a factor model. Specifically,
Σ = D2 + FF > , where D is a n × n, diagonal positive definite matrix,
and F is a n × k “factor loading” matrix, with k ≈ 10 − 100 the num-
ber of factors in the model (typically, k n). We can write the risk
function as
T
R( X ) = ∑ xt> ( Dw2 + Fw Fw> )xt ,
t =1
. .
where Dw = diag (w) D is diagonal, positive definite, and Fw =
diag (w) F.
Finally, the alpha function accounts for views on the asset return
themselves, and is a linear function of X, which we write as
T
C(X ) = ∑ c>t xt ,
t =1
where C = [c1 , . . . , c T ] ∈ Rn,T is a given matrix that depends on
α ∈ Rn , which contains our return predictions for the day. Precisely,
ct = αt ◦ p, where p ∈ Rn is the price vector, and αt is a vector of
predicted returns.
1. Summarize the problem data, and their sizes.
2. Write the portfolio execution problem as a QP. Make sure to define
precisely the variables, objective and constraints.
3. Explain how to take advantage of the factor model to speed up
computation. Hint: look at Exercise 12.9.
71
15. Control Problems
Exercise 15.1 (Stability and eigenvalues) Prove that the continuous-
time LTI system (15.20) is asymptotically stable (or stable, for short)
if and only if all the eigenvalues of the A matrix, λi ( A), i = 1, . . . , n,
have (strictly) negative real parts.
Prove that the discrete-time LTI system (15.28) is stable if and only
if all the eigenvalues of the A matrix, λi ( A), i = 1, . . . , n, have moduli
(strictly) smaller than one.
Hint: use the expression x (t) = e At x0 for the free response of
the continuous-time system, and the expression x (k ) = Ak x0 for the
free response of the discrete-time system. You may derive your proof
under the assumption that A is diagonalizable.
Exercise 15.2 (Signal norms) A continuous-time signal w(t) is a func-
tion mapping time t ∈ R to values w(t) in either Cm or Rm . The
energy content of a signal w(t) is defined as
Z ∞
.
E(w) = kwk22 = kw(t)k22 dt,
−∞
where kwk2 is the 2-norm of the signal. The class of finite-energy
signal contains signals for which the above 2-norm is finite.
Periodic signals typically have infinite energy. For a signal with
period T, we define its power content as
Z t0 + T
. 1
P(w) = kw(t)k22 dt.
T t0
1. Evaluate the energy of the harmonic signal w(t) = ve ωt , v ∈ Rm ,
and of the causal exponential signal w(t) = veat , for a < 0, t ≥ 0
(w(t) = 0 for t < 0).
2. Evaluate the power of the harmonic signal w(t) = ve ωt and of the
sinusoidal signal w(t) = v sin(ωt).
Exercise 15.3 (Energy upper bound on the system’s state evolution)
Consider a continuous-time LTI system ẋ (t) = Ax (t), t ≥ 0, with no
input (such a system is said to be autonomous), and output y(t) = Cx.
We wish to evaluate the energy contained in the system’s output, as
measured by the index
Z ∞ Z ∞
.
J ( x0 ) = y(t)> y(t)dt = x (t)> Qx (t)dt,
0 0
.
where Q = C > C 0.
1. Show that if the system is stable, then J ( x0 ) < ∞, for any given
x0 .
72
2. Show that if the system is stable and there exists a matrix P 0
such that
A> P + PA + Q 0,
then it holds that J ( x0 ) ≤ x0> Px0 . Hint: consider the quadratic
form V ( x (t)) = x (t)> Px (t), and evaluate its derivative with re-
spect to time.
3. Explain how to compute a minimal upper bound on the state en-
ergy, for the given initial conditions.
Exercise 15.4 (System gain) The gain of a system is the maximum
energy amplification from the input signal to output. Any input sig-
nal u(t) having finite energy is mapped by a stable system to an
output signal y(t) which also has finite energy. Parseval’s identity
relates the energy of a signal w(t) in the time domain to the energy
of the same signal in the Fourier domain (see Remark 15.1), that is
Z ∞ Z ∞
. 1 .
E(w) = kwk22 = kw(t)k22 dt = kŴ (ω )k22 dω = kŴ k22 .
−∞ 2π −∞
The energy gain of system (15.26) defined as
. kyk22
energy gain = sup 2
.
u(t):kuk2 <∞,u6=0 k u k2
1. Using the above information, prove that, for a stable system,
energy gain ≤ sup k H ( ω )k22 ,
ω ≥0
where k H ( ω )k2 is the spectral norm of the transfer matrix of sys-
tem (15.26), evaluated at s = ω. The (square-root of the) energy
gain of the system is also known as the H∞ -norm, and it is de-
noted by k H k∞ .
Hint: use Parseval’s identity and then suitably bound a certain in-
tegral. Notice that equality actually holds in the previous formula,
but you are not asked to prove this.
2. Assume that system (15.26) is stable, x (0) = 0, and D = 0. Prove
that if there exists P 0 such that
" #
A> P + PA + C > C PB
0 (15.25)
B> P − γ2 I
then it holds that
k H k∞ ≤ γ.
73
Devise a computational scheme that provides you with the lowest
possible upper bound γ∗ on the energy gain of the system.
Hint: define a quadratic function V ( x ) = x > Px, and observe that
the derivative in time of V, along the trajectories of system (15.26),
is
dV ( x )
= x > P ẋ + ẋ > Px.
dt
Then show that the LMI condition (15.25) is equivalent to the con-
dition that
dV ( x )
+ kyk2 − γ2 kuk2 ≤ 0, ∀ x, u satisfying (15.26),
dt
and that this implies in turn that k H k∞ ≤ γ.
Exercise 15.5 (Extended superstable matrices) A matrix A ∈ Rn,n is
said to be continuous-time extended superstable31 (which we denote by 31
See B. T. Polyak, Extended super-
A ∈ Ec ) if there exists d ∈ Rn such that stability in control theory, Automation
and Remote Control, 2004.
∑ |aij |d j < −aii di , di > 0, i = 1, . . . , n.
j 6 =i
Similarly, a matrix A ∈ Rn,n is said to be discrete-time extended
superstable (which we denote by A ∈ Ed ) if there exists d ∈ Rn such
that
n
∑ |aij |d j < di , di > 0, i = 1, . . . , n.
j =1
If A ∈ Ec , then all its eigenvalues have real parts smaller than zero,
hence the corresponding continuous-time LTI system ẋ = Ax is sta-
ble. Similarly, if A ∈ Ed , then all its eigenvalues have moduli smaller
than one, hence the corresponding discrete-time LTI system x (k +
1) = Ax (k ) is stable. Extended superstability thus provides a suffi-
cient condition for stability, which has the advantage of being check-
able via feasibility of a set of linear inequalities.
1. Given a continuous-time system ẋ = Ax + Bu, with x ∈ Rn ,
u ∈ Rm , describe your approach for efficiently designing a state-
feedback control law of the form u = −Kx, such that the controlled
system is extended superstable.
2. Given a discrete-time system x (k + 1) = Ax (k ) + Bu(k), assume
that matrix A is affected by interval uncertainty, that is
aij = âij + δij , i, j = 1, . . . , n,
where âij is the given nominal entry, and δij is an uncertainty term,
which is only known to be bounded in amplitude as |δij | ≤ ρrij , for
74
given rij ≥ 0. Define the radius of extended superstability as the
largest value ρ∗ of ρ ≥ 0 such that A is extended superstable for all
the admissible uncertainties. Describe a computational approach
for determining such a ρ∗ .
75
16. Engineering Design
Exercise 16.1 (Network congestion control) A network of n = 6
peer-to-peer computers is shown in Figure 16.9. Each computer can
upload or download data at a certain rate on the connection links
shown in the figure. Let b+ ∈ R8 be the vector containing the
packet transmission rates on the links numbered in the figure, and
let b− ∈ R8 be the vector containing the packet transmission rates on
the reverse links, where it must hold that b+ ≥ 0 and b− ≥ 0.
Define an arc–node incidence matrix for this network:
1 0 1 1 0 0 0 0
−1 1 0 0 0 0 0 0
. 0 0 0 −1 1 0 0 0
A= 0 −1 −1 0
,
0 −1 −1 0
0 0 0 0 −1 1 0 1
0 0 0 0 0 0 1 −1
. .
and let A+ = max( A, 0) (the positive part of A), A− = min( A, 0) (the
negative part of A). Then the total output (upload) rate at the nodes
is given by vupl = A+ b+ − A− b− , and the total input (download) rate
at the nodes is given by vdwl = A+ b− − A− b+ . The net outflow at
nodes is hence given by
4 5
1 3 5
vnet = vupl − vdwl = Ab+ − Ab− , 3 6
1 8
4
and the flow balance equations require that [vnet ]i = f i , where f i = 0
2 7
if computer i is not generating or sinking packets (it just passes on
2 6
the received packets, i.e., it is acting as a relay station), f i > 0 if
computer i is generating packets, or f i < 0 if it is sinking packets at Figure 16.9: A small network.
an assigned rate f i .
Each computer can download data at a maximum rate of v̄dwl =
20 Mbit/s and upload data at a maximum rate of v̄upl = 10 Mbit/s
(these limits refer to the total download or upload rates of a com-
puter, through all its connections). The level of congestion of each
connection is defined as
−
c j = max(0, (b+
j + b j − 4)), j = 1, . . . , 8.
Assume that node 1 must transmit packets to node 5 at a rate f 1 = 9
Mbit/s, and that node 2 must transmit packets to node 6 at a rate
f 2 = 8 Mbit/s. Find the rate on all links such that the average con-
gestion level of the network is minimized.
Exercise 16.2 (Design of a water reservoir) We need to design a wa-
ter reservoir for water and energy storage, as depicted in Figure 16.10.
Figure 16.10: A water reservoir on
concrete basement.
76
The concrete basement has a square cross-section of side length
b1 and height h0 , while the reservoir itself has a square cross-section
of side length b2 and height h. Some useful data is reported in Ta-
ble 16.6.
Table 16.6: Data for reservoir prob-
lem.
Quantity Value Units Description
g 9.8 m/s2 gravity acceleration
E 30 × 109 N/m2 basement long. elasticity modulus
ρw 10 × 103 N/m3 specific weight of water
ρb 25 × 103 N/m3 specific weight of basement
J b14 /12 m4 basement moment of inertia
Ncr π JE/(2h0 )2
2
N basement critical load limit
The critical load limit Ncr of the basement should withstand at least
twice the weight of water. The structural specification h0 /b12 ≤ 35
should hold. The form factor of the reservoir should be such that
1 ≤ b2 /h ≤ 2. The total height of the structure should be no larger
than 30 m. The total weight of the structure (basement plus reservoir
full of water) should not exceed 9.8 × 105 N. The problem is to find
the dimensions b1 , b2 , h0 , h such that the potential energy Pw of the
stored water is maximal (assume Pw = (ρw hb22 )h0 ). Explain if and
how the problem can be modeled as a convex optimization problem
and, in the positive case, find the optimal design.
Exercise 16.3 (Wire sizing in circuit design) Interconnects in modern
electronic chips can be modeled as conductive surface areas deposed
on a substrate. A “wire” can thus be thought as a sequence of rect-
Figure 16.11: A wire is represented as
angular segments, as shown in Figure 16.11. a sequence of rectangular surfaces on
We assume that the lengths of these segments are fixed, while the a substrate. Lengths `i are fixed, and
the widths xi of the segments are the
widths xi need be sized according to the criteria explained next. A decision variables. This example has
common approach is to model the wire as the cascade connection of three wire segments.
RC stages, where, for each stage, Si = 1/Ri , Ci are, respectively, the
conductance and the capacitance of the i-th segment, see Figure 16.12.
The values of Si , Ci are proportional to the surface area of the wire
segment, hence, since the lengths `i are assumed known and fixed,
they are affine functions of the widths, i.e.,
(0) (0)
Si = Si ( xi ) = σi + σi xi , Ci = Ci ( xi ) = ci + ci xi , Figure 16.12: RC model of a three-
segment wire.
(0) (0)
where σi , σi , ci , ci are given positive constants. For the three-segment
wire model illustrated in the figures, one can write the following set
of dynamic equations that describe the evolution in time of the node
77
voltages vi (t), i = 1, . . . , 3:
C1 C2 C3 S1 0 0 S1
0 C2 C3 v̇(t) = − −S2 S2 0 v ( t ) + 0 u ( t ).
0 0 C3 0 − S3 S3 0
These equations are actually expressed in a more useful form if we
introduce a change of variables
1 0 0
v(t) = Qz(t), Q = 1 1 0 ,
1 1 1
from which we obtain
S1
C( x )ż(t) = −S( x )z(t) + 0 u(t),
0
where
C1 + C2 + C3 C2 + C3 C3
. .
C( x ) = C2 + C3 C2 + C3 C3 , S( x ) = diag (S1 , S2 , S3 ) .
C3 C3 C3
Clearly, C( x ), S( x ) are symmetric matrices whose entries depend
affinely on the decision variable x = ( x1 , x2 , x3 ). Further, one may
observe that C( x ) is nonsingular whenever x ≥ 0 (as is physically
the case in our problem), hence the evolution of z(t) is represented
by (we next assume u(t) = 0, i.e., we consider only the free-response
time evolution of the system)
ż(t) = −C( x )−1 S( x )z(t).
The dominant time constant of the circuit is defined as
1
τ= ,
λmin (C( x )−1 S( x ))
and it provides a measure of the “speed” of the circuit (the smaller
τ, the faster is the response of the circuit).
Describe a computationally efficient method for sizing the wire
so as to minimize the total area occupied by the wire, while guaran-
teeing that the dominant time constant does not exceed an assigned
level η > 0.