Nonlinear Optimization (18799 B, PP) : Ist-Cmu PHD Course, Spring 2011
Nonlinear Optimization (18799 B, PP) : Ist-Cmu PHD Course, Spring 2011
_
x
1
x
2
.
.
.
x
n
_
_
.
For vectors x = (x
1
, x
2
, . . . , x
n
) and y = (y
1
, y
2
, . . . , y
n
) in R
n
the compact notations x y,
x y, x < y and x > y mean that the inequalities hold componentwise. For example:
x y x
i
y
i
for i = 1, 2, . . . , n
x > y x
i
> y
i
for i = 1, 2, . . . , n.
The sets of non-negative and strictly positive vectors are denoted by
R
n
+
= x R
n
: x 0
R
n
++
= x R
n
: x > 0.
2. Linear space spanned by a set of vectors. The linear space spanned by the set of vectors
v
1
, v
2
, . . . , v
k
R
n
is dened as
spanv
1
, v
2
, . . . , v
k
= a
1
v
1
+ a
2
v
2
+ + a
k
v
k
: a
i
R for i = 1, 2, . . . , k.
For v ,= 0, note that spanv is the straight-line spanned by v (it contains the origin).
Another example:
span(1, 0, 0), (0, 1, 0) = (x
1
, x
2
, x
3
) R
3
: x
3
= 0.
Note also that
span(1, 0, 0), (0, 1, 0), (1, 1, 0) = span(1, 0, 0), (0, 1, 0) = span(1, 0, 0), (1, 1, 0).
3. Subspace of R
n
. A subset V of R
n
is said to be a subspace if av +bw V for any v, w V
and a, b R. Example:
V = (x
1
, x
2
, x
3
) R
3
: 3x
1
2x
2
+ x
3
= 0
is a subspace, but
U = (x
1
, x
2
, x
3
) R
3
: 3x
1
2x
2
+ x
3
= 1
is not. [Note: the origin always belong to a subspace]
1
4. Linearly independent vectors. The vectors v
1
, v
2
, . . . , v
k
R
n
are linearly independent
if and only if
a
1
v
1
+ a
2
v
2
+ + a
k
v
k
= 0 a
1
= a
2
= = a
k
= 0.
Example: (1, 1, 1), (1, 1, 1) are linearly independent but (1, 1, 1), (2, 2, 2) are not.
5. Basis and dimension of subspaces. Let V be a subspace of R
n
. We say that v
1
, v
2
, . . . , v
k
is a basis for V if v
1
, v
2
, . . . , v
k
are linearly independent and they span V , that is,
spanv
1
, v
2
, . . . , v
k
= V.
Example: (1, 0, 0), (0, 1, 0), (0, 0, 1) is a basis for R
3
but (1, 0, 0), (0, 1, 0), (1, 1, 0) is not.
All bases of a given subspace V have the same number of vectors. This number is called the
dimension of V and it is denoted by dimV . Note that dimR
n
= n.
6. Inner product, orthogonality and norm. The inner-product of v = (v
1
, v
2
, . . . , v
n
) and
w = (w
1
, w
2
, . . . , w
n
) is given by
v, w) = v
w =
n
i=1
v
i
w
i
.
The vectors v and w are said to be orthogonal if v, w) = 0. The norm of v is given by
|v| =
_
v, v) =
_
n
i=1
v
2
i
. (1)
7. Orthonormal bases. The set v
1
, v
2
, . . . , v
k
of a subspace V R
n
is said to be an
orthonormal basis if v
1
, v
2
, . . . , v
k
is a basis of V and
v
i
, v
j
) =
_
1, i = j
0, i ,= j
.
(1, 0, 0), (0, 1, 0), (0, 0, 1) is an orthonormal basis of R
3
, but (1, 0, 0), (0, 1, 0), (0, 0, 2) is
not.
8.
p
-norms. For p 1, the
p
-norm is dened as
||
p
: R
n
R |x|
p
=
_
n
i=1
[x
i
[
p
_
1/p
,
where x = (x
1
, x
2
, . . . , x
n
). Note that the norm in (1) corresponds to the case p = 2.
For p = , the
-norm is dened as
||
: R
n
R |x|
= max[x
1
[, [x
2
[, . . . , [x
n
[.
Properties of the
p
-norms:
(denite positive) |x|
p
0 with equality if and only if x = 0
(homogeneous) |ax|
p
= [a[ |x|
p
for all a R and x R
n
(triangular inequality) |x + y|
p
|x|
p
+|y|
p
for any x, y R
n
2
9. Cauchy-Schwartz inequality. For v, w R
n
and 1 p there holds
[v, w)[ |v|
p
|w|
q
where
q =
_
_
_
, if p = 1
p
p1
, if 1 < p <
1 , if p =
Some important special cases: [v, w)[ |v|
2
|w|
2
and [v, w)[ |v|
1
|w|
.
Matrices
1. R
nm
. The set of matrices of size n m with real entries is denoted by R
nm
.
2. Identity matrix. The identity matrix of size n n is denoted by
I
n
=
_
_
1 0 0
0 1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0
0 0 1
_
_
.
3. Transpose, symmetric matrices, trace, determinant and inverse. Let
A =
_
_
a
11
a
12
a
1m
a
21
a
22
a
2m
.
.
.
.
.
.
.
.
.
a
n1
a
n2
a
nm
_
_
R
nm
.
Its transpose is given by
A
=
_
_
a
11
a
21
a
n1
a
12
a
22
a
n2
.
.
.
.
.
.
.
.
.
a
1m
a
2m
a
nm
_
_
R
mn
.
A square matrix A is said to be symmetric if A = A
.
The trace of a square matrix A : n n is the sum of its diagonal entries
tr(A) =
n
i=1
a
ii
.
Note that tr(BC) = tr(CB) for any matrices B : p q and C : q p.
The determinant of A is denoted by det(A). Properties:
det(A) = det(A
)
det(BC) = det(B) det(C) for square matrices B and C of the same size.
3
The determinant is not a linear operator: in general, det(A + B) ,= det(A) + det(B).
If det(A) ,= 0 then A is invertible (non-singular) and its inverse is denoted by A
1
:
AA
1
= A
1
A = I
n
.
Let A be written in columns A = [ a
1
a
2
a
n
] (each a
i
R
n
); then, det(A) ,= 0 if and only
if a
1
, a
2
, . . . , a
n
are linearly independent.
4. Ker A, ImA e rank(A). Let A R
nm
. The nullspace or kernel of A is the subspace
Ker A = v R
m
: Av = 0.
Let A be written in rows
A =
_
_
a
1
a
2
.
.
.
a
n
_
_
(a
i
R
m
);
then, Ker A = v R
m
: a
i
, v) = 0 for i = 1, 2, . . . , n. That is, Ker A is the subspace of
vectors which are orthogonal to all the rows of A.
The range space or image of A is the subspace
ImA = Av : v R
m
.
Let A be written in columns A = [ a
1
a
2
a
m
] (a
i
R
n
); then, ImA = spana
1
, a
2
, . . . , a
m
.
That is, ImA is the subspace spanned by the columns of A.
The rank of A is the dimension of the subspace ImA: rank(A) = dimImA.
Properties:
rank(A) = rank(A
)
rank(A) is the maximum number of linearly independents columns (or rows) of A
m = dimKer A + dimImA
rank(AB) minrank(A), rank(B).
Note that rank(A) minn, m. A matrix is said to be full-rank if rank(A) = minn, m.
Examples of full-rank matrices :
A =
_
1 0 1
1 0 1
_
B =
_
0
2
_
.
5. Orthogonal matrices. A square matrix Q : n n is said to be orthogonal if
QQ
= I
n
.
Note that, in that case, Q
1
= Q
and Q
Q = I
n
.
Let Q be written in columns Q = [ q
1
q
2
q
n
]; then, Q is orthogonal if and only if its
columns q
1
, q
2
, . . . , q
n
constitute an orthonormal basis of R
n
.
Since Q: orthogonal implies Q
= I
n
implies det(Q) det(Q
) = (det(Q))
2
= det(I
n
) =
1).
6. Frobenius norm. The Frobenius norm of a matrix A R
nm
is dened as
|A| =
_
n
i=1
m
j=1
a
2
ij
.
4
Note that
|A| =
_
tr (A
A) =
_
tr (AA
).
It corresponds to the usual norm of vectors, if A is interpreted as a vector in R
nm
(for
example, by stacking all its columns).
There holds |Av| |A| |v| for any A R
nm
and v R
m
.
7. Symmetric matrices: eigenvalues and eigenvectors. The set of symmetric matrices of
size n n is denoted by
S
n
= A R
nn
: A = A
.
Note that S
n
is a subspace of R
nn
.
Let A S
n
. The number R is said to be an eigenvalue of A if and only if there exists a
non-zero vector v R
n
such that
Av = v.
In that case, the vector v is said to be an eigenvector of A associated with the eigenvalue .
Properties:
if v
i
and v
j
are eigenvectors of A associated with distinct eigenvalues
i
and
j
com
i
,=
j
, then they are orthogonal v
i
, v
j
) = 0;
A is singular (non-invertible) if and only if A has a zero eigenvalue.
8. Spectral decomposition theorem. Let A be a symmetric matrix of size n n. Then,
there exists an orthogonal matrix Q : n n and a diagonal matrix
=
_
2
.
.
.
n
_
_
: n n
containing the eigenvalues of A, such that
A = QQ
.
Note that Aq
i
=
i
q
i
where q
i
R
n
denotes the ith column of Q = [ q
1
q
2
q
n
]. Thus, each
q
i
is an eigenvector of A associated with the eigenvalue
i
.
The spectral decomposition theorem implies:
tr(A) =
n
i=1
i
det(A) =
n
i=1
i
.
9. Inequalities for symmetric matrices. If A : n n is a symmetric matrix, then
min
(A) |v|
2
v
Av
max
(A) |v|
2
for all v R
n
. Note: the equalities are achieved choosing v as an eigenvector associated with
min
(A) or
max
(A).
Thus,
min
(A) = min
v=0
v
Av
v
v
= min
v=1
v
Av and
max
(A) = max
v=0
v
Av
v
v
= max
v=1
v
Av.
5
10. Theorem about the continuity of eigenvalues. To each symmetric matrix A S
n
correspond n eigenvalues (not necessarily distinct), which we sort in non-decreasing order
min
(A) =
1
(A)
2
(A)
n
(A) =
max
(A).
Thus, there exist n functions dened on S
n
:
_
1
: S
n
R
2
: S
n
R
.
.
.
n
: S
n
R
The function
i
: S
n
R corresponds to the map A
i
(A).
Theorem: each function
i
is continuous, that is, for any A
0
S
n
, there holds
>0
>0
: |AA
0
| < and A S
n
[
i
(A)
i
(A
0
)[ < .
11. Positive denite and semidenite matrices. A symmetric matrix
A =
_
_
a
11
a
12
a
13
a
1n
a
21
a
22
a
23
a
2n
a
31
a
32
a
33
a
3n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
a
n3
a
nn
_
_
is said to be positive semidenite if v
Av 0 for all v R
n
. The notation A _ 0 means
that A is positive semidenite.
The set of positive semidenite matrices of size n n is denoted by
S
n
+
= A S
n
: A _ 0.
Equivalent characterizations of A _ 0:
min
(A) 0
All the principal minors of A are nonnegative. For example, for n = 2:
A =
_
a
11
a
12
a
21
a
22
_
_ 0
_
_
a
11
0, a
22
0
det
_
a
11
a
12
a
21
a
22
_
0
.
For n = 3:
A =
_
_
a
11
a
12
a
13
a
21
a
22
a
23
a
31
a
32
a
33
_
_
_ 0
_
_
a
11
0, a
22
0, a
33
0
det
_
a
11
a
12
a
21
a
22
_
0, det
_
a
11
a
13
a
31
a
33
_
0, det
_
a
22
a
23
a
32
a
33
_
0
det
_
_
a
11
a
12
a
13
a
21
a
22
a
23
a
31
a
32
a
33
_
_
0
.
A symmetric matrix Ais said to be positive denite if v
_
a
11
a
12
a
13
a
1n
a
21
a
22
a
23
a
2n
a
31
a
32
a
33
a
3n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
a
n3
a
nn
_
_
is said to be negative semidenite if v
Av 0 for all v R
n
. The notation A _ 0 means
that A is negative semidenite.
The set of negative semidenite matrices of size n n is denoted by
S
n
= A S
n
: A _ 0.
Equivalent characterizations of A _ 0:
A _ 0
max
(A) 0.
A matrix A is said to be negative denite if v
= A S
n
: A 0.
Equivalent characterizations of A 0:
A ~ 0
max
(A) < 0
The diagonal principal minors of A are alternatively negative and positive:
a
11
< 0 det
_
a
11
a
12
a
21
a
22
_
> 0 det
_
_
a
11
a
12
a
13
a
21
a
22
a
23
a
31
a
32
a
33
_
_
< 0 (1)
n
det(A) > 0.
Properties of negative denite matrices:
if A 0 then A is invertible
if A 0 then A
1
0
7
13. Singular value decomposition. Any matrix A R
nm
can be factored as
A = UV
where U R
nn
and V R
mm
are orthogonal matrices and
=
_
D 0
0 0
_
R
nm
.
The matrix D R
rr
, where r = rank(A), is diagonal
D =
_
2
.
.
.
r
_
_
and
1
2
r
> 0.
The
i
s are called (the nonzero) singular values of A. The columns of U (V ) are called the
left (right) singular vectors of A.
Other variants:
If n m, then r = rank(A) m and A can be written as
A =
U
V
with
U R
nm
obtained from U discarding its n m rightmost columns and
=
_
1
.
.
.
r+1
.
.
.
m
_
_
R
mm
,
r+1
= =
m
= 0
obtained from discarding its last n m rows.
If n m, then r = rank(A) n and A can be written as
A = U
with
V R
mn
obtained from V discarding its last n m rows and
=
_
1
.
.
.
r+1
.
.
.
n
_
_
R
nn
,
r+1
= =
n
= 0
obtained from discarding its n m rightmost columns.
[Note: when m = n, the matrices U, and V are all square. However, the singular value de-
composition (SVD) gives a dierent decomposition from the spectral decomposition theorem,
since the latter only applies to symmetric matrices]
Properties:
8
A R
nm
= A
A S
m
. By the spectral decomposition theorem, A
A = V V
,
where V R
mm
is orthogonal and R
mm
is diagonal with the eigenvalues of A
in its diagonal. On the other hand, the SVD composition of A is A = UV
and
A
A = V
UV
= V (
)V
A and
=
=
_
2
1
2
2
.
.
.
2
m
_
_
,
that is, the nonzero singular values of A are the positive square roots of the eigenvalues
of A
A.
Similarly, A R
mn
=AA
S
n
, and the matrix V R
mm
of the SVD decompo-
sition of A (A = UV
; and contains
the positive square root of the eigenvalues of AA
in its diagonal.
A =
r
i=1
i
u
i
v
i
, where
U =
_
_
u
1
u
2
u
n
_
_
and V =
_
_
v
1
v
2
.
.
.
v
m
_
_
.
The largest singular value of A,
max
:=
1
, is given by
max
= max
u=1, v=1
u
Av .
|Ax|
max
|x| for any x.
|A| =
_
r
i=1
2
i
Analysis
1. Compact sets. A subset K R
n
is said to be compact if it is closed and bounded.
Examples:
K = x R
n
: |x| 5 is compact
K = x = (x
1
, x
2
) R
2
: 1 x
1
3, x
2
0, x
1
+ x
2
1 is compact
the set K = x R
n
: |x| < 5 is not compact because it is not closed
the set K = x R
2
: 1 x
1
3, x
2
0 is not compact because it is not bounded.
Equivalent characterization: K R
n
is compact if and only if for any sequence x
k
: k N
in K there exists a subsequence x
m
k
: k N converging to a point of K.
Common application of the previous result: let x
k
: k N be a sequence of points in the
sphere K = x R
n
: |x| = 1. Note that K is compact. Then, there exists a subsequence
x
m
k
: k N and a point x
0
K such that x
m
k
x
0
.
2. Weierstrass theorem. Let f : R
n
R be a continuous function. Then, for each compact
subset K R
n
there exist x
1
K and x
2
K such that
f(x
1
) = min
xK
f(x) e f(x
2
) = max
xK
f(x).
That is, a continuous function on a compact subset achieves its inmum and supremum over
this set.
9
3. Dierentiability. A function f : R
n
R is said to be of class C
k
(k = 0, 1, 2, . . .) if all
partial derivatives of order k exist and are continuous (a function of class C
0
is a continuous
function).
For example, if f is of class C
1
then its gradient
f : R
n
R f(x) =
_
_
f
x
1
(x)
f
x
2
(x)
.
.
.
f
x
n
(x)
_
_
is a continuous map (each function R
n
x f/x
i
(x) R is continuous). If f is of class
C
2
then its Hessian
2
f : R
n
R
nn
2
f(x) =
_
2
f
x
2
1
(x)
2
f
x
1
x
2
(x)
2
f
x
1
x
n
(x)
2
f
x
2
x
1
(x)
2
f
x
2
2
(x)
2
f
x
2
x
n
(x)
.
.
.
.
.
.
.
.
.
.
.
.
2
f
x
n
x
1
(x)
2
f
x
n
x
2
(x)
2
f
x
2
n
(x)
_
_
is a continuous map (each function R
n
x
2
f/x
i
x
j
(x) R is continuous).
Schwarz theorem: if f : R
n
R is of class C
2
, then its Hessian matrix
2
f(x) is symmetric
for any x R
n
.
A function f : R
n
R is said to be smooth if it is of class C
k
for all k = 0, 1, 2, . . .
A map F : R
n
R
m
, F(x) = (f
1
(x), f
2
(x), . . . , f
m
(x)) is said to be of class C
k
(respectively,
smooth) if each component function f
i
: R
n
R is of class C
k
(respectively, smooth).
Taylor theorems
1. Taylor expansions. Consider a function f : R
n
R and a nominal point x
0
R
n
.
(1st order expansion) If f is dierentiable, then
f(x
0
+ h) = f(x
0
) +f(x
0
)
h + o (|h|) ,
that is,
>0
>0
: |h| [f(x
0
+ h) [f(x
0
+f(x
0
)
h][ |h| .
(2nd order expansion) If f is twice-dierentiable, then
f(x
0
+ h) = f(x
0
) +f(x
0
)
h +
1
2
h
2
f(x
0
)h + o
_
|h|
2
_
,
that is,
>0
>0
: |h|
f(x
0
+ h)
_
f(x
0
) +f(x
0
)
h +
1
2
h
2
f(x
0
)h
_
|h|
2
.
10
2. Important properties. Let F : R
n
R
m
, F(x) = (f
1
(x), f
2
(x), . . . , f
m
(x)) be a map of
class C
1
. Then, for any x, y R
n
there holds
F(y) = F(x) +
_
1
0
DF(x + t(y x))(y x)dt
where
DF(z) =
_
_
f
1
x
1
(z)
f
1
x
2
(z)
f
1
x
n
(z)
f
2
x
1
(z)
f
2
x
2
(z)
f
2
x
n
(z)
.
.
.
.
.
.
.
.
.
.
.
.
f
m
x
1
(z)
f
m
x
2
(z)
f
m
x
n
(z)
_
_
=
_
_
f
1
(z)
f
2
(z)
.
.
.
f
m
(z)
_
denotes the derivative of F at the point z.
Application: for the particular case of a function f : R
n
R (i.e., m = 1) there holds
if f is of class C
1
f(y) = f(x) +
_
1
0
f(x + t(y x))
(y x)dt
if f is of class C
2
f(y) = f(x) +
_
1
0
2
f(x + t(y x))(y x)dt
for any x, y R
n
.
3. Mean-value theorem for continuous functions. If f : R R is continuous then
_
b
a
f(x)dx = f(c)(b a)
for some c [a, b].
4. Mean-value theorems for dierentiable functions. Let f : R
n
R and x, y R
n
.
(1st order expansion) If f is of class C
1
then
f(y) = f(x) +f(z)
(y x)
for some z [x, y]. The notation [x, y] denotes the line segment which runs from x to
y, that is,
[x, y] = (1 t)x + ty : t [0, 1].
(2nd order expansion) If f is of class C
2
then
f(y) = f(x) +f(x)
(y x) +
1
2
(y x)
T
2
f(z)(y x)
for some z [x, y].
References
[1] R. Horn and C. Johnson. Matrix Analysis. Cambridge University Press, 1985.
[2] J. Marsden and M. Homan. Elementary classical analysis. 2nd ed. W.H.Freeman, 1993.
11