Eecs127 Reader
Eecs127 Reader
Course Reader
Spring 2024
EECS 127/227AT Course Reader 2024-04-27 21:08:09-07:00
Acknowledgements
This reader is based on lectures from Spring 2021, Fall 2022, Spring 2023, and Fall 2023 iterations of EECS 127/227A
by Prof. Gireeja Ranade. The reader was mostly written by Spring 2023 GSIs Druv Pai, Arwa Alanqary, and Aditya
Ramabadran, and reviewed by Prof. Ranade. Fall 2022 tutor Jeffrey Wu collaborated with Druv on a writeup about the
Eckart-Young theorem which was folded into the reader. Contributions from Prof. Venkat Anantharam and Chih-Yuan
Chiu were also added in Fall 2023.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 1
Contents
1 Introduction 4
1.1 What is Optimization? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Solution Concepts and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 (OPTIONAL) Infimum Versus Minimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Vector Calculus 48
3.1 Gradient, Jacobian, and Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 Taylor’s Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3 The Main Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4 Directional Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5 (OPTIONAL) Matrix Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Convexity 79
5.1 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
2
EECS 127/227AT Course Reader Contents 2024-04-27 21:08:09-07:00
7 Duality 130
7.1 Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.2 Weak Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.3 Strong Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.4 Karush-Kuhn-Tucker (KKT) Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.5 (OPTIONAL) Conic Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11 Applications 189
11.1 Deterministic Control and Linear-Quadratic Regulator . . . . . . . . . . . . . . . . . . . . . . . . . 189
11.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 3
Chapter 1
Introduction
• [1] Chapter 1.
• [2] Chapter 1.
• A statistical model, such as a neural network, trains using finite data samples.
• A robot learns a strategy using the environment, so that it does what you want.
• A major gas company decides what mixture of different fuels to process in order to get maximum profit.
• The EECS department decides how to set class sizes in order to maximize the number of credits offered subject
to budget constraints.
While it might seem that these four examples are very distinct, they can all be formulated as minimizing an objective
function over a feasible set. Thus, they can all be put into the framework of optimization.
To develop the basics of optimization, including precisely defining an objective function and a feasible set, we use
some motivating examples from the third and fourth “problems”. (The first and second “problems” will be discussed
at the very end of the course.)
Example 1 (Oil and Gas). Say that we are a gas company with 105 barrels of crude oil that we must refine by an
expiration date. There are two refineries: one which processes crude oil into jet fuel, and one which processes crude
oil into gasoline. We can sell a barrel of jet fuel to consumers for $0.10, while we can sell a barrel of gasoline fuel for
$0.20. So, letting x1 be a variable denoting the number of barrels of jet fuel produced, and x2 be a variable denoting
the number of barrels of gasoline produced, we aim to solve the problem:
1 1
max x1 + x2 (1.1)
x1 ,x2 10 5
s.t. x1 ≥ 0
x2 ≥ 0
4
EECS 127/227AT Course Reader 1.1. What is Optimization? 2024-04-27 21:08:09-07:00
x1 + x2 = 105 .
That is, we aim to choose x1 and x2 which maximize the objective function 1
10 x1 + 15 x2 , but with the caveat that
they must obey the constraints x1 ≥ 0, x2 ≥ 0, and x1 + x2 = 105 . The feasible set is the set of all (x1 , x2 ) pairs
which obey the constraints. As you may have noticed, constraints can be equalities or inequalities in the xi , which we
formalize shortly.
The solution to this problem can be seen to be (x?1 , x?2 ) = (0, 105 ), which corresponds to refining all the crude oil
into gasoline. This makes sense – after all, gasoline sells for more! And with all else equal between gasoline and jet
fuel, to maximize our profit, we just need to produce gasoline.
To model another constraint, say that we need at least 103 gallons of jet fuel and 5 · 102 gallons of gasoline, we can
directly incorporate them into the constraint set:
1 1
max x1 + x2 (1.2)
x1 ,x2 10 5
s.t. x1 ≥ 0
x2 ≥ 0
x1 ≥ 103
x2 ≥ 5 · 102
x1 + x2 = 105 .
We then notice that x1 ≥ 0 is made redundant by the constraint x1 ≥ 103 . That is, no pair (x1 , x2 ) which satisfies
x1 ≥ 103 is not going to satisfy x1 ≥ 0. Thus, we can eliminate the latter constraint, since it defines the same feasible
set. We can do the same thing for the constraints x2 ≥ 0 and x2 ≥ 5 · 102 , the latter making the former redundant.
Thus, we can simplify the above problem to only include the redundant constraints:
1 1
max x1 + x2 (1.3)
x1 ,x2 10 5
s.t. x1 ≥ 103
x2 ≥ 5 · 102
x1 + x2 = 105 .
Let’s say that we want to incorporate one final business need. Before, we were modeling that the oil refinement is free,
since we don’t have an objective or constraint term which involves this cost. Now, let us say that we can transport a
total of 2 · 106 “barrel-miles” – that is, the number of barrels times the number of miles we can transport is no greater
than 2 · 106 . Let us further say that the jet fuel refinery is 10 miles away from the crude oil storage, and the gasoline
refinery is 30 miles away from the crude oil storage. We can incorporate this further constraint into the constraint set
directly:
1 1
max x1 + x2 (1.4)
x1 ,x2 10 5
s.t. x1 ≥ 103
x2 ≥ 5 · 102
10x1 + 30x2 ≤ 2 · 106
x1 + x2 = 105 .
This is a good first problem; we have a non-trivial objective function, non-trivial inequality and equality constraints,
and even got to work with manipulating constraints (so as to remove redundant ones)!
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 5
EECS 127/227AT Course Reader 1.1. What is Optimization? 2024-04-27 21:08:09-07:00
This type of optimization problem is called a linear program. We will learn more about how to formulate and solve
linear programs later in the course.
A more generic reformulation of the above optimization problem is the following “standard form”.
Here:
• fi are inequality constraint functions; the expression “fi (~x) ≤ 0” is an inequality constraint.
• Similarly, hj are equality constraint functions, and the expression “hj (~x) = 0” is an equality constraint.
• The feasible set, i.e., the set of all ~x that satisfy all constraints, is
( )
. n fi (~ x) ≤ 0, ∀i ∈ {1, . . . , m}
Ω = ~x ∈ R . (1.6)
hj (~x) = 0, ∀j ∈ {1, . . . , p}
• A solution to this optimization problem is any ~x? ∈ Ω which attains the minimum value of f (~x) across all
~x ∈ Ω. Correspondingly, ~x? is also called a minimizer of f0 over Ω.
It’s perfectly fine if m = 0 (in which case there are no inequality constraints) and/or p = 0 (in which case there are no
equality constraints). If there are no constraints, then Ω = Rn and the problem is called unconstrained; otherwise it is
called constrained.
Let us try another example now, which has vector-valued quantities.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 6
EECS 127/227AT Course Reader 1.2. Least Squares 2024-04-27 21:08:09-07:00
127 x1 c1 r1
126 x2 c2 r2
182 x3 c3 r3
189 x4 c4 r4
162 x5 c5 r5
188 x6 c6 r6
.. .. .. ..
. . . .
i>
.
h
Suppose there are n classes in total. Let ~x = x1 x2 · · · xn ∈ Rn be the decision variable, and let
i> i>
. .
h h
~c = c1 c2 · · · cn ∈ Rn and ~r = r1 r2 · · · rn ∈ Rn be constants. Then, in order to maximize the
total number of credit hours subject to a total resource budget b, we set up the linear program
s.t. ~r> ~x ≤ b
xi ≥ 0, ∀i ∈ {1, . . . , n}.
As notation, instead of the last set of constraints xi ≥ 0, we can write the vector constraint ~x ≥ ~0.
More generally, recall that if we have a vector equality constraint ~h(~x) = ~0, it can be viewed as short-hand for
the several scalar equality constraints h1 (~x) = 0, . . . , hp (~x) = 0. Correspondingly, we define the vector inequality
constraint f~(~x) ≤ ~0 to be short-hand for the several scalar inequality constraints f1 (~x) ≤ 0, . . . , fm (~x) ≤ 0.
. √
i=1 zi ; it is labeled with the 2 for a reason we will see later in the course.
pPn
k~zk2 = ~z> ~z = 2
(1.9)
2
min kA~x − ~y k2 .
x∈Rn
~
(1.9)
2
min kA~x − ~y k2 ,
x∈Rn
~
is given by
~x? = (A> A)−1 A> ~y . (1.10)
Proof. The idea is to find A~x ∈ R(A) which is closest to ~y . Here R(A) is the range, or column space, or column span,
of A. In general, We have no guarantee that ~y ∈ R(A), so there is not necessarily an ~x such that A~x = ~y . Instead, we
are finding an approximate solution to the equation A~x = ~y .
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 7
EECS 127/227AT Course Reader 1.2. Least Squares 2024-04-27 21:08:09-07:00
Recall that R(A) is a subspace, and that ~y itself may not belong to R(A). Thus we can visualize the geometry of
the problem as the following picture:
~y
R(A)
~0
We can now solve this problem using ideas from geometry. We claim that the closest point to ~y contained in R(A)
.
is the orthogonal projection of ~y onto R(A); call this point ~z. Also, define ~e = ~y −~z. This gives the following diagram.
~y
~e
~z R(A)
~0
From this diagram, we see that ~e is orthogonal to any vector in R(A). But remember that we still have to prove
.
that ~z is the closest point to ~y within R(A). To see this, consider any another point ~u ∈ R(A) and define ~v = ~y − ~u.
This gives the following diagram:
~y
~e
~v
~z R(A)
~0
~u
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 8
EECS 127/227AT Course Reader 1.2. Least Squares 2024-04-27 21:08:09-07:00
.
To complete our proof, we define w
~ = ~z − ~u, noting that the angle ~u → ~z → ~y is a right angle; in other words, w
~
and ~e are orthogonal. This gives the following picture.
~y
~e
~v
~z R(A)
~0
w
~
~u
(1.11)
2 2
k~y − ~uk2 = k~v k2
(1.12)
2 2
= kwk
~ 2 + k~ek2
(1.13)
2 2
= k~z − ~uk2 + k~ek2
| {z }
>0
(1.14)
2
> k~ek2
(1.15)
2
= k~y − ~zk2 .
We’ll conclude with a statistical application of least squares to linear regression. Suppose we are given data
(x1 , y1 ), . . . , (xn , yn ), and want to fit an affine model y = mx + b through these data points. This corresponds to
approximately solving the system
mx1 + b = y1
mx2 + b = y2
.. (1.20)
.
mxn + b = yn .
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 9
EECS 127/227AT Course Reader 1.3. Solution Concepts and Notation 2024-04-27 21:08:09-07:00
In the case where the data is noisy or inconsistent with the model, as in the below figure, the linear system will be
overdetermined and have no solutions. Then, we find an approximate solution – a line of best fit – via least squares on
the above system.
y
As a last note, solving least squares (and similar problems) is easy because it is a so-called convex problem. Convex
problems are easy to solve because any local optimum is a global optimum, which allows us to use a variety of simple
techniques to find global optima. It is generally much more difficult to solve non-convex problems, though we solve a
few during this course.
We discuss much more about convexity and convex problems later in the course.
On the other hand, in the framework of (1.7) and using the definition of Ω in (1.6), we may write¹.
¹For the case where the minimum does not exist, but the infimum is finite, please see Section 1.4
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 10
EECS 127/227AT Course Reader 1.4. (OPTIONAL) Infimum Versus Minimum 2024-04-27 21:08:09-07:00
As an example, consider the two-element set Ω = {0, 1} and f0 (x) = 3x2 + 2. Then p? = min{f (0), f (1)} =
min{2, 5} = 2. We emphasize that p? is a real number, not a vector.
To extract the minimizers, i.e., the points ~x ∈ Ω which minimize f0 (~x), we use the argmin notation, which gives
us the set of arguments which minimize our objective function. Formally, we define:
.
argmin f0 (~x) = ~x ∈ Ω f0 (~x) = min f0 (~u) (1.25)
x∈Ω
~ u∈Ω
~
We emphasize that the argmin is a set of vectors, any of which are an optimal solution, i.e., a minimizer, of the
optimization problem at hand. It is possible for the argmin to contain 0 vectors (in which case the minimum value is
not realized and the problem has no global optima), any positive number of vectors, or an infinite number of vectors.
Let us consider the same example as before. In particular, consider the two-element set Ω = {0, 1} and f0 (x) =
3x + 2. Then argminx∈Ω f0 (x) = {0}. But, in different scenarios, the argmin can have zero elements; for example,
2
if f0 (x) = 3x, then argminx∈R f0 (x) = ∅. And it can have multiple elements; for example, if f0 (x) = 3x2 (x −
1)2 , then argminx∈R f0 (x) = {0, 1}. It can even have infinitely many elements; for example, if f0 (x) = 0, then
argminx∈R f0 (x) = R.
Though we must remember to keep in mind that technically argmin is a set, in the problems we study, it usually
contains exactly one element. Thus, instead of writing, for example, ~x? ∈ argmin~x∈Ω f0 (~x), we may also write
~x? = argmin~x∈Ω f0 (~x). The former expression is technically more correct, but both usages are fine, if — and only if
— the argmin in question contains exactly one element.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 11
EECS 127/227AT Course Reader 1.4. (OPTIONAL) Infimum Versus Minimum 2024-04-27 21:08:09-07:00
and
p? = inf f0 (~x). (1.29)
x∈Ω
~
However, the argmin retains the same definition. In fact, one can prove that if we replaced the min in the argmin
definition (1.25) with inf, that this “new” argmin would be exactly equivalent in every case to the “old” argmin, which
we use henceforth. The analogous quantity to infimum for maximization — that is, the appropriate generalization of
max — is the supremum, denoted sup.
Interested readers are encouraged to consult a real analysis textbook such as [3] for a more comprehensive coverage.
Though we have gone over the technical details here, for the rest of the course we will omit them for simplicity, and
stick to using min and max (meaning inf and sup when the minimum and maximum do not exist).
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 12
Chapter 2
• [1] Appendix A.
• [2] Chapters 2, 3, 4, 5.
2.1 Norms
2.1.1 Definitions
Definition 5 (Norm)
Let V be a vector space over R. A function f : V → R is a norm if:
• Positive definiteness: f (~x) ≥ 0 for all ~x ∈ V, and f (~x) = 0 if and only if ~x = ~0.
We can check that the familiar Euclidean norm k·k2 : ~x 7→ x2i satisfies these properties. A generalization
pPn
i=1
of the Euclidean norm is the following very useful class of norms.
n
!1/p
.
(2.1)
X p
k~xkp = |xi | .
i=1
13
EECS 127/227AT Course Reader 2.1. Norms 2024-04-27 21:08:09-07:00
(a) The Euclidean norm, given by k~xk2 = x2i , is an `p -norm for p = 2. (This is why we gave the subscript
pPn
i=1
2 to the Euclidean norm previously).
(c) The `∞ -norm, given by k~xk∞ = maxi∈{1,...,n} |xi |, is the limit of the `p norms as p → ∞:
2.1.2 Inequalities
There are a variety of useful inequalities which are associated with the `p norms. Before we provide them, we will take
a second to discuss the importance of inequalities for optimization.
A priori, it may not be clear why we need to care about inequalities; why does it matter whether one arrangement
of variables is always greater or less than another arrangement? It turns out that such inequalities are very helpful for
characterizing the minimum and maximum of a given set of things; we can obtain upper bounds and lower bounds for
things using these inequalities. This is definitely very helpful for optimization.
With that out of the way, let us get to the first major inequality.
We can get this result for `2 norms. A natural next question is whether we can generalize it to `p norms for p 6= 2.
It turns out that we can, as we demonstrate shortly.
This inequality collapses to Cauchy-Schwarz Inequality when p = q = 2. The proof is out of scope for now since
it uses convexity.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 14
EECS 127/227AT Course Reader 2.1. Norms 2024-04-27 21:08:09-07:00
It is initially difficult to see how to proceed, so let us simplify the problem to get back onto familiar territory. We
start with p = 2, so that the problem becomes:
max ~x> ~y . (2.10)
x∈Rn
~
k~
xk2 ≤1
For n = 2, the feasible set and ~y together look like the following:
x2
~y
x1
This term is maximized when cos θ = 1, or equivalently θ = 0. Thus ~x and ~y must point in the same direction, i.e., ~x
is a scalar multiple of ~y . And since we want to maximize this dot product, we must choose ~x to maximize k~xk2 subject
to the constraint k~xk2 ≤ 1. Thus, we choose an ~x which has k~xk2 = 1 and points in the same direction as ~y . This
gives ~x? = ~y / k~y k2 . Thus,
> 2
~y > ~y
~y k~y k2
> ? >
maxn ~x ~y = (~x ) ~y = ~y = = = k~y k2 . (2.12)
x∈R
~ k~y k2 k~y k2 k~y k2
k~
xk2 ≤1
~y
x1
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 15
EECS 127/227AT Course Reader 2.1. Norms 2024-04-27 21:08:09-07:00
Motivated by this diagram, we see that the constraint k~xk∞ ≤ 1 is equivalent to the 2n constraints −1 ≤ xi and
xi ≤ 1. Also, writing out the objective function
n
(2.14)
X
~x> ~y = xi yi = x1 y1 + x2 y2 + · · · + xn yn ,
i=1
This problem has an interesting structure that will be repeated several times in the problems we discuss in this class.
Namely, the objective function is the sum of several terms, each of which involves only one xi . And the constraints are
able to be partitioned into some groups, where the constraints in each group constrain only one xi . Thus, this problem
is separable into n different scalar problems, such that the optimal solutions for each scalar problem form an optimal
solution for the vector problem. Namely, the problems are
max xi yi (2.16)
xi ∈R
−1≤xi ≤1
We solve this much simpler problem by hand. If yi > 0 then x?i = 1; on the other hand, if yi ≤ 0 then x?i = −1. To
summarize, x?i = sgn(yi ), so that x?i yi = |yi |.
Putting all the scalar problems together, we see that ~x? = sgn(~y ), and the vector problem’s optimal value is given
by
n n n
(2.17)
X X X
maxn ~x> ~y = (~x? )> ~y = x?i yi = sgn(yi )yi = |yi | = k~y k1 .
x∈R
~
k~
xk∞ ≤1 i=1 i=1 i=1
For n = 2, the feasible set and ~y together look like the following:
x2
~y
x1
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 16
EECS 127/227AT Course Reader 2.1. Norms 2024-04-27 21:08:09-07:00
n
by triangle inequality (2.21)
X
≤ |xi yi |
i=1
n
(2.22)
X
= |xi | |yi |
i=1
n
(2.23)
X
≤ |xi | max |yi |
i∈{1,...,n}
i=1
n
X
= max |yi | |xi | (2.24)
i∈{1,...,n}
i=1
Thus we have
max ~x> ~y ≤ k~y k∞ . (2.27)
x∈Rn
~
k~
xk1 ≤1
This inequality is actually an equality. To show this, we need to show the reverse inequality
And showing this inequality amounts to choosing, for our fixed ~y , a ~x such that k~xk1 ≤ 1 and ~x> ~y ≥ k~y k∞ . This is
also called “showing the maximum is attained”. To do this, we can find an ~x such that k~xkp ≤ 1 and all the inequalities
in the chain are met with equality.
• First, the inequality in (2.21) is a triangle inequality with the absolute value, i.e., | i=1 xi yi | ≤ i=1 |xi yi |.
Pn Pn
To make sure this is an equality, it’s enough to make sure that all terms xi yi are the same sign or 0.
• Next, the inequality in (2.23) says that i=1 |xi | |yi | ≤ i=1 |xi | maxi∈{1,...,n} |yi | . The most obvious
Pn Pn
instance in which this inequality is met with equality is when |yi | = maxj∈{1,...,n} |yj | for all i. But we can’t
choose ~y , as it’s fixed, so we can’t be assured that this holds. An alternate way in which this holds is that |xi | = 0
for all i for which |yi | 6= maxj∈{1,...,n} |yj |, i.e., i ∈
/ argmaxj∈{1,...,n} |yj |.
• Finally, the inequality in (2.26) says that k~xk1 k~y k∞ ≤ k~y k∞ ; to meet this inequality with equality, it is sufficient
to have k~xk1 = 1.
To meet all three of these constraints, we can construct ~x? via the following process:
• For each i ∈
/ argmaxj∈{1,...,n} |yj |, set x
ei = 0, as per the second bullet point above.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 17
EECS 127/227AT Course Reader 2.2. Gram-Schmidt and QR Decomposition 2024-04-27 21:08:09-07:00
This notion where the `2 -norm constraint leads to the `2 -norm objective, the `∞ -norm constraint leads to the `1 -
norm objective, and the `1 -norm constraint leads to the `∞ -norm objective, hints at a greater pattern. Indeed, one can
show that for 1 ≤ p, q ≤ ∞ such that 1
p + 1
q = 1, an `p -norm constraint leads to an `q -norm objective:
As before, we can prove this equality by proving the two constituent inequalities:
The proof of the first inequality (≤) follows from applying Hölder’s inequality to the objective function:
max ~x> ~y ≤ maxn k~xkp k~y kq = k~y kq · maxn k~xkp = k~y kq . (2.32)
x∈Rn
~ x∈R
~ x∈R
~
k~
xkp ≤1 k~
xkp ≤1 k~
xkp ≤1
The second inequality (≥) can follow if, for our fixed choice of ~y , we produce some ~x such that k~xkp ≤ 1 and
~x> ~y ≥ k~y kq , i.e., “the maximum is attained”. This is more complicated to do, and we won’t do it here.
The above equality (2.30) means that the norms k·kp and k·kq are so-called dual norms. We will explore aspects
of duality later in the course, though frankly we are just scratching the surface.
These problems, which are short and easy to state, contain a couple of core ideas within their solutions, which are
broadly generalizable to a lot of optimization problems. For your convenience, we discuss these explicitly below.
Problem Solving Strategy 11 (Separating Vector Problems into Scalar Problems). When trying to simplify an opti-
mization problem, try to see if you can simplify it into several independent scalar problems. Then solve each scalar
problem — this is usually much easier than solving the whole vector problem at once. The optimal solutions to each
scalar problem will then form the optimal solution to the whole vector problem.
Problem Solving Strategy 12 (Proving Optimality in an Optimization Problem). To solve an optimization problem,
you can use inequalities to bound the objective function, and then try to show that this bound is tight by finding a
feasible choice of optimization variable which makes all the inequalities into equalities.
~a2
~0 ~a1
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 18
EECS 127/227AT Course Reader 2.2. Gram-Schmidt and QR Decomposition 2024-04-27 21:08:09-07:00
• it’s orthogonal to all the ~qi which came before it — which is none of them, so we don’t have to worry; and
~a2
~0 ~a1
~q1
Then we go to ~a2 . To find ~q2 which is orthogonal to all the ~qi before it — that is, ~q1 — we subtract off the orthogonal
projection of ~a2 onto ~q1 from ~a2 . The orthogonal projection of ~a2 onto ~q1 is given by
.
p~2 = ~q1 (~q1>~a2 ) (2.34)
Note that these formulas only hold because ~q1 is normalized, i.e., has norm 1.
~a2
~s2
~0 ~a1
~q1 p~2
While ~s2 is orthogonal to ~q1 , because we want a ~q2 that is normalized, we normalize ~s2 to get ~q2 :
. ~s2
~q2 = . (2.36)
k~s2 k2
~a2
~q2
~s2
~0 ~a1
~q1 p~2
If we had a vector ~q3 (and weren’t limited by drawing in 2D space), we would ensure that ~q3 were orthogonal to ~q1
and ~q2 , as well as normalized, in a similar way as before. First we would compute the projection
.
p~3 = ~q1 (~q1>~a3 ) + ~q2 (~q2>~a3 ). (2.37)
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 19
EECS 127/227AT Course Reader 2.2. Gram-Schmidt and QR Decomposition 2024-04-27 21:08:09-07:00
These projection formulas only hold because {~q1 , ~q2 } is an orthonormal set. And then we could compute
. ~s3
~q3 = . (2.39)
k~s3 k2
And so on. The general algorithm goes similar.
This algorithm has the following two properties, which you can formally prove as an exercise.
In particular, {~a1 , . . . , ~ak } spans the same subspace as {~q1 , . . . , ~qk }, as was stated in our original goal.
The Gram-Schmidt algorithm leads to something called the QR decomposition. Because, for each i, we have
span(~a1 , . . . , ~ai ) = span(~q1 , . . . , ~qi ), it means that we can write ~ai as a linear combination of the ~qj :
i
(2.41)
X
~ai = r1i ~q1 + r2i ~q2 + · · · + rii ~qi = rji ~qj
j=1
More generally, we can decompose every tall matrix with full column rank into a product of a tall matrix with orthonor-
mal columns Q and an upper-triangular matrix R.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 20
EECS 127/227AT Course Reader 2.3. Fundamental Theorem of Linear Algebra 2024-04-27 21:08:09-07:00
As a final note, there are various alterations to the QR decomposition that work for matrices which are wide and/or do
not have full column rank. Those are out of scope, but the idea is the same.
The QR decomposition is also relevant in numerical linear algebra, where it can be used to solve tall linear systems
A~x = ~y efficiently, especially if the underlying matrix A has special structure. All such connections are out of scope.
• Every vector ~x ∈ Rn can be written as ~x = ~x1 + ~x2 , where ~x1 ∈ U and ~x2 ∈ V .
• Furthermore, this decomposition is unique, in the sense that if ~x = ~x1 + ~x2 = ~y1 + ~y2 are two instances of
the above decomposition, then ~x1 = ~y1 and ~x2 = ~y2 .
Note that we cannot replace R A> by R(A), since vectors in R(A) and N (A) do not even have the same number
of entries or lie in the same Euclidean space. If we want to make a statement about R(A), we can replace A by A> in
the above theorem to get the following corollary.
To prove the fundamental theorem of linear algebra, we use a tool called the orthogonal decomposition theorem.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 21
EECS 127/227AT Course Reader 2.3. Fundamental Theorem of Linear Algebra 2024-04-27 21:08:09-07:00
To prove this claim, suppose first that U ⊕ V = Rn . Then every every vector ~x ∈ Rn can be written as ~x = ~x1 + ~x2 ,
where ~x1 ∈ U and ~x2 ∈ V . It remains to prove that U ∩ V = {~0}. Suppose for the sake of contradiction that there
exists ~y 6= ~0 such that ~y ∈ U ∩ V . Then
~x = (~x1 + ~y ) + (~x2 − ~y ). (2.47)
are two distinct ways to write ~x as the sum of vectors from U and V , so it cannot be true that U ⊕ V = Rn , a
contradiction.
Towards the other direction, suppose that every vector ~x ∈ Rn can be written as ~x = ~x1 + ~x2 , where ~x1 ∈ U and
~x2 ∈ V , and U ∩ V = {~0}. The only thing remaining to prove is that if
where ~x1 , ~z1 ∈ U and ~x2 , ~z2 ∈ V , then we must have ~x1 = ~z1 and ~x2 = ~z2 . Suppose again for the sake of contradiction
that there exists ~x ∈ Rn , ~x1 , ~z1 ∈ U , and ~x2 , ~z2 ∈ V such that
Since ~x1 , ~z1 ∈ U , we have ~x1 − ~z1 ∈ U , and since ~x2 , ~z2 ∈ V , we have ~z2 − ~x2 ∈ V . Since they are equal, we have
~x1 − ~z1 ∈ U ∩ V and nonzero. Thus U ∩ V 6= {~0}, a contradiction.
This proves the above claim. Now to prove the actual theorem, we note that every vector ~x ∈ Rn can be written as
By definition, projS (~x) ∈ S, and because the projection residual is orthogonal to the subspace, we have ~x −projS (~x) ∈
S ⊥ . Thus every vector in Rn can be written as the sum of a vector in S and S ⊥ . It is an exercise to show that
S ∩ S ⊥ = {~0}. Invoking the quoted claim completes the proof.
Using this theorem, the only thing we need to show to prove the fundamental theorem of linear algebra is that N (A)
and R A> are orthogonal complements. We do this below.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 22
EECS 127/227AT Course Reader 2.3. Fundamental Theorem of Linear Algebra 2024-04-27 21:08:09-07:00
Proof of Theorem 16. By Theorem 19, the only thing we need to show is that N (A) = R A> ⊥ . This is a set equality;
We first want to show that N (A) ⊆ R A> ⊥ . That is, we want to show that for any ~x ∈ N (A) we have ~x ∈
R A> ⊥ . That is, for any ~y ∈ R A> , we want to show that ~y > ~x = 0.
Since ~y ∈ R A> we can write ~y = A> w ~ for some w~ ∈ Rm . Then, since ~x ∈ N (A) we have A~x = ~0, so
~y > ~x = (A> w)
~ > ~x (2.54)
=w
~ A~x >
(2.55)
=w
~ 0 >~
(2.56)
= 0. (2.57)
Thus ~x and ~y are orthogonal, so ~x ∈ R A> ⊥ , which shows that N (A) ⊆ R A> ⊥ .
We now want to show that R A> ⊥ ⊆ N (A). That is, we want to show that for any ~x ∈ R A> ⊥ , we want to
show that ~x ∈ N (A). That is, we want to show that A~x = ~0.
By definition, for every ~y ∈ R A> , we have ~y > ~x = 0. By writing ~y = A> w
~ for arbitrary w
~ ∈ Rm , we get that
for every w
~ ∈ Rm we have (A> w)
~ > ~x = 0. But the left-hand side is w
~ > A~x, so we have that w
~ > A~x = 0 for every
~ ∈ Rm . This is true for all w
w ~ ∈ Rm , so it is true for the specific choice of w
~ = A~x, which yields
~ > A~x
0=w (2.58)
= (A~x)> A~x (2.59)
(2.60)
2
= kA~xk2
=⇒ A~x = ~0. (2.61)
Thus, we have shown that N (A) = R A> ⊥ , and so by Theorem 19 we have N (A) ⊕ R A> = Rn .
This will help us solve a very important optimization problem, which is considered “dual” to least squares in some
sense. Recall that least squares helps us find an approximate solution to the linear system A~x = ~y , when A is a tall
matrix with full column rank. In other words, the linear system is over-determined, there are many more equations than
unknowns, and there are generally no exact solutions, so we pick the solution with minimum squared error.
What about when A is a wide matrix with full row rank? There are now more unknowns than equations, and
infinitely many exact solutions. So how do we pick one solution in particular? It really depends on which engineering
problem we are solving. One common solution is to pick the minimum-energy or minimum-norm problem, which is
the solution to the optimization problem:
(2.62)
2
min k~xk2
x∈Rn
~
s.t. A~x = ~y .
Note that this principle of choosing the smallest or simplest solution — the “Occam’s Razor” principle — is much
more broadly generalized beyond the case of finding solutions to linear systems, and is used within control theory and
machine learning. But we deal with just this linear system case for now.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 23
EECS 127/227AT Course Reader 2.3. Fundamental Theorem of Linear Algebra 2024-04-27 21:08:09-07:00
Let A ∈ Rm×n have full row rank, and let ~y ∈ Rm . Then the solution to Equation (2.62), i.e., the solution to
(2.62)
2
min k~xk2
x∈Rn
~
s.t. A~x = ~y ,
is given by
~x? = A> (AA> )−1 ~y . (2.63)
Proof. Observe that the constraint A~x = ~y under-specifies the ~x — in particular, any component of ~x in N (A) will
not affect the constraint and only the objective. In this sense, it is “wasteful”, and we should intuitively remove it. This
motivates using Theorem 16 to decompose ~x into a component inside N (A) — which we want to remove — and a
component inside R A> — which we will optimize over.
constraint becomes
~y = A~x (2.64)
= A(~u + ~v ) (2.65)
= A~u + A~v (2.66)
= ~0 + AA> w
~ (2.67)
= AA w.
~ >
(2.68)
(2.69)
2 2
k~xk2 = k~u + ~v k2
>
= ~u ~u + 2~u ~v + ~v ~v > >
(2.70)
(2.71)
2 > 2
= k~uk2 + 2~v ~u + k~v k2
(2.72)
2 > > 2
= k~uk2 + 2(A w) ~ ~u + k~v k2
(2.73)
2 2
= k~uk2 + 2w~ > A~u + k~v k2
(2.74)
2 2
= k~uk2 + 2w~ >~0 + k~v k2
(2.75)
2 2
= k~uk2 + 2 · 0 + k~v k2
(2.76)
2 2
= k~uk2 + k~v k2
2
(2.77)
2 >
= k~uk2 + A w
~ 2
s.t. ~y = AA> w
~
A~u = ~0.
Now, because A has full row rank, AA> is invertible, so the first constraint implies that w
~ ? = (AA> )−1 ~y , so ~v ? =
~ ? = A> (AA> )−1 ~y . And because we are trying to minimize the objective, which only involves ~u through k~uk2 ,
2
A> w
the ideal solution is to set ~u? = ~0, which also satisfies the second constraint and so is feasible. Thus ~x? = ~v ? =
A> (AA> )−1 ~y as desired.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 24
EECS 127/227AT Course Reader 2.4. Symmetric Matrices 2024-04-27 21:08:09-07:00
Example 23 (Covariance Matrices). Any matrix of the form A = BB > , such as the covariance matrices we will
discuss in the next section, is a symmetric matrix, since
Example 24 (Adjacency Matrix). Consider an undirected connected graph G = (V, E), for example the following:
1 3
Its adjacency matrix A has coordinate Aij = 1 if (i, j) ∈ E, and Aij = 0 otherwise; in the above example, we
have
0 1 0 1
1 0 1 0
A=
0
. (2.80)
1 0 1
1 0 1 0
Since the graph is undirected, (i, j) ∈ E if and only if (j, i) ∈ E, so Aij = Aji , and so A is a symmetric matrix.
Why do we care about symmetric matrices? Symmetric matrices have two nice properties: real eigenvalues, and
guaranteed diagonalizability. " #
1 1
In general, a (non-symmetric) matrix need not be diagonalizable. For example, the matrix A = is not
0 1
diagonalizable. How can we characterize the diagonalizability of a matrix, then?
First, we will need the following definitions.
Definition 25 (Multiplicities)
Let A ∈ Rn×n , and let λ be an eigenvalue of A.
(a) The algebraic multiplicity µ of eigenvalue λ in A is the number of times λ is a root of the characteristic
.
polynomial pA (x) = det(xI − A) of A, i.e., it is the power of (x − λ) in the factorization of pA (x).
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 25
EECS 127/227AT Course Reader 2.4. Symmetric Matrices 2024-04-27 21:08:09-07:00
.
(b) The geometric multiplicity φ of eigenvalue λ in A is the dimension of the null space Φ = N (λI − A).
Theorem 26 (Diagonalizability)
A square matrix A ∈ Rn×n is diagonalizable if and only if every eigenvalue of A has equal algebraic and geometric
multiplicities.
" #
1 1
Example 27 (Multiplicities of Degenerate Matrix). We were earlier told that the matrix A = is not diagonal-
0 1
izable. To check this, let us compute its eigenvalues, algebraic multiplicities, and geometric multiplicities.
First, its characteristic polynomial is
Thus, A has only one eigenvalue λ = 1. Since (x − 1) has power 2 in the factorization of pA , the eigenvalue λ = 1
has algebraic multiplicity µ = 2.
The corresponding null space is
Φ = N (λI − A) (2.84)
" #!
1 − 1 −1
=N (2.85)
0 1−1
" #!
0 −1
=N (2.86)
0 0
" #!
1
= span (2.87)
0
which has dimension φ = 1. Thus, for λ = 1, we have µ 6= φ and the matrix is indeed not diagonalizable.
(b) Eigenspaces corresponding to different eigenvalues are orthogonal: Φi and Φj are orthogonal subspaces,
i.e., for every p~i ∈ Φi and p~j ∈ Φj we have p~>
i p
~j = 0.
(d) A is orthonormally diagonalizable; there exists an orthonormal matrix U ∈ Rn×n and diagonal matrix
Λ ∈ Rn×n such that A = U ΛU > .
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 26
EECS 127/227AT Course Reader 2.4. Symmetric Matrices 2024-04-27 21:08:09-07:00
Recall that orthonormal matrices are matrices whose columns are orthonormal, i.e., are pairwise orthogonal and unit-
norm. Orthonormal matrices U have the nice property that U > U = I, and if U is square, then U > = U −1 .
Proof of Theorem 28. Part (a) might be left to homework; part (b) will definitely be left to homework; we prove parts (c)
and (d) here. In particular, we assume that parts (a) and (b) are true, and attempt to prove (d). Note that (d) implies (c),
as an orthonormal diagonalization is a type of diagonalization, and so the existence of an orthonormal diagonalization
must require the algebraic and geometric multiplicities to be equal.
Our proof strategy is to use induction on n, the size of the matrix. The base case of our induction is 1 × 1 matrices,
for which the diagonalization is trivial. Now consider the inductive step. Our hope is, given A ∈ Sn which has
eigenvalue λ, to get a decomposition of the form
" # " #
λ ~0> λ ~0>
A=V V> or equivalently >
V AV = (2.88)
~0 B ~0 B
where V ∈ Rn×n is orthonormal and B ∈ Sn−1 is symmetric. If we can do that, then we can inductively diagonalize
" , and # finally use that to construct a diagonalization for A = U ΛU , where U ∈ R is orthonormal
> > n×n
B = W ΓW
. λ ~0 >
and Λ = .
~0 Γ
Let ~u be a unit-norm eigenvector of A corresponding to eigenvalue λ. Remember that we want an orthonormal
matrix V ∈ Rn×n which “isolates” λ. This motivates using ~u and a basis of the orthogonal complement
h i of span(~u)
to form V . To construct this matrix V , we run Gram-Schmidt on the columns of the matrix ~u I ∈ Rn×(n+1) ,
throwing out the single vector which will have 0 projection residual (there must be exactly one such vector by a counting
argument; to get n linearly independent vectors from
h a spanning
i set of n + 1 vectors, we need to remove exactly one
vector), and obtaining the orthonormal matrix V = ~u V1 ∈ Rn×n where V1 ∈ Rn×(n−1) is itself orthonormal. By
construction, we have V1> ~u = ~0 and ~u> V1 = ~0> . Thus,
h i> h i
V > AV = ~u V1 A ~u V1 (2.89)
" #
~u> h i
= A ~u V 1 (2.90)
V1>
" #
~u> h i
= A~u AV 1 (2.91)
V1>
" #
~u> h i
= λ~
u AV 1 (2.92)
V1>
" #
λ~u> ~u ~u> AV1
= (2.93)
λV1> ~u V1> AV1
" #
2
λ k~uk2 (A> ~u)> V1
= (2.94)
~0 V1> AV1
" #
λ (A~u)> V1
= (2.95)
~0 V1> AV1
" #
λ λ~u> V1
= (2.96)
~0 V1> AV1
" #
λ ~0>
= (2.97)
~0 V1> AV1
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 27
EECS 127/227AT Course Reader 2.4. Symmetric Matrices 2024-04-27 21:08:09-07:00
" #
λ ~0>
= , (2.98)
~0 B
.
where B = V1> AV1 in accordance with our proof outline. Now we need to check that B is symmetric; indeed, we have
By induction, we can orthonormally diagonalize this matrix as B = W ΓW > ∈ R(n−1)×(n−1) , where W ∈ R(n−1)×(n−1)
is orthonormal and Γ ∈ R(n−1)×(n−1) is diagonal. Thus, by using W −1 = W > , we have
Γ = W > BW (2.103)
= W > V1> AV1 W (2.104)
= (V1 W ) A(V1 W ). >
(2.105)
" #
λ ~0>
We want an orthonormal matrix U ∈ Rn×n such that U > AU = Λ = . Thus, the above calculation motivates
~0 Γ
h i
the choice U = ~u V1 W ∈ Rn×n . Thus
h i> h i
U > AU = ~u V1 W A ~u V1 W (2.106)
" #
~u> h i
= A ~u V 1 W (2.107)
W > V1>
" #
~u> h i
= > >
A~u AV1 W (2.108)
W V1
" #
~u> h i
= λ~
u AV 1 W (2.109)
W > V1>
" #
λ~u> ~u ~u> AV1 W
= (2.110)
λW > V1> ~u W > V1> AV1 W
" #
2
λ k~uk2 (A> ~u)> V1 W
= (2.111)
λW >~0 W > BW
" #
λ (A~u)> V1 W
= (2.112)
~0 Γ
" #
λ λ~u> V1 W
= (2.113)
~0 Γ
" #
λ λ~0> W
= (2.114)
~0 Γ
" #
λ ~0>
= (2.115)
~0 Γ
= Λ, (2.116)
as desired. Thus A = U ΛU > is an orthonormal diagonalization of A. This proves (d), and hence (c).
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 28
EECS 127/227AT Course Reader 2.4. Symmetric Matrices 2024-04-27 21:08:09-07:00
One nice thing about diagonalization is that we can read off the eigenvalues and eigenvectors from the components
of the diagonalization.
Proposition 29 h i
Let A ∈ Sn have orthonormal diagonalization A = U ΛU > , where U = ~u1 · · · ~un ∈ Rn×n is square
λ1
orthonormal, and Λ =
.. ∈ Rn×n is diagonal. Then for each i, the pair (λi , ~ui ) is an eigenvalue-
.
λn
eigenvector pair for A.
A = U ΛU > (2.117)
AU = U Λ (2.118)
λ
i 1
..
h i h
(2.119)
A ~u1 · · · ~un = ~u1 · · · ~un
.
λn
h i h i
A~u1 ··· A~un = λ1 ~u1 ··· λn ~un . (2.120)
Using this, we can work with another nice property of the orthonormal diagonalization. Namely, we can read off
bases for N (A) and R(A). That is, a basis for N (A) is the set of eigenvectors ~ui corresponding to the eigenvalues
λi of A which are equal to 0. Since U is orthonormal, the remaining eigenvectors ~ui span the orthogonal complement
to N (A). But by the fundamental theorem of linear algebra (Theorem 16), we have N (A)⊥ = R A> = R(A), so
these eigenvectors form a basis for R(A). Soon, we’ll discover the singular value decomposition, which allows for this
kind of decomposition of a matrix into its range and null spaces, except for arbitrary matrices.
Before we get into those, we will first state and solve a quick optimization problem which yields the eigenvalues of
a symmetric matrix. This optimization problem turns out to be quite useful for further study of optimization.
~x> A~x
λmax {A} = maxn = maxn ~x> A~x (2.121)
x∈R
~ ~x> ~x x∈R
~
x6=~
~ 0 k~
xk2 =1
>
~x A~x
λmin {A} = minn = minn ~x> A~x. (2.122)
x∈R
~ ~x> ~x x∈R
~
x6=~
~ 0 k~
xk2 =1
~x> A~x
The term is called the Rayleigh quotient of A; it is a function of ~x ∈ Rn .
~x> ~x
Proof. Before we start trying to prove any equalities, let us try to simplify the crucial term ~x> A~x; the intuition behind
this is that it looks like the easiest term to use the orthonormal diagonalization on and achieve results.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 29
EECS 127/227AT Course Reader 2.4. Symmetric Matrices 2024-04-27 21:08:09-07:00
.
with the invertible change of variables ~y = U > ~x ⇐⇒ ~x = U~y . Also we note that this change of variables preserves
the norm, i.e.,
2
(2.127)
2 2
k~y k2 = U > ~x 2
= ~x> U U > ~x = ~x> ~x = k~xk2 .
We now turn to the first equality chain (with max). Immediately, we have
Now, because the norm of our optimization variable ~x does not matter, in that it only affects the objective through its
normalization ~x/ k~xk2 , it is equivalent to optimize over only unit-norm ~x, so
~x> A~x
maxn = maxn ~x> A~x. (2.130)
x∈R
~ ~x> ~x x∈R
~
x6=~
~ 0 k~
xk2 =1
With the invertible change of variables ~y = U > ~x already discussed, we can write
n
(2.131)
X
maxn ~x> A~x = maxn λi yi2
x∈R
~ y ∈R
~
k~
xk2 =1 y k2 =1 i=1
k~
n
(2.132)
X
≤ λmax {A} · maxn yi2
y ∈R
~
y k2 =1 i=1
k~
(2.133)
2
= λmax {A} · maxn k~y k2
y ∈R
~
k~
y k2 =1
It is left to exhibit a ~y which makes this inequality an equality; indeed, it is achieved when yi = 1 for one i such that
λi {A} = λmax {A} and yi = 0 otherwise. The achieving ~x can be recovered by ~x = U~y . Note that since this ~y is a
standard basis vector ~y = ~ei for some i such that λi {A} = λmax {A}, then ~x = U~y = U~ei = ~ui , i.e., the ith column
of U , is an eigenvector of A corresponding to the maximum eigenvalue of A.
The analysis for λmin {A} goes exactly analogously.
This characterization motivates defining a new sub-class (or really several new sub-classes) of matrices.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 30
EECS 127/227AT Course Reader 2.5. Principal Component Analysis 2024-04-27 21:08:09-07:00
There are also negative semidefinite (NSD) and negative definite (ND) symmetric matrices, defined analogously. There
are also indefinite symmetric matrices, which are none of the above. It is clear to see that PD matrices are themselves
PSD.
Proposition 32
We have A ∈ Sn+ if and only if each eigenvalue of A is non-negative. Also, A ∈ Sn++ if and only if each eigenvalue
of A is positive.
which implies that ~x> A~x ≥ 0 for all ~x with unit norm, and by scaling we see that ~x> A~x ≥ 0 for all ~x 6= ~0, while the
inequality certainly holds for ~x = ~0. Thus ~x> A~x ≥ 0 for all ~x so A ∈ Sn+ .
If A is PD, we have
λmin {A} = minn ~x> A~x > 0. (2.138)
x∈R
~
k~
xk2 =1
which implies that ~x> A~x > 0 for all ~x with unit norm, and by scaling we see that ~x> A~x > 0 for all ~x 6= ~0, so
A ∈ Sn++ .
The final construction we discuss is that of the positive semidefinite square root.
Proposition 33
Let A ∈ Sn+ . Then there exists a unique symmetric PSD matrix B ∈ Sn+ , usually denoted B = A1/2 , such that
A = B2.
Proof. Discussion or homework. Note that there are non-symmetric matrices B such that A = B 2 , but there is a
unique PSD B.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 31
EECS 127/227AT Course Reader 2.5. Principal Component Analysis 2024-04-27 21:08:09-07:00
This idea has many, many use cases. For example, in modern machine learning, most data has thousands or millions
of dimensions. In order to visualize it properly, we need to reduce its dimension to a reasonable number, in order to
get an idea about the underlying structure of the data.
Let us first lay out some notation and definitions. Suppose we have the data points ~x1 , . . . , ~xn ∈ Rd . We organize
these into a data matrix X where data points form the rows:¹
~x>1
. ..
h i
n×d
so that X > = ~x1 · · · ~xn ∈ Rd×n . (2.140)
X= . ∈R
~x>n
While this dataset is clearly fully two-dimensional, there is equally clearly some inherent 1-dimensional linear
structure to the data. So when we want to look for an underlying low-dimensional structure, we’re looking for something
like this. Here, if we could find the direction w
~ as in below:
¹Different textbooks handle this differently. For instance, some textbooks define a data matrix as one where the data points form the columns. If
you’re unsure, work it out from first principles.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 32
EECS 127/227AT Course Reader 2.5. Principal Component Analysis 2024-04-27 21:08:09-07:00
w
~
y
x
And, if we are in generic Rd space, we want to find orthonormal vectors w ~ p such that projection onto them
~ 1, . . . , w
uncovers the underlying data structure. The process of accurately characterizing these w
~ i is what we will discuss in
what follows.
We begin with a motivating example. Consider the MNIST dataset of handwritten digits. Each image is a 28-pixel
by 28-pixel grid with each numerical entry in the grid denoting the greyscale value at that grid point. This can be
represented by a 28 × 28 size matrix, or alternatively unrolled into a 282 = 784-dimensional vector. It is impossible
to directly visualize 784-dimensional space, so we seek to find w ~ 8 ∈ R784 such that the projection onto the w
~ 1, . . . , w ~i
preserve a lot of structure. Say that we take w ~ i = ~ei , where ~ei is the ith standard basis vector in R784 . Then for most
images, the projection onto the w ~ i ’s will be 0 or near-0. Thus, the projection of all of the data onto the w ~ i preserves
almost none of the structure and collapses all points in the dataset to just a few points in R8 . There is instead a much
more principled way to choose the w
~ i that will preserve most of the structure.
We now discuss how to choose the first principal component w
~ 1 ∈ Rd . To preserve the structure of the underlying
data as much as possible, we want the vectors ~xi projected onto the span of w
~ 1 to be as close as possible to the original
vectors ~xi . We also want kw
~ 1 k2 = 1. Thus, the error of the projection across all data points is
n
1X 2
err(w
~ 1) = ~xi − w ~ 1> ~xi )
~ 1 (w 2
.
n i=1
Expanding, we have
n
1X 2
err(w
~ 1) = ~xi − w ~ 1> ~xi )
~ 1 (w 2
(2.142)
n i=1
n
1X
= (~xi − w ~ 1> ~xi ))> (~xi − w
~ 1 (w ~ 1> ~xi ))
~ 1 (w (2.143)
n i=1
n
1X >
= (~x ~xi − ~x>
i w ~ 1> ~xi ) − (w
~ 1 (w ~ 1> ~xi ))> ~xi + (w
~ 1 (w ~ 1> ~xi ))> (w
~ 1 (w ~ 1> ~xi )))
~ 1 (w (2.144)
n i=1 i
n
1X >
= (~x ~xi − 2(~x> ~ 1> w
~ 1 )2 + ( w
i w ~ 1> ~xi )2 )
~ 1 )(w (2.145)
n i=1 i
n
1X
(2.146)
2
= (k~xi k2 − 2(~x> ~ 1> ~xi )2 )
~ 1 )2 + ( w
i w
n i=1
n
1X
(2.147)
2
= (k~xi k2 − (~x> ~ 1 )2 ).
i w
n i=1
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 33
EECS 127/227AT Course Reader 2.6. Singular Value Decomposition 2024-04-27 21:08:09-07:00
with the w
~ 1 achieving this upper bound being the eigenvector ~umax corresponding to the eigenvalue λmax {C}. Thus,
the first principal component is exactly an eigenvector corresponding to the largest eigenvalue of the dot product matrix
C = X > X/n.
This computation is a special case of the singular value decomposition, which is used in practice to compute the
PCA of a dataset; understanding this decomposition will allow us to neatly compute the other principal components
(i.e., second, third, fourth,...), as well.
Definition 34 (SVD)
Let A ∈ Rm×n have rank r. A singular value decomposition (SVD) of A is a decomposition of the form
A = U ΣV > (2.156)
" #" #
h i Σr 0r×(n−r) Vr>
= Ur Um−r >
(2.157)
0(m−r)×r 0(m−r)×(n−r) Vn−r
= Ur Σr Vr> (2.158)
r
(2.159)
X
= σi ~ui~vi> ,
i=1
where:
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 34
EECS 127/227AT Course Reader 2.6. Singular Value Decomposition 2024-04-27 21:08:09-07:00
• U ∈ Rm×m , Ur ∈ Rm×r , Um−r ∈ hRm×(m−r) ,iV ∈ Rn×n , Vr ∈ Rn×r , and Vn−r ∈ Rn×(n−r)
are orthonormal matrices, where U = Ur Um−r has columns ~u1 , . . . , ~um (left singular vectors) and
h i
V = Vr Vn−r has columns ~v1 , . . . , ~vn (right singular vectors).
σ1
• Σr =
.. ∈ Rr×r is a diagonal matrix with ordered positive entries σ1 ≥ · · · ≥ σr > 0
.
σr
" #
Σr 0r×(n−r)
(singular values), and the zero matrices in the Σ = matrix are shaped to
0(m−r)×r 0(m−r)×(n−r)
ensure that Σ ∈ Rm×n .
Suppose that A is tall (so m > n) with full column rank n. Then the SVD looks like the following:
" #
Σn
A=U V >. (2.160)
0(m−n)×n
On the other hand, if A is wide (so m < n) with full row rank m, then the SVD looks like the following:
h i
A = U Σm 0m×(n−m) V > . (2.161)
The last (summation) form of the SVD is called the dyadic SVD; this is because terms of the form p~~q> are called
dyads, and the dyadic SVD expresses the matrix A as the sum of dyads.
All forms of the SVD are useful conceptually and computationally, depending on the problem we are working on.
We now discuss a method to construct the SVD. Suppose A ∈ Rm×n has rank r. We consider the symmetric
matrix A> A which has rank r and thus r nonzero eigenvalues, which are positive. We can order its eigenvalues as
λ1 ≥ · · · ≥ λr > λr+1 = · · · = λn = 0, say with corresponding orthonormal eigenvectors ~v1 , . . . , ~vn .
. √ .
Then, for i ∈ {1, . . . , r}, we define σi = λi > 0, and ~ui = A~vi /σi . This only gives us r vectors
h ~ui , but we need
i
m of them to construct U ∈ R m×m
. To find the remaining ~ui we use Gram-Schmidt on the matrix ~u1 · · · ~ur I ∈
Rm×(r+m) , throwing out the r vectors whose projection residual onto previously processed vectors is 0.
More formally, we can write an algorithm:
It’s clear that Algorithm 2 gives an orthonormal basis {~v1 , . . . , ~vn } for Rn that can be constructed into the orthonor-
mal V matrix, and that it gives singular values σ1 ≥ · · · ≥ σr > 0. We aim to show two things: the {~u1 , . . . , ~um } are
orthonormal, and that A = U ΣV > where U , Σ, and V are constructed using the returned vectors and scalars.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 35
EECS 127/227AT Course Reader 2.6. Singular Value Decomposition 2024-04-27 21:08:09-07:00
Proposition 35
In the context of Algorithm 2, {~u1 , . . . , ~um } is an orthonormal set.
Proof. From our invocation of Gram-Schmidt, {~ur+1 , . . . , ~um } is an orthonormal set which spans an orthogonal sub-
space to the span of {~u1 , . . . , ~ur }. Thus, we need to show that {~u1 , . . . , ~ur } are orthonormal.
Indeed, take 1 ≤ i < j ≤ r. Then since the ~vj are orthonormal eigenvectors of A> A, we have
>
A~vi A~vj
~u>
i ~
uj = (2.162)
σi σj
~vi> A> A~vj
= (2.163)
σi σj
λj ~vi>~vj
= (2.164)
σi σj
λj >
= ~v ~vj (2.165)
σi σj | i{z }
=0
= 0. (2.166)
On the other hand, for a specific i ∈ {1, . . . , r}, using that σi2 = λi , we have
2
A~vi
(2.167)
2
k~ui k2 =
σi 2
>
A~vi A~vi
= (2.168)
σi σi
~vi> A> A~vi
= (2.169)
σi2
λi~vi>~vi
= (2.170)
σi2
λi
= 2 ~vi>~vi (2.171)
σi | {z }
|{z} =1
=1
= 1. (2.172)
Thus the set {~u1 , . . . , ~ur } is orthonormal, so the whole set {~u1 , . . . , ~um } is orthonormal.
Proposition 36
In the context of Algorithm 2, we have A = U ΣV > .
This gives us
h i
AV = A ~v1 · · · ~vr ~vr+1 · · · ~vn (2.175)
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 36
EECS 127/227AT Course Reader 2.6. Singular Value Decomposition 2024-04-27 21:08:09-07:00
h i
= A~v1 · · · A~vr A~vr+1 · · · A~vn (2.176)
h i
= σ1 ~u1 · · · σr ~ur ~0 · · · ~0 (2.177)
h i
= Ur Σr 0 (2.178)
The SVD is not unique: the Gram-Schmidt process could have used any basis for Rm that wasn’t the columns
of I and still have been valid; if you had multiple eigenvectors of A> A with the same eigenvalue then the choice of
eigenvectors in the diagonalization would not be unique; and even if you didn’t have multiple eigenvectors with the
same eigenvalue, the eigenvectors would only be determined up to a sign change ~v 7→ −~v anyways. So there is a lot of
ambiguity in the construction, which reflects the non-uniqueness.
We now discuss the geometry of the SVD, especially how each component of the SVD acts on vectors. To do this
we will fix A ∈ R2×2 with SVD A = U ΣV > , and find the behavior of A~x for all ~x on the unit circle (i.e., with norm
1). We will analyze the behavior of A~x by using the behavior of V > ~x, ΣV > ~x and finally U ΣV > ~x. In the end, we will
interpret U as a rotation or reflection, Σ as a scaling, and V > as another rotation or reflection.
Before we start, let us discuss what different types of matrices look like as linear transformations. Consider our
friendly unit circle:
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 37
EECS 127/227AT Course Reader 2.6. Singular Value Decomposition 2024-04-27 21:08:09-07:00
On the other hand, multiplying each vector in the unit circle by the same diagonal matrix will scale the vectors in
the coordinate directions. This means that the unit circle will be mapped, in general, to an axis-aligned ellipse.
diagonal
In general, matrices aren’t orthonormal or diagonal, and so they will both rotate and scale in various ways. This
means that the unit circle will be mapped to an ellipse which isn’t necessarily axis-aligned.
generic
Let us now systematically study such A = U ΣV > through the lens of the unit circle, as well as where the right
singular vectors ~v1 and ~v2 are mapped by A.
Since V > is an orthonormal matrix, it represents a rotation and/or reflection, and so it maps the unit circle to the unit
circle, much like our observed figure. Specifically, the right singular vectors ~vi have V >~vi = ~ei , so they get mapped
onto the standard basis by V > . This gives the following picture.
V>
~e2
~v2 ~v1
~e1
Now, the diagonal matrix Σ will scale each ~ei by σi , obtaining an ellipse.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 38
EECS 127/227AT Course Reader 2.7. Low-Rank Approximation 2024-04-27 21:08:09-07:00
V> Σ
~e2
~v2 ~v1 σ2~e2
~e1 σ1~e1
Finally, the orthonormal matrix U will map this axis-aligned ellipse to an ellipse which isn’t necessarily axis-
aligned. Specifically, the vectors that we’re looking at, i.e., σi~ei , have U (σi~ei ) = σi U~ei = σi ~ui . These ~ui will be the
axes of the resulting ellipse in the same sense as σi~ei were the axes of the axis-aligned ellipse.
V> Σ U
σ1 ~u1
~e2
~v2 ~v1 σ2~e2
~e1 σ1~e1
σ2 ~u2
Recall that we originally started with a depiction that didn’t have any fine-grained description of any vectors, yet
obtained the same result:
To understand the impact of A on any general vector ~x, we write it in the V basis: ~x = α1~v1 + α2~v2 , and use
linearity to obtain A~x = α1 σ1 ~u1 + α2 σ2 ~u2 . One can draw this graphically using scaled versions of the above ellipses.
This perspective also says that σ1 is the maximum scaling of any vector obtained by multiplication by A, and σr
is the minimum nonzero scaling. (If r < n, i.e., A is not full column rank, then there are some nonzero vectors in Rn
which are sent to ~0 by A, so the minimum scaling is 0.) You will formally prove this in homework.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 39
EECS 127/227AT Course Reader 2.7. Low-Rank Approximation 2024-04-27 21:08:09-07:00
To formally talk about a compression algorithm that stores a compressed version of the data with minimal error,
we need to talk about what kind of errors are appropriate to discuss in the context of matrices. In the case of vectors,
we can use the `2 -norm, or more generally the `p norm, to define a distance function; then, the error would just be the
distance between the true and the perturbed vectors. This motivates thinking about matrix norms, which allow us to
quantify the distance between matrices, and thus create error functions.
There are two ways to think about a matrix. The first way is as a block of numbers. Similarly to how we thought of
a vector as a block of numbers and found the norm based on this, we can think of the matrix as a big list of vectors and
take the norm. This norm is called the Frobenius norm, and it corresponds to unrolling an m × n matrix into a length
m · n vector and taking its `2 -norm.
Proposition 38
For a matrix A ∈ Rm×n , we have kAkF = tr A> A .
2
Proposition 39
For a matrix A ∈ Rm×n and orthonormal matrices U ∈ Rm×m , V ∈ Rn×n , we have
Proposition 40
For a matrix A ∈ Rm×n with rank r and singular values σ1 ≥ · · · ≥ σr > 0, we have
r
(2.184)
2
X
kAkF = σi2 .
i=1
Proof. Let A = U ΣV > be the SVD of A; then, we use the previous proposition to get
2
(2.185)
2
kAkF = U ΣV > F
(2.186)
2
= kΣkF
r
(2.187)
X
= σi2 .
i=1
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 40
EECS 127/227AT Course Reader 2.7. Low-Rank Approximation 2024-04-27 21:08:09-07:00
Under this perspective, kA − BkF is small if each component of A and B is close; that is, they are very similar in
terms of the block-of-numbers interpretation.
The second way to think about a matrix is as a linear transformation. In this case, the matrix is defined by how it
acts on vectors via multiplication. A suitable notion of size in this case is the largest scaling factor of the matrix on any
unit vector; this is called the spectral norm or the matrix `2 -norm.
Fortunately, this optimization problem has a solution — what’s more, we’ve actually seen this solution before.
Proposition 42
For a matrix A ∈ Rm×n with rank r and singular values σ1 ≥ · · · ≥ σr > 0, we have
kAk2 = σ1 . (2.189)
Connecting back to the ellipse transformations, the spectral norm captures how much the ellipse is stretched in its most
stretched direction. Thus, the matrix is viewed as as a linear map.
To present our main theorems about how to approximate the matrix well under these norms, we need to define
.
notation. Fix a matrix A ∈ Rm×n . For convenience, let p = min{m, n}. Suppose that A has rank r ≤ p, and that A
has SVD
p
(2.196)
X
A= σi ~ui~vi>
i=1
where we note that σ1 ≥ · · · ≥ σr and define σr+1 = σr+2 = · · · = 0. Then, for k ≤ p, we can define
k
. X
Ak = σi ~ui~vi> . (2.197)
i=1
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 41
EECS 127/227AT Course Reader 2.7. Low-Rank Approximation 2024-04-27 21:08:09-07:00
Note that if k p, then Ak can be stored much more efficiently than A. For instance, Ak needs to store k scalars (σs),
k vectors in Rm (~us), and k vectors in Rn (~v s), for a total storage of k(m + n + 1) floats. On the other hand, storing
A requires mn floats naively, and even storing the (full) SVD of A is not much better. So the former is much more
efficient to store. The only thing left to do is to show that it actually is a good approximation in A (otherwise what
would be the point to storing it instead of A?)
It turns out that Ak indeed well-approximates A in the sense of the two norms — the Frobenius norm and the
spectral norm — that we discussed previously. The two results are collectively known as the Eckart-Young (sometimes
Eckart-Young-Mirsky) theorem(s). We state and prove these now.
We begin with the Eckart-Young theorem for the spectral norm, since it will help us prove the analogous result for
Frobenius norms.
or, equivalently,
kA − Ak k2 ≤ kA − Bk2 , ∀B ∈ Rm×n : rank(B) ≤ k. (2.199)
Proof of Theorem 43. The proof of Eckart-Young Theorem for Spectral Norm is partitioned into two parts which to-
gether jointly show the conclusion. First, we explicitly calculate kA − Ak k2 ; then for an arbitrary B of rank ≤ k we
show that kA − Bk2 is lower-bounded by this quantity.
Since σk+1 ≥ σk+2 ≥ · · ·, the ~ui are orthonormal, and the ~vi are orthonormal, this is a valid singular value
decomposition of the rank r − k matrix A − Ak . Thus A − Ak has largest singular value equal to σk+1 , so
kA − Ak k2 = σk+1 . (2.202)
Part 2. We now show that for any matrix B ∈ Rm×n of rank ≤ k, we have kA − Bk2 ≥ σk+1 . Before we commence
with the proof, let us first discuss the overall argument structure.
We have two characterizations of the spectral norm: the maximum singular value, and the maximal Rayleigh
coefficient. Up until now, we don’t have any advanced machinery such as singular value inequalities to directly
show that the maximum singular value of A−B is lower-bounded by some constant. However, we do understand
matrix-vector products, and how to characterize their optima, pretty well, so this motivates the use of the
maximal Rayleigh coefficient. In particular, suppose f (~x) = k(A − B)~xk2 / k~xk2 . Then we know that for
any specific value of ~x0 , we have kA − Bk2 = max~x f (~x) ≥ f (~x0 ). Thus, one way to get a lower bound on
kA − Bk2 is to plug in any ~x0 into f . The trick is, given B, to find the right value of ~x0 such that f (~x0 ) = σk+1 .
The first part of the argument will be finding an appropriate ~x0 ; the next will show that it works.
Okay, let’s start with the proof.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 42
EECS 127/227AT Course Reader 2.7. Low-Rank Approximation 2024-04-27 21:08:09-07:00
Thus
dim(N (B)) + dim(R(Vk+1 )) ≥ (n − k) + (k + 1) = n + 1 > n. (2.205)
Claim. There exists a unit vector in N (B) ∩ R(Vk+1 ).
Proof. First, we note that since N (B) and R(Vk+1 ) are subspaces of Rn , their intersection is
a subspace of Rn . Thus, it is closed under scalar multiplication. Thus the existence of a unit
vector is equivalent to the existence of a nonzero vector, since one can scale this nonzero vector
by its (inverse) norm to get the unit vector.
Suppose for the sake of contradiction that N (B) and R(Vk+1 ) have no nonzero vectors in
common. Fix a basis S1 for N (B) and a basis S2 for R(Vk+1 ). By assumption, we have that
S1 ∩ S2 = ∅, and S1 ∪ S2 is a linearly independent set. Here, there are two cases.
A. S1 ∪ S2 contains < n + 1 vectors. This means that S1 and S2 have a vector in common,
so S1 ∩ S2 6= ∅ and we have reached a contradiction.
B. S1 ∪ S2 has exactly n + 1 vectors. As a set of n + 1 vectors in Rn , S1 ∪ S2 is linearly
dependent and we have reached a contradiction.
In both cases we have reached a contradiction, so there is a nonzero vector in N (B)∩R(Vk+1 ).
We may scale this vector by its inverse norm to get a unit vector in N (B) ∩ R(Vk+1 ).
Let ~x0 ∈ N (B) ∩ R(Vk+1 ) be any such unit vector. We use this vector for the Rayleigh coefficient.
Step 2. Now we show that our choice of ~x0 plugged into the Rayleigh coefficient gives σk+1 as a lower bound.
Indeed, we have
k(A − B)~xk2
kA − Bk2 = max (2.206)
x6=~
~ 0 k~xk2
k(A − B)~x0 k2
≥ (2.207)
k~x0 k2
= k(A − B)~x0 k2 (2.208)
= kA~x0 − B~x0 k2 . (2.209)
Since ~x0 ∈ R(Vk+1 ), there are constants α1 , . . . , αk+1 , not all zero, such that ~x0 = αi~vi . Thus
Pk+1
i=1
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 43
EECS 127/227AT Course Reader 2.7. Low-Rank Approximation 2024-04-27 21:08:09-07:00
p
! k+1
!
(2.213)
X X
= σi ~ui~vi> αi~vi
i=1 i=1 2
p k+1
(2.214)
X X
= σi αj ~ui~vi>~vj .
i=1 j=1
2
By vector algebra, the fact that the ~ui are orthonormal, and the fact that the ~vi are orthonormal, one
can mechanically show that
p k+1
(2.215)
X X
kA − Bk2 ≥ σi αj ~ui~vi>~vj
i=1 j=1
2
k+1
(2.216)
X
= αi σi ~ui
i=1 2
v
u k+1 2
u X
=t αi σi ~ui (2.217)
i=1 2
v
uk+1
(2.218)
uX
=t α2 σ 2 i i
i=1
v
u k+1
(2.219)
u X
2
≥ tσk+1 αi2
i=1
v
uk+1
(2.220)
uX
= σk+1 t α2 i
i=1
We now state and prove the other result, which is the Eckart-Young theorem for the Frobenius norm.
or, equivalently,
(2.224)
2 2
kA − Ak kF ≤ kA − BkF , ∀B ∈ Rm×n : rank(B) ≤ k.
Proof of Theorem 44. Like the Frobenius norm, the proof of Eckart-Young Theorem for Frobenius Norm is partitioned
into two parts which together jointly show the conclusion. First, we explicitly calculate kA − Ak kF ; then for an
2
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 44
EECS 127/227AT Course Reader 2.7. Low-Rank Approximation 2024-04-27 21:08:09-07:00
p
! k
!
(2.225)
X X
A − Ak = σi ~ui~vi> − σi ~ui~vi>
i=1 i=1
p
(2.226)
X
= σi ~ui~vi> .
i=k+1
Since σk+1 ≥ σk+2 ≥ · · ·, the ~ui are orthonormal, and the ~vi are orthonormal, this is a valid singular value
decomposition of the rank r − k matrix A − Ak . Thus A − Ak has largest singular value equal to σk+1 , so
p
(2.227)
2
X
kA − Ak kF = σi2 .
i=k+1
Part 2. We now show that for any matrix B ∈ Rm×n of rank ≤ k, we have kA − BkF ≥ σi2 . Again, let us
2 Pp
i=k+1
begin by discussing the overall argument structure.
Our goal is to show an inequality between two sums of squares of singular values. To do this, it is simplest to
match up terms in the two sums and show the inequality between each pair of terms. Then summing over all
terms preserves the inequality.
.
More concretely, define C = A − B. Suppose C has SVD C = i=1 γi ~yi ~zi> . We want to show
Pp
p p
(2.228)
2
X X
kA − BkF = γi2 ≥ σi2 .
|{z}
i=1 still need to show i=k+1
Thus, we claim that γi ≥ σi+k for every i ∈ {1, . . . , p}, with the understanding that σp+1 = σp+2 = · · · = 0.
To use our earlier knowledge about the spectral norm to our advantage, we write the singular value γi as the
spectral norm of the approximation error matrix:
γi = kC − Ci−1 k2 . (2.229)
Now this looks somewhat like the low-rank approximation setup, because we are computing the spectral norm
of the difference between A and some matrix. But what is the rank of this matrix? We have
Thus, by applying Eckart-Young Theorem for Spectral Norm to the rank i + k − 1 approximation of A, that
γi2 ≥ σi+k
2
, ∀i ∈ {1, . . . , p} (2.233)
p p
summing over all i, (2.234)
X X
γi2 ≥ 2
σi+k
i=1 i=1
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 45
EECS 127/227AT Course Reader 2.8. (OPTIONAL) Block Matrix Identities 2024-04-27 21:08:09-07:00
p
(2.235)
2
X
kA − BkF ≥ σi2
i=k+1
as desired.
~y >
i 1 X n
..
h
~x1 · · · ~xn .
= ~xi ~yi> (2.241)
i=1
~yn>
~x>
1 ~x>
1 ~y1 · · · ~x>
1 ~yn
. h
. ~y1 · · · ~yn = .. .. ..
i
. . . .
(2.242)
~x>
n ~x>
n~ y1 · · · ~x>
n~ yn
~x> 1
..
~e> x> (2.243)
i . =~
i
>
~xn
h i
~x1 · · · ~xn ~ei = ~xi (2.244)
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 46
EECS 127/227AT Course Reader 2.8. (OPTIONAL) Block Matrix Identities 2024-04-27 21:08:09-07:00
h i h i
A ~x1 · · · ~xn = A~x1 · · · A~xn (2.245)
~x>
1 ~x>
1A
.
. A = .. (2.246)
. .
~x>
n ~x>
nA
h i h i
A B C = AB AC (2.247)
" # " #
A AC
C= . (2.248)
B BC
0 0 ··· An Bn An B n
(2.251)
XX
~x> A~y = Aij xi yj
i j
" #> " #" #
~x A B ~x
= ~x> A~x + ~x> B~y + ~y > C~x + ~y > D~y . (2.252)
~y C D ~y
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 47
Chapter 3
Vector Calculus
df f (x + h) − f (x)
(x) = lim . (3.1)
dx h→0 h
1. Multivariate functions f : Rn → R which take a vector ~x ∈ Rn as input and produce a scalar f (~x) ∈ R as
output. Familiar examples of such functions include f (~x) = k~xkp and f (~x) = ~a> ~x.
2. Vector-valued functions f~ : Rn → Rm which take a vector ~x ∈ Rn as input and produce another vector f~(~x) ∈
Rm as output. A familiar example of such functions is f (~x) = A~x.
One tool that allows us to compute derivatives of scalar functions is the chain rule, which describes the derivative
of the composition of two functions.
48
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00
dh df dg
(x) = (g(x)) · (x). (3.2)
dx dg dx
Here, we use some — perhaps unfamiliar — notation not previously introduced. In particular, the derivative df
dg (g(x))
is a little strange. When we write df
dg we take the derivative of f with respect to the output of g. In this case, we know
that the input of f is exactly the output of g, so really df
dg takes the derivative of f with respect to its input. Thus we
can more compactly write the chain rule using the f notation as 0
We will see in the next section that the chain rule can be generalized to settings of multivariate functions. To aid
our study of such generalizations, we will introduce here a computational graph perspective of the chain rule. At the
moment this graphical perspective looks trivial, but it will help us understand more complicated cases.
x g f h(x)
In this computational graph, one computes the derivative by summing along all paths from x to h(x). There is only
one path, so we get
dh df dg
(x) = (g(x)) · (x). (3.4)
dx dg dx
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 49
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00
∂f ∂ >
(~x) = ~a ~x (3.7)
∂xi ∂xi
n
∂ X
= aj xj (3.8)
∂xi j=1
∂
= (a1 x1 + · · · + an xn ) (3.9)
∂xi
= ai . (3.10)
Let us consider the case where the input ~x is not an independent vector but rather depends on another variable
t ∈ R, i.e., ~x : R → Rn is a function. In such case the function f (~x(t)) has one independent input, which is t. If we
are interested in finding the derivative of f with respect to t, we can utilize a chain rule to do so.
Here we again review this computational graph perspective. There are now n separate paths from x to h, like so:
g1
..
.
x gi f h(x)
..
.
gn
In this computational graph, one computes the derivative by summing across the n paths from x to h(x), obtaining
X ∂f n
dh dgi
(x) = (~g (x)) · (x). (3.12)
dx i=1
∂gi dx
3.1.2 Gradient
We will now use the definition of partial derivatives to introduce the gradient of multivariate functions.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 50
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00
Definition 51 (Gradient)
Let f : Rn → R be a differentiable function. The gradient of f is the function ∇f : Rn → Rn defined by
∂f
x)
∂x1 (~
.
. (3.13)
. .
∇f (~x) =
∂f
x)
∂xn (~
Note that the gradient is a column vector. The transpose of the gradient is (confusingly!) referred to as the derivative
of the function. We will now list two important geometric properties of the gradient. The first can be stated straight
away:
Proposition 52
Let ~x ∈ Rn . The gradient ∇f (~x) points in the direction of steepest ascent at ~x, i.e., the direction around ~x in
which f has the maximum rate of change. Furthermore, this rate of change is quantified by the norm k∇f (~x)k2 .
Proof. Let ~u ∈ Rn be a unit vector (i.e., representing an arbitrary direction in Rn ). Using the Cauchy-Schwarz
inequality we can write:
so the maximum value that the expression can take is k∇f (~x)k2 . Now it remains to show that this value is attained for
the choice ~u = x)k2 .
∇f (~
k∇f (~
x)
> 2
∇f (~x) k∇f (~x)k2
[∇f (~x)] = (3.16)
k∇f (~x)k2 k∇f (~x)k2
= k∇f (~x)k2 . (3.17)
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 51
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00
Proposition 54
Let ~x ∈ Rn and suppose f (~x) = α. Then ∇f (~x) is orthogonal to the hyperplane which is tangent at ~x to the
α-level set of f .
Example 55 (Gradient of the Squared `2 Norm). In this example we will compute and visualize the gradient of the
function f (~x) = k~xk2 where ~x ∈ R2 .
2
" #
∂f
x)
∂x1 (~
∇f (~x) = ∂f
(3.21)
∂x2 x)
(~
" #
∂ 2
∂x1 (x1 + x22 )
= ∂ 2
(3.22)
∂x2 (x1 + x22 )
" #
2x1
= (3.23)
2x2
= 2~x. (3.24)
This function has a paraboloid-shaped graph, as shown in Figure 3.3a. Let us now find the α-level set of the function
f for some constant α.
α = f (~x) (3.25)
= x21 + x22 (3.26)
√
For a given α ≥ 0, the α-level set is a circle centered at the origin which has radius α. Now we evaluate the gradient
at a few points on these level sets:
We plot the level sets for α = 1 and α = 4 and visualize the gradient directions in Figure 3.3b.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 52
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00
1. At each point on a given level set, the gradient vector at that point is orthogonal to the line tangent to the level
set at that point.
2. The length of the gradient vector increases as we move away from the origin. This means that the function gets
steeper in that direction (i.e., it changes more rapidly).
Example 56 (Gradient of Linear Function). In this example we will compute and comment on the gradient of the linear
function f : Rn → R is defined by f (~x) = ~a> ~x where ~a ∈ Rn is fixed. We have
1. The α-level sets of this function are the hyperplanes given by all ~x such that ~a> ~x = α. These hyperplanes have
normal vector ~a. Notice that the normal vector (which is orthogonal to all vectors in the hyperplane) is exactly
the gradient vector.
2. The gradient is constant, meaning that the function has a constant rate of change everywhere.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 53
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00
Example 57 (Gradient of the Quadratic Form). Let A ∈ Rn×n . In this example we will compute the gradient of the
quadratic function f : Rn → R defined by f (~x) = ~x> A~x. Indeed, we have
Now fix k ∈ {1, . . . , n}, and we find the partial derivative with respect to xk . We have
n X
X n
f (~x) = Aij xi xj
i=1 j=1
Xn n X
X n
= Akj xk xj + Aij xi xj
j=1 i=1 j=1
i6=k
n
X X n n X
X n
= Akj xk xj + Aik xi xk + Aij xi xj
j=1 i=1 i=1 j=1
i6=k i6=k j6=k
n
X n
X n X
X n
= Akk x2k + Akj xk xj + Aik xi xk + Aij xi xj
j=1 i=1 i=1 j=1
j6=k i6=k i6=k j6=k
n
X n
X X n X n
= Akk x2k + xk Akj xj + xk Aik xi + Aij xi xj .
j=1 i=1 i=1 j=1
j6=k i6=k i6=k j6=k
Thus computing the gradient via stacking the partial derivatives gets
∂f
x)
∂x1 (~
.
. (3.43)
∇f (~x) = .
∂f
x)
∂xn (~
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 54
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00
((A + A> )~x)1
..
(3.44)
= .
((A + A> )~x)n
= (A + A> )~x. (3.45)
That was a very involved computation! Luckily, in the near future, we will see ways to simplify the process of computing
gradients.
3.1.3 Jacobian
We now have the tools to generalize the notion of derivatives to vector-valued functions f~ : Rn → Rm .
Definition 58 (Jacobian)
Let f~ : Rn → Rm be a differentiable function. The Jacobian of f~ is the function Df~ : Rn → Rm×n defined as
∂f ∂f1
∇f1 (~x)> 1
x
(~ ) · · · x
(~ )
.. ∂x1. ..
∂xn
..
Df~(~x) = = . (3.46)
. . . . .
> ∂fm ∂fm
∇fm (~x) x) · · · ∂xn (~x)
∂x1 (~
One big thing to note is that the Jacobian is different from the gradient! If f : Rn → R1 = R, then its Jacobian
Df : Rn → R1×n is a function which outputs a row vector. This row vector is the transpose of the gradient.
We can develop a nice and general chain rule with the Jacobian.
Here, as before, the notation Df~(~g (~x)) means that we compute Df~ and then evaluate it on the point ~g (~x).
This is a broad chain rule which we can apply to many problems, but we must always remember that for a function
f : Rn → R, its Jacobian is the transpose of its gradient. One typical chain rule we can derive from the general one
follows below.
Example 61. In this example we will use the chain rule to compute the gradient of the function h(~x) = kA~x − ~y k2 .
2
It can be written as h(~x) = f (~g (~x)) where f (~x) = k~xk2 and ~g (~x) = A~x − ~y . We have that
2
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 55
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00
as desired. This gradient matches what would be computed if we had used the componentwise partial derivatives.
Finally, we again revisit the computational graph perspective. Here, the graph looks like:
x1 g1 f1
.. .. ..
. . .
~x xi gj fk h(~x)
.. .. ..
. . .
xn gp fm
∂fk
[D~h(~x)]k,i = (~x) (3.53)
∂xi
p
∂fk ∂gj
(3.54)
X
= (~g (~x)) · (~x)
j=1
∂g j ∂x i
p
(3.55)
X
= [Df~(~g (~x))]k,j · [D~g (~x)]j,i
j=1
Example 62 (Neural Networks and Backpropagation). In this example, we will develop a basic understanding of what
deep neural networks are and how to train them via gradient-based optimization methods, such as those we will study
in this class.
The most basic class of neural networks are “multi-layer perceptrons,” or functions of the form f~ : Rn × Θ → Rp ,
where Θ is the so-called “parameter space”, and have the form
f~(~x, θ) = W (m)~σ (m) (W (m−1) (· · · ~σ (1) (W (0) ~x + ~b(0) ) · · · ) + ~b(m−1) ) + ~b(m) (3.57)
where θ = (W (0) ~ (0)
,b ,W (1) ~ (1)
,b ,W (2) ~ (2)
,b ,...,W (m) ~ (m)
,b ) ∈ Θ, (3.58)
and ~σ (1) , . . . , ~σ (m) are “activation functions,” which we treat as generic differentiable nonlinear functions. Here θ is
bolded because it should not be thought of as a scalar, vector, or matrix, but rather as a collection of matrices and
vectors, an object we haven’t discussed so far in this class.
More precisely, if we define the functions (~z(i) )m
i=0 and (f
~(i) )m as
i=0
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 56
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00
~z(i) (~x, θ) = f~(i) (~z(i−1) (~x, θ), θ) = W (i)~σ (i) (~z(i−1) ) + ~b(i) , ∀ i ∈ {1, . . . , m}, (3.60)
When we say that we “train” this neural network, we mean just that we optimize θ to minimize some loss function
evaluated on the data. Given a (differentiable) loss function ` : Rp × Rp → R and a data point (~x, ~y ) ∈ Rn × Rp , we
evaluate the loss on this data point as `(f~(~x, θ), ~y ). The true “empirical loss” across a batch of q data points (~xi , ~yi ) is
q
. X ~
L(θ) = `(f (~xi , θ), ~yi ), (3.62)
i=1
We now explore how to compute ∇L(θ). In this situation, we are only differentiating with respect to θ, so we fix ~x
and ~y , which gives the following computational graph:¹
.. .. ..
. . .
.. .. ..
. . .
Since we have not defined gradients with respect to matrices (yet), we will focus on computing ∇~b(i) L(θ), i.e., the
gradient of the loss function L with respect to ~b(i) , as well as ∇~z(i) L(θ), i.e., the gradient of the loss function L with
respect to ~z(i) (~x, θ). They may be interpreted as sub-vectors (i.e., literally subsets of the entries) of the full gradient
∇L(θ) = ∇θ L(θ). This subscript notation is common in more involved problems which demand derivatives with
respect to multiple vector-valued variables.
We first compute ∇~z(m) L(θ). We have
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 57
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00
Indeed we cannot simplify this further, it being just the gradient of ` in its first argument.
Now we handle general i ∈ {1, . . . , m}. By chain rule we have the recursion:
This gives us
D~z(i−1) ~z(i) = W (i) · D~σ (i) (~z(i−1) ), (3.72)
and so
∇~z(i−1) L(θ) = [W (i) · D~σ (i) (~z(i−1) )]> [∇~z(i) L(θ)]. (3.73)
To see the true value of this computation, we now compute ∇~b(i) L(θ), for each i ∈ {0, . . . , m}. By another
application of chain rule, we obtain
Now to do automatic differentiation by backpropagation, your favorite machine learning framework (such as Py-
Torch) computes all the gradients ∇~z(i) L(θ), starting at i = m and recursing until i = 0. Computing the gradients in
this order saves a lot of work, since each derivative is only computed once — this is the main idea of backpropagation.
Once these derivatives are computed, one can use them to compute ∇~b(i) L(θ). If we can also compute ∇W (i) L(θ),
then we can compute ∇θ L(θ), and thus will be able to optimize L via gradient-based optimization algorithms such as
gradient descent, as discussed later in the class. However, this computation will have to wait until slightly later when
we cover matrix calculus.
3.1.4 Hessian
So far, we have appropriately generalized the notion of first derivative to vector-valued functions, i.e., functions f~ : Rn →
Rm . We now turn to doing the same with second derivatives.
It turns out that for general vector-valued functions f~, defining a second derivative is possible, but such an object
will live in Rm×n×n and thus not be hard to work with using the linear algebra we have discussed in this class. However,
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 58
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00
for multivariate functions f : Rn → R, defining this second derivative as a particular matrix becomes possible; this
matrix, called the Hessian, has great conceptual and computational importance.
Recall that to find the second derivative of a scalar-valued function, we merely take the derivative of the derivative.
Our notion of the gradient suffices as a first derivative; to take the derivative of this, we need to use the Jacobian.
Indeed, the Hessian is exactly the Jacobian of the gradient, and defined precisely below.
Definition 63 (Hessian)
Let f : Rn → R be twice differentiable. The Hessian of f is the function ∇2 f : Rn → Rn×n defined by
∂ 2f ∂ 2f
∂x21
(~x) ··· x)
∂xn ∂x1 (~
It turns out that under some mild conditions, the Hessian is symmetric; this is called Clairaut’s theorem.
The vast majority of functions we work with in this course are twice continuously differentiable, with some notable
exceptions (i.e., the `1 norm is not even once-differentiable). Thus, in most cases, the Hessian is symmetric.
Example 65 (Hessian of Squared `2 Norm). In this example we will compute the Hessian of the function f (~x) = k~xk2
2
Example 66. In this example we will compute the gradient and Hessian of the function h(~x) = log(1 + k~xk2 ). Let
2
f : R → R be defined by f (x) = log(1 + x), and g : Rn → R be defined as g(~x) = k~xk2 . Then we have Df = f 0 is
2
the derivative of f , and ∇g(~x) = 2~x by previous examples. By the Jacobian chain rule, we have
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 59
EECS 127/227AT Course Reader 3.2. Taylor’s Theorems 2024-04-27 21:08:09-07:00
2~x
= 2. (3.89)
1 + k~xk2
We compute this Jacobian, hence the desired Hessian, componentwise, and obtain
" !#
2~x
2
[∇ h(~x)]j,k = D 2 (3.92)
1 + k~xk2 j,k
!
∂ 2xj
= (3.93)
∂xk 1 + k~xk22
2 2
(1 + k~xk2 ) ∂x∂ k (xj ) − xj ∂x∂ k (1 + k~xk2 )
=2 2 (3.94)
(1 + k~xk2 )2
2 2
(1 + k~xk2 ) ∂x∂ k (xj ) − xj ∂x∂ k (k~xk2 )
=2 2 (3.95)
(1 + k~xk2 )2
2 ∂x
(1 + k~xk2 ) ∂xkj − 2xj xk
=2 2 (3.96)
(1 + k~xk2 )2
4xj xk 2 ∂xj
=− 2 + 2 (3.97)
(1 + k~xk2 )2 1 + k~xk2 ∂xk
4xj xk 2 2 , if j = k
=− 2 + 1+k~xk2
(3.98)
(1 + k~xk2 )2 0, if j 6= k.
This gives
4xj xk
[∇2 h(~x)]j,k = − 2 , ∀j 6= k (3.99)
(1 + k~xk2 )2
2 4xj xk
[∇2 h(~x)]jj = 2 − 2 , ∀j. (3.100)
1 + k~xk2 (1 + k~xk2 )2
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 60
EECS 127/227AT Course Reader 3.2. Taylor’s Theorems 2024-04-27 21:08:09-07:00
1 df 1 dk f
fbk (x; x0 ) = f (x0 ) + (x0 ) · (x − x0 ) + · · · + (x0 ) · (x − x0 )k (3.102)
1! dx k! dxk
k
1 di f
(3.103)
X
= i
(x0 ) · (x − x0 )i .
i=0
i! dx
df
fb1 (x; x0 ) = f (x0 ) + (x0 ) · (x − x0 ) (3.104)
dx
df 1 d2 f
fb2 (x; x0 ) = f (x0 ) + (x0 ) · (x − x0 ) + (x0 ) · (x − x0 )2 . (3.105)
dx 2 dx2
We will derive multivariable versions of these approximations later.
Example 68 (Taylor Approximation of Cubic Function). Let us approximate the function f (x) = x3 around the fixed
point x0 = 1 using Taylor approximations of different degrees.
df
fb1 (x; 1) = f (x0 ) + (x0 ) · (x − x0 ) (3.106)
dx
= x30 + 3x20 · (x − x0 ) (3.107)
= 13 + 3 · 12 · (x − 1) (3.108)
= 3(x − 1) + 1 (3.109)
= 3x − 2. (3.110)
2
1d f
fb2 (x; 1) = fb1 (x; 1) + (x0 ) · (x − x0 )2 (3.111)
2 dx2
= 3x − 2 + 3 · 1 · (x − 1)2 (3.112)
2
= 3x − 3x + 1. (3.113)
3
1d f
fb3 (x; 1) = fb2 (x; 1) + (x0 ) · (x − x0 )3 (3.114)
6 dx3
= 3x2 − 3x + 1 + (x − 1)3 (3.115)
= x3 . (3.116)
• The first-order Taylor approximation fb1 (·; x0 ) is the best linear approximation to f around x = x0 = 1. In
particular, its graph is the tangent line to the graph of f around the point (x0 , f (x0 )) = (1, 1), as observed in
Figure 3.6.
• The second-order Taylor approximation fb2 (·; x0 ) is the best quadratic approximation to f around x = x0 = 1. It
is the parabola whose graph passes through the point (x0 , f (x0 )) = (1, 1), as observed in Figure 3.6, and it has
the same first and second derivatives as f at x0 . Using the intuition that the second derivative models curvature,
we see that the second-order Taylor approximation captures the local curvature of the graph of the function. This
intuition will be helpful later when discussing convexity.
• The third-degree Taylor approximation fb3 (·; x0 ) is the best cubic approximation to f ; because f is just a cubic
function, the best cubic approximation is just f itself, and indeed we have fb3 (·; x0 ) = f .
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 61
EECS 127/227AT Course Reader 3.2. Taylor’s Theorems 2024-04-27 21:08:09-07:00
Figure 3.6: First and second degree Taylor approximations of the function f (x) = x3 .
Taylor approximation gives us the degree k polynomial that approximates the function f (x) around the fixed point
x = x0 . Taylor’s theorem quantifies the bounds for the error of this approximation.
where the term o(|x − x0 | ) (i.e., the remainder) denotes a function, say Rk (x; x0 ), such that
k
Rk (x; x0 )
lim k
= 0. (3.118)
x→x0 |x − x0 |
We use this remainder notation because we don’t really care about what it is precisely, only its limiting behavior as
x → x0 , and the little-o notation allows us to not worry too much about the exact form of the remainder.
This theorem certifies that the Taylor approximations fbk are good approximations to f . Another way to write this
result is generally more useful or simpler:
df
f (x + δ) = f (x) + (x) · δ +o(|δ|) (3.119)
| dx
{z }
=fb1 (x+δ;x)
df 1 d2 f
= f (x) + (x) · δ + 2
(x) · δ 2 +o(δ 2 ) (3.120)
| dx {z 2 dx }
=fb2 (x+δ;x)
= .... (3.121)
We will never need to quantitatively work with the remainder in this course; we will usually write f ≈ fbk and leave it
at that.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 62
EECS 127/227AT Course Reader 3.2. Taylor’s Theorems 2024-04-27 21:08:09-07:00
• If f is continuously differentiable, then its first-order Taylor approximation around ~x0 is the function
fb1 (·; ~x0 ) : Rn → R given by
fb1 (~x; ~x0 ) = f (~x0 ) + [∇f (~x0 )]> (~x − ~x0 ). (3.122)
• If f is twice continuously differentiable, then its second-order Taylor approximation around ~x0 is the func-
tion fb2 (·; ~x0 ) : Rn → R given by
1
fb2 (~x; ~x0 ) = f (~x0 ) + [∇f (~x0 )]> (~x − ~x0 ) + (~x − ~x0 )> [∇2 f (~x0 )](~x − ~x0 ). (3.123)
2
The graph of the first-order Taylor approximation is the hyperplane tangent to the graph of f at the point (~x0 , f (~x0 )).
This hyperplane has normal vector ∇f (~x0 ).
We could define higher-order Taylor approximations fbk , but to express them concisely would require generalizations
of matrices, called tensors. For example, the third derivative of a function f : Rn → R is a rank-3 tensor, i.e., an object
which lives in Rn×n×n . These are out of scope for this course, and anyways we will only need the first two derivatives.
We can also state an analogous Taylor’s theorem.
We can re-write this result in the following, more useful, way for k = 1 and k = 2:
= .... (3.127)
Example 72 (Taylor Approximation of the Squared `2 norm). In this example we will compute and visualize the first
and second degree Taylor approximations of the squared `2 norm function f (~x) = k~xk2 for ~x ∈ R2 around the vector
2
~x = ~x0 . First recall the gradient and hessian of the function which are computed in Examples 55 and 65, respectively.
fb1 (~x; ~x0 ) = f (~x0 ) + [∇f (x~0 )]> (~x − ~x0 ) (3.128)
(3.129)
2 >
= k~x0 k2 + [2~x0 ] (~x − ~x0 )
(3.130)
2
= 2~x>
0~x− k~x0 k2 .
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 63
EECS 127/227AT Course Reader 3.2. Taylor’s Theorems 2024-04-27 21:08:09-07:00
We plot this function in Figure 3.7 and notice that the graph of the first order approximation is the plane tangent
to the paraboloid at the point (1, 0, f (1, 0)) = (1, 0, 1).
= 2~x>
0~x − ~x>
0~x0 + (~x − ~x0 )> (~x − ~x0 ) (3.134)
= 2~x>
0~x − ~x>
0~x0 + ~x ~x >
− ~x>
0~x − ~x >
~x0 + ~x>
0~x0 (3.135)
= 2~x>
0~x − ~x>
0~x0 + ~x> ~x − 2~x>
0~x + ~x>
0~x0 (3.136)
= ~x ~x >
(3.137)
(3.138)
2
= k~xk2 .
Thus fb2 = f independently of the choice of ~x0 , which makes sense since f is a quadratic function.
Figure 3.7: First degree Taylor approximation of the function f (~x) = k~xk22
Example 73. We can also compute gradients using Taylor’s theorem by pattern matching; this is sometimes much
neater than taking componentwise gradients. At first glance this seems circular, but we will see how it is possible. Take
for example the function f : Rn → R given by f (~x) = ~x> A~x. We can perturb f around ~x to obtain
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 64
EECS 127/227AT Course Reader 3.2. Taylor’s Theorems 2024-04-27 21:08:09-07:00
~δ> (2A)~δ = ~δ> A~δ + ~δ> A~δ = ~δ> A~δ + (~δ> A~δ)> = ~δ> A~δ + ~δ> A>~δ = ~δ> (A + A> )~δ. (3.144)
We conclude by introducing a more general version of a first-order Taylor approximation, a corresponding Taylor’s
theorem, and giving an example of when it is useful.
Again, higher-order approximations will require higher-order derivatives, which requires tensors.
Example 76. Taylor’s theorem can be used to compute gradients by pattern matching, even when the function is
not linear or quadratic. For instance, we now use it to derive the chain rule (albeit with stronger assumptions on the
functions). Let f~ : Rp → Rm and ~g : Rn → Rp be continuously differentiable. Let ~h : Rn → Rm be defined as
~h(~x) = f~(~g (~x)). Then we compute ~h on a perturbation around ~x and expand:
The first Taylor expansion is an expansion of ~g around the point ~x with perturbation ~δ; the second Taylor expansion is
an expansion of f~ around the point g(~x) with perturbation [D~g (~x)]~δ.
Meanwhile, Taylor’s theorem says that
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 65
EECS 127/227AT Course Reader 3.3. The Main Theorem 2024-04-27 21:08:09-07:00
As a last practical note, remembering the formula for Taylor approximations helps us confirm our understanding of
the dimensions of each vector. For instance, every term should multiply to a scalar. This makes it simpler to remember
that, for a function f : Rn → R, the gradient ∇f outputs column vectors in Rn , the Hessian ∇2 f outputs square
matrices in Rn×n , etc.
This theorem gives a necessary condition for a point to be an optimal solution of this optimization problem. It says
that any point that is optimal must necessarily have gradient equal to zero.
Proof. We prove this for scalar functions f : R → R only; the vector case is a bit more complicated and is left as an
exercise.
Using Taylor approximation of the function around the optimal point:
df ?
f (x) = f (x? ) + (x ) · (x − x? ) + o(|x − x? |). (3.156)
dx
Since f (x? ) ≤ f (x) for all x ∈ Ω, we have
df ?
f (x) ≤ f (x) + (x ) · (x − x? ) + o(|x − x? |) (3.157)
dx
df ?
=⇒ 0 ≤ (x ) · (x − x? ) + o(|x − x? |). (3.158)
dx
Since Ω is an open set, there exists some ball of positive radius r > 0 around x? such that Br (x? ) ⊆ Ω. Formally,
Let us partition Br (x? ) into B+ , the set of all x ∈ Br (x? ) such that x − x? ≥ 0, and B− , the set of all x ∈ Br (x? )
such that x − x? < 0.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 66
EECS 127/227AT Course Reader 3.4. Directional Derivatives 2024-04-27 21:08:09-07:00
df ?
0≤ (x ) · (x − x? ) + o(|x − x? |) (3.160)
dx
df ?
= (x ) · |x − x? | + o(|x − x? |) (3.161)
dx
df ? o(|x − x? |)
=⇒ 0 ≤ (x ) + . (3.162)
dx |x − x? |
o(|x − x? |)
df ?
0 ≤ lim? (x ) + (3.163)
x→x dx |x − x? |
x∈B+
df ? o(|x − x? |)
= (x ) + lim? (3.164)
dx x→x |x − x? |
x∈B+
df ?
= (x ). (3.165)
dx
Thus we have 0 ≤ df ?
dx (x ). On the other hand, for all x ∈ B− , we have
df ?
0≤(x ) · (x − x? ) + o(|x − x? |) (3.166)
dx
df
= − (x? ) · |x − x? | + o(|x − x? |) (3.167)
dx
df ? o(|x − x? |)
=⇒ 0 ≥ (x ) − . (3.168)
dx |x − x? |
o(|x − x? |)
df ?
0 ≥ lim? (x ) − (3.169)
x→x dx |x − x? |
x∈B−
df ? o(|x − x? |)
= (x ) − lim? (3.170)
dx x→x |x − x? |
x∈B−
df ?
= (x ). (3.171)
dx
Thus we have 0 ≥ df ?
dx (x ) and so df ?
dx (x ) = 0.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 67
EECS 127/227AT Course Reader 3.5. (OPTIONAL) Matrix Calculus 2024-04-27 21:08:09-07:00
If we know the directional derivative in any direction, we know the gradient; similarly, if we know the gradient, we
know the directional derivative. The way to connect the two is given by the following proposition, whose proof is left
as an exercise.
Proposition 79
Let f : Rn → R be differentiable, and fix ~u ∈ Rn such that k~uk2 = 1. Then
In particular, Df (~x)[~ei ] = ∂f
x).
∂xi (~
Definition 80 (Gradient)
Let f : Rm×n → R be differentiable. The gradient of f is the function ∇f : Rm×n → Rm×n which is defined as
∂f ∂f
∂X11 (X) · · · ∂X (X)
.. ..
1n
..
(3.174)
∇f (X) = . . .
∂f ∂f
∂Xm1 (X) · · · ∂Xmn (X)
There exists a general chain rule for matrix-valued functions, which is provable by flattening out all matrices into
vectors and applying the vector chain rule.
As before, the notation means to take the derivative of the ij th output of F by its abth input. A more specific
∂Fij
∂Gab
version of this chain rule is given below for functions f : Rm×n → R.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 68
EECS 127/227AT Course Reader 3.5. (OPTIONAL) Matrix Calculus 2024-04-27 21:08:09-07:00
Proposition 82
Let F : Rp×q → R and G : Rm×n → Rp×q be differentiable functions. Let h : Rm×n → R be defined by
h(X) = f (G(X)) for all X ∈ Rm×n . Then h is differentiable, and for all k, `, we have
∂h ∂Gab
(3.176)
XX
(X) = [∇f (G(X))]ab (X)
∂Xk` a
∂X k`
b
We also are able to define a first-order Taylor expansion without having to use tensor notation.
There is a corresponding Taylor’s theorem certifying the Taylor approximation accuracy, but we don’t state it here.
Finally, note that the general recipe for computing all quantities such as the gradient, Jacobian, and gradient matrix
is the same: consider each input component and each output component separately and organize their partial derivatives
in vector or matrix form with a standard layout.
Example 84 (Finishing Example 62). Now that we know how to take matrix-valued gradients, we complete the example
of Neural Networks and Backpropagation. Before reading the following, please revise the lengthy setup of this example.
We promised in this example a way to compute ∇θ L(θ), or more precisely a way to compute ∇W (i) L(θ). We now
have the tools to do this using the chain rule. Recall that we have access to ∇~z(i) L(θ) by backpropagation. Then we
can compute the components of ∇W (i) L(θ) by
∂L
[∇W (i) L(θ)]j,k = (θ) (3.178)
∂(W (i) )j,k
∂(~z(i) )a
(3.179)
X
= [∇~z(i) L(θ)]a
a
∂(W (i) )j,k
[~σ (i) (~z(i−1) )]k , if a = j and i ∈ {1, . . . , m}
(3.180)
X
= [∇~z(i) L(θ)]a · xk , if a = j and i = 0
a
otherwise
0,
[∇ (i) L(θ)] [~σ (i) (~z(i−1) )] , if i ∈ {1, . . . , m}
j k
(3.181)
~
z
=
[∇ (i) L(θ)] [~x] ,
~
z j k if i = 0
[{∇ (i) L(θ)}~σ (i) (~z(i−1) )> ] , if i ∈ {1, . . . , m}
j,k
(3.182)
~
z
=
{∇ (i) L(θ)}~x> ] ,
~
z j,k if i = 0.
This gives
[∇
z (i) L(θ)]~
σ (i)
(~z(i−1) )> , if i ∈ {1, . . . , m}
(3.183)
~
∇W (i) L(θ) =
[∇ x> ,
z (i) L(θ)]~
~ if i = 0.
In combination with the expression for ∇~b(i) from Example 62, we can efficiently compute ∇θ L(θ), and are able to
train our neural network via gradient-based optimization methods such as gradient descent.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 69
Chapter 4
• [2] Chapter 6.
≤ max A−1 ~z 2
(4.7)
z ∈Rn
~
z k2 = ~
k~ δy
~
2
70
EECS 127/227AT Course Reader4.1. Impact of Perturbations on Linear Regression 2024-04-27 21:08:09-07:00
= maxn A−1 ~z 2
~δ~y (4.8)
z ∈R
~ 2
k~
z k2 =1
= A−1 2
~δ~y . (4.9)
2
~
δ~
In order to upper-bound k~xk 2 , we also need to lower-bound k~xk2 . Applying the same matrix norm inequality to the
x
2
regular linear system A~x = ~y gives
A~x = ~y (4.10)
kA~xk2 = k~y k2 (4.11)
kAk2 k~xk2 ≥ k~y k2 (4.12)
k~y k
k~xk2 ≥ (4.13)
kAk2
Thus we’ve bounded the relative change in ~x by the relative change in ~y . If the relative change in ~y is small, then the
relative change in ~x will be small, and so on. But we’d like to say something more about kAk2 A−1 2
, and indeed we
can:
σ1 {A}
kAk2 A−1 2
= σ1 {A} · σ1 {A−1 } = , (4.16)
σn {A}
where again, σn {A} 6= 0 because A is invertible. This quantity
. σ1 {A}
κ(A) = (4.17)
σn {A}
is called the condition number of a matrix. In general, for non-invertible systems, this can be infinite, but has the same
definition.
. σ1 {A}
κ(A) = . (4.18)
σn {A}
If κ(A) is large, then even a small change in our measurement ~y will result in a huge change in our variable ~x. If κ(A)
is small, then large changes in our measurement ~y result in small changes to our variable ~x.
It seems unlikely that in general, the equations that define our system will be square. Most likely we will have a
least-squares type tall system. But this is resolved by using the so-called normal equations to represent the least squares
solution:
A> A~x = A> ~y . (4.19)
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 71
EECS 127/227AT Course Reader 4.2. Ridge Regression 2024-04-27 21:08:09-07:00
The condition number of this linear system is κ(A> A). Since A> A is symmetric and positive semidefinite, its eigen-
values are also its singular values, and so we have
λmax {A> A}
κ(A> A) = . (4.20)
λmin {A> A}
is given by
~x? = (A> A + λI)−1 A> ~y . (4.22)
.
Proof. Let f (~x) = kA~x − ~y k2 + λ k~xk2 . By taking gradients, we get
2 2
n o
(4.23)
2 2
∇~x f (~x) = ∇~x kA~x − ~y k2 + λ k~xk2
= ∇~x {~x> A> A~x − 2~y > A~x + ~y > ~y + λ~x> ~x} (4.24)
>
= 2A A~x − 2A ~y + 2λ~x >
(4.25)
= 2(A> A + λI)~x − 2A> ~y . (4.26)
Thus we get that the optimal point is determined by solving the linear system
Since A> A is PSD and λ > 0, we have A> A + λI is PD and thus invertible. Therefore
is the unique solution to the above linear system and therefore the unique solution to the optimization problem.
Note that we haven’t proved that a convex function (such as the above ridge regression objective) is minimized when
its derivative is 0; we prove this in subsequent lectures, but for now let us take it for granted.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 72
EECS 127/227AT Course Reader 4.3. Principal Components Regression 2024-04-27 21:08:09-07:00
Proof. Another way to solve the same problem is to consider the augmented system
" # " #
A ~y
√ ~x = . (4.29)
λI ~0
This augmented matrix has full column rank, so we can use the least squares solution to get a unique solution for ~x.
We get
" #> " #−1 "
#> " #
A A A ~y
~x = √ √ √ (4.30)
λI λI λI ~0
" #!−1 " #
h √ i A h √ i ~y
= A >
λI √ A >
λI (4.31)
λI ~0
" #!−1
h √ i A √
= A> λI √ A> ~y + λI · ~0 (4.32)
λI
" #!−1
h √ i A
= A >
λI √ A> ~y (4.33)
λI
−1 >
= A> A + λI A ~y . (4.34)
the second term λ k~xk2 is called a regularizer; this is because it regulates or regularizes our problem by making it
2
better-conditioned.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 73
EECS 127/227AT Course Reader 4.3. Principal Components Regression 2024-04-27 21:08:09-07:00
Thus, we get
" #
(Σ2r + λI)−1 Σr 0
?
~x = V U > ~y (4.45)
0 0
r
!
σi {A}
(4.46)
X
= ~vi ~u>
i ~y
i=1
σi {A}2 + λ
r
σi {A}
(4.47)
X
= (~u> ~y ) · ~vi .
i=1
σi {A}2 + λ i
To understand what λ is doing here, we contrast two examples. Let A ∈ Rn×3 for some large n 3.
Suppose first that σ1 {A} = σ2 {A} = σ3 {A} = 1. Then
where ~x
e is the solution of the corresponding least squares linear regression problem, namely the ridge problem with
λ = 0. In this way, the λ parameter decays the solution in each principal direction equally, pulling the whole ~x
e vector
towards 0. This is interesting precisely because a first-level examination of the ridge regression objective function —
and namely the k~xk2 term, which by itself penalizes every direction of ~x equally — may make it seem like this is always
2
the case, but it turns out to not be, as we will see shortly.
Now suppose that σ1 {A} = 100, σ2 {A} = 10, and σ3 {A} = 1. Then
Thus, the different terms are now impacted differently based on λ; in particular, to impact the first term by a certain
amount, one needs to change λ by 100 times the amount required to change the last term. Namely, if we set λ to be large,
say λ = 10000, the coefficient of the first term becomes 1/110, while the coefficient of the last term becomes 1/10001
which is much lower. More generally, for a larger example, setting λ to be large effectively zeros out the last few terms
while effectively not changing the first few terms. Thus, setting λ to be large effectively performs a “soft thresholding”
of the singular values, making the terms associated with smaller singular values be nearly 0 while preserving the terms
associated with larger singular values. More quantitatively, for large λ, we have
1 1 1
~x? = (~u>
1~y )~v1 + (~u>
2~y )~v2 + (~u> ~y )~v3 (4.54)
100 + λ/100 10 + λ/10 1+λ 3
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 74
EECS 127/227AT Course Reader 4.4. Tikhonov Regression 2024-04-27 21:08:09-07:00
1 1
≈ (~u>
1~y )~v1 + (~u> ~y )~v2 (4.55)
100 + λ/100 10 + λ/10 2
and for even larger λ we simply have
1 1 1
~x? = (~u>
1~ y )~v1 + (~u>
2~y )~v2 + (~u> ~y )~v3 (4.56)
100 + λ/100 10 + λ/10 1+λ 3
1
≈ (~u> ~y )~v1 . (4.57)
100 + λ/100 1
Since the terms form a linear combination of the ~vi , setting such terms associated with small singular values to (nearly)
0 is similar to performing PCA, where we only use the ~vi associated with the largest few singular values. Thus our
conclusion is — ridge regression behaves qualitatively similar to a soft form of PCA.
which had full column rank, tried to find a ~x such that A~x ≈ ~y while ~x ≈ ~0 — in other words, ~x that is small. Suppose
that we wanted to instead enforce that ~x were close to some other vector ~x0 ∈ Rn . Then we would set up the system
" # " #
A ~y
√ ~x = . (4.59)
λI ~x0
(4.60)
2 2
kA~x − ~y k2 + λ k~x − ~x0 k2 .
The final generalization of this is to put different weights on each row of A~x − ~y and ~x − ~x0 . If, for example, we really
want to get row i of A~x close to bi , we can multiply the squared difference (A~x − ~y )2i by a large weight in the loss
function, and the solutions will bias towards ensuring that (A~x − ~y )i ≈ 0. Similarly, if we really are sure that the true
~x has ith coordinate (~x0 )i , then we can attach a large weight to the difference (~x − ~x0 )2i as well. Mathematically, this
gives us the following objective function:
(4.61)
2 2
kW1 (A~x − ~y )k2 + kW2 (~x − ~x0 )k2 ,
where W1 ∈ Rm×m and W2 ∈ Rn×n are diagonal matrices representing the weights. Notice how this is a gener-
√
alization of ridge regression with W1 = I, W2 = λI, and ~x0 = ~0. This general regression is called Tikhonov
regression.
is given by
~x? = (A> W12 A + W22 )−1 (A> W12 ~y + W22 ~x0 ). (4.63)
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 75
EECS 127/227AT Course Reader 4.5. Maximum Likelihood Estimation (MLE) 2024-04-27 21:08:09-07:00
yi = ~a>
i ~
x + wi , ∀i ∈ {1, . . . , m} (4.64)
where w1 , . . . , wn are independent Gaussian random variables; in particular, wi ∼ N (0, σi2 ). Or in short, we have
~y = A~x + w
~ (4.65)
i>
.
h
where w~ = w1 · · · wm ∈ Rm . In this case, we say that w ~ ∼ N (~0, Σw~ ) where Σw~ = diag σ12 , . . . , σm
2
In this setup, the maximum likelihood estimate (MLE) for ~x turns out to be exactly a solution to a Tikhonov regres-
sion problem. The maximum likelihood estimate is the parameter choice which makes the data most likely, in that it
has the highest probability or probability density out of all choices of the parameter. It is a meaningful and popular sta-
tistical estimator; thus the fact that we can reduce its computation to a ridge regression-type problem is both interesting
and useful.
Henceforth, we use p to denote probability densities, and use p~x to denote probability densities for a fixed value of
~x. In the above model, ~x is not a random variable, so it doesn’t quite make formal sense to condition on it (though —
spoilers! — we will soon put a probabilistic prior on it, and then it makes sense to condition).
Proof. Since the logarithm is monotonically increasing, argmax~x f (~x) = argmax~x log(f (~x)) for all functions f , and
so
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 76
EECS 127/227AT Course Reader 4.6. Maximum A Posteriori Estimation (MAP) 2024-04-27 21:08:09-07:00
!
m
X > 2
1 (yi − ~ai ~x)
= argmax log p + log exp − (4.71)
x∈Rn i=1
~ 2πσi2 2σi2
| {z }
independent of ~
x
m
(yi − ~a> x)2
i ~
(4.72)
X
= argmax log exp −
x∈Rn i=1
~ 2σi2
m
(yi − ~a> x)2
i ~
(4.73)
X
= argmax −
x∈Rn i=1
~ 2σi2
( m
)
1 X (yi − ~a> x)2
i ~
= argmax − (4.74)
x∈Rn
~ 2 i=1 σi2
m
(yi − ~a> ~x)2
(4.75)
X
i
= argmin
x∈Rn
~ i=1
σi2
2
(4.76)
−1/2
= argmin Σw~ (A~x − ~y ) .
x∈Rn
~ 2
~x = ~x0 + ~v (4.78)
i>
.
h
where ~v = v1 · · · vn ∈ Rn is distributed as ~v ∼ N (~0, Σ~v ), where Σ~v = diag τ12 , . . . , τn2 .
In this setup, the maximum likelihood estimate may still be useful, but another quantity that is perhaps more relevant
is the maximum a posteriori estimate (MAP). The MAP estimate is the value of ~x which is most likely, i.e., having
the highest conditional probability or conditional probability density, conditioned on the observed data. It is also a
meaningful and popular statistical estimator. It turns out that we can derive a similar result as in the MLE case.
Proof. Using Bayes’ rule and the computations from before, we have
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 77
EECS 127/227AT Course Reader 4.6. Maximum A Posteriori Estimation (MAP) 2024-04-27 21:08:09-07:00
= argmax log(p(~y | ~x)) + log(p(~x)) − log(p(~y )) (4.83)
x∈Rn
~ | {z }
independent of ~
x
as desired.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 78
Chapter 5
Convexity
• [1] Chapter 4.
• [2] Chapter 8.
We can think of each θi as a weight on the corresponding ~xi . Since they are non-negative numbers which sum to
1, we can also interpret them as probabilities.
Geometrically, a set C is convex if for every two points ~x1 , ~x2 ∈ C, the line segment {θ~x1 + (1 − θ)~x2 | θ ∈ [0, 1]}
is contained in C. This means that, for example, the midpoint between ~x1 and ~x2 , i.e., 12 ~x1 + 12 ~x2 , is contained in C,
as well as the point 1
3 of the way from ~x1 to ~x2 , i.e., 23 ~x1 + 13 ~x2 , etc. More generally, as we vary θ, we go along the
line segment connecting ~x1 and ~x2 .
79
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
3
5~x1 + 25 ~x2 {θ~x1 + (1 − θ)~x2 | θ ∈ [0, 1]}
~x1 {θ~x1 + (1 − θ)~x2 | θ ∈ [0, 1]} ~x2 ~x1 23 ~x1 + 13 ~x2 ~x2
C1 C2
Figure 5.1: Two sets C1 , C2 ⊆ R2 . C1 is not convex, but C2 is. To visualize the behavior of the line segments, we also plot a point
on each line segment along with its associated θ.
Algebraically, a set C is convex if for any ~x1 , . . . , ~xk ∈ C, any convex combination of ~x1 , . . . , ~xk is contained in
C.
One way to generate a convex set from any (possibly non-convex) set, including finite and infinite sets, is to take
its convex hull.
Here are some properties of the convex hull; the proof is left as an exercise.
Proposition 93
Let S ⊆ Rn be a set.
(5.3)
\
conv(S) = C.
C⊇S
C is a convex set
(c) conv(S) is the union of convex hulls of all finite subsets of S, i.e.,
(5.4)
[
conv(S) = conv(A).
A⊆S
A is a finite set
Actually, the last statement can be strengthened to a separate, more quantitative result, which gives a fundamental
characterization of convex sets.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 80
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
The proof of this theorem is left as an exercise; interested students can reference the proof in Bertsekas [5, Proposition
B.6], for example.
Below, we visualize the convex hull of a finite set S. By the above proposition, the convex hull of an infinite set S 0
is the union of convex hulls of all finite subsets of S 0 .
S conv(S)
The following content is optional/out of scope for this semester. Regardless, it may be helpful to read it to
gain context, or get a deeper understanding of various results.
Geometrically, the conic hull of a set S is the set of all rays from the origin that pass through conv(S).
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 81
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
conic(S)
conv(S)
(0, 0)
Figure 5.3: A finite set S and its convex and conic hulls.
Note the difference between affine and convex sets. In the latter, θ is restricted to [0, 1]. Geometrically this restriction
corresponds to the (finite) line segment connecting ~x1 and ~x2 being contained in S. In the former, however, θ can be
any real number, corresponding to the whole (infinite) line connecting ~x1 and ~x2 being contained in S.
Note that an affine set is a translation of a subspace. This intuition is one of the most helpful ways to understand
affine sets.
Proposition 97
.
For a set A ⊆ Rn , define the translation A + ~x = {~a + ~x | ~a ∈ A}.
(a) Let S ⊆ Rn be a nonempty affine set. Then there is a subspace U ⊆ Rn such that, for any ~x ∈ S, we have
S = U + ~x.
(b) For any subspace U ⊆ Rn and vector ~x ∈ Rn , the set U + ~x is an affine set.
Proof.
(a) Let ~x ∈ S be any vector in S, and define U := S + (−~x) = {~s − ~x | ~s ∈ S}. We claim that U is a subspace.
Indeed, since ~x ∈ S, we see that ~0 ∈ U = S + (−~x). We show that U is closed under addition. Let ~u1 , ~u2 ∈ U .
By definition of U , there exist ~s1 , ~s2 ∈ S such that ~u1 = ~s1 − ~x and ~u2 = ~s2 − ~x. Then
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 82
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
Since S is affine it is convex, so α~s + (1 − α)~x ∈ S. Thus α~u ∈ S + (−~x) = U , so U is closed under scalar
multiplication. We have shown that U is closed under linear combinations and contains ~0, so U is a subspace
and the claim is proved.
(b) Let α ∈ R and let ~s1 , ~s2 ∈ U + ~x. By definition of U , there exist ~s1 , ~s2 ∈ S such that ~s1 = ~u1 + ~x and
~s2 = ~u2 + ~x. Then
Since U is a subspace, α~u1 +(1−α)~u2 ∈ U . Thus, from above, we have α~s1 +(1−α)~s2 = [α~u1 +(1−α)~u2 ]+~x ∈
U + ~x. We have shown that U + ~x is closed under affine combinations, so it is an affine set.
Here are some properties of the affine hull; the proof is left as an exercise.
Proposition 99
Let S ⊆ Rn be a set.
(b) aff (S) is the minimal affine set which contains S, i.e.,
(5.14)
\
aff (S) = C.
C⊇S
C is an affine set
(c) aff (S) is the union of affine hulls of all finite subsets of S, i.e.,
(5.15)
[
aff (S) = aff (A).
A⊆S
A is a finite set
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 83
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
Corollary 100. Let S ⊆ Rn be a set, and let aff (S) be the translation of a linear subspace of dimension d ≤ n. Then
aff (S) is the union of affine hulls of all finite subsets of S of size at most d, i.e.,
(5.16)
[
aff (S) = aff (A).
A⊆S
|A|≤d
Proof. Suppose that aff (S) = U + ~x where U ⊆ Rn is a subspace of dimension d. We prove both subset relations,
i.e., A ⊆ B and B ⊆ A implies A = B.
We show the quicker subset inequality first. We have from earlier results that
(5.17)
[ [
aff (S) = aff (A) ⊇ aff (A).
A⊆S A⊆S
A is a finite set |A|≤d
.
Towards showing the reverse inequality, let ~s1 , . . . , ~sd be elements of S such that, if we define ~ui = ~si − ~x, then
~u1 , . . . , ~ud is a basis for U . (Such ~si have to exist; if they don’t, then there are no ~s1 , . . . , ~sd whose translates by ~x span
U , a contradiction with the definition of aff (S) = U +~x). Now taking A = {~s1 , . . . , ~sd }, we see that aff (A) = aff (S).
Thus we have
(5.18)
[
aff (S) = aff ({~s1 , . . . , ~sd }) ⊆ aff (A).
A⊆S
|A|≤d
Therefore we have
(5.19)
[
aff (S) = aff (A),
A⊆S
|A|≤d
as desired.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 84
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
S1
conv(S1 )
(0, 0) (0, 0)
. .
(a) S1 = {(3/2, 2), (3, 1)}. (b) S2 = conv(S1 ).
(0, 0) (0, 0)
. .
(c) S3 = conic(S1 ). (d) S4 = aff (S1 ).
. .
Figure 5.4: (a) The set S1 = {(3/2, 2), (3, 1)} of two points in R2 ; (b) The convex hull S2 = conv(S1 ) of S1 , which is the closed
.
line segment connecting the two points in S1 ; (c) The conic hull S3 = conic(S1 ) of S1 , which is the union of all rays passing
.
through S1 ; (d) The affine hull S4 = aff (S1 ) of S1 , which is the infinite line connecting the two points in S1 . Note that we also
have S3 = conic(S2 ) and S4 = aff (S2 ); this relationship can be shown to hold in general from definitions.
Next, given a set S ⊆ Rn , we sometimes wish to distinguish points that lie on the “boundary” of S from points that
.
lie in the “interior” of S. For instance, consider the set S = [0, 1) ⊆ R. In this case, 0 and 1 lie on the “boundary” of S,
since they are infinitely close to both points inside S and points outside S. On the other hand, 1
10 lies in the interior of
S, since all points within a sufficiently small distance lie in S. Although 0 and 1 can both be geometrically interpreted
as points on the boundary of S = [0, 1), note that 0 ∈ S while 1 6∈ S. In general, a set may contain either all, some, or
none of the points on its boundary.
Below, we formalize the notion of interior points¹.
(b) (Interior.) We say that ~x is an interior point of S when there exists some r > 0 such that Nr (~x) ⊆ S.a The
set of all interior points of S is called the interior of S and denoted int(S).
aIn the definition for the interior point, it does not matter whether we use ⊆ or ⊂. Think about why; this is a good exercise to internalize
the definitions.
¹The definitions provided below can be generalized to spaces more abstract than Rn or even general finite-dimensional vector spaces, such as
metric or topological spaces.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 85
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
In words, given a set S ⊆ Rn , we say that ~x ∈ Rn is an interior point of S if it is contained inside an open ball in
Rn that is in turn entirely contained in S. A mental picture is provided in Figure 5.5.
~x
Nr (~x)
Figure 5.5: The vector ~x is an interior point of the set S ⊆ R2 , since there exists a 2-dimensional ball (red) centered at ~x that is
contained in S.
Some sets may represent geometric shapes embedded in Euclidean space of strictly higher dimension, and therefore
must have empty interior. As an example, consider the set S2 defined in Figure 5.4(b), which connects the points
(3/2, 2) and (3, 1) in R2 . This is a one-dimensional line segment embedded in an Euclidean space of dimension 2.
Indeed, if one claims that a point in S2 , say, the midpoint (9/4, 3/2), is an interior point of S2 , then one would have to
show the existence of a two-dimensional open ball centered at (9/4, 3/2) that lies entirely in S2 . This is impossible,
since S1 is a one-dimensional line segment, and so S2 has empty interior.
However, it may still be geometrically meaningful to classify points in such a set as points on the “edge” of the
set, or points “inside” the set. In the context of the line segment S2 , the end points S1 = {(3/2, 2), (3, 1)} appear at
the “edge” of S2 , while the remaining points S2 \ S1 are located “inside” S2 . As we explained above, this cannot be
captured using the definitions of interior points and the interior presented in Definition 101. Roughly speaking, this
is because S2 \ S1 can only be considered points “inside” the line segment S2 from a one-dimensional perspective,
e.g., relative to the line S4 = aff (S1 ) = aff (S2 ) that contains S2 . This motivates the following definition of relative
interior, provided below.
In words, given a set S ⊆ Rn , we say that ~x ∈ Rn is a relative interior point of S if it is contained inside an open
ball in Rn whose intersection with aff (S) is entirely contained in S.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 86
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
S3
S2
S1
(0, 0)
. .
Figure 5.6: A similar setup to Figure 5.4. Here S1 = {(3/2, 2), (3, 1)}, S2 = conv(S1 ) = {θ(3/2, 2)+(1−θ)(3, 1) | θ ∈ [0, 1]},
.
and S3 = aff (S1 ) = aff (S2 ) = {θ(3/2, 2) + (1 − θ)(3, 1) | θ ∈ R}. Thus relint(S2 ) = S2 \ S1 . In other words, S2 is the line
segment connecting (3/2, 2) and (3, 1), S3 = aff (S2 ) is the extension of S2 into a line, and relint(S2 ) is the open (i.e., excluding
the endpoints) line segment connecting (3/2, 2) and (3, 1). This illustrates the description of the relative interior of a set as its
interior when viewed as a subset of its own affine hull.
S aff (S)
~0
Figure 5.7: A set S and its affine hull. While the interior of S is the empty set, its relative interior is nonempty.
Next, we use the concept of relative interior to characterize strictly convex sets.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 87
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
S1 S2
~x2
S3
~x1 ~x2
Figure 5.8: Left: a strictly convex set S1 . Middle: a convex set S2 which is not strictly convex. Right: A non-convex set S3 . All
three sets are defined to include their boundaries. In particular, S2 is not strictly convex because some sections of its boundary
consist of line segments. For any two points along the same line segment, each convex combination of these points will lie on the
boundary of S2 .
The above content is optional/out of scope for this semester, but now we resume the required/in scope content.
The equations ~a> ~x = b and ~a> (~x − ~x0 ) = 0 are connected, because if we define b = ~a> ~x0 , then the second equation
resolves to the first equation; and if we take ~x0 to be any vector such that ~a> ~x0 = b, then the first equation resolves to
the second equation.
.
Example 105. Hyperplanes are convex. Consider a hyperplane H = {~x ∈ Rn | ~a> ~x = b}. Let ~x1 , ~x2 ∈ H and
θ ∈ [0, 1]. Then
To show that a set C is convex, we need to show that for every ~x1 , ~x2 ∈ C and every θ ∈ [0, 1], that θ~x1 +(1−θ)~x2 ∈
C.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 88
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
To show that C is not convex, we just need to come up with one choice of ~x1 , ~x2 ∈ C and one θ ∈ [0, 1] such
/ C. Note that even if C is non-convex, there could be some choices of ~x1 , ~x2 ∈ C, θ ∈ [0, 1]
that θ~x1 + (1 − θ)~x2 ∈
such that θ~x1 + (1 − θ)~x2 ∈ C; but if C is non-convex, there is at least one choice of ~x1 , ~x2 ∈ C, θ ∈ [0, 1] such that
θ~x1 + (1 − θ)~x2 ∈
/ C.
The mental picture we have for these hyperplanes and half-spaces is the following. Let ~x0 ∈ Rn and define
.
H = {~x ∈ Rn | ~a> (~x − ~x0 ) = 0} (5.27)
.
H+ = {~x ∈ Rn | ~a> (~x − ~x0 ) ≥ 0} (5.28)
.
H− = {~x ∈ Rn | ~a> (~x − ~x0 ) ≤ 0}. (5.29)
~a
H− H H+
~x0
In words, the positive and negative half-spaces partition Rn . Looking at some individual vectors, say ~x1 ∈ H−
and ~x2 ∈ H+ , we have the picture
~a
~x2
~x1 ~x0
If we draw lines connecting ~x0 with ~x1 and ~x2 , they are not themselves representations of ~x1 and ~x2 , unless ~x0 = ~0.
Instead, they are representations of the displacements of ~x1 and ~x2 from ~x0 . Thus, we see the following picture:
~a
~x2 − ~x0 ~x2
~x1 ~x1 − ~x0 ~x0
And this gives us a clearer understanding of what’s going on — ~x1 − ~x0 forms an obtuse angle with ~a, indicating a
negative dot product, whereas ~x2 − ~x0 forms an acute angle with ~a, indicating a positive dot product. And this is how
H+ and H− are computed.
This allows us to consider what it means for a hyperplane to separate two sets. It means that for every vector in the
first set, the dot product is non-positive, and for every vector in the second set, the dot product is non-negative.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 89
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
Example 107 (Set of PSD Matrices is Convex). Consider Sn+ , the set of all symmetric positive semidefinite (PSD)
matrices. We want to show that Sn+ is convex. Take A1 , A2 ∈ Sn+ and θ ∈ [0, 1]. We want to show that θA1 +(1−θ)A2 ∈
Sn+ .
One of the ways to tell if a matrix A is PSD is to check whether ~x> A~x ≥ 0 for all ~x ∈ Rn . Checking this for our
convex combination, we get
≥ 0. (5.31)
" #
1 0
Note that it is possible to come up with linear combinations of PSD matrices that are not PSD; indeed, and
0 1
" # " #
2 0 −1 0
are PSD, yet their difference is not PSD. But all convex combinations of PSD matrices are PSD,
0 2 0 −1
as we have confirmed above.
Moreover, if C is closed (containing its boundary points) and D is closed and bounded, then there exists a hyper-
plane that separates C and D without intersecting either set, i.e., there exists ~a, ~x0 ∈ Rn such that
C
D
Proof. We prove the part of the theorem statement in the case where C is closed and bounded and D is closed.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 90
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
Even though our theorem statement concerns existence of such ~a and ~x0 , we will prove it by construction, i.e., we
will construct a ~a and ~x0 which separate C and D. This proof strategy is very powerful and will show up frequently.
Since C and D are disjoint, any points in C and D are separated by some positive distance; since they are compact,
this distance has a finite lower bound.² Define
.
dist(C, D) = min ~c − d~ . (5.36)
c∈C
~ 2
~
d∈D
Note that dist(C, D) > 0, and there exists some c ∈ C and d ∈ D such that ~c − d~ = dist(C, D).³
2
C
~c d~ D
~c − d~
This signals that we want ~c − d~ to be the normal vector of our hyperplane — that is, our ~a vector. To find the other
point ~x0 which the hyperplane passes through, we can just have it pass through the midpoint of ~c and d,~ i.e., ~c+d~ . This
2
gives the following diagram.
C c+d~
~
2
~c d~ D
~c − d~
~c + d~
~
~a = ~c − d, ~x0 = . (5.37)
2
It yields the following picture, where the hyperplane is a dotted line.
²Proving this requires some mathematical analysis and is out of scope of the course.
³Same as the above footnote. The fact that C is closed and bounded and D is closed will not be used from this point onwards.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 91
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
C ~x0
~c d~ D
~a
Notice that there are many separating hyperplanes, such as the one discussed before the theorem. But we just need
to prove that this hyperplane separates C and D.
The equation for this hyperplane is
!
~c + d~
> ~
~a (~x − ~x0 ) = (~c − d) >
~x − (5.38)
2
~> ~
~ > ~x − (~c − d) (~c + d)
= (~c − d) (5.39)
2
> ~>~
~ > ~x − ~c ~c − d d
= (~c − d) (5.40)
2
2
2
k~ck2 − d~
~ > ~x −
= (~c − d) 2
. (5.41)
2
Thus the given hyperplane is also available in (~a, b) form as
2
2
k~ck2 − d~
~
~a = ~c − d, b= 2
. (5.42)
2
For the sake of contradiction, suppose there exists ~u ∈ D such that f (~u) ≥ 0. We can write
0 ≤ f (~u) (5.44)
!
~c + d~
~ > ~u −
= (~c − d) (5.45)
2
!
~
~ > ~u − d~ − ~c − d
= (~c − d) (5.46)
2
~> ~
~ − (~c − d) (~c − d)
~ > (~u − d)
= (~c − d) (5.47)
2
1 2
~ > ~
= (~c − d) (~u − d) − ~c − d~ . (5.48)
2 2
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 92
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
Thus
2
0 ≤ (~c − d) ~ − 1 ~c − d~ < (~c − d)
~ > (~u − d) ~ > (~u − d).
~ (5.49)
2 2
This means that ~c − d~ and ~u − d~ form an acute angle. It also means that ~u 6= d,
~ since otherwise the dot product would
be 0. Going back to our picture, this means that ~u would have to be positioned similarly to the following:
~u
C ~x0
~c d~ D
~a
At least from the diagram, it seems hard to imagine a ~u ∈ D such that ~u − d~ and ~c − d~ form an acute angle. Namely,
any vector ~x ∈ Rn (of reasonably small norm, such as the ~u in the figure) such that ~x − d~ and ~c − d~ form an acute
angle, seems to be closer to ~c than d~ is to ~c.
Why do we need the “reasonably small norm” condition? Consider the following possible ~x:
~x
C ~x0
~c d~ D
~a
Certainly, this ~x is farther from ~c than d~ is, and so no contradiction would be derived.
If we can prove that our ~u, which we assume is in D, is closer to ~c than d~ is, then we can derive a contradiction with
the fact that d~ is the closest vector in D to ~c. But we can’t prove this for our ~u directly, because ~u − d~ may be large
2
as in the above figure, so instead we take another vector ~x which is close to d, ~ where the displacement between ~x and
d~ points in the direction of ~u. We will show that this ~x is in D yet is closer to ~c than d~ is, thus deriving a contradiction.
Here are the details. Let p~ : [0, 1] → Rn trace out the line from d~ to ~u; namely, let p~(t) = d+t(~
~ u−d)~ = t~u+(1−t)d.
~
Since ~u, d ∈ D by assumption, and D is convex, we have that p~(t) ∈ D for all t ∈ [0, 1]. Now we see that
~
2
2
p(t) − ~ck2 = d~ + t(~u − d)
k~ ~ − ~c
2
2
= (d~ − ~c) + t(~u − d)
~
2
2 2
= d~ − ~c + 2t(d~ − ~c)> (~u − d)
~ + t2 ~u − d~
2 2
2 2
= ~c − d~ ~ (~u − d)
− 2t(~c − d) ~ +t > 2
~u − d~ .
2 2
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 93
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
2
We want to show that there exists t such that k~
p(t) − ~ck2 < ~c − d~ , i.e.,
2
2
2
~ > (~u − d)
−2t(~c − d) ~ + t2 ~u − d~ < 0, (5.50)
2
i.e.,
2
~ > (~u − d)
2 (~c − d) ~ −t ~u − d~ > 0. (5.51)
| {z } 2
>0
~ > (~ ~
Now for all 0 < t < c−d)
2(~
~
u−d)
2 , i.e., t small enough, the above inequality holds, so we have for this t that
u−d
~
2
2 2
2
p(t) − ~ck2 = ~c − d~
k~ ~ > (~u − d)
− 2t(~c − d) ~ + t2 ~u − d~
2 2
2
< ~c − d~ .
2
The following content is optional/out of scope for this semester. Regardless, it may be helpful to read it to
gain context, or get a deeper understanding of various results.
(c) We call K a pointed cone if it contains no line through the origin, i.e., if for each nonzero ~v ∈ K, there
exists some α ∈ R such that α~v ∈
/ K.
(d) We call K a solid cone if it has non-empty interior, i.e., if there exists some ~v ∈ K and some r > 0 such that
the open ball in Rn of radius r centered at ~v is contained in K: namely, we have {w
~ ∈ Rn | kw
~ − ~v k2 <
r} ⊆ K.
(e) We call K a closed cone if it is a closed set, i.e., it contains its boundary points.
Note that non-empty cones must contain the zero vector, which corresponds to the case of taking α = 0 in the
definition of a cone.
The definition of proper cones is motivated by their connection to generalized inequalities in convex optimization,
which will be discussed later in the course in the context of second-order cone programs (SOCPs) and semidefinite
programming (SDPs). For this we require the above definitions to apply to a slightly broader context. We would need
to replace Rn with a generic vector space V and the k·k2 norm with any norm k·kV on this vector space. In fact, for the
following results to hold we additionally need to have an inner product on this vector space h·, ·iV that is compatible
with the norm, i.e., h~x, ~xiV = k~xkV ; that is, we would need an inner product space. One can check (as an exercise)
2
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 94
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
that Rn is an example of such an inner product space, with the `2 norm k·k2 and usual inner product. Thus, in order to
generalize the results introduced in this section, we would replace Rn with V , replace the norm k~xk2 with k~xkV , and
replace the inner product ~x> ~y with h~x, ~y iV .⁴
For now, we return to working over Rn so as to not introduce additional complexity in the definitions.
C1
Definition 111
(a) A set of the form
.
C = {(~x, t) ∈ Rn+1 | A~x ≤ t~y , t ≥ 0} (5.54)
is called a polyhedral cone, and in particular corresponds to the polyhedron {~x ∈ Rn | A~x ≤ ~y }.
is called a ellipsoidal cone, and in particular corresponds to the ellipse {~x ∈ Rn | kA~x − ~y k2 ≤ z}.a
aThe ellipsoidal cone corresponding to the unit circle — which is, after all, an ellipse — is the second order cone, to be discussed later.
Proposition 112
Polyhedral and ellipsoidal cones are convex cones.
⁴In this class, the issue really only comes up when discussing vector spaces where each element is a matrix, where the norm is the Frobenius
norm, and the inner product is a corresponding “Frobenius inner product,” to be defined later. This is relevant in semidefinite programming, for
example.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 95
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
Proposition 113
Let K ⊆ Rn be a cone. Define
.
K ? = {~y ∈ Rn | ~y > ~x ≥ 0 for each ~x ∈ K}. (5.56)
In particular, this holds for β = 0 (so that K ? is a cone), and α ∈ [0, 1] and β = 1 − α (so that K ? is convex). Thus,
K ? is a convex cone.
Now we want to show that K contains its limits. Let (~yk )∞
k=1 be a sequence in K that converges to some ~
?
y ∈ Rn .
We want to show that ~y ∈ K ? . Indeed, for any ~x ∈ K, we want to show that ~y > ~x ≥ 0. But this is true because we
have >
>
~y ~x = lim ~yk ~x = lim ~yk> ~x ≥ 0. (5.58)
k→∞ k→∞ |{z}
≥0
Since ~x was arbitrary, we have ~y ∈ K . Thus K contains its limits and is a closed cone.
? ?
.
A geometric interpretation of the dual cone is that K ? is the intersection of the half-spaces H~x = {~y ∈ Rn | ~y > ~x ≥
0} defined by each vector ~x in K.
Below, we provide some examples of cones and their dual cones. The reader is encouraged to verify the following
statements.
Example 114.
.
(a) The set Rn+ = {~x ∈ Rn | xi ≥ 0 for all i ∈ {1, . . . , n}} is a convex cone, and its dual cone in Rn is itself.
.
(b) Let S = {~x ∈ R2 | x1 = 0 or x2 = 0}. Then S is a cone but is not a convex cone, and the dual cone of S is
{~0}, the singleton set comprised of the 2-dimensional zero vector.
(c) Let S ⊆ Rn be a subspace. Then S is a convex cone, and the orthogonal complement S ⊥ of S is the dual cone
of S.
Two proper cones with interesting properties that are widely used in convex optimization are the cone of symmetric
positive semi-definite matrices and the second-order cone. The propositions below explore their properties.
Proposition 115
Let Sn be the vector space of n × n real-valued symmetric matrices equipped with the Frobenius inner product:
n n
.
for any A, B ∈ Sn . (5.59)
XX
hA, BiF = tr(AB) = Aij Bij ,
i=1 j=1
and the Frobenius norm k·kF . Let Sn+ denote the set of all n × n positive semidefinite matrices.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 96
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
To start with, this is an instance of the earlier discussion: not every application of cones will be with reference to Rn ,
but could be with reference to another inner product space. Here it is the vector space Sn with the appropriate inner
product and norm. The intuition for why we can do this is that Sn is a subspace of Rn×n , the space of n × n matrices.
But by stacking up the entries in an n × n matrix we get an n2 -dimensional vector, i.e., an element of Rn . Indeed,
2
the Frobenius norm and inner product on matrices are exactly the `2 inner product and norm applied to the “unrolled”
matrices in Rn . Thus, one can informally view Sn as a subspace of Rn (though remember that it is a vector space in
2 2
its own right, so that we can define things like interiors and dual cones with respect to it instead of its “parent” space
Rn ), so the same proof techniques and intuitions carry over.
2
Proof.
(a) To show that Sn+ is a convex cone, let A, B ∈ Sn+ and α, β ≥ 0 be given. We wish to show that αA + βB ∈ Sn+ ,
which will confirm that Sn+ is a convex cone. Indeed, αA + βB is symmetric as the linear combination of two
symmetric matrices. To show that it is positive semidefinite, let ~v ∈ Rn be arbitrarily given. Then we have
Since ~v was arbitrarily given, αA + βB is positive semidefinite. Thus αA + βB ∈ Sn+ . This holds for β = 0,
so Sn+ is a cone, and also for α ∈ [0, 1] and β = 1 − α, so Sn+ is convex.
We now show that Sn+ is pointed, i.e., it contains no lines through the origin. Let A ∈ Sn+ be a nonzero matrix.
Then −A ∈
/ Sn+ , because there exists ~v ∈ Rn such that ~v > A~v > 0, at which point ~v > (−A)~v = −~v > A~v < 0, so
−A is not positive semidefinite. Thus −A ∈
/ Sn+ , so that for any A ∈ Sn+ there exists α ∈ R such that αA ∈
/ Sn+ .
Thus Sn+ contains no lines through the origin and is pointed.
We now show that Sn+ is solid, i.e., has nonempty interior. We show that the open ball in Sn defined by
. 1
n
B = A ∈ S kA − IkF < (5.61)
2
is contained in Sn+ . Indeed, let A ∈ B. By definition of B, we have that A is symmetric. Moreover, for each
~v ∈ Rn we have
For ~v nonzero, we have k~v k2 , and so ~v > A~v > 0. Thus, A ∈ Sn+ .
1 2
2
A ∈ Sn . We want to show that A ∈ Sn+ . As the limit of symmetric matrices, A is symmetric. Now for any
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 97
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
~v ∈ Rn we have
~v > A~v = ~v > lim Ak ~v = lim ~v > Ak~v ≥ 0. (5.69)
k→∞ k→∞ | {z }
≥0
(b) We now show that the dual cone of Sn+ in Sn is itself. That is, defining the dual cone as (Sn+ )? = {A ∈ Sn |
hA, BiF ≥ 0 for all B ∈ Sn+ }, we want to show that (Sn+ )? = Sn+ . To do this, we show that (Sn+ )? ⊆ Sn+ and
that (Sn+ )? ⊇ Sn+ .
We first show that (Sn+ )? ⊆ Sn+ . Fix A ∈ (Sn+ )? , and let ~v ∈ Rn be given arbitrarily. Then, since ~v~v > ∈ Sn+ , we
have
>
(5.71)
= tr A~v~v
= A, ~v~v > F
(5.72)
≥ 0, (5.73)
where in the first line we use the fact that the trace of a scalar is a scalar, in the second line we use the cyclic trace
inequality,and the last inequality is justified because ~v~v > ∈ Sn+ and A ∈ (Sn+ )? . Since ~v was selected arbitrarily,
~v > A~v ≥ 0 for all ~v ∈ Rn . This (along with the fact that A is symmetric) proves that A ∈ Sn+ . Since A was
selected arbitrarily, (Sn+ )? ⊆ Sn+ .
Now we show that Sn+ ⊆ (Sn+ )? . Let B ∈ Sn+ . We aim to show that B ∈ (Sn+ )? , i.e., that hB, CiF ≥ 0 for any
C ∈ Sn+ . By the spectral theorem, we may diagonalize C = i=1 λi~vi~vi> , where λi ≥ 0 are the eigenvalues of
Pn
≥ 0. (5.80)
Thus we have hB, CiF ≥ 0. Since C ∈ Sn+ were arbitrary, we have B ∈ (Sn+ )? . Since B were arbitrary, we
have Sn+ ⊆ (Sn+ )? .
Thus, (Sn+ )? = Sn+ .
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 98
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
The next example of a cone will be useful when discussing the eponymous second-order cone programs (SOCPs).
Proposition 117
Let K be the second-order cone in Rn+1 .
Proof.
(a) We first show that K is a convex cone. Let (~x1 , t1 ), (~x2 , t2 ) ∈ K and let α1 , α2 ≥ 0. Then
where the first inequality is by triangle inequality and the second is by definition of a second-order cone. This
holds for α2 = 0, showing that K is indeed a cone, and α1 ∈ [0, 1] and α2 = 1 − α1 , showing that K is convex.
Thus K is a convex cone.
We show that K is pointed, i.e., contains no lines through the origin. Indeed, let (~x, t) ∈ K be nonzero. Then
either ~x is nonzero or t is nonzero (or both); in the first case, k~xk2 > 0 so since t ≥ k~xk2 , we have t > 0 as
well. Thus t > 0 in all cases. Thus we certainly do not have k−~xk2 ≤ −t(in fact, norms can never be negative)
so that −(~x, t) = (−~x, −t) ∈
/ K. Thus for any (~x, t) ∈ K there exists α ∈ R such that α(~x, t) ∈
/ K, so K is
pointed.
We now show that K is solid, i.e., has nonempty interior. We claim that the open ball in Rn+1 of radius 1
centered at (~0, 2), where ~0 is the n-dimensional zero vector, is contained in K. Formally, define
.
B = {(~x, t) ∈ Rn+1 | (~x, t) − (~0, 2) 2
< 1}. (5.84)
(~x, t) − (~0, 2) 2
<1 (5.85)
2
=⇒ (~x, t) − (~0, 2) 2
< 12 = 1 (5.86)
(5.87)
2 2
=⇒ k~xk2 + (t − 2) < 1,
which implies that k~xk2 < 1 and (t − 2)2 < 1, namely t ∈ (1, 3). Thus k~xk2 < 1 < t, so k~xk2 < 1 < t, so
2 2
k~xk2 ≤ t, so (~x, t) ∈ K as desired. Since (~x, t) ∈ B were arbitrarily chosen, B ⊆ K and K is solid.
We now show that K is closed, i.e., contains its limits. Let ((~xk , tk ))∞
k=1 be a sequence in K that converges
to some (~x, t) ∈ Rn+1
. We want to show that (~x, t) ∈ K. Indeed, we have that (~x, t) ∈ K if and only if
t − k~xk2 ≥ 0. We have
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 99
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00
≥ 0. (5.90)
(b) We show that the dual cone of K in Rn+1 is K itself. Let K ? = {(~y , s) ∈ Rn+1 | (~y , s)> (~x, t) ≥ 0 for all (~x, t) ∈
K} be the dual cone of K. We first show that K ? ⊆ K, then show that K ? ⊇ K.
First, to show that K ? ⊆ K, fix (~y , s) ∈ K ? . We want to show that s ≥ k~y k2 , so that (~y , s) ∈ K ? . Since
(~0, 1) ∈ K, by definition of K ? we have
(5.92)
2
0 ≤ (~y , s)> (−~y , k~y k2 ) = − k~y k2 + s k~y k2
(5.93)
2
⇒ s k~y k2 ≥ k~y k2
⇒ s ≥ k~y k2 . (5.94)
where the first inequality follows by Cauchy-Schwarz inequality, and the second inequality follows from the fact
that since (~x, t), (~y , s) ∈ K we have k~xk2 ≤ t and k~y k2 ≤ s. This shows that (~y , s) ∈ K ? , so that K ⊆ K ? .
We have shown that K ⊇ K ? and K ⊆ K ? , so K = K ? .
Theorem 118
Let K ⊆ Rn be a non-empty closed convex cone. Then (K ? )? = K.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 100
EECS 127/227AT Course Reader 5.2. Convex Functions 2024-04-27 21:08:09-07:00
nonzero w
~ ∈ Rn and c ∈ R such that w
~ > ~x > c for all ~x ∈ K, and w
~ > ~y < c. Since ~0 ∈ K, we have 0 = w
~ >~0 > c,
i.e., c < 0, so w
~ > ~y < c < 0. Since K is a cone, for any α > 0 we have
The above content is optional/out of scope for this semester, but now we resume the required/in scope content.
Equation (5.99) is also called Jensen’s inequality and is equivalent to the following, seemingly more general statement.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 101
EECS 127/227AT Course Reader 5.2. Convex Functions 2024-04-27 21:08:09-07:00
graph(f )
(~x1 , f (~x1 )) (θ~x1 + (1 − θ)~x2 ,
f (~x1 ) θf (~x1 ) + (1 − θ)f (~x2 ))
f (~x2 )
(~x2 , f (~x2 ))
(θ~x1 + (1 − θ)~x2 ,
f (θ~x1 + (1 − θ)~x2 ))
f (θ~x1 + (1 − θ)~x2 )
Ω
~x1 θ~x1 + (1 − θ)~x2 ~x2
The prototypical convex function has a “bowl-shaped” graph, and taking a weighted average of two (or any finite
number) of points will mean we land in the bowl. In particular, taking a weighted average of any number of function
values f (~x1 ), . . . , f (~xk ) will always give a larger number than applying the function f to the same weighted averages
of the points ~x1 , . . . , ~xk . Put more simply, if f is convex then the chord joining the points (~x1 , f (~x1 )) and (~x2 , f (~x2 ))
always lies above the graph of f . Similarly, if f is concave then the chord joining the points (~x1 , f (~x1 )) and (~x2 , f (~x2 ))
always lies below the graph of f .
From the picture, it may be intuitively clear that it is hard to construct convex functions f with multiple global
minima. We will come back to this idea later.
It may be useful to connect the notion of convex function and convex set. For this, we will define the epigraph.
Geometrically, the epigraph is all points in Ω × R that lie above the graph of f .
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 102
EECS 127/227AT Course Reader 5.2. Convex Functions 2024-04-27 21:08:09-07:00
graph(f )
epi(f )
Proposition 122
Let Ω ⊆ Rn be a convex set and let f : Ω → R be a function. Then f is a convex function if and only if epi(f ) is
a convex set.
Note that this latter term is the first-order Taylor expansion of f around ~x evaluated at ~y , i.e., fb1 (~y ; ~x) = f (~x) +
[∇f (~x)]> (~y − ~x). The graph of fb1 (·; ~x) is the tangent line to the graph of f at the point (~x, f (~x)). So another
characterization of convex functions is that their graphs lie above their tangent lines.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 103
EECS 127/227AT Course Reader 5.2. Convex Functions 2024-04-27 21:08:09-07:00
graph(f )
f (~y )
(~y , f (~y ))
(~x, f (~x))
f (~x)
Ω
~x ~y
fb1 (~y ; ~x)
(~y , fb1 (~y ; ~x)) graph(fb1 (·; x))
Proof. First suppose f is convex. Then for any h ∈ (0, 1), we have
as desired. Here the last equality is because the limit is interpreted as a directional derivative, and it has already been
shown that directional derivatives are equal to inner products of the gradient with the direction vector.
For the other direction, let θ ∈ [0, 1] and let ~z = θ~x + (1 − θ)~y . We have
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 104
EECS 127/227AT Course Reader 5.2. Convex Functions 2024-04-27 21:08:09-07:00
Adding θ times the first equation to (1 − θ) times the second equation, we get
θf (~x) + (1 − θ)f (~y ) ≥ θf (~z) + (1 − θ)f (~z) + θ[∇f (~z)]> (~x − ~z) + (1 − θ)[∇f (~z)]> (~y − ~z) (5.116)
>
= f (~z) + [∇f (~z)] (θ~x + (1 − θ)~y − ~z) (5.117)
= f (~z) + [∇f (~z)]> (~z − ~z) (5.118)
= f (~z) + [∇f (~z)] 0 >~
(5.119)
= f (~z) (5.120)
= f (θ~x + (1 − θ)~y ). (5.121)
Corollary 125. Let Q ∈ Sn be a symmetric matrix, let ~b ∈ Rn , and let c ∈ R. The quadratic form
Lastly, we identify a strengthened condition of convexity which allows for stronger guarantees later down the line.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 105
EECS 127/227AT Course Reader 5.2. Convex Functions 2024-04-27 21:08:09-07:00
Notice that this is not an if-and-only-if! As an example, take the scalar function f (x) = x4 . Then f 00 (0) = 0, so it
is not true that f 00 (x) 0 for all x ∈ Ω = R, but f is strictly convex.
(b) A function f~ : Rn → Rm is said to be affine if there exists some matrix A ∈ Rm×n and some vector ~b ∈ Rm
such that for any ~x ∈ Rn , we have
f~(~x) = A~x + ~b. (5.127)
(c) A function f : Rm×n → R is said to be affine if there exists some matrix A ∈ Rm×n and scalar b ∈ R such
that for any X ∈ Rm×n , we have
m X
n
(5.128)
X
Aij Xij + b = tr A> X + b.
f (X) =
i=1 j=1
Note that a given function f : Rn → R is affine if and only if the function g : Rn → R defined by g(~x) = f (~x)−f (~0)
is linear. An analogous result holds for other types of affine functions.
Below, we show that a scalar-valued affine function is one that is both convex and concave, while a vector-valued
affine function is one whose component functions are all both convex and concave. Analogous results hold for affine
functions whose inputs and outputs are both matrices.
Proposition 130
(a) A function f : Rn → R is affine if and only if it is both convex and concave, i.e., for any α ∈ [0, 1] and
~x, ~y ∈ Rn , we have
f (α~x + (1 − α)~y ) = αf (~x) + (1 − α)f (~y ). (5.129)
(b) A function f~ : Rn → Rm is affine if and only each component function of f is both convex and concave,
i.e., for any α ∈ [0, 1] and ~x, ~y ∈ Rn , we have
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 106
EECS 127/227AT Course Reader 5.2. Convex Functions 2024-04-27 21:08:09-07:00
(c) A function f : Rm×n → R is affine if and only if it is both convex and concave, i.e., for any α ∈ [0, 1] and
X, Y ∈ Rm×n , we have
f (αX + (1 − α)Y ) = αf (X) + (1 − α)f (Y ). (5.131)
Proof. We prove (a); the claims (b) and (c) follow similarly.
Suppose first that f is affine. Then there exists ~a ∈ Rn and b ∈ R such that for each ~x, ~y ∈ Rn and α ∈ [0, 1], we
have
f (α~x + (1 − α)~y ) = αf (~x) + (1 − α)f (~y ), for all α ∈ [0, 1] and ~x, ~y ∈ Rn . (5.136)
To show that f is affine, it suffices to show that g : Rn → R defined by g(~x) = f (~x) − f (~0) is linear (and thus can be
written as an inner product against a vector ~a). We first show that g(r~x) = rg(~x) for any r ∈ R and ~x ∈ Rn . We break
this problem up into three cases, each building on the other.
Case 1. Suppose that r ∈ [0, 1]. Then r~x can be expressed as a convex combination of ~0 and ~x: that is,
(Yes, this is a simpler step, but we build on it in the later parts.) With this, we have
Thus, we obtain
Case 2. Now suppose that r ∈ (1, ∞). Then ~x can be expressed as a convex combination of ~0 and r~x: that is,
1
~x = α(r~x) + (1 − α)~0, where α = ∈ (0, 1). (5.144)
r
Thus we have
1 1 1 ~ 1 1
f (~x) = f · r~x = f · r~x + 1 − 0 = f (r~x) + 1 − f (~0). (5.145)
r r r r r
Multiplying both sides by r and plugging it into the previous calculation, we get
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 107
EECS 127/227AT Course Reader 5.3. Convex Optimization Problems 2024-04-27 21:08:09-07:00
1
= rf (~x) − r 1 − f (~0) − f (~0) (5.147)
r
= rf (~x) − (r − 1)f (~0) − f (~0) (5.148)
= rf (~x) − rf (~0) (5.149)
= r(f (~x) − f (~0)) (5.150)
= rg(~x). (5.151)
Case 3. Now suppose that r ∈ (−∞, 0). Then ~0 can be expressed as a convex combination of ~x and r~x: that is,
1
0 = α(r~x) + (1 − α)~x, where α = ∈ (0, 1). (5.152)
1−r
Thus we have
~ 1 1 1 1−r−1 1 r
f (0) = f (r~x) + 1 − ~x = f (r~x) + f (~x) = f (r~x) − f (~x).
1−r 1−r 1−r 1−r 1−r 1−r
(5.153)
Multiplying both sides by 1 − r and plugging it into the previous calculation, we get
Thus we proved that g is linear, so f is affine. This is a full proof of (a), and (b) and (c) can be proved in almost exactly
the same way.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 108
EECS 127/227AT Course Reader 5.3. Convex Optimization Problems 2024-04-27 21:08:09-07:00
Note that this applies to other kinds of constraint sets too — in particular, those of the “standard” form “fi (~x) ≤ 0
for all i and hi (~x) = 0 for all j” still define a feasible region Ω and thus can furnish a convex optimization problem. In
particular, we have the following result.
is a convex optimization problem if f0 , f1 , . . . , fm are convex functions and h1 , . . . , hp are affine functions.
Now, we can establish the first-order condition for optimality within a convex problem. This is one of the main
theorems of convex analysis.
A generalization of this statement is that all local minimizers are global minimizers.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 109
EECS 127/227AT Course Reader 5.4. Solving Convex Optimization Problems 2024-04-27 21:08:09-07:00
Theorem 134 (For Convex Functions, Local Minima are Global Minima)
Let Ω ⊆ Rn be a convex set and let f : Ω → R be a convex function. Let ~x? ∈ Ω be such that there exists some
> 0 such that if ~x ∈ Ω has k~x − ~x? k2 ≤ then f (~x? ) ≤ f (~x). Then ~x? is a minimizer of f over Ω.
When is the global minimizer unique? We can justify this using strict convexity.
For an example of a strictly convex function with one global minimizer, take f (x) = x4 , which is minimized at
x = 0. For an example of a strictly convex function with no global minimizers, take f (x) = ex .
Fix a feasible ~x0 ∈ Rn . The inequality constraint fk (~x) ≤ 0 is active at ~x0 if fk (~x0 ) = 0, and inactive at ~x0
otherwise, i.e., fk (~x0 ) < 0.
We can use this to formulate a strategy for solving convex optimization problems. Recall that for a convex problem
min~x∈Ω f0 (~x) which has a solution ~x? , either ∇f (~x? ) = ~0 or ~x? is on the boundary of Ω. The boundary of Ω is any
point in which any inequality constraint is active. This allows us to systematically find solutions to convex optimization
problems.
Problem Solving Strategy 137. To solve a convex optimization program, we can do the following.
1. Iterate through all 2m subsets S ⊆ {1, . . . , m} of constraints which might be active at optimum.
2. For each S:
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 110
EECS 127/227AT Course Reader
5.5. Problem Transformations and Reparameterizations 2024-04-27 21:08:09-07:00
s.t. fi (~x) = 0, ∀i ∈ S
hj (~x) = 0, ∀j ∈ {1, . . . , p},
i.e., solve the problem where you pretend that all inequality constraints in S are met with equality and
pretend that the other inequality constraints don’t exist. This gives some solutions ~x?S .
(ii) If there is a solution ~x?S which is feasible for the original problem, write down the value of f0 (~x?S ). Other-
wise, ignore it.
(iii) After iterating through all ~x?S which are feasible for the original problem, take the one(s) with the best
objective value f0 (~x?S ) as the optimal solution(s) to the original problem.
Predictably, this problem solving strategy is exponentially hard as the number of inequality constraints increases.
Even if solving the “inner” equality-constrained minimization problems is easy (as it often is), the whole procedure is
untenable for large-scale problems. In future chapters, we will develop better analytic and algorithmic ways to solve
convex optimization problems.
argmin f0 (~x) = argmin φ(f0 (~x)) and argmax f0 (~x) = argmax φ(f0 (~x)). (5.173)
x∈Ω
~ x∈Ω
~ x∈Ω
~ x∈Ω
~
argmin f0 (~x) = argmax ψ(f0 (~x)) and argmax f0 (~x) = argmin ψ(f0 (~x)). (5.174)
x∈Ω
~ x∈Ω
~ x∈Ω
~ x∈Ω
~
Proof. We only prove the very first equality; the rest follow similarly. We have
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 111
EECS 127/227AT Course Reader
5.5. Problem Transformations and Reparameterizations 2024-04-27 21:08:09-07:00
The first equality will be by far the most important for us, though the others might also be situationally useful. This
proposition is why doing things like squaring the norm in least squares won’t affect the solutions we get, i.e.,
2
argmin A~x − ~b = argmin A~x − ~b . (5.179)
x∈Rn
~ 2 x∈Rn
~ 2
But the fact that φ is monotonic when restricted to f0 (Ω) is quite crucial; indeed, we have that
This is because the function u 7→ u2 is not monotonic in general, although it is monotonic on the non-negative real
numbers R+ . It just so happens that k·k2 only outputs non-negative numbers, so actually in the case of least squares
we have f0 (Ω) = R+ and the proposition applies.
Example 139 (Logistic Regression). A more non-trivial example is logistic regression. First, define σ : R → (0, 1) by
. 1
σ(x) = . (5.181)
1 + e−x
Suppose we have data points ~x1 , . . . , ~xn ∈ Rd and accompanying labels y1 , . . . , yn ∈ {0, +1}. Suppose that the
conditional probability that yi = 1 given ~xi is given by
for some w
~ 0 ∈ Rd . We wish to recover w
~ 0 , and thus recover the generative model Pw~ 0 [y | ~x]. We do this by maximum
likelihood estimation. Writing out the problem, we get
n
(5.184)
Y
= argmax Pw~ [yi | ~xi ]
w∈R
~ d
i=1
n n
(5.185)
Y Y
= argmax Pw~ [yi = 0 ~xi ] Pw~ [yi = 1 ~xi ]
w∈R
~ d
i=1 i=1
yi =0 yi =1
n
(5.186)
Y
= argmax Pw~ [yi = 1 | ~xi ]yi Pw~ [yi = 0 | ~xi ]1−yi
w∈R
~ d
i=1
n
(5.187)
Y
= argmax σ(~x> ~ yi (1 − σ(~x>
i w) ~ 1−yi .
i w))
w∈R
~ d
i=1
Now, the σ function is very non-convex. The product of σs is also very non-convex. Thus, it seems intractable to solve
this problem. But note that the objective function takes values in (0, 1). We use the above proposition with the function
x 7→ log(x), which is monotonically increasing on (0, 1). We obtain
n
(5.188)
Y
argmax σ(~x> ~ yi (1 − σ(~x>
i w) ~ 1−yi
i w))
w∈R
~ d
i=1
n
!
(5.189)
Y
= argmax log σ(~x> ~ yi (1 − σ(~x>
i w) ~ 1−yi
i w))
w∈R
~ d
i=1
n
(5.190)
X
yi log σ(~x> ~ + (1 − yi ) log 1 − σ(~x>
= argmax i w) i w)
~
w∈R
~ d
i=1
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 112
EECS 127/227AT Course Reader
5.5. Problem Transformations and Reparameterizations 2024-04-27 21:08:09-07:00
( n
)
(5.191)
X
σ(~x> σ(~x>
= argmin − yi log i w)
~ + (1 − yi ) log 1 − i w)
~
w∈R
~ d
i=1
n
(5.192)
X
−yi log σ(~x> ~ − (1 − yi ) log 1 − σ(~x>
= argmin i w) i w)
~ .
w∈R
~ d
i=1
In the penultimate line we used another one of the equalities in the proposition with the monotonically decreasing
function ψ(x) = −x. Thus, logistic regression reduces to minimizing the objective function
n
(5.193)
X
−yi log σ(~x> ~ − (1 − yi ) log 1 − σ(~x>
f0 (w)
~ = i w) i w)
~ .
i=1
Because σ(x) ∈ (0, 1), the above Hessian is a non-negative weighted sum of positive semidefinite matrices ~xi ~x>
i and is
thus positive semidefinite. By the second order conditions, f0 is convex! Thus we have turned an extremely non-convex
problem into an unconstrained convex minimization problem just by a neat application of monotone functions. We can
efficiently solve this problem algorithmically using convex optimization solvers such as gradient descent.
Actually, we can do better than gradient descent for this particular example! If we define
~x>
1 y1 σ(~x>1 w)
~
. . ..
. n×d . n ∈ Rn (5.196)
X= . ∈R , . ∈R ,
~y = p~(w)
~ = .
> >
~xn yn σ(~xn w)
~
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 113
EECS 127/227AT Course Reader
5.5. Problem Transformations and Reparameterizations 2024-04-27 21:08:09-07:00
Let S ⊆ {1, . . . , m} be any subset. The above problem is equivalent to the following problem:
s.t. fi (~x) + si = 0, ∀i ∈ S
fi (~x) ≤ 0, ∀i ∈ {1, . . . , m} \ S
hj (~x) = 0, ∀j ∈ {1, . . . , p}.
Here the notation RS+ = {(xi )i∈S | xi ≥ 0 ∀i ∈ S}, and ~s is called a slack variable.
One can choose to create slack variables si for only a subset of the inequality constraints, or all of them. When we
work with more advanced optimization algorithms later, sometimes this parameterization is crucial (e.g. for equality-
constrained Newton’s method).
Example 141. If we have a problem of the form
but our solver could only handle equality constraints, then it would be equivalent to solve the problem
and, upon solving this problem and obtaining (~x? , ~s? ), the solution to the original problem would be this same ~x? .
Definition 142
Consider a problem of the form
min t (5.207)
t∈R
x∈Rn
~
s.t. t ≥ f0 (~x)
fi (~x) ≤ 0, ∀i ∈ {1, . . . , m}
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 114
EECS 127/227AT Course Reader
5.5. Problem Transformations and Reparameterizations 2024-04-27 21:08:09-07:00
The epigraph objective is always a linear and differentiable function of the decision variables (t, ~x). However, the
constraint can become complicated if f0 (~x) is non-linear. This transformation is especially useful in the case of
quadratically-constrained quadratic programs (QCQP).
Example 143 (Elastic-Net Regularization). This example uses the two previously-discussed techniques in tandem to
figure out how to handle a regularizer with both smooth and non-smooth components (i.e., `1 and `2 norms).
Let A ∈ Rm×n , and ~y ∈ Rn . Suppose that we have a problem of the form
n o
(5.208)
2 2
minn kA~x − ~y k2 + α k~xk2 + β k~xk1 .
x∈R
~
The regularizer α k~xk2 + β k~xk1 is called the elastic net regularizer and encourages “sparse” and small ~x; this regu-
2
larizer has some use in the analysis of high-dimensional and structured data.
Suppose that our solver can only handle differentiable objectives, but is able to handle constraints so long as they
are also differentiable. Then we cannot solve the problem out-right using our solver, so we need to reformulate it. We
can first start by using a modification of the epigraph reformulation:
(5.209)
2 2
min kA~x − ~y k2 + α k~xk2 + βt
t∈R
x∈Rn
~
Now the constraint is non-differentiable, so we are no longer able to exactly solve this constrained problem. However,
the objective is now convex and differentiable. Let us rewrite this problem using the |xi |:
(5.211)
2 2
min kA~x − ~y k2 + α k~xk2 + βt
t∈R
x∈Rn
~
n
s.t. t ≥ (5.212)
X
|xi | .
i=1
The main insight that goes into resolving the non-differentiability of this constraint is that |xi | = max{xi , −xi } and
in particular si ≥ |xi | if and only if si ≥ xi and si ≥ −xi . Thus we add more “slack-type” variables and obtain
(5.213)
2 2
min kA~x − ~y k2 + α k~xk2 + βt
t∈R
x∈Rn
~
s∈Rn
~ +
n
s.t. t ≥ (5.214)
X
si
i=1
si ≥ xi , ∀i ∈ {1, . . . , n} (5.215)
si ≥ −xi , ∀i ∈ {1, . . . , n}. (5.216)
Now the constraints are all differentiable (in fact, affine) and the problem may be solved. This problem is exactly
equivalent to the above problem because t is being minimized in the objective, so each si is being minimized by way
of the first constraint, meaning that si = |xi | and this is equivalent to the original elastic-net regression problem.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 115
Chapter 6
Gradient Descent
Recall from Theorem 123 that the first order condition for (usual) convexity requires the function to be bigger than its
linear (first order) Taylor approximation centered at any point. µ-strong convexity imposes a stronger requirement on
the function: it needs to be bigger than its linear approximation plus a non-negative quadratic term that has a Hessian
matrix µI. This becomes more obvious if we write Equation (6.1) in the equivalent form
µ
(6.2)
>
f (~y ) ≥ f (~x) + [∇f (~x)] (~y − ~x) + (~y − ~x)> I (~y − ~x) .
| {z } | 2
first-order Taylor approximation
{z }
non-negative quadratic term
Below, we visualize the µ-strong convexity property and compare it to the first order condition for convexity.
116
EECS 127/227AT Course Reader 6.1. Strong Convexity and Smoothness 2024-04-27 21:08:09-07:00
graph(f )
(~y , f (~y ))
f (~y )
µ 2
graph(fb1 (·; x) + 2 k· − ~xk2 )
µ 2 µ 2
fb1 (~y ; ~x) + 2 k~y − ~xk2 (~y , fb1 (~y ; ~x) + 2 k~y − ~xk2 )
(~x, f (~x))
f (~x)
Ω
~x ~y
graph(fb1 (·; x))
fb1 (~y ; ~x)
(~y , fb1 (~y ; ~x))
Therefore, µ-strong convexity of the function is a very important feature. It guarantees that the function will always
have enough curvature (at least as much as its quadratic lower bound) and thus will never become too flat anywhere on
its domain.
If the function f is twice-differentiable we can formalize this notion by giving the following equivalent condition
for µ-strong convexity.
An important property of µ-strongly convex functions is that they are strictly convex and thus they have at most one
minimizer. In fact, one can show that they have exactly one minimizer.
The second property we want to introduce is L-smoothness, which describes quadratic upper bounds on the function
f.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 117
EECS 127/227AT Course Reader 6.1. Strong Convexity and Smoothness 2024-04-27 21:08:09-07:00
If f is L-smooth, then the function f is upper bounded by its first order Taylor approximation plus a non-negative
quadratic term:
L
(6.5)
> >
f (~y ) ≤ f (~x) + [∇f (~x)] (~y − ~x) + (~y − ~x) I (~y − ~x) .
| {z } 2
first-order Taylor approximation | {z }
non-negative quadratic term
We can visualize the L-smoothness condition similarly to the strongly convex condition, as below:
L 2
graph(fb1 (·; x) + 2 k· − ~xk2 )
L 2
fb1 (~y ; ~x) + 2 k~y − ~xk2
(~y , fb1 (~y ; ~x) + L
k~y − ~xk2 )
2 graph(f )
2
f (~y )
(~y , f (~y ))
(~x, f (~x))
f (~x)
Ω
~x ~y
graph(fb1 (·; x))
fb1 (~y ; ~x)
(~y , fb1 (~y ; ~x))
L-smoothness provides a quadratic upper bound on f . This upper bound ensures that the function doesn’t have too
much curvature anywhere on its domain (at most as much as its upper bound). We will see later in this chapter that this
actually translates into an upper bound on the rate at which the gradient of the function changes.
Finally, we visualize the behavior of the µ-strongly convex and L-smooth bounds together.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 118
EECS 127/227AT Course Reader 6.2. Gradient Descent 2024-04-27 21:08:09-07:00
2
graph(fb1 (·; x) + + L2 k· − ~xk2 )
L 2
fb1 (~y ; ~x) + 2 k~y − ~xk2 2
(~y , fb1 (~y ; ~x) + L
k~y − ~xk2 ) graph(f )
2
µ 2
graph(fb1 (·; x) + 2 k· − ~xk2 )
µ 2 µ 2
fb1 (~y ; ~x) + 2 k~y − ~xk2 (~y , fb1 (~y ; ~x) + 2 k~y − ~xk2 )
(~x, f (~x))
f (~x)
Ω
~x ~y
The general idea behind the gradient descent algorithm is that it starts with some initial guess ~x0 ∈ Rn and produces
a sequence of refined guesses ~x1 , ~x2 , . . ., called iterates. In each iteration t = 0, 1, 2, . . . , the algorithm updates its
guess according to the following rule:
~xt+1 = ~xt + η~vt (6.7)
• The vector ~vt , or the search direction, which specifies a good direction to move.
• The scalar η, or the step size, which specifies how far we move in the direction of ~vt .
For the gradient descent algorithm, we assume that at every point ~x ∈ Rn we can get two pieces of information about
the function we are optimizing: the value of the function f (~x) ∈ R as well as its gradient ∇f (~x) ∈ Rn . Next, we
will use this available information to come up with a good search direction ~vt . The choice of the step size η is a more
difficult task. There is no universal choice of η that is good for all problems and a good choice of η is problem-specific.
We will discuss the choice of the step size later in the section and show the important role it plays in the algorithm.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 119
EECS 127/227AT Course Reader 6.2. Gradient Descent 2024-04-27 21:08:09-07:00
∇f (~x)
− ∈ argmin Df (~x)[~v ]. (6.8)
k∇f (~x)k2 v ∈Rn
~
k~
v k2 =1
We also want to use the norm of the gradient in our update. For example, if f (x) = x2 then f 0 (x) = 2x, which
has large norm when x is far from the optimal point x? = 0. In this way, if the gradient is large then we are usually far
away from an optimum, and we want our update η~vt to be large. This motivates choosing ~vt = −∇f (~xt ).
We can formalize this in an algorithm which terminates after T iterations for a user-set T .
It is important to note that with the choice ~vt = −∇f (~xt ), descent is not guaranteed. Namely, for a fixed η > 0, it
is not true in general that
f (~xt+1 ) = f (~xt − η∇f (~xt )) ≤ f (~xt ). (6.10)
Rather, from the proof of the above theorem, we only have that, for any given t there exists ηt > 0 such that
This ηt , in general, is very small, and depends heavily on t and the local geometry of f around ~xt . None of these details
are known by any realistic implementation of the gradient descent algorithm. In what follows, we will study gradient
descent with a constant step size; this setting is most common in practice.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 120
EECS 127/227AT Course Reader 6.2. Gradient Descent 2024-04-27 21:08:09-07:00
Example 149 (Gradient Descent for Least Squares). In this example we explore the convergence properties of the
gradient descent algorithm by applying it to the least squares problem. Let A ∈ Rm×n have full column rank, and
~y ∈ Rm .
(6.12)
2
min kA~x − ~y k2 .
x∈Rn
~
Recall that, in this setting, the least squares problem has the unique closed form solution
We will use our knowledge of the true solution to analyze the convergence of the gradient descent algorithm. To apply
gradient descent to this problem, let us first compute the gradient, which we see as
Now let ~x0 ∈ Rn be the initial guess. For t a non-negative integer, we can write the gradient descent step:
We aim to set η which achieves the following two desired properties for the gradient descent iterates ~xt :
• We make progress, i.e., in every iteration we get closer to the optimal solution:
To study this, let us write write out the relationship between ~xt+1 − ~x? and ~xt − ~x? . We do this by subtracting ~x? from
both sides of Equation (6.17) and do some algebraic manipulations.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 121
EECS 127/227AT Course Reader 6.2. Gradient Descent 2024-04-27 21:08:09-07:00
Here we used the trick of introducing I = (A> A)(A> A)−1 ; this is because we wanted to group terms with A> A, and
we also wanted to introduce instances of ~x? = (A> A)−1 A> ~y . Thus, while this was a trick, it was a motivated trick.
In the end we get the following relationship:
k~xt+1 − ~x? k2 ≤ σmax {I − 2ηA> A} k~xt − ~x? k2 < k~xt − ~x? k2 . (6.30)
Thus we are guaranteed to make progress at each step. For the convergence guarantee, we recursively apply Equa-
tion (6.29) to get
lim k~xt − ~x? k2 ≤ lim σmax {I − 2ηA> A}t k~x0 − ~x? k2 (6.35)
t→∞ t→∞
= 0. (6.37)
Let us now generalize this analysis to a larger class of functions that includes least squares, as well as more general
functions. We will focus our attention on the class of L-smooth and µ-strongly convex functions. Similar to what we
did for least squares, we will use the optimal solution ~x? in our analysis. For a general function f , the quantity ~x? is
almost always not known. But in the case of L-smooth and µ-strongly convex functions, a unique global minimizer
exists. We just need the fact that it exists to show that for some small enough choice of the step size η, the gradient
descent algorithm converges to this optimal solution. Before we formally state and prove this, we want to introduce
one property of L-smooth functions that will become useful in our following proof.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 122
EECS 127/227AT Course Reader 6.2. Gradient Descent 2024-04-27 21:08:09-07:00
This lemma says that the magnitude of the gradient gets smaller as we get closer to the optimal solution.
This is true for all points ~x, ~y ∈ Rn , so let us fix ~x and set ~y = ~x − L .
∇f (~
x)
We get
2
∇f (~x) ∇f (~x) L ∇f (~x)
(6.40)
>
f ~x − ≤ f (~x) + [∇f (~x)] − + −
L L 2 L 2
1 1
(6.41)
2 2
= f (~x) − k∇f (~x)k2 + k∇f (~x)k2
L 2L
1
(6.42)
2
= f (~x) − k∇f (~x)k2 .
2L
Now we can use the fact that min~x0 f (~x0 ) ≤ f (~z) for all ~z ∈ Rn to get
∇f (~x)
0
min f (~x ) ≤ f ~x − (6.43)
x0 ∈Rn
~ L
1
(6.44)
2
≤ f (~x) − k∇f (~x)k2 .
2L
Rearranging, we have
(6.45)
2 0
k∇f (~x)k2 ≤ 2L f (~x) − min
0 n
f (~x )
x ∈R
~
as desired.
Now we have all the needed tools to prove the following property of gradient descent: for any µ-strongly convex
and L-smooth function, the gradient descent algorithm converges exponentially fast to the true solution ~x? .
Theorem 151 (Convergence of Gradient Descent for Smooth Strongly Convex Functions)
Let µ, L > 0. Let f : Rn → R be an L-smooth, µ-strongly convex function. Consider the following optimization
problem:
p? = minn f (~x) (6.46)
x∈R
~
which has optimal solution ~x? . Then the constant step size η = 1
L is such that, if applying the gradient descent
algorithm generates the sequence of points
(6.49)
2 2
k~xt+1 − ~x? k2 = k~xt − η∇f (~xt ) − ~x? k2
(6.50)
? 2
= k[~xt − ~x ] − η∇f (~xt )k2
(6.51)
2 2
= k~xt − ~x? k2 − 2η[∇f (~xt )]> (~xt − ~x? ) + η 2 k∇f (~xt )k2
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 123
EECS 127/227AT Course Reader 6.2. Gradient Descent 2024-04-27 21:08:09-07:00
(6.52)
2 2
= k~xt − ~x? k2 + 2η[∇f (~xt )]> (~x? − ~xt ) + η 2 k∇f (~xt )k2 .
(6.53)
2
k∇f (~xt )k2 ≤ 2L(f (~xt ) − f (~x? )).
(6.57)
2 2 2
k~xt+1 − ~x? k2 = k~xt − ~x? k2 + 2η[∇f (~xt )]> (~x? − ~xt ) + η 2 k∇f (~xt )k2
h µ i
(6.58)
2 2
≤ k~xt − ~x? k2 + 2η f (~x? ) − f (~xt ) − k~xt − ~x? k2 + η 2 [2L(f (~xt ) − f (~x? ))]
2
? 2
= (1 − ηµ) k~xt − ~x k2 + 2η(ηL − 1)(f (~xt ) − f (~x? )). (6.59)
µ
(6.60)
2 2
k~xt+1 − ~x? k2 ≤ 1 − k~xt − ~x? k2 .
L
(a) (Descent at every step.) k~xt+1 − ~x? k2 ≤ k~xt − ~x? k2 for all non-negative integers t and initializations ~x0 .
r
µ
k~xt+1 − ~x k2 ≤ 1 − k~xt − ~x? k2 < k~xt − ~x? k2 .
?
(6.64)
L
The second claim follows since
µ
(6.65)
2 2
k~xt − ~x? k2 ≤ 1 − k~xt−1 − ~x? k2
L
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 124
EECS 127/227AT Course Reader 6.3. Variations: Stochastic Gradient Descent 2024-04-27 21:08:09-07:00
µ 2
(6.66)
2
≤ 1− k~xt−2 − ~x? k2
L
≤ ··· (6.67)
µ t
(6.68)
2
≤ 1− k~x0 − ~x? k2
L
and so
µ t
(6.69)
2 2
lim k~xt − ~x? k2 = lim k~x0 − ~x? k2
1−
t→∞ t→∞ L
µ t
(6.70)
2
= k~x0 − ~x? k2 · lim 1 −
{z L }
t→∞
|
=0
= 0. (6.71)
cT D ≤ (6.73)
=⇒ cT ≤ (6.74)
D
=⇒ T log(c) ≤ log() − log(D) (6.75)
log() − log(D)
=⇒ T ≥ (6.76)
log(c)
log(1/) + log(D)
= . (6.77)
log(1/c)
Thus, the lower the value of c is, the faster we get convergence towards a given accuracy.
Finally, it is important to point out that this is not the only class of functions for which the gradient descent algorithm
converges. For example for the class of functions that are L-smooth and convex (i.e., not µ-strictly convex), the gradient
descent algorithm does converge; in particular it converges to within accuracy after T ≥ O(1/) iterations. This is
vastly slower than the convergence achieved for µ-strongly convex functions.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 125
EECS 127/227AT Course Reader 6.3. Variations: Stochastic Gradient Descent 2024-04-27 21:08:09-07:00
This class of functions is very common in applications that involve learning from data, one example being the least
squares problem. Recall that we can write the least squares problem as
m
1 1 X >
(6.81)
2
minn kA~x − ~y k2 = minn (~ai ~x − bi )2
x∈R
~ m
| {z } x∈R m
~
i=1
| {z }
x)
=fi (~
x)
=f (~
If this index i is drawn uniformly at random, we can see that the expected value of the estimated gradient is equal
to the full gradient.
m
1
(6.83)
X
E[∇fi (~x)] = ∇fi (~x)
i=1
m
= ∇f (~x). (6.84)
Note that the direction −∇fi (~x) is not guaranteed to be the direction of steepest descent of the function f (~x). In
fact, it might not be a descent direction at all, and the value of the function f might even increase by taking a step in
this direction. Further, it is not guaranteed to satisfy the first order optimality conditions; that is, when ∇f (~x) = ~0,
the gradient of a single function ∇fi (~x) is generally not zero. This already tells us that the notion of convergence we
studied (which is called “last-iterate convergence”) is practically impossible to achieve — and in particular, the type of
convergence proof that we studied for gradient descent will not work here. Instead the type of convergence guarantees
we can hope for in this setting is for the averaged point across all iterations will converge to the optimal solution, i.e.,
T
!
1X
lim f ~xt = minn f (~x). (6.85)
T →∞ T t=1 x∈R
~
Further, to prove convergence of SGD we need to use a variable step size ηt . Using a fixed step size (like we did
for gradient descent) doesn’t guarantee convergence. The intuition behind this is that the gradient directions ∇fi (~x)
for different i might be competing with each other and want to pull the solution in different directions. This will cause
the sequence ~xt to bounce back and forth. Using a variable step size ηt such that limt→∞ ηt = 0 makes it possible
to converge to a single point. It is not trivial to show that the point that the algorithm will converge to is the optimal
solution, but this can be proven under some assumptions on the objective function. For formal proofs, a good reference
is [6].
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 126
EECS 127/227AT Course Reader
6.4. Variations: Gradient Descent for Constrained Optimization 2024-04-27 21:08:09-07:00
Example 153 (Finding the Centroid Using SGD). Fix points p~1 , . . . , p~m ∈ Rn . Let us consider the problem of finding
their centroid, which we formulate as
m
1 X1
(6.86)
2
minn k~x − p~i k2 .
x∈R
~ m i=1 |2 {z }
x)
=fi (~
The optimal solution for this problem is just the average of the points
m
1 X
~x? = p~i . (6.87)
m i=1
Let us apply SGD on this problem with the time varying step size ηt = 1
t and initial guess ~x0 = ~0. We first compute
the gradients
To simplify the notation we will apply SGD by selecting the indices in order.
1
Iteration 1 : ~x1 = ~x0 − (~x0 − p~1 ) = p~1 (6.89)
1
1 p~1 + p~2
Iteration 2 : ~x2 = ~x1 − (~x1 − p~2 ) = (6.90)
2 2
1 p~1 + p~2 + p~3
Iteration 3 : ~x3 = ~x3 − (~x2 − p~3 ) = (6.91)
3 3
..
. (6.92)
Pm
1 p~i
Iteration m : ~xm = ~xm−1 − (~xm−1 − p~m ) = i=1 . (6.93)
m m
So the SGD algorithm which takes a step along the gradient of a single ∇fi (~x) in every iteration converges to the true
solution in m iterations. It is important to note that this is a toy example that is used to illustrate the application of
SGD. As we discussed earlier, this type of convergence behavior is not common or guaranteed when applying SGD to
more complicated real world problems.
Finally, we note that this set up of the SGD algorithm allows us to think of the optimization scheme as an online
algorithm. Instead of randomly selecting an index to compute the gradient, assume that the data points (i.e., in the
above example the p~i ) that define the functions fi (~x) arrive one at a time. Following the same idea of SGD, we can
perform an optimization step as each new data point becomes available. In this setting it is important to normalize by
the number of data points m in the objective function; otherwise the objective function will keep growing as we get
more data, irregardless of whether our optimization algorithm finds a good solution.
For these types of problems we want to limit our search to points inside the set Ω. If we start with an initial guess
~x0 ∈ Ω and apply the (unconstrained) gradient descent algorithm
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 127
EECS 127/227AT Course Reader
6.4. Variations: Gradient Descent for Constrained Optimization 2024-04-27 21:08:09-07:00
the point ~xt+1 might end up outside of the feasible set Ω, even when ~xt is in Ω. In this section we will consider two
variants of the gradient descent algorithm that propose techniques to deal with constraints.
(6.96)
2
projΩ (~y ) = argmin k~x − ~y k2 .
x∈Ω
~
One can show that, since Ω is closed and convex, the projection is unique. In words, projΩ (~y ) is the closest point in Ω
to ~y . From this definition one sees that if ~y ∈ Ω then projΩ (~y ) = ~y . Using this definition we can write the step of the
projected gradient descent algorithm as
Note that the projection operator itself solves a convex optimization problem. It is only meaningful to consider the
projected gradient descent algorithm if this projection problem is simple enough to be solved in every iteration. We
have seen examples of projections onto simple sets. For example, the least squares problem computes the projection
of a vector onto the subspace R(A), and we know how to efficiently solve this projection problem. However, that’s not
always the case, and the projection problem might be nearly as difficult to solve as our original optimization problem.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 128
EECS 127/227AT Course Reader
6.4. Variations: Gradient Descent for Constrained Optimization 2024-04-27 21:08:09-07:00
conditional gradient descent algorithm is to limit our search direction to the feasible set Ω and find a vector inside the
set that minimizes ∇ [f (~xt )] ~vt . Thus, given ~xt ∈ Ω, we can define the search direction of the conditional gradient
>
descent as
~vt = argmin[∇f (~xt )]>~v . (6.99)
v ∈Ω
~
Once we find this direction ~vt , we need to use it in a way that ensures that we don’t leave the set. Taking a step along this
direction ~xt+1 = ~xt + η~vt doesn’t guarantee this. But we can do something different and take the convex combination
between the current point and the search direction. That is, for δt ∈ [0, 1] we can update
By convexity of the set, this guarantees that if we start with a feasible initial guess ~x0 ∈ Ω, we get a sequence of points
~xt ∈ Ω for all t ≥ 0. While this property holds for any choice of δt ∈ [0, 1], to prove convergence of the algorithm we
require limt→∞ δt = 0; a conventional choice is δt = 1t .
As a last note, we point out that the problem of finding a search direction ~vt is a constrained optimization problem
within Ω.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 129
Chapter 7
Duality
• [1] Chapter 5.
7.1 Lagrangian
This chapter develops the theory of duality for optimization problems. This is a technique that can help us solve or
bound the optimal values for constrained optimization problems by solving the related dual problem. The dual problem
might not always give us a direct solution to our original (“primal”) problem, but it will always give us a bound. For
this, we first define the Lagrangian for a problem.
Let us start with our generic constrained optimization problem, which we will call problem P (which stands for
primal):
We know how to algorithmically approach unconstrained problems, say for instance using gradient descent, which
we discussed in the previous chapter. Thus, our question is whether there is a way to incorporate the constraints of
the problem P into the objective function itself. Trying to understand if this is possible is one motivation for the
Lagrangian.
To this end, we construct the indicator function 1 [·], which for a condition C(~x) defined as
. 0,
if C(~x) is true,
1 [C(~x)] = (7.3)
+∞, if C(~x) is false.
130
EECS 127/227AT Course Reader 7.1. Lagrangian 2024-04-27 21:08:09-07:00
To explain the chain of equalities leading to (7.6), say we consider the function
m p
.
1 [fi (~x) ≤ 0] + 1 [hj (~x) = 0] (7.7)
X X
F0 (~x) = f0 (~x) +
i=1 j=1
evaluated at an ~x ∈
/ Ω, i.e., there is some i such that fi (~x) > 0 or some j such that hj (~x) 6= 0. Then F0 (~x) = ∞.
Therefore no solution to the minimization in (7.6) will ever be outside of Ω, i.e., all solutions to (7.6) will be contained
in Ω. And for ~x ∈ Ω, we have F0 (~x) = f0 (~x), so that min~x∈Ω f0 (~x) = min~x∈Rn F0 (~x) (and the argmins are equal
too). Hence the problem in (7.6) is equivalent to the original problem P.
Thus through this reformulation, we have removed the explicit constraints of the problem and incorporated them
into the objective function of the problem, and we can nominally write the problem as an unconstrained problem. If
we had some magic algorithm which allowed us to solve all unconstrained problems, then this reduction would let us
solve constrained problems as well.
Unfortunately, we do not have such a magic algorithm. Our usual algorithms for solving unconstrained problems,
such as gradient descent, require differentiable objective functions that are well-defined. Thus, it falls to us to express
our indicator functions in a differentiable way.
For this we consider approximating the indicator functions by other functions that are differentiable.
The key idea here is that, for a given indicator 1 [fi (~x) ≤ 0], we can express it as
where R+ is the set of non-negative real numbers. Why does this equality hold? Let’s break it up into cases. Suppose
that fi (~x) > 0. Then λi being more positive would make λi fi (~x) more positive, so the maximization will send
λi → ∞, making λi fi (~x) = ∞. Now suppose that fi (~x) ≤ 0. Then λi being more positive would make λi fi (~x)
more negative, so the maximization will keep λi as its lower bound 0, making λi fi (~x) = 0. For a similar reason, for
an indicator 1 [hj (~x) = 0], we have
1 [hj (~x) = 0] = max νj hj (~x). (7.9)
νj ∈R
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 131
EECS 127/227AT Course Reader 7.1. Lagrangian 2024-04-27 21:08:09-07:00
This interior function is called the Lagrangian, and is much easier to work with than the original unconstrained problem
with indicator functions. Thus we have made our constrained problem into a (mostly) unconstrained problem,¹ at the
expense of adding an extra max. We will see how to cope with this extra difficulty shortly.
More formally, we have the following definition:
The important intuition behind the Lagrange multipliers is that they are penalties for violating their corresponding
constraint. In particular, just like how the indicator function 1 [fi (~x) ≤ 0] assigns an infinite penalty for violating the
constraint fi (~x) ≤ 0, the term λi fi (~x) assigns a penalty λi fi (~x) for violating the constraint fi (~x) ≤ 0. One can show
that λi fi (~x) ≤ 1 [fi (~x) ≤ 0], so the Lagrangian penalty is a smooth lower bound to the indicator penalty, which is
something like a hard-threshold. We can do similar analysis for the equality constraints hj (~x) = 0.
As derived before, the primal problem can be expressed in terms of the Lagrangian as
Proposition 156
For every ~x ∈ Rn , the function (~λ, ~ν ) 7→ L(~x, ~λ, ~ν ) is an affine function, and hence a concave function. (We also
say that L is affine (resp. concave) in ~λ and ~ν .)
is affine in ~λ and ~ν .
s.t. 2x3 ≤ 8.
This problem is not convex, but we can still find its Lagrangian as
to handle than arbitrary constraints, in part because the constraint set is convex.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 132
EECS 127/227AT Course Reader 7.2. Weak Duality 2024-04-27 21:08:09-07:00
In general, the quantities p? and d? do not have to be equal! And swapping the min and the max is the generic definition
of “duality”.
Thus, g(~λ, ~ν ) can be computed as an unconstrained optimization problem over ~x, in particular g(~λ, ~ν ) is the minimum
value of L(~x, ~λ, ~ν ) over all ~x.
There are several important properties of the dual problem and dual function, which we summarize below.
Proposition 159
The dual function g is a concave function of (~λ, ~ν ), regardless of any properties of P.
Proof. We already showed that L(~x, ~λ, ~ν ) is an affine (and thus concave) function of ~λ and ~ν . Thus the function
Corollary 160. The dual problem D is always a convex problem, no matter what the primal problem P is.
Note that the minimization of a convex function or the maximization of a concave function are both considered
“convex” problems.
This means we have analytic and algorithmic ways to solve the dual problem D. If we can connect the solutions to
the dual problem D to the primal problem P, this means that we have reduced the process of solving P to the process
of solving a convex optimization problem, and this is something we know how to do. In the rest of this chapter, we
discuss when this reduction is possible.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 133
EECS 127/227AT Course Reader 7.2. Weak Duality 2024-04-27 21:08:09-07:00
s.t. 3x ≤ 5.
The Lagrangian is
L(x, λ) = 5x2 + λ(3x − 5). (7.24)
For each λ > 0, the function L(x, λ) = 5x2 + λ(3x − 5) is bounded below (in x), convex, and differentiable, so to
minimize it we set its derivative to 0. The optimal x? is a function of the Lagrange multiplier λ, so we write it as x? (λ).
We have
We also have some more bounds for the Lagrangian in terms of f0 , g, p? , and d? .
Proposition 162
Let ~x ∈ Ω, let ~λ ∈ Rm
+ , and let ~
ν ∈ Rp , so that ~x is feasible for P and (~λ, ~ν ) is feasible for D. Then we have:
p? = min
0
f0 (~x0 ) ≤ f0 (~x) (7.34)
x ∈Ω
~
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 134
EECS 127/227AT Course Reader 7.2. Weak Duality 2024-04-27 21:08:09-07:00
g(~λ, ~ν ) = min
0
L(~x0 , ~λ, ~ν ) ≤ L(~x, ~λ, ~ν ) (7.36)
x ∈Ω
~
and additionally
From all these inequalities, the relationship between p? and d? is totally unclear. Actually, their relationship is very
simple; it is captured in a condition called “weak duality”.
The fact that weak duality always holds is actually a corollary of a slightly more generic result called the minimax
inequality, which we now state and prove.
F (x, y) ≥ min
0
F (x0 , y). (7.43)
x ∈X
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 135
EECS 127/227AT Course Reader 7.2. Weak Duality 2024-04-27 21:08:09-07:00
Taking the maximum over y on both sides (we will discuss why we can do this at the end of the proof), we have
Finally, the left hand side is a function of x but the right hand side is a constant, so we might as well take the minimum
in x on the left hand side to get
min max F (x0 , y 0 ) ≥ max min F (x0 , y 0 ). (7.45)
x0 ∈X y 0 ∈Y 0 0 y ∈Y x ∈X
max f (y 0 ) ≥ f (e
y ) ≥ g(e
y ) = max g(y 0 ) =⇒ max F (x, y 0 ) ≥ max min F (x0 , y 0 ) for all x ∈ X, (7.46)
y 0 ∈Y 0
y ∈Y 0 y ∈Y0 0 y ∈Y x ∈X
as desired.
Weak duality is useful because it gives a bound on how close to optimum a given decision variable is. More
specifically, we have that for all ~x ∈ Ω and all (~λ, ~ν ) ∈ Rm
+ × R that
p
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 136
EECS 127/227AT Course Reader 7.3. Strong Duality 2024-04-27 21:08:09-07:00
Thus, if f0 (~x) − g(~λ, ~ν ) ≤ then we know that f0 (~x) − p? ≤ . This is called a certificate of optimality. We can use
this certificate as a stopping condition for algorithms such as gradient descent or a broad class of algorithms known as
primal-dual algorithms.
If there exists any ~x ∈ relint(Ω)a which is strictly feasible, i.e., such that all of the following hold:
• for all i ∈ {1, . . . , m} such that fi is not an affine function, we have fi (~x) < 0;
then strong duality holds for P and its dual D, i.e., the duality gap is 0.
aRecall that this notation means the relative interior of the feasible set Ω, defined formally in Definition 102. Since the formal definition
of the relative interior is out of scope for this semester, you may just consider it to be the interior of Ω, i.e., points not on the boundary of Ω,
and we will not give you problems where the distinction matters.
This result is sometimes called refined Slater’s condition as it is a slight modification of another condition which is also
called Slater’s condition.
We will not prove this in class; a proof sketch is in [1]. It uses the separating hyperplane theorem we proved earlier.
This result is actually one of the main payoffs of the separating hyperplane theorem that we will see in this course.
Again, observe that we only need a single point which fulfills the three conditions in order to declare that Slater’s
condition holds. Moreover, this point need not be optimal for the problem — any point at all which satisfies the
conditions will do.
Slater’s condition has a geometric interpretation: if there is a point in the relative interior of the feasible set which
is strictly feasible, then strong duality holds. In problems we are able to visualize, this makes it relatively easy to
determine whether Slater’s condition holds.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 137
EECS 127/227AT Course Reader 7.3. Strong Duality 2024-04-27 21:08:09-07:00
(7.49)
2
p? = minn k~xk2
x∈R
~
s.t. A~x = ~y .
This problem only has equality constraints, and so it only has the Lagrange multiplier ~ν . The Lagrangian of this problem
is
p
(7.50)
2
X
L(~x, ~ν ) = k~xk2 + νj (A~x − ~y )j
j=1
(7.51)
2
= k~xk2 + ~ν > (A~x − ~y ).
The Lagrangian is convex in ~x and bounded below, and so it is minimized wherever its gradient in ~x is ~0. The optimal
~x? is a function of ~ν , so we write it as ~x? (~ν ). We have
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 138
EECS 127/227AT Course Reader 7.3. Strong Duality 2024-04-27 21:08:09-07:00
2
1 1
= − A>~ν + ~ν > A − A>~ν − ~y (7.59)
2 2 2
1 > 1
= ~ν AA>~ν − ~ν > AA>~ν − ~ν > ~y (7.60)
4 2
1 >
= − ~ν AA>~ν − ~ν > ~y . (7.61)
4
Thus the dual problem is
1 1 >
d = maxp − ~ν > AA>~ν − ~ν > ~y
?
= − minp > >
~ν AA ~ν + ~ν ~y . (7.62)
ν ∈R
~ 4 ν ∈R
~ 4
There are no constraints on the dual problem, because there are no inequality constraints on the primal problem. Since
this dual problem is a convex unconstrained problem whose objective value is bounded below, it can be solved by
setting the gradient of the objective to ~0 and solving for the optimal dual variable. This yields
~0 = ∇g(~ν ? ) (7.63)
1
= AA>~ν ? + ~y (7.64)
2
=⇒ ~ν ? = −2(AA> )−1 ~y . (7.65)
Our original primal problem is convex, and all constraints are affine. As long as there is a feasible point ~x, i.e., a point
~x such that A~x = ~y , then Slater’s condition implies that strong duality holds. Suppose that there is such a feasible point
(i.e., the primal problem is feasible). Then strong duality holds. Because the primal problem is convex and strong
duality holds, we can recover the optimal primal variable ~x? from the optimal dual variable ~ν ? as
1
~x? = ~x? (~ν ? ) = − A> (−2(AA> )−1 ~y ) = A> (AA> )−1 ~y . (7.66)
2
This corresponds to the familiar minimum-norm solution.
s.t. A~x = ~y
~x ≥ ~0
where the last constraint means xi ≥ 0 for all i. Slater’s condition implies that strong duality holds for this problem
provided there is a feasible point. The Lagrangian is
n m
(7.68)
X X
L(~x, ~λ, ~ν ) = ~c> ~x + λi (−xi ) + νj (A~x − ~y )j
i=1 j=1
This function is affine in ~x, hence convex in ~x, so its dual function is
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 139
EECS 127/227AT Course Reader 7.3. Strong Duality 2024-04-27 21:08:09-07:00
−~ν > ~y , if ~c + A>~ν − ~λ = ~0
= (7.73)
−∞, otherwise.
(It is a good exercise to figure out why this last equality is correct.) Thus, the dual is
s.t. λi ≥ 0, ∀i ∈ {1, . . . , n}
~c + A ~ν − ~λ = ~0.
>
Example 169 (Shadow Prices). In this example, we determine an economic interpretation of Lagrange multipliers.
Suppose we have 200 kilos of merlot grapes and 300 kilos of shiraz grapes. Consider the following possible blends:
We want to maximize our profits. Suppose we make q1 bottles of the first blend, and q2 of the second. Our optimization
problem is:
In reality, we can’t make fractional bottles of wine, so we can round q1 and q2 to the nearest integer if needed. Now
consider the following modification. Suppose that we actually want to sell off some of the grapes instead of turning
them into wine. We can earn λ1 dollars per kilo of merlot and λ2 per kilo of shiraz. This yields a new optimization
problem
= max {(20 − 4λ1 − λ2 )q1 + (15 − 2λ1 − 3λ2 )q2 + 200λ1 + 300λ2 } . (7.78)
q1 ,q2 ∈R+
If the coefficient of q1 is negative, we shouldn’t make any of the first blend. If the coefficient of q2 is negative, we
shouldn’t make any of the second blend. If both are positive, we should make as many of each as possible. What
happens when 20 − 4λ1 − λ2 = 0 and 15 − 2λ1 − 3λ2 = 0? This is an indifference point, i.e. the point at which our
profit is the same no matter how many bottles of either wine we make. Under this condition, the minimum profit we
could possibly make is
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 140
EECS 127/227AT Course Reader 7.4. Karush-Kuhn-Tucker (KKT) Conditions 2024-04-27 21:08:09-07:00
s.t. 20 − 4λ1 − λ2 = 0
15 − 2λ1 − 3λ2 = 0.
This problem is the dual to our original (primal) problem. The λi ’s are called shadow prices, and capture how much
we’re willing to pay to violate our constraints.
• (Primal feasibility.) ~x
e is feasible for P, i.e.,
fi (~x
e) ≤ 0, ∀i ∈ {1, . . . , m} (7.80)
and hj (~x
e) = 0, ∀j ∈ {1, . . . , p}. (7.81)
~e ~
• (Dual feasibility.) (λ, νe) is feasible for D, i.e.,
ei ≥ 0,
λ ∀i ∈ {1, . . . , m}. (7.82)
• (Complementary slackness.)
ei fi (~x
λ e) = 0, ∀i ∈ {1, . . . , m}. (7.83)
Given an arbitrary optimization problem, the KKT conditions do not have to be related to the optimality conditions.
But it turns out that in many cases, they are related.
Theorem 171 (If Strong Duality Holds, then KKT Conditions are Necessary for Optimality)
Suppose P is a primal problem with dual D. Suppose that the objective function f0 and constraint functions
f1 , . . . , fm , h1 , . . . , hp are differentiable, strong duality holds, and (~x? , ~λ? , ~ν ? ) are optimal primal and dual vari-
ables. Then (~x? , ~λ? , ~ν ? ) fulfill the KKT conditions.
Proof. By assumption, ~x? is feasible for P and (~λ? , ~ν ? ) is feasible for D. This implies that λ?i fi (~x? ) ≤ 0 for all i, and
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 141
EECS 127/227AT Course Reader 7.4. Karush-Kuhn-Tucker (KKT) Conditions 2024-04-27 21:08:09-07:00
d? = g(~λ? , ~ν ? ) (7.86)
= minn L(~x, ~λ? , ~ν ? ) (7.87)
x∈R
~
≤ f0 (~x ) ?
(7.90)
= p? . (7.91)
Because p? = d? , all inequalities in the above chain are actually equalities. This means that ~x? minimizes L(~x, ~λ? , ~ν ? )
over ~x ∈ Rn . By the main theorem of vector calculus, this implies that ∇~x L(~x? , ~λ? , ~ν ? ) = ~0, which is the stationarity
condition. It also implies that i=1 λ?i fi (~x? ) = 0. But each term in the sum is ≤ 0, so they must all be 0. This means
Pm
that λ?i fi (~x? ) = 0 for each i. This gives the complementary slackness condition. Thus the KKT conditions hold for
(~x? , ~λ? , ~ν ? ).
Theorem 172 (If Convexity Holds, then KKT Conditions are Sufficient for Optimality)
Suppose P is a primal problem with dual D. Suppose that the objective function f0 and constraint functions
f1 , . . . , fm , h1 , . . . , hp of P are differentiable. Suppose that f0 , f1 , . . . , fm are convex and h1 , . . . , hp are affine.
~e ~ ~e ~
Suppose that (~x e, λ, νe) fulfill the KKT conditions. Then P is convex, strong duality holds, and (~x e, λ, νe) are optimal
primal and dual variables.
~e ~
Proof. By assumption, ~x
e is feasible for P and (λ, νe) is feasible for D. Since the fi are convex and the hj are affine, P
is convex, and
m p
(7.92)
X X
L(~x, ~λ, ~ν ) = f0 (~x) + λi fi (~x) + νj hj (~x)
i=1 j=1
~e ~
so because the primal problem is convex, we have that ~x
e minimizes L(~x, λ, νe) over all ~x ∈ Rn . Thus
~e ~ ~e ~
g(λ, νe) = minn L(~x, λ, νe) (7.94)
x∈R
~
~e ~
= L(~x
e, λ, νe) (7.95)
m p
(7.96)
X X
= f0 (~x
e) + ei fi (~x
λ e) + νei hi (~x
e)
| {z } | {z }
i=1 =0 j=1 =0
= f0 (~x
e). (7.97)
~e ~ ~e ~
Thus f0 (~x
e) = g(λ, νe), so the duality gap is 0 and strong duality holds, with (~x
e, λ, νe) being optimal primal and dual
variables.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 142
EECS 127/227AT Course Reader 7.4. Karush-Kuhn-Tucker (KKT) Conditions 2024-04-27 21:08:09-07:00
In this course, most problems will be convex and strong duality will hold; in this case the above two theorems can
apply and we get that the KKT conditions are equivalent to optimality conditions. This is the source of the intuition
that convex problems are easier to optimize.
Corollary 173 (If Convexity and Strong Duality Hold, then KKT Conditions are Necessary and Sufficient for Optimal-
ity). Suppose P is a primal problem with dual D. Suppose strong duality holds for P and D. Suppose that the objective
function f0 and constraint functions f1 , . . . , fm , h1 , . . . , hp for P are differentiable. Suppose that f0 , f1 , . . . , fm are
convex and h1 , . . . , hP are affine. Let (~x, ~λ, ~ν ) ∈ Rn × Rm × Rp be primal and dual variables. Then (~x, ~λ, ~ν ) are
optimal primal and dual variables if and only if they fulfill the KKT conditions.
The following is a generic sequence of steps that you can try for any convex optimization problem.
Problem Solving Strategy 174 (Solving Convex Optimization Problems Using KKT Conditions). Suppose you have
a problem P with dual D.
1. Show that P is convex and the objective and constraint functions are differentiable.
2. Show Slater’s condition holds and/or that strong duality holds for P and D.
4. Solve for the optimal primal and dual variables using the KKT conditions.
Even for more complicated problems where it is not possible to solve the KKT conditions analytically, many algo-
rithms can be interpreted as solving the KKT conditions under various conditions.
Example 175 (Example 161, with KKT Conditions). In this example, we apply the KKT conditions to the problem in
Example 161. Consider the following problem:
s.t. 3x ≤ 5. (7.99)
Its Lagrangian is
L(x, λ) = 5x2 + λ(3x − 5). (7.100)
The problem is convex, and there exists a strictly feasible point in the relative interior of the feasible set, e.g., x = 0
(notice that this is also global optimum, but it does not have to be; x = −1 would have sufficed just as well to be a
strictly feasible point in the relative interior of the feasible set). Thus Slater’s condition holds, strong duality holds,
and the KKT conditions are necessary and sufficient for optimality. Now we solve the system to global optimum using
KKT conditions.
Let (e e solve the KKT conditions. Then they must obey:
x, λ)
• Primal feasibility: 3e
x ≤ 5.
• Dual feasibility: λ
e ≥ 0.
• Stationarity: 0 = ∇x L(e
x, λ) x + 3λ.
e = 10e e
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 143
EECS 127/227AT Course Reader 7.5. (OPTIONAL) Conic Duality 2024-04-27 21:08:09-07:00
e = − 10 x
λ e. (7.101)
3
Thus by complementary slackness we have
10
0=− x x − 5),
e(3e (7.102)
3
so that
5
x
e=0 or x
e= . (7.103)
3
Supposing that x
e = 5/3, then we have λ e = −5/3, which violates dual feasibility. Thus we must have xe = 0, implying
λ = 0. And this indeed satisfies all of the KKT conditions. In particular, we can tell from inspection that at least the
e
primal variable x
e = 0 = x? is globally optimal.
Note that even when solving the KKT system correctly, we (momentarily) ended up with an answer that would not
satisfy the KKT conditions in the first place. This is one way to solve such systems: simplify the system as much as
possible, generate two or several possible solutions (e e and check again to see which one(s) still make the KKT
x, λ),
conditions hold.
The following content is optional/out of scope for this semester. Regardless, it may be helpful to read it to
gain context, or get a deeper understanding of various results.
It is possible that ~u − ~v ∈
/ K, in which case there is no generalized inequality with respect to K between them.
Example 177. While the above may look scary, we have already seen generalized inequalities in this course. The set of
positive definite matrices Sn+ is a proper cone in the set of symmetric matrices Sn . Thus when we write that A 0 for a
symmetric matrix A, we are really using the generalized inequality with respect to the cone K = Sn+ . Correspondingly,
we can write A B to mean A − B is positive definite.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 144
EECS 127/227AT Course Reader 7.5. (OPTIONAL) Conic Duality 2024-04-27 21:08:09-07:00
Generalized inequalities can also help simplify or generalize several concepts introduced earlier in the course.
Suppose for example that we have a familiar convex optimization problem:
(One can also add equality constraints to this problem without issue.) Now consider the set K = Rm
+ = {~
x ∈
Rm : xi ≥ 0 ∀i ∈ {1, . . . , m}}, i.e., the non-negative orthant. Indeed, K is a proper cone (proof is an exercise). If we
collect all constraints into a single vector-valued constraint function f~ : Rn → Rm given by
f1 (~x)
.
f~(~x) = . (7.106)
.
fm (~x)
then the constraint f~(~x) K ~0 means that −f~(~x) ∈ K, that is, each −fi (~x) ≥ 0, or equivalently fi (~x) ≤ 0.
These forms of the m constraints are absolutely equivalent. Thus, our original familiar convex optimization problem
is equivalent to the following problem:
Observe that, for any proper cone K ⊆ Rn and given ~v1 , ~v2 , ~v3 ∈ Rn , if we have ~v1 K ~v2 and ~v2 K ~v3 , then
~v1 K ~v3 .
Now we use this generalized inequality system to define more general convex optimization problems. Let f0 : Rn →
R be the objective function. For each i ∈ {1, . . . , m}, let f~i : Rn → Rdi be a vector-valued inequality constraint
function (we will see shortly how this works), and let Ki ⊆ Rdi be a proper cone (this could be called the constraint
. Pm
cone). For convenience, let d = i=1 di . Finally, for each j ∈ {1, . . . , p}, let hj : Rn → R be a (usual scalar-valued)
equality constrained function.
Consider the following primal optimization problem over Rn with generalized inequality constraints:
Recall that we showed, en route to the Lagrangian duality presented in Section 7.1, that
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 145
EECS 127/227AT Course Reader 7.5. (OPTIONAL) Conic Duality 2024-04-27 21:08:09-07:00
The latter equality will help us here again, but the former will not, and we have to derive an analogue. Indeed, we claim
that h i
1 f~i (~x) Ki ~0 = maxd ~λ> ~ x),
i fi (~ (7.112)
~
λi ∈R i
~
λi K ? ~
0
i
where Ki?denotes the dual cone of Ki in R . Why is this true? It turns out to be for a similar reason as the more
di
special case above. If f~i (~x) K ~0, then −f~i (~x) ∈ Ki . By definition, for ~λi ∈ K ? , we must have
i i
Equality in this case is obtained by selecting ~λi = ~0, which we are assured is legal because, by definition of a proper
cone, ~0 ∈ Ki . Thus we have shown
In the other case, suppose f~i (~x) 6Ki ~0. Then we have −f~i (~x) ∈
/ Ki . Since Ki is a proper cone, it is closed and convex,
so Ki = (K ) . We thus have −fi (~x) ∈
? ?
i
~ / (K ) . Namely, there exists some ~λi ∈ K ? such that ~λ> (−f~i (~x)) < 0, or
? ?
i i i
equivalently ~λ> ~ x) > 0. Since K ? is a cone, it is closed under positive scalar multiplication; thus, for any αi > 0,
i fi (~ i
we have αi~λi ∈ Ki? , so
(αi~λi )> f~i (~x) = αi (~λ> f~i (~x)) > 0 i (7.115)
Taking αi → ∞ obtains that the maximum over ~λi is +∞ in this case. Namely, we have shown
The above discussion inspires the following definition of a generalized Lagrangian L : Rn × Rd × Rp for the given
primal optimization problem with generalized inequalities.²
m p
.
(7.118)
X X
L(~x, ~λ, ~ν ) = f0 (~x) + ~λ> f~i (~x) +
i νj hj (~x).
i=1 j=1
²If the optimization problem under study were over a general inner product space, then the expressions ~λ> ~ x) should be replaced with
i fi (~
x) . This occurs primarily in semidefinite programming, where fi could produce a symmetric matrix as an output. In this case, ~λi
~λi , f~i (~ ~
would similarly be a symmetric matrix, and the inner product would be the Frobenius inner product.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 146
EECS 127/227AT Course Reader 7.5. (OPTIONAL) Conic Duality 2024-04-27 21:08:09-07:00
Note that weak duality, i.e., d? ≤ p? , always holds, by the minimax inequality.
An analog of Slater’s condition holds for optimization problems with generalized inequalities induced by proper
cones. We first require the following definition, which extends the notion of the convexity of a function to generalized
convexity, where the inequalities in the definition of convexity are turned into general inequalities induced by proper
cones.
(a) We say that f~ is K-convex if for any α ∈ [0, 1] and ~x, ~y ∈ Rn , we have
(b) We say that f~ is strictly K-convex if for any α ∈ (0, 1) and ~x, ~y ∈ Rn with ~x 6= ~y , we have
where
• for each i ∈ {1, . . . , m}, the function f~i : Rn → Rdi is Ki -convex, where Ki ⊆ Rdi is a proper cone;
If there exists any point ~x ∈ relint(Ω) which is strictly feasible, i.e., such that all of the following holds:
then strong duality holds for P and its dual D, i.e., the duality gap is 0. a
aAnalogous conditions hold for the general case where the primal optimization problem is defined over a vector space equipped with an
inner product and a norm, as in the case of semidefinite programming.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 147
Chapter 8
• [1] Chapter 4.
s.t. A~x = ~y
~x ≥ ~0.
There are many equivalent forms of a linear program; in particular, the following proposition (whose proof we leave
as an exercise) can be shown using slack variables.
Proposition 181
Any linear program is equivalent to a standard form linear program.
Putting linear programs in standard form is important because the standard form is commonly accepted by opti-
mization algorithms and implementations. Usually if you provide a linear program that isn’t in standard form to a
148
EECS 127/227AT Course Reader 8.1. Linear Programs 2024-04-27 21:08:09-07:00
solver, they will convert it to standard form first before running their algorithms. Conversions to-and-from standard
form may increase the number of variables in the problem and the eventual algorithmic complexity of solving it. One
example of this conversion is done below.
s.t. x1 + x2 ≥ 3 (8.3)
3x1 + 2x2 = 14 (8.4)
x1 ≥ 0. (8.5)
x1 + x2 ≥ 3 ⇐⇒ −x1 − x2 ≤ −3 (8.6)
s.t. − x1 − x2 ≤ −3 (8.8)
3x1 + 2x2 = 14 (8.9)
x1 ≥ 0. (8.10)
s.t. − x1 − x2 + x3 = −3 (8.12)
3x1 + 2x2 = 14 (8.13)
x1 ≥ 0 (8.14)
x3 ≥ 0. (8.15)
The only reason this is not in standard form is that x2 is unconstrained. We can represent any real number as the
difference of non-negative numbers; one such construction is x2 = x+
2 − x2 where x2 = max{x2 , 0} and x2 =
− + −
− min{x2 , 0}, but there are many others. Thus we can replace x2 by x4 − x5 and add the constraints that x4 ≥ 0 and
x5 ≥ 0, obtaining the problem
s.t. − x1 + x3 − x4 + x5 = −3 (8.17)
3x1 + 2x4 − 2x5 = 14 (8.18)
x1 ≥ 0 (8.19)
x3 ≥ 0 (8.20)
x4 ≥ 0 (8.21)
x5 ≥ 0. (8.22)
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 149
EECS 127/227AT Course Reader 8.1. Linear Programs 2024-04-27 21:08:09-07:00
Finally, we notice that x2 is neither in the objective nor the constraints and so can be eliminated.
s.t. − x1 + x2 − x3 + x4 = −3 (8.24)
3x1 + 2x3 − 2x4 = 14 (8.25)
x1 ≥ 0 (8.26)
x2 ≥ 0 (8.27)
x3 ≥ 0 (8.28)
x4 ≥ 0. (8.29)
Linear programs are convex problems, and in particular the feasible set of a linear program is a convex set.
Proposition 183
Any linear program is a convex optimization problem.
Proposition 184
The dual of the standard form linear program
s.t. A~x = ~y
~x ≥ ~0
is
s.t. ~c − ~λ + A>~ν = ~0
~λ ≥ ~0.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 150
EECS 127/227AT Course Reader 8.1. Linear Programs 2024-04-27 21:08:09-07:00
as desired.
One very relevant question is how to solve linear programs efficiently. There are several methods available to us —
for example, any constrained convex optimization solver, such as projected gradient descent, will solve the problem,
given an appropriate learning rate and efficient projection method. The algorithm most used in practice is the interior
point method; we will learn the basics of interior point methods later in this course.¹ But there is one algorithm which
was used for many years in practice, that truly exploits the structure of linear programs to efficiently solve them. It is
called the simplex algorithm, which was invented by George Dantzig in 1947.
The core idea behind the the simplex algorithm is that at least one optimal point of a linear program is a “vertex”
of its feasible set, so long as this feasible set is bounded. We need to prove this idea, so far stated informally, and to do
this we will first need to characterize the feasible set of a linear program.
From the definition of a linear program, we see that its feasible region must be a polyhedron. For the standard form
linear program, we can check explicitly that the feasible set is the intersection of the three classes of half-spaces {~x ∈
Rn | ~a> x ≤ yi }, {~x ∈ Rn | ~a>
i ~ x ≥ yi }, and {~x ∈ Rn | xj ≥ 0}, where ~a>
i ~ i are the rows of A, yi are the entries of ~
y,
and xj are the entries of ~x. This feasible region can be unbounded or bounded; if it is bounded, it will be a polygon.
If the feasible set of a linear program is an unbounded polyhedron, then there are examples where the optimal value
is not achieved at a vertex, as demonstrated in the following example.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 151
EECS 127/227AT Course Reader 8.1. Linear Programs 2024-04-27 21:08:09-07:00
s.t. x1 ≥ 0
x2 ≥ 0
x1 = 1.
" #
. 1
Define ~xα = . One can check that ~xα is feasible for all α ≥ 0, and that the objective value at ~xα is −α. By sending
α
α → ∞, we get p? = −∞, and the optimum is not achieved at a vertex (or indeed by any ~x ∈ R2 ).
With this understanding, we now seek to prove the main idea we had earlier. There are several proofs, but the
cleanest uses the following intuitive fact:
Proposition 188
A polygon has finitely many vertices and is the convex hull of its vertices.
Unfortunately, the proof is (surprisingly?) quite complicated, so we omit it. A complete proof is in ziegler2012lectures,
for example.
s.t. A~x = ~y
~x ≥ ~0.
.
Suppose that the feasible set Ω = {~x ∈ Rn | A~x = ~y , ~x ≥ ~0} is bounded. Then the optimal value is achieved at
a vertex.
Namely, one can find an optimal point which is a vertex. There may be optimal points that are not vertices. The simplest
example is to set the objective as ~c = ~0, so every feasible point is optimal with objective value 0, but there are other
examples which are a bit more complicated to set up.
Proof. Since the feasible set Ω is a bounded polyhedron, it is a polygon, and so it is the convex hull of its vertices, say
~v1 , . . . , ~vk . Thus any ~x ∈ Ω can be written as a convex combination of the vertices ~vi , namely,
k
(8.45)
X
~x = αi~vi
i=1
k
(8.46)
X
?
p = min αi (~c>~vi )
α1 ,...,αk ∈R
i=1
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 152
EECS 127/227AT Course Reader 8.2. Quadratic Programs 2024-04-27 21:08:09-07:00
Now we have
k k
(8.49)
X X
> >
αi (~c ~vi ) ≥ αi min ~c ~vj
j∈{1,...,k}
i=1 i=1
k
X !
= min >
~c ~vj αi (8.50)
j∈{1,...,k}
i=1
| {z }
=1
Let i? ∈ {1, . . . , k} be an index such that ~c>~vi? achieves the above minimum, i.e., ~c>~vi? = minj∈{1,...,k} ~c>~vj . Then
the above lower bound is achieved when αi? = 1 and αi = 0 for i 6= i? , for example. Thus ~vi? is an optimal point for
the original linear program, concluding the proof.
This theorem says is that to solve a linear program, we only need to check the vertices of the constraint polyhedron.
This reduces an optimization problem over Rn to an optimization problem over the finite set of vertices. This reduction
motivates a “greedy-like heuristic” solver for linear programs with bounded feasible sets Ω, which is called the simplex
method. The simplex method is the following procedure:
• Start at a vertex ~v of Ω.
• When there are no neighboring vertices with better optima, stop and return ~v .
There are (rather more technical) modifications one can make to this algorithm to solve linear programs with unbounded
feasible sets. But the main idea is just the same as gradient descent: iteratively search locally for another point with
better objective value, and move to it.
where H ∈ Sn .
In the standard form, we do not lose any generality by enforcing H ∈ Sn . In particular, for any H we have
H + H>
1 > 1
~x H~x + ~c> ~x = ~x> ~x + ~c> ~x (8.53)
2 2 2
H+H >
whence the matrix 2 (i.e., the symmetric part of H) is always symmetric. So if we have a non-symmetric H we
can just replace it with its symmetric part, and thus obtain a standard form quadratic program.
Quadratic programs may or may not be convex.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 153
EECS 127/227AT Course Reader 8.2. Quadratic Programs 2024-04-27 21:08:09-07:00
Proposition 191
Consider the following standard form quadratic program:
1 >
p? = minn ~x H~x + ~c> ~x (8.54)
x∈R
~ 2
s.t. A~x ≤ ~y
C~x = ~z,
(b) H ∈ Sn+ .
Thus we have
1 2
p? ≤ lim λt + t k~ck2 . (8.62)
t→∞ 2
Since λ < 0, the term inside the limit is a concave (i.e., downward facing) quadratic function of t, and so its
limit as t → ±∞ is −∞. Thus p? ≤ −∞ so p? = −∞.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 154
EECS 127/227AT Course Reader 8.2. Quadratic Programs 2024-04-27 21:08:09-07:00
Case 2. Suppose that H ∈ Sn+ , and suppose that ~c ∈ N (H) \ {0}. Then H has (at least) an eigenvalue λ equal to 0,
and in particular, by the spectral theorem ~c can be written as a linear combination of eigenvectors of H with
eigenvalue 0. Let ~v be any unit eigenvector with eigenvalue 0 such that ~c>~v 6= 0. Let ~xt = −t · sgn ~c>~v · ~v .
Then
1 >
p? ≤ lim ~xt H~xt + ~c> ~xt (8.63)
t→∞ 2
1 2 >
t ~v H~v − t · sgn ~c>~v · ~c>~v (8.64)
= lim
t→∞ 2
1 2 >~
= lim t ~v 0 − t · ~c>~v (8.65)
t→∞ 2
= −∞. (8.68)
Thus p? ≤ −∞ so p? = −∞.
Case 3. Suppose that H ∈ Sn+ with ~c ∈ R(H). Then there is nonzero ~x0 such that ~c = −H~x0 . Rewriting the
objective, we obtain
1 > 1
~x H~x + ~c> ~x = ~x> H~x − ~x>0 H~x (8.69)
2 2
1 1 1
= ~x> H~x − ~x>0 H~x + ~x> 0 H~x0 − ~x> H~x0 (8.70)
2 2 2 0
1 > 1
= ~x H~x − 2~x> x + ~x>
0 H~ 0 H~x0 − ~x> H~x0 (8.71)
2 2 0
1 1
= (~x − ~x0 )> H(~x − ~x0 ) − ~x> H~x0 . (8.72)
2 2 0
Since H ∈ Sn+ , the minimizer is at any ~x such that ~x − ~x0 ∈ N (H). One can write this as ~x ∈ ~x0 + N (H).
A particular solution in terms of problem parameters is ~x = −H †~c where H † is the Moore-Penrose pseu-
doinverse of H. Recall that we discussed the Moore-Penrose pseudoinverse in more generality in homework
where we derived the solution to the least-norm least-squares problem, but onecan show that if H = U ΛU >
1/Λ , if Λ 6= 0
ii ii
then H † = U Λ† U > where Λ† is the diagonal matrix whose entries are Λ†ii = .
0, if Λii = 0
The previous example shows that we can solve unconstrained quadratic programs directly and read off the solutions.
It turns out that one can transform any quadratic program with equality constraints into an unconstrained quadratic
program. So really, this analysis encapsulates a huge class of quadratic programs.
Computing the dual of a quadratic program has a similar number of cases; it is an exercise which is left to homework.
Example 193 (Linear-Quadratic Regulator). Suppose we have a discrete-time dynamical system, of the form
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 155
EECS 127/227AT Course Reader8.3. Quadratically-Constrained Quadratic Programs 2024-04-27 21:08:09-07:00
For a fixed terminal time T , we want to reach goal state ~g . Namely, we want to solve the problem
T −1
(8.76)
2
X 2
min k~xT − ~g k2 + k~uk k2
~
x0 ,...,~
xT
~
u0 ,...,~
uT −1 k=0
This is a quadratic program since the objective function is a quadratic function of each ~xt , and the constraints are affine
equations relating the ~xt and the ~ut .
As a last note, problems with quadratic objectives and quadratic inequality constraints are called quadratically
constrained quadratic programs (QCQPs). Like quadratic programs, QCQPs can be convex or non-convex.
Proposition 195
Consider the following standard form quadratically-constrained quadratic program:
1 >
p? = minn ~x H~x + ~c> ~x (8.80)
x∈R
~ 2
1 >
s.t. ~x Pi ~x + ~b> x + ci ≤ 0,
i ~ ∀i ∈ {1, . . . , m}
2
1 >
~x Qi ~x + d~>i ~
x + fi = 0, ∀i ∈ {1, . . . , p},
2
where H, P1 , . . . , Pm , Q1 , . . . , Qp ∈ Sn . If H, P1 , . . . , Pm ∈ Sn+ and Q1 = · · · = Qp = 0, then the problem is
convex.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 156
EECS 127/227AT Course Reader 8.4. Second-Order Cone Programs 2024-04-27 21:08:09-07:00
Second order cone constraints are strictly more broad than affine constraints; to encode an affine constraint Ai ~x = ~yi as
a second-order cone constraint, pick the corresponding ~bi = ~0 and zi = 0. This makes the constraint kAi ~x − ~yi k2 ≤ 0
or equivalently Ai ~x = ~yi .
Proposition 197
Second-order cone problems are convex optimization problems.
Proof. Each second-order cone constraint kAi ~x − ~yi k2 ≤ ~b> x + zi can be alternatively formulated as constraining the
i ~
tuple (Ai ~x −~yi , ~b> ~x +zi ) ∈ Rn+1 to lie within the second-order cone in Rn+1 . But this tuple is an affine transformation
i
of ~x, in particular " # " # " #
Ai ~x − ~yi Ai −~yi
= > ~x + . (8.83)
~b> ~x + zi ~b zi
i i
Since the second order cone is convex and the tuple is an affine transformation of ~x, it follows that {~x ∈ Rn |
kAi ~x − ~yi k ≤ ~b> ~x + zi } is a convex set. Thus the feasible set is convex (as the intersection of convex sets). The
2 i
objective function is linear in ~x, so the second-order cone problem is convex.
One can formulate this as a second-order cone program by using slack variables:
m
(8.85)
X
p? = minn si
x∈R
~
s∈Rm
~ i=1
We can use a similar slack variable reformulation to formulate this problem as a second order cone program.
p? = minn s (8.88)
x∈R
~
s∈R
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 157
EECS 127/227AT Course Reader 8.4. Second-Order Cone Programs 2024-04-27 21:08:09-07:00
These problems can be formulated in terms of route planning – more specifically, finding the route which minimizes
the total length between waypoints (in the first problem), or the route which minimizes the maximum length between
waypoints (in the second problem).
Example 199 (LPs, QPs, and QCQPs as SOCPs). One can see how LPs are QPs and how QPs are QCQPs, because
in each transition the set of properties becomes more permissive — first a linear objective can become a quadratic
objective, then linear constraints can become quadratic constraints. It is less clear how LPs, QPs, and QCQPs are
SOCPs. In this example we derive a way to write QCQPs as SOCPs, which is also applicable to LPs and QPs (since
LPs and QPs are QCQPs).
Consider a QCQP of the form
1 >
p? = minn ~x H~x + ~c> ~x (8.90)
x∈R
~ 2
1 >
s.t. ~x Pi ~x + ~b> x + ci ≤ 0,
i ~ ∀i ∈ {1, . . . , m}
2
C~x = ~z,
~x> Pi ~x + 2(~b> x + ci ) ≤ 0.
i ~ (8.93)
2
In order to write each term as a square, we first write ~x> Pi ~x = Pi . We now use a difference of squares identity:
1/2
~x
2
which is equivalent to
2 2
2 1 ~> 1 ~>
(8.97)
1/2
Pi ~x + + bi ~x + ci ≤ − bi ~x − ci .
2 2 2
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 158
EECS 127/227AT Course Reader 8.4. Second-Order Cone Programs 2024-04-27 21:08:09-07:00
Now we would like to take square roots and write things in terms of the `2 -norm. For this, we need to show that
1 ~ > x − ci ≥ 0. This follows because, since Pi is positive semidefinite, we have 1 ≥ 0 ≥ − 1 ~x> Pi ~x, and so
2 − bi ~ 2 2
1 ~> 1
− bi ~x − ci ≥ − ~x> Pi ~x − ~b> x − ci ≥ 0.
i ~ (8.98)
2 2
Now, taking square roots and writing things in terms of the `2 -norm, we have
" 1/2
#
Pi ~x 1
≤ − ~b> x − ci .
i ~ (8.99)
1 ~ > 2
2 + bi ~ x + ci 2
kC~x − ~zk2 ≤ 0.
Below, we establish that the dual of an SOCP is an SOCP. This fact can either be proved via conic duality, or proved
directly.
Theorem 200
Let ~c ∈ Rn , and for i ∈ {1, . . . , m} let Ai ∈ Rdi ×n , ~yi ∈ Rdi , ~bi ∈ Rn , and zi ∈ R. The dual of the following
SOCP in standard form:
.
Proof via Conic Duality. Let Ki = {(~u, r) ∈ Rdi × R | k~uk2 ≤ r} denote the second-order cone in Rdi +1 , and let
d = i=1 di . Then the standard form SOCP can be written as:
Pm
The Lagrangian L : Rn × Rd × Rm → R can thus be defined as follows. For each ~x ∈ Rn , ~λ = (~λ1 , . . . , ~λm ) ∈ Rd
(with ~λi ∈ Rdi for each i ∈ {1, . . . , m}), and µ
~ ∈ Rm :
m h i
(8.105)
X
L(~x, ~λ, µ
~ ) = ~c> ~x − ~λ> (Ai ~x − ~yi ) + µi (~b> ~x + zi )
i i
i=1
m
!> m
(8.106)
X X
~c − (A> ~ + µi~bi ) (~λ> yi − µi zi ).
= i λi ~x + i ~
i=1 i=1
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 159
EECS 127/227AT Course Reader 8.4. Second-Order Cone Programs 2024-04-27 21:08:09-07:00
Next, define the dual function g : Rd × Rm → R by maximizing over the primal variable ~x ∈ Rn :
g(~λ, µ
~ ) = minn L(~x, ~λ, µ
~) (8.107)
x∈R
~
Pm (~λ> ~y − µ z ), if Pm (A>~λ + µ ~b ) = ~c,
i=1 i i i i i i i i
= i=1
(8.108)
−∞, otherwise.
The last equality follows by noticing that the objective is linear in ~x; if its coefficient ~c − i=1 (A> ~λi + µi~bi ) = ~0, then
Pm
Pmi
the objective value is the sum (λ ~yi − µi zi ) regardless of the value of ~x, while if ~c −
Pm ~ >
i=1 i (A>~λi + µi~bi ) 6= ~0 i=1 i
then we can make the objective value as low as we want by picking ~x appropriately. For instance, let K > 0 be a large
positive number; then
m
! m 2 m
(~λ> ~yi −µi zi ) (8.109)
X X X
~x = −K ~c − (A λi + µi~bi ) =⇒ L(~x, ~λ, µ
>~
i ~ ) = −K ~c − (A>~λi + µi~bi ) + i i
i=1 i=1 2 i=1
which we can drive down to −∞ by increasing K to +∞. Thus, the dual problem is given by:
m
(8.110)
X
max (~λ> yi − µi zi )
i ~
~
λ∈Rd i=1
~ ∈Rm
µ
m
s.t. (8.111)
X
(A> ~ ~
i λ i + µ i bi ) = ~
c,
i=1
where the last line uses the fact that for each Ki its conic dual is itself, or in standard SOCP form by:
m
(8.113)
X
max (~λ> yi − µi zi )
i ~
~
λ∈Rd i=1
~ ∈Rm
µ
m
s.t. (8.114)
X
(A> ~ ~
i λ i + µ i bi ) − ~
c ≤0
i=1 2
~λi ≤ µi , ∀i ∈ {1, . . . , m}. (8.115)
2
We add some variables to simplify. Namely, we introduce ~ui ∈ Rdi and wi ∈ R for each i ∈ {1, . . . , m}. For
.
convenience, we define ~u = (~u1 , . . . , ~um ) ∈ Rd1 × · · · × Rdm = Rd , where again d = i=1 di , and also define
Pm
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 160
EECS 127/227AT Course Reader 8.4. Second-Order Cone Programs 2024-04-27 21:08:09-07:00
wi = ~b>
i ~
x + zi , ∀i ∈ {1, . . . , m}. (8.121)
We can thus define a Lagrangian for this system, say with dual variables ~λ ∈ Rm , ~η ∈ Rd (with ~ηi ∈ Rdi for each i)
and ~ν ∈ Rm . We have
m m m
(8.122)
X X X
~ ~λ, ~η , ~ν ) = ~c> ~x +
L(~x, ~u, w, λi (k~ui k2 − wi ) + ~ηi> (~ui − Ai ~x + ~yi ) + νi (wi − ~b> x − zi )
i ~
i=1 i=1 i=1
m
!> n m m
X X X X
= ~c − (A> ηi + νi~bi )
i ~ ~x + >
(λi k~ui k2 + ~ηi ~ui ) + (−λi + νi )wi + (~ηi> ~yi − νi zi ).
i=1 i=1 i=1 i=1
(8.123)
Now define the dual function g : Rm × Rd × Rm → R by minimizing over the primal variables (~x, ~u, w)
~ ∈ Rn × Rd ×
Rm :
The last equality looks complicated and a bit magical but we methodically justify it here.
m
!>
~x + other terms not involving ~x, (8.126)
X
L= ~c − (A> ηi + νi~bi )
i ~
i=1
so unless the coefficient ~c − + νi~bi ) is ~0, then we can make the Lagrangian arbitrarily negative by
Pm >
i=1 (Ai ~
ηi
varying ~x while keeping ~u and w
~ fixed. For instance, let K > 0 be a large positive number. Then
m
! m 2
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 161
EECS 127/227AT Course Reader 8.4. Second-Order Cone Programs 2024-04-27 21:08:09-07:00
Towards minimizing this expression over ~u, we aim to solve the problem
and collect the results for each i at the end. At first glance, it may seem hard to imagine this term blowing up at
all. Towards finding out a possible blow-up case, if any, we use Cauchy-Schwarz to try to make the sum as small
as possible. In particular, by Cauchy-Schwarz we have
λi k~ui k2 + ~ηi> ~ui ≥ λi k~ui k2 − k~ηi k2 k~ui k2 = (λi − k~ηi k2 ) k~ui k2 (8.131)
with equality when ~ui points in the opposite direction as ~ηi , that is, ~ui = −K~ηi for some K ≥ 0. With this
value of ~ui (for varying K → ∞) we shall try to make the Lagrangian go to −∞. Indeed, in this case, we have
First suppose that k~ηi k2 = 0. Since λi ≥ 0 in the Lagrangian formulation, we must have λi ≥ k~ηi k2 , as indicated
in the original equality. The optimal ~ui is ~ui = −K~ηi = ~0 (independently of the value of K), at which point the
term in the Lagrangian becomes 0. We now deal with the non-edge case, assuming that ~ηi 6= ~0.
Suppose that λi − k~ηi k2 < 0. Then by sending K → ∞ with this choice of ~ui = −K~ηi we drive the Lagrangian
to −∞. On the other hand, if λi − k~ηi k2 ≥ 0, then the minimizing choice for K is K = 0, so that ~ui = ~0, and
the term in the Lagrangian becomes 0. Thus,
0, if λi ≥ k~ui k2
min (λi k~ui k2 + ~ηi> ~ui ) = (8.133)
ui ∈Rdi
~ −∞, otherwise.
(c) Suppose that ~c = + νi~bi ) and λi ≥ k~ui k2 for each i. Then the Lagrangian has the form
Pm >
i=1 (Ai ~
ηi
m
(νi − λi )wi + other terms not involving w. (8.135)
X
minn L = ~
x∈R
~
u∈Rd
~ i=1
and collect the results at the end. Thankfully this is much simpler than the rest of the calculations, since the
objective is an unconstrained minimization of a linear function of a scalar wi . If the coefficient νi − λi is
nonzero, then we can thus blow up the objective in any direction by choosing wi accordingly. Namely, if νi 6= λi
then the choice of wi = −K(νi − λi ) for some positive scalar K > 0, simplifies the objective as −K(νi − λi )2 .
Since (νi − λi )2 > 0, taking K → ∞ shows that the optimal value of the objective is −∞. On the other hand,
if νi = λi then the objective has value 0 independent of the choice of wi . We have shown that
0, if νi = λi
min (νi − λi )wi = (8.137)
wi ∈R −∞, otherwise.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 162
EECS 127/227AT Course Reader 8.4. Second-Order Cone Programs 2024-04-27 21:08:09-07:00
which simplifies to
m
(8.140)
X
d? = max (~ηi> ~yi − νi zi )
~
λ∈Rm i=1
~ ∈Rd
η
ν ∈Rm
~
m
s.t. ~c = (8.141)
X
(A> ηi + νi~bi )
i ~
i=1
Note that the constraint λi ≥ k~ui k2 already implies λi ≥ 0 since k~ui k2 ≥ 0. Thus, we can rewrite the problem again
as
m
(8.145)
X
d? = max (~ηi> ~yi − νi zi )
~
λ∈Rm i=1
~ ∈Rd
η
ν ∈Rm
~
m
s.t. ~c = (8.146)
X
(A> ηi + νi~bi )
i ~
i=1
Now note that the last constraint forces ~λ = ~ν . Thus, we can eliminate one of them; we choose arbitrarily to eliminate
~ν by replacing it everywhere with ~λ. This gives the dual problem as
m
(8.149)
X
d? = max (~ηi> ~yi − λi zi )
~
λ∈Rm i=1
~ ∈Rd
η
m
s.t. ~c = (8.150)
X
(A> ηi + λi~bi )
i ~
i=1
To write this in SOCP form, we can write the affine constraint as a norm, obtaining
m
(8.152)
X
d? = max (~ηi> ~yi − λi zi )
~
λ∈Rm i=1
~ ∈Rd
η
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 163
EECS 127/227AT Course Reader 8.5. Semidefinite Programming 2024-04-27 21:08:09-07:00
m
s.t. (8.153)
X
~c − (A> ηi + λi~bi )
i ~ ≤0
i=1 2
The following content is optional/out of scope for this semester. Regardless, it may be helpful to read it to
gain context, or get a deeper understanding of various results.
where ~c ∈ Rn , and F0 , F1 . . . , Fn ∈ Sn .
The expression F0 + i=1 xi Fi is referred to as a linear matrix inequality. The constraint set, i.e., the set of ~x ∈ Rn
Pn
Notice that we only require one linear matrix inequality in the definition. What if we had multiple? Suppose that
we actually wanted to solve the problem
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 164
EECS 127/227AT Course Reader 8.5. Semidefinite Programming 2024-04-27 21:08:09-07:00
This could be phrased using a single linear matrix inequality, and the problem would be
(If this reduction isn’t clear to you, it’s totally fine; try to prove it as an exercise.)
We now introduce another major standard form of SDPs.
where C, A1 , . . . , Am ∈ Sn , and b1 , . . . , bm ∈ R.
The first main theorem below establishes that the inequality and standard forms of an SDP are equivalent, in the
sense that either can be reformulated as the other.
Theorem 203
An SDP in inequality form can be reformulated as an SDP in standard form, and vice versa.
Proof. Just for this proof, we introduce the notation vec : Rm×n → Rmn , which takes an m × n matrix and unrolls it
into an mn-length vector. With this notation, in fact, for two symmetric matrices A, B ∈ Sn , we can write tr(AB) =
j=1 Aij Bij = vec(A) vec(B). On the other hand, we will sometimes need to access the element of vec(A)
Pn Pn >
i=1
corresponding to Aij ; we denote this by vec(A)i,j (where the comma makes it clear that the index is not the product of
i and j). We will also use the notation diag : Rn → Rn×n which takes a vector and returns a diagonal matrix whose
diagonal is the entries of this vector. This notation will greatly simplify things to follow.
“Inequality form =⇒ Standard form”: Let ~c ∈ Rn and F0 , F1 , . . . , Fn ∈ Sd , and consider the following SDP in
inequality form:
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 165
EECS 127/227AT Course Reader 8.5. Semidefinite Programming 2024-04-27 21:08:09-07:00
First, towards introducing the positive semidefinite constraint, we introduce a new variable Y ∈ Sd , associated with
−(F0 + i=1 xi Fi ). That is, our original problem has the form
Pn
Y 0. (8.174)
Since we have a linear matrix equality, we can write it as a bunch of scalar equations to get it closer to the desired form,
say in the following way:
Y 0. (8.177)
But even this isn’t quite right – after all, we require all decision variables to be encapsulated in a positive semidefinite
matrix. The simplest way to do this is to form a block diagonal matrix where each block is an embedding of a decision
variable into a positive semidefinite matrix; the large matrix will also be positive semidefinite in this case. Towards
converting ~x to a positive semidefinite block, one could consider its diagonal matrix equivalent diag(~x), but this would
not be positive semidefinite unless all entries of ~x were positive. To ensure that this happens, we use slack variables,
akin to the proof that general linear programs can be written in standard form.
Namely, associate vectors ~x+ , ~x− ∈ Rn by the following formulae:
x , x > 0 0, xi > 0
i i
x+ i = x−i = (8.178)
0, x ≤ 0, −x , x < 0.
i i i
In this case ~x = ~x+ − ~x− . Thus the original problem is equivalent to the reformulation
x+
i ≥ 0, ∀i ∈ {1, . . . , n}, (8.181)
x−
i ≥ 0, ∀i ∈ {1, . . . , n}, (8.182)
Y 0. (8.183)
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 166
EECS 127/227AT Course Reader 8.5. Semidefinite Programming 2024-04-27 21:08:09-07:00
This is the positive semidefiniteness constraint we want, so the decision variable is Z ∈ S2n+d . As notation, let
i be the i
Z 1,i = diag(~x+ )ii = x+ element of the first block, Z 2,i = diag(~x− )ii = x−
i be the i element of the
th th
second block, and Z 3,ij = Yij be the (i, j)th element of the third block. As notation for later, let O be the set of all
indices in {1, . . . , 2n + d} × {1, . . . , 2n + d} which are not on the diagonal or part of the Y block, and thus must be
set to zero; formally O = {(i, j) | 1 ≤ i, j ≤ 2n + d, i 6= j, i ≤ 2n or j ≤ 2n}.
Now, we have written our problem in the form
n
(8.185)
X
min ci (Z 1,i − Z 2,i )
Z∈S2n+d
i=1
n
s.t. (8.186)
X
Z 3,jk + (F0 )jk + (Z 1,i − Z 2,i )(Fi )jk = 0, ∀j, k ∈ {1, . . . , d}
i=1
Notice that all constraints are affine or positive definite, and our objective is affine; by our discussion of affine functions,
the affine constraints can be written in the form tr(Ak Z) = bk , and the objective can be written in the form tr(CZ),
for some symmetric matrices Ak , C and scalars bk , and k ∈ {1, . . . , m} where m = d2 + |O|.² Thus we can write our
problem as
as desired.
“Standard form =⇒ Inequality form”: Let C, A1 , . . . , Am ∈ Sn be fixed symmetric matrices, and let b1 , . . . , bm ∈
R be fixed scalars. Consider the following SDP in standard form:
Notice by our notation that tr(CX) = vec(C)> vec(X) and that tr(Ak X) = vec(Ak )> vec(X). Thus, letting
~x ∈ Rn be defined as ~x = vec(X), our objective is linear in ~x, since it is ~c> ~x where ~c = vec(C). Furthermore, our
2
²Careful readers may notice that the discussion on affine functions ensured something slightly different; namely, for an affine function f on
symmetric matrices (or indeed all of Rn×n ), there was some matrix A and scalar b such that f (X) = tr A> X + b. In particular, the result did
not guarantee that such an A could be symmetric. But certainly the matrix (A + A> )/2 is symmetric, and for Z symmetric, we have tr A> Z =
tr [(A + A> )/2]Z , so indeed, for an affine function f : Sn → R there exists some matrix A ∈ Sn and scalar b ∈ R such that f (X) =
tr(AX) + b.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 167
EECS 127/227AT Course Reader 8.5. Semidefinite Programming 2024-04-27 21:08:09-07:00
where we are using for ordering on the space of 1 × 1 symmetric matrices, i.e., scalars. These are bona-fide linear
matrix inequalities and will be combined with others, later, to form the full linear matrix inequality constraint for our
problem.
The only constraint remaining that cannot easily be expressed in vectorized form is the constraint X 0. For this,
we note that we are allowed to have a linear matrix inequality constraint, so we want to express X 0 in terms of a
linear matrix inequality involving ~x. This is difficult at first, so we handle it in the case n = 2 for an example. Write
" # x1
x1 x2 x2
X= , ~x =
x .
(8.200)
x3 x4 3
x4
Notice that, since X is symmetric (and so x2 = x3 ), we can write X in terms of a linear combination of constant
symmetric matrices, as follows
" #
x1 x2
X= (8.201)
x2 x4
" # " # " #
1 0 0 1 0 0
= x1 + x2+ x4 (8.202)
0 0 1 0 0 1
" # " # " # " #
1 01 0 1 1 0 1 0 0
= x1 + x2 + x3 + x4 (8.203)
0 0 2 1 0 2 1 0 0 1
1 1
= x1 E 11 + x2 (E 12 + E 21 ) + x3 (E 12 + E 21 ) + x4 E 22 , (8.204)
2 2
where E ij is defined as the n×n matrix with 1 in the (i, j)th coordinate and 0 elsewhere. Thus the positive semidefinite
constraint can be replaced by the linear matrix inequality
1 1
X 0 ⇐⇒ − x1 E 11 + x2 (E 12 + E 21 ) + x3 (E 12 + E 21 ) + x4 E 22 0. (8.205)
2 2
The general case goes the same way. We can say
n n n
1 XX
(8.206)
X ii
xi,j (E ij + E ji )
X 0 ⇐⇒ − x i,i E + 0
i=1
2 i=1 j=1
j6=i
where again xi,j refers to the element of ~x corresponding to the entry Xij .
This gives a linear matrix inequality for the last constraint, and so all constraints can be represented by some linear
matrix inequalities. Thus, by the discussion on reducing several linear matrix inequalities to a single one, all constraints
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 168
EECS 127/227AT Course Reader 8.5. Semidefinite Programming 2024-04-27 21:08:09-07:00
can be represented as a single linear matrix inequality of the form F0 + xi Fi . Thus the original problem can be
Pm
i=1
represented as
where m = n2 , ~c = vec(C), and the linear matrix inequality constraint is constructed in the aforementioned way.
We compute the conic dual of this problem. We know that the dual cone of Sd+ in Sd (equipped with the Frobenius
inner product hA, BiF = tr(AB) and corresponding Frobenius norm) is simply Sd+ itself. Thus we can define the
Lagrangian L : Rn × Sd+ as
* n
+
(8.211)
X
>
L(~x, Λ) = ~c ~x + Λ, F0 + x i Fi
i=1 F
n
!!
(8.212)
X
= ~c> ~x + tr Λ F0 + xi Fi
i=1
n
(8.213)
X
= ~c> ~x + tr(ΛF0 ) + xi tr(ΛFi )
i=1
n
(8.214)
X
= (ci + tr(ΛFi ))xi + tr(ΛF0 ).
i=1
Now, define the dual function g : Sn+ → R by minimizing over the primal variable ~x:
The last equality is because in each individual term (ci + tr(ΛFi ))xi , when minimizing over xi , if ci + tr(ΛFi ) 6= 0
then we can always drive it to −∞ by picking xi to be large and of the opposite sign.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 169
EECS 127/227AT Course Reader 8.5. Semidefinite Programming 2024-04-27 21:08:09-07:00
SDPs generalize all previously introduced classes of convex optimization problems: LPs, (convex) QPs, (convex)
QCQPs, and SOCPs.
Theorem 205
SOCPs can be reformulated as SDPs.
Proof. We use the following useful characterization of second-order cone constraints as semidefinite constraints.
Claim. For (~x, t) ∈ Rm+1 , we have
" #
tI ~x
k~xk2 ≤ t ⇐⇒ 0. (8.222)
~x> t
By Cauchy-Schwarz we have
(8.226)
2 2
t k~ak2 + b2 + 2b~a> ~x ≥ t k~ak2 + b2 − 2 |b| k~ak2 k~xk2
with equality when ~a = −K~x for some positive scalar K > 0. Thus
" #
tI ~x
(8.227)
2
0 ⇐⇒ t k~
a k 2 + b 2
− 2 |b| k~ak2 k~xk2 ≥ 0, ∀(~a, b) ∈ Rm+1 .
~x> t
Now by the AM-GM inequality (i.e., expanding the square on (k~ak2 − |b|)2 ≥ 0), we have k~ak2 + b2 ≥ 2 |b| k~ak2 ,
2
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 170
EECS 127/227AT Course Reader 8.6. General Taxonomy 2024-04-27 21:08:09-07:00
Let K d be the second-order cone in Rd . Notice that each cone constraint can be written in the form
where (~bi )j is the j th entry of ~bi , and (Ai )j is the j th column of Ai . Anyways, this is a linear matrix inequality (after
some reshuffling of terms).
The conversion from SOCP to SDP consists of converting all second-order cone constraints to small linear matrix
inequalities, then combining them to form one larger linear matrix inequality, which defines the constraint set of the
inequality-form SDP. The objective function is already linear, so the resulting SDP is in the “standard” inequality form.
Thus, we have reduced the original SOCP to an SDP.
Note that in practice, this reduction is often extremely costly; SDPs are hard to solve at large scale, while SOCPs
are much easier.
The above content is optional/out of scope for this semester, but now we resume the required/in scope content.
LPs ⊂ Convex QPs ⊂ Convex QCQPs ⊂ SOCPs ⊂ SDPs ⊂ Convex Problems (8.237)
All inclusions are strict, i.e., none of the classes is equivalent to any of the others.
For extra optional reading, you may also look into geometric programs (GPs), which are nonconvex programs that
can be turned into convex programs with a change of variables; and mixed-integer programs (MIPs), which are useful
in practice to incorporate integer constraints, but difficult to solve exactly. All such material is out of scope of the
course.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 171
Chapter 9
This is different from the OLS problem due to the additional λ k~xk2 term, which can be thought of as a regularizer
2
(i.e., a penalty) for having large ~x values. The λ parameter controls the strength of the penalty and is usually called a
regularization parameter. In this sense, ridge regression is regularized least squares. More generally, we may define
regularization as follows.
For a given function R : Ω → R+ (the regularizer) and a regularization parameter λ > 0, the regularized version
of the above problem is the problem
p?λ = min{f0 (~x) + λR(~x)}. (9.3)
x∈Ω
~
In general, the original problem and the regularized problem do not have the same solutions, nor do versions of the
regularized problem with different λ parameter. One need only consider ridge regression to keep this in mind; for a
fixed A and ~y , increasing λ will decrease the norm of the solution to the ridge regression problem, and sending it to 0
(i.e., recovering unregularized least squares) will increase the norm of the solution.
One example of regularization is the `2 -norm penalty R(~x) = k~xk2 , which (when combined with f0 (~x) =
2
kA~x − ~y k2 ) yields ridge regression. Another example is the elastic-net regression, which we covered briefly as an
2
172
EECS 127/227AT Course
9.2. Reader
Understanding the Difference Between the `2 -Norm and the `1 -Norm2024-04-27 21:08:09-07:00
example when discussing convexity. But the main objective of this chapter is to look at the so-called LASSO re-
gression problem, which uses an `1 -norm regularizer. Recall that for a vector ~x ∈ Rn , its `1 -norm is defined as
Pn
k~xk1 = i=1 |xi |.
Proposition 208
Consider the LASSO regression problem
.
where (9.5)
2
min f0 (~x) f0 (~x) = kA~x − ~y k2 + λ k~xk1 .
x∈Rn
~
(b) If A has full column rank then f0 is µ-strongly convex with µ = 2σn {A}2 .
(d) If A has full column rank then the above solution is unique.
This picture is very different from ridge regression, where we are guaranteed that a solution always exists, is unique,
and solvable in closed form. The question then becomes: why do we even care about the LASSO problem at all? The
basic answer is that it induces sparsity in the solution, i.e., solutions to LASSO usually tend to have few nonzero
entries. This sparsity is useful for applications in high-dimensional statistics and machine learning, as it reveals a
certain structure — in words, it points out which “features” are the most relevant to the regression. In the following
sections, we will observe how this sparsity emerges, both geometrically and algebraically.
9.2 Understanding the Difference Between the `2 -Norm and the `1 -Norm
In this section, we attempt to build more intuition about the difference between the `2 -norm and the `1 -norm. We do
this by solving some problems which use the `2 norm, then replace it with the `1 norm and solve this new problem.
Besides giving us intuition, it will help us learn how to analyze the LASSO problem.
Here is a diagram of the norm balls of the `1 (blue) and `2 (red) norms in n = 1 dimensions:
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 173
EECS 127/227AT Course
9.2. Reader
Understanding the Difference Between the `2 -Norm and the `1 -Norm2024-04-27 21:08:09-07:00
x2
x1
Figure 9.1: The `1 and `2 norm balls in n = 2 dimensions. Recall that the `p -norm ball is defined as the set of vectors ~v such that
k~v kp ≤ 1.
The border of the norm balls are the points where each norm is equal to 1. Notice the difference in the geometry of
these norm balls. The `2 norm ball is circular, while the `1 norm ball has distinctive corners.
In fact, these corners hint at a key difference between these norms: the `2 norm is differentiable everywhere, but
k~xk1 is not differentiable when any xi = 0. These corners will help us understand how the `1 norm regularizer induces
sparsity in the solution, and also inform our analysis of problems involving the `1 -norm, including LASSO.
(9.6)
2
min k~xk2
x∈Rn
~
Using the KKT conditions, namely stationarity, we found an explicit solution to this problem: ~x? = A> (AA> )−1 ~y .
Now let us replace the `2 norm with an `1 norm; we obtain the problem
We cannot apply stationarity to this problem because the objective is non-differentiable. Thus, this problem seems
intractable to solve by hand, at least for the moment. Instead, let us formulate it as a linear program. As before,
we represent each xi as the difference of non-negative numbers which sum to |xi |. More formally, we introduce slack
variables ~x+ , ~x− ∈ Rn such that for each i ∈ {1, . . . , n} we have x+
i ≥ 0, xi ≥ 0, xi −xi = xi , and xi +xi = |xi |.
− + − + −
Thus we can rewrite the problem using the following linear program:
n
(9.10)
X
−
min (x+
i + xi )
x− ∈Rn
x+ ,~
~
i=1
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 174
EECS 127/227AT Course
9.2. Reader
Understanding the Difference Between the `2 -Norm and the `1 -Norm2024-04-27 21:08:09-07:00
We can introduce the slack variable ~e = A~x − ~y and obtain the problem:
which is an equality-constrained `1 minimization problem, and thus a linear program as demonstrated above.
Example 210 (Mean Versus Median). Let k be a positive integer. Suppose we have points ~x1 , . . . , ~xk ∈ Rn . Consider
the problem
k
(9.17)
X 2
minn k~x − ~xi k2 .
x∈R
~
i=1
This is an unconstrained strongly convex differentiable problem, so it has a unique solution ~x?1 which we may find by
setting the derivative of the objective to ~0. We obtain
k
(9.18)
X
~0 = 2 (~x?1 − ~xi )
i=1
k k
(9.19)
X X
=⇒ ~0 = (~x?1 − ~xi ) = k · ~x?1 − ~xi
i=1 i=1
k
1X
=⇒ ~x?1 = ~xi . (9.20)
k i=1
This computation implies that the sample mean is the point which minimizes the total squared distance to all points in
the dataset.
Now suppose that we instead consider the problem
k
(9.21)
X
minn k~x − ~xi k2 .
x∈R
~
i=1
The solution to this problem is the sample median of the points. To see this, suppose that n = 1, i.e., all our data xi
are scalar-valued. Then we obtain the problem
k
(9.22)
X
min |x − xi | .
x∈R
i=1
This is an unconstrained, convex, non-differentiable problem. Let us examine all critical points – that is, points where
the derivative is 0 or undefined. The derivative of the objective is
k k
d X X d
|x − xi | = |x − xi | (9.23)
dx i=1 i=1
dx
k
1,
if x > xi
(9.24)
X
= −1, if x < xi
i=1
undefined, if x = x
i
if x ∈
P P
i : x>xi 1 + i : x<xi −1, / {x1 , . . . , xk }
= (9.25)
undefined, if x ∈ {x1 , . . . , xk }
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 175
EECS 127/227AT Course Reader 9.3. Analysis of LASSO Regression 2024-04-27 21:08:09-07:00
|{i ∈ {1, . . . , k} : x > x }| − |{i ∈ {1, . . . , k} : x < x }| ,
i i if x ∈
/ {x1 , . . . , xk }
= (9.26)
undefined, if x ∈ {x1 , . . . , xk }.
Thus if x is such that |{i ∈ {1, . . . , k} : x > xi }| = |{i ∈ {1, . . . , k} : x < xi }|, then the derivative is 0, so this x is a
candidate solution. To put this convoluted-looking condition in words, notice that the first term in hte equality is just
the number of xi which are larger than x, and the second term is the number of xi which are smaller than x. Thus the
condition says that there are the same number of points in the set which are larger than x as there are points which are
smaller than x. This x would fulfill the traditional definition of “median” as the middle of the sorted list of points.
To formally solve this problem, one must also check all the values x = xi and compare the objective values. But
eventually after doing all this, one recovers that the optimal solutions are all possible medians of the dataset.
Because the median is defined using the |·| instead of (·)2 function, it inherits several different properties. The most
striking is its robustness; the median is much more robust than the mean. The mean is very sensitive to outliers, while
the median is less sensitive (i.e. if we blow up an outlier point, the mean will change a lot, while the median will be
unaffected).
Let x? be a critical point of this problem. We solve what x? should be using casework.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 176
EECS 127/227AT Course Reader 9.3. Analysis of LASSO Regression 2024-04-27 21:08:09-07:00
Case 1. If x? > 0, then the derivative is well-defined, so it must be equal to 0. Thus we have
dfLASSO ?
0= (x ) (9.35)
dx
= ~a> (~ax? − ~y ) + λ (9.36)
>
~a ~y − λ
=⇒ x? = 2 . (9.37)
k~ak2
a> ~
Thus if x? > 0 then x? = ~ y −λ
ak22
k~
. Thus x? > 0 if and only if ~a> ~y > λ.
Case 2. If x? < 0, then the derivative is well-defined, so it must be equal to 0. Thus we have
dfLASSO ?
0= (x ) (9.38)
dx
= ~a> (~ax? − ~y ) − λ (9.39)
~a> ~y + λ
=⇒ x? = 2 . (9.40)
k~ak2
a> ~
Thus if x? < 0 then x? = ~ y +λ
ak22
k~
. Thus x? < 0 if and only if ~a> ~y < −λ.
Case 3. If x? = 0 then it is neither > 0 nor < 0. Thus we must have −λ ≤ ~a> ~y ≤ λ.
• x? < 0 ⇐⇒ ~a> ~y < −λ, in which case x? = (~a> ~y + λ)/ k~ak2 ; and
2
• x? = 0 ⇐⇒ −λ ≤ ~a> ~y ≤ λ,
or in other words,
(~a> ~y − λ)/ k~ak2 , if ~a> ~y > λ
2
x? = (~a> ~y + λ)/ k~ak22 , if ~a> ~y < −λ (9.41)
if − λ ≤ ~a> ~y ≤ λ.
0,
x?
x?LASSO
~a> ~y
−λ λ
If we plot the least squares solution in red on the same graph, it has the same nonzero slope as the LASSO solution,
and looks like this:
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 177
EECS 127/227AT Course Reader 9.4. Geometry of LASSO Regression 2024-04-27 21:08:09-07:00
x?
x?LS
x?LASSO
~a> ~y
−λ λ
Figure 9.3: In red, we add the plot of the function which maps ~a> ~
y 7→ x?LS , where the latter term is the solution to to our scalar
least squares problem. Note that the LASSO solution (in blue) is always closer to zero than the least squares solution, and is set
directly to zero when ~a> ~
y ∈ [−λ, λ].
This illustrates a concept called soft thresholding: in the regime where the least squares solution x?LS is already
close to zero, x?LASSO becomes exactly zero. Meanwhile, ridge regression does not do this: x?RR = 0 if and only if
~a> ~y = 0, which is exactly when the unregularized least squares solution itself is zero. This fundamental difference is
why the solutions to LASSO regression tend to be sparse, i.e., have many entries set to 0.
Theorem 211
Let f0 : Rn → R be strictly convex and such that limt→∞ f0 (~xt ) = ∞ for all sequences (~xt )∞
t=0 such that
limt→∞ k~xt k2 = ∞,a and R : Rn → R+ be convex and take non-negative values. Further suppose that there
exists ~x0 ∈ Rn such that R(~x0 ) = 0.
For λ ≥ 0 and k ≥ 0, let R(λ) and C(k) be sets of solutions to the “regularized” and “constraint” programs:
.
R(λ) = argmin{f0 (~x) + λR(~x)} (9.42)
x∈Rn
~
.
C(k) = argmin f0 (~x). (9.43)
x∈Rn
~
R(~x)≤k
Then:
(a) for every λ ≥ 0 there exists k ≥ 0 such that R(λ) = C(k); and
(b) for every k > 0 there exists λ ≥ 0 such that R(λ) = C(k).
aThis assumption is called “coercivity”.
This shows that in some sense, regularized convex problems are equivalent to constrained convex problems; and
in this equivalence, the regularizer for the regularized problem shapes the constraint set of the constrained problem.
In particular, regularized least squares (f0 (~x) = kA~x − ~y k2 ) with full column rank is equivalent to constrained least
2
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 178
EECS 127/227AT Course Reader 9.4. Geometry of LASSO Regression 2024-04-27 21:08:09-07:00
Now, we sketch the feasible sets and level sets of the objective function for the constrained problems corresponding
to both ridge regression and LASSO regression.
x2 x2
x1 x1
Figure 9.4: Geometric differences between LASSO and ridge regression. On the left side, the blue diamond depicts the feasible
region for an `1 -norm constraint such as k~xk1 ≤ t, while the circle on the right side is the feasible region for an `2 -norm constraint
such as k~xk22 ≤ t. On both graphs, the red line is a level set of our objective function; specifically, the minimal level set that still
intersects the feasible region. The intersection of this level set with the feasible region is the solution to our constrained problem
and thus to an equivalent regularized problem.
Note how with the `1 -norm constraint, the intersection of the feasible region with the minimal level set is more likely
to be at a corner of the feasible region, which is a point where some coordinates are set exactly to zero. Meanwhile,
with the `2 -norm constraint, the intersection can be at an arbitrary point on the circle (or sphere in higher dimensions),
and likely isn’t at a corner. This is why LASSO induces sparsity in ~x, due to the distinctive corners we saw earlier in its
norm ball. Meanwhile, although ridge regression compresses ~x?RR to be smaller, it doesn’t necessarily induce sparsity
in ~x?RR .
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 179
Chapter 10
for some search direction ~v (t) and step size η. In Chapter 6 we covered the gradient descent method, which uses the
gradient of the function as the search direction. In this chapter we will revisit descent-based optimization methods and
introduce alternative update rules.
In this section we will introduce coordinate descent, a class of descent-based algorithms that finds a minimizer of
multivariate functions by iteratively minimizing it along one direction at a time. Consider the unconstrained convex
optimization problem
p? = minn f (~x), (10.2)
x∈R
~
(10.4)
(t+1) (t+1) (t)
xi ∈ argmin f (~x1:i−1 , xi , ~xi+1:n ).
xi ∈R
180
EECS 127/227AT Course Reader 10.1. Coordinate Descent 2024-04-27 21:08:09-07:00
(10.5)
(t+1) (t)
x1 ∈ argmin f (x1 , ~x2:n )
x1 ∈R
(10.6)
(t+1) (t+1) (t)
x2 ∈ argmin f (x1 , x2 , ~x3:n )
x2 ∈R
.. ..
. . (10.7)
(10.8)
(t+1)
x(t+1)
n ∈ argmin f (~x1:n−1 , xn ).
xn ∈R
This is a sequential process since after finding the minimizer along the ith coordinate (i.e. xi ) we use its values
(t+1)
for minimizing subsequent coordinates. Also we note that the order of the coordinates is arbitrary. We formalize this
update in the following algorithm.
Algorithm 6 CoordinateDescent
1: function CoordinateDescent(f, ~x(0) , T )
2: for t = 0, 1, . . . , T1 do
3: for i = 1, . . . , N do
(t+1) (t+1) (t)
4: xi ← argminxi ∈R f (x1:i−1 , xi , xi+1:n ).
5: end for
6: end for
7: return ~xT
8: end function
The algorithm breaks down the difficult multivariate optimization problem into a sequence of simpler univariate
optimization problems.
We first want to discuss the issue of well-posedness of the algorithm. We know that any of the argmins used may
not exist, in which case the algorithm is not well-defined, and so we cannot even think about its behavior or convergence.
Nevertheless, in a large class of problems which have many different characterizations, the argmins are well-defined.
We say in this case that the coordinate descent algorithm is well-posed.
We now want to address the question of convergence. It is not obvious that minimizing the function f (~x) can be
achieved by minimizing along each direction separately. In fact, the algorithm is not guaranteed to converge to an
optimal solution for general convex functions. However, under some additional assumptions on the function, we can
guarantee convergence. To build an intuition for what additional assumptions are needed we consider the following
question. Let f (~x) be a convex differentiable function. Suppose that x?i ∈ argminxi ∈R f (x?1:i−1 , xi , x?i+1:n ) for all
i. Can we conclude that ~x? is a global minimizer of f (~x)? The answer to this question is yes. We can prove this by
recalling the first order optimality condition for unconstrained convex functions and the definition of partial derivatives.
If ~x? is a minimizer of f (~x) along the direction ~ei then we have
∂f ?
(~x ) = 0. (10.9)
∂xi
If this is true for all i then ∇f (~x? ) = ~0, implying that ~x? is a global minimizer for f . This discussion forms a proof of
the following theorem, which is Theorem 12.4 in [2].
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 181
EECS 127/227AT Course Reader 10.1. Coordinate Descent 2024-04-27 21:08:09-07:00
has a solution, then the sequence of iterates ~x(0) , ~x(1) , . . . generated by the coordinate descent algorithm converges
to an optimal solution to (10.10).
The coordinate descent algorithm may not converge to an optimal solution for general non-differentiable functions,
even if they are convex. However, we can still prove that coordinate descent converges for a special class of functions
of the form
n
(10.11)
X
f (~x) = g(~x) + hi (xi )
i=1
where g : Rn → R is convex and differentiable, and each hi : R → R is convex (but not necessarily differentiable). This
form includes various `1 regularization problems (such as LASSO regression) which have a separable non-differentiable
component. The provable convergence of coordinate descent algorithm makes it an attractive choice for this class of
problems.
Example 213. In this example we will consider the LASSO regression problem and examine how coordinate descent
algorithm can be applied to solve it. Note that the LASSO objective follows the form described in (10.11). For A ∈
Rm×n which has columns ~a1 , . . . , ~an ∈ Rm , and ~y ∈ Rm , we consider the LASSO objective
1
(10.12)
2
f (~x) = kA~x − ~y k2 + λ kxk1 .
2
We aim to use coordinate descent to minimize this function. Let ~x(0) be the initial guess. Then we perform the
coordinate descent update by solving the following optimization problem:
(10.13)
(t+1) (t+1) (t)
xi = argmin f (~x1:i−1 , xi , ~xi+1:n ).
xi ∈R
Each of these optimization problems will be solved similarly to what we did in Section 9.3. For notational clarity, let
us instead solve the more generic problem
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 182
EECS 127/227AT Course Reader 10.2. Newton’s Method 2024-04-27 21:08:09-07:00
We now introduce the additional notation Ai:j ∈ Rm×(j−i+1) as the sub-matrix of A whose columns are the ith through
j th columns of A (inclusive). Using this notation, we can simplify the first term as
~a> x − ~y ) = ~a>
i (A~ x − ~a>
i A~ i ~
y (10.19)
n
(10.20)
X
= ~a>
i
~aj xj − ~y
j=1
n
(10.21)
2
X
= k~ai k2 xi + ~a>
i ~aj xj − ~y
j=1
j6=i
(10.22)
2
= k~ai k2 xi + ~a> x1:i−1 + Ai+1:n ~xi+1:n − ~y )
i (A1:i−1 ~
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 183
EECS 127/227AT Course Reader 10.2. Newton’s Method 2024-04-27 21:08:09-07:00
Recall that in the gradient descent algorithm, we assumed that the objective function f (~x) is differentiable. Further-
more, we assumed that at every point ~x ∈ Rn we can compute f (~x) as well as ∇f (~x). Here, we make the additional
assumption that f (~x) is twice differentiable and that we can compute the Hessian ∇2 f (~x). We wish to use the Hessian
to choose a good search direction and accelerate convergence. Optimization algorithms that utilize second derivatives
(e.g. the Hessian) are called second-order methods.
One of the most famous second-order methods is Newton’s method. Newton’s method is based on the following
idea for minimizing strictly-convex functions with positive definite Hessians: first, start with an initial guess ~x(0) . Then
in each iteration t = 1, 2, 3, . . ., approximate the objective function with its second-order Taylor approximation around
the point ~x(t) . The minimizer of this quadratic approximation is then chosen as the next iterate ~x(t+1) .
More formally, let us assume that f is strictly convex and twice-differentiable with positive definite Hessian at ~x(t) ,
and let us write the second-order Taylor approximation of the function f (~x) around the point ~x(t) . We obtain
1
fb2 (~x; ~x(t) ) = f (~x(t) ) + [∇f (~x(t) )]> (~x − ~x(t) ) + (~x − ~x(t) )> [∇2 f (~x(t) )](~x − ~x(t) ). (10.33)
2
Since the Hessian ∇2 f (~x(t) ) 0, we can solve the problem
which is a convex quadratic program, using our (by now) standard techniques. Setting the gradient (in ~x) to ~0, we obtain
−1
We call the vector ∇2 f (~x(t) ) [∇f (~x(t) )] the Newton direction. Here, we do not choose a step-size η. Instead,
we take a full step in the Newton direction towards the minimizer of the quadratic approximation of the objective func-
tion. This is the basic version of Newton’s method; it is not guaranteed to converge in general. To achieve convergence,
we can introduce a step-size η > 0 to the Newton update, obtaining the so-called damped Newton’s method, which has
the iteration h i−1
~x(t+1) = ~x(t) − η ∇2 f (~x(t) ) [∇f (~x(t) )]. (10.39)
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 184
EECS 127/227AT Course Reader
10.3. Newton’s Method with Linear Equality Constraints 2024-04-27 21:08:09-07:00
We will not discuss convergence proofs of Newton’s method in this course; you may use [7] for further reading.
All of our discussion thus far has been under the assumption ∇2 f (~x(t) ) 0. If the Hessian is not positive definite,
one may adapt the algorithm accordingly, forming a new class of methods called quasi-Newton’s methods. Discussion
of quasi-Newton’s methods are out of scope of the course.
Finally, we discuss the algorithmic complexity of Newton’s method. In every iteration, we need to compute and
invert the Hessian ∇2 f (~x(t) ) to obtain the search direction. This is much more expensive than computing the gradient
∇f (~x(t) ), which is used both in the gradient descent method and in Newton’s method. However, this expensive step
is not without justification; in many convex optimization problems, Newton’s method can be shown to converge to the
optimal solution much faster (i.e., in fewer iterations) than gradient descent.
We will use the same approach as with the unconstrained Newton’s method, that is, we will take the second-order Taylor
.
approximation around ~x(t) and minimize it over the constraint set Ω = {~x ∈ Rn : A~x = ~y }. This method gives the
following constrained convex quadratic program:
1
min f (~x(t) ) + [∇f (~x(t) )]> (~x − ~x(t) ) + (~x − ~x(t) )> [∇2 f (~x(t) )](~x − ~x(t) ) (10.42)
x∈Rn
~ 2
s.t. A~x = b. ~ (10.43)
Note that the quadratic program is convex and, if the original problem is feasible (i.e., Ω is nonempty) that strong
duality holds by Slater’s condition. Thus, we can solve this QP by solving the KKT conditions, as they are necessary
and sufficient for global optimality. We begin by writing the Lagrangian L : Rn × Rm → R associated with this
quadratic program, which is defined as
1
L(~x, ~ν ) = f (~x(t) ) + [∇f (~x(t) )]> (~x − ~x(t) ) + (~x − ~x(t) )> [∇2 f (~x(t) )](~x − ~x(t) ) + ~ν > (A~x − ~b). (10.44)
2
Suppose that (~x? , ~ν ? ) are globally optimal for the constrained quadratic program. Then they must satisfy the KKT
conditions, which are:
• Primal feasibility:
A~x? = ~y . (10.45)
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 185
EECS 127/227AT Course Reader 10.4. (OPTIONAL) Interior Point Method 2024-04-27 21:08:09-07:00
• Stationarity/first-order condition:
Let us define a vector ~v (t) = ~x? − ~x(t) . Since ~x(t) is feasible, we have A~x(t) = ~y . Thus we have
Thus, if we write the system in terms of ~v (t) instead of ~x? , we have the system of equations
After solving this system of equations for ~v (t) , our update rule becomes
which is equivalent to setting the new iterate as the minimizer of the constrained QP. The formal iteration is given in
Algorithm 9.
There also exist damped versions of this algorithm, but their analysis is out of scope of the course.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 186
EECS 127/227AT Course Reader 10.4. (OPTIONAL) Interior Point Method 2024-04-27 21:08:09-07:00
of algorithms to handle convex optimization problems with inequality constraints. Precisely, we will introduce the
interior point method, which allows us to solve convex optimization problems of the following form
where f0 , f1 , . . ., fm are all convex twice-differentiable functions. Interior point methods (IPM) are a class of algo-
rithms which solves the problem (10.56) by solving a sequence of convex optimization problems with linear constraints
using Newton’s method. The key idea used in IPM is the barrier function, which we introduce next.
This gives us an optimization problem with only linear equality constraints that is equivalent to the original optimiza-
tion problem (10.56) (i.e., they have the same solution). However, introducing the indicator function now makes the
objective function non-differentiable so we can no longer apply Newton’s method to solve this problem. To overcome
this problem, we will instead approximate the indicator function with a differentiable function φ, which we call a barrier
function.
There are several choices for φ, i.e., good approximations for the indicator function, but they must all have something
in common. Namely, φ should be a convex increasing function on R−− , such that limz%0 φ(z) = +∞, just like the
indicator function I. There are many candidate functions that satisfy these criteria. One of the most used barrier
functions that we will introduce here is the logarithmic barrier function, which, for some α > 0, takes the form
1
φα (z) = −
log(−z), (10.62)
α
The parameter α controls the accuracy of the approximation — as α grows larger, the logarithmic barrier function
becomes a better and better approximation to the indicator function.
Using this logarithmic barrier, we can define an approximate optimization problem P(α)
b to P by the following:
m
problem P(α): (10.63)
X
b pb?α = minn f0 (~x) + φα (fi (~x))
x∈R
~
i=1
A~x = ~y . (10.64)
This optimization problem has a convex twice-differentiable objective function and linear equality constraints, so New-
ton’s method can be applied.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 187
EECS 127/227AT Course Reader 10.4. (OPTIONAL) Interior Point Method 2024-04-27 21:08:09-07:00
¹This easy-to-hard solution process is one instance of a more general algorithmic paradigm called homotopy continuation or homotopy analysis,
which is used to precisely simulate very unstable dynamical systems.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 188
Chapter 11
Applications
In this chapter, we will discuss some applications of the theory we have developed so far in this class. Our explo-
ration will include deterministic control and the linear-quadratic regulator, stochastic control and the policy gradient
algorithm, and support vector machines.
Example 214 (Vertical Rocket System). For example, we can consider a vertical rocket. Our goal is to maximize its
height by time T . Let x1,t denote its height, x2,t denote its vertical velocity, and x3,t denote the weight of the rocket
(which we will approximate as the weight of the fuel), all at time t. The weight of the fuel will go down over time,
and that can affect the rocket’s velocity. The forces pushing the rocket down are drag and gravity, and the upward force
comes from the rocket’s thrust.
The forces at time t have the following expressions. (Here ẋ is the time-derivative of x.)
• Drag: cD ρ(x1,t )x22,t where cD is a numerical constant and ρ(x1 ) is the density of the air at height x1 .
Given the input ut = −ẋ3,t , we want to write our dynamical system in the standard form for continuous dynamics, i.e.,
~x˙ t = f~(~xt , ut ). From the force expressions, we have
189
EECS 127/227AT Course Reader
11.1. Deterministic Control and Linear-Quadratic Regulator 2024-04-27 21:08:09-07:00
Now we have a dynamical system of the form ~x˙ t = f~(~xt , ut ). Recall that we want to maximize the height of the rocket,
x1,T , at some terminal time T > 0, given an initial condition ~x0 = ξ~ for some ξ~ ∈ R3 . We can set up an optimization
problem to determine the (ut )t∈[0,T ] which accomplishes this:
max x1,T
(ut )t∈[0,T ]
In practice, we can use a numerical solver to solve this problem; conceptually one can solve many such systems by hand
using the so-called calculus of variations, which is a sort of infinite-dimensional optimization paradigm which we do
not explore more in this course.
Even though our dynamics are continuous-time and complex, we can discretize and locally linearize our system,
obtaining an approximate system which is discrete linear time-invariant, i.e., of the form
where A and B are matrices of the appropriate sizes. This is a linear system because it is linear in (~xk )K
k=0 and (~ k=0 ,
uk )K−1
which we conceptually think of as very long (but finite-length) vectors. It is time-invariant because the matrices A and
B do not depend on the discrete-time index k.
This particular type of control problem is ubiquitous within science and engineering, and thus deserves a special
name — it is called the linear quadratic regulator problem.
Equation (11.5) is actually a quadratic program, since the objective is quadratic in the variables (~xk )K
k=0 and (~ k=0 ,
uk )K−1
and the constraints are linear. We are able to solve this with the methods we already know. However, this problem is
very large, having (K + 1)n + Km variables (as well as K + 1) constraints, and for n, m, K large this quickly becomes
intractable. Our saving grace is that this problem has significant additional structure. The traditional way to solve it is
using the dynamic programming approach and Bellman’s equation. However, in this section, we will solve it using the
KKT conditions and the Riccati equation.
~u?k = −R−1 B > (I + Pk+1 BR−1 B > )−1 Pk+1 A~x?k ~x?k , ∀k ∈ {0, . . . , K − 1} (11.6)
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 190
EECS 127/227AT Course Reader
11.1. Deterministic Control and Linear-Quadratic Regulator 2024-04-27 21:08:09-07:00
PK = Qf (11.7)
>
Pk = A (I + Pk+1 BR −1
B ) > −1
Pk+1 A + Q, ∀k ∈ {0, . . . , K − 1}. (11.8)
Since the objective function of Equation (11.5) is convex and the constraints are affine, Slater’s condition holds auto-
matically. Thus strong duality holds. Since Equation (11.5) is a convex problem and strong duality holds, the KKT
conditions are necessary and sufficient for global optimality.
Let ((~x?k )K u?k )K−1
k=0 , (~
~ ? K ν ? ) be globally optimal primal and dual variables for Equation (11.5), hence
k=0 , (λk )k=1 , ~
satisfying the KKT conditions. Then we have
1. Primal feasibility:
4. Lagrangian stationarity:
However, this update dynamics goes backwards in time from k = K to k = 1. This is in contrast to the update
dynamics for ~xk which goes forwards in time.
When we find the ~λ? , we are able to find the ~u? , since a stationarity equation gives
k k
This motivates solving for ~λ?k , which we do via (backwards) induction. Our induction hypothesis is of the form
~λ? = Pk ~x? ,
k k (11.14)
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 191
EECS 127/227AT Course Reader
11.1. Deterministic Control and Linear-Quadratic Regulator 2024-04-27 21:08:09-07:00
which we aim to show for k ∈ {1, . . . , K}. The base case is k = K, whence we have ~λ?K = Qf ~x?K , so that PK = Qf .
For the inductive step, for k ∈ {1, . . . , K − 1} we have
= Pk ~x?k . (11.24)
Above, to show that (11.20) follows from (11.19), we need to confirm that I + Pk+1 BR−1 B > is invertible. To show
this, we will explicitly construct its inverse in terms of the inverse of other matrices which we know are invertible.
First, observe that R + B > Pk+1 B is invertible, since it is symmetric positive definite: R is symmetric positive definite
and B > Pk+1 B is symmetric positive semidefinite, so their sum is symmetric positive definite. Next, we claim that
(I + Pk+1 BR−1 B > )−1 = I − Pk+1 B(R + B > Pk+1 B)−1 B > . This follows from the Sherman-Morrison-Woodbury
identity, but for the sake of completeness we prove it here. Indeed,
(I + Pk+1 BR−1 B > )(I − Pk+1 B(R + B > Pk+1 B)−1 B > ) (11.25)
= I + Pk+1 BR−1 B > − (I + Pk+1 BR−1 B > )Pk+1 B(R + B > Pk+1 B)−1 B > (11.26)
−1 > > −1 > −1 > > −1 >
= I + Pk+1 BR B − Pk+1 B(R + B Pk+1 B) B − Pk+1 BR B Pk+1 B(R + B Pk+1 B) B
(11.27)
= I + Pk+1 BR−1 B > − Pk+1 BR−1 R(R + B > Pk+1 B)−1 B > − Pk+1 BR−1 B > Pk+1 B(R + B > Pk+1 B)−1 B >
(11.28)
= I + Pk+1 BR −1 >
B − Pk+1 BR −1 >
(R + B Pk+1 B)(R + B Pk+1 B) > −1
B >
(11.29)
= I + Pk+1 BR−1 B > − Pk+1 BR−1 B > (11.30)
= I. (11.31)
This confirms that I + Pk+1 BR−1 B > is invertible. Thus we have for k ∈ {0, . . . , K − 1} that
as desired.
Now, note that we have written our recurrence for Pk in terms of Pk+1 . Starting from PK , we can compute each
Pk in a backwards order, completely offline (i.e., without processing any iterations of the forward system). Once we
have the Pk , we can then compute each ~xk , ~λk , and ~uk directly. Therefore, we have a way to solve the LQR problem
just by using matrix multiplication.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 192
EECS 127/227AT Course Reader 11.2. Support Vector Machines 2024-04-27 21:08:09-07:00
Hard-Margin SVM
First, let us work through the case that the data are strictly linearly separable; that is, there exists some w,
~ b such that
yi gw,b
~ (~xi ) > 0 for all i. This hypothesis is unreasonable in many cases, but it allows us to build intuition for the
problem. We will later remove this hypothesis, but many tools remain the same.
In this case, we would like to find one (w,
~ b) pair for which yi gw,b
~ (~xi ) > 0 for all i. However, there can be many
such pairs. Thus, we have to determine which one we would like to pick.
~ b) with the largest margin — that is, the distance between the hyperplane
One possible heuristic¹ is to pick a pair (w,
~ and the closest point towards it. Thus we want to solve the problem:
Hw,b
max min ~ ,~
dist(Hw,b xi ) (11.34)
d i∈{1,...,n}
w∈R
~
b∈R
s.t. yi gw,b
~ (~xi ) > 0, ∀i ∈ {1, . . . , n}.
This problem is called the hard-margin SVM, so named because we do not allow any misclassification of the training
points — no training data can “cross the margin.”
Unfortunately, the problem in Equation (11.34) seems intractable. Let us go about simplifying it. First, the distance
between point ~x and Hw,b
~ is defined as
~ > ~x − b
. w |gw,b
~ (~x)|
~ ,~
dist(Hw,b x) = = . (11.35)
kwk
~ 2 kwk
~ 2
¹This heuristic makes sense from the perspective of robustness and generalization. We haven’t sampled all the data that exists, but we hypothesize
that the data we haven’t sampled is geometrically close to the data we have sampled. We want to make sure that the largest fraction possible of all
data (sampled and unsampled) is correctly classified by the w ~ and b we learn. Thus, to capture the most unsampled data possible, we require the
classification to be as robust to these geometric deviations as possible.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 193
EECS 127/227AT Course Reader 11.2. Support Vector Machines 2024-04-27 21:08:09-07:00
s.t. yi gw,b
~ (~xi ) > 0, ∀i ∈ {1, . . . , n}.
Adding the real-valued slack variable s to denote the minimizing component, we have
s
max (11.37)
w∈R
~ d kwk
~ 2
b∈R
s∈R++
s.t. yi gw,b
~ (~xi ) > 0, ∀i ∈ {1, . . . , n},
|gw,b
~ (~xi )| ≥ s, ∀i ∈ {1, . . . , n}.
Since yi ∈ {−1, +1} and s > 0, we have that for any scalar ui that |ui | ≥ s and yi ui > 0 if and only if yi ui ≥ s. This
relation certainly holds for ui = gw,b
~ (~xi ), and so the above problem simplifies to the following:
s
max (11.38)
w∈R
~ d kwk
~ 2
b∈R
s∈R++
s.t. yi gw,b
~ (~xi ) ≥ s, ∀i ∈ {1, . . . , n}.
s.t. yi gw/s,b/s
~ (~xi ) ≥ 1, ∀i ∈ {1, . . . , n}.
s.t. yi gw,b
~ (~xi ) ≥ 1, ∀i ∈ {1, . . . , n}.
Finally, we can obtain an equivalent problem which is a convex minimization problem by using the transformation
2x2 , which is monotonically decreasing on R++ , on the objective function, whence (after expanding the form of
1
x 7→
~ ), we obtain the problem
gw,b
1
(11.41)
2
min kwk
~ 2
w∈R
~ d 2
b∈R
s.t. yi (w
~ > ~xi − b) ≥ 1, ∀i ∈ {1, . . . , n}.
The problem (11.41) is by far the most common and simplified form of the hard-margin SVM problem. Moreover, this
is a quadratic program in (w,
~ b), which we know how to solve.
²It may seem natural to make gw,b
~ more complex and thus obtain a recipe for learning more complicated types of classifiers, but note that the
above simplification no longer holds if we do, so the final problem may be much more complex and not efficiently solvable.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 194
EECS 127/227AT Course Reader 11.2. Support Vector Machines 2024-04-27 21:08:09-07:00
Most of our real world data isn’t actually linearly separable. In that case, our hard margin SVM problem would just be
infeasible. But maybe we still want to try to separate the points, even if our classifier is not perfect on the training data.
To develop this relaxed SVM problem, let us consider the hard-margin SVM:
1
(11.42)
2
min kwk
~ 2
w∈R
~ d 2
b∈R
s.t. yi gw,b
~ (~xi ) ≥ 1, ∀i ∈ {1, . . . , n}.
It is a constrained optimization problem, so we can reformulate it into an unconstrained problem using indicator vari-
ables: !
n
1
(11.43)
2
X
min kwk
~ 2+ `0−∞ (1 − yi gw,b
~ (~xi ))
w∈R
~ d 2 i=1
b∈R
We can introduce some slack variables ξ~ to model this maximization term and make the problem differentiable. After
expanding the form of gw,b
~ , we have the program
n
1
(11.48)
2
X
min kwk
~ 2+C ξi
w∈R
~ d 2 i=1
b∈R
~
ξ∈R n
This is the usual form of the soft margin SVM problem, and the solutions are parameterized by C.
In accordance with our derivation, if C is large then we allow only small violations to the margin, because the
second term becomes a better approximation of the sum of indicators in Equation (11.43). If C is small, then we allow
larger violations to the margin.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 195
EECS 127/227AT Course Reader 11.2. Support Vector Machines 2024-04-27 21:08:09-07:00
Depending on the perspective, either the first term or the second term of the loss can be viewed as the regularizer.
The first term works to maximize the margin, while the second term works to penalize the margin violations. Both
work together to form an approximate maximum-margin classifier.
KKT Conditions
It turns out that we get significant insight into the solutions to the hard-margin and soft-margin SVM using the KKT
conditions. First, let us consider the hard-margin SVM:
1
(11.51)
2
min kwk
~ 2
w∈R
~ d 2
b∈R
s.t. yi (w
~ > ~xi − b) ≥ 1, ∀i ∈ {1, . . . , n}.
The Lagrangian is
n
1
(11.52)
2
X
~ b, ~λ) = kwk
L(w, ~ 2+ ~ > ~xi − b)).
λi (1 − yi (w
2 i=1
This problem is convex, and the constraints are affine, so if the problem is feasible (i.e., the data are strictly linearly
separable), then Slater’s condition holds so strong duality holds. Since the problem is convex and strong duality holds,
the KKT conditions are necessary and sufficient for optimality.
Suppose that (w
~ ? , b? , ~λ? ) satisfy the KKT conditions. Then:
3. Stationarity:
We say that (~xi , yi ) is a support vector if λ?i > 0. To see why, we consider the following cases:
1. If λ?i = 0 then, since w ~ ? = i=1 λ?i yi ~xi , we see that (~xi , yi ) does not contribute to the optimal solution.
Pn
2. If λ?i > 0 then by complementary slackness we have yi ((w ~ ? )> ~xi − b) = 1. Thus ~xi is on the margin of the
SVM. Furthermore, since w~ ? = i=1 λ?i yi ~xi , we see that (~xi , yi ) does contribute to the optimal solution.
P n
Now let us consider the analogous notion for soft-margin SVMs. Consider the soft-margin SVM problem:
n
1
(11.53)
2
X
min kwk
~ 2+C ξi
w∈R
~ d 2 i=1
b∈R
~ n
ξ∈R
It has Lagrangian
n n n
1
(11.56)
2
X X X
L(w, ~ ~λ, µ
~ b, ξ, ~ ) = kwk
~ 2+C ξi + µi (−ξi ) + ~ > ~xi − b) − ξi ).
λi (1 − yi (w
2 i=1 i=1 i=1
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 196
EECS 127/227AT Course Reader 11.2. Support Vector Machines 2024-04-27 21:08:09-07:00
This problem is convex, and the constraints are affine; one can show that it is always feasible, so Slater’s condition
holds and strong duality holds. Since the problem is convex with strong duality, the KKT conditions are both necessary
and sufficient for optimality.
Suppose that (w ~ ? ) satisfy the KKT conditions. Then:
~ ? , b? , ξ~? , ~λ? , µ
3. Stationarity:
Similarly to the case of the hard-margin SVM, we say that (~xi , yi ) is a support vector if λi > 0. To see why, we
consider the following cases:
1. If λ?i = 0 then µ?i = C. Thus µ?i > 0, so ξi? = 0. Thus the point ~xi does not violate the margin. Also since
~ ? = i=1 λ?i yi ~x?i , we see that (~xi , yi ) does not contribute to the optimal solution.
Pn
w
2. If λ?i = C then µ?i = 0. Thus we cannot say anything about ξi? . But since λ?i > 0, the other complementary
slackness condition says that yi ((w
~ ? )> ~xi − b) = 1 − ξi? . Thus (~xi , yi ) is either on the margin or violates the
margin. Also since w
~ ? = i=1 λ?i yi ~x?i , we see that (~xi , yi ) does contribute to the optimal solution.
P n
3. If λ?i ∈ (0, C), then µ?i ∈ (0, C) as well. Thus by complementary slackness, we have ξi? = 0. Applying this
to the other complementary slackness condition, we have yi ((w ~ ? )> ~xi − b) = 1. Thus (~xi , yi ) is exactly on the
margin. Also since w
~ ? = i=1 λ?i yi ~x?i , we see that (~xi , yi ) does contribute to the optimal solution.
P n
In general, the support vectors contribute to the optimal solution, and they are on/violate the margin.
© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 197
Bibliography
[1] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.
[2] G. Calafiore and L. El Ghaoui, Optimization Models. Cambridge University Press, 2014.
[3] C. C. Pugh, Real Mathematical Analysis. Springer, 2002, vol. 2011.
[4] P. Varaiya et al., Lecture notes on optimization. Unpublished manuscript, University of California, Department of
Electrical Engineering and Computer Science, 1998.
[5] D. P. Bertsekas, “Nonlinear programming,” Journal of the Operational Research Society, vol. 48, no. 3, pp. 334–
334, 1997.
[6] G. Garrigos and R. M. Gower, “Handbook of convergence theorems for (stochastic) gradient methods,” arXiv
preprint arXiv:2301.11235, 2023.
[7] Y. Nesterov et al., Lectures on convex optimization. Springer, 2018, vol. 137.
198