0% found this document useful (0 votes)
11 views199 pages

Eecs127 Reader

The EECS 127/227AT Course Reader for Spring 2024 provides a comprehensive overview of optimization models in engineering, covering topics such as linear algebra, vector calculus, regression techniques, convexity, and advanced descent methods. It includes contributions from various professors and graduate student instructors, and is structured to guide students through the fundamental concepts and applications of optimization. The document serves as a resource for understanding optimization problems and methods relevant to engineering disciplines.

Uploaded by

mengh657657
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views199 pages

Eecs127 Reader

The EECS 127/227AT Course Reader for Spring 2024 provides a comprehensive overview of optimization models in engineering, covering topics such as linear algebra, vector calculus, regression techniques, convexity, and advanced descent methods. It includes contributions from various professors and graduate student instructors, and is structured to guide students through the fundamental concepts and applications of optimization. The document serves as a resource for understanding optimization problems and methods relevant to engineering disciplines.

Uploaded by

mengh657657
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 199

EECS 127/227AT

Optimization Models in Engineering

Course Reader
Spring 2024
EECS 127/227AT Course Reader 2024-04-27 21:08:09-07:00

Acknowledgements
This reader is based on lectures from Spring 2021, Fall 2022, Spring 2023, and Fall 2023 iterations of EECS 127/227A
by Prof. Gireeja Ranade. The reader was mostly written by Spring 2023 GSIs Druv Pai, Arwa Alanqary, and Aditya
Ramabadran, and reviewed by Prof. Ranade. Fall 2022 tutor Jeffrey Wu collaborated with Druv on a writeup about the
Eckart-Young theorem which was folded into the reader. Contributions from Prof. Venkat Anantharam and Chih-Yuan
Chiu were also added in Fall 2023.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 1
Contents

1 Introduction 4
1.1 What is Optimization? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Solution Concepts and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 (OPTIONAL) Infimum Versus Minimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Linear Algebra Review 13


2.1 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Gram-Schmidt and QR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Fundamental Theorem of Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Symmetric Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7 Low-Rank Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.8 (OPTIONAL) Block Matrix Identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3 Vector Calculus 48
3.1 Gradient, Jacobian, and Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 Taylor’s Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3 The Main Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4 Directional Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5 (OPTIONAL) Matrix Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4 Linear and Ridge Regression 70


4.1 Impact of Perturbations on Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Principal Components Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4 Tikhonov Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5 Maximum Likelihood Estimation (MLE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.6 Maximum A Posteriori Estimation (MAP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5 Convexity 79
5.1 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

2
EECS 127/227AT Course Reader Contents 2024-04-27 21:08:09-07:00

5.3 Convex Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108


5.4 Solving Convex Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.5 Problem Transformations and Reparameterizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6 Gradient Descent 116


6.1 Strong Convexity and Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.3 Variations: Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4 Variations: Gradient Descent for Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . 127

7 Duality 130
7.1 Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.2 Weak Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.3 Strong Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.4 Karush-Kuhn-Tucker (KKT) Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.5 (OPTIONAL) Conic Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8 Types of Optimization Problems 148


8.1 Linear Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.2 Quadratic Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.3 Quadratically-Constrained Quadratic Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.4 Second-Order Cone Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.5 Semidefinite Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.6 General Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

9 Regularization and Sparsity 172


9.1 Recapping Ridge Regression and Defining LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.2 Understanding the Difference Between the `2 -Norm and the `1 -Norm . . . . . . . . . . . . . . . . . . 173
9.3 Analysis of LASSO Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
9.4 Geometry of LASSO Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

10 Advanced Descent Methods 180


10.1 Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
10.2 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
10.3 Newton’s Method with Linear Equality Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
10.4 (OPTIONAL) Interior Point Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

11 Applications 189
11.1 Deterministic Control and Linear-Quadratic Regulator . . . . . . . . . . . . . . . . . . . . . . . . . 189
11.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 3
Chapter 1

Introduction

Relevant sections of the textbooks:

• [1] Chapter 1.

• [2] Chapter 1.

1.1 What is Optimization?


Try to see what the following “problems” have in common.

• A statistical model, such as a neural network, trains using finite data samples.

• A robot learns a strategy using the environment, so that it does what you want.

• A major gas company decides what mixture of different fuels to process in order to get maximum profit.

• The EECS department decides how to set class sizes in order to maximize the number of credits offered subject
to budget constraints.

While it might seem that these four examples are very distinct, they can all be formulated as minimizing an objective
function over a feasible set. Thus, they can all be put into the framework of optimization.
To develop the basics of optimization, including precisely defining an objective function and a feasible set, we use
some motivating examples from the third and fourth “problems”. (The first and second “problems” will be discussed
at the very end of the course.)

Example 1 (Oil and Gas). Say that we are a gas company with 105 barrels of crude oil that we must refine by an
expiration date. There are two refineries: one which processes crude oil into jet fuel, and one which processes crude
oil into gasoline. We can sell a barrel of jet fuel to consumers for $0.10, while we can sell a barrel of gasoline fuel for
$0.20. So, letting x1 be a variable denoting the number of barrels of jet fuel produced, and x2 be a variable denoting
the number of barrels of gasoline produced, we aim to solve the problem:
1 1
max x1 + x2 (1.1)
x1 ,x2 10 5
s.t. x1 ≥ 0
x2 ≥ 0

4
EECS 127/227AT Course Reader 1.1. What is Optimization? 2024-04-27 21:08:09-07:00

x1 + x2 = 105 .

That is, we aim to choose x1 and x2 which maximize the objective function 1
10 x1 + 15 x2 , but with the caveat that
they must obey the constraints x1 ≥ 0, x2 ≥ 0, and x1 + x2 = 105 . The feasible set is the set of all (x1 , x2 ) pairs
which obey the constraints. As you may have noticed, constraints can be equalities or inequalities in the xi , which we
formalize shortly.
The solution to this problem can be seen to be (x?1 , x?2 ) = (0, 105 ), which corresponds to refining all the crude oil
into gasoline. This makes sense – after all, gasoline sells for more! And with all else equal between gasoline and jet
fuel, to maximize our profit, we just need to produce gasoline.
To model another constraint, say that we need at least 103 gallons of jet fuel and 5 · 102 gallons of gasoline, we can
directly incorporate them into the constraint set:
1 1
max x1 + x2 (1.2)
x1 ,x2 10 5
s.t. x1 ≥ 0
x2 ≥ 0
x1 ≥ 103
x2 ≥ 5 · 102
x1 + x2 = 105 .

We then notice that x1 ≥ 0 is made redundant by the constraint x1 ≥ 103 . That is, no pair (x1 , x2 ) which satisfies
x1 ≥ 103 is not going to satisfy x1 ≥ 0. Thus, we can eliminate the latter constraint, since it defines the same feasible
set. We can do the same thing for the constraints x2 ≥ 0 and x2 ≥ 5 · 102 , the latter making the former redundant.
Thus, we can simplify the above problem to only include the redundant constraints:
1 1
max x1 + x2 (1.3)
x1 ,x2 10 5
s.t. x1 ≥ 103
x2 ≥ 5 · 102
x1 + x2 = 105 .

Let’s say that we want to incorporate one final business need. Before, we were modeling that the oil refinement is free,
since we don’t have an objective or constraint term which involves this cost. Now, let us say that we can transport a
total of 2 · 106 “barrel-miles” – that is, the number of barrels times the number of miles we can transport is no greater
than 2 · 106 . Let us further say that the jet fuel refinery is 10 miles away from the crude oil storage, and the gasoline
refinery is 30 miles away from the crude oil storage. We can incorporate this further constraint into the constraint set
directly:
1 1
max x1 + x2 (1.4)
x1 ,x2 10 5
s.t. x1 ≥ 103
x2 ≥ 5 · 102
10x1 + 30x2 ≤ 2 · 106
x1 + x2 = 105 .

This is a good first problem; we have a non-trivial objective function, non-trivial inequality and equality constraints,
and even got to work with manipulating constraints (so as to remove redundant ones)!

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 5
EECS 127/227AT Course Reader 1.1. What is Optimization? 2024-04-27 21:08:09-07:00

This type of optimization problem is called a linear program. We will learn more about how to formulate and solve
linear programs later in the course.

A more generic reformulation of the above optimization problem is the following “standard form”.

Definition 2 (Standard Form of Optimization Problem)


We say that an optimization problem is written in standard form if it is of the form

min f0 (~x) (1.5)


x∈Rn
~

s.t. fi (~x) ≤ 0, ∀i ∈ {1, . . . , m}


hj (~x) = 0, ∀j ∈ {1, . . . , p}.

Here:

• ~x ∈ Rn is the optimization variable.

• f1 , . . . , fm and h1 , . . . , hp are functions Rn → R.

• f0 is the objective function.

• fi are inequality constraint functions; the expression “fi (~x) ≤ 0” is an inequality constraint.

• Similarly, hj are equality constraint functions, and the expression “hj (~x) = 0” is an equality constraint.

• The feasible set, i.e., the set of all ~x that satisfy all constraints, is
( )
. n fi (~ x) ≤ 0, ∀i ∈ {1, . . . , m}
Ω = ~x ∈ R . (1.6)
hj (~x) = 0, ∀j ∈ {1, . . . , p}

We can thus also write the problem (1.5) as

min f0 (~x). (1.7)


x∈Ω
~

• A solution to this optimization problem is any ~x? ∈ Ω which attains the minimum value of f (~x) across all
~x ∈ Ω. Correspondingly, ~x? is also called a minimizer of f0 over Ω.

It’s perfectly fine if m = 0 (in which case there are no inequality constraints) and/or p = 0 (in which case there are no
equality constraints). If there are no constraints, then Ω = Rn and the problem is called unconstrained; otherwise it is
called constrained.
Let us try another example now, which has vector-valued quantities.

Example 3. Consider the following table of EECS courses:

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 6
EECS 127/227AT Course Reader 1.2. Least Squares 2024-04-27 21:08:09-07:00

Class Size Credits Resources per Student

127 x1 c1 r1
126 x2 c2 r2
182 x3 c3 r3
189 x4 c4 r4
162 x5 c5 r5
188 x6 c6 r6
.. .. .. ..
. . . .
i>
.
h
Suppose there are n classes in total. Let ~x = x1 x2 · · · xn ∈ Rn be the decision variable, and let
i> i>
. .
h h
~c = c1 c2 · · · cn ∈ Rn and ~r = r1 r2 · · · rn ∈ Rn be constants. Then, in order to maximize the
total number of credit hours subject to a total resource budget b, we set up the linear program

max ~c> ~x (1.8)


x∈Rn
~

s.t. ~r> ~x ≤ b
xi ≥ 0, ∀i ∈ {1, . . . , n}.

As notation, instead of the last set of constraints xi ≥ 0, we can write the vector constraint ~x ≥ ~0.
More generally, recall that if we have a vector equality constraint ~h(~x) = ~0, it can be viewed as short-hand for
the several scalar equality constraints h1 (~x) = 0, . . . , hp (~x) = 0. Correspondingly, we define the vector inequality
constraint f~(~x) ≤ ~0 to be short-hand for the several scalar inequality constraints f1 (~x) ≤ 0, . . . , fm (~x) ≤ 0.

1.2 Least Squares


We begin with one of the simplest optimization problems, that of least squares. We’ve probably seen this formulation
before. Mathematically, we are given a data matrix A ∈ Rm×n and a vector of outcomes ~y ∈ Rm , and attempt to
find a parameter vector ~x ∈ Rn which minimizes the residual kA~x − ~y k2 . Here k·k2 is the standard Euclidean norm
2

. √
i=1 zi ; it is labeled with the 2 for a reason we will see later in the course.
pPn
k~zk2 = ~z> ~z = 2

More precisely, we attempt to solve the following optimization problem:

(1.9)
2
min kA~x − ~y k2 .
x∈Rn
~

Theorem 4 (Least Squares Solution)


Let A ∈ Rm×n have full column rank, and let ~y ∈ Rm . Then the solution to (1.9), i.e., the solution to

(1.9)
2
min kA~x − ~y k2 ,
x∈Rn
~

is given by
~x? = (A> A)−1 A> ~y . (1.10)

Proof. The idea is to find A~x ∈ R(A) which is closest to ~y . Here R(A) is the range, or column space, or column span,
of A. In general, We have no guarantee that ~y ∈ R(A), so there is not necessarily an ~x such that A~x = ~y . Instead, we
are finding an approximate solution to the equation A~x = ~y .

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 7
EECS 127/227AT Course Reader 1.2. Least Squares 2024-04-27 21:08:09-07:00

Recall that R(A) is a subspace, and that ~y itself may not belong to R(A). Thus we can visualize the geometry of
the problem as the following picture:

~y

R(A)
~0

We can now solve this problem using ideas from geometry. We claim that the closest point to ~y contained in R(A)
.
is the orthogonal projection of ~y onto R(A); call this point ~z. Also, define ~e = ~y −~z. This gives the following diagram.

~y

~e

~z R(A)
~0

From this diagram, we see that ~e is orthogonal to any vector in R(A). But remember that we still have to prove
.
that ~z is the closest point to ~y within R(A). To see this, consider any another point ~u ∈ R(A) and define ~v = ~y − ~u.
This gives the following diagram:

~y

~e
~v

~z R(A)
~0

~u

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 8
EECS 127/227AT Course Reader 1.2. Least Squares 2024-04-27 21:08:09-07:00

.
To complete our proof, we define w
~ = ~z − ~u, noting that the angle ~u → ~z → ~y is a right angle; in other words, w
~
and ~e are orthogonal. This gives the following picture.

~y

~e
~v

~z R(A)
~0
w
~

~u

By the Pythagorean theorem, we see that

(1.11)
2 2
k~y − ~uk2 = k~v k2
(1.12)
2 2
= kwk
~ 2 + k~ek2
(1.13)
2 2
= k~z − ~uk2 + k~ek2
| {z }
>0

(1.14)
2
> k~ek2
(1.15)
2
= k~y − ~zk2 .

Therefore, ~z is the closest point to ~y within R(A).


Now, we want to find ~z ∈ R(A), i.e., the orthogonal projection of ~y onto R(A), such that ~e = ~y − ~z is orthogonal
to all vectors in R(A). By the definition of R(A), it’s equivalent to find ~x? ∈ Rn such that ~y − A~x? is orthogonal to
all vectors in R(A). Since the columns of A form a spanning set for R(A), it’s equivalent to find ~x? ∈ Rn such that
~y − A~x? is orthogonal to all columns of A. This implies

~0 = A> (~y − A~x? ) (1.16)


= A> ~y − A> A~x? (1.17)
=⇒ A A~x = A ~y> ? >
(1.18)
=⇒ ~x? = (A> A)−1 A> ~y . (1.19)

Here A> A is invertible because A has full column rank.

We’ll conclude with a statistical application of least squares to linear regression. Suppose we are given data
(x1 , y1 ), . . . , (xn , yn ), and want to fit an affine model y = mx + b through these data points. This corresponds to
approximately solving the system
mx1 + b = y1
mx2 + b = y2
.. (1.20)
.
mxn + b = yn .

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 9
EECS 127/227AT Course Reader 1.3. Solution Concepts and Notation 2024-04-27 21:08:09-07:00

Formulating it in terms of vectors and matrices, we have


   
x1 1 y1
" #  
 x2 1  m  y2 

 .
 . .. 
 b =  ..  .
  (1.21)
 . . .
xn 1 yn

In the case where the data is noisy or inconsistent with the model, as in the below figure, the linear system will be
overdetermined and have no solutions. Then, we find an approximate solution – a line of best fit – via least squares on
the above system.
y

As a last note, solving least squares (and similar problems) is easy because it is a so-called convex problem. Convex
problems are easy to solve because any local optimum is a global optimum, which allows us to use a variety of simple
techniques to find global optima. It is generally much more difficult to solve non-convex problems, though we solve a
few during this course.
We discuss much more about convexity and convex problems later in the course.

1.3 Solution Concepts and Notation


Sometimes we assign values to our optimization problems. For example in the framework of (1.5) we may write

p? = minn f0 (~x) (1.22)


x∈R
~

s.t. fi (~x) ≤ 0 ∀i ∈ {1, . . . , m}


hj (~x) = 0 ∀j ∈ {1, . . . , p}.

On the other hand, in the framework of (1.7) and using the definition of Ω in (1.6), we may write¹.

p? = min f0 (~x). (1.23)


x∈Ω
~

This means that p? ∈ R is the minimum value of f0 over all ~x ∈ Ω; formally,


.
p? = min f0 (~x) = min{f0 (~x) | ~x ∈ Ω}. (1.24)
x∈Ω
~

¹For the case where the minimum does not exist, but the infimum is finite, please see Section 1.4

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 10
EECS 127/227AT Course Reader 1.4. (OPTIONAL) Infimum Versus Minimum 2024-04-27 21:08:09-07:00

As an example, consider the two-element set Ω = {0, 1} and f0 (x) = 3x2 + 2. Then p? = min{f (0), f (1)} =
min{2, 5} = 2. We emphasize that p? is a real number, not a vector.
To extract the minimizers, i.e., the points ~x ∈ Ω which minimize f0 (~x), we use the argmin notation, which gives
us the set of arguments which minimize our objective function. Formally, we define:
 
.
argmin f0 (~x) = ~x ∈ Ω f0 (~x) = min f0 (~u) (1.25)
x∈Ω
~ u∈Ω
~

We can thus write the set of solutions to (1.5) as

argmin f0 (~x) (1.26)


x∈Rn
~

s.t. fi (~x) ≤ 0 ∀i ∈ {1, . . . , m}


hj (~x) = 0 ∀j ∈ {1, . . . , p}.

And, as just discussed, we can write the set of solutions to (1.7) as

argmin f0 (~x). (1.27)


x∈Ω
~

We emphasize that the argmin is a set of vectors, any of which are an optimal solution, i.e., a minimizer, of the
optimization problem at hand. It is possible for the argmin to contain 0 vectors (in which case the minimum value is
not realized and the problem has no global optima), any positive number of vectors, or an infinite number of vectors.
Let us consider the same example as before. In particular, consider the two-element set Ω = {0, 1} and f0 (x) =
3x + 2. Then argminx∈Ω f0 (x) = {0}. But, in different scenarios, the argmin can have zero elements; for example,
2

if f0 (x) = 3x, then argminx∈R f0 (x) = ∅. And it can have multiple elements; for example, if f0 (x) = 3x2 (x −
1)2 , then argminx∈R f0 (x) = {0, 1}. It can even have infinitely many elements; for example, if f0 (x) = 0, then
argminx∈R f0 (x) = R.
Though we must remember to keep in mind that technically argmin is a set, in the problems we study, it usually
contains exactly one element. Thus, instead of writing, for example, ~x? ∈ argmin~x∈Ω f0 (~x), we may also write
~x? = argmin~x∈Ω f0 (~x). The former expression is technically more correct, but both usages are fine, if — and only if
— the argmin in question contains exactly one element.

1.4 (OPTIONAL) Infimum Versus Minimum


There is one remaining issue with our formulation, which we can conceptually consider as a “corner case”. What
happens if the minimum does not exist? This may seem like a very esoteric case, yet one can construct a straightforward
example, such as the following. We know that the minimum of any set of numbers must be contained in the set. But
what happens if we try to find the minimum of the open interval (0, 1)? For any x ∈ (0, 1) which we claim to be our
minimum, we see that x
2 is also contained in (0, 1) and is smaller than x, which is a contradiction to our claim. Thus
the set (0, 1) has no minimum.
It seems like 0 is a useful notion of “minimum” for this set — that is, it’s the largest number which is ≤ all numbers
in the set, i.e., its “greatest lower bound” — but it isn’t contained in the set and thus cannot be the minimum. Fortunately,
this notion of greatest lower bound of a set is formalized in real analysis as the concept of an “infimum”, denoted inf.
For our purposes, we can think of the infimum as a generalization of the minimum which takes care of these corner
cases and always exists. When the minimum exists, it is always equal to the infimum.
Based on this discussion, we can write our optimization problems as

p? = infn f0 (~x) (1.28)


x∈R
~

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 11
EECS 127/227AT Course Reader 1.4. (OPTIONAL) Infimum Versus Minimum 2024-04-27 21:08:09-07:00

s.t. fi (~x) ≤ 0 ∀i ∈ {1, . . . , m}


hj (~x) = 0 ∀j ∈ {1, . . . , p}.

and
p? = inf f0 (~x). (1.29)
x∈Ω
~

However, the argmin retains the same definition. In fact, one can prove that if we replaced the min in the argmin
definition (1.25) with inf, that this “new” argmin would be exactly equivalent in every case to the “old” argmin, which
we use henceforth. The analogous quantity to infimum for maximization — that is, the appropriate generalization of
max — is the supremum, denoted sup.
Interested readers are encouraged to consult a real analysis textbook such as [3] for a more comprehensive coverage.
Though we have gone over the technical details here, for the rest of the course we will omit them for simplicity, and
stick to using min and max (meaning inf and sup when the minimum and maximum do not exist).

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 12
Chapter 2

Linear Algebra Review

Relevant sections of the textbooks:

• [1] Appendix A.

• [2] Chapters 2, 3, 4, 5.

2.1 Norms
2.1.1 Definitions
Definition 5 (Norm)
Let V be a vector space over R. A function f : V → R is a norm if:

• Positive definiteness: f (~x) ≥ 0 for all ~x ∈ V, and f (~x) = 0 if and only if ~x = ~0.

• Positive homogeneity: f (α~x) = |α| f (~x) for all α ∈ R and ~x ∈ V.

• Triangle inequality: f (~x + ~y ) ≤ f (~x) + f (~y ) for all ~x, ~y ∈ V.

We can check that the familiar Euclidean norm k·k2 : ~x 7→ x2i satisfies these properties. A generalization
pPn
i=1
of the Euclidean norm is the following very useful class of norms.

Definition 6 (`p Norms)


Let 1 ≤ p < ∞. The `p -norm on Rn is given by

n
!1/p
.
(2.1)
X p
k~xkp = |xi | .
i=1

The `∞ -norm on Rn is given by


.
k~xk∞ = max |xi | . (2.2)
i∈{1,...,n}

Example 7 (Examples of `p Norms).

13
EECS 127/227AT Course Reader 2.1. Norms 2024-04-27 21:08:09-07:00

(a) The Euclidean norm, given by k~xk2 = x2i , is an `p -norm for p = 2. (This is why we gave the subscript
pPn
i=1
2 to the Euclidean norm previously).

(b) The `1 -norm is given by k~xk1 =


Pn
i=1 |xi |.

(c) The `∞ -norm, given by k~xk∞ = maxi∈{1,...,n} |xi |, is the limit of the `p norms as p → ∞:

k~xk∞ = lim k~xkp . (2.3)


p→∞

We do not prove this here; it is left as an exercise.

2.1.2 Inequalities
There are a variety of useful inequalities which are associated with the `p norms. Before we provide them, we will take
a second to discuss the importance of inequalities for optimization.
A priori, it may not be clear why we need to care about inequalities; why does it matter whether one arrangement
of variables is always greater or less than another arrangement? It turns out that such inequalities are very helpful for
characterizing the minimum and maximum of a given set of things; we can obtain upper bounds and lower bounds for
things using these inequalities. This is definitely very helpful for optimization.
With that out of the way, let us get to the first major inequality.

Theorem 8 (Cauchy-Schwarz Inequality)


For any ~x, ~y ∈ Rn , we have
~x> ~y ≤ k~xk2 k~y k2 . (2.4)

Proof. Let θ be the angle between ~x and ~y . We write

~x> ~y = |k~xk2 k~y k2 cos θ| (2.5)


= k~xk2 k~y k2 |cos θ| (2.6)
≤ k~xk2 k~y k2 . (2.7)

We can get this result for `2 norms. A natural next question is whether we can generalize it to `p norms for p 6= 2.
It turns out that we can, as we demonstrate shortly.

Theorem 9 (Hölder’s Inequality)


Let 1 ≤ p, q ≤ ∞ such that 1
p + 1
q = 1.a Then for any ~x, ~y ∈ Rn , we have
n
(2.8)
X
>
~x ~y ≤ |xi yi | ≤ k~xkp k~y kq .
i=1

aSuch pairs (p, q) are called Hölder conjugates.

This inequality collapses to Cauchy-Schwarz Inequality when p = q = 2. The proof is out of scope for now since
it uses convexity.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 14
EECS 127/227AT Course Reader 2.1. Norms 2024-04-27 21:08:09-07:00

Example 10 (Dual Norms). Fix ~y ∈ Rn . Let us solve the problem:

max ~x> ~y . (2.9)


x∈Rn
~
k~
xkp ≤1

It is initially difficult to see how to proceed, so let us simplify the problem to get back onto familiar territory. We
start with p = 2, so that the problem becomes:
max ~x> ~y . (2.10)
x∈Rn
~
k~
xk2 ≤1

For n = 2, the feasible set and ~y together look like the following:
x2

~y

x1

For any ~x ∈ Rn , and θ the angle between ~x and ~y , we have

~x> ~y = k~xk2 k~y k2 cos θ. (2.11)

This term is maximized when cos θ = 1, or equivalently θ = 0. Thus ~x and ~y must point in the same direction, i.e., ~x
is a scalar multiple of ~y . And since we want to maximize this dot product, we must choose ~x to maximize k~xk2 subject
to the constraint k~xk2 ≤ 1. Thus, we choose an ~x which has k~xk2 = 1 and points in the same direction as ~y . This
gives ~x? = ~y / k~y k2 . Thus,
> 2
~y > ~y

~y k~y k2
> ? >
maxn ~x ~y = (~x ) ~y = ~y = = = k~y k2 . (2.12)
x∈R
~ k~y k2 k~y k2 k~y k2
k~
xk2 ≤1

Now let us try p = ∞. The problem becomes

max ~x> ~y . (2.13)


x∈Rn
~
k~
xk∞ ≤1

The feasible set and ~y are given by the following diagram.


x2

~y

x1

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 15
EECS 127/227AT Course Reader 2.1. Norms 2024-04-27 21:08:09-07:00

Motivated by this diagram, we see that the constraint k~xk∞ ≤ 1 is equivalent to the 2n constraints −1 ≤ xi and
xi ≤ 1. Also, writing out the objective function
n
(2.14)
X
~x> ~y = xi yi = x1 y1 + x2 y2 + · · · + xn yn ,
i=1

we see that the problem is

max (x1 y1 + x2 y2 + · · · + xn yn ) (2.15)


x∈Rn
~

s.t. − 1 ≤ xi ≤ 1, ∀i ∈ {1, . . . , n}.

This problem has an interesting structure that will be repeated several times in the problems we discuss in this class.
Namely, the objective function is the sum of several terms, each of which involves only one xi . And the constraints are
able to be partitioned into some groups, where the constraints in each group constrain only one xi . Thus, this problem
is separable into n different scalar problems, such that the optimal solutions for each scalar problem form an optimal
solution for the vector problem. Namely, the problems are

max xi yi (2.16)
xi ∈R
−1≤xi ≤1

We solve this much simpler problem by hand. If yi > 0 then x?i = 1; on the other hand, if yi ≤ 0 then x?i = −1. To
summarize, x?i = sgn(yi ), so that x?i yi = |yi |.
Putting all the scalar problems together, we see that ~x? = sgn(~y ), and the vector problem’s optimal value is given
by
n n n
(2.17)
X X X
maxn ~x> ~y = (~x? )> ~y = x?i yi = sgn(yi )yi = |yi | = k~y k1 .
x∈R
~
k~
xk∞ ≤1 i=1 i=1 i=1

As a final exercise, we consider p = 1, so that the problem becomes

max ~x> ~y . (2.18)


x∈Rn
~
k~
xk1 ≤1

For n = 2, the feasible set and ~y together look like the following:
x2

~y

x1

We now bound the objective as

~x> ~y ≤ ~x> ~y (2.19)


n
(2.20)
X
= xi yi
i=1

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 16
EECS 127/227AT Course Reader 2.1. Norms 2024-04-27 21:08:09-07:00

n
by triangle inequality (2.21)
X
≤ |xi yi |
i=1
n
(2.22)
X
= |xi | |yi |
i=1
n  
(2.23)
X
≤ |xi | max |yi |
i∈{1,...,n}
i=1
 n
X
= max |yi | |xi | (2.24)
i∈{1,...,n}
i=1

= k~y k∞ k~xk1 (2.25)


≤ k~y k∞ . (2.26)

Thus we have
max ~x> ~y ≤ k~y k∞ . (2.27)
x∈Rn
~
k~
xk1 ≤1

This inequality is actually an equality. To show this, we need to show the reverse inequality

max ~x> ~y ≥ k~y k∞ . (2.28)


x∈Rn
~ |{z}
k~
xk1 ≤1 need to show

And showing this inequality amounts to choosing, for our fixed ~y , a ~x such that k~xk1 ≤ 1 and ~x> ~y ≥ k~y k∞ . This is
also called “showing the maximum is attained”. To do this, we can find an ~x such that k~xkp ≤ 1 and all the inequalities
in the chain are met with equality.

• First, the inequality in (2.21) is a triangle inequality with the absolute value, i.e., | i=1 xi yi | ≤ i=1 |xi yi |.
Pn Pn

To make sure this is an equality, it’s enough to make sure that all terms xi yi are the same sign or 0.

• Next, the inequality in (2.23) says that i=1 |xi | |yi | ≤ i=1 |xi | maxi∈{1,...,n} |yi | . The most obvious
Pn Pn 

instance in which this inequality is met with equality is when |yi | = maxj∈{1,...,n} |yj | for all i. But we can’t
choose ~y , as it’s fixed, so we can’t be assured that this holds. An alternate way in which this holds is that |xi | = 0
for all i for which |yi | 6= maxj∈{1,...,n} |yj |, i.e., i ∈
/ argmaxj∈{1,...,n} |yj |.

• Finally, the inequality in (2.26) says that k~xk1 k~y k∞ ≤ k~y k∞ ; to meet this inequality with equality, it is sufficient
to have k~xk1 = 1.

To meet all three of these constraints, we can construct ~x? via the following process:

• For each i ∈
/ argmaxj∈{1,...,n} |yj |, set x
ei = 0, as per the second bullet point above.

• For each i ∈ argmaxj∈{1,...,n} |yj |, set x


ei = sgn(yi ), as per the first bullet point above.

• To get the true solution vector ~x? , divide ~x


e by ~x
e ; that is, ~x? = ~x e . This ensures that k~x? k1 = 1, as per
e/ ~x
1 1
the third bullet point above.

This ~x? “achieves the maximum”, showing that

max ~x> ~y = k~y k∞ . (2.29)


x∈Rn
~
k~
xk1 ≤1

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 17
EECS 127/227AT Course Reader 2.2. Gram-Schmidt and QR Decomposition 2024-04-27 21:08:09-07:00

This notion where the `2 -norm constraint leads to the `2 -norm objective, the `∞ -norm constraint leads to the `1 -
norm objective, and the `1 -norm constraint leads to the `∞ -norm objective, hints at a greater pattern. Indeed, one can
show that for 1 ≤ p, q ≤ ∞ such that 1
p + 1
q = 1, an `p -norm constraint leads to an `q -norm objective:

max ~x> ~y = k~y kq . (2.30)


x∈Rn
~
k~
xkp ≤1

As before, we can prove this equality by proving the two constituent inequalities:

max ~x> ~y ≤ k~y kq and max ~x> ~y ≥ k~y kq . (2.31)


x∈Rn
~ x∈Rn
~
k~
xkp ≤1 k~
xkp ≤1

The proof of the first inequality (≤) follows from applying Hölder’s inequality to the objective function:

max ~x> ~y ≤ maxn k~xkp k~y kq = k~y kq · maxn k~xkp = k~y kq . (2.32)
x∈Rn
~ x∈R
~ x∈R
~
k~
xkp ≤1 k~
xkp ≤1 k~
xkp ≤1

The second inequality (≥) can follow if, for our fixed choice of ~y , we produce some ~x such that k~xkp ≤ 1 and
~x> ~y ≥ k~y kq , i.e., “the maximum is attained”. This is more complicated to do, and we won’t do it here.
The above equality (2.30) means that the norms k·kp and k·kq are so-called dual norms. We will explore aspects
of duality later in the course, though frankly we are just scratching the surface.

These problems, which are short and easy to state, contain a couple of core ideas within their solutions, which are
broadly generalizable to a lot of optimization problems. For your convenience, we discuss these explicitly below.

Problem Solving Strategy 11 (Separating Vector Problems into Scalar Problems). When trying to simplify an opti-
mization problem, try to see if you can simplify it into several independent scalar problems. Then solve each scalar
problem — this is usually much easier than solving the whole vector problem at once. The optimal solutions to each
scalar problem will then form the optimal solution to the whole vector problem.

Problem Solving Strategy 12 (Proving Optimality in an Optimization Problem). To solve an optimization problem,
you can use inequalities to bound the objective function, and then try to show that this bound is tight by finding a
feasible choice of optimization variable which makes all the inequalities into equalities.

2.2 Gram-Schmidt and QR Decomposition


The Gram-Schmidt algorithm is a way to turn a linearly independent set {~a1 , . . . , ~ak } of vectors into an orthonormal
set {~q1 , . . . , ~qk } which spans the same space. To reiterate, an orthonormal set is a set of vectors in which each vector
has norm 1 and is orthogonal to all others in the basis.
Suppose for simplicity that n = k = 2, and that we have the following vectors.

~a2

~0 ~a1

We begin with ~a1 . We want to construct a vector ~q1 such that

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 18
EECS 127/227AT Course Reader 2.2. Gram-Schmidt and QR Decomposition 2024-04-27 21:08:09-07:00

• it’s orthogonal to all the ~qi which came before it — which is none of them, so we don’t have to worry; and

• it has unit norm, so k~q1 k2 = 1.

To achieve this, the simplest choice is


. ~a1
~q1 = . (2.33)
k~a1 k2

~a2

~0 ~a1
~q1

Then we go to ~a2 . To find ~q2 which is orthogonal to all the ~qi before it — that is, ~q1 — we subtract off the orthogonal
projection of ~a2 onto ~q1 from ~a2 . The orthogonal projection of ~a2 onto ~q1 is given by
.
p~2 = ~q1 (~q1>~a2 ) (2.34)

and so the projection residual is given by


.
~s2 = ~a2 − p~2 = ~a2 − ~q1 (~q1>~a2 ). (2.35)

Note that these formulas only hold because ~q1 is normalized, i.e., has norm 1.

~a2

~s2

~0 ~a1
~q1 p~2

While ~s2 is orthogonal to ~q1 , because we want a ~q2 that is normalized, we normalize ~s2 to get ~q2 :
. ~s2
~q2 = . (2.36)
k~s2 k2

~a2

~q2
~s2

~0 ~a1
~q1 p~2

If we had a vector ~q3 (and weren’t limited by drawing in 2D space), we would ensure that ~q3 were orthogonal to ~q1
and ~q2 , as well as normalized, in a similar way as before. First we would compute the projection
.
p~3 = ~q1 (~q1>~a3 ) + ~q2 (~q2>~a3 ). (2.37)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 19
EECS 127/227AT Course Reader 2.2. Gram-Schmidt and QR Decomposition 2024-04-27 21:08:09-07:00

and the residual


.
~s3 = ~a3 − p~3 = ~a3 − ~q1 (~q1>~a3 ) − ~q2 (~q2>~a3 ). (2.38)

These projection formulas only hold because {~q1 , ~q2 } is an orthonormal set. And then we could compute

. ~s3
~q3 = . (2.39)
k~s3 k2
And so on. The general algorithm goes similar.

Algorithm 1 Gram-Schmidt algorithm.


1: function GramSchmidtAlgorithm(linearly independent set {~a1 , . . . , ~ak })
.
2: ~q1 = ~a1 / k~a1 k2 .
3: for i ∈ {2, 3, . . . , k} do
. Pi−1
4: p~i = j=1 ~qj (~qj>~ai )
.
5: ~si = ~ai − p~i
.
6: ~qi = ~si / k~si k2
7: end for
8: return orthonormal set {~q1 , . . . , ~qk }
9: end function

This algorithm has the following two properties, which you can formally prove as an exercise.

Proposition 13 (Gram-Schmidt Algorithm)


Algorithm 1 has the following properties:

1. For each i ∈ {1, . . . , k}, we have

span(~a1 , . . . , ~ai ) = span(~q1 , . . . , ~qi ). (2.40)

In particular, {~a1 , . . . , ~ak } spans the same subspace as {~q1 , . . . , ~qk }, as was stated in our original goal.

2. {~q1 , . . . , ~qk } is an orthonormal set.

The Gram-Schmidt algorithm leads to something called the QR decomposition. Because, for each i, we have
span(~a1 , . . . , ~ai ) = span(~q1 , . . . , ~qi ), it means that we can write ~ai as a linear combination of the ~qj :
i
(2.41)
X
~ai = r1i ~q1 + r2i ~q2 + · · · + rii ~qi = rji ~qj
j=1

Putting all the k equations in a matrix form, we can write


 
r11 r12 ··· r1k
0
i r22 ··· r2k 
h i h 
~a1 · · · ~ak = ~q1  ..
· · · ~qk  .. .. .. 
. (2.42)
 . . . . 
0 0 ··· rkk

More generally, we can decompose every tall matrix with full column rank into a product of a tall matrix with orthonor-
mal columns Q and an upper-triangular matrix R.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 20
EECS 127/227AT Course Reader 2.3. Fundamental Theorem of Linear Algebra 2024-04-27 21:08:09-07:00

Theorem 14 (QR Decomposition)


Let A ∈ Rn×k where k ≤ n (so A is tall). Suppose A has full column rank. Then there is a matrix Q ∈ Rn×k
with orthonormal columns, and a matrix R ∈ Rk×k which is upper triangular, such that A = QR.

As a final note, there are various alterations to the QR decomposition that work for matrices which are wide and/or do
not have full column rank. Those are out of scope, but the idea is the same.
The QR decomposition is also relevant in numerical linear algebra, where it can be used to solve tall linear systems
A~x = ~y efficiently, especially if the underlying matrix A has special structure. All such connections are out of scope.

2.3 Fundamental Theorem of Linear Algebra


The fundamental theorem of linear algebra is a tool for understanding what happens to vectors and vector spaces under
a linear transformation. Matrix multiplication transforms one vector space into another. This is helpful for allowing us
to change our coordinate system, which tells us more about the problem.

Definition 15 (Direct Sum)


Let U, V ⊆ Rn be subspaces. We say that U and V direct sum to Rn , denoted U ⊕ V = Rn , if and only if:

• Every vector ~x ∈ Rn can be written as ~x = ~x1 + ~x2 , where ~x1 ∈ U and ~x2 ∈ V .

• Furthermore, this decomposition is unique, in the sense that if ~x = ~x1 + ~x2 = ~y1 + ~y2 are two instances of
the above decomposition, then ~x1 = ~y1 and ~x2 = ~y2 .

Theorem 16 (Fundamental Theorem of Linear Algebra)


Let A ∈ Rm×n . Then
N (A) ⊕ R A> = Rn . (2.43)


Note that we cannot replace R A> by R(A), since vectors in R(A) and N (A) do not even have the same number


of entries or lie in the same Euclidean space. If we want to make a statement about R(A), we can replace A by A> in
the above theorem to get the following corollary.

Corollary 17. Let A ∈ Rm×n . Then


N A> ⊕ R(A) = Rm . (2.44)


To prove the fundamental theorem of linear algebra, we use a tool called the orthogonal decomposition theorem.

Definition 18 (Orthogonal Complement)


Let S ⊆ Rn be a subspace. The orthogonal complement of S, denoted S ⊥ , is
.
S ⊥ = {~x ∈ Rn | ~s> ~x = 0 for all ~s ∈ S} (2.45)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 21
EECS 127/227AT Course Reader 2.3. Fundamental Theorem of Linear Algebra 2024-04-27 21:08:09-07:00

Theorem 19 (Orthogonal Decomposition Theorem, Theorem 2.1 of [2])


Let S ⊆ Rn be a subspace. Then
S ⊕ S ⊥ = Rn . (2.46)

Proof. To prove this, we first need to prove the following claim:

Let U, V ⊆ Rn be subspaces. Then U ⊕ V = Rn if and only if every vector ~x ∈ Rn can be written as


~x = ~x1 + ~x2 , where ~x1 ∈ U and ~x2 ∈ V , and U ∩ V = {~0}.

To prove this claim, suppose first that U ⊕ V = Rn . Then every every vector ~x ∈ Rn can be written as ~x = ~x1 + ~x2 ,
where ~x1 ∈ U and ~x2 ∈ V . It remains to prove that U ∩ V = {~0}. Suppose for the sake of contradiction that there
exists ~y 6= ~0 such that ~y ∈ U ∩ V . Then
~x = (~x1 + ~y ) + (~x2 − ~y ). (2.47)

Since ~y ∈ U , we have ~x1 + ~y ∈ U ; since ~y ∈ V , we have ~x2 − ~y ∈ V . Thus

~x = ~x1 + ~x2 = (~x1 + ~y ) + (~x2 − ~y ) (2.48)

are two distinct ways to write ~x as the sum of vectors from U and V , so it cannot be true that U ⊕ V = Rn , a
contradiction.
Towards the other direction, suppose that every vector ~x ∈ Rn can be written as ~x = ~x1 + ~x2 , where ~x1 ∈ U and
~x2 ∈ V , and U ∩ V = {~0}. The only thing remaining to prove is that if

~x = ~x1 + ~x2 = ~z1 + ~z2 (2.49)

where ~x1 , ~z1 ∈ U and ~x2 , ~z2 ∈ V , then we must have ~x1 = ~z1 and ~x2 = ~z2 . Suppose again for the sake of contradiction
that there exists ~x ∈ Rn , ~x1 , ~z1 ∈ U , and ~x2 , ~z2 ∈ V such that

~x = ~x1 + ~x2 = ~z1 + ~z2 (2.50)

but ~x1 6= ~z1 or ~x2 6= ~z2 . Then we have

~0 = ~x − ~x = ~x1 + ~x2 − ~z1 − ~z2 = (~x1 − ~z1 ) + (~x2 − ~z2 ). (2.51)

Thus, we have that


~x1 − ~z1 = ~z2 − ~x2 6= ~0. (2.52)

Since ~x1 , ~z1 ∈ U , we have ~x1 − ~z1 ∈ U , and since ~x2 , ~z2 ∈ V , we have ~z2 − ~x2 ∈ V . Since they are equal, we have
~x1 − ~z1 ∈ U ∩ V and nonzero. Thus U ∩ V 6= {~0}, a contradiction.
This proves the above claim. Now to prove the actual theorem, we note that every vector ~x ∈ Rn can be written as

~x = projS (~x) + (~x − projS (~x)). (2.53)

By definition, projS (~x) ∈ S, and because the projection residual is orthogonal to the subspace, we have ~x −projS (~x) ∈
S ⊥ . Thus every vector in Rn can be written as the sum of a vector in S and S ⊥ . It is an exercise to show that
S ∩ S ⊥ = {~0}. Invoking the quoted claim completes the proof.

Using this theorem, the only thing we need to show to prove the fundamental theorem of linear algebra is that N (A)
and R A> are orthogonal complements. We do this below.


© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 22
EECS 127/227AT Course Reader 2.3. Fundamental Theorem of Linear Algebra 2024-04-27 21:08:09-07:00

Proof of Theorem 16. By Theorem 19, the only thing we need to show is that N (A) = R A> ⊥ . This is a set equality;


we show it by showing that N (A) ⊆ R A> ⊥ and that N (A) ⊇ R A> ⊥ .


 

We first want to show that N (A) ⊆ R A> ⊥ . That is, we want to show that for any ~x ∈ N (A) we have ~x ∈


R A> ⊥ . That is, for any ~y ∈ R A> , we want to show that ~y > ~x = 0.
 

Since ~y ∈ R A> we can write ~y = A> w ~ for some w~ ∈ Rm . Then, since ~x ∈ N (A) we have A~x = ~0, so


~y > ~x = (A> w)
~ > ~x (2.54)
=w
~ A~x >
(2.55)
=w
~ 0 >~
(2.56)
= 0. (2.57)

Thus ~x and ~y are orthogonal, so ~x ∈ R A> ⊥ , which shows that N (A) ⊆ R A> ⊥ .
 

We now want to show that R A> ⊥ ⊆ N (A). That is, we want to show that for any ~x ∈ R A> ⊥ , we want to
 

show that ~x ∈ N (A). That is, we want to show that A~x = ~0.
By definition, for every ~y ∈ R A> , we have ~y > ~x = 0. By writing ~y = A> w
~ for arbitrary w
~ ∈ Rm , we get that


for every w
~ ∈ Rm we have (A> w)
~ > ~x = 0. But the left-hand side is w
~ > A~x, so we have that w
~ > A~x = 0 for every
~ ∈ Rm . This is true for all w
w ~ ∈ Rm , so it is true for the specific choice of w
~ = A~x, which yields

~ > A~x
0=w (2.58)
= (A~x)> A~x (2.59)
(2.60)
2
= kA~xk2
=⇒ A~x = ~0. (2.61)

This implies that ~x ∈ N (A) as desired, so R A> ⊥ ⊆ N (A).




Thus, we have shown that N (A) = R A> ⊥ , and so by Theorem 19 we have N (A) ⊕ R A> = Rn .
 

This will help us solve a very important optimization problem, which is considered “dual” to least squares in some
sense. Recall that least squares helps us find an approximate solution to the linear system A~x = ~y , when A is a tall
matrix with full column rank. In other words, the linear system is over-determined, there are many more equations than
unknowns, and there are generally no exact solutions, so we pick the solution with minimum squared error.
What about when A is a wide matrix with full row rank? There are now more unknowns than equations, and
infinitely many exact solutions. So how do we pick one solution in particular? It really depends on which engineering
problem we are solving. One common solution is to pick the minimum-energy or minimum-norm problem, which is
the solution to the optimization problem:

(2.62)
2
min k~xk2
x∈Rn
~

s.t. A~x = ~y .

Note that this principle of choosing the smallest or simplest solution — the “Occam’s Razor” principle — is much
more broadly generalized beyond the case of finding solutions to linear systems, and is used within control theory and
machine learning. But we deal with just this linear system case for now.

Theorem 20 (Minimum-Norm Solution)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 23
EECS 127/227AT Course Reader 2.3. Fundamental Theorem of Linear Algebra 2024-04-27 21:08:09-07:00

Let A ∈ Rm×n have full row rank, and let ~y ∈ Rm . Then the solution to Equation (2.62), i.e., the solution to

(2.62)
2
min k~xk2
x∈Rn
~

s.t. A~x = ~y ,

is given by
~x? = A> (AA> )−1 ~y . (2.63)

Proof. Observe that the constraint A~x = ~y under-specifies the ~x — in particular, any component of ~x in N (A) will
not affect the constraint and only the objective. In this sense, it is “wasteful”, and we should intuitively remove it. This
motivates using Theorem 16 to decompose ~x into a component inside N (A) — which we want to remove — and a
component inside R A> — which we will optimize over.


Indeed, write ~x = ~u + ~v , where ~u ∈ N (A) and ~v ∈ R A> . Thus, there exists w


~ ∈ Rm such that ~v = A> w.
~ The


constraint becomes

~y = A~x (2.64)
= A(~u + ~v ) (2.65)
= A~u + A~v (2.66)
= ~0 + AA> w
~ (2.67)
= AA w.
~ >
(2.68)

And the objective function becomes

(2.69)
2 2
k~xk2 = k~u + ~v k2
>
= ~u ~u + 2~u ~v + ~v ~v > >
(2.70)
(2.71)
2 > 2
= k~uk2 + 2~v ~u + k~v k2
(2.72)
2 > > 2
= k~uk2 + 2(A w) ~ ~u + k~v k2
(2.73)
2 2
= k~uk2 + 2w~ > A~u + k~v k2
(2.74)
2 2
= k~uk2 + 2w~ >~0 + k~v k2
(2.75)
2 2
= k~uk2 + 2 · 0 + k~v k2
(2.76)
2 2
= k~uk2 + k~v k2
2
(2.77)
2 >
= k~uk2 + A w
~ 2

Thus, the minimum-norm problem can be reformulated in terms of ~u and w:


~
2
(2.78)
2
minn k~uk2 + A> w
~ 2
u∈R
~
m
w∈R
~

s.t. ~y = AA> w
~
A~u = ~0.

Now, because A has full row rank, AA> is invertible, so the first constraint implies that w
~ ? = (AA> )−1 ~y , so ~v ? =
~ ? = A> (AA> )−1 ~y . And because we are trying to minimize the objective, which only involves ~u through k~uk2 ,
2
A> w
the ideal solution is to set ~u? = ~0, which also satisfies the second constraint and so is feasible. Thus ~x? = ~v ? =
A> (AA> )−1 ~y as desired.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 24
EECS 127/227AT Course Reader 2.4. Symmetric Matrices 2024-04-27 21:08:09-07:00

2.4 Symmetric Matrices


Symmetric matrices are a sub-class of matrices which have many special properties, and in engineering applications
one usually tries to work with symmetric matrices as much as possible.

Definition 21 (Symmetric Matrix)


Let A ∈ Rn×n be a square matrix. We say that A is symmetric if A = A> . The set of all symmetric matrices is
denoted Sn .

Equivalently, Aij = Aji for all i and j.


" #
a b
Example 22. The 2 × 2 matrix is symmetric.
b c

Example 23 (Covariance Matrices). Any matrix of the form A = BB > , such as the covariance matrices we will
discuss in the next section, is a symmetric matrix, since

A> = (BB > )> = (B > )> (B)> = BB > = A. (2.79)

Example 24 (Adjacency Matrix). Consider an undirected connected graph G = (V, E), for example the following:

1 3

Its adjacency matrix A has coordinate Aij = 1 if (i, j) ∈ E, and Aij = 0 otherwise; in the above example, we
have  
0 1 0 1
 
1 0 1 0
A=
0
. (2.80)
 1 0 1
1 0 1 0
Since the graph is undirected, (i, j) ∈ E if and only if (j, i) ∈ E, so Aij = Aji , and so A is a symmetric matrix.

Why do we care about symmetric matrices? Symmetric matrices have two nice properties: real eigenvalues, and
guaranteed diagonalizability. " #
1 1
In general, a (non-symmetric) matrix need not be diagonalizable. For example, the matrix A = is not
0 1
diagonalizable. How can we characterize the diagonalizability of a matrix, then?
First, we will need the following definitions.

Definition 25 (Multiplicities)
Let A ∈ Rn×n , and let λ be an eigenvalue of A.

(a) The algebraic multiplicity µ of eigenvalue λ in A is the number of times λ is a root of the characteristic
.
polynomial pA (x) = det(xI − A) of A, i.e., it is the power of (x − λ) in the factorization of pA (x).

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 25
EECS 127/227AT Course Reader 2.4. Symmetric Matrices 2024-04-27 21:08:09-07:00

.
(b) The geometric multiplicity φ of eigenvalue λ in A is the dimension of the null space Φ = N (λI − A).

Theorem 26 (Diagonalizability)
A square matrix A ∈ Rn×n is diagonalizable if and only if every eigenvalue of A has equal algebraic and geometric
multiplicities.

" #
1 1
Example 27 (Multiplicities of Degenerate Matrix). We were earlier told that the matrix A = is not diagonal-
0 1
izable. To check this, let us compute its eigenvalues, algebraic multiplicities, and geometric multiplicities.
First, its characteristic polynomial is

pA (x) = det(xI − A) (2.81)


" #!
x−1 −1
= det (2.82)
0 x−1
= (x − 1)2 . (2.83)

Thus, A has only one eigenvalue λ = 1. Since (x − 1) has power 2 in the factorization of pA , the eigenvalue λ = 1
has algebraic multiplicity µ = 2.
The corresponding null space is

Φ = N (λI − A) (2.84)
" #!
1 − 1 −1
=N (2.85)
0 1−1
" #!
0 −1
=N (2.86)
0 0
" #!
1
= span (2.87)
0

which has dimension φ = 1. Thus, for λ = 1, we have µ 6= φ and the matrix is indeed not diagonalizable.

This allows us to formally state the spectral theorem.

Theorem 28 (Spectral Theorem)


.
Let A ∈ Sn have eigenvalues λi with algebraic multiplicities µi , eigenspaces Φi = N (λi I − A), and geometric
.
multiplicities φi = dim(Φi ).

(a) All eigenvalues are real: λi ∈ R for each i.

(b) Eigenspaces corresponding to different eigenvalues are orthogonal: Φi and Φj are orthogonal subspaces,
i.e., for every p~i ∈ Φi and p~j ∈ Φj we have p~>
i p
~j = 0.

(c) A is diagonalizable: µi = φi for each i.

(d) A is orthonormally diagonalizable; there exists an orthonormal matrix U ∈ Rn×n and diagonal matrix
Λ ∈ Rn×n such that A = U ΛU > .

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 26
EECS 127/227AT Course Reader 2.4. Symmetric Matrices 2024-04-27 21:08:09-07:00

Recall that orthonormal matrices are matrices whose columns are orthonormal, i.e., are pairwise orthogonal and unit-
norm. Orthonormal matrices U have the nice property that U > U = I, and if U is square, then U > = U −1 .

Proof of Theorem 28. Part (a) might be left to homework; part (b) will definitely be left to homework; we prove parts (c)
and (d) here. In particular, we assume that parts (a) and (b) are true, and attempt to prove (d). Note that (d) implies (c),
as an orthonormal diagonalization is a type of diagonalization, and so the existence of an orthonormal diagonalization
must require the algebraic and geometric multiplicities to be equal.
Our proof strategy is to use induction on n, the size of the matrix. The base case of our induction is 1 × 1 matrices,
for which the diagonalization is trivial. Now consider the inductive step. Our hope is, given A ∈ Sn which has
eigenvalue λ, to get a decomposition of the form
" # " #
λ ~0> λ ~0>
A=V V> or equivalently >
V AV = (2.88)
~0 B ~0 B

where V ∈ Rn×n is orthonormal and B ∈ Sn−1 is symmetric. If we can do that, then we can inductively diagonalize
" , and # finally use that to construct a diagonalization for A = U ΛU , where U ∈ R is orthonormal
> > n×n
B = W ΓW
. λ ~0 >
and Λ = .
~0 Γ
Let ~u be a unit-norm eigenvector of A corresponding to eigenvalue λ. Remember that we want an orthonormal
matrix V ∈ Rn×n which “isolates” λ. This motivates using ~u and a basis of the orthogonal complement
h i of span(~u)
to form V . To construct this matrix V , we run Gram-Schmidt on the columns of the matrix ~u I ∈ Rn×(n+1) ,
throwing out the single vector which will have 0 projection residual (there must be exactly one such vector by a counting
argument; to get n linearly independent vectors from
h a spanning
i set of n + 1 vectors, we need to remove exactly one
vector), and obtaining the orthonormal matrix V = ~u V1 ∈ Rn×n where V1 ∈ Rn×(n−1) is itself orthonormal. By
construction, we have V1> ~u = ~0 and ~u> V1 = ~0> . Thus,
h i> h i
V > AV = ~u V1 A ~u V1 (2.89)
" #
~u> h i
= A ~u V 1 (2.90)
V1>
" #
~u> h i
= A~u AV 1 (2.91)
V1>
" #
~u> h i
= λ~
u AV 1 (2.92)
V1>
" #
λ~u> ~u ~u> AV1
= (2.93)
λV1> ~u V1> AV1
" #
2
λ k~uk2 (A> ~u)> V1
= (2.94)
~0 V1> AV1
" #
λ (A~u)> V1
= (2.95)
~0 V1> AV1
" #
λ λ~u> V1
= (2.96)
~0 V1> AV1
" #
λ ~0>
= (2.97)
~0 V1> AV1

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 27
EECS 127/227AT Course Reader 2.4. Symmetric Matrices 2024-04-27 21:08:09-07:00

" #
λ ~0>
= , (2.98)
~0 B
.
where B = V1> AV1 in accordance with our proof outline. Now we need to check that B is symmetric; indeed, we have

B > = (V1> AV1 )> (2.99)


= (V1 ) A > >
(V1> )> (2.100)
= V1> AV1 (2.101)
= B. (2.102)

By induction, we can orthonormally diagonalize this matrix as B = W ΓW > ∈ R(n−1)×(n−1) , where W ∈ R(n−1)×(n−1)
is orthonormal and Γ ∈ R(n−1)×(n−1) is diagonal. Thus, by using W −1 = W > , we have

Γ = W > BW (2.103)
= W > V1> AV1 W (2.104)
= (V1 W ) A(V1 W ). >
(2.105)
" #
λ ~0>
We want an orthonormal matrix U ∈ Rn×n such that U > AU = Λ = . Thus, the above calculation motivates
~0 Γ
h i
the choice U = ~u V1 W ∈ Rn×n . Thus
h i> h i
U > AU = ~u V1 W A ~u V1 W (2.106)
" #
~u> h i
= A ~u V 1 W (2.107)
W > V1>
" #
~u> h i
= > >
A~u AV1 W (2.108)
W V1
" #
~u> h i
= λ~
u AV 1 W (2.109)
W > V1>
" #
λ~u> ~u ~u> AV1 W
= (2.110)
λW > V1> ~u W > V1> AV1 W
" #
2
λ k~uk2 (A> ~u)> V1 W
= (2.111)
λW >~0 W > BW
" #
λ (A~u)> V1 W
= (2.112)
~0 Γ
" #
λ λ~u> V1 W
= (2.113)
~0 Γ
" #
λ λ~0> W
= (2.114)
~0 Γ
" #
λ ~0>
= (2.115)
~0 Γ

= Λ, (2.116)

as desired. Thus A = U ΛU > is an orthonormal diagonalization of A. This proves (d), and hence (c).

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 28
EECS 127/227AT Course Reader 2.4. Symmetric Matrices 2024-04-27 21:08:09-07:00

One nice thing about diagonalization is that we can read off the eigenvalues and eigenvectors from the components
of the diagonalization.

Proposition 29 h i
Let A ∈ Sn have orthonormal diagonalization A = U ΛU > , where U = ~u1 · · · ~un ∈ Rn×n is square
 
λ1
orthonormal, and Λ = 
 ..  ∈ Rn×n is diagonal. Then for each i, the pair (λi , ~ui ) is an eigenvalue-

 . 
λn
eigenvector pair for A.

Proof. By using U > = U −1 , we have

A = U ΛU > (2.117)
AU = U Λ (2.118)
 
λ
i 1
..
h i h
(2.119)

A ~u1 · · · ~un = ~u1 · · · ~un 
 . 

λn
h i h i
A~u1 ··· A~un = λ1 ~u1 ··· λn ~un . (2.120)

Therefore, each (λi , ~ui ) is an eigenvalue-eigenvector pair of A.

Using this, we can work with another nice property of the orthonormal diagonalization. Namely, we can read off
bases for N (A) and R(A). That is, a basis for N (A) is the set of eigenvectors ~ui corresponding to the eigenvalues
λi of A which are equal to 0. Since U is orthonormal, the remaining eigenvectors ~ui span the orthogonal complement
to N (A). But by the fundamental theorem of linear algebra (Theorem 16), we have N (A)⊥ = R A> = R(A), so


these eigenvectors form a basis for R(A). Soon, we’ll discover the singular value decomposition, which allows for this
kind of decomposition of a matrix into its range and null spaces, except for arbitrary matrices.
Before we get into those, we will first state and solve a quick optimization problem which yields the eigenvalues of
a symmetric matrix. This optimization problem turns out to be quite useful for further study of optimization.

Theorem 30 (Variational Characterization of Eigenvalues)


Let A ∈ Sn . Let λmin {A} and λmax {A} be the maximum and minimum eigenvalues of A (which is well-defined
since by the spectral theorem, all eigenvalues of A are real). Then

~x> A~x
λmax {A} = maxn = maxn ~x> A~x (2.121)
x∈R
~ ~x> ~x x∈R
~
x6=~
~ 0 k~
xk2 =1
>
~x A~x
λmin {A} = minn = minn ~x> A~x. (2.122)
x∈R
~ ~x> ~x x∈R
~
x6=~
~ 0 k~
xk2 =1

~x> A~x
The term is called the Rayleigh quotient of A; it is a function of ~x ∈ Rn .
~x> ~x

Proof. Before we start trying to prove any equalities, let us try to simplify the crucial term ~x> A~x; the intuition behind
this is that it looks like the easiest term to use the orthonormal diagonalization on and achieve results.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 29
EECS 127/227AT Course Reader 2.4. Symmetric Matrices 2024-04-27 21:08:09-07:00

Let A = U ΛU > be an orthonormal diagonalization of A. We have

~x> A~x = ~x> U ΛU > ~x (2.123)


>
= (U ~x) Λ(U ~x) > >
(2.124)
= ~y > Λ~y (2.125)
n
(2.126)
X
= λi {A}yi2
i=1

.
with the invertible change of variables ~y = U > ~x ⇐⇒ ~x = U~y . Also we note that this change of variables preserves
the norm, i.e.,
2
(2.127)
2 2
k~y k2 = U > ~x 2
= ~x> U U > ~x = ~x> ~x = k~xk2 .

We now turn to the first equality chain (with max). Immediately, we have

~x> A~x ~x> A~x


maxn = max 2 (2.128)
x∈R
~ ~x> ~x x∈Rn k~
~ xk2
x6=~
~ 0 x6=~
~ 0
 >  
~x ~x
= maxn A . (2.129)
x∈R
~ k~xk2 k~xk2
x6=~
~ 0

Now, because the norm of our optimization variable ~x does not matter, in that it only affects the objective through its
normalization ~x/ k~xk2 , it is equivalent to optimize over only unit-norm ~x, so

~x> A~x
maxn = maxn ~x> A~x. (2.130)
x∈R
~ ~x> ~x x∈R
~
x6=~
~ 0 k~
xk2 =1

With the invertible change of variables ~y = U > ~x already discussed, we can write
n
(2.131)
X
maxn ~x> A~x = maxn λi yi2
x∈R
~ y ∈R
~
k~
xk2 =1 y k2 =1 i=1
k~
n
(2.132)
X
≤ λmax {A} · maxn yi2
y ∈R
~
y k2 =1 i=1
k~

(2.133)
2
= λmax {A} · maxn k~y k2
y ∈R
~
k~
y k2 =1

= λmax {A} · maxn 1 (2.134)


y ∈R
~
k~
y k2 =1

= λmax {A}. (2.135)

It is left to exhibit a ~y which makes this inequality an equality; indeed, it is achieved when yi = 1 for one i such that
λi {A} = λmax {A} and yi = 0 otherwise. The achieving ~x can be recovered by ~x = U~y . Note that since this ~y is a
standard basis vector ~y = ~ei for some i such that λi {A} = λmax {A}, then ~x = U~y = U~ei = ~ui , i.e., the ith column
of U , is an eigenvector of A corresponding to the maximum eigenvalue of A.
The analysis for λmin {A} goes exactly analogously.

This characterization motivates defining a new sub-class (or really several new sub-classes) of matrices.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 30
EECS 127/227AT Course Reader 2.5. Principal Component Analysis 2024-04-27 21:08:09-07:00

Definition 31 (Positive Semidefinite and Positive Definite Matrices)


Let A ∈ Sn . We say that A is positive semidefinite (PSD), denoted A ∈ Sn+ , if ~x> A~x ≥ 0 for all ~x. We say that
A is positive definite (PD), denoted A ∈ Sn++ , if ~x> A~x > 0 for all nonzero ~x.

There are also negative semidefinite (NSD) and negative definite (ND) symmetric matrices, defined analogously. There
are also indefinite symmetric matrices, which are none of the above. It is clear to see that PD matrices are themselves
PSD.

Proposition 32
We have A ∈ Sn+ if and only if each eigenvalue of A is non-negative. Also, A ∈ Sn++ if and only if each eigenvalue
of A is positive.

Proof. If A is PSD, we have


λmin {A} = minn ~x> A~x ≥ 0. (2.136)
x∈R
~
k~
xk2 =1

Now suppose that each eigenvalue of A is non-negative. Then

0 ≤ λmin {A} = minn ~x> A~x (2.137)


x∈R
~
k~
xk2 =1

which implies that ~x> A~x ≥ 0 for all ~x with unit norm, and by scaling we see that ~x> A~x ≥ 0 for all ~x 6= ~0, while the
inequality certainly holds for ~x = ~0. Thus ~x> A~x ≥ 0 for all ~x so A ∈ Sn+ .
If A is PD, we have
λmin {A} = minn ~x> A~x > 0. (2.138)
x∈R
~
k~
xk2 =1

Now suppose that each eigenvalue of A is positive. Then

0 < λmin {A} = minn ~x> A~x (2.139)


x∈R
~
k~
xk2 =1

which implies that ~x> A~x > 0 for all ~x with unit norm, and by scaling we see that ~x> A~x > 0 for all ~x 6= ~0, so
A ∈ Sn++ .

The final construction we discuss is that of the positive semidefinite square root.

Proposition 33
Let A ∈ Sn+ . Then there exists a unique symmetric PSD matrix B ∈ Sn+ , usually denoted B = A1/2 , such that
A = B2.

Proof. Discussion or homework. Note that there are non-symmetric matrices B such that A = B 2 , but there is a
unique PSD B.

2.5 Principal Component Analysis


Principal components analysis is a way to recover the eponymous principal components of the data. These principal
components are those that are most representative of the data structure. Formally, if we have data in Rd , we want to
find an underlying p-dimensional linear structure, where p  d.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 31
EECS 127/227AT Course Reader 2.5. Principal Component Analysis 2024-04-27 21:08:09-07:00

This idea has many, many use cases. For example, in modern machine learning, most data has thousands or millions
of dimensions. In order to visualize it properly, we need to reduce its dimension to a reasonable number, in order to
get an idea about the underlying structure of the data.
Let us first lay out some notation and definitions. Suppose we have the data points ~x1 , . . . , ~xn ∈ Rd . We organize
these into a data matrix X where data points form the rows:¹
 
~x>1
. .. 
h i
n×d
so that X > = ~x1 · · · ~xn ∈ Rd×n . (2.140)

X=  . ∈R

~x>n

We define the covariance matrix C ∈ Rd×d by


 
~x>
i 1  n
. 1 1h ..  1X
C = X >X = ~x  .  n
· · · ~xn  = ~xi ~x>
i . (2.141)
n n 1 i=1
>
~xn

We see that C is symmetric since X > X is symmetric, so really C ∈ Sd .


Before we progress further, let us try to get some intuition for what we might want to be doing. Consider the case
d = 2 and p = 1. Suppose we have the dataset given as below.
y

While this dataset is clearly fully two-dimensional, there is equally clearly some inherent 1-dimensional linear
structure to the data. So when we want to look for an underlying low-dimensional structure, we’re looking for something
like this. Here, if we could find the direction w
~ as in below:
¹Different textbooks handle this differently. For instance, some textbooks define a data matrix as one where the data points form the columns. If
you’re unsure, work it out from first principles.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 32
EECS 127/227AT Course Reader 2.5. Principal Component Analysis 2024-04-27 21:08:09-07:00

w
~

y
x

And, if we are in generic Rd space, we want to find orthonormal vectors w ~ p such that projection onto them
~ 1, . . . , w
uncovers the underlying data structure. The process of accurately characterizing these w
~ i is what we will discuss in
what follows.
We begin with a motivating example. Consider the MNIST dataset of handwritten digits. Each image is a 28-pixel
by 28-pixel grid with each numerical entry in the grid denoting the greyscale value at that grid point. This can be
represented by a 28 × 28 size matrix, or alternatively unrolled into a 282 = 784-dimensional vector. It is impossible
to directly visualize 784-dimensional space, so we seek to find w ~ 8 ∈ R784 such that the projection onto the w
~ 1, . . . , w ~i
preserve a lot of structure. Say that we take w ~ i = ~ei , where ~ei is the ith standard basis vector in R784 . Then for most
images, the projection onto the w ~ i ’s will be 0 or near-0. Thus, the projection of all of the data onto the w ~ i preserves
almost none of the structure and collapses all points in the dataset to just a few points in R8 . There is instead a much
more principled way to choose the w
~ i that will preserve most of the structure.
We now discuss how to choose the first principal component w
~ 1 ∈ Rd . To preserve the structure of the underlying
data as much as possible, we want the vectors ~xi projected onto the span of w
~ 1 to be as close as possible to the original
vectors ~xi . We also want kw
~ 1 k2 = 1. Thus, the error of the projection across all data points is
n
1X 2
err(w
~ 1) = ~xi − w ~ 1> ~xi )
~ 1 (w 2
.
n i=1

Expanding, we have
n
1X 2
err(w
~ 1) = ~xi − w ~ 1> ~xi )
~ 1 (w 2
(2.142)
n i=1
n
1X
= (~xi − w ~ 1> ~xi ))> (~xi − w
~ 1 (w ~ 1> ~xi ))
~ 1 (w (2.143)
n i=1
n
1X >
= (~x ~xi − ~x>
i w ~ 1> ~xi ) − (w
~ 1 (w ~ 1> ~xi ))> ~xi + (w
~ 1 (w ~ 1> ~xi ))> (w
~ 1 (w ~ 1> ~xi )))
~ 1 (w (2.144)
n i=1 i
n
1X >
= (~x ~xi − 2(~x> ~ 1> w
~ 1 )2 + ( w
i w ~ 1> ~xi )2 )
~ 1 )(w (2.145)
n i=1 i
n
1X
(2.146)
2
= (k~xi k2 − 2(~x> ~ 1> ~xi )2 )
~ 1 )2 + ( w
i w
n i=1
n
1X
(2.147)
2
= (k~xi k2 − (~x> ~ 1 )2 ).
i w
n i=1

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 33
EECS 127/227AT Course Reader 2.6. Singular Value Decomposition 2024-04-27 21:08:09-07:00

Now solving the principal components optimization problem gives


n
1X
(2.148)
2
min err(w
~ 1) = min (k~xi k2 − (~x> ~ 1 )2 )
i w
~ 1 ∈Rd
w ~ 1 ∈Rd
w n i=1
kw
~ 1 k2 =1 kw
~ 1 k2 =1
n n
1X 1X
(2.149)
2
= k~xi k2 + min (−(~x> ~ 1 )2 )
i w
n i=1 ~ 1 ∈Rd n
w
i=1
kw
~ 1 k2 =1
n n
1X 1X >
(2.150)
2
= k~xi k2 − max ~ 1 )2
(~xi w
n i=1 ~ 1 ∈Rd n
w
i=1
kw
~ 1 k2 =1
n n
1 1X >
(2.151)
X 2
= k~xi k2 − max ~ ~xi ~x>
w i w
~1
n i=1
~ 1 ∈Rd
w n i=1 1
kw
~ 1 k2 =1
n n
!
1X 1X
(2.152)
2
= k~xi k2 − max w ~ 1> ~xi ~x>
i w
~1
n i=1 ~ 1 ∈Rd
w n i=1
kw
~ 1 k2 =1
n  
1 1 >
(2.153)
X 2
= ~ 1>
k~xi k2 − max w X X w~1
n i=1
~ 1 ∈Rd
w n
kw
~ 1 k2 =1
n
1X
(2.154)
2
= k~xi k2 − max w ~ 1> C w
~1
n i=1 ~ 1 ∈Rd
w
kw
~ 1 k2 =1
n
1X
(2.155)
2
= k~xi k2 − λmax {C}
n i=1

with the w
~ 1 achieving this upper bound being the eigenvector ~umax corresponding to the eigenvalue λmax {C}. Thus,
the first principal component is exactly an eigenvector corresponding to the largest eigenvalue of the dot product matrix
C = X > X/n.
This computation is a special case of the singular value decomposition, which is used in practice to compute the
PCA of a dataset; understanding this decomposition will allow us to neatly compute the other principal components
(i.e., second, third, fourth,...), as well.

2.6 Singular Value Decomposition

Definition 34 (SVD)
Let A ∈ Rm×n have rank r. A singular value decomposition (SVD) of A is a decomposition of the form

A = U ΣV > (2.156)
" #" #
h i Σr 0r×(n−r) Vr>
= Ur Um−r >
(2.157)
0(m−r)×r 0(m−r)×(n−r) Vn−r
= Ur Σr Vr> (2.158)
r
(2.159)
X
= σi ~ui~vi> ,
i=1

where:

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 34
EECS 127/227AT Course Reader 2.6. Singular Value Decomposition 2024-04-27 21:08:09-07:00

• U ∈ Rm×m , Ur ∈ Rm×r , Um−r ∈ hRm×(m−r) ,iV ∈ Rn×n , Vr ∈ Rn×r , and Vn−r ∈ Rn×(n−r)
are orthonormal matrices, where U = Ur Um−r has columns ~u1 , . . . , ~um (left singular vectors) and
h i
V = Vr Vn−r has columns ~v1 , . . . , ~vn (right singular vectors).
 
σ1
• Σr = 
 ..  ∈ Rr×r is a diagonal matrix with ordered positive entries σ1 ≥ · · · ≥ σr > 0

 . 
σr
" #
Σr 0r×(n−r)
(singular values), and the zero matrices in the Σ = matrix are shaped to
0(m−r)×r 0(m−r)×(n−r)
ensure that Σ ∈ Rm×n .

Suppose that A is tall (so m > n) with full column rank n. Then the SVD looks like the following:
" #
Σn
A=U V >. (2.160)
0(m−n)×n

On the other hand, if A is wide (so m < n) with full row rank m, then the SVD looks like the following:
h i
A = U Σm 0m×(n−m) V > . (2.161)

The last (summation) form of the SVD is called the dyadic SVD; this is because terms of the form p~~q> are called
dyads, and the dyadic SVD expresses the matrix A as the sum of dyads.
All forms of the SVD are useful conceptually and computationally, depending on the problem we are working on.
We now discuss a method to construct the SVD. Suppose A ∈ Rm×n has rank r. We consider the symmetric
matrix A> A which has rank r and thus r nonzero eigenvalues, which are positive. We can order its eigenvalues as
λ1 ≥ · · · ≥ λr > λr+1 = · · · = λn = 0, say with corresponding orthonormal eigenvectors ~v1 , . . . , ~vn .
. √ .
Then, for i ∈ {1, . . . , r}, we define σi = λi > 0, and ~ui = A~vi /σi . This only gives us r vectors
h ~ui , but we need
i
m of them to construct U ∈ R m×m
. To find the remaining ~ui we use Gram-Schmidt on the matrix ~u1 · · · ~ur I ∈
Rm×(r+m) , throwing out the r vectors whose projection residual onto previously processed vectors is 0.
More formally, we can write an algorithm:

Algorithm 2 Construction of the SVD.


function SVD(A ∈ Rm×n )
.
r = rank(A)
.
(λ1 , ~v1 ), . . . , (λn , ~vn ) = Eigenpairs(A> A) . λ1 ≥ · · · ≥ λr > λr+1 = · · · = λn = 0, and ~vi orthonormal.
for i ∈ {1, . . . , r} do
. √
σi = λi . σi > 0
.
~ui = A~vi /σi
end for
.
h i
~u1 , . . . , ~ur , ~ur+1 , . . . , ~um = GramSchmidt( ~u1 · · · ~ur I )
return {~u1 , . . . , ~um }, {σ1 , . . . , σr }, {~v1 , . . . , ~vn }
end function

It’s clear that Algorithm 2 gives an orthonormal basis {~v1 , . . . , ~vn } for Rn that can be constructed into the orthonor-
mal V matrix, and that it gives singular values σ1 ≥ · · · ≥ σr > 0. We aim to show two things: the {~u1 , . . . , ~um } are
orthonormal, and that A = U ΣV > where U , Σ, and V are constructed using the returned vectors and scalars.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 35
EECS 127/227AT Course Reader 2.6. Singular Value Decomposition 2024-04-27 21:08:09-07:00

Proposition 35
In the context of Algorithm 2, {~u1 , . . . , ~um } is an orthonormal set.

Proof. From our invocation of Gram-Schmidt, {~ur+1 , . . . , ~um } is an orthonormal set which spans an orthogonal sub-
space to the span of {~u1 , . . . , ~ur }. Thus, we need to show that {~u1 , . . . , ~ur } are orthonormal.
Indeed, take 1 ≤ i < j ≤ r. Then since the ~vj are orthonormal eigenvectors of A> A, we have
 >  
A~vi A~vj
~u>
i ~
uj = (2.162)
σi σj
~vi> A> A~vj
= (2.163)
σi σj
λj ~vi>~vj
= (2.164)
σi σj
λj >
= ~v ~vj (2.165)
σi σj | i{z }
=0

= 0. (2.166)

On the other hand, for a specific i ∈ {1, . . . , r}, using that σi2 = λi , we have
2
A~vi
(2.167)
2
k~ui k2 =
σi 2
 >  
A~vi A~vi
= (2.168)
σi σi
~vi> A> A~vi
= (2.169)
σi2
λi~vi>~vi
= (2.170)
σi2
λi
= 2 ~vi>~vi (2.171)
σi | {z }
|{z} =1
=1

= 1. (2.172)

Thus the set {~u1 , . . . , ~ur } is orthonormal, so the whole set {~u1 , . . . , ~um } is orthonormal.

Proposition 36
In the context of Algorithm 2, we have A = U ΣV > .

Proof. By construction, we have

A~vi = σi ~ui for all i ∈ {1, . . . , r}, (2.173)


A~vi = ~0 for all i ∈ {r + 1, . . . , m}, (2.174)

This gives us
h i
AV = A ~v1 · · · ~vr ~vr+1 · · · ~vn (2.175)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 36
EECS 127/227AT Course Reader 2.6. Singular Value Decomposition 2024-04-27 21:08:09-07:00

h i
= A~v1 · · · A~vr A~vr+1 · · · A~vn (2.176)
h i
= σ1 ~u1 · · · σr ~ur ~0 · · · ~0 (2.177)
h i
= Ur Σr 0 (2.178)

On the other hand, we have


" #
h i Σ 0
(2.179)
r
U Σ = Ur Um−r
0 0
h i
= Ur Σr + Um−r · 0 Ur · 0 + Um−r · 0 (2.180)
h i
= Ur Σr 0 . (2.181)

Thus AV = U Σ. Since V is orthonormal, V > = V −1 , so A = U ΣV > .

The SVD is not unique: the Gram-Schmidt process could have used any basis for Rm that wasn’t the columns
of I and still have been valid; if you had multiple eigenvectors of A> A with the same eigenvalue then the choice of
eigenvectors in the diagonalization would not be unique; and even if you didn’t have multiple eigenvectors with the
same eigenvalue, the eigenvectors would only be determined up to a sign change ~v 7→ −~v anyways. So there is a lot of
ambiguity in the construction, which reflects the non-uniqueness.
We now discuss the geometry of the SVD, especially how each component of the SVD acts on vectors. To do this
we will fix A ∈ R2×2 with SVD A = U ΣV > , and find the behavior of A~x for all ~x on the unit circle (i.e., with norm
1). We will analyze the behavior of A~x by using the behavior of V > ~x, ΣV > ~x and finally U ΣV > ~x. In the end, we will
interpret U as a rotation or reflection, Σ as a scaling, and V > as another rotation or reflection.
Before we start, let us discuss what different types of matrices look like as linear transformations. Consider our
friendly unit circle:

This unit circle consists of all vectors in R2 with norm 1.


Multiplying each vector in that circle by the same orthonormal matrix will not change the norm of any vector in
that circle, and thus will amount to nothing more than a rotation of the circle, thus not changing its shape.
orthonormal

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 37
EECS 127/227AT Course Reader 2.6. Singular Value Decomposition 2024-04-27 21:08:09-07:00

On the other hand, multiplying each vector in the unit circle by the same diagonal matrix will scale the vectors in
the coordinate directions. This means that the unit circle will be mapped, in general, to an axis-aligned ellipse.

diagonal

In general, matrices aren’t orthonormal or diagonal, and so they will both rotate and scale in various ways. This
means that the unit circle will be mapped to an ellipse which isn’t necessarily axis-aligned.

generic

Let us now systematically study such A = U ΣV > through the lens of the unit circle, as well as where the right
singular vectors ~v1 and ~v2 are mapped by A.
Since V > is an orthonormal matrix, it represents a rotation and/or reflection, and so it maps the unit circle to the unit
circle, much like our observed figure. Specifically, the right singular vectors ~vi have V >~vi = ~ei , so they get mapped
onto the standard basis by V > . This gives the following picture.

V>

~e2
~v2 ~v1

~e1

Now, the diagonal matrix Σ will scale each ~ei by σi , obtaining an ellipse.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 38
EECS 127/227AT Course Reader 2.7. Low-Rank Approximation 2024-04-27 21:08:09-07:00

V> Σ

~e2
~v2 ~v1 σ2~e2

~e1 σ1~e1

Finally, the orthonormal matrix U will map this axis-aligned ellipse to an ellipse which isn’t necessarily axis-
aligned. Specifically, the vectors that we’re looking at, i.e., σi~ei , have U (σi~ei ) = σi U~ei = σi ~ui . These ~ui will be the
axes of the resulting ellipse in the same sense as σi~ei were the axes of the axis-aligned ellipse.

V> Σ U
σ1 ~u1
~e2
~v2 ~v1 σ2~e2
~e1 σ1~e1

σ2 ~u2

Recall that we originally started with a depiction that didn’t have any fine-grained description of any vectors, yet
obtained the same result:

To understand the impact of A on any general vector ~x, we write it in the V basis: ~x = α1~v1 + α2~v2 , and use
linearity to obtain A~x = α1 σ1 ~u1 + α2 σ2 ~u2 . One can draw this graphically using scaled versions of the above ellipses.
This perspective also says that σ1 is the maximum scaling of any vector obtained by multiplication by A, and σr
is the minimum nonzero scaling. (If r < n, i.e., A is not full column rank, then there are some nonzero vectors in Rn
which are sent to ~0 by A, so the minimum scaling is 0.) You will formally prove this in homework.

2.7 Low-Rank Approximation


Sometimes, in real datasets, matrices have billions or trillions of entries. Storing all of them would be prohibitively
expensive, and we would need a way to compress them down into their most important parts. This turns out to be
doable via the SVD, as we will see.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 39
EECS 127/227AT Course Reader 2.7. Low-Rank Approximation 2024-04-27 21:08:09-07:00

To formally talk about a compression algorithm that stores a compressed version of the data with minimal error,
we need to talk about what kind of errors are appropriate to discuss in the context of matrices. In the case of vectors,
we can use the `2 -norm, or more generally the `p norm, to define a distance function; then, the error would just be the
distance between the true and the perturbed vectors. This motivates thinking about matrix norms, which allow us to
quantify the distance between matrices, and thus create error functions.
There are two ways to think about a matrix. The first way is as a block of numbers. Similarly to how we thought of
a vector as a block of numbers and found the norm based on this, we can think of the matrix as a big list of vectors and
take the norm. This norm is called the Frobenius norm, and it corresponds to unrolling an m × n matrix into a length
m · n vector and taking its `2 -norm.

Definition 37 (Frobenius Norm)


For a matrix A ∈ Rm×n , its Frobenius norm is defined as
v
uX m X
n
. u
kAkF = t A2ij . (2.182)
i=1 j=1

The following property will help us when we study vector calculus.

Proposition 38
For a matrix A ∈ Rm×n , we have kAkF = tr A> A .
2 

The following pair of results will help us now.

Proposition 39
For a matrix A ∈ Rm×n and orthonormal matrices U ∈ Rm×m , V ∈ Rn×n , we have

kU AV kF = kU AkF = kAV kF = kAkF . (2.183)

We do not prove this property now, though it might be on homework.

Proposition 40
For a matrix A ∈ Rm×n with rank r and singular values σ1 ≥ · · · ≥ σr > 0, we have
r
(2.184)
2
X
kAkF = σi2 .
i=1

Proof. Let A = U ΣV > be the SVD of A; then, we use the previous proposition to get
2
(2.185)
2
kAkF = U ΣV > F

(2.186)
2
= kΣkF
r
(2.187)
X
= σi2 .
i=1

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 40
EECS 127/227AT Course Reader 2.7. Low-Rank Approximation 2024-04-27 21:08:09-07:00

Under this perspective, kA − BkF is small if each component of A and B is close; that is, they are very similar in
terms of the block-of-numbers interpretation.
The second way to think about a matrix is as a linear transformation. In this case, the matrix is defined by how it
acts on vectors via multiplication. A suitable notion of size in this case is the largest scaling factor of the matrix on any
unit vector; this is called the spectral norm or the matrix `2 -norm.

Definition 41 (Spectral Norm)


For a matrix A ∈ Rm×n , its spectral norm is defined by
.
kAk2 = maxn kA~xk2 . (2.188)
x∈R
~
k~
xk2 =1

Fortunately, this optimization problem has a solution — what’s more, we’ve actually seen this solution before.

Proposition 42
For a matrix A ∈ Rm×n with rank r and singular values σ1 ≥ · · · ≥ σr > 0, we have

kAk2 = σ1 . (2.189)

Proof. We use the Rayleigh quotient to say that


.
kAk2 = maxn kA~xk2 (2.190)
x∈R
~
k~
xk2 =1
q
(2.191)
2
= maxn kA~xk2
x∈R
~
k~
xk2 =1
s
(2.192)
2
= max kA~xk2
x∈Rn
~
k~
xk2 =1
s
= max ~x> A> A~x (2.193)
x∈Rn
~
k~
xk2 =1
q
= λmax {A> A} (2.194)
= σ1 . (2.195)

Connecting back to the ellipse transformations, the spectral norm captures how much the ellipse is stretched in its most
stretched direction. Thus, the matrix is viewed as as a linear map.
To present our main theorems about how to approximate the matrix well under these norms, we need to define
.
notation. Fix a matrix A ∈ Rm×n . For convenience, let p = min{m, n}. Suppose that A has rank r ≤ p, and that A
has SVD
p
(2.196)
X
A= σi ~ui~vi>
i=1
where we note that σ1 ≥ · · · ≥ σr and define σr+1 = σr+2 = · · · = 0. Then, for k ≤ p, we can define
k
. X
Ak = σi ~ui~vi> . (2.197)
i=1

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 41
EECS 127/227AT Course Reader 2.7. Low-Rank Approximation 2024-04-27 21:08:09-07:00

Note that if k  p, then Ak can be stored much more efficiently than A. For instance, Ak needs to store k scalars (σs),
k vectors in Rm (~us), and k vectors in Rn (~v s), for a total storage of k(m + n + 1) floats. On the other hand, storing
A requires mn floats naively, and even storing the (full) SVD of A is not much better. So the former is much more
efficient to store. The only thing left to do is to show that it actually is a good approximation in A (otherwise what
would be the point to storing it instead of A?)
It turns out that Ak indeed well-approximates A in the sense of the two norms — the Frobenius norm and the
spectral norm — that we discussed previously. The two results are collectively known as the Eckart-Young (sometimes
Eckart-Young-Mirsky) theorem(s). We state and prove these now.
We begin with the Eckart-Young theorem for the spectral norm, since it will help us prove the analogous result for
Frobenius norms.

Theorem 43 (Eckart-Young Theorem for Spectral Norm)


We have
Ak ∈ argmin kA − Bk2 , (2.198)
B∈Rm×n
rank(B)≤k

or, equivalently,
kA − Ak k2 ≤ kA − Bk2 , ∀B ∈ Rm×n : rank(B) ≤ k. (2.199)

Proof of Theorem 43. The proof of Eckart-Young Theorem for Spectral Norm is partitioned into two parts which to-
gether jointly show the conclusion. First, we explicitly calculate kA − Ak k2 ; then for an arbitrary B of rank ≤ k we
show that kA − Bk2 is lower-bounded by this quantity.

Part 1. First, we calculate kA − Ak k2 . Indeed, we have


p
! k
!
(2.200)
X X
A − Ak = σi ~ui~vi> − σi ~ui~vi>
i=1 i=1
p
(2.201)
X
= σi ~ui~vi> .
i=k+1

Since σk+1 ≥ σk+2 ≥ · · ·, the ~ui are orthonormal, and the ~vi are orthonormal, this is a valid singular value
decomposition of the rank r − k matrix A − Ak . Thus A − Ak has largest singular value equal to σk+1 , so

kA − Ak k2 = σk+1 . (2.202)

Part 2. We now show that for any matrix B ∈ Rm×n of rank ≤ k, we have kA − Bk2 ≥ σk+1 . Before we commence
with the proof, let us first discuss the overall argument structure.
We have two characterizations of the spectral norm: the maximum singular value, and the maximal Rayleigh
coefficient. Up until now, we don’t have any advanced machinery such as singular value inequalities to directly
show that the maximum singular value of A−B is lower-bounded by some constant. However, we do understand
matrix-vector products, and how to characterize their optima, pretty well, so this motivates the use of the
maximal Rayleigh coefficient. In particular, suppose f (~x) = k(A − B)~xk2 / k~xk2 . Then we know that for
any specific value of ~x0 , we have kA − Bk2 = max~x f (~x) ≥ f (~x0 ). Thus, one way to get a lower bound on
kA − Bk2 is to plug in any ~x0 into f . The trick is, given B, to find the right value of ~x0 such that f (~x0 ) = σk+1 .
The first part of the argument will be finding an appropriate ~x0 ; the next will show that it works.
Okay, let’s start with the proof.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 42
EECS 127/227AT Course Reader 2.7. Low-Rank Approximation 2024-04-27 21:08:09-07:00

Step 1. Here we show the existence of an appropriate choice for ~x0 .


Because B has rank ≤ k, the rank-nullity theorem gives

dim(N (B)) = n − rank(B) ≥ n − k > 0. (2.203)


.
h i
Now, define Vk+1 = ~v1 , . . . , ~vk+1 . We know by the SVD that the ~vi are orthonormal, and so are
linearly independent. Thus

rank(Vk+1 ) = dim(R(Vk+1 )) = k + 1. (2.204)

Thus
dim(N (B)) + dim(R(Vk+1 )) ≥ (n − k) + (k + 1) = n + 1 > n. (2.205)
Claim. There exists a unit vector in N (B) ∩ R(Vk+1 ).
Proof. First, we note that since N (B) and R(Vk+1 ) are subspaces of Rn , their intersection is
a subspace of Rn . Thus, it is closed under scalar multiplication. Thus the existence of a unit
vector is equivalent to the existence of a nonzero vector, since one can scale this nonzero vector
by its (inverse) norm to get the unit vector.
Suppose for the sake of contradiction that N (B) and R(Vk+1 ) have no nonzero vectors in
common. Fix a basis S1 for N (B) and a basis S2 for R(Vk+1 ). By assumption, we have that
S1 ∩ S2 = ∅, and S1 ∪ S2 is a linearly independent set. Here, there are two cases.
A. S1 ∪ S2 contains < n + 1 vectors. This means that S1 and S2 have a vector in common,
so S1 ∩ S2 6= ∅ and we have reached a contradiction.
B. S1 ∪ S2 has exactly n + 1 vectors. As a set of n + 1 vectors in Rn , S1 ∪ S2 is linearly
dependent and we have reached a contradiction.
In both cases we have reached a contradiction, so there is a nonzero vector in N (B)∩R(Vk+1 ).
We may scale this vector by its inverse norm to get a unit vector in N (B) ∩ R(Vk+1 ).
Let ~x0 ∈ N (B) ∩ R(Vk+1 ) be any such unit vector. We use this vector for the Rayleigh coefficient.
Step 2. Now we show that our choice of ~x0 plugged into the Rayleigh coefficient gives σk+1 as a lower bound.
Indeed, we have
k(A − B)~xk2
kA − Bk2 = max (2.206)
x6=~
~ 0 k~xk2
k(A − B)~x0 k2
≥ (2.207)
k~x0 k2
= k(A − B)~x0 k2 (2.208)
= kA~x0 − B~x0 k2 . (2.209)

Since ~x0 ∈ N (B), we have B~x0 = ~0, so

kA − Bk2 ≥ kA~x0 − B~x0 k2 = kA~x0 k2 . (2.210)

Since ~x0 ∈ R(Vk+1 ), there are constants α1 , . . . , αk+1 , not all zero, such that ~x0 = αi~vi . Thus
Pk+1
i=1

kA − Bk2 ≥ kA~x0 k2 (2.211)


k+1
(2.212)
X
= A αi~vi
i=1 2

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 43
EECS 127/227AT Course Reader 2.7. Low-Rank Approximation 2024-04-27 21:08:09-07:00

p
! k+1
!
(2.213)
X X
= σi ~ui~vi> αi~vi
i=1 i=1 2
p k+1
(2.214)
X X
= σi αj ~ui~vi>~vj .
i=1 j=1
2

By vector algebra, the fact that the ~ui are orthonormal, and the fact that the ~vi are orthonormal, one
can mechanically show that

p k+1
(2.215)
X X
kA − Bk2 ≥ σi αj ~ui~vi>~vj
i=1 j=1
2
k+1
(2.216)
X
= αi σi ~ui
i=1 2
v
u k+1 2
u X
=t αi σi ~ui (2.217)
i=1 2
v
uk+1
(2.218)
uX
=t α2 σ 2 i i
i=1
v
u k+1
(2.219)
u X
2
≥ tσk+1 αi2
i=1
v
uk+1
(2.220)
uX
= σk+1 t α2 i
i=1

= σk+1 k~x0 k2 (2.221)


= σk+1 . (2.222)

This proves the claim.

We now state and prove the other result, which is the Eckart-Young theorem for the Frobenius norm.

Theorem 44 (Eckart-Young Theorem for Frobenius Norm)


We have
Ak ∈ argmin kA − BkF , (2.223)
B∈Rm×n
rank(B)≤k

or, equivalently,
(2.224)
2 2
kA − Ak kF ≤ kA − BkF , ∀B ∈ Rm×n : rank(B) ≤ k.

Proof of Theorem 44. Like the Frobenius norm, the proof of Eckart-Young Theorem for Frobenius Norm is partitioned
into two parts which together jointly show the conclusion. First, we explicitly calculate kA − Ak kF ; then for an
2

arbitrary B of rank ≤ k we show that kA − is lower-bounded by this quantity.


2
BkF

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 44
EECS 127/227AT Course Reader 2.7. Low-Rank Approximation 2024-04-27 21:08:09-07:00

Part 1. First, we calculate kA − Ak kF . Indeed, we have


2

p
! k
!
(2.225)
X X
A − Ak = σi ~ui~vi> − σi ~ui~vi>
i=1 i=1
p
(2.226)
X
= σi ~ui~vi> .
i=k+1

Since σk+1 ≥ σk+2 ≥ · · ·, the ~ui are orthonormal, and the ~vi are orthonormal, this is a valid singular value
decomposition of the rank r − k matrix A − Ak . Thus A − Ak has largest singular value equal to σk+1 , so
p
(2.227)
2
X
kA − Ak kF = σi2 .
i=k+1

Part 2. We now show that for any matrix B ∈ Rm×n of rank ≤ k, we have kA − BkF ≥ σi2 . Again, let us
2 Pp
i=k+1
begin by discussing the overall argument structure.
Our goal is to show an inequality between two sums of squares of singular values. To do this, it is simplest to
match up terms in the two sums and show the inequality between each pair of terms. Then summing over all
terms preserves the inequality.
.
More concretely, define C = A − B. Suppose C has SVD C = i=1 γi ~yi ~zi> . We want to show
Pp

p p
(2.228)
2
X X
kA − BkF = γi2 ≥ σi2 .
|{z}
i=1 still need to show i=k+1

Thus, we claim that γi ≥ σi+k for every i ∈ {1, . . . , p}, with the understanding that σp+1 = σp+2 = · · · = 0.
To use our earlier knowledge about the spectral norm to our advantage, we write the singular value γi as the
spectral norm of the approximation error matrix:

γi = kC − Ci−1 k2 . (2.229)

We can expand to get

γi = kC − Ci−1 k2 = k(A − B) − Ci−1 k2 = kA − (B + Ci−1 )k2 . (2.230)

Now this looks somewhat like the low-rank approximation setup, because we are computing the spectral norm
of the difference between A and some matrix. But what is the rank of this matrix? We have

rank(B + Ci−1 ) ≤ rank(B) + rank(Ci−1 ) ≤ k + i − 1. (2.231)

Thus, by applying Eckart-Young Theorem for Spectral Norm to the rank i + k − 1 approximation of A, that

γi = kA − (B + Ci−1 )k2 ≥ kA − Ai+k−1 k2 = σi+k (2.232)

which is what we wanted to show.


To finish the proof, we use the following inequalities:

γi2 ≥ σi+k
2
, ∀i ∈ {1, . . . , p} (2.233)
p p
summing over all i, (2.234)
X X
γi2 ≥ 2
σi+k
i=1 i=1

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 45
EECS 127/227AT Course Reader 2.8. (OPTIONAL) Block Matrix Identities 2024-04-27 21:08:09-07:00

p
(2.235)
2
X
kA − BkF ≥ σi2
i=k+1

as desired.

2.8 (OPTIONAL) Block Matrix Identities


In this section, we list many ways to manipulate block matrices. Since each fact in here is something you can derive
yourself using definitions (e.g. of matrix multiplication), you may use any of them without proof.

2.8.1 Transposes of Block Matrices


 
~x>
i>  1 
.. 
h
~x1 ··· ~xn =   .  (2.236)
~x>n
 
>
~x1
 .  h i>
 .  = ~x1 · · · ~xn (2.237)
 . 
~x>
n
" #
h i> A>
A B = (2.238)
B>
" #>
A h i
= A> B > (2.239)
B
" #> " #
A B A> C >
= . (2.240)
C D B > D>

2.8.2 Block Matrix Products


In the following, ~ei is the ith standard basis vector – it has a 1 in the ith coordinate and 0 in all other coordinates.

 
~y >
i 1  X n
.. 
h
~x1 · · · ~xn  . 
 = ~xi ~yi> (2.241)
i=1
~yn>
   
~x>
1 ~x>
1 ~y1 · · · ~x>
1 ~yn
 . h
 .  ~y1 · · · ~yn =  .. .. .. 
i 
  .  . . . 
 (2.242)
~x>
n ~x>
n~ y1 · · · ~x>
n~ yn
 
~x> 1
.. 
~e>  x> (2.243)

i  . =~
 i
>
~xn
h i
~x1 · · · ~xn ~ei = ~xi (2.244)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 46
EECS 127/227AT Course Reader 2.8. (OPTIONAL) Block Matrix Identities 2024-04-27 21:08:09-07:00

h i h i
A ~x1 · · · ~xn = A~x1 · · · A~xn (2.245)
   
~x>
1 ~x>
1A
 . 
 .  A =  ..  (2.246)
 
 .   . 
~x>
n ~x>
nA
h i h i
A B C = AB AC (2.247)
" # " #
A AC
C= . (2.248)
B BC

2.8.3 Block Diagonal Matrices


   >
d1 0 ··· 0 ~x1  
  > d1 ~x>
1
0 d2 ··· 0  ~x2   . 

  .  =  ..  (2.249)
. .. .. 
. ..
.   .. 
   
. . .
dn ~x>n
0 0 ··· dn ~x>
n
    
A1 0 · · · 0 B1 A1 B 1
 0 A2 · · · 0   B 2   A2 B 2 
    
 .
 . .. .. ..   ..   .. 
   =  (2.250)
 . . . .  .   . 

0 0 ··· An Bn An B n

2.8.4 Quadratic Forms

(2.251)
XX
~x> A~y = Aij xi yj
i j
" #> " #" #
~x A B ~x
= ~x> A~x + ~x> B~y + ~y > C~x + ~y > D~y . (2.252)
~y C D ~y

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 47
Chapter 3

Vector Calculus

Relevant sections of the textbooks:

• [1] Appendix A.4.

3.1 Gradient, Jacobian, and Hessian


To motivative this section, we start with a familiar concept which should have been covered in the prerequisites: the
derivatives of a scalar function f : R → R which takes in scalar input and produces a scalar output. The derivative
quantifies the (instantaneous) rate of change of the function due to the change of its input. We recall the limit definition
of the derivative.

Definition 45 (Derivative for Scalar Functions)


Let f : R → R be differentiable. The derivative of f with respect to x is the function df
dx : R → R defined by

df f (x + h) − f (x)
(x) = lim . (3.1)
dx h→0 h

The derivative of f has several other notations, including f 0 and f˙.


In this section, we aim to generalize the concept of derivatives beyond scalar functions. We will focus on two types
of functions:

1. Multivariate functions f : Rn → R which take a vector ~x ∈ Rn as input and produce a scalar f (~x) ∈ R as
output. Familiar examples of such functions include f (~x) = k~xkp and f (~x) = ~a> ~x.

2. Vector-valued functions f~ : Rn → Rm which take a vector ~x ∈ Rn as input and produce another vector f~(~x) ∈
Rm as output. A familiar example of such functions is f (~x) = A~x.

One tool that allows us to compute derivatives of scalar functions is the chain rule, which describes the derivative
of the composition of two functions.

Theorem 46 (Chain Rule for Scalar Functions)


Let f : R → R and g : R → R be two differentiable scalar functions, and let h : R → R be defined as h(x) =

48
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00

f (g(x)) for all x ∈ R. Then h is differentiable, and

dh df dg
(x) = (g(x)) · (x). (3.2)
dx dg dx

Here, we use some — perhaps unfamiliar — notation not previously introduced. In particular, the derivative df
dg (g(x))
is a little strange. When we write df
dg we take the derivative of f with respect to the output of g. In this case, we know
that the input of f is exactly the output of g, so really df
dg takes the derivative of f with respect to its input. Thus we
can more compactly write the chain rule using the f notation as 0

h0 (x) = f 0 (g(x)) · g 0 (x). (3.3)

We will see in the next section that the chain rule can be generalized to settings of multivariate functions. To aid
our study of such generalizations, we will introduce here a computational graph perspective of the chain rule. At the
moment this graphical perspective looks trivial, but it will help us understand more complicated cases.

x g f h(x)

Figure 3.1: Graphical depiction of the single-variable chain rule corresponding to h = f ◦ g.

In this computational graph, one computes the derivative by summing along all paths from x to h(x). There is only
one path, so we get
dh df dg
(x) = (g(x)) · (x). (3.4)
dx dg dx

3.1.1 Partial Derivatives


For multivariate functions f : Rn → R, when we talk about the rate of change of the function with respect to its input,
we need to specify which input we are talking about. Partial derivatives quantify this and give us the rate of change
of the function due to the change of one of its inputs, say xi , while keeping all other inputs fixed. Keeping all but one
input fixed renders a scalar function, for which we know how to compute the derivative. To formalize this we introduce
the limit definition of the partial derivative.

Definition 47 (Partial Derivative)


Let f : Rn → R be differentiable. The partial derivative of f with respect to xi is the function ∂f
∂xi : Rn → R
defined by
∂f f (x1 , . . . , xi + h, . . . , xn ) − f (~x)
(~x) = lim , (3.5)
∂xi h→0 h
or equivalently,
∂f f (~x + h · ~ei ) − f (~x)
(~x) = lim (3.6)
∂xi h→0 h
where ~ei is the ith standard basis vector.

This limit definition gives an alternative way of interpreting partial derivatives: ∂f


∂xi is the rate of change of the
function along the direction of the standard basis vector ~ei .
In practice, we do not use the limit definition to compute regular derivatives. Similarly, we do not use the limit
definition to compute partial derivatives. The main way to compute partial derivatives uses the following tip.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 49
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00

Problem Solving Strategy 48. To compute the partial derivative ∂xi ,


∂f
pretend that all xj for j 6= i are constants, then
take the ordinary derivative in xi .

Example 49. Consider the function f (~x) = ~a> ~x. Then

∂f ∂ >
(~x) = ~a ~x (3.7)
∂xi ∂xi
n
∂ X
= aj xj (3.8)
∂xi j=1

= (a1 x1 + · · · + an xn ) (3.9)
∂xi
= ai . (3.10)

Let us consider the case where the input ~x is not an independent vector but rather depends on another variable
t ∈ R, i.e., ~x : R → Rn is a function. In such case the function f (~x(t)) has one independent input, which is t. If we
are interested in finding the derivative of f with respect to t, we can utilize a chain rule to do so.

Theorem 50 (Chain Rule For Multivariate Functions)


Let f : Rn → R and ~g : R → Rn be differentiable functions. Define the function h : R → R by h(x) = f (~g (x))
for all x ∈ R. Then h is differentiable and has derivative
X ∂f n
dh dgi
(x) = (~g (x)) · (x). (3.11)
dx i=1
∂gi dx

Here we again review this computational graph perspective. There are now n separate paths from x to h, like so:

g1
..
.
x gi f h(x)
..
.
gn

Figure 3.2: Graphical depiction of the multivariate chain rule corresponding to h = f ◦ ~g .

In this computational graph, one computes the derivative by summing across the n paths from x to h(x), obtaining

X ∂f n
dh dgi
(x) = (~g (x)) · (x). (3.12)
dx i=1
∂gi dx

Here again we use the notation ∂f


∂gi to denote the derivative of f with respect to the output of gi . Since the output of gi
is really the i th
input of f , this derivative is just the derivative of f with respect to its ith input.

3.1.2 Gradient
We will now use the definition of partial derivatives to introduce the gradient of multivariate functions.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 50
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00

Definition 51 (Gradient)
Let f : Rn → R be a differentiable function. The gradient of f is the function ∇f : Rn → Rn defined by
 ∂f 
x)
∂x1 (~
 . 
.  (3.13)
 . .
∇f (~x) = 
∂f
x)
∂xn (~

Note that the gradient is a column vector. The transpose of the gradient is (confusingly!) referred to as the derivative
of the function. We will now list two important geometric properties of the gradient. The first can be stated straight
away:

Proposition 52
Let ~x ∈ Rn . The gradient ∇f (~x) points in the direction of steepest ascent at ~x, i.e., the direction around ~x in
which f has the maximum rate of change. Furthermore, this rate of change is quantified by the norm k∇f (~x)k2 .

Proof. Let ~u ∈ Rn be a unit vector (i.e., representing an arbitrary direction in Rn ). Using the Cauchy-Schwarz
inequality we can write:

~u> [∇f (~x)] ≤ k~uk2 k∇f (~x)k2 (3.14)


= k∇f (~x)k2 , (3.15)

so the maximum value that the expression can take is k∇f (~x)k2 . Now it remains to show that this value is attained for
the choice ~u = x)k2 .
∇f (~
k∇f (~
x)

 > 2
∇f (~x) k∇f (~x)k2
[∇f (~x)] = (3.16)
k∇f (~x)k2 k∇f (~x)k2
= k∇f (~x)k2 . (3.17)

To list the second property, first we need a quick definition.

Definition 53 (Level Set)


Let f : Rn → R be a function, and α ∈ R be a scalar.

• The α-level set of f is the set of points ~x such that f (~x) = α:

Lα (f ) = {~x ∈ Rn | f (~x) = α}. (3.18)

• The α-sublevel set of f is the set of points ~x such that f (~x) ≤ α:

L≤α (f ) = {~x ∈ Rn | f (~x) ≤ α}. (3.19)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 51
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00

• The α-superlevel set of f is the set of points ~x such that f (~x) ≥ α:

L≥α (f ) = {~x ∈ Rn | f (~x) ≥ α}. (3.20)

Proposition 54
Let ~x ∈ Rn and suppose f (~x) = α. Then ∇f (~x) is orthogonal to the hyperplane which is tangent at ~x to the
α-level set of f .

We illustrate the two properties through examples.

Example 55 (Gradient of the Squared `2 Norm). In this example we will compute and visualize the gradient of the
function f (~x) = k~xk2 where ~x ∈ R2 .
2

" #
∂f
x)
∂x1 (~
∇f (~x) = ∂f
(3.21)
∂x2 x)
(~
" #
∂ 2
∂x1 (x1 + x22 )
= ∂ 2
(3.22)
∂x2 (x1 + x22 )
" #
2x1
= (3.23)
2x2
= 2~x. (3.24)

This function has a paraboloid-shaped graph, as shown in Figure 3.3a. Let us now find the α-level set of the function
f for some constant α.

α = f (~x) (3.25)
= x21 + x22 (3.26)

For a given α ≥ 0, the α-level set is a circle centered at the origin which has radius α. Now we evaluate the gradient
at a few points on these level sets:

∇f (−1, 0) = (−2, 0) (3.27)


√ √ √ √
∇f (1/ 2, 1/ 2) = ( 2, 2) (3.28)
∇f (2, 0) = (4, 0) (3.29)
√ √ √ √
∇f (− 2, 2) = (−2 2, 2 2) (3.30)

We plot the level sets for α = 1 and α = 4 and visualize the gradient directions in Figure 3.3b.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 52
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00

(b) The level sets of the function f (~ xk22 = α for


x) = k~
α = 1 and α = 4, and the gradients at some points along
(a) Plot of the function f (~ xk22
x) = k~ the level sets.

Figure 3.3: Example 55

We make the following observations:

1. At each point on a given level set, the gradient vector at that point is orthogonal to the line tangent to the level
set at that point.

2. The length of the gradient vector increases as we move away from the origin. This means that the function gets
steeper in that direction (i.e., it changes more rapidly).

Example 56 (Gradient of Linear Function). In this example we will compute and comment on the gradient of the linear
function f : Rn → R is defined by f (~x) = ~a> ~x where ~a ∈ Rn is fixed. We have

f (~x) = ~a> ~x, (3.31)


 ∂f 
(~x)
 ∂x1. 
.  (3.32)
 . 
∇f (~x) = 
∂f
∂xn (~ x)
 
a1
 . 
.  (3.33)
= . 
an
= ~a. (3.34)

We make the following observations:

1. The α-level sets of this function are the hyperplanes given by all ~x such that ~a> ~x = α. These hyperplanes have
normal vector ~a. Notice that the normal vector (which is orthogonal to all vectors in the hyperplane) is exactly
the gradient vector.

2. The gradient is constant, meaning that the function has a constant rate of change everywhere.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 53
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00

Example 57 (Gradient of the Quadratic Form). Let A ∈ Rn×n . In this example we will compute the gradient of the
quadratic function f : Rn → R defined by f (~x) = ~x> A~x. Indeed, we have

f (~x) = ~x> A~x (3.35)


n X
n
(3.36)
X
= Aij xi xj .
i=1 j=1

Now fix k ∈ {1, . . . , n}, and we find the partial derivative with respect to xk . We have
n X
X n
f (~x) = Aij xi xj
i=1 j=1
Xn n X
X n
= Akj xk xj + Aij xi xj
j=1 i=1 j=1
i6=k
n
X X n n X
X n
= Akj xk xj + Aik xi xk + Aij xi xj
j=1 i=1 i=1 j=1
i6=k i6=k j6=k
n
X n
X n X
X n
= Akk x2k + Akj xk xj + Aik xi xk + Aij xi xj
j=1 i=1 i=1 j=1
j6=k i6=k i6=k j6=k
n
X n
X X n X n
= Akk x2k + xk Akj xj + xk Aik xi + Aij xi xj .
j=1 i=1 i=1 j=1
j6=k i6=k i6=k j6=k

Then, taking the derivatives, we have


 
n n n X n
∂f ∂ 
(3.37)
X X X
Akk x2k + xk

(~x) = A kj x j + x k A ik x i + Aij xi xj 
∂xk ∂xk  j=1 i=1 i=1 j=1

j6=k i6=k i6=k j6=k
n n
(3.38)
X X
= 2Akk xk + Akj xj + Aik xi
j=1 i=1
j6=k i6=k
   
n n
(3.39)
 X   X
Akk xk +
= Akj xj 
 + Akk xk + Aik xi 

j=1 i=1
j6=k i6=k
n n
(3.40)
X X
= Akj xj + Aik xi
j=1 i=1

= (A~x)k + (A> ~x)k (3.41)


>
= ((A + A )~x)k . (3.42)

Thus computing the gradient via stacking the partial derivatives gets
 ∂f 
x)
∂x1 (~
 . 
.  (3.43)
∇f (~x) =   . 
∂f
x)
∂xn (~

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 54
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00

 
((A + A> )~x)1
..
(3.44)
 
= .

 
((A + A> )~x)n
= (A + A> )~x. (3.45)

That was a very involved computation! Luckily, in the near future, we will see ways to simplify the process of computing
gradients.

3.1.3 Jacobian
We now have the tools to generalize the notion of derivatives to vector-valued functions f~ : Rn → Rm .

Definition 58 (Jacobian)
Let f~ : Rn → Rm be a differentiable function. The Jacobian of f~ is the function Df~ : Rn → Rm×n defined as
   ∂f ∂f1

∇f1 (~x)> 1
x
(~ ) · · · x
(~ )
..   ∂x1. ..
∂xn
.. 
Df~(~x) =  = . (3.46)

 .   . . . .

> ∂fm ∂fm
∇fm (~x) x) · · · ∂xn (~x)
∂x1 (~

One big thing to note is that the Jacobian is different from the gradient! If f : Rn → R1 = R, then its Jacobian
Df : Rn → R1×n is a function which outputs a row vector. This row vector is the transpose of the gradient.
We can develop a nice and general chain rule with the Jacobian.

Theorem 59 (Chain Rule for Vector-Valued Functions)


Let f~ : Rp → Rm and ~g : Rn → Rp be differentiable functions. Let ~h : Rn → Rm be defined as ~h(~x) = f~(~g (~x))
for all ~x ∈ Rn . Then ~h is differentiable, and

D~h(~x) = [Df~(~g (~x))][D~g (~x)]. (3.47)

Here, as before, the notation Df~(~g (~x)) means that we compute Df~ and then evaluate it on the point ~g (~x).
This is a broad chain rule which we can apply to many problems, but we must always remember that for a function
f : Rn → R, its Jacobian is the transpose of its gradient. One typical chain rule we can derive from the general one
follows below.

Corollary 60. Let f : Rp → R and ~g : Rn → Rp be differentiable functions. Let h : Rn → R be defined as h(~x) =


f (~g (~x)) for all ~x ∈ Rn . Then h is differentiable, and

∇h(~x) = [D~g (~x)]> ∇f (~g (~x)). (3.48)

Example 61. In this example we will use the chain rule to compute the gradient of the function h(~x) = kA~x − ~y k2 .
2

It can be written as h(~x) = f (~g (~x)) where f (~x) = k~xk2 and ~g (~x) = A~x − ~y . We have that
2

D~g (~x) = A (3.49)

(the proof is in discussion or homework), and also from earlier

∇f (~x) = 2~x. (3.50)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 55
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00

Thus applying the chain rule obtains

∇h(~x) = [D~g (~x)]> ∇f (~g (~x)) (3.51)


= 2A> (A~x − ~y ) (3.52)

as desired. This gradient matches what would be computed if we had used the componentwise partial derivatives.

Finally, we again revisit the computational graph perspective. Here, the graph looks like:

x1 g1 f1
.. .. ..
. . .

~x xi gj fk h(~x)
.. .. ..
. . .

xn gp fm

Figure 3.4: Graphical depiction of the multivariate chain rule corresponding to ~h = f~ ◦ ~g .

To find the partial derivative ∂fk


x),
∂xi (~ one sums across all paths in the graph from xi to fk , i.e.,

∂fk
[D~h(~x)]k,i = (~x) (3.53)
∂xi
p
∂fk ∂gj
(3.54)
X
= (~g (~x)) · (~x)
j=1
∂g j ∂x i
p
(3.55)
X
= [Df~(~g (~x))]k,j · [D~g (~x)]j,i
j=1

= {[Df~(~g (~x))][D~g (~x)]}k,i . (3.56)

Here again we use the notation ∂fk


∂gj to denote the derivative of fk with respect to the output of gj . Since the output of
gj is really the j th
input of fk , this derivative is just the derivative of fk with respect to its j th input.

Example 62 (Neural Networks and Backpropagation). In this example, we will develop a basic understanding of what
deep neural networks are and how to train them via gradient-based optimization methods, such as those we will study
in this class.
The most basic class of neural networks are “multi-layer perceptrons,” or functions of the form f~ : Rn × Θ → Rp ,
where Θ is the so-called “parameter space”, and have the form

f~(~x, θ) = W (m)~σ (m) (W (m−1) (· · · ~σ (1) (W (0) ~x + ~b(0) ) · · · ) + ~b(m−1) ) + ~b(m) (3.57)
where θ = (W (0) ~ (0)
,b ,W (1) ~ (1)
,b ,W (2) ~ (2)
,b ,...,W (m) ~ (m)
,b ) ∈ Θ, (3.58)

and ~σ (1) , . . . , ~σ (m) are “activation functions,” which we treat as generic differentiable nonlinear functions. Here θ is
bolded because it should not be thought of as a scalar, vector, or matrix, but rather as a collection of matrices and
vectors, an object we haven’t discussed so far in this class.
More precisely, if we define the functions (~z(i) )m
i=0 and (f
~(i) )m as
i=0

~z(0) (~x, θ) = f~(0) (~x, θ) = W (0) ~x + ~b(0) (3.59)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 56
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00

~z(i) (~x, θ) = f~(i) (~z(i−1) (~x, θ), θ) = W (i)~σ (i) (~z(i−1) ) + ~b(i) , ∀ i ∈ {1, . . . , m}, (3.60)

then this formulation gives


f~(~x, θ) = ~z(m) (~x, θ) = f~(m) ◦ · · · ◦ f~(0) (~x, θ). (3.61)

When we say that we “train” this neural network, we mean just that we optimize θ to minimize some loss function
evaluated on the data. Given a (differentiable) loss function ` : Rp × Rp → R and a data point (~x, ~y ) ∈ Rn × Rp , we
evaluate the loss on this data point as `(f~(~x, θ), ~y ). The true “empirical loss” across a batch of q data points (~xi , ~yi ) is
q
. X ~
L(θ) = `(f (~xi , θ), ~yi ), (3.62)
i=1

and a “training algorithm” attempts to optimize θ to minimize L.


For now, we concentrate on the case where we have q = 1, i.e., we have a single data point (~x, ~y ). Here we have

L(θ) = `(f~(~x, θ), ~y ). (3.63)

We now explore how to compute ∇L(θ). In this situation, we are only differentiating with respect to θ, so we fix ~x
and ~y , which gives the following computational graph:¹

~z(0) (~x, θ) ~z(1) (~x, θ) ~z(m) (~x, θ)

(0) (1) (m)


f1 f1 ··· fi

.. .. ..
. . .

(0) (1) (m)


fi fj ··· fk `(f (~x, θ), ~y )

.. .. ..
. . .

(1) (1) ··· (m)


fn0 fn1 fnm

(W (0) , ~b(0) ) (W (1) , ~b(1) ) (W (m) , ~b(m) )

Figure 3.5: Computational graph corresponding to the multi-layer perceptron.

Since we have not defined gradients with respect to matrices (yet), we will focus on computing ∇~b(i) L(θ), i.e., the
gradient of the loss function L with respect to ~b(i) , as well as ∇~z(i) L(θ), i.e., the gradient of the loss function L with
respect to ~z(i) (~x, θ). They may be interpreted as sub-vectors (i.e., literally subsets of the entries) of the full gradient
∇L(θ) = ∇θ L(θ). This subscript notation is common in more involved problems which demand derivatives with
respect to multiple vector-valued variables.
We first compute ∇~z(m) L(θ). We have

∇~z(m) L(θ) = ∇~z(m) `(~z(m) , ~y ). (3.64)


¹Note that we did not break the graph down to matrix multiplications, vector additions, and applications of ~
σ (i) ; we do this for clarity, because
the full graph would be rather messy.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 57
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00

Indeed we cannot simplify this further, it being just the gradient of ` in its first argument.
Now we handle general i ∈ {1, . . . , m}. By chain rule we have the recursion:

∇~z(i−1) L(θ) = [D~z(i−1) L(θ)]> (3.65)


= [{D~z(i) L(θ)}{D~z(i−1) ~z (~x, θ)}] (i) >
(3.66)
(i)
= [D~z(i−1) ~z (~x, θ)] [∇~z(i) L(θ)]. >
(3.67)

Now we want to compute D~z(i−1) ~z(i) . Recall that

~z(i) = W (i)~σ (i) (~z(i−1) ) + ~b(i) . (3.68)

For convenience, write


.
~u(i) = ~σ (i) (~z(i−1) ), so that ~z(i) = W (i) ~u(i) + ~b(i) . (3.69)

Now we compute derivatives, obtaining by the chain rule:

D~z(i−1) ~z(i) = [D~u(i) ~z(i) ][D~z(i−1) ~u(i) ] (3.70)


= W (i) · D~σ (i) (~z(i−1) ). (3.71)

This gives us
D~z(i−1) ~z(i) = W (i) · D~σ (i) (~z(i−1) ), (3.72)

and so
∇~z(i−1) L(θ) = [W (i) · D~σ (i) (~z(i−1) )]> [∇~z(i) L(θ)]. (3.73)

To see the true value of this computation, we now compute ∇~b(i) L(θ), for each i ∈ {0, . . . , m}. By another
application of chain rule, we obtain

∇~b(i) L(θ) = [D~b(i) L(θ)]> (3.74)


= [{D~z(i) L(θ)}{D~b(i) ~z(i) }]> (3.75)
= [{D~z(i) L(θ)} · I] >
(3.76)
= [D~z(i) L(θ)] >
(3.77)
= ∇~z(i) L(θ). (3.78)

Now to do automatic differentiation by backpropagation, your favorite machine learning framework (such as Py-
Torch) computes all the gradients ∇~z(i) L(θ), starting at i = m and recursing until i = 0. Computing the gradients in
this order saves a lot of work, since each derivative is only computed once — this is the main idea of backpropagation.
Once these derivatives are computed, one can use them to compute ∇~b(i) L(θ). If we can also compute ∇W (i) L(θ),
then we can compute ∇θ L(θ), and thus will be able to optimize L via gradient-based optimization algorithms such as
gradient descent, as discussed later in the class. However, this computation will have to wait until slightly later when
we cover matrix calculus.

3.1.4 Hessian
So far, we have appropriately generalized the notion of first derivative to vector-valued functions, i.e., functions f~ : Rn →
Rm . We now turn to doing the same with second derivatives.
It turns out that for general vector-valued functions f~, defining a second derivative is possible, but such an object
will live in Rm×n×n and thus not be hard to work with using the linear algebra we have discussed in this class. However,

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 58
EECS 127/227AT Course Reader 3.1. Gradient, Jacobian, and Hessian 2024-04-27 21:08:09-07:00

for multivariate functions f : Rn → R, defining this second derivative as a particular matrix becomes possible; this
matrix, called the Hessian, has great conceptual and computational importance.
Recall that to find the second derivative of a scalar-valued function, we merely take the derivative of the derivative.
Our notion of the gradient suffices as a first derivative; to take the derivative of this, we need to use the Jacobian.
Indeed, the Hessian is exactly the Jacobian of the gradient, and defined precisely below.

Definition 63 (Hessian)
Let f : Rn → R be twice differentiable. The Hessian of f is the function ∇2 f : Rn → Rn×n defined by
∂ 2f ∂ 2f
 
∂x21
(~x) ··· x)
∂xn ∂x1 (~

∇2 f (~x) = D(∇f )(~x) = 


 .. .. .. 
(3.79)
 . . .
.

∂ 2f ∂ 2f
x)
∂x1 ∂xn (~ ··· x)
∂x2n (~

It turns out that under some mild conditions, the Hessian is symmetric; this is called Clairaut’s theorem.

Theorem 64 (Clairaut’s Theorem)


Let f : Rn → R be twice continuously differentiable, and fix ~x ∈ Rn . Then ∇2 f (~x) is a symmetric matrix, i.e.,
for every 1 ≤ i, j ≤ n we have
∂ 2f ∂ 2f
(~x) = (~x). (3.80)
∂xi ∂xj ∂xj ∂xi

The vast majority of functions we work with in this course are twice continuously differentiable, with some notable
exceptions (i.e., the `1 norm is not even once-differentiable). Thus, in most cases, the Hessian is symmetric.

Example 65 (Hessian of Squared `2 Norm). In this example we will compute the Hessian of the function f (~x) = k~xk2
2

where ~x ∈ R2 . Recall the gradient of this function computed in Example 55 as


" #
2x1
∇f (~x) = . (3.81)
2x2

The Hessian can then be computed as


" #
2 0
2
∇ f (~x) = D(∇f )(~x) = . (3.82)
0 2

Example 66. In this example we will compute the gradient and Hessian of the function h(~x) = log(1 + k~xk2 ). Let
2

f : R → R be defined by f (x) = log(1 + x), and g : Rn → R be defined as g(~x) = k~xk2 . Then we have Df = f 0 is
2

the derivative of f , and ∇g(~x) = 2~x by previous examples. By the Jacobian chain rule, we have

∇h(~x) = (Dh(~x))> (3.83)


= (D(f ◦ g)(~x))> (3.84)
= ([Df (g(~x))][Dg(~x)]) >
(3.85)
= ([Dg(~x)]> [Df (g(~x))]> ) (3.86)
= [∇g(~x)][Df (g(~x))] >
(3.87)
= [∇g(~x)][f 0 (g(~x))] (3.88)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 59
EECS 127/227AT Course Reader 3.2. Taylor’s Theorems 2024-04-27 21:08:09-07:00

2~x
= 2. (3.89)
1 + k~xk2

For the Hessian, we have

∇2 h(~x) = D(∇h)(~x) (3.90)


!
2~x
=D 2 . (3.91)
1 + k~xk2

We compute this Jacobian, hence the desired Hessian, componentwise, and obtain
" !#
2~x
2
[∇ h(~x)]j,k = D 2 (3.92)
1 + k~xk2 j,k
!
∂ 2xj
= (3.93)
∂xk 1 + k~xk22
2 2
(1 + k~xk2 ) ∂x∂ k (xj ) − xj ∂x∂ k (1 + k~xk2 )
=2 2 (3.94)
(1 + k~xk2 )2
2 2
(1 + k~xk2 ) ∂x∂ k (xj ) − xj ∂x∂ k (k~xk2 )
=2 2 (3.95)
(1 + k~xk2 )2
2 ∂x
(1 + k~xk2 ) ∂xkj − 2xj xk
=2 2 (3.96)
(1 + k~xk2 )2
4xj xk 2 ∂xj
=− 2 + 2 (3.97)
(1 + k~xk2 )2 1 + k~xk2 ∂xk

4xj xk  2 2 , if j = k
=− 2 + 1+k~xk2
(3.98)
(1 + k~xk2 )2 0, if j 6= k.

This gives
4xj xk
[∇2 h(~x)]j,k = − 2 , ∀j 6= k (3.99)
(1 + k~xk2 )2
2 4xj xk
[∇2 h(~x)]jj = 2 − 2 , ∀j. (3.100)
1 + k~xk2 (1 + k~xk2 )2

We can write this using vectors as


2 4~x~x>
∇2 h(~x) = 2I − 2 . (3.101)
1 + k~xk2 (1 + k~xk2 )2

3.2 Taylor’s Theorems


In this section, we will introduce Taylor approximation and Taylor’s theorem for the familiar scalar function case. Then
we will generalize the idea of Taylor approximation to multivariate functions. Taylor approximation is a tool to find
polynomial approximation of functions using information about the function value at a point along with the value of
its firs, second and higher order derivatives.

Definition 67 (Taylor Approximation)


Let f : R → R be a k-times continuously differentiable function, and fix x0 ∈ R. The k th degree Taylor approxi-

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 60
EECS 127/227AT Course Reader 3.2. Taylor’s Theorems 2024-04-27 21:08:09-07:00

mation around x0 is the function fbk (·; x0 ) : R → R given by

1 df 1 dk f
fbk (x; x0 ) = f (x0 ) + (x0 ) · (x − x0 ) + · · · + (x0 ) · (x − x0 )k (3.102)
1! dx k! dxk
k
1 di f
(3.103)
X
= i
(x0 ) · (x − x0 )i .
i=0
i! dx

In particular, the first-order and second-order Taylor approximations of f around x0 are

df
fb1 (x; x0 ) = f (x0 ) + (x0 ) · (x − x0 ) (3.104)
dx
df 1 d2 f
fb2 (x; x0 ) = f (x0 ) + (x0 ) · (x − x0 ) + (x0 ) · (x − x0 )2 . (3.105)
dx 2 dx2
We will derive multivariable versions of these approximations later.

Example 68 (Taylor Approximation of Cubic Function). Let us approximate the function f (x) = x3 around the fixed
point x0 = 1 using Taylor approximations of different degrees.

df
fb1 (x; 1) = f (x0 ) + (x0 ) · (x − x0 ) (3.106)
dx
= x30 + 3x20 · (x − x0 ) (3.107)
= 13 + 3 · 12 · (x − 1) (3.108)
= 3(x − 1) + 1 (3.109)
= 3x − 2. (3.110)
2
1d f
fb2 (x; 1) = fb1 (x; 1) + (x0 ) · (x − x0 )2 (3.111)
2 dx2
= 3x − 2 + 3 · 1 · (x − 1)2 (3.112)
2
= 3x − 3x + 1. (3.113)
3
1d f
fb3 (x; 1) = fb2 (x; 1) + (x0 ) · (x − x0 )3 (3.114)
6 dx3
= 3x2 − 3x + 1 + (x − 1)3 (3.115)
= x3 . (3.116)

We notice the following take-aways:

• The first-order Taylor approximation fb1 (·; x0 ) is the best linear approximation to f around x = x0 = 1. In
particular, its graph is the tangent line to the graph of f around the point (x0 , f (x0 )) = (1, 1), as observed in
Figure 3.6.

• The second-order Taylor approximation fb2 (·; x0 ) is the best quadratic approximation to f around x = x0 = 1. It
is the parabola whose graph passes through the point (x0 , f (x0 )) = (1, 1), as observed in Figure 3.6, and it has
the same first and second derivatives as f at x0 . Using the intuition that the second derivative models curvature,
we see that the second-order Taylor approximation captures the local curvature of the graph of the function. This
intuition will be helpful later when discussing convexity.

• The third-degree Taylor approximation fb3 (·; x0 ) is the best cubic approximation to f ; because f is just a cubic
function, the best cubic approximation is just f itself, and indeed we have fb3 (·; x0 ) = f .

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 61
EECS 127/227AT Course Reader 3.2. Taylor’s Theorems 2024-04-27 21:08:09-07:00

Figure 3.6: First and second degree Taylor approximations of the function f (x) = x3 .

Taylor approximation gives us the degree k polynomial that approximates the function f (x) around the fixed point
x = x0 . Taylor’s theorem quantifies the bounds for the error of this approximation.

Theorem 69 (Taylor’s Theorem)


Let f : R → R be a function which is k-times continuously differentiable, and fix x0 ∈ R. Then for all x ∈ R we
have
(3.117)
k
f (x) = fbk (x; x0 ) + o(|x − x0 | )

where the term o(|x − x0 | ) (i.e., the remainder) denotes a function, say Rk (x; x0 ), such that
k

Rk (x; x0 )
lim k
= 0. (3.118)
x→x0 |x − x0 |

We use this remainder notation because we don’t really care about what it is precisely, only its limiting behavior as
x → x0 , and the little-o notation allows us to not worry too much about the exact form of the remainder.
This theorem certifies that the Taylor approximations fbk are good approximations to f . Another way to write this
result is generally more useful or simpler:
df
f (x + δ) = f (x) + (x) · δ +o(|δ|) (3.119)
| dx
{z }
=fb1 (x+δ;x)

df 1 d2 f
= f (x) + (x) · δ + 2
(x) · δ 2 +o(δ 2 ) (3.120)
| dx {z 2 dx }
=fb2 (x+δ;x)

= .... (3.121)

We will never need to quantitatively work with the remainder in this course; we will usually write f ≈ fbk and leave it
at that.

3.2.1 Taylor Approximation of Multivariate Functions


Using the definitions we introduced for the gradient and Hessian of multivariate functions, we can generalize the idea
of Taylor’s approximation to these functions.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 62
EECS 127/227AT Course Reader 3.2. Taylor’s Theorems 2024-04-27 21:08:09-07:00

Definition 70 (Multivariate Taylor Approximations)


Let f : Rn → R and fix ~x0 ∈ Rn .

• If f is continuously differentiable, then its first-order Taylor approximation around ~x0 is the function
fb1 (·; ~x0 ) : Rn → R given by

fb1 (~x; ~x0 ) = f (~x0 ) + [∇f (~x0 )]> (~x − ~x0 ). (3.122)

• If f is twice continuously differentiable, then its second-order Taylor approximation around ~x0 is the func-
tion fb2 (·; ~x0 ) : Rn → R given by
1
fb2 (~x; ~x0 ) = f (~x0 ) + [∇f (~x0 )]> (~x − ~x0 ) + (~x − ~x0 )> [∇2 f (~x0 )](~x − ~x0 ). (3.123)
2

The graph of the first-order Taylor approximation is the hyperplane tangent to the graph of f at the point (~x0 , f (~x0 )).
This hyperplane has normal vector ∇f (~x0 ).
We could define higher-order Taylor approximations fbk , but to express them concisely would require generalizations
of matrices, called tensors. For example, the third derivative of a function f : Rn → R is a rank-3 tensor, i.e., an object
which lives in Rn×n×n . These are out of scope for this course, and anyways we will only need the first two derivatives.
We can also state an analogous Taylor’s theorem.

Theorem 71 (Taylor’s Theorem)


Let f : Rn → R be a function which is k-times continuously differentiable, and fix ~x0 ∈ Rn . Then for all ~x ∈ Rn
we have
(3.124)
k
f (~x) = fbk (~x; ~x0 ) + o(k~x − ~x0 k2 ).

We can re-write this result in the following, more useful, way for k = 1 and k = 2:

f (~x + ~δ) = f (~x) + [∇f (~x)]>~δ +o(k~δk2 ) (3.125)


| {z }
x+~
fb1 (~ δ;~
x)
1
= f (~x) + [∇f (~x)]>~δ + ~δ> [∇2 f (~x)]~δ +o(k~δk22 ) (3.126)
| {z 2 }
x+~
fb2 (~ δ;~
x)

= .... (3.127)

Example 72 (Taylor Approximation of the Squared `2 norm). In this example we will compute and visualize the first
and second degree Taylor approximations of the squared `2 norm function f (~x) = k~xk2 for ~x ∈ R2 around the vector
2

~x = ~x0 . First recall the gradient and hessian of the function which are computed in Examples 55 and 65, respectively.

• First degree approximation:

fb1 (~x; ~x0 ) = f (~x0 ) + [∇f (x~0 )]> (~x − ~x0 ) (3.128)
(3.129)
2 >
= k~x0 k2 + [2~x0 ] (~x − ~x0 )
(3.130)
2
= 2~x>
0~x− k~x0 k2 .

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 63
EECS 127/227AT Course Reader 3.2. Taylor’s Theorems 2024-04-27 21:08:09-07:00

Recall that the graph of f (~x) has a paraboloid


" # shape. Let us now evaluate and visualize the first degree approxi-
1
mation of the function around ~x0 = .
0

fˆ1 (~x; ~x0 ) = 2x1 − 1. (3.131)

We plot this function in Figure 3.7 and notice that the graph of the first order approximation is the plane tangent
to the paraboloid at the point (1, 0, f (1, 0)) = (1, 0, 1).

• Second degree approximation:


1
fb2 (~x; ~x0 ) = fb1 (~x; ~x0 ) + (~x − ~x0 )> [∇2 f (~x0 )](~x − ~x0 ) (3.132)
2
1
(3.133)
2
= 2~x0 ~x − k~x0 k2 + (~x − ~x0 )> [2I](~x − ~x0 )
>
| {z } 2
=fb1 (~
x;~x0 )

= 2~x>
0~x − ~x>
0~x0 + (~x − ~x0 )> (~x − ~x0 ) (3.134)
= 2~x>
0~x − ~x>
0~x0 + ~x ~x >
− ~x>
0~x − ~x >
~x0 + ~x>
0~x0 (3.135)
= 2~x>
0~x − ~x>
0~x0 + ~x> ~x − 2~x>
0~x + ~x>
0~x0 (3.136)
= ~x ~x >
(3.137)
(3.138)
2
= k~xk2 .

Thus fb2 = f independently of the choice of ~x0 , which makes sense since f is a quadratic function.

Figure 3.7: First degree Taylor approximation of the function f (~x) = k~xk22

Example 73. We can also compute gradients using Taylor’s theorem by pattern matching; this is sometimes much
neater than taking componentwise gradients. At first glance this seems circular, but we will see how it is possible. Take
for example the function f : Rn → R given by f (~x) = ~x> A~x. We can perturb f around ~x to obtain

f (~x + ~δ) = (~x + ~δ)> A(~x + ~δ) (3.139)


= ~x> A~x + ~δ> A~x + ~x> A~δ + ~δ> A~δ (3.140)
= f (~x) + (~x A + ~x A)~δ + ~δ> A~δ
> > >
(3.141)
1
= f (~x) + ((A + A> )~x)>~δ + ~δ> (A + A> )~δ. (3.142)
2

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 64
EECS 127/227AT Course Reader 3.2. Taylor’s Theorems 2024-04-27 21:08:09-07:00

However, Taylor’s theorem tells us that


1
f (~x + ~δ) = f (~x) + [∇f (~x)]>~δ + ~δ> [∇2 f (~x)]~δ + o(k~δk22 ). (3.143)
2
By pattern matching, we see that ∇f (~x) = (A + A> )~x (as obtained in a previous example), ∇2 f (~x) = A + A> , and
the remainder term is ~0.
One final note is that we changed 2A → A+A> in Equation (3.142); this is to ensure that the Hessian is symmetric,
which is a consequence of Theorem 64; and we are able to do this because

~δ> (2A)~δ = ~δ> A~δ + ~δ> A~δ = ~δ> A~δ + (~δ> A~δ)> = ~δ> A~δ + ~δ> A>~δ = ~δ> (A + A> )~δ. (3.144)

We conclude by introducing a more general version of a first-order Taylor approximation, a corresponding Taylor’s
theorem, and giving an example of when it is useful.

Definition 74 (Vector-Valued Taylor Approximation)


Let f~ : Rn → Rm and fix ~x0 ∈ Rn . If f~ is continuously differentiable, then its first-order Taylor approximation
ˆ
around ~x0 is the function f~1 : Rn → Rm given by
ˆ
f~1 (~x; ~x0 ) = f~(~x0 ) + [Df~(~x0 )](~x − ~x0 ). (3.145)

Again, higher-order approximations will require higher-order derivatives, which requires tensors.

Theorem 75 (Vector-Valued Taylor’s Theorem)


Let f~ : Rn → Rm be a continuously differentiable function, and fix ~x0 ∈ Rn . Then for all ~x ∈ Rn we have
ˆ
f~(~x) = f~1 (~x; ~x0 ) + ~o(k~x − ~x0 k2 ). (3.146)

We can again re-write this result in a more workable form:

f~(~x + ~δ) = f~(~x) + [Df~(~x)]~δ + o(k~δk2 ). (3.147)

Example 76. Taylor’s theorem can be used to compute gradients by pattern matching, even when the function is
not linear or quadratic. For instance, we now use it to derive the chain rule (albeit with stronger assumptions on the
functions). Let f~ : Rp → Rm and ~g : Rn → Rp be continuously differentiable. Let ~h : Rn → Rm be defined as
~h(~x) = f~(~g (~x)). Then we compute ~h on a perturbation around ~x and expand:

~h(~x + ~δ) = f~(~g (~x + ~δ)) (3.148)


≈ f~(~g (~x) + [D~g (~x)]~δ)) (3.149)
≈ f~(~g (~x)) + [Df~(~g (~x))]([D~g (~x)]~δ)) (3.150)
≈ f~(~g (~x)) + [Df~(~g (~x))][D~g (~x)]~δ. (3.151)

The first Taylor expansion is an expansion of ~g around the point ~x with perturbation ~δ; the second Taylor expansion is
an expansion of f~ around the point g(~x) with perturbation [D~g (~x)]~δ.
Meanwhile, Taylor’s theorem says that

~h(~x + ~δ) ≈ ~h(~x) + [D~h(~x)]~δ. (3.152)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 65
EECS 127/227AT Course Reader 3.3. The Main Theorem 2024-04-27 21:08:09-07:00

Thus by pattern matching we find that


D~h(~x) = [Df~(~g (~x))][D~g (~x)] (3.153)

which is precisely the chain rule!


Note that here we did not invoke the little-o notation because it turns out to be quite messy, but it is indeed possible
to do the required rigorous manipulations and get the same result.

As a last practical note, remembering the formula for Taylor approximations helps us confirm our understanding of
the dimensions of each vector. For instance, every term should multiply to a scalar. This makes it simpler to remember
that, for a function f : Rn → R, the gradient ∇f outputs column vectors in Rn , the Hessian ∇2 f outputs square
matrices in Rn×n , etc.

3.3 The Main Theorem


In this section, we will use the concepts introduced in the previous sections to state and prove one of the fundamental
ideas in optimization.

Theorem 77 (The Main Theorem [4])


Let f : Rn → R be a differentiable function, and let Ω ⊆ Rn be an open set.a Consider the optimization problem

min f (~x). (3.154)


x∈Ω
~

Let ~x? be a solution to this optimization problem. Then

∇f (~x? ) = ~0. (3.155)


a“Open sets” are analogous to open intervals (a, b) — i.e., not containing boundary points.

This theorem gives a necessary condition for a point to be an optimal solution of this optimization problem. It says
that any point that is optimal must necessarily have gradient equal to zero.

Proof. We prove this for scalar functions f : R → R only; the vector case is a bit more complicated and is left as an
exercise.
Using Taylor approximation of the function around the optimal point:
df ?
f (x) = f (x? ) + (x ) · (x − x? ) + o(|x − x? |). (3.156)
dx
Since f (x? ) ≤ f (x) for all x ∈ Ω, we have
df ?
f (x) ≤ f (x) + (x ) · (x − x? ) + o(|x − x? |) (3.157)
dx
df ?
=⇒ 0 ≤ (x ) · (x − x? ) + o(|x − x? |). (3.158)
dx
Since Ω is an open set, there exists some ball of positive radius r > 0 around x? such that Br (x? ) ⊆ Ω. Formally,

Br (x? ) = {x ∈ R | |x − x? | ≤ r}. (3.159)

Let us partition Br (x? ) into B+ , the set of all x ∈ Br (x? ) such that x − x? ≥ 0, and B− , the set of all x ∈ Br (x? )
such that x − x? < 0.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 66
EECS 127/227AT Course Reader 3.4. Directional Derivatives 2024-04-27 21:08:09-07:00

For all x ∈ B+ , we have

df ?
0≤ (x ) · (x − x? ) + o(|x − x? |) (3.160)
dx
df ?
= (x ) · |x − x? | + o(|x − x? |) (3.161)
dx
df ? o(|x − x? |)
=⇒ 0 ≤ (x ) + . (3.162)
dx |x − x? |

Taking the limit as x → x? within B+ , we have

o(|x − x? |)
 
df ?
0 ≤ lim? (x ) + (3.163)
x→x dx |x − x? |
x∈B+

df ? o(|x − x? |)
= (x ) + lim? (3.164)
dx x→x |x − x? |
x∈B+
df ?
= (x ). (3.165)
dx

Thus we have 0 ≤ df ?
dx (x ). On the other hand, for all x ∈ B− , we have

df ?
0≤(x ) · (x − x? ) + o(|x − x? |) (3.166)
dx
df
= − (x? ) · |x − x? | + o(|x − x? |) (3.167)
dx
df ? o(|x − x? |)
=⇒ 0 ≥ (x ) − . (3.168)
dx |x − x? |

Taking the limit x → x? within B− , we have

o(|x − x? |)
 
df ?
0 ≥ lim? (x ) − (3.169)
x→x dx |x − x? |
x∈B−

df ? o(|x − x? |)
= (x ) − lim? (3.170)
dx x→x |x − x? |
x∈B−
df ?
= (x ). (3.171)
dx

Thus we have 0 ≥ df ?
dx (x ) and so df ?
dx (x ) = 0.

3.4 Directional Derivatives


Recall the definition of the partial derivative of a multivariate function (Definition 47), which represents the rate of
change of the function f (~x) along one of the standard basis vectors. We do not need to restrict our treatment to the
standard basis vectors; in fact, we can compute the rate of change of the function in any arbitrary direction. This is
called the directional derivative.

Definition 78 (Directional Derivative)


Let f : Rn → R be differentiable, and fix ~u ∈ Rn such that k~uk2 = 1. The directional derivative of f along ~u is

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 67
EECS 127/227AT Course Reader 3.5. (OPTIONAL) Matrix Calculus 2024-04-27 21:08:09-07:00

the function Df (·)[~u] : Rn → R defined by

f (~x + h · ~u) − f (~x)


Df (~x)[~u] = lim . (3.172)
h→0 h

If we know the directional derivative in any direction, we know the gradient; similarly, if we know the gradient, we
know the directional derivative. The way to connect the two is given by the following proposition, whose proof is left
as an exercise.

Proposition 79
Let f : Rn → R be differentiable, and fix ~u ∈ Rn such that k~uk2 = 1. Then

Df (~x)[~u] = ~u> [∇f (~x)]. (3.173)

In particular, Df (~x)[~ei ] = ∂f
x).
∂xi (~

3.5 (OPTIONAL) Matrix Calculus


So far we have only discussed derivatives of three types of function and all have either scalars or vectors and their
input and output. We can think of a more general class of functions that also involve matrices. In this section, we will
generalize the idea of derivatives to such functions. We will focus our attention on functions of the form f : Rm×n → R,
which take a matrix X ∈ Rm×n as input and produce a scalar f (X) as output. Familiar examples of such functions
include matrix norms, the determinant, and the trace.

Definition 80 (Gradient)
Let f : Rm×n → R be differentiable. The gradient of f is the function ∇f : Rm×n → Rm×n which is defined as
 ∂f ∂f

∂X11 (X) · · · ∂X (X)
.. ..
1n
 .. 
(3.174)
∇f (X) =  . . .


∂f ∂f
∂Xm1 (X) · · · ∂Xmn (X)

There exists a general chain rule for matrix-valued functions, which is provable by flattening out all matrices into
vectors and applying the vector chain rule.

Theorem 81 (Chain Rule)


Let F : Rp×q → Rr×s and G : Rm×n → Rp×q be differentiable functions. Let H : Rm×n → Rr×s be defined
by H(X) = F (G(X)) for all X ∈ Rm×n . Then H is differentiable, and for all i, j, k, `, we have

∂Hij X X ∂Fij ∂Gab


(X) = (G(X)) (X) (3.175)
∂Xk` a
∂Gab ∂Xk`
b

As before, the notation means to take the derivative of the ij th output of F by its abth input. A more specific
∂Fij
∂Gab
version of this chain rule is given below for functions f : Rm×n → R.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 68
EECS 127/227AT Course Reader 3.5. (OPTIONAL) Matrix Calculus 2024-04-27 21:08:09-07:00

Proposition 82
Let F : Rp×q → R and G : Rm×n → Rp×q be differentiable functions. Let h : Rm×n → R be defined by
h(X) = f (G(X)) for all X ∈ Rm×n . Then h is differentiable, and for all k, `, we have

∂h ∂Gab
(3.176)
XX
(X) = [∇f (G(X))]ab (X)
∂Xk` a
∂X k`
b

We also are able to define a first-order Taylor expansion without having to use tensor notation.

Definition 83 (Matrix Taylor Approximation)


Let f : Rm×n → R and fix X0 ∈ Rm×n . If f is continuously differentiable, then its first-order Taylor approxi-
mation around X0 is the function fb1 (·; X0 ) : Rm×n → R given by

fb1 (X; X0 ) = f (X0 ) + tr [∇f (X0 )]> (X − X0 ) . (3.177)




There is a corresponding Taylor’s theorem certifying the Taylor approximation accuracy, but we don’t state it here.
Finally, note that the general recipe for computing all quantities such as the gradient, Jacobian, and gradient matrix
is the same: consider each input component and each output component separately and organize their partial derivatives
in vector or matrix form with a standard layout.

Example 84 (Finishing Example 62). Now that we know how to take matrix-valued gradients, we complete the example
of Neural Networks and Backpropagation. Before reading the following, please revise the lengthy setup of this example.
We promised in this example a way to compute ∇θ L(θ), or more precisely a way to compute ∇W (i) L(θ). We now
have the tools to do this using the chain rule. Recall that we have access to ∇~z(i) L(θ) by backpropagation. Then we
can compute the components of ∇W (i) L(θ) by
∂L
[∇W (i) L(θ)]j,k = (θ) (3.178)
∂(W (i) )j,k
∂(~z(i) )a
(3.179)
X
= [∇~z(i) L(θ)]a
a
∂(W (i) )j,k

[~σ (i) (~z(i−1) )]k , if a = j and i ∈ {1, . . . , m}



(3.180)
X
= [∇~z(i) L(θ)]a · xk , if a = j and i = 0

a
otherwise


0,

[∇ (i) L(θ)] [~σ (i) (~z(i−1) )] , if i ∈ {1, . . . , m}
j k
(3.181)
~
z
=
[∇ (i) L(θ)] [~x] ,
~
z j k if i = 0

[{∇ (i) L(θ)}~σ (i) (~z(i−1) )> ] , if i ∈ {1, . . . , m}
j,k
(3.182)
~
z
=
{∇ (i) L(θ)}~x> ] ,
~
z j,k if i = 0.

This gives 
[∇
z (i) L(θ)]~
σ (i)
(~z(i−1) )> , if i ∈ {1, . . . , m}
(3.183)
~
∇W (i) L(θ) =
[∇ x> ,
z (i) L(θ)]~
~ if i = 0.
In combination with the expression for ∇~b(i) from Example 62, we can efficiently compute ∇θ L(θ), and are able to
train our neural network via gradient-based optimization methods such as gradient descent.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 69
Chapter 4

Linear and Ridge Regression

Relevant sections of the textbooks:

• [2] Chapter 6.

4.1 Impact of Perturbations on Linear Regression


Before we start thinking about generic convex analysis, we will first study a particularly instructive, useful, and inter-
esting linear-algebraic convex optimization problem.
Let A ∈ Rn×n be invertible and ~y ∈ Rn . Consider the generic linear system A~x = ~y , perhaps representing
measurements of some physical system. There is exactly one ~x which solves this system — that being ~x = A−1 ~y . We
want to understand how sensitive this system is to perturbations in the output. That is, if ~y is perturbed by ~δ~y for ~δ~y
2
small (say, representing noise in the measurements), then the ~x that solves the system is also perturbed, say by ~δ~x . So
in the end, we have
A(~x + ~δ~x ) = (~y + ~δ~y ). (4.1)
~
δ~
We want to compute the relative change in ~x, that is, , in terms of ~δ~y , as well as other properties of the system.
x
2
k~
xk2
2
In the context of our physical measurement system, we would much rather have this ratio be small; this means that the
solutions to the equations governing our physical system are robust to measurement errors, thus assuring us that our
~
δ~
model is relatively accurate to the real-life physical system. Thus, at the least we want to upper-bound .
x
2
k~
xk2
~
δ~
The first part of upper-bounding is to upper-bound ~δ~x . We have
x
2
k~
xk2
2

A(~x + ~δ~x ) = ~y + ~δ~y (4.2)


A~x + A~δ~x = ~y + ~δ~y (4.3)
A~δ~x = ~δ~y (4.4)
~δ~x = A−1~δ~y . (4.5)

Then by taking norms on both sides,


~δ~x = A−1~δ~y (4.6)
2 2

≤ max A−1 ~z 2
(4.7)
z ∈Rn
~
z k2 = ~
k~ δy
~
2

70
EECS 127/227AT Course Reader4.1. Impact of Perturbations on Linear Regression 2024-04-27 21:08:09-07:00

 

=  maxn A−1 ~z 2
 ~δ~y (4.8)
z ∈R
~ 2
k~
z k2 =1

= A−1 2
~δ~y . (4.9)
2

~
δ~
In order to upper-bound k~xk 2 , we also need to lower-bound k~xk2 . Applying the same matrix norm inequality to the
x

2
regular linear system A~x = ~y gives

A~x = ~y (4.10)
kA~xk2 = k~y k2 (4.11)
kAk2 k~xk2 ≥ k~y k2 (4.12)
k~y k
k~xk2 ≥ (4.13)
kAk2

where kAk2 6= 0 because A is invertible. Plugging in both bounds, we have

~δ~x A−1 ~δ~y


2
2
≤ 2
(4.14)
k~xk2 k~y k2 / kAk2
~δ~y
= kAk2 A−1 2
· 2
. (4.15)
k~y k2

Thus we’ve bounded the relative change in ~x by the relative change in ~y . If the relative change in ~y is small, then the
relative change in ~x will be small, and so on. But we’d like to say something more about kAk2 A−1 2
, and indeed we
can:
σ1 {A}
kAk2 A−1 2
= σ1 {A} · σ1 {A−1 } = , (4.16)
σn {A}
where again, σn {A} 6= 0 because A is invertible. This quantity

. σ1 {A}
κ(A) = (4.17)
σn {A}

is called the condition number of a matrix. In general, for non-invertible systems, this can be infinite, but has the same
definition.

Definition 85 (Condition Number)


Let A ∈ Rn×n . The condition number of A, denoted κ(A), is given by

. σ1 {A}
κ(A) = . (4.18)
σn {A}

If κ(A) is large, then even a small change in our measurement ~y will result in a huge change in our variable ~x. If κ(A)
is small, then large changes in our measurement ~y result in small changes to our variable ~x.
It seems unlikely that in general, the equations that define our system will be square. Most likely we will have a
least-squares type tall system. But this is resolved by using the so-called normal equations to represent the least squares
solution:
A> A~x = A> ~y . (4.19)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 71
EECS 127/227AT Course Reader 4.2. Ridge Regression 2024-04-27 21:08:09-07:00

The condition number of this linear system is κ(A> A). Since A> A is symmetric and positive semidefinite, its eigen-
values are also its singular values, and so we have

λmax {A> A}
κ(A> A) = . (4.20)
λmin {A> A}

4.2 Ridge Regression


Sometimes we have least squares systems with κ(A> A) = ∞, or even finite but very large. How do we make this
system solvable, robust, or even better conditioned? The answer goes like the following: suppose we could add some
number λ to all eigenvalues of A> A. Then, since λ1 {A> A} and λn {A> A} go up by the same amount, λn {A> A}
becomes a larger fraction of λ1 {A> A}, so the condition number κ(A> A) becomes lower.
This is perhaps easier to see with a numerical example, which we provide now. Suppose that A> A has λ1 {A> A} =
5 and λn {A> A} = 0.01. Then κ(A> A) = 500. But if we add 3 to all eigenvalues of A> A, then λ1 {A> A} = 8 and
λn {A> A} = 3.01, so κ(A> A) = 8
3.01 ≈ 2.65. This is a much better-conditioned problem.
The question is now how to add λ to all eigenvalues of A> A. Using the shift property of eigenvalues, we see that we
can add λI to A> A, so that instead of solving the system A> A~x = A> ~y we instead solve the system (A> A + λI)~x =
A> ~y . The problem of finding ~x which solves this system is called ridge regression. It turns out to be equivalent to the
following formulation.

Theorem 86 (Ridge Regression)


Let A ∈ Rm×n , ~y ∈ Rm , and λ > 0. The unique solution to the ridge regression problem
n o
(4.21)
2 2
minn kA~x − ~y k2 + λ k~xk2
x∈R
~

is given by
~x? = (A> A + λI)−1 A> ~y . (4.22)

.
Proof. Let f (~x) = kA~x − ~y k2 + λ k~xk2 . By taking gradients, we get
2 2

n o
(4.23)
2 2
∇~x f (~x) = ∇~x kA~x − ~y k2 + λ k~xk2

= ∇~x {~x> A> A~x − 2~y > A~x + ~y > ~y + λ~x> ~x} (4.24)
>
= 2A A~x − 2A ~y + 2λ~x >
(4.25)
= 2(A> A + λI)~x − 2A> ~y . (4.26)

Thus we get that the optimal point is determined by solving the linear system

(A> A + λI)~x = A> ~y . (4.27)

Since A> A is PSD and λ > 0, we have A> A + λI is PD and thus invertible. Therefore

~x? = (A> A + λI)−1 A> ~y (4.28)

is the unique solution to the above linear system and therefore the unique solution to the optimization problem.

Note that we haven’t proved that a convex function (such as the above ridge regression objective) is minimized when
its derivative is 0; we prove this in subsequent lectures, but for now let us take it for granted.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 72
EECS 127/227AT Course Reader 4.3. Principal Components Regression 2024-04-27 21:08:09-07:00

Proof. Another way to solve the same problem is to consider the augmented system
" # " #
A ~y
√ ~x = . (4.29)
λI ~0

This augmented matrix has full column rank, so we can use the least squares solution to get a unique solution for ~x.
We get
" #> " #−1 "
#> " #
A A  A ~y
~x =  √ √ √ (4.30)
λI λI λI ~0
" #!−1 " #
h √ i A h √ i ~y
= A >
λI √ A >
λI (4.31)
λI ~0
" #!−1
h √ i A  √ 
= A> λI √ A> ~y + λI · ~0 (4.32)
λI
" #!−1
h √ i A
= A >
λI √ A> ~y (4.33)
λI
−1 >
= A> A + λI A ~y . (4.34)

In the ridge regression objective


2
(4.35)
2
A~x − ~b + λ k~xk2 ,
2

the second term λ k~xk2 is called a regularizer; this is because it regulates or regularizes our problem by making it
2

better-conditioned.

4.3 Principal Components Regression


We can gain more understanding of the ridge regression solution by looking at it through the SVD of A. Indeed, let
A = U ΣV > . Then the ridge regression solution is
−1
x? = A> A + λI A> ~y (4.36)
−1
= (U ΣV > )> (U ΣV > ) + λI (U ΣV > )> ~y (4.37)
−1
= V Σ> U > U ΣV > + λI V Σ> U > ~y (4.38)
−1
= V Σ> ΣV > + λI V Σ> U > ~y (4.39)
−1
= V Σ> ΣV > + V (λI)V > V Σ> U > ~y (4.40)
−1
= V Σ> Σ + λI V > V Σ> U > ~y (4.41)

−1 >
= V Σ> Σ + λI V V Σ> U > ~y (4.42)
−1 > >
= V Σ> Σ + λI (4.43)

Σ U ~y
" #
(Σ2r + λI)−1 Σr 0
=V U > ~y . (4.44)
0 0

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 73
EECS 127/227AT Course Reader 4.3. Principal Components Regression 2024-04-27 21:08:09-07:00

Looking at the middle matrix a bit more, we see that


σ1 {A}
 
σ1 {A}2 +λ
(Σ2r + λI)−1 Σr = 
 .. 
 . .

σn {A}
σn {A}2 +λ

Thus, we get
" #
(Σ2r + λI)−1 Σr 0
?
~x = V U > ~y (4.45)
0 0
r
!
σi {A}
(4.46)
X
= ~vi ~u>
i ~y
i=1
σi {A}2 + λ
r
σi {A}
(4.47)
X
= (~u> ~y ) · ~vi .
i=1
σi {A}2 + λ i

To understand what λ is doing here, we contrast two examples. Let A ∈ Rn×3 for some large n  3.
Suppose first that σ1 {A} = σ2 {A} = σ3 {A} = 1. Then

σ1 {A} σ2 {A} σ3 {A}


~x? = (~u>
1~y )~v1 + (~u>2~
y )~v2 + (~u> ~y )~v3 (4.48)
2
σ1 {A} + λ 2
σ2 {A} + λ σ3 {A}2 + λ 3
1
= {(~u>
1~y )~v1 + (~u>2~y )~v2 + (~u> y )~v3 }
3~ (4.49)
1+λ
1 ~
= x
e (4.50)
1+λ

where ~x
e is the solution of the corresponding least squares linear regression problem, namely the ridge problem with
λ = 0. In this way, the λ parameter decays the solution in each principal direction equally, pulling the whole ~x
e vector
towards 0. This is interesting precisely because a first-level examination of the ridge regression objective function —
and namely the k~xk2 term, which by itself penalizes every direction of ~x equally — may make it seem like this is always
2

the case, but it turns out to not be, as we will see shortly.
Now suppose that σ1 {A} = 100, σ2 {A} = 10, and σ3 {A} = 1. Then

σ1 {A} σ2 {A} σ3 {A}


~x? = (~u> ~y )~v1 + (~u> ~y )~v2 + (~u> ~y )~v3 (4.51)
σ1 {A}2 + λ 1 σ2 {A}2 + λ 2 σ3 {A}2 + λ 3
100 10 1
= (~u>1~y )~v1 + (~u>
2~y )~v2 + (~u> ~y )~v3 (4.52)
10000 + λ 100 + λ 1+λ 3
1 1 1
= (~u>
1~ y )~v1 + (~u>2~y )~v2 + (~u> ~y )~v3 . (4.53)
100 + λ/100 10 + λ/10 1+λ 3

Thus, the different terms are now impacted differently based on λ; in particular, to impact the first term by a certain
amount, one needs to change λ by 100 times the amount required to change the last term. Namely, if we set λ to be large,
say λ = 10000, the coefficient of the first term becomes 1/110, while the coefficient of the last term becomes 1/10001
which is much lower. More generally, for a larger example, setting λ to be large effectively zeros out the last few terms
while effectively not changing the first few terms. Thus, setting λ to be large effectively performs a “soft thresholding”
of the singular values, making the terms associated with smaller singular values be nearly 0 while preserving the terms
associated with larger singular values. More quantitatively, for large λ, we have
1 1 1
~x? = (~u>
1~y )~v1 + (~u>
2~y )~v2 + (~u> ~y )~v3 (4.54)
100 + λ/100 10 + λ/10 1+λ 3

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 74
EECS 127/227AT Course Reader 4.4. Tikhonov Regression 2024-04-27 21:08:09-07:00

1 1
≈ (~u>
1~y )~v1 + (~u> ~y )~v2 (4.55)
100 + λ/100 10 + λ/10 2
and for even larger λ we simply have
1 1 1
~x? = (~u>
1~ y )~v1 + (~u>
2~y )~v2 + (~u> ~y )~v3 (4.56)
100 + λ/100 10 + λ/10 1+λ 3
1
≈ (~u> ~y )~v1 . (4.57)
100 + λ/100 1
Since the terms form a linear combination of the ~vi , setting such terms associated with small singular values to (nearly)
0 is similar to performing PCA, where we only use the ~vi associated with the largest few singular values. Thus our
conclusion is — ridge regression behaves qualitatively similar to a soft form of PCA.

4.4 Tikhonov Regression


Recall that our earlier augmented system " # " #
A ~y
√ ~x = (4.58)
λI ~0

which had full column rank, tried to find a ~x such that A~x ≈ ~y while ~x ≈ ~0 — in other words, ~x that is small. Suppose
that we wanted to instead enforce that ~x were close to some other vector ~x0 ∈ Rn . Then we would set up the system
" # " #
A ~y
√ ~x = . (4.59)
λI ~x0

This would yield the least-squares type objective function

(4.60)
2 2
kA~x − ~y k2 + λ k~x − ~x0 k2 .

The final generalization of this is to put different weights on each row of A~x − ~y and ~x − ~x0 . If, for example, we really
want to get row i of A~x close to bi , we can multiply the squared difference (A~x − ~y )2i by a large weight in the loss
function, and the solutions will bias towards ensuring that (A~x − ~y )i ≈ 0. Similarly, if we really are sure that the true
~x has ith coordinate (~x0 )i , then we can attach a large weight to the difference (~x − ~x0 )2i as well. Mathematically, this
gives us the following objective function:

(4.61)
2 2
kW1 (A~x − ~y )k2 + kW2 (~x − ~x0 )k2 ,

where W1 ∈ Rm×m and W2 ∈ Rn×n are diagonal matrices representing the weights. Notice how this is a gener-

alization of ridge regression with W1 = I, W2 = λI, and ~x0 = ~0. This general regression is called Tikhonov
regression.

Theorem 87 (Tikhonov Regression)


Let A ∈ Rm×n , ~x0 ∈ Rn , and ~y ∈ Rm , and let W1 ∈ Rm×m and W2 ∈ Rn×n be diagonal. Then the unique
solution to the Tikhonov regression problem
n o
(4.62)
2 2
minn kW1 (A~x − ~y )k2 + kW2 (~x − ~x0 )k2
x∈R
~

is given by
~x? = (A> W12 A + W22 )−1 (A> W12 ~y + W22 ~x0 ). (4.63)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 75
EECS 127/227AT Course Reader 4.5. Maximum Likelihood Estimation (MLE) 2024-04-27 21:08:09-07:00

Proof. Left as exercise.



This expression looks complicated, so we do a sanity-check; if W1 = I, W2 = λI, and ~x0 = ~0, then we get exactly
the ridge regression solution.

4.5 Maximum Likelihood Estimation (MLE)


Previously, we talked about incorporating side information (like ~x = ~x0 ) deterministically. Now we discuss a way to
incorporate probabilistic information into our model.
Namely, suppose that the rows of our A matrix are vectors ~a1 , . . . , ~am ∈ Rn , and that the entries of our ~y vector
are y1 , . . . , ym ∈ R. Now suppose we have the probabilistic model

yi = ~a>
i ~
x + wi , ∀i ∈ {1, . . . , m} (4.64)

where w1 , . . . , wn are independent Gaussian random variables; in particular, wi ∼ N (0, σi2 ). Or in short, we have

~y = A~x + w
~ (4.65)
i>
.
h
where w~ = w1 · · · wm ∈ Rm . In this case, we say that w ~ ∼ N (~0, Σw~ ) where Σw~ = diag σ12 , . . . , σm
2


In this setup, the maximum likelihood estimate (MLE) for ~x turns out to be exactly a solution to a Tikhonov regres-
sion problem. The maximum likelihood estimate is the parameter choice which makes the data most likely, in that it
has the highest probability or probability density out of all choices of the parameter. It is a meaningful and popular sta-
tistical estimator; thus the fact that we can reduce its computation to a ridge regression-type problem is both interesting
and useful.
Henceforth, we use p to denote probability densities, and use p~x to denote probability densities for a fixed value of
~x. In the above model, ~x is not a random variable, so it doesn’t quite make formal sense to condition on it (though —
spoilers! — we will soon put a probabilistic prior on it, and then it makes sense to condition).

Proposition 88 (MLE as Tikhonov Regression)


In the above probabilistic model, we have
2
(4.66)
−1/2
argmax p~x (~y ) = argmin Σw~ (A~x − ~y ) .
x∈Rn
~ x∈Rn
~ 2

Proof. Since the logarithm is monotonically increasing, argmax~x f (~x) = argmax~x log(f (~x)) for all functions f , and
so

argmax p~x (~y ) = argmax log(p~x (~y )) (4.67)


x∈Rn
~ x∈Rn
~
m
!
(4.68)
Y
= argmax log p~x (yi )
x∈Rn
~ i=1
m
(4.69)
X
= argmax log(p~x (yi ))
x∈Rn
~ i=1
m !
(yi − ~a> x)2

i ~
1
(4.70)
X
= argmax log p exp −
x∈Rn i=1
~ 2πσi2 2σi2

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 76
EECS 127/227AT Course Reader 4.6. Maximum A Posteriori Estimation (MAP) 2024-04-27 21:08:09-07:00

 

 

 ! 
m
X   > 2

 1 (yi − ~ai ~x) 
= argmax log p + log exp − (4.71)
x∈Rn i=1 
~  2πσi2 2σi2 


 | {z } 

 
independent of ~
x
m 
(yi − ~a> x)2
  
i ~
(4.72)
X
= argmax log exp −
x∈Rn i=1
~ 2σi2
m 
(yi − ~a> x)2

i ~
(4.73)
X
= argmax −
x∈Rn i=1
~ 2σi2
( m
)
1 X (yi − ~a> x)2
i ~
= argmax − (4.74)
x∈Rn
~ 2 i=1 σi2
m
(yi − ~a> ~x)2
(4.75)
X
i
= argmin
x∈Rn
~ i=1
σi2
2
(4.76)
−1/2
= argmin Σw~ (A~x − ~y ) .
x∈Rn
~ 2

4.6 Maximum A Posteriori Estimation (MAP)


Now we consider the same probabilistic model as above, except this time suppose that we believe ~x is also random, in
the sense that
x j = µ j + vj , ∀j ∈ {1, . . . , n} (4.77)
where v1 , . . . , vn are independent Gaussian random variables; in particular vj ∼ N (0, τj2 ). Or in short, we have

~x = ~x0 + ~v (4.78)
i>
.
h
where ~v = v1 · · · vn ∈ Rn is distributed as ~v ∼ N (~0, Σ~v ), where Σ~v = diag τ12 , . . . , τn2 .


In this setup, the maximum likelihood estimate may still be useful, but another quantity that is perhaps more relevant
is the maximum a posteriori estimate (MAP). The MAP estimate is the value of ~x which is most likely, i.e., having
the highest conditional probability or conditional probability density, conditioned on the observed data. It is also a
meaningful and popular statistical estimator. It turns out that we can derive a similar result as in the MLE case.

Theorem 89 (MAP as Tikhonov Regression)


In the above probabilistic model, we have
 
2 2
(4.79)
−1/2 −1/2
argmax p(~x | ~y ) = argmin Σw~ (A~x − ~y ) + Σ~v (~x − ~x0 ) .
x∈Rn
~ x∈Rn
~ 2 2

Proof. Using Bayes’ rule and the computations from before, we have

argmax p(~x | ~y ) (4.80)


x∈Rn
~

= argmax log(p(~x | ~y ) (4.81)


x∈Rn
~
 
p(~y | ~x)p(~x)
= argmax log (4.82)
x∈Rn
~ p(~y )

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 77
EECS 127/227AT Course Reader 4.6. Maximum A Posteriori Estimation (MAP) 2024-04-27 21:08:09-07:00

 

 

= argmax log(p(~y | ~x)) + log(p(~x)) − log(p(~y )) (4.83)
x∈Rn
~  | {z } 
independent of ~
 
x

= argmax {log(p(~y | ~x)) + log(p(~x))} (4.84)


x∈Rn
~
(
m
)  n 
Y 
(4.85)
Y
= argmax log  p(yi | ~x) · p(xj ) 
x∈Rn
~ i=1

j=1

 
X m n 
(4.86)
X
= argmax log(p(yi | ~x)) + log(p(xj ))
x∈Rn  i=1
~ j=1

  !
m  > 2
! X n 2
X 1 (yi − ~ai ~x) q 1 (xj − (~x0 )j ) 
= argmax log p exp − 2 + log exp − (4.87)
x∈Rn  i=1
~ 2πσi2 2σi j=1 2πτj2 2τj2 
 !
m  n
(yi − ~a> 2 2 
 X
X ~
x ) (x − (~x ) )
(4.88)
i j 0 j
= argmax − + −
x∈Rn  i=1
~ 2σi2 j=1
2τ 2
j 
 ! 
m  n
(yi − ~a> x)2 (xj − (~x0 )j )2 
 X
i ~
X
= argmin + (4.89)
x∈Rn  i=1
~ σi2 j=1
τj2 
 
2 2
(4.90)
−1/2 −1/2
= argmin Σw~ (A~x − ~y ) + Σ~v (~x − ~x0 )
x∈Rn
~ 2 2

as desired.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 78
Chapter 5

Convexity

Relevant sections of the textbooks:

• [1] Chapter 4.

• [2] Chapter 8.

5.1 Convex Sets


5.1.1 Basics
First, we want to define a special type of linear combination called a convex combination.

Definition 90 (Convex Combination)


Let ~x1 , . . . , ~xk ∈ Rn . The sum
k
(5.1)
X
~x = θi ~xi
i=1

is a convex combination of ~x1 , . . . , ~xk if each θi ≥ 0 and


Pk
i=1 θi = 1.

We can think of each θi as a weight on the corresponding ~xi . Since they are non-negative numbers which sum to
1, we can also interpret them as probabilities.

Definition 91 (Convex Set)


Let C ⊆ Rn . We say that C is a convex set if it is closed under convex combinations: for all ~x1 , ~x2 ∈ C and all
θ ∈ [0, 1], we have θ~x1 + (1 − θ)~x2 ∈ C.

Geometrically, a set C is convex if for every two points ~x1 , ~x2 ∈ C, the line segment {θ~x1 + (1 − θ)~x2 | θ ∈ [0, 1]}
is contained in C. This means that, for example, the midpoint between ~x1 and ~x2 , i.e., 12 ~x1 + 12 ~x2 , is contained in C,
as well as the point 1
3 of the way from ~x1 to ~x2 , i.e., 23 ~x1 + 13 ~x2 , etc. More generally, as we vary θ, we go along the
line segment connecting ~x1 and ~x2 .

79
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

3
5~x1 + 25 ~x2 {θ~x1 + (1 − θ)~x2 | θ ∈ [0, 1]}

~x1 {θ~x1 + (1 − θ)~x2 | θ ∈ [0, 1]} ~x2 ~x1 23 ~x1 + 13 ~x2 ~x2

C1 C2

Figure 5.1: Two sets C1 , C2 ⊆ R2 . C1 is not convex, but C2 is. To visualize the behavior of the line segments, we also plot a point
on each line segment along with its associated θ.

Algebraically, a set C is convex if for any ~x1 , . . . , ~xk ∈ C, any convex combination of ~x1 , . . . , ~xk is contained in
C.
One way to generate a convex set from any (possibly non-convex) set, including finite and infinite sets, is to take
its convex hull.

Definition 92 (Convex Hull)


Let S ⊆ Rn be a set. The convex hull of S, denoted conv(S), is the set of all convex combinations of points in
S, i.e., ( )
k k
(5.2)
X X
conv(S) = θi ~xi k ∈ N, θ1 , . . . , θk ≥ 0, θi = 1, ~x1 , . . . , ~xk ∈ S .
i=1 i=1

Here are some properties of the convex hull; the proof is left as an exercise.

Proposition 93
Let S ⊆ Rn be a set.

(a) conv(S) is a convex set.

(b) conv(S) is the minimal convex set which contains S, i.e.,

(5.3)
\
conv(S) = C.
C⊇S
C is a convex set

Thus if S is convex then conv(S) = S.

(c) conv(S) is the union of convex hulls of all finite subsets of S, i.e.,

(5.4)
[
conv(S) = conv(A).
A⊆S
A is a finite set

Actually, the last statement can be strengthened to a separate, more quantitative result, which gives a fundamental
characterization of convex sets.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 80
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

Theorem 94 (Carathéodory’s Theorem)


Let S ⊆ Rn be a set. Then conv(S) is the union of convex hulls of all finite subsets of S of size at most n + 1,
i.e.,
(5.5)
[
conv(S) = conv(A).
A⊆S
|A|≤n+1

The proof of this theorem is left as an exercise; interested students can reference the proof in Bertsekas [5, Proposition
B.6], for example.
Below, we visualize the convex hull of a finite set S. By the above proposition, the convex hull of an infinite set S 0
is the union of convex hulls of all finite subsets of S 0 .

S conv(S)

Figure 5.2: A finite set S and its convex hull.

The following content is optional/out of scope for this semester. Regardless, it may be helpful to read it to
gain context, or get a deeper understanding of various results.

5.1.2 (OPTIONAL) Conic Hull, Affine Hull, and Relative Interior


As stated above, given a set S that is not necessarily convex, taking arbitrary convex combinations of vectors in S
generates the convex hull conv(S) of S, which is the smallest convex set containing S. Below, we present two other
methods of generating convex sets containing S that look similar to the definition of the convex hull. Each method has
its own geometric interpretation.

Definition 95 (Conic Hull)


Let S ⊆ Rn be a set. The conic hull of S, denoted conic(S), is defined as the set of conic combinations of vectors
in S, i.e., linear combinations of vectors in S with non-negative coefficients:
( k )
(5.6)
X
conic(S) = θi ~xi k ∈ N, θ1 , . . . , θk ≥ 0, ~x1 , . . . , ~xk ∈ S .
i=1

Geometrically, the conic hull of a set S is the set of all rays from the origin that pass through conv(S).

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 81
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

conic(S)

conv(S)

(0, 0)

Figure 5.3: A finite set S and its convex and conic hulls.

Definition 96 (Affine Set)


Let S ⊆ Rn be a set. We say that S is an affine set if it is closed under affine combinations: for each ~x1 , ~x2 ∈ S,
and any θ ∈ R, we have θ~x1 + (1 − θ)~x2 ∈ S.

Note the difference between affine and convex sets. In the latter, θ is restricted to [0, 1]. Geometrically this restriction
corresponds to the (finite) line segment connecting ~x1 and ~x2 being contained in S. In the former, however, θ can be
any real number, corresponding to the whole (infinite) line connecting ~x1 and ~x2 being contained in S.
Note that an affine set is a translation of a subspace. This intuition is one of the most helpful ways to understand
affine sets.

Proposition 97
.
For a set A ⊆ Rn , define the translation A + ~x = {~a + ~x | ~a ∈ A}.

(a) Let S ⊆ Rn be a nonempty affine set. Then there is a subspace U ⊆ Rn such that, for any ~x ∈ S, we have
S = U + ~x.

(b) For any subspace U ⊆ Rn and vector ~x ∈ Rn , the set U + ~x is an affine set.

Proof.

(a) Let ~x ∈ S be any vector in S, and define U := S + (−~x) = {~s − ~x | ~s ∈ S}. We claim that U is a subspace.
Indeed, since ~x ∈ S, we see that ~0 ∈ U = S + (−~x). We show that U is closed under addition. Let ~u1 , ~u2 ∈ U .
By definition of U , there exist ~s1 , ~s2 ∈ S such that ~u1 = ~s1 − ~x and ~u2 = ~s2 − ~x. Then

~u1 + ~u2 = ~s1 − ~x + ~s2 − ~x (5.7)


   
~s1 + ~s2
= 2 + (1 − 2)~x − ~x. (5.8)
2

Now because S is affine it is convex, so ~s1 +~


2
s2
∈ S. As an affine combination of elements in S, we have
~
s1 +~s2
Thus we have x) = U , so that U is closed under vector addition.

2 2 + (1 − x
2)~ ∈ S. ~
u 1 + ~
u 2 ∈ S + (−~
To show that U is closed under scalar multiplication, let α ∈ R and ~u ∈ U . By definition of U , there exists

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 82
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

~s ∈ S such that ~u = ~s − ~x. Then

α~u = α(~s − ~x) (5.9)


= [α~s + (1 − α)~x] − ~x. (5.10)

Since S is affine it is convex, so α~s + (1 − α)~x ∈ S. Thus α~u ∈ S + (−~x) = U , so U is closed under scalar
multiplication. We have shown that U is closed under linear combinations and contains ~0, so U is a subspace
and the claim is proved.

(b) Let α ∈ R and let ~s1 , ~s2 ∈ U + ~x. By definition of U , there exist ~s1 , ~s2 ∈ S such that ~s1 = ~u1 + ~x and
~s2 = ~u2 + ~x. Then

α~s1 + (1 − α)~s2 = α(~u1 + ~x) + (1 − α)(~u2 + ~x) (5.11)


= [α~u1 + (1 − α)~u2 ] + ~x. (5.12)

Since U is a subspace, α~u1 +(1−α)~u2 ∈ U . Thus, from above, we have α~s1 +(1−α)~s2 = [α~u1 +(1−α)~u2 ]+~x ∈
U + ~x. We have shown that U + ~x is closed under affine combinations, so it is an affine set.

Definition 98 (Affine Hull)


Let S ⊆ Rn be a set. The affine hull of S, denoted aff (S), is defined as the set of affine combinations of vectors
in S, i.e., linear combinations of vectors in S with coefficients which sum to 1:
( k k
)
(5.13)
X X
aff (S) = θi ~xi k ∈ N, θ1 , . . . , θk ∈ R, θi = 1, ~x1 , . . . , ~xk ∈ S .
i=1 i=1

Here are some properties of the affine hull; the proof is left as an exercise.

Proposition 99
Let S ⊆ Rn be a set.

(a) aff (S) is an affine set.

(b) aff (S) is the minimal affine set which contains S, i.e.,

(5.14)
\
aff (S) = C.
C⊇S
C is an affine set

Thus if S is affine then aff (S) = S.

(c) aff (S) is the union of affine hulls of all finite subsets of S, i.e.,

(5.15)
[
aff (S) = aff (A).
A⊆S
A is a finite set

We can actually get an elementary refinement of (c) above.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 83
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

Corollary 100. Let S ⊆ Rn be a set, and let aff (S) be the translation of a linear subspace of dimension d ≤ n. Then
aff (S) is the union of affine hulls of all finite subsets of S of size at most d, i.e.,

(5.16)
[
aff (S) = aff (A).
A⊆S
|A|≤d

Proof. Suppose that aff (S) = U + ~x where U ⊆ Rn is a subspace of dimension d. We prove both subset relations,
i.e., A ⊆ B and B ⊆ A implies A = B.
We show the quicker subset inequality first. We have from earlier results that

(5.17)
[ [
aff (S) = aff (A) ⊇ aff (A).
A⊆S A⊆S
A is a finite set |A|≤d

.
Towards showing the reverse inequality, let ~s1 , . . . , ~sd be elements of S such that, if we define ~ui = ~si − ~x, then
~u1 , . . . , ~ud is a basis for U . (Such ~si have to exist; if they don’t, then there are no ~s1 , . . . , ~sd whose translates by ~x span
U , a contradiction with the definition of aff (S) = U +~x). Now taking A = {~s1 , . . . , ~sd }, we see that aff (A) = aff (S).
Thus we have
(5.18)
[
aff (S) = aff ({~s1 , . . . , ~sd }) ⊆ aff (A).
A⊆S
|A|≤d

Therefore we have
(5.19)
[
aff (S) = aff (A),
A⊆S
|A|≤d

as desired.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 84
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

S1

conv(S1 )

(0, 0) (0, 0)
. .
(a) S1 = {(3/2, 2), (3, 1)}. (b) S2 = conv(S1 ).

conic(S1 ) aff (S1 )

(0, 0) (0, 0)
. .
(c) S3 = conic(S1 ). (d) S4 = aff (S1 ).

. .
Figure 5.4: (a) The set S1 = {(3/2, 2), (3, 1)} of two points in R2 ; (b) The convex hull S2 = conv(S1 ) of S1 , which is the closed
.
line segment connecting the two points in S1 ; (c) The conic hull S3 = conic(S1 ) of S1 , which is the union of all rays passing
.
through S1 ; (d) The affine hull S4 = aff (S1 ) of S1 , which is the infinite line connecting the two points in S1 . Note that we also
have S3 = conic(S2 ) and S4 = aff (S2 ); this relationship can be shown to hold in general from definitions.

Next, given a set S ⊆ Rn , we sometimes wish to distinguish points that lie on the “boundary” of S from points that
.
lie in the “interior” of S. For instance, consider the set S = [0, 1) ⊆ R. In this case, 0 and 1 lie on the “boundary” of S,
since they are infinitely close to both points inside S and points outside S. On the other hand, 1
10 lies in the interior of
S, since all points within a sufficiently small distance lie in S. Although 0 and 1 can both be geometrically interpreted
as points on the boundary of S = [0, 1), note that 0 ∈ S while 1 6∈ S. In general, a set may contain either all, some, or
none of the points on its boundary.
Below, we formalize the notion of interior points¹.

Definition 101 (Interior)


Let S ⊆ Rn and let ~x ∈ Rn .
.
(a) (Open ball.) Let r > 0. We call Nr (~x) = {~y ∈ Rn | k~y − ~xk2 < r} the open ball in Rn of radius r
centered at ~x.

(b) (Interior.) We say that ~x is an interior point of S when there exists some r > 0 such that Nr (~x) ⊆ S.a The
set of all interior points of S is called the interior of S and denoted int(S).
aIn the definition for the interior point, it does not matter whether we use ⊆ or ⊂. Think about why; this is a good exercise to internalize
the definitions.

¹The definitions provided below can be generalized to spaces more abstract than Rn or even general finite-dimensional vector spaces, such as
metric or topological spaces.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 85
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

In words, given a set S ⊆ Rn , we say that ~x ∈ Rn is an interior point of S if it is contained inside an open ball in
Rn that is in turn entirely contained in S. A mental picture is provided in Figure 5.5.

~x

Nr (~x)

Figure 5.5: The vector ~x is an interior point of the set S ⊆ R2 , since there exists a 2-dimensional ball (red) centered at ~x that is
contained in S.

Some sets may represent geometric shapes embedded in Euclidean space of strictly higher dimension, and therefore
must have empty interior. As an example, consider the set S2 defined in Figure 5.4(b), which connects the points
(3/2, 2) and (3, 1) in R2 . This is a one-dimensional line segment embedded in an Euclidean space of dimension 2.
Indeed, if one claims that a point in S2 , say, the midpoint (9/4, 3/2), is an interior point of S2 , then one would have to
show the existence of a two-dimensional open ball centered at (9/4, 3/2) that lies entirely in S2 . This is impossible,
since S1 is a one-dimensional line segment, and so S2 has empty interior.
However, it may still be geometrically meaningful to classify points in such a set as points on the “edge” of the
set, or points “inside” the set. In the context of the line segment S2 , the end points S1 = {(3/2, 2), (3, 1)} appear at
the “edge” of S2 , while the remaining points S2 \ S1 are located “inside” S2 . As we explained above, this cannot be
captured using the definitions of interior points and the interior presented in Definition 101. Roughly speaking, this
is because S2 \ S1 can only be considered points “inside” the line segment S2 from a one-dimensional perspective,
e.g., relative to the line S4 = aff (S1 ) = aff (S2 ) that contains S2 . This motivates the following definition of relative
interior, provided below.

Definition 102 (Relative Interior Points, Relative Interior)


Let S ⊆ Rn be a set, and let ~x ∈ S. We say that ~x is a relative interior point of S when there exists some r > 0
such that Nr (~x)∩aff (S) ⊆ S. The relative interior of S is the set of all relative interior points of S, and is denoted
relint(S).

In words, given a set S ⊆ Rn , we say that ~x ∈ Rn is a relative interior point of S if it is contained inside an open
ball in Rn whose intersection with aff (S) is entirely contained in S.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 86
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

S3

S2

S1

(0, 0)

. .
Figure 5.6: A similar setup to Figure 5.4. Here S1 = {(3/2, 2), (3, 1)}, S2 = conv(S1 ) = {θ(3/2, 2)+(1−θ)(3, 1) | θ ∈ [0, 1]},
.
and S3 = aff (S1 ) = aff (S2 ) = {θ(3/2, 2) + (1 − θ)(3, 1) | θ ∈ R}. Thus relint(S2 ) = S2 \ S1 . In other words, S2 is the line
segment connecting (3/2, 2) and (3, 1), S3 = aff (S2 ) is the extension of S2 into a line, and relint(S2 ) is the open (i.e., excluding
the endpoints) line segment connecting (3/2, 2) and (3, 1). This illustrates the description of the relative interior of a set as its
interior when viewed as a subset of its own affine hull.

S aff (S)

~0

Figure 5.7: A set S and its affine hull. While the interior of S is the empty set, its relative interior is nonempty.

Next, we use the concept of relative interior to characterize strictly convex sets.

Definition 103 (Strictly Convex Sets)


Let C ⊆ Rn be a set. We say that C is a strictly convex set if for every ~x1 , ~x2 ∈ C and each θ ∈ (0, 1), we have
θ~x1 + (1 − θ)~x2 ∈ relint(C).

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 87
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

~x1 ~x1 ~x2

S1 S2
~x2
S3

~x1 ~x2

Figure 5.8: Left: a strictly convex set S1 . Middle: a convex set S2 which is not strictly convex. Right: A non-convex set S3 . All
three sets are defined to include their boundaries. In particular, S2 is not strictly convex because some sections of its boundary
consist of line segments. For any two points along the same line segment, each convex combination of these points will lie on the
boundary of S2 .

The above content is optional/out of scope for this semester, but now we resume the required/in scope content.

5.1.3 Hyperplane and Half-Spaces

Definition 104 (Hyperplane)


Let ~a, ~x0 ∈ Rn and b ∈ R. A hyperplane is a set of the form

{~x ∈ Rn | ~a> ~x = b} (5.20)

or, equivalently, a set of the form


{~x ∈ Rn | ~a> (~x − ~x0 ) = 0}. (5.21)

The equations ~a> ~x = b and ~a> (~x − ~x0 ) = 0 are connected, because if we define b = ~a> ~x0 , then the second equation
resolves to the first equation; and if we take ~x0 to be any vector such that ~a> ~x0 = b, then the first equation resolves to
the second equation.
.
Example 105. Hyperplanes are convex. Consider a hyperplane H = {~x ∈ Rn | ~a> ~x = b}. Let ~x1 , ~x2 ∈ H and
θ ∈ [0, 1]. Then

~a> (θ~x1 + (1 − θ)~x2 ) = θ~a> ~x1 + (1 − θ)~a> ~x2 (5.22)


= θb + (1 − θ)b (5.23)
=b (5.24)

so θ~x1 + (1 − θ)~x2 ∈ H. Thus H is convex.

To show that a set C is convex, we need to show that for every ~x1 , ~x2 ∈ C and every θ ∈ [0, 1], that θ~x1 +(1−θ)~x2 ∈
C.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 88
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

To show that C is not convex, we just need to come up with one choice of ~x1 , ~x2 ∈ C and one θ ∈ [0, 1] such
/ C. Note that even if C is non-convex, there could be some choices of ~x1 , ~x2 ∈ C, θ ∈ [0, 1]
that θ~x1 + (1 − θ)~x2 ∈
such that θ~x1 + (1 − θ)~x2 ∈ C; but if C is non-convex, there is at least one choice of ~x1 , ~x2 ∈ C, θ ∈ [0, 1] such that
θ~x1 + (1 − θ)~x2 ∈
/ C.

Definition 106 (Half-Space)


Let ~a, ~x0 ∈ Rn and b ∈ R. A positive half-space is a set of the form

{~x ∈ Rn | ~a> ~x ≥ b} or {~x ∈ Rn | ~a> (~x − ~x0 ) ≥ 0}. (5.25)

A negative half-space is a set of the form

{~x ∈ Rn | ~a> ~x ≤ b} or {~x ∈ Rn | ~a> (~x − ~x0 ) ≤ 0}. (5.26)

The mental picture we have for these hyperplanes and half-spaces is the following. Let ~x0 ∈ Rn and define
.
H = {~x ∈ Rn | ~a> (~x − ~x0 ) = 0} (5.27)
.
H+ = {~x ∈ Rn | ~a> (~x − ~x0 ) ≥ 0} (5.28)
.
H− = {~x ∈ Rn | ~a> (~x − ~x0 ) ≤ 0}. (5.29)

Then the alignment of these objects looks like the following:

~a

H− H H+
~x0

In words, the positive and negative half-spaces partition Rn . Looking at some individual vectors, say ~x1 ∈ H−
and ~x2 ∈ H+ , we have the picture

~a
~x2

~x1 ~x0

If we draw lines connecting ~x0 with ~x1 and ~x2 , they are not themselves representations of ~x1 and ~x2 , unless ~x0 = ~0.
Instead, they are representations of the displacements of ~x1 and ~x2 from ~x0 . Thus, we see the following picture:

~a
~x2 − ~x0 ~x2
~x1 ~x1 − ~x0 ~x0

And this gives us a clearer understanding of what’s going on — ~x1 − ~x0 forms an obtuse angle with ~a, indicating a
negative dot product, whereas ~x2 − ~x0 forms an acute angle with ~a, indicating a positive dot product. And this is how
H+ and H− are computed.
This allows us to consider what it means for a hyperplane to separate two sets. It means that for every vector in the
first set, the dot product is non-positive, and for every vector in the second set, the dot product is non-negative.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 89
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

Example 107 (Set of PSD Matrices is Convex). Consider Sn+ , the set of all symmetric positive semidefinite (PSD)
matrices. We want to show that Sn+ is convex. Take A1 , A2 ∈ Sn+ and θ ∈ [0, 1]. We want to show that θA1 +(1−θ)A2 ∈
Sn+ .
One of the ways to tell if a matrix A is PSD is to check whether ~x> A~x ≥ 0 for all ~x ∈ Rn . Checking this for our
convex combination, we get

~x> (θA1 + (1 − θ)A2 )~x = θ ~x> A1 ~x +(1 − θ) ~x> A2 ~x (5.30)


| {z } | {z }
≥0 ≥0

≥ 0. (5.31)
" #
1 0
Note that it is possible to come up with linear combinations of PSD matrices that are not PSD; indeed, and
0 1
" # " #
2 0 −1 0
are PSD, yet their difference is not PSD. But all convex combinations of PSD matrices are PSD,
0 2 0 −1
as we have confirmed above.

Theorem 108 (Separating Hyperplane Theorem)


Let C, D ⊆ Rn be two nonempty disjoint convex sets, i.e., C ∩ D = ∅. Then there exists a hyperplane that
separates C and D, i.e., there exists ~a, ~x0 ∈ Rn such thata

~a> (~x − ~x0 ) ≥ 0, ∀~x ∈ C (5.32)


~a> (~x − ~x0 ) ≤ 0, ∀~x ∈ D. (5.33)

Moreover, if C is closed (containing its boundary points) and D is closed and bounded, then there exists a hyper-
plane that separates C and D without intersecting either set, i.e., there exists ~a, ~x0 ∈ Rn such that

~a> (~x − ~x0 ) > 0, ∀~x ∈ C (5.34)


>
~a (~x − ~x0 ) < 0, ∀~x ∈ D. (5.35)
.
aBy defining b = ~a> ~
x0 , one can express ~a> (~ x0 ) as ~a> ~
x−~ x − ~b if desired.

The mental picture we want to have is the following.

C
D

Proof. We prove the part of the theorem statement in the case where C is closed and bounded and D is closed.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 90
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

Even though our theorem statement concerns existence of such ~a and ~x0 , we will prove it by construction, i.e., we
will construct a ~a and ~x0 which separate C and D. This proof strategy is very powerful and will show up frequently.
Since C and D are disjoint, any points in C and D are separated by some positive distance; since they are compact,
this distance has a finite lower bound.² Define
.
dist(C, D) = min ~c − d~ . (5.36)
c∈C
~ 2
~
d∈D

Note that dist(C, D) > 0, and there exists some c ∈ C and d ∈ D such that ~c − d~ = dist(C, D).³
2

C
~c d~ D
~c − d~

This signals that we want ~c − d~ to be the normal vector of our hyperplane — that is, our ~a vector. To find the other
point ~x0 which the hyperplane passes through, we can just have it pass through the midpoint of ~c and d,~ i.e., ~c+d~ . This
2
gives the following diagram.

C c+d~
~
2
~c d~ D
~c − d~

Thus our proposed hyperplane has ~a and ~x0 equal to

~c + d~
~
~a = ~c − d, ~x0 = . (5.37)
2
It yields the following picture, where the hyperplane is a dotted line.
²Proving this requires some mathematical analysis and is out of scope of the course.
³Same as the above footnote. The fact that C is closed and bounded and D is closed will not be used from this point onwards.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 91
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

C ~x0
~c d~ D
~a

Notice that there are many separating hyperplanes, such as the one discussed before the theorem. But we just need
to prove that this hyperplane separates C and D.
The equation for this hyperplane is
!
~c + d~
> ~
~a (~x − ~x0 ) = (~c − d) >
~x − (5.38)
2
~> ~
~ > ~x − (~c − d) (~c + d)
= (~c − d) (5.39)
2
> ~>~
~ > ~x − ~c ~c − d d
= (~c − d) (5.40)
2
2
2
k~ck2 − d~
~ > ~x −
= (~c − d) 2
. (5.41)
2
Thus the given hyperplane is also available in (~a, b) form as
2
2
k~ck2 − d~
~
~a = ~c − d, b= 2
. (5.42)
2

Now we prove that it actually separates ~c and d.


~ Define f : Rn → R by
!
~c + d~
~
f (~x) = (~c − d)>
~x − . (5.43)
2

For the sake of contradiction, suppose there exists ~u ∈ D such that f (~u) ≥ 0. We can write

0 ≤ f (~u) (5.44)
!
~c + d~
~ > ~u −
= (~c − d) (5.45)
2
!
~
~ > ~u − d~ − ~c − d
= (~c − d) (5.46)
2
~> ~
~ − (~c − d) (~c − d)
~ > (~u − d)
= (~c − d) (5.47)
2
1 2
~ > ~
= (~c − d) (~u − d) − ~c − d~ . (5.48)
2 2

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 92
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

Thus
2
0 ≤ (~c − d) ~ − 1 ~c − d~ < (~c − d)
~ > (~u − d) ~ > (~u − d).
~ (5.49)
2 2

This means that ~c − d~ and ~u − d~ form an acute angle. It also means that ~u 6= d,
~ since otherwise the dot product would
be 0. Going back to our picture, this means that ~u would have to be positioned similarly to the following:

~u
C ~x0
~c d~ D
~a

At least from the diagram, it seems hard to imagine a ~u ∈ D such that ~u − d~ and ~c − d~ form an acute angle. Namely,
any vector ~x ∈ Rn (of reasonably small norm, such as the ~u in the figure) such that ~x − d~ and ~c − d~ form an acute
angle, seems to be closer to ~c than d~ is to ~c.
Why do we need the “reasonably small norm” condition? Consider the following possible ~x:

~x

C ~x0
~c d~ D
~a

Certainly, this ~x is farther from ~c than d~ is, and so no contradiction would be derived.
If we can prove that our ~u, which we assume is in D, is closer to ~c than d~ is, then we can derive a contradiction with
the fact that d~ is the closest vector in D to ~c. But we can’t prove this for our ~u directly, because ~u − d~ may be large
2
as in the above figure, so instead we take another vector ~x which is close to d, ~ where the displacement between ~x and
d~ points in the direction of ~u. We will show that this ~x is in D yet is closer to ~c than d~ is, thus deriving a contradiction.
Here are the details. Let p~ : [0, 1] → Rn trace out the line from d~ to ~u; namely, let p~(t) = d+t(~
~ u−d)~ = t~u+(1−t)d.
~
Since ~u, d ∈ D by assumption, and D is convex, we have that p~(t) ∈ D for all t ∈ [0, 1]. Now we see that
~
2
2
p(t) − ~ck2 = d~ + t(~u − d)
k~ ~ − ~c
2
2
= (d~ − ~c) + t(~u − d)
~
2
2 2
= d~ − ~c + 2t(d~ − ~c)> (~u − d)
~ + t2 ~u − d~
2 2
2 2
= ~c − d~ ~ (~u − d)
− 2t(~c − d) ~ +t > 2
~u − d~ .
2 2

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 93
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

2
We want to show that there exists t such that k~
p(t) − ~ck2 < ~c − d~ , i.e.,
2
2

2
~ > (~u − d)
−2t(~c − d) ~ + t2 ~u − d~ < 0, (5.50)
2

i.e.,
2
~ > (~u − d)
2 (~c − d) ~ −t ~u − d~ > 0. (5.51)
| {z } 2
>0
~ > (~ ~
Now for all 0 < t < c−d)
2(~
~
u−d)
2 , i.e., t small enough, the above inequality holds, so we have for this t that
u−d
~
2

2 2
2
p(t) − ~ck2 = ~c − d~
k~ ~ > (~u − d)
− 2t(~c − d) ~ + t2 ~u − d~
2 2
2
< ~c − d~ .
2

However, p~(t) ∈ D, a contradiction.

The following content is optional/out of scope for this semester. Regardless, it may be helpful to read it to
gain context, or get a deeper understanding of various results.

5.1.4 (OPTIONAL) Cones

Definition 109 (Cones, Proper Cones)


Let K ⊆ Rn .

(a) We call K a cone if, for any ~v ∈ K and α ∈ R+ a, we have α~v ∈ K.

(b) We call K a convex cone if it is both a cone and a convex set.

(c) We call K a pointed cone if it contains no line through the origin, i.e., if for each nonzero ~v ∈ K, there
exists some α ∈ R such that α~v ∈
/ K.

(d) We call K a solid cone if it has non-empty interior, i.e., if there exists some ~v ∈ K and some r > 0 such that
the open ball in Rn of radius r centered at ~v is contained in K: namely, we have {w
~ ∈ Rn | kw
~ − ~v k2 <
r} ⊆ K.

(e) We call K a closed cone if it is a closed set, i.e., it contains its boundary points.

(f) We call K a proper cone if it is convex, pointed, solid, and closed.


aThat is, α ∈ R and α ≥ 0

Note that non-empty cones must contain the zero vector, which corresponds to the case of taking α = 0 in the
definition of a cone.
The definition of proper cones is motivated by their connection to generalized inequalities in convex optimization,
which will be discussed later in the course in the context of second-order cone programs (SOCPs) and semidefinite
programming (SDPs). For this we require the above definitions to apply to a slightly broader context. We would need
to replace Rn with a generic vector space V and the k·k2 norm with any norm k·kV on this vector space. In fact, for the
following results to hold we additionally need to have an inner product on this vector space h·, ·iV that is compatible
with the norm, i.e., h~x, ~xiV = k~xkV ; that is, we would need an inner product space. One can check (as an exercise)
2

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 94
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

that Rn is an example of such an inner product space, with the `2 norm k·k2 and usual inner product. Thus, in order to
generalize the results introduced in this section, we would replace Rn with V , replace the norm k~xk2 with k~xkV , and
replace the inner product ~x> ~y with h~x, ~y iV .⁴
For now, we return to working over Rn so as to not introduce additional complexity in the definitions.

Example 110. The sets


.
C1 = {(~x, y) ∈ Rn+1 | k~xk2 ≤ y} (5.52)
.
C2 = {(x, y) ∈ R2 | y ≥ 0} (5.53)

are cones. If n = 1 then C1 looks like:


y

C1

Now, we discuss some important classes of cones.

Definition 111
(a) A set of the form
.
C = {(~x, t) ∈ Rn+1 | A~x ≤ t~y , t ≥ 0} (5.54)

is called a polyhedral cone, and in particular corresponds to the polyhedron {~x ∈ Rn | A~x ≤ ~y }.

(b) A set of the form


.
C = {(~x, t) ∈ Rn+1 | kA~x − t~y k2 ≤ tz, t ≥ 0} (5.55)

is called a ellipsoidal cone, and in particular corresponds to the ellipse {~x ∈ Rn | kA~x − ~y k2 ≤ z}.a
aThe ellipsoidal cone corresponding to the unit circle — which is, after all, an ellipse — is the second order cone, to be discussed later.

Proposition 112
Polyhedral and ellipsoidal cones are convex cones.

Proof. Left as an exercise.

⁴In this class, the issue really only comes up when discussing vector spaces where each element is a matrix, where the norm is the Frobenius
norm, and the inner product is a corresponding “Frobenius inner product,” to be defined later. This is relevant in semidefinite programming, for
example.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 95
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

Proposition 113
Let K ⊆ Rn be a cone. Define
.
K ? = {~y ∈ Rn | ~y > ~x ≥ 0 for each ~x ∈ K}. (5.56)

Then K ? is a closed convex cone. We call K ? the dual cone of K.

Proof. Let ~y , ~z ∈ K ? and let α, β ≥ 0. Then, for any ~x ∈ K, we have

(α~y + β~z)> ~x = |{z}


α (~y > ~x) + β (~z> ~x) ≥ 0. (5.57)
| {z } |{z} | {z }
≥0 ≥0 ≥0 ≥0

In particular, this holds for β = 0 (so that K ? is a cone), and α ∈ [0, 1] and β = 1 − α (so that K ? is convex). Thus,
K ? is a convex cone.
Now we want to show that K contains its limits. Let (~yk )∞
k=1 be a sequence in K that converges to some ~
?
y ∈ Rn .
We want to show that ~y ∈ K ? . Indeed, for any ~x ∈ K, we want to show that ~y > ~x ≥ 0. But this is true because we
have  >
>
~y ~x = lim ~yk ~x = lim ~yk> ~x ≥ 0. (5.58)
k→∞ k→∞ |{z}
≥0

Since ~x was arbitrary, we have ~y ∈ K . Thus K contains its limits and is a closed cone.
? ?

.
A geometric interpretation of the dual cone is that K ? is the intersection of the half-spaces H~x = {~y ∈ Rn | ~y > ~x ≥
0} defined by each vector ~x in K.
Below, we provide some examples of cones and their dual cones. The reader is encouraged to verify the following
statements.

Example 114.
.
(a) The set Rn+ = {~x ∈ Rn | xi ≥ 0 for all i ∈ {1, . . . , n}} is a convex cone, and its dual cone in Rn is itself.
.
(b) Let S = {~x ∈ R2 | x1 = 0 or x2 = 0}. Then S is a cone but is not a convex cone, and the dual cone of S is
{~0}, the singleton set comprised of the 2-dimensional zero vector.

(c) Let S ⊆ Rn be a subspace. Then S is a convex cone, and the orthogonal complement S ⊥ of S is the dual cone
of S.

Two proper cones with interesting properties that are widely used in convex optimization are the cone of symmetric
positive semi-definite matrices and the second-order cone. The propositions below explore their properties.

Proposition 115
Let Sn be the vector space of n × n real-valued symmetric matrices equipped with the Frobenius inner product:
n n
.
for any A, B ∈ Sn . (5.59)
XX
hA, BiF = tr(AB) = Aij Bij ,
i=1 j=1

and the Frobenius norm k·kF . Let Sn+ denote the set of all n × n positive semidefinite matrices.

(a) Sn+ is a proper cone in Sn ;

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 96
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

(b) The dual cone of Sn+ in Sn is itself.

To start with, this is an instance of the earlier discussion: not every application of cones will be with reference to Rn ,
but could be with reference to another inner product space. Here it is the vector space Sn with the appropriate inner
product and norm. The intuition for why we can do this is that Sn is a subspace of Rn×n , the space of n × n matrices.
But by stacking up the entries in an n × n matrix we get an n2 -dimensional vector, i.e., an element of Rn . Indeed,
2

the Frobenius norm and inner product on matrices are exactly the `2 inner product and norm applied to the “unrolled”
matrices in Rn . Thus, one can informally view Sn as a subspace of Rn (though remember that it is a vector space in
2 2

its own right, so that we can define things like interiors and dual cones with respect to it instead of its “parent” space
Rn ), so the same proof techniques and intuitions carry over.
2

Proof.

(a) To show that Sn+ is a convex cone, let A, B ∈ Sn+ and α, β ≥ 0 be given. We wish to show that αA + βB ∈ Sn+ ,
which will confirm that Sn+ is a convex cone. Indeed, αA + βB is symmetric as the linear combination of two
symmetric matrices. To show that it is positive semidefinite, let ~v ∈ Rn be arbitrarily given. Then we have

~v > (αA + βB)~v = |{z}


α ~|v >{zA~v} + β ~|v >{z
B~v} ≥ 0. (5.60)
|{z}
≥0 ≥0 ≥0 ≥0

Since ~v was arbitrarily given, αA + βB is positive semidefinite. Thus αA + βB ∈ Sn+ . This holds for β = 0,
so Sn+ is a cone, and also for α ∈ [0, 1] and β = 1 − α, so Sn+ is convex.
We now show that Sn+ is pointed, i.e., it contains no lines through the origin. Let A ∈ Sn+ be a nonzero matrix.
Then −A ∈
/ Sn+ , because there exists ~v ∈ Rn such that ~v > A~v > 0, at which point ~v > (−A)~v = −~v > A~v < 0, so
−A is not positive semidefinite. Thus −A ∈
/ Sn+ , so that for any A ∈ Sn+ there exists α ∈ R such that αA ∈
/ Sn+ .
Thus Sn+ contains no lines through the origin and is pointed.
We now show that Sn+ is solid, i.e., has nonempty interior. We show that the open ball in Sn defined by
 
. 1
n
B = A ∈ S kA − IkF < (5.61)
2

is contained in Sn+ . Indeed, let A ∈ B. By definition of B, we have that A is symmetric. Moreover, for each
~v ∈ Rn we have

~v > A~v = ~v > ((A − I) + I)~v (5.62)


= ~v > (A − I)~v + ~v >~v (5.63)
(5.64)
> 2
= ~v (A − I)~v + k~v k2
(5.65)
2 2
≥ − kA − Ik2 k~v k2 + k~v k2
(5.66)
2 2
≥ − kA − IkF k~v k2 + k~v k2
1
(5.67)
2 2
> − k~v k2 + k~v k2
2
1
(5.68)
2
= k~v k2 .
2

For ~v nonzero, we have k~v k2 , and so ~v > A~v > 0. Thus, A ∈ Sn+ .
1 2
2

Finally, we need to show that Sn+ is a closed cone. Let (Ak )∞


k=1 be a sequence in S+ that converges to some
n

A ∈ Sn . We want to show that A ∈ Sn+ . As the limit of symmetric matrices, A is symmetric. Now for any

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 97
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

~v ∈ Rn we have  
~v > A~v = ~v > lim Ak ~v = lim ~v > Ak~v ≥ 0. (5.69)
k→∞ k→∞ | {z }
≥0

Thus ~v A~v ≥ 0 for all ~v ∈ R , so that A ∈


> n
Sn+ . Thus Sn+ contains its limits and is a closed cone.
We have thus proved that Sn+ is a convex, pointed, solid, and closed cone, so it is a proper cone.

(b) We now show that the dual cone of Sn+ in Sn is itself. That is, defining the dual cone as (Sn+ )? = {A ∈ Sn |
hA, BiF ≥ 0 for all B ∈ Sn+ }, we want to show that (Sn+ )? = Sn+ . To do this, we show that (Sn+ )? ⊆ Sn+ and
that (Sn+ )? ⊇ Sn+ .
We first show that (Sn+ )? ⊆ Sn+ . Fix A ∈ (Sn+ )? , and let ~v ∈ Rn be given arbitrarily. Then, since ~v~v > ∈ Sn+ , we
have

~v > A~v = tr ~v > A~v (5.70)




>
(5.71)

= tr A~v~v
= A, ~v~v > F
(5.72)
≥ 0, (5.73)

where in the first line we use the fact that the trace of a scalar is a scalar, in the second line we use the cyclic trace
inequality,and the last inequality is justified because ~v~v > ∈ Sn+ and A ∈ (Sn+ )? . Since ~v was selected arbitrarily,
~v > A~v ≥ 0 for all ~v ∈ Rn . This (along with the fact that A is symmetric) proves that A ∈ Sn+ . Since A was
selected arbitrarily, (Sn+ )? ⊆ Sn+ .
Now we show that Sn+ ⊆ (Sn+ )? . Let B ∈ Sn+ . We aim to show that B ∈ (Sn+ )? , i.e., that hB, CiF ≥ 0 for any
C ∈ Sn+ . By the spectral theorem, we may diagonalize C = i=1 λi~vi~vi> , where λi ≥ 0 are the eigenvalues of
Pn

C and ~vi ∈ Rn are orthonormal eigenvectors of C. Thus we have

hB, CiF = tr(BC) (5.74)


n
!!
(5.75)
X
= tr B λi~vi~vi>
i=1
n
!
(5.76)
X
= tr λi B~vi~vi>
i=1
n
(5.77)
X
λi tr B~vi~vi>

=
i=1
n
(5.78)
X
λi tr ~vi> B~vi

=
i=1
n
(5.79)
X
= λi ~vi> B~vi
|{z} | {z }
i=1 ≥0
≥0

≥ 0. (5.80)

Thus we have hB, CiF ≥ 0. Since C ∈ Sn+ were arbitrary, we have B ∈ (Sn+ )? . Since B were arbitrary, we
have Sn+ ⊆ (Sn+ )? .
Thus, (Sn+ )? = Sn+ .

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 98
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

The next example of a cone will be useful when discussing the eponymous second-order cone programs (SOCPs).

Definition 116 (Second Order Cone)


The second-order cone in Rn+1 is the set:
.
K = {(~x, t) ∈ Rn × R | k~xk2 ≤ t}. (5.81)

Proposition 117
Let K be the second-order cone in Rn+1 .

(a) K is a proper cone.

(b) The dual cone of K in Rn+1 is itself.

Proof.

(a) We first show that K is a convex cone. Let (~x1 , t1 ), (~x2 , t2 ) ∈ K and let α1 , α2 ≥ 0. Then

kα1 ~x1 + α2 ~x2 k2 ≤ α1 k~x1 k2 + α2 k~x2 k2 (5.82)


≤ α1 t1 + α2 t2 (5.83)

where the first inequality is by triangle inequality and the second is by definition of a second-order cone. This
holds for α2 = 0, showing that K is indeed a cone, and α1 ∈ [0, 1] and α2 = 1 − α1 , showing that K is convex.
Thus K is a convex cone.
We show that K is pointed, i.e., contains no lines through the origin. Indeed, let (~x, t) ∈ K be nonzero. Then
either ~x is nonzero or t is nonzero (or both); in the first case, k~xk2 > 0 so since t ≥ k~xk2 , we have t > 0 as
well. Thus t > 0 in all cases. Thus we certainly do not have k−~xk2 ≤ −t(in fact, norms can never be negative)
so that −(~x, t) = (−~x, −t) ∈
/ K. Thus for any (~x, t) ∈ K there exists α ∈ R such that α(~x, t) ∈
/ K, so K is
pointed.
We now show that K is solid, i.e., has nonempty interior. We claim that the open ball in Rn+1 of radius 1
centered at (~0, 2), where ~0 is the n-dimensional zero vector, is contained in K. Formally, define
.
B = {(~x, t) ∈ Rn+1 | (~x, t) − (~0, 2) 2
< 1}. (5.84)

Let (~x, t) ∈ B; we show that (~x, t) ∈ K. Indeed, we have

(~x, t) − (~0, 2) 2
<1 (5.85)
2
=⇒ (~x, t) − (~0, 2) 2
< 12 = 1 (5.86)
(5.87)
2 2
=⇒ k~xk2 + (t − 2) < 1,

which implies that k~xk2 < 1 and (t − 2)2 < 1, namely t ∈ (1, 3). Thus k~xk2 < 1 < t, so k~xk2 < 1 < t, so
2 2

k~xk2 ≤ t, so (~x, t) ∈ K as desired. Since (~x, t) ∈ B were arbitrarily chosen, B ⊆ K and K is solid.
We now show that K is closed, i.e., contains its limits. Let ((~xk , tk ))∞
k=1 be a sequence in K that converges
to some (~x, t) ∈ Rn+1
. We want to show that (~x, t) ∈ K. Indeed, we have that (~x, t) ∈ K if and only if
t − k~xk2 ≥ 0. We have

t − k~xk2 = lim tk − lim ~xk (5.88)


k→∞ k→∞ 2

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 99
EECS 127/227AT Course Reader 5.1. Convex Sets 2024-04-27 21:08:09-07:00

= lim (tk − k~xk k2 ) (5.89)


k→∞ | {z }
≥0

≥ 0. (5.90)

Thus (~x, t) ∈ K, so K contains its limits and is closed.


We have proved that K is a convex, pointed, solid, and closed cone, so it is proper.

(b) We show that the dual cone of K in Rn+1 is K itself. Let K ? = {(~y , s) ∈ Rn+1 | (~y , s)> (~x, t) ≥ 0 for all (~x, t) ∈
K} be the dual cone of K. We first show that K ? ⊆ K, then show that K ? ⊇ K.
First, to show that K ? ⊆ K, fix (~y , s) ∈ K ? . We want to show that s ≥ k~y k2 , so that (~y , s) ∈ K ? . Since
(~0, 1) ∈ K, by definition of K ? we have

0 ≤ (~y , s)> (~0, 1) = s. (5.91)

Thus, if k~y k2 = 0 (i.e., if ~y = ~0), then we have s ≥ k~y k2 , so that (~y , s) ∈ K.


Now suppose that ~y 6= 0, so that k~y k2 > 0. Then (−~y , k~y k2 ) ∈ K, so that

(5.92)
2
0 ≤ (~y , s)> (−~y , k~y k2 ) = − k~y k2 + s k~y k2
(5.93)
2
⇒ s k~y k2 ≥ k~y k2
⇒ s ≥ k~y k2 . (5.94)

Therefore, (~y , s) ∈ K. Since (~y , s) ∈ K ? were arbitrary, we have K ? ⊆ K.


Now we want to show that K ⊆ K ? . To this end, let (~y , s) ∈ K. We want to show that (~y , s) ∈ K ? , or
equivalently, (~x, t)> (~y , s) ≥ 0 for all (~x, t) ∈ K. Indeed,

(~x, t)> (~y , s) = ~x> ~y + st ≥ − k~xk2 k~y k2 − st ≥ 0 (5.95)

where the first inequality follows by Cauchy-Schwarz inequality, and the second inequality follows from the fact
that since (~x, t), (~y , s) ∈ K we have k~xk2 ≤ t and k~y k2 ≤ s. This shows that (~y , s) ∈ K ? , so that K ⊆ K ? .
We have shown that K ⊇ K ? and K ⊆ K ? , so K = K ? .

Theorem 118
Let K ⊆ Rn be a non-empty closed convex cone. Then (K ? )? = K.

Proof. We want to show that K ⊆ (K ? )? and K ⊇ (K ? )? .


First, we want to show that K ⊆ (K ? )? . Fix ~x ∈ K. We want to show that ~x ∈ (K ? )? , which means that ~x> ~y ≥ 0
for any ~y ∈ K ? . For this ~y , we have ~y > ~z ≥ 0 for all ~z ∈ K. But this includes ~z = ~x, so ~x> ~y = ~y > ~x ≥ 0. Thus
~x ∈ (K ? )? . We conclude that K ⊆ (K ? )? . (Note that K ⊆ (K ? )? actually holds for any cone K, not just non-empty
closed convex cones.)
Next, we want to show that (K ? )? ⊆ K. Suppose for the sake of contradiction that (K ? )? 6⊆ K. Then there exists
~y ∈ (K ? )? such that ~y ∈
/ K. Since ~y ∈
/ K, we have ~y 6= ~0. Because K is a closed convex cone, it is a closed convex set.
Since {~y } and K are two disjoint closed convex sets, the Separating Hyperplane Theorem tells us that there exists some

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 100
EECS 127/227AT Course Reader 5.2. Convex Functions 2024-04-27 21:08:09-07:00

nonzero w
~ ∈ Rn and c ∈ R such that w
~ > ~x > c for all ~x ∈ K, and w
~ > ~y < c. Since ~0 ∈ K, we have 0 = w
~ >~0 > c,
i.e., c < 0, so w
~ > ~y < c < 0. Since K is a cone, for any α > 0 we have

~ > (α~x) > c


w (5.96)
⇒ αw~ > ~x > c (5.97)
c
⇒w~ > ~x > . (5.98)
α
By taking α → ∞ we get that w
~ > ~x ≥ 0 for any ~x ∈ K. Thus w
~ ∈ K ? . But we must have w
~ > ~y < c < 0, so
/ (K ? )? , a contradiction. Thus (K ? )? ⊆ K and so (K ? )? = K.
~y ∈

The above content is optional/out of scope for this semester, but now we resume the required/in scope content.

5.2 Convex Functions


In this section, we define convex and concave functions, and introduce their properties. At the end, we also deliberate
on the special and important example of affine functions.

Definition 119 (Convex and Concave Functions)


Let f : Rn → R be a function. We say that f is convex if the domain of f , say Ω, is convex, and, for any ~x1 , ~x2 ∈ Ω
and θ ∈ [0, 1], we have
f (θ~x1 + (1 − θ)~x2 ) ≤ θf (~x1 ) + (1 − θ)f (~x2 ). (5.99)

A function f : Rn → R is concave if −f is convex.

Equation (5.99) is also called Jensen’s inequality and is equivalent to the following, seemingly more general statement.

Theorem 120 (Jensen’s Inequality)


Let Ω be a convex set and let f : Ω → R be a convex function. Let ~x1 , . . . , ~xk ∈ Ω, and let θ1 , . . . , θk ∈ [0, 1]
such that i=1 θi = 1. Then
Pk
k
! k
(5.100)
X X
f θi ~xi ≤ θi f (~xi ).
i=1 i=1

We can think about this result in terms of a picture.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 101
EECS 127/227AT Course Reader 5.2. Convex Functions 2024-04-27 21:08:09-07:00

graph(f )
(~x1 , f (~x1 )) (θ~x1 + (1 − θ)~x2 ,
f (~x1 ) θf (~x1 ) + (1 − θ)f (~x2 ))

θf (~x1 ) + (1 − θ)f (~x2 )

f (~x2 )
(~x2 , f (~x2 ))
(θ~x1 + (1 − θ)~x2 ,
f (θ~x1 + (1 − θ)~x2 ))

f (θ~x1 + (1 − θ)~x2 )


~x1 θ~x1 + (1 − θ)~x2 ~x2

The prototypical convex function has a “bowl-shaped” graph, and taking a weighted average of two (or any finite
number) of points will mean we land in the bowl. In particular, taking a weighted average of any number of function
values f (~x1 ), . . . , f (~xk ) will always give a larger number than applying the function f to the same weighted averages
of the points ~x1 , . . . , ~xk . Put more simply, if f is convex then the chord joining the points (~x1 , f (~x1 )) and (~x2 , f (~x2 ))
always lies above the graph of f . Similarly, if f is concave then the chord joining the points (~x1 , f (~x1 )) and (~x2 , f (~x2 ))
always lies below the graph of f .
From the picture, it may be intuitively clear that it is hard to construct convex functions f with multiple global
minima. We will come back to this idea later.
It may be useful to connect the notion of convex function and convex set. For this, we will define the epigraph.

Definition 121 (Epigraph)


Let Ω be a convex set and let f : Ω → R be a convex function. The epigraph of f , denoted epi(f ) ⊆ Ω × R, is
defined as
epi(f ) = {(~x, t) | ~x ∈ Ω, t ≥ f (~x)}. (5.101)

Geometrically, the epigraph is all points in Ω × R that lie above the graph of f .

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 102
EECS 127/227AT Course Reader 5.2. Convex Functions 2024-04-27 21:08:09-07:00

graph(f )

epi(f )

Proposition 122
Let Ω ⊆ Rn be a convex set and let f : Ω → R be a function. Then f is a convex function if and only if epi(f ) is
a convex set.

Proof. Left as an exercise.

Theorem 123 (First-Order Condition for Convexity)


Let Ω ⊆ Rn be a convex set and let f : Ω → R be a differentiable function. Then f is convex if and only if for all
~x, ~y ∈ Ω, we have
f (~y ) ≥ f (~x) + [∇f (~x)]> (~y − ~x). (5.102)

Note that this latter term is the first-order Taylor expansion of f around ~x evaluated at ~y , i.e., fb1 (~y ; ~x) = f (~x) +
[∇f (~x)]> (~y − ~x). The graph of fb1 (·; ~x) is the tangent line to the graph of f at the point (~x, f (~x)). So another
characterization of convex functions is that their graphs lie above their tangent lines.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 103
EECS 127/227AT Course Reader 5.2. Convex Functions 2024-04-27 21:08:09-07:00

graph(f )

f (~y )
(~y , f (~y ))

(~x, f (~x))
f (~x)


~x ~y
fb1 (~y ; ~x)
(~y , fb1 (~y ; ~x)) graph(fb1 (·; x))

Proof. First suppose f is convex. Then for any h ∈ (0, 1), we have

f (h~y + (1 − h)~x) ≤ hf (~y ) + (1 − h)f (~x) (5.103)


= hf (~y ) + f (~x) − hf (~x) (5.104)
= f (~x) + h(f (~y ) − f (~x)) (5.105)
=⇒ f (h~y + (1 − h)~x) − f (~x) ≤ h(f (~y ) − f (~x)) (5.106)
f (h~y + (1 − h)~x) − f (~x)
=⇒ ≤ f (~y ) − f (~x) (5.107)
h
f (h~y + (1 − h)~x) − f (~x)
=⇒ f (~y ) ≥ f (~x) + (5.108)
h
f (~x + h(~y − ~x)) − f (~x)
= f (~x) + . (5.109)
h
To summarize, for any h ∈ (0, 1) we have that
f (~x + h(~y − ~x)) − f (~x)
f (~y ) ≥ f (~x) + . (5.110)
h
Taking the limit h → 0 on both sides, we get
 
f (~x + h(~y − ~x)) − f (~x)
f (~y ) ≥ lim f (~x) + (5.111)
h→0 h
f (~x + h(~y − ~x)) − f (~x)
= f (~x) + lim (5.112)
h→0 h
= f (~x) + [∇f (~x)]> (~y − ~x) (5.113)

as desired. Here the last equality is because the limit is interpreted as a directional derivative, and it has already been
shown that directional derivatives are equal to inner products of the gradient with the direction vector.
For the other direction, let θ ∈ [0, 1] and let ~z = θ~x + (1 − θ)~y . We have

f (~x) ≥ f (~z) + [∇f (~z)]> (~x − ~z) (5.114)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 104
EECS 127/227AT Course Reader 5.2. Convex Functions 2024-04-27 21:08:09-07:00

f (~y ) ≥ f (~z) + [∇f (~z)]> (~y − ~z). (5.115)

Adding θ times the first equation to (1 − θ) times the second equation, we get

θf (~x) + (1 − θ)f (~y ) ≥ θf (~z) + (1 − θ)f (~z) + θ[∇f (~z)]> (~x − ~z) + (1 − θ)[∇f (~z)]> (~y − ~z) (5.116)
>
= f (~z) + [∇f (~z)] (θ~x + (1 − θ)~y − ~z) (5.117)
= f (~z) + [∇f (~z)]> (~z − ~z) (5.118)
= f (~z) + [∇f (~z)] 0 >~
(5.119)
= f (~z) (5.120)
= f (θ~x + (1 − θ)~y ). (5.121)

We also have a corresponding second-order condition.

Theorem 124 (Second-Order Condition for Convexity)


Let Ω ⊆ Rn be a convex set and let f : Ω → R be a twice-differentiable function. Then f is convex if and only if
for all ~x ∈ Ω, we have
∇2 f (~x)  0, (5.122)

i.e., ∇2 f (~x) is PSD for each ~x ∈ Ω.

Corollary 125. Let Q ∈ Sn be a symmetric matrix, let ~b ∈ Rn , and let c ∈ R. The quadratic form

f (~x) = ~x> Q~x + ~b> ~x + c (5.123)

is convex if and only if Q is PSD.

Lastly, we identify a strengthened condition of convexity which allows for stronger guarantees later down the line.

Definition 126 (Strictly Convex Function)


Let f : Rn → R be a function. We say that f is strictly convex if the domain of f , say Ω, is convex, and, for any
~x1 6= ~x2 ∈ Ω and θ ∈ (0, 1), we have

f (θ~x1 + (1 − θ)~x2 ) < θf (~x1 ) + (1 − θ)f (~x2 ). (5.124)

A function f : Rn → R is strictly concave if −f is strictly convex.

And correspondingly, we have the first-order and second-order conditions.

Theorem 127 (First-Order Condition for Strict Convexity)


Let Ω ⊆ Rn be a convex set and let f : Ω → R be a differentiable function. Then f is strictly convex if and only
if for all ~x 6= ~y ∈ Ω, we have
f (~y ) > f (~x) + [∇f (~x)]> (~y − ~x). (5.125)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 105
EECS 127/227AT Course Reader 5.2. Convex Functions 2024-04-27 21:08:09-07:00

Theorem 128 (Second-Order Condition for Strict Convexity)


Let Ω ⊆ Rn be a convex set and let f : Ω → R be a twice-differentiable function such that ∇2 f (~x)  0 for all
~x ∈ Ω, i.e., ∇2 f (~x) is PD for all ~x ∈ Ω. Then f is strictly convex.

Notice that this is not an if-and-only-if! As an example, take the scalar function f (x) = x4 . Then f 00 (0) = 0, so it
is not true that f 00 (x)  0 for all x ∈ Ω = R, but f is strictly convex.

5.2.1 Affine Functions


Finally, we spend some time on the special case of affine functions, which are the only functions that are both convex
and concave. These are of special importance in the remainder of the course, being used or referenced in almost every
topic henceforth.

Definition 129 (Affine Functions)


(a) A function f : Rn → R is said to be affine if there exists some vector ~a ∈ Rn and some scalar b ∈ R such
that for any ~x ∈ Rn , we have
f (~x) = ~a> ~x + b. (5.126)

(b) A function f~ : Rn → Rm is said to be affine if there exists some matrix A ∈ Rm×n and some vector ~b ∈ Rm
such that for any ~x ∈ Rn , we have
f~(~x) = A~x + ~b. (5.127)

(c) A function f : Rm×n → R is said to be affine if there exists some matrix A ∈ Rm×n and scalar b ∈ R such
that for any X ∈ Rm×n , we have
m X
n
(5.128)
X
Aij Xij + b = tr A> X + b.

f (X) =
i=1 j=1

Note that a given function f : Rn → R is affine if and only if the function g : Rn → R defined by g(~x) = f (~x)−f (~0)
is linear. An analogous result holds for other types of affine functions.
Below, we show that a scalar-valued affine function is one that is both convex and concave, while a vector-valued
affine function is one whose component functions are all both convex and concave. Analogous results hold for affine
functions whose inputs and outputs are both matrices.

Proposition 130
(a) A function f : Rn → R is affine if and only if it is both convex and concave, i.e., for any α ∈ [0, 1] and
~x, ~y ∈ Rn , we have
f (α~x + (1 − α)~y ) = αf (~x) + (1 − α)f (~y ). (5.129)

(b) A function f~ : Rn → Rm is affine if and only each component function of f is both convex and concave,
i.e., for any α ∈ [0, 1] and ~x, ~y ∈ Rn , we have

f~(α~x + (1 − α)~y ) = αf~(~x) + (1 − α)f~(~y ). (5.130)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 106
EECS 127/227AT Course Reader 5.2. Convex Functions 2024-04-27 21:08:09-07:00

(c) A function f : Rm×n → R is affine if and only if it is both convex and concave, i.e., for any α ∈ [0, 1] and
X, Y ∈ Rm×n , we have
f (αX + (1 − α)Y ) = αf (X) + (1 − α)f (Y ). (5.131)

Proof. We prove (a); the claims (b) and (c) follow similarly.
Suppose first that f is affine. Then there exists ~a ∈ Rn and b ∈ R such that for each ~x, ~y ∈ Rn and α ∈ [0, 1], we
have

f (α~x + (1 − α)~y ) = ~a> (α~x + (1 − α)~y ) + b (5.132)


= α~a> ~x + (1 − α)~a> ~y + b (5.133)
>
= α(~a ~x + b) + (1 − α)(~a ~y + b) >
(5.134)
= αf (~x) + (1 − α)f (~y ). (5.135)

Conversely, suppose that

f (α~x + (1 − α)~y ) = αf (~x) + (1 − α)f (~y ), for all α ∈ [0, 1] and ~x, ~y ∈ Rn . (5.136)

To show that f is affine, it suffices to show that g : Rn → R defined by g(~x) = f (~x) − f (~0) is linear (and thus can be
written as an inner product against a vector ~a). We first show that g(r~x) = rg(~x) for any r ∈ R and ~x ∈ Rn . We break
this problem up into three cases, each building on the other.

Case 1. Suppose that r ∈ [0, 1]. Then r~x can be expressed as a convex combination of ~0 and ~x: that is,

r~x = α~x + (1 − α)~0, where α = r ∈ (0, 1). (5.137)

(Yes, this is a simpler step, but we build on it in the later parts.) With this, we have

f (r~x) = f (r~x + (1 − r)~0) = rf (~x) + (1 − r)f (~0). (5.138)

Thus, we obtain

g(r~x) = f (r~x) − f (~0) (5.139)


= rf (~x) + (1 − r)f (~0) − f (~0) (5.140)
= rf (~x) − rf (~0) (5.141)
= r(f (~x) − f (~0)) (5.142)
= rg(~x). (5.143)

Case 2. Now suppose that r ∈ (1, ∞). Then ~x can be expressed as a convex combination of ~0 and r~x: that is,
1
~x = α(r~x) + (1 − α)~0, where α = ∈ (0, 1). (5.144)
r
Thus we have
       
1 1 1 ~ 1 1
f (~x) = f · r~x = f · r~x + 1 − 0 = f (r~x) + 1 − f (~0). (5.145)
r r r r r
Multiplying both sides by r and plugging it into the previous calculation, we get

g(r~x) = f (r~x) − f (~0) (5.146)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 107
EECS 127/227AT Course Reader 5.3. Convex Optimization Problems 2024-04-27 21:08:09-07:00

 
1
= rf (~x) − r 1 − f (~0) − f (~0) (5.147)
r
= rf (~x) − (r − 1)f (~0) − f (~0) (5.148)
= rf (~x) − rf (~0) (5.149)
= r(f (~x) − f (~0)) (5.150)
= rg(~x). (5.151)

Case 3. Now suppose that r ∈ (−∞, 0). Then ~0 can be expressed as a convex combination of ~x and r~x: that is,
1
0 = α(r~x) + (1 − α)~x, where α = ∈ (0, 1). (5.152)
1−r
Thus we have
   
~ 1 1 1 1−r−1 1 r
f (0) = f (r~x) + 1 − ~x = f (r~x) + f (~x) = f (r~x) − f (~x).
1−r 1−r 1−r 1−r 1−r 1−r
(5.153)
Multiplying both sides by 1 − r and plugging it into the previous calculation, we get

g(r~x) = f (r~x) − f (~0) (5.154)


= rf (~x) + (1 − r)f (~0) − f (~0) (5.155)
= rf (~x) − rf (~0) (5.156)
= r(f (~x) − f (~0)) (5.157)
= rg(~x). (5.158)

This proves that, no matter the value of r, we have g(r~x) = rg(~x).


Now we show that g(~x + ~y ) = g(~x) + g(~y ). With the preceding result in hand, this proof is much shorter:
 
1 1
g(~x + ~y ) = 2g ~x + ~y (5.159)
2 2
   
1 1
=2 f ~x + ~y − f (~0) (5.160)
2 2
 
1 1
= 2 f (~x) + f (~y ) − f (~0) (5.161)
2 2
 
1 1
= 2 (f (~x) − f (~0)) + (f (~y ) − f (~0)) (5.162)
2 2
 
1 1
= 2 g(~x) + g(~y ) (5.163)
2 2
= g(~x) + g(~y ). (5.164)

Thus we proved that g is linear, so f is affine. This is a full proof of (a), and (b) and (c) can be proved in almost exactly
the same way.

5.3 Convex Optimization Problems


This section will lay out some of the key properties of the main class of optimization problems we are interested in —
convex optimization problems.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 108
EECS 127/227AT Course Reader 5.3. Convex Optimization Problems 2024-04-27 21:08:09-07:00

Definition 131 (Convex Optimization Problem)


Let Ω ⊆ Rn be a set and let f : Ω → R be a function. We say that the problem

min f (~x) (5.165)


x∈Ω
~

is a convex optimization problem if Ω is a convex set and f is a convex function.

Note that this applies to other kinds of constraint sets too — in particular, those of the “standard” form “fi (~x) ≤ 0
for all i and hi (~x) = 0 for all j” still define a feasible region Ω and thus can furnish a convex optimization problem. In
particular, we have the following result.

Theorem 132 (When is Standard Form Optimization Problem Convex?)


Let f1 , . . . , fm , h1 , . . . , hp : Rn → R be functions. The feasible set
( )
fi (~x) ≤ 0, ∀i ∈ {1, . . . , m}
Ω = ~x ∈ R n
(5.166)
hj (~x) = 0, ∀j ∈ {1, . . . , p}

is convex if each fi is convex and each hj is affine. Consequently, the problem

min f0 (~x) (5.167)


x∈Rn
~

s.t. fi (~x) ≤ 0, ∀i ∈ {1, . . . , m}


hj (~x) = 0, ∀j ∈ {1, . . . , p}.

is a convex optimization problem if f0 , f1 , . . . , fm are convex functions and h1 , . . . , hp are affine functions.

Proof. Left as exercise.

Now, we can establish the first-order condition for optimality within a convex problem. This is one of the main
theorems of convex analysis.

Theorem 133 (First-Order Conditions for Optimality in Convex Problem)


Let Ω ⊆ Rn be a convex set and let f : Ω → R be a differentiable convex function. Let ~x? ∈ Ω be such that
∇f (~x? ) = ~0. Then ~x? ∈ argmin~x∈Ω f (~x), i.e., ~x? is a global minimizer of f .

Proof. Let ~y be any other point in Ω. Then

f (~y ) ≥ f (~x? ) + [∇f (~x? )]> (~y − ~x? ) (5.168)


? ~>
= f (~x ) + 0 (~y − ~x ) ?
(5.169)
= f (~x? ). (5.170)

This proves that ~x? ∈ argmin~x∈Ω f (~x) as desired.

A generalization of this statement is that all local minimizers are global minimizers.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 109
EECS 127/227AT Course Reader 5.4. Solving Convex Optimization Problems 2024-04-27 21:08:09-07:00

Theorem 134 (For Convex Functions, Local Minima are Global Minima)
Let Ω ⊆ Rn be a convex set and let f : Ω → R be a convex function. Let ~x? ∈ Ω be such that there exists some
 > 0 such that if ~x ∈ Ω has k~x − ~x? k2 ≤  then f (~x? ) ≤ f (~x). Then ~x? is a minimizer of f over Ω.

Proof. Left as exercise.

When is the global minimizer unique? We can justify this using strict convexity.

Theorem 135 (Strictly Convex Functions have Unique Minimizers)


Let Ω ⊆ Rn be a convex set and let f : Ω → R be a strictly convex function. Then f has at most one global
minimizer.

Proof. Left as exercise.

For an example of a strictly convex function with one global minimizer, take f (x) = x4 , which is minimized at
x = 0. For an example of a strictly convex function with no global minimizers, take f (x) = ex .

5.4 Solving Convex Optimization Problems


To construct a first attempt at systematically solving convex optimization problems, we first need to define active and
inactive constraints.

Definition 136 (Types of Constraints)


Consider a problem of the form

min f0 (~x) (5.171)


x∈Rn
~

s.t. fi (~x) ≤ 0, ∀i ∈ {1, . . . , m}


hj (~x) = 0, ∀j ∈ {1, . . . , p}.

Fix a feasible ~x0 ∈ Rn . The inequality constraint fk (~x) ≤ 0 is active at ~x0 if fk (~x0 ) = 0, and inactive at ~x0
otherwise, i.e., fk (~x0 ) < 0.

We can use this to formulate a strategy for solving convex optimization problems. Recall that for a convex problem
min~x∈Ω f0 (~x) which has a solution ~x? , either ∇f (~x? ) = ~0 or ~x? is on the boundary of Ω. The boundary of Ω is any
point in which any inequality constraint is active. This allows us to systematically find solutions to convex optimization
problems.

Problem Solving Strategy 137. To solve a convex optimization program, we can do the following.

1. Iterate through all 2m subsets S ⊆ {1, . . . , m} of constraints which might be active at optimum.

2. For each S:

(i) Solve the modified problem

min f0 (~x) (5.172)


x∈Rn
~

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 110
EECS 127/227AT Course Reader
5.5. Problem Transformations and Reparameterizations 2024-04-27 21:08:09-07:00

s.t. fi (~x) = 0, ∀i ∈ S
hj (~x) = 0, ∀j ∈ {1, . . . , p},

i.e., solve the problem where you pretend that all inequality constraints in S are met with equality and
pretend that the other inequality constraints don’t exist. This gives some solutions ~x?S .
(ii) If there is a solution ~x?S which is feasible for the original problem, write down the value of f0 (~x?S ). Other-
wise, ignore it.
(iii) After iterating through all ~x?S which are feasible for the original problem, take the one(s) with the best
objective value f0 (~x?S ) as the optimal solution(s) to the original problem.

Predictably, this problem solving strategy is exponentially hard as the number of inequality constraints increases.
Even if solving the “inner” equality-constrained minimization problems is easy (as it often is), the whole procedure is
untenable for large-scale problems. In future chapters, we will develop better analytic and algorithmic ways to solve
convex optimization problems.

5.5 Problem Transformations and Reparameterizations


In this section, we discuss various transformations, or reparameterizations, of generic optimization problems, which
allow us to go between many equivalent formulations of a problem. Some of them will even allow us to turn non-convex
problems into convex problems.

5.5.1 Monotone Transformations of the Objective Function


The core of this technique is the following result.

Proposition 138 (Positive Monotone Transformations Don’t Affect Optimizers)


.
Let Ω be any set, let f0 : Ω → R be any function. Define f0 (Ω) = {f0 (~x) | ~x ∈ Ω} ⊆ R.

(a) Let φ : f0 (Ω) → R be any monotonically increasing function. Then

argmin f0 (~x) = argmin φ(f0 (~x)) and argmax f0 (~x) = argmax φ(f0 (~x)). (5.173)
x∈Ω
~ x∈Ω
~ x∈Ω
~ x∈Ω
~

(b) Let ψ : f0 (Ω) → R be any monotonically decreasing function. Then

argmin f0 (~x) = argmax ψ(f0 (~x)) and argmax f0 (~x) = argmin ψ(f0 (~x)). (5.174)
x∈Ω
~ x∈Ω
~ x∈Ω
~ x∈Ω
~

Proof. We only prove the very first equality; the rest follow similarly. We have

~x? ∈ argmin f0 (~x) (5.175)


x∈Ω
~

⇐⇒ f0 (~x? ) ≤ f0 (~x), ∀~x ∈ Ω (5.176)


⇐⇒ φ(f0 (~x? )) ≤ φ(f0 (~x)), ∀~x ∈ Ω (5.177)
?
⇐⇒ ~x ∈ argmin φ(f0 (~x)). (5.178)
x∈Ω
~

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 111
EECS 127/227AT Course Reader
5.5. Problem Transformations and Reparameterizations 2024-04-27 21:08:09-07:00

The first equality will be by far the most important for us, though the others might also be situationally useful. This
proposition is why doing things like squaring the norm in least squares won’t affect the solutions we get, i.e.,
2
argmin A~x − ~b = argmin A~x − ~b . (5.179)
x∈Rn
~ 2 x∈Rn
~ 2

But the fact that φ is monotonic when restricted to f0 (Ω) is quite crucial; indeed, we have that

argmin x = ∅ but argmin x2 = {0}. (5.180)


x∈R x∈R

This is because the function u 7→ u2 is not monotonic in general, although it is monotonic on the non-negative real
numbers R+ . It just so happens that k·k2 only outputs non-negative numbers, so actually in the case of least squares
we have f0 (Ω) = R+ and the proposition applies.

Example 139 (Logistic Regression). A more non-trivial example is logistic regression. First, define σ : R → (0, 1) by

. 1
σ(x) = . (5.181)
1 + e−x
Suppose we have data points ~x1 , . . . , ~xn ∈ Rd and accompanying labels y1 , . . . , yn ∈ {0, +1}. Suppose that the
conditional probability that yi = 1 given ~xi is given by

Pw~ 0 [~yi = 1 | ~xi ] = σ(~x>


i w
~ 0) (5.182)

for some w
~ 0 ∈ Rd . We wish to recover w
~ 0 , and thus recover the generative model Pw~ 0 [y | ~x]. We do this by maximum
likelihood estimation. Writing out the problem, we get

~ ? ∈ argmax Pw~ [y1 , . . . , yn | ~x1 , . . . , ~xn ]


w (5.183)
w∈R
~ d

n
(5.184)
Y
= argmax Pw~ [yi | ~xi ]
w∈R
~ d
i=1
  
n n
(5.185)
Y Y
= argmax  Pw~ [yi = 0 ~xi ]  Pw~ [yi = 1 ~xi ]
  
w∈R
~ d
i=1 i=1
yi =0 yi =1
n
(5.186)
Y
= argmax Pw~ [yi = 1 | ~xi ]yi Pw~ [yi = 0 | ~xi ]1−yi
w∈R
~ d
i=1
n
(5.187)
Y
= argmax σ(~x> ~ yi (1 − σ(~x>
i w) ~ 1−yi .
i w))
w∈R
~ d
i=1

Now, the σ function is very non-convex. The product of σs is also very non-convex. Thus, it seems intractable to solve
this problem. But note that the objective function takes values in (0, 1). We use the above proposition with the function
x 7→ log(x), which is monotonically increasing on (0, 1). We obtain
n
(5.188)
Y
argmax σ(~x> ~ yi (1 − σ(~x>
i w) ~ 1−yi
i w))
w∈R
~ d
i=1
n
!
(5.189)
Y
= argmax log σ(~x> ~ yi (1 − σ(~x>
i w) ~ 1−yi
i w))
w∈R
~ d
i=1
n
(5.190)
X
yi log σ(~x> ~ + (1 − yi ) log 1 − σ(~x>
 
= argmax i w) i w)
~
w∈R
~ d
i=1

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 112
EECS 127/227AT Course Reader
5.5. Problem Transformations and Reparameterizations 2024-04-27 21:08:09-07:00

( n
)
(5.191)
X
σ(~x> σ(~x>
 
= argmin − yi log i w)
~ + (1 − yi ) log 1 − i w)
~
w∈R
~ d
i=1
n
(5.192)
X
−yi log σ(~x> ~ − (1 − yi ) log 1 − σ(~x>
 
= argmin i w) i w)
~ .
w∈R
~ d
i=1

In the penultimate line we used another one of the equalities in the proposition with the monotonically decreasing
function ψ(x) = −x. Thus, logistic regression reduces to minimizing the objective function
n
(5.193)
X
−yi log σ(~x> ~ − (1 − yi ) log 1 − σ(~x>
 
f0 (w)
~ = i w) i w)
~ .
i=1

This is the so-called cross-entropy loss.


Computing the gradient and Hessian of this function is an exercise, but the result is
n
(5.194)
X
∇f0 (w)
~ = ~xi (σ(~x> ~ − yi )
i w)
i=1
n
(5.195)
X
∇2 f0 (w)
~ = σ(~x>
i w)(1
~ − σ(~x> ~ xi ~x>
i w))~ i .
i=1

Because σ(x) ∈ (0, 1), the above Hessian is a non-negative weighted sum of positive semidefinite matrices ~xi ~x>
i and is
thus positive semidefinite. By the second order conditions, f0 is convex! Thus we have turned an extremely non-convex
problem into an unconstrained convex minimization problem just by a neat application of monotone functions. We can
efficiently solve this problem algorithmically using convex optimization solvers such as gradient descent.
Actually, we can do better than gradient descent for this particular example! If we define
     
~x>
1 y1 σ(~x>1 w)
~
 .  . ..
.  n×d . n  ∈ Rn (5.196)
 
X=  . ∈R ,  . ∈R ,
~y =  p~(w)
~ = . 
> >
~xn yn σ(~xn w)
~

then the gradient looks like


~ = X > (~
∇f0 (w) ~ − ~y ).
p(w) (5.197)

Notice the similarity to the gradient of least squares, X > (X w


~ − ~y ). In fact, we can exploit this similarity to obtain a
specialized optimization algorithm called iteratively reweighted least squares, which can efficiently solve for maximum
likelihood estimates within this type of statistical model (i.e., a generalized linear model).

5.5.2 Slack Variables


The purpose of slack variables is to turn equality constraints into inequality constraints and vice-versa.

Definition 140 (Slack Variables)


Consider a problem of the form

min f0 (~x) (5.198)


x∈Rn
~

s.t. fi (~x) ≤ 0, ∀i ∈ {1, . . . , m}


hj (~x) = 0, ∀j ∈ {1, . . . , p}.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 113
EECS 127/227AT Course Reader
5.5. Problem Transformations and Reparameterizations 2024-04-27 21:08:09-07:00

Let S ⊆ {1, . . . , m} be any subset. The above problem is equivalent to the following problem:

min f0 (~x) (5.199)


x∈Rn
~
s∈RS
~ +

s.t. fi (~x) + si = 0, ∀i ∈ S
fi (~x) ≤ 0, ∀i ∈ {1, . . . , m} \ S
hj (~x) = 0, ∀j ∈ {1, . . . , p}.

Here the notation RS+ = {(xi )i∈S | xi ≥ 0 ∀i ∈ S}, and ~s is called a slack variable.

One can choose to create slack variables si for only a subset of the inequality constraints, or all of them. When we
work with more advanced optimization algorithms later, sometimes this parameterization is crucial (e.g. for equality-
constrained Newton’s method).
Example 141. If we have a problem of the form

min f0 (~x) (5.200)


x∈Rn
~

s.t. 3x21 + 4x22 ≤ 0 (5.201)


2x21 + 5x22 ≤0 (5.202)

but our solver could only handle equality constraints, then it would be equivalent to solve the problem

min f0 (~x) (5.203)


x∈R2
~
s∈R2+
~

s.t. 3x21 + 4x22 + s1 = 0 (5.204)


2x21 + 5x22 + s2 = 0 (5.205)

and, upon solving this problem and obtaining (~x? , ~s? ), the solution to the original problem would be this same ~x? .

5.5.3 Epigraph Reformulation


The epigraph reformulation is a way to convert between non-linearity in the objective and a constraint.

Definition 142
Consider a problem of the form

min f0 (~x) (5.206)


x∈Rn
~

s.t. fi (~x) ≤ 0, ∀i ∈ {1, . . . , m}


hj (~x) = 0, ∀j ∈ {1, . . . , p}.

Its epigraph reformulation is the problem

min t (5.207)
t∈R
x∈Rn
~

s.t. t ≥ f0 (~x)
fi (~x) ≤ 0, ∀i ∈ {1, . . . , m}

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 114
EECS 127/227AT Course Reader
5.5. Problem Transformations and Reparameterizations 2024-04-27 21:08:09-07:00

hj (~x) = 0, ∀j ∈ {1, . . . , p}.

The epigraph objective is always a linear and differentiable function of the decision variables (t, ~x). However, the
constraint can become complicated if f0 (~x) is non-linear. This transformation is especially useful in the case of
quadratically-constrained quadratic programs (QCQP).

Example 143 (Elastic-Net Regularization). This example uses the two previously-discussed techniques in tandem to
figure out how to handle a regularizer with both smooth and non-smooth components (i.e., `1 and `2 norms).
Let A ∈ Rm×n , and ~y ∈ Rn . Suppose that we have a problem of the form
n o
(5.208)
2 2
minn kA~x − ~y k2 + α k~xk2 + β k~xk1 .
x∈R
~

The regularizer α k~xk2 + β k~xk1 is called the elastic net regularizer and encourages “sparse” and small ~x; this regu-
2

larizer has some use in the analysis of high-dimensional and structured data.
Suppose that our solver can only handle differentiable objectives, but is able to handle constraints so long as they
are also differentiable. Then we cannot solve the problem out-right using our solver, so we need to reformulate it. We
can first start by using a modification of the epigraph reformulation:

(5.209)
2 2
min kA~x − ~y k2 + α k~xk2 + βt
t∈R
x∈Rn
~

s.t. t ≥ k~xk1 . (5.210)

Now the constraint is non-differentiable, so we are no longer able to exactly solve this constrained problem. However,
the objective is now convex and differentiable. Let us rewrite this problem using the |xi |:

(5.211)
2 2
min kA~x − ~y k2 + α k~xk2 + βt
t∈R
x∈Rn
~
n
s.t. t ≥ (5.212)
X
|xi | .
i=1

The main insight that goes into resolving the non-differentiability of this constraint is that |xi | = max{xi , −xi } and
in particular si ≥ |xi | if and only if si ≥ xi and si ≥ −xi . Thus we add more “slack-type” variables and obtain

(5.213)
2 2
min kA~x − ~y k2 + α k~xk2 + βt
t∈R
x∈Rn
~
s∈Rn
~ +
n
s.t. t ≥ (5.214)
X
si
i=1

si ≥ xi , ∀i ∈ {1, . . . , n} (5.215)
si ≥ −xi , ∀i ∈ {1, . . . , n}. (5.216)

Now the constraints are all differentiable (in fact, affine) and the problem may be solved. This problem is exactly
equivalent to the above problem because t is being minimized in the objective, so each si is being minimized by way
of the first constraint, meaning that si = |xi | and this is equivalent to the original elastic-net regression problem.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 115
Chapter 6

Gradient Descent

Relevant sections of the textbooks:

• [2] Section 12.2.

6.1 Strong Convexity and Smoothness


In this section, we will introduce two properties of functions that will become useful in analyzing optimization algo-
rithms. The first property is a notion of convexity of functions that is stronger than the ones previously introduced.

Definition 144 (µ-Strongly Convex Function)


Let µ ≥ 0. Let f : Rn → R be a differentiable function. We say that f is µ-strongly convex if the domain of f ,
say Ω, is convex, and, for any ~x, ~y ∈ Ω, we have
µ
(6.1)
> 2
f (~y ) ≥ f (~x) + [∇f (~x)] (~y − ~x) + k~y − ~xk2 .
2
A function f : Rn → R is µ-strongly concave if −f is µ-strongly convex.

Recall from Theorem 123 that the first order condition for (usual) convexity requires the function to be bigger than its
linear (first order) Taylor approximation centered at any point. µ-strong convexity imposes a stronger requirement on
the function: it needs to be bigger than its linear approximation plus a non-negative quadratic term that has a Hessian
matrix µI. This becomes more obvious if we write Equation (6.1) in the equivalent form
µ 
(6.2)
>
f (~y ) ≥ f (~x) + [∇f (~x)] (~y − ~x) + (~y − ~x)> I (~y − ~x) .
| {z } | 2
first-order Taylor approximation
{z }
non-negative quadratic term

Below, we visualize the µ-strong convexity property and compare it to the first order condition for convexity.

116
EECS 127/227AT Course Reader 6.1. Strong Convexity and Smoothness 2024-04-27 21:08:09-07:00

graph(f )
(~y , f (~y ))
f (~y )

µ 2
graph(fb1 (·; x) + 2 k· − ~xk2 )

µ 2 µ 2
fb1 (~y ; ~x) + 2 k~y − ~xk2 (~y , fb1 (~y ; ~x) + 2 k~y − ~xk2 )
(~x, f (~x))
f (~x)


~x ~y
graph(fb1 (·; x))
fb1 (~y ; ~x)
(~y , fb1 (~y ; ~x))

Therefore, µ-strong convexity of the function is a very important feature. It guarantees that the function will always
have enough curvature (at least as much as its quadratic lower bound) and thus will never become too flat anywhere on
its domain.
If the function f is twice-differentiable we can formalize this notion by giving the following equivalent condition
for µ-strong convexity.

Theorem 145 (Second Order Condition for µ-Strong Convexity)


Let Ω ⊆ Rn be a convex set and let f : Ω → R be a twice-differentiable function. Then f is µ-strongly convex if
and only if for all ~x ∈ Ω we have
∇2 f (~x) − µI  0, (6.3)

i.e., ∇2 f (~x) − µI is PSD for each ~x ∈ Ω.

An important property of µ-strongly convex functions is that they are strictly convex and thus they have at most one
minimizer. In fact, one can show that they have exactly one minimizer.

Theorem 146 (Strongly Convex Functions have Unique Global Minimizers)


Let µ > 0. Let Ω ⊆ Rn be a convex set and let f : Ω → R be a µ-strongly convex function. Then f has exactly
one global minimizer.

The second property we want to introduce is L-smoothness, which describes quadratic upper bounds on the function
f.

Definition 147 (L-Smooth Function)


Let L ≥ 0. Let Ω ⊆ Rn be a set. Let f : Ω → R be a differentiable function. We say that f is L-smooth if, for

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 117
EECS 127/227AT Course Reader 6.1. Strong Convexity and Smoothness 2024-04-27 21:08:09-07:00

any ~x, ~y ∈ Ω, we have


L
(6.4)
> 2
f (~y ) ≤ f (~x) + [∇f (~x)] (~y − ~x) + k~y − ~xk2 .
2

If f is L-smooth, then the function f is upper bounded by its first order Taylor approximation plus a non-negative
quadratic term:  
L
(6.5)
> >
f (~y ) ≤ f (~x) + [∇f (~x)] (~y − ~x) + (~y − ~x) I (~y − ~x) .
| {z } 2
first-order Taylor approximation | {z }
non-negative quadratic term

We can visualize the L-smoothness condition similarly to the strongly convex condition, as below:

L 2
graph(fb1 (·; x) + 2 k· − ~xk2 )

L 2
fb1 (~y ; ~x) + 2 k~y − ~xk2
(~y , fb1 (~y ; ~x) + L
k~y − ~xk2 )
2 graph(f )
2

f (~y )
(~y , f (~y ))

(~x, f (~x))
f (~x)


~x ~y
graph(fb1 (·; x))
fb1 (~y ; ~x)
(~y , fb1 (~y ; ~x))

L-smoothness provides a quadratic upper bound on f . This upper bound ensures that the function doesn’t have too
much curvature anywhere on its domain (at most as much as its upper bound). We will see later in this chapter that this
actually translates into an upper bound on the rate at which the gradient of the function changes.
Finally, we visualize the behavior of the µ-strongly convex and L-smooth bounds together.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 118
EECS 127/227AT Course Reader 6.2. Gradient Descent 2024-04-27 21:08:09-07:00

2
graph(fb1 (·; x) + + L2 k· − ~xk2 )

L 2
fb1 (~y ; ~x) + 2 k~y − ~xk2 2
(~y , fb1 (~y ; ~x) + L
k~y − ~xk2 ) graph(f )
2

f (~y ) (~y , f (~y ))

µ 2
graph(fb1 (·; x) + 2 k· − ~xk2 )

µ 2 µ 2
fb1 (~y ; ~x) + 2 k~y − ~xk2 (~y , fb1 (~y ; ~x) + 2 k~y − ~xk2 )
(~x, f (~x))
f (~x)


~x ~y

6.2 Gradient Descent


We now have all the tools we need to introduce, understand, and analyze gradient descent - one of the most ubiquitous
optimization algorithms. The algorithm is used to numerically solve differentiable and unconstrained optimization
problems.
More formally, let f : Rn → R be a differentiable function. We attempt to solve the problem:

p? = minn f (~x). (6.6)


x∈R
~

The general idea behind the gradient descent algorithm is that it starts with some initial guess ~x0 ∈ Rn and produces
a sequence of refined guesses ~x1 , ~x2 , . . ., called iterates. In each iteration t = 0, 1, 2, . . . , the algorithm updates its
guess according to the following rule:
~xt+1 = ~xt + η~vt (6.7)

for some ~vt ∈ Rn and η ≥ 0.


There are two quantities that we need to specify to complete the definition of the algorithm:

• The vector ~vt , or the search direction, which specifies a good direction to move.

• The scalar η, or the step size, which specifies how far we move in the direction of ~vt .

For the gradient descent algorithm, we assume that at every point ~x ∈ Rn we can get two pieces of information about
the function we are optimizing: the value of the function f (~x) ∈ R as well as its gradient ∇f (~x) ∈ Rn . Next, we
will use this available information to come up with a good search direction ~vt . The choice of the step size η is a more
difficult task. There is no universal choice of η that is good for all problems and a good choice of η is problem-specific.
We will discuss the choice of the step size later in the section and show the important role it plays in the algorithm.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 119
EECS 127/227AT Course Reader 6.2. Gradient Descent 2024-04-27 21:08:09-07:00

6.2.1 Search Direction


To choose ~vt , we want to ensure that f (~xt + η~vt ) ≤ f (~xt ) for some small η. This motivates taking ~vt as the direction
of steepest descent, i.e., the direction in which f decays the fastest. The degree (i.e., the steepness) of the rate of change
can be characterized by the directional derivative Df (~x)[~v ] = ~v > [∇f (~x)].

Theorem 148 (Negative Gradient is Direction of Steepest Descent)


Let f : Rn → R be a differentiable function, and let ~x ∈ Rn . Then

∇f (~x)
− ∈ argmin Df (~x)[~v ]. (6.8)
k∇f (~x)k2 v ∈Rn
~
k~
v k2 =1

Proof. Left as exercise.

We also want to use the norm of the gradient in our update. For example, if f (x) = x2 then f 0 (x) = 2x, which
has large norm when x is far from the optimal point x? = 0. In this way, if the gradient is large then we are usually far
away from an optimum, and we want our update η~vt to be large. This motivates choosing ~vt = −∇f (~xt ).

6.2.2 Defining Gradient Descent


This choice of ~vt gives the famous gradient descent iteration scheme:

~xt+1 = ~xt − η∇f (~xt ), ∀t = 0, 1, 2, . . . . (6.9)

We can formalize this in an algorithm which terminates after T iterations for a user-set T .

Algorithm 3 Gradient Descent.


1: function GradientDescent(f, ~x0 , η, T )
2: for t = 0, 1, . . . , T − 1 do
3: ~xt+1 ← ~xt − η∇f (~xt )
4: end for
5: return ~xT
6: end function

It is important to note that with the choice ~vt = −∇f (~xt ), descent is not guaranteed. Namely, for a fixed η > 0, it
is not true in general that
f (~xt+1 ) = f (~xt − η∇f (~xt )) ≤ f (~xt ). (6.10)

Rather, from the proof of the above theorem, we only have that, for any given t there exists ηt > 0 such that

f (~xt − ηt ∇f (~xt )) ≤ f (~xt ). (6.11)

This ηt , in general, is very small, and depends heavily on t and the local geometry of f around ~xt . None of these details
are known by any realistic implementation of the gradient descent algorithm. In what follows, we will study gradient
descent with a constant step size; this setting is most common in practice.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 120
EECS 127/227AT Course Reader 6.2. Gradient Descent 2024-04-27 21:08:09-07:00

6.2.3 Convergence Analysis of Gradient Descent


When applying GD algorithm to a problem we need to make sure that we’re making progress and that we eventually
converge to the optimal solution. This question is very related to the choice of the step size. For the reminder of the
section we will show that, for certain classes of functions f : Rn → R and certain chosen step sizes η > 0, the gradient
descent algorithm is guaranteed to make progress in every iteration and converge to the optimal solution. We will first
explore these questions through a familiar example.

Example 149 (Gradient Descent for Least Squares). In this example we explore the convergence properties of the
gradient descent algorithm by applying it to the least squares problem. Let A ∈ Rm×n have full column rank, and
~y ∈ Rm .
(6.12)
2
min kA~x − ~y k2 .
x∈Rn
~

Recall that, in this setting, the least squares problem has the unique closed form solution

~x? = (A> A)−1 A> ~y . (6.13)

We will use our knowledge of the true solution to analyze the convergence of the gradient descent algorithm. To apply
gradient descent to this problem, let us first compute the gradient, which we see as

∇f (~x) = 2A> (A~x − ~b). (6.14)

Now let ~x0 ∈ Rn be the initial guess. For t a non-negative integer, we can write the gradient descent step:

~xt+1 = ~xt − η∇f (~xt ) (6.15)


= ~xt − 2ηA> (A~xt − ~y ) (6.16)
>
= (I − 2ηA A)~xt + 2ηA ~y . >
(6.17)

We aim to set η which achieves the following two desired properties for the gradient descent iterates ~xt :

• We make progress, i.e., in every iteration we get closer to the optimal solution:

k~xt+1 − ~x? k2 ≤ k~xt − ~x? k2 . (6.18)

• We eventually converge to the optimal solution:

lim ~xt = ~x? ⇐⇒ lim k~xt − ~x? k2 = 0. (6.19)


t→∞ t→∞

To study this, let us write write out the relationship between ~xt+1 − ~x? and ~xt − ~x? . We do this by subtracting ~x? from
both sides of Equation (6.17) and do some algebraic manipulations.

~xt+1 − ~x? = (I − 2ηA> A)~xt + 2ηA> ~y − ~x? (6.20)


>
= (I − 2ηA A)~xt + 2ηA ~y − (A A) > > −1 >
A ~y (6.21)
= (I − 2ηA> A)~xt + 2η(A> A)(A> A)−1 A> ~y − (A> A)−1 A> ~y (6.22)
>
= (I − 2ηA A)~xt + 2ηA A~x − ~x > ? ?
(6.23)
= (I − 2ηA> A)~xt − (I − 2ηA> A)~x? (6.24)
>
= (I − 2ηA A)(~xt − ~x ). ?
(6.25)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 121
EECS 127/227AT Course Reader 6.2. Gradient Descent 2024-04-27 21:08:09-07:00

Here we used the trick of introducing I = (A> A)(A> A)−1 ; this is because we wanted to group terms with A> A, and
we also wanted to introduce instances of ~x? = (A> A)−1 A> ~y . Thus, while this was a trick, it was a motivated trick.
In the end we get the following relationship:

~xt+1 − ~x? = (I − 2ηA> A)(~xt − ~x? ). (6.26)

If we take the norm of both sides, we have

k~xt+1 − ~x? k2 = (I − 2ηA> A)(~xt − ~x? ) 2


(6.27)
≤ I − 2ηA> A 2
k~xt − ~x? k2 (6.28)
= σmax {I − 2ηA A} k~xt − ~x k2 . > ?
(6.29)

Thus, if σmax {I − 2ηA> A} < 1, then we have

k~xt+1 − ~x? k2 ≤ σmax {I − 2ηA> A} k~xt − ~x? k2 < k~xt − ~x? k2 . (6.30)

Thus we are guaranteed to make progress at each step. For the convergence guarantee, we recursively apply Equa-
tion (6.29) to get

k~xt+1 − ~x? k2 ≤ σmax {I − 2ηA> A} k~xt − ~x? k2 (6.31)


≤ σmax {I − 2ηA A} k~xt−1 − ~x k2 > 2 ?
(6.32)
≤ ··· (6.33)
≤ σmax {I − 2ηA A} > t+1
k~x0 − ~x k2 . ?
(6.34)

Thus taking the limit we get

lim k~xt − ~x? k2 ≤ lim σmax {I − 2ηA> A}t k~x0 − ~x? k2 (6.35)
t→∞ t→∞

= k~x0 − ~x? k2 · lim σmax {I − 2ηA> A}t (6.36)


t→∞
| {z }
=0

= 0. (6.37)

Thus our convergence guarantee holds as well, so long as σ1 {I − 2ηA> A} < 1.

Let us now generalize this analysis to a larger class of functions that includes least squares, as well as more general
functions. We will focus our attention on the class of L-smooth and µ-strongly convex functions. Similar to what we
did for least squares, we will use the optimal solution ~x? in our analysis. For a general function f , the quantity ~x? is
almost always not known. But in the case of L-smooth and µ-strongly convex functions, a unique global minimizer
exists. We just need the fact that it exists to show that for some small enough choice of the step size η, the gradient
descent algorithm converges to this optimal solution. Before we formally state and prove this, we want to introduce
one property of L-smooth functions that will become useful in our following proof.

Lemma 150 (Gradient Bound for L-Smooth Functions)


Let f : Rn → R be an L-smooth function. For all ~x ∈ Rn , we have
 
(6.38)
2 0
k∇f (~x)k2 ≤ 2L f (~x) − min 0 n
f x
(~ ) .
x ∈R
~

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 122
EECS 127/227AT Course Reader 6.2. Gradient Descent 2024-04-27 21:08:09-07:00

This lemma says that the magnitude of the gradient gets smaller as we get closer to the optimal solution.

Proof. Recall the definition of L-smooth functions:


L
(6.39)
> 2
f (~y ) ≤ f (~x) + [∇f (~x)] (~y − ~x) + k~y − ~xk2 .
2

This is true for all points ~x, ~y ∈ Rn , so let us fix ~x and set ~y = ~x − L .
∇f (~
x)
We get
    2
∇f (~x) ∇f (~x) L ∇f (~x)
(6.40)
>
f ~x − ≤ f (~x) + [∇f (~x)] − + −
L L 2 L 2
1 1
(6.41)
2 2
= f (~x) − k∇f (~x)k2 + k∇f (~x)k2
L 2L
1
(6.42)
2
= f (~x) − k∇f (~x)k2 .
2L
Now we can use the fact that min~x0 f (~x0 ) ≤ f (~z) for all ~z ∈ Rn to get
 
∇f (~x)
0
min f (~x ) ≤ f ~x − (6.43)
x0 ∈Rn
~ L
1
(6.44)
2
≤ f (~x) − k∇f (~x)k2 .
2L
Rearranging, we have  
(6.45)
2 0
k∇f (~x)k2 ≤ 2L f (~x) − min
0 n
f (~x )
x ∈R
~

as desired.

Now we have all the needed tools to prove the following property of gradient descent: for any µ-strongly convex
and L-smooth function, the gradient descent algorithm converges exponentially fast to the true solution ~x? .

Theorem 151 (Convergence of Gradient Descent for Smooth Strongly Convex Functions)
Let µ, L > 0. Let f : Rn → R be an L-smooth, µ-strongly convex function. Consider the following optimization
problem:
p? = minn f (~x) (6.46)
x∈R
~

which has optimal solution ~x? . Then the constant step size η = 1
L is such that, if applying the gradient descent
algorithm generates the sequence of points

~xt+1 = ~xt − η∇f (~xt ), (6.47)

then for all initializations ~x0 and non-negative integers t,


 µ
(6.48)
2 2
k~xt+1 − ~x? k2 ≤ 1 − k~xt − ~x? k2 .
L

Proof. We want to upper bound k~xt+1 − ~x? k2 , so let us write it out as


2

(6.49)
2 2
k~xt+1 − ~x? k2 = k~xt − η∇f (~xt ) − ~x? k2
(6.50)
? 2
= k[~xt − ~x ] − η∇f (~xt )k2
(6.51)
2 2
= k~xt − ~x? k2 − 2η[∇f (~xt )]> (~xt − ~x? ) + η 2 k∇f (~xt )k2

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 123
EECS 127/227AT Course Reader 6.2. Gradient Descent 2024-04-27 21:08:09-07:00

(6.52)
2 2
= k~xt − ~x? k2 + 2η[∇f (~xt )]> (~x? − ~xt ) + η 2 k∇f (~xt )k2 .

Because f is L-smooth, the lemma gives

(6.53)
2
k∇f (~xt )k2 ≤ 2L(f (~xt ) − f (~x? )).

Because f is µ-strongly convex, we can write


µ ?
(6.54)
2
f (~x? ) ≥ f (~xt ) + [∇f (~xt )]> (~x? − ~xt ) + k~x − ~xt k2
2
µ
(6.55)
2
=⇒ [∇f (~xt )]> (~xt − ~x? ) ≥ f (~xt ) − f (~x? ) + k~xt − ~x? k2
2
µ
(6.56)
2
=⇒ [∇f (~xt )]> (~x? − ~xt ) ≤ f (~x? ) − f (~xt ) − k~xt − ~x? k2 .
2
Plugging these two estimates in, we get

(6.57)
2 2 2
k~xt+1 − ~x? k2 = k~xt − ~x? k2 + 2η[∇f (~xt )]> (~x? − ~xt ) + η 2 k∇f (~xt )k2
h µ i
(6.58)
2 2
≤ k~xt − ~x? k2 + 2η f (~x? ) − f (~xt ) − k~xt − ~x? k2 + η 2 [2L(f (~xt ) − f (~x? ))]
2
? 2
= (1 − ηµ) k~xt − ~x k2 + 2η(ηL − 1)(f (~xt ) − f (~x? )). (6.59)

So as to make the second term 0, we choose η = L.


This gives
1

 µ
(6.60)
2 2
k~xt+1 − ~x? k2 ≤ 1 − k~xt − ~x? k2 .
L

Corollary 152. Consider the setting of Theorem 151. We have that 0 ≤ 1 − µ


L < 1, and consequently:

(a) (Descent at every step.) k~xt+1 − ~x? k2 ≤ k~xt − ~x? k2 for all non-negative integers t and initializations ~x0 .

(b) (Convergence.) limt→∞ ~xt = ~x? .

Proof. We need to show that 1 − µ


L ∈ [0, 1), or in other words that 0 < µ
L ≤ 1. By assumption µ > 0 so µ
L > 0. The
upper bound is given by the fact that f is µ-strongly convex and L-smooth, so
µ L
(6.61)
2 2
f (~x) + [∇f (~x)]> (~y − ~x) + k~y − ~xk2 ≤ f (~y ) ≤ f (~x) + [∇f (~x)]> (~y − ~x) + k~y − ~xk2
2 2
which implies that
µ L
(6.62)
2 2
f (~x) + [∇f (~x)]> (~y − ~x) + k~y − ~xk2 ≤ f (~x) + [∇f (~x)]> (~y − ~x) + k~y − ~xk2
2 2
which implies that
µ L
(6.63)
2 2
k~y − ~xk2 ≤ k~y − ~xk2
2 2
which finally implies that µ ≤ L, i.e., 0 < L
µ
≤ 1. This immediately yields the first claim since 1 − µ
L ∈ [0, 1) so
1 − L ∈ [0, 1), therefore
µ
p

r
µ
k~xt+1 − ~x k2 ≤ 1 − k~xt − ~x? k2 < k~xt − ~x? k2 .
?
(6.64)
L
The second claim follows since
 µ
(6.65)
2 2
k~xt − ~x? k2 ≤ 1 − k~xt−1 − ~x? k2
L

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 124
EECS 127/227AT Course Reader 6.3. Variations: Stochastic Gradient Descent 2024-04-27 21:08:09-07:00

 µ 2
(6.66)
2
≤ 1− k~xt−2 − ~x? k2
L
≤ ··· (6.67)
 µ t
(6.68)
2
≤ 1− k~x0 − ~x? k2
L
and so
µ t

(6.69)
2 2
lim k~xt − ~x? k2 = lim k~x0 − ~x? k2
1−
t→∞ t→∞ L
 µ t
(6.70)
2
= k~x0 − ~x? k2 · lim 1 −
{z L }
t→∞
|
=0

= 0. (6.71)

In this case the quantity c = µ


is important. If we wanted to run gradient descent for T iterations to within
p
1− L
accuracy , we would have
k~xT − ~x? k2 ≤ cT k~x0 − ~x? k2 . (6.72)
.
If we set D = k~x0 − ~x? k, then we can ensure that k~xT − ~x? k2 ≤  by bounding the right-hand side cT k~x0 − ~x? k2 =
cT D by , getting

cT D ≤  (6.73)

=⇒ cT ≤ (6.74)
D
=⇒ T log(c) ≤ log() − log(D) (6.75)
log() − log(D)
=⇒ T ≥ (6.76)
log(c)
log(1/) + log(D)
= . (6.77)
log(1/c)

Thus, the lower the value of c is, the faster we get convergence towards a given accuracy.
Finally, it is important to point out that this is not the only class of functions for which the gradient descent algorithm
converges. For example for the class of functions that are L-smooth and convex (i.e., not µ-strictly convex), the gradient
descent algorithm does converge; in particular it converges to within accuracy  after T ≥ O(1/) iterations. This is
vastly slower than the convergence achieved for µ-strongly convex functions.

6.3 Variations: Stochastic Gradient Descent


In this section we will cover the stochastic gradient descent (SGD) algorithm. This variant is motivated by optimiza-
tion problems in which computing the gradient can be computationally expensive. The idea behind SGD is to use
random sampling to find an estimate of the gradient that is easy to compute. Let us consider the class of unconstrained
differentiable optimization problems that take the form
m
1 X
p? = minn f (~x) = minn fi (~x). (6.78)
x∈R
~ x∈R
~ m i=1

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 125
EECS 127/227AT Course Reader 6.3. Variations: Stochastic Gradient Descent 2024-04-27 21:08:09-07:00

The gradient of the objective function is given by


m
1 X
∇f (~x) = ∇fi (~x), (6.79)
m i=1

and the gradient descent step is given by


m
1 X
~xt+1 = ~xt − η · ∇fi (~x), (6.80)
m i=1

This class of functions is very common in applications that involve learning from data, one example being the least
squares problem. Recall that we can write the least squares problem as
m
1 1 X >
(6.81)
2
minn kA~x − ~y k2 = minn (~ai ~x − bi )2
x∈R
~ m
| {z } x∈R m
~
i=1
| {z }
x)
=fi (~
x)
=f (~

Note that we multiplied the objective function with the constant m,


1
which does not change the solutions to the opti-
mization problem. We will see shortly that this constant is important in the SGD setting.
If the number of functions m is large and if each gradient ∇fi (~x) is complex to evaluate, computing the full gradient
in Equation (6.79) can become prohibitively expensive. To overcome this, SGD proposes to take a random sample from
the set of functions {f1 , f2 , . . . , fm }, evaluate the gradient at that sample only, and take the step

~xt+1 = ~xt − η∇fi (~x). (6.82)

If this index i is drawn uniformly at random, we can see that the expected value of the estimated gradient is equal
to the full gradient.
m
1
(6.83)
X
E[∇fi (~x)] = ∇fi (~x)
i=1
m

= ∇f (~x). (6.84)

Note that the direction −∇fi (~x) is not guaranteed to be the direction of steepest descent of the function f (~x). In
fact, it might not be a descent direction at all, and the value of the function f might even increase by taking a step in
this direction. Further, it is not guaranteed to satisfy the first order optimality conditions; that is, when ∇f (~x) = ~0,
the gradient of a single function ∇fi (~x) is generally not zero. This already tells us that the notion of convergence we
studied (which is called “last-iterate convergence”) is practically impossible to achieve — and in particular, the type of
convergence proof that we studied for gradient descent will not work here. Instead the type of convergence guarantees
we can hope for in this setting is for the averaged point across all iterations will converge to the optimal solution, i.e.,
T
!
1X
lim f ~xt = minn f (~x). (6.85)
T →∞ T t=1 x∈R
~

Further, to prove convergence of SGD we need to use a variable step size ηt . Using a fixed step size (like we did
for gradient descent) doesn’t guarantee convergence. The intuition behind this is that the gradient directions ∇fi (~x)
for different i might be competing with each other and want to pull the solution in different directions. This will cause
the sequence ~xt to bounce back and forth. Using a variable step size ηt such that limt→∞ ηt = 0 makes it possible
to converge to a single point. It is not trivial to show that the point that the algorithm will converge to is the optimal
solution, but this can be proven under some assumptions on the objective function. For formal proofs, a good reference
is [6].

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 126
EECS 127/227AT Course Reader
6.4. Variations: Gradient Descent for Constrained Optimization 2024-04-27 21:08:09-07:00

Example 153 (Finding the Centroid Using SGD). Fix points p~1 , . . . , p~m ∈ Rn . Let us consider the problem of finding
their centroid, which we formulate as
m
1 X1
(6.86)
2
minn k~x − p~i k2 .
x∈R
~ m i=1 |2 {z }
x)
=fi (~

The optimal solution for this problem is just the average of the points
m
1 X
~x? = p~i . (6.87)
m i=1

Let us apply SGD on this problem with the time varying step size ηt = 1
t and initial guess ~x0 = ~0. We first compute
the gradients

∇fi (~x) = ~x − p~i . (6.88)

To simplify the notation we will apply SGD by selecting the indices in order.
1
Iteration 1 : ~x1 = ~x0 − (~x0 − p~1 ) = p~1 (6.89)
1
1 p~1 + p~2
Iteration 2 : ~x2 = ~x1 − (~x1 − p~2 ) = (6.90)
2 2
1 p~1 + p~2 + p~3
Iteration 3 : ~x3 = ~x3 − (~x2 − p~3 ) = (6.91)
3 3
..
. (6.92)
Pm
1 p~i
Iteration m : ~xm = ~xm−1 − (~xm−1 − p~m ) = i=1 . (6.93)
m m
So the SGD algorithm which takes a step along the gradient of a single ∇fi (~x) in every iteration converges to the true
solution in m iterations. It is important to note that this is a toy example that is used to illustrate the application of
SGD. As we discussed earlier, this type of convergence behavior is not common or guaranteed when applying SGD to
more complicated real world problems.

Finally, we note that this set up of the SGD algorithm allows us to think of the optimization scheme as an online
algorithm. Instead of randomly selecting an index to compute the gradient, assume that the data points (i.e., in the
above example the p~i ) that define the functions fi (~x) arrive one at a time. Following the same idea of SGD, we can
perform an optimization step as each new data point becomes available. In this setting it is important to normalize by
the number of data points m in the objective function; otherwise the objective function will keep growing as we get
more data, irregardless of whether our optimization algorithm finds a good solution.

6.4 Variations: Gradient Descent for Constrained Optimization


So far we have seen how gradient descent can be applied to different problems all of which are unconstrained. There
are variants of the gradient descent algorithm that can be used to solve problems with simple constraints. Let Ω ⊆ Rn
be a convex set, let f : Ω → R be a differentiable function, and consider the problem

p? = min f (~x). (6.94)


x∈Ω
~

For these types of problems we want to limit our search to points inside the set Ω. If we start with an initial guess
~x0 ∈ Ω and apply the (unconstrained) gradient descent algorithm

~xt+1 = ~xt − η∇f (~x), (6.95)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 127
EECS 127/227AT Course Reader
6.4. Variations: Gradient Descent for Constrained Optimization 2024-04-27 21:08:09-07:00

the point ~xt+1 might end up outside of the feasible set Ω, even when ~xt is in Ω. In this section we will consider two
variants of the gradient descent algorithm that propose techniques to deal with constraints.

6.4.1 Projected Gradient Descent


This first variant is projected gradient descent, proposes the idea of projecting the point back into the feasible set.

Definition 154 (Projection onto a Convex Set)


Let Ω be a closeda convex set. Let ~y ∈ Rn be any vector. We define the projection of the vector ~y onto the set Ω
as

(6.96)
2
projΩ (~y ) = argmin k~x − ~y k2 .
x∈Ω
~

aA closed set is one that contains its boundary points.

One can show that, since Ω is closed and convex, the projection is unique. In words, projΩ (~y ) is the closest point in Ω
to ~y . From this definition one sees that if ~y ∈ Ω then projΩ (~y ) = ~y . Using this definition we can write the step of the
projected gradient descent algorithm as

~xt+1 = projΩ (~xt − η∇f (~xt )). (6.97)

Algorithm 4 Projected Gradient Descent


1: function ProjectedGradientDescent(f, ~x0 , Ω, η, T )
2: for t = 0, 1, . . . , T − 1 do
3: ~xt+1 ← projΩ (~xt − η∇f (~xt ))
4: end for
5: return ~xT
6: end function

Note that the projection operator itself solves a convex optimization problem. It is only meaningful to consider the
projected gradient descent algorithm if this projection problem is simple enough to be solved in every iteration. We
have seen examples of projections onto simple sets. For example, the least squares problem computes the projection
of a vector onto the subspace R(A), and we know how to efficiently solve this projection problem. However, that’s not
always the case, and the projection problem might be nearly as difficult to solve as our original optimization problem.

6.4.2 Conditional Gradient Descent


We introduce another variant of the gradient descent algorithm for constrained problems called conditional gradient
descent (also known as Frank-Wolfe method). Recall that in our development of the choice of the “best” search direction
for the (regular) gradient descent algorithm, we considered the quantity
f (~xt + η~vt ) − f (~xt )
D~vt f (~xt ) = lim = [∇f (~xt )]>~vt . (6.98)
η→0 η
This quantity, known as the directional derivative, is the (instantaneous) rate of change of the function f at point ~xt
along direction ~vt . We also saw that the choice ~vt = −∇f (~xt ) is the direction that minimizes this rate of change.
This is a reasonable choice for the unconstrained case since we can freely move in any direction. The idea behind the

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 128
EECS 127/227AT Course Reader
6.4. Variations: Gradient Descent for Constrained Optimization 2024-04-27 21:08:09-07:00

conditional gradient descent algorithm is to limit our search direction to the feasible set Ω and find a vector inside the
set that minimizes ∇ [f (~xt )] ~vt . Thus, given ~xt ∈ Ω, we can define the search direction of the conditional gradient
>

descent as
~vt = argmin[∇f (~xt )]>~v . (6.99)
v ∈Ω
~

Once we find this direction ~vt , we need to use it in a way that ensures that we don’t leave the set. Taking a step along this
direction ~xt+1 = ~xt + η~vt doesn’t guarantee this. But we can do something different and take the convex combination
between the current point and the search direction. That is, for δt ∈ [0, 1] we can update

~xt+1 = (1 − δt )~xt + δt~vt (6.100)

By convexity of the set, this guarantees that if we start with a feasible initial guess ~x0 ∈ Ω, we get a sequence of points
~xt ∈ Ω for all t ≥ 0. While this property holds for any choice of δt ∈ [0, 1], to prove convergence of the algorithm we
require limt→∞ δt = 0; a conventional choice is δt = 1t .

Algorithm 5 Conditional Gradient Descent (Frank-Wolfe)


1: function ConditionalGradientDescent(f, ~x0 , Ω, δ, T )
2: for t = 0, 1, . . . , T − 1 do
3: ~vt ← argmin~v∈Ω [∇f (~xt )]>~v
4: ~xt+1 ← (1 − δt )~xt + δt~vt
5: end for
6: return ~xT
7: end function

As a last note, we point out that the problem of finding a search direction ~vt is a constrained optimization problem
within Ω.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 129
Chapter 7

Duality

Relevant sections of the textbooks:

• [1] Chapter 5.

• [2] Section 8.5.

7.1 Lagrangian
This chapter develops the theory of duality for optimization problems. This is a technique that can help us solve or
bound the optimal values for constrained optimization problems by solving the related dual problem. The dual problem
might not always give us a direct solution to our original (“primal”) problem, but it will always give us a bound. For
this, we first define the Lagrangian for a problem.
Let us start with our generic constrained optimization problem, which we will call problem P (which stands for
primal):

problem P: p? = minn f0 (~x) (7.1)


x∈R
~

s.t. fi (~x) ≤ 0, ∀i ∈ {1, . . . , m}


hj (~x) = 0, ∀j ∈ {1, . . . , p}.

Let us denote its feasible set by


( )
. fi (~x) ≤ 0, ∀i ∈ {1, . . . , m}
Ω = ~x ∈ Rn so that p? = min f0 (~x). (7.2)
hj (~x) = 0, ∀j ∈ {1, . . . , p} x∈Ω
~

We know how to algorithmically approach unconstrained problems, say for instance using gradient descent, which
we discussed in the previous chapter. Thus, our question is whether there is a way to incorporate the constraints of
the problem P into the objective function itself. Trying to understand if this is possible is one motivation for the
Lagrangian.
To this end, we construct the indicator function 1 [·], which for a condition C(~x) defined as

. 0,
 if C(~x) is true,
1 [C(~x)] = (7.3)
+∞, if C(~x) is false.

130
EECS 127/227AT Course Reader 7.1. Lagrangian 2024-04-27 21:08:09-07:00

Then the problem P in (7.1) is equivalent to the following unconstrained problem:

p? = min f0 (~x) (7.4)


x∈Ω
~

= minn {f0 (~x) + 1 [~x ∈ Ω]} (7.5)


x∈R
~
 
 m p 
1 [fi (~x) ≤ 0] + 1 [hj (~x) = 0] . (7.6)
X X
= minn f0 (~x) +
x∈R 
~ 
i=1 j=1

To explain the chain of equalities leading to (7.6), say we consider the function
m p
.
1 [fi (~x) ≤ 0] + 1 [hj (~x) = 0] (7.7)
X X
F0 (~x) = f0 (~x) +
i=1 j=1

evaluated at an ~x ∈
/ Ω, i.e., there is some i such that fi (~x) > 0 or some j such that hj (~x) 6= 0. Then F0 (~x) = ∞.
Therefore no solution to the minimization in (7.6) will ever be outside of Ω, i.e., all solutions to (7.6) will be contained
in Ω. And for ~x ∈ Ω, we have F0 (~x) = f0 (~x), so that min~x∈Ω f0 (~x) = min~x∈Rn F0 (~x) (and the argmins are equal
too). Hence the problem in (7.6) is equivalent to the original problem P.
Thus through this reformulation, we have removed the explicit constraints of the problem and incorporated them
into the objective function of the problem, and we can nominally write the problem as an unconstrained problem. If
we had some magic algorithm which allowed us to solve all unconstrained problems, then this reduction would let us
solve constrained problems as well.
Unfortunately, we do not have such a magic algorithm. Our usual algorithms for solving unconstrained problems,
such as gradient descent, require differentiable objective functions that are well-defined. Thus, it falls to us to express
our indicator functions in a differentiable way.
For this we consider approximating the indicator functions by other functions that are differentiable.
The key idea here is that, for a given indicator 1 [fi (~x) ≤ 0], we can express it as

1 [fi (~x) ≤ 0] = max λi fi (~x), (7.8)


λi ∈R+

where R+ is the set of non-negative real numbers. Why does this equality hold? Let’s break it up into cases. Suppose
that fi (~x) > 0. Then λi being more positive would make λi fi (~x) more positive, so the maximization will send
λi → ∞, making λi fi (~x) = ∞. Now suppose that fi (~x) ≤ 0. Then λi being more positive would make λi fi (~x)
more negative, so the maximization will keep λi as its lower bound 0, making λi fi (~x) = 0. For a similar reason, for
an indicator 1 [hj (~x) = 0], we have
1 [hj (~x) = 0] = max νj hj (~x). (7.9)
νj ∈R

Thus, we can rewrite the problem P as


 
 m p 
(7.10)
X X
p? = minn f0 (~x) + max λi fi (~x) + max νj hj (~x)
x∈R 
~ λi ∈R+ νj ∈R 
i=1 j=1
 
 m p 
(7.11)
X X
= minn f0 (~x) + max λi fi (~x) + maxp νj hj (~x)
x∈R 
~ ~
λ∈Rm ν ∈R
~ 
+ i=1 j=1
 
 m p 
(7.12)
X X
= minn max f0 (~x) + λi fi (~x) + νj hj (~x) .
x∈R ~
~ λ∈Rm  
+ i=1 j=1
ν ∈Rp
~

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 131
EECS 127/227AT Course Reader 7.1. Lagrangian 2024-04-27 21:08:09-07:00

This interior function is called the Lagrangian, and is much easier to work with than the original unconstrained problem
with indicator functions. Thus we have made our constrained problem into a (mostly) unconstrained problem,¹ at the
expense of adding an extra max. We will see how to cope with this extra difficulty shortly.
More formally, we have the following definition:

Definition 155 (Lagrangian)


The Lagrangian of problem P in (7.1) is the function L : Rn × Rm × Rp → R given by
m p
.
(7.13)
X X
L(~x, ~λ, ~ν ) = f0 (~x) + λi fi (~x) + νj hj (~x).
i=1 j=1

Additionally, λi , νj ∈ R are called Lagrange multipliers.

The important intuition behind the Lagrange multipliers is that they are penalties for violating their corresponding
constraint. In particular, just like how the indicator function 1 [fi (~x) ≤ 0] assigns an infinite penalty for violating the
constraint fi (~x) ≤ 0, the term λi fi (~x) assigns a penalty λi fi (~x) for violating the constraint fi (~x) ≤ 0. One can show
that λi fi (~x) ≤ 1 [fi (~x) ≤ 0], so the Lagrangian penalty is a smooth lower bound to the indicator penalty, which is
something like a hard-threshold. We can do similar analysis for the equality constraints hj (~x) = 0.
As derived before, the primal problem can be expressed in terms of the Lagrangian as

p? = minn max L(~x, ~λ, ~ν ). (7.14)


x∈R ~
~ λ∈Rm+
ν ∈Rp
~

We conclude with two very important properties of the Lagrangian.

Proposition 156
For every ~x ∈ Rn , the function (~λ, ~ν ) 7→ L(~x, ~λ, ~ν ) is an affine function, and hence a concave function. (We also
say that L is affine (resp. concave) in ~λ and ~ν .)

Proof. The proof follows from the definition of the Lagrangian:


m p
(7.15)
X X
L(~x, ~λ, ~ν ) = f0 (~x) + λi fi (~x) + νj hj (~x)
i=1 j=1

is affine in ~λ and ~ν .

Example 157. Consider the optimization problem

p? = min 3x2 (7.16)


x∈R

s.t. 2x3 ≤ 8.

This problem is not convex, but we can still find its Lagrangian as

L(x, λ) = 3x2 + λ(2x3 − 8). (7.17)


¹Not quite — the max is actually over Rm + × R which is technically a constrained set. But this non-negativity constraint is usually much easier
p

to handle than arbitrary constraints, in part because the constraint set is convex.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 132
EECS 127/227AT Course Reader 7.2. Weak Duality 2024-04-27 21:08:09-07:00

7.2 Weak Duality


Our foray into duality starts by considering the primal problem:

p? = minn max L(~x, ~λ, ~ν ). (7.18)


x∈R ~
~ λ∈Rm+
ν ∈Rp
~

The dual problem is obtained by swapping the min and max:

d? = max minn L(~x, ~λ, ~ν ). (7.19)


~
λ∈Rm x∈R
~
+
ν ∈Rp
~

In general, the quantities p? and d? do not have to be equal! And swapping the min and the max is the generic definition
of “duality”.

Definition 158 (Dual Problem)


For a primal problem P as described in (7.1), its dual problem D is defined as

problem D: d? = maxm g(~λ, ~ν ) (7.20)


λ∈R
ν∈Rp

s.t. λi ≥ 0, ∀i ∈ {1, . . . , m}.

+ × R → R is the dual function and defined as


Here the function g : Rm p

g(~λ, ~ν ) = minn L(~x, ~λ, ~ν ). (7.21)


x∈R
~

Thus, g(~λ, ~ν ) can be computed as an unconstrained optimization problem over ~x, in particular g(~λ, ~ν ) is the minimum
value of L(~x, ~λ, ~ν ) over all ~x.
There are several important properties of the dual problem and dual function, which we summarize below.

Proposition 159
The dual function g is a concave function of (~λ, ~ν ), regardless of any properties of P.

Proof. We already showed that L(~x, ~λ, ~ν ) is an affine (and thus concave) function of ~λ and ~ν . Thus the function

g(~λ, ~ν ) = minn L(~λ, ~ν ) (7.22)


x∈R
~

is a pointwise minimum of concave functions and is thus concave.

Corollary 160. The dual problem D is always a convex problem, no matter what the primal problem P is.

Note that the minimization of a convex function or the maximization of a concave function are both considered
“convex” problems.
This means we have analytic and algorithmic ways to solve the dual problem D. If we can connect the solutions to
the dual problem D to the primal problem P, this means that we have reduced the process of solving P to the process
of solving a convex optimization problem, and this is something we know how to do. In the rest of this chapter, we
discuss when this reduction is possible.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 133
EECS 127/227AT Course Reader 7.2. Weak Duality 2024-04-27 21:08:09-07:00

Example 161. Let us compute the dual of the optimization problem

p? = min 5x2 (7.23)


x∈R

s.t. 3x ≤ 5.

The Lagrangian is
L(x, λ) = 5x2 + λ(3x − 5). (7.24)

The dual function is


g(λ) = min L(x, λ) = min 5x2 + λ(3x − 5) . (7.25)

x x

For each λ > 0, the function L(x, λ) = 5x2 + λ(3x − 5) is bounded below (in x), convex, and differentiable, so to
minimize it we set its derivative to 0. The optimal x? is a function of the Lagrange multiplier λ, so we write it as x? (λ).
We have

0 = ∇x L(x? (λ), λ) (7.26)


= 10x (λ) + 3λ ?
(7.27)
3
=⇒ x? (λ) = − λ. (7.28)
10
Thus we can plug this optimal point back in to get g:

g(λ) = L(x? (λ), λ) (7.29)


? 2
= 5(x (λ)) + λ(3x (λ) − 5) ?
(7.30)
 
5·9 2 3·3
= λ +λ − λ−5 (7.31)
100 10
9
= − λ2 − 5λ. (7.32)
20
Thus the dual problem is  
9 2
?
d = max − λ − 5λ . (7.33)
λ∈R+ 20
One may solve this problem directly to obtain λ? = 0 and d? = 0.

We also have some more bounds for the Lagrangian in terms of f0 , g, p? , and d? .

Proposition 162
Let ~x ∈ Ω, let ~λ ∈ Rm
+ , and let ~
ν ∈ Rp , so that ~x is feasible for P and (~λ, ~ν ) is feasible for D. Then we have:

(a) f0 (~x) ≥ p? and g(~λ, ~ν ) ≤ d? ;

(b) f0 (~x) ≥ L(~x, ~λ, ~ν ) ≥ g(~λ, ~ν );

(c) f0 (~x) ≥ d? and g(~λ, ~ν ) ≤ p? .

Proof. The inequalities in (a) follow from the characterizations

p? = min
0
f0 (~x0 ) ≤ f0 (~x) (7.34)
x ∈Ω
~

d? = max g(~λ0 , ~ν 0 ) ≥ g(~λ, ~ν ). (7.35)


λ0 ∈Rm
~
+
ν 0 ∈Rp
~

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 134
EECS 127/227AT Course Reader 7.2. Weak Duality 2024-04-27 21:08:09-07:00

The inequalities in (b) follow from the characterizations

g(~λ, ~ν ) = min
0
L(~x0 , ~λ, ~ν ) ≤ L(~x, ~λ, ~ν ) (7.36)
x ∈Ω
~

f0 (~x) = max L(~x, ~λ0 , ~ν 0 ) ≥ L(~x, ~λ, ~ν ). (7.37)


λ0 ∈Rm
~
+
ν 0 ∈Rp
~

The inequalities in (c) follow from (b), in the sense that

f0 (~x) ≥ g(~λ, ~ν ) ∀~λ ∈ Rm


+ ∀~ν ∈ Rp (7.38)
=⇒ f0 (~x) ≥ max g(λ , ~ν ) = d , ~0 0 ?
(7.39)
λ0 ∈Rm
~
+
ν 0 ∈Rp
~

and additionally

f0 (~x) ≥ g(~λ, ~ν ) ∀~x ∈ Ω (7.40)


?
=⇒ p = min
0
f0 (~x ) ≥ g(~λ, ~ν ). 0
(7.41)
x ∈Ω
~

From all these inequalities, the relationship between p? and d? is totally unclear. Actually, their relationship is very
simple; it is captured in a condition called “weak duality”.

Definition 163 (Types of Duality)


Let P be a primal problem with optimum p? . Let D be the corresponding dual problem with optimum d? .

(a) If p? ≥ d? then we say that weak duality holds.

(b) If p? = d? then we say that strong duality holds.

(c) The quantity p? − d? is called the duality gap.

Proposition 164 (Weak Duality Always Holds)


For any problem, weak duality holds, i.e., the duality gap is non-negative.

The fact that weak duality always holds is actually a corollary of a slightly more generic result called the minimax
inequality, which we now state and prove.

Proposition 165 (Minimax Inequality)


Let X and Y be any sets, and F : X × Y → R be any function. Then

min max F (x, y) ≥ max min F (x, y). (7.42)


x∈X y∈Y y∈Y x∈X

Proof. For any x ∈ X and y ∈ Y , we have

F (x, y) ≥ min
0
F (x0 , y). (7.43)
x ∈X

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 135
EECS 127/227AT Course Reader 7.2. Weak Duality 2024-04-27 21:08:09-07:00

Taking the maximum over y on both sides (we will discuss why we can do this at the end of the proof), we have

max F (x, y 0 ) ≥ max min F (x0 , y 0 ). (7.44)


y 0 ∈Y 0 0 y ∈Y x ∈X

Finally, the left hand side is a function of x but the right hand side is a constant, so we might as well take the minimum
in x on the left hand side to get
min max F (x0 , y 0 ) ≥ max min F (x0 , y 0 ). (7.45)
x0 ∈X y 0 ∈Y 0 0 y ∈Y x ∈X

This concludes the (main part of the) proof.


However, we still need to discuss why we can take maximums on both sides in Equation (7.44). We prove this
in the case where the maximums are attained, though it is true in general (and can be proved via a slightly technical
.
modification to this argument). Fix x ∈ X. Let f : Y → R be defined as f (y) = F (x, y), and let g : Y → R be
.
defined as g(y) = minx0 ∈X F (x0 , y). Then we had shown that f (y) ≥ g(y) for all y ∈ Y , and we want to justify why
this means that maxy0 ∈Y f (y 0 ) ≥ maxy0 ∈Y g(y 0 ). Towards this end, let ye ∈ argmaxy0 ∈Y g(y 0 ). Then we have

max f (y 0 ) ≥ f (e
y ) ≥ g(e
y ) = max g(y 0 ) =⇒ max F (x, y 0 ) ≥ max min F (x0 , y 0 ) for all x ∈ X, (7.46)
y 0 ∈Y 0
y ∈Y 0 y ∈Y0 0 y ∈Y x ∈X

as desired.

Here is a game theoretic interpretation of the minimax inequality.


Suppose X and Y are two players choosing actions x and y. Player X chooses action x trying to minimize the
value of the function F (x, y). Player Y chooses action y trying to maximize the value of the function F (x, y). The
two players are perfectly rational and take turns — that is, the players choose their actions one after the other, and the
second player has full knowledge of the first player’s action (which is locked in) as they choose their action. (In game
theory this is called a two-player zero-sum sequential game.) The value of the game, given actions x and y, is F (x, y).
The outer optimization corresponds to the player going first, and the inner minimization corresponds to the player
going second. Why? Consider the optimization problem minx∈X maxy∈Y F (x, y). Then defining G(x) = maxy∈Y F (x, y),
we see that given a particular x, G(x) is the maximum value obtained by varying y of F (x, y). That is, it corresponds
to the best play by player Y given that player X picks action x. This means that the inner minimization corresponds to
the second player Y ’s actions given that the first player X chooses action x. And so the outer optimization corresponds
to the first player X’s action. The situation is analogous on the right-hand side.
What the minimax inequality says is that going second is better. That is, being the second player and choosing
your action with full knowledge of the first player’s action leads to a better outcome for you than going first with no
knowledge of the second player’s action. When X is the second player, the value of the game is lower than when X is
the first player; when Y is the second player, the value of the game is higher than when Y is the first player.
Given the minimax inequality, the proof that weak duality always holds is only one line.

Proof that weak duality always holds. We have

p? = minn max L(~x, ~λ, ~ν ) ≥ max minn L(~x, ~λ, ~ν ) = d? . (7.47)


x∈R ~
~ λ∈Rm ~
λ∈Rm x∈R
~
+ +
ν ∈Rp
~ ν ∈Rp
~

Weak duality is useful because it gives a bound on how close to optimum a given decision variable is. More
specifically, we have that for all ~x ∈ Ω and all (~λ, ~ν ) ∈ Rm
+ × R that
p

f0 (~x) ≥ p? ≥ d? ≥ g(~λ, ~ν ) =⇒ f0 (~x) − g(~λ, ~ν ) ≥ f0 (~x) − p? . (7.48)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 136
EECS 127/227AT Course Reader 7.3. Strong Duality 2024-04-27 21:08:09-07:00

Thus, if f0 (~x) − g(~λ, ~ν ) ≤  then we know that f0 (~x) − p? ≤ . This is called a certificate of optimality. We can use
this certificate as a stopping condition for algorithms such as gradient descent or a broad class of algorithms known as
primal-dual algorithms.

7.3 Strong Duality


Since weak duality holds for all problems, it necessarily would not have too many critical implications. The stronger
condition of strong duality turns out to hold the keys to efficiently computing minima for constrained optimization
problems.
First, a natural question is: when does strong duality hold? It does not always hold, and in particular it is a stronger
condition than weak duality. You will see some example problems where strong duality does not hold in discussion.
There are many conditions under which strong duality holds. In this course, we discuss one such condition which
is relatively simple to evaluate and broadly applicable. This condition is called Slater’s condition.

Theorem 166 (Slater’s Condition)


Consider a convex problem P:

p? = minn f0 (~x) (7.1)


x∈R
~

s.t. fi (~x) ≤ 0, ∀i ∈ {1, . . . , m}


hj (~x) = 0, ∀j ∈ {1, . . . , p}.

If there exists any ~x ∈ relint(Ω)a which is strictly feasible, i.e., such that all of the following hold:

• for all i ∈ {1, . . . , m} such that fi is an affine function, we have fi (~x) ≤ 0;

• for all i ∈ {1, . . . , m} such that fi is not an affine function, we have fi (~x) < 0;

• and for all j ∈ {1, . . . , p}, we have hj (~x) = 0;

then strong duality holds for P and its dual D, i.e., the duality gap is 0.
aRecall that this notation means the relative interior of the feasible set Ω, defined formally in Definition 102. Since the formal definition
of the relative interior is out of scope for this semester, you may just consider it to be the interior of Ω, i.e., points not on the boundary of Ω,
and we will not give you problems where the distinction matters.

This result is sometimes called refined Slater’s condition as it is a slight modification of another condition which is also
called Slater’s condition.
We will not prove this in class; a proof sketch is in [1]. It uses the separating hyperplane theorem we proved earlier.
This result is actually one of the main payoffs of the separating hyperplane theorem that we will see in this course.
Again, observe that we only need a single point which fulfills the three conditions in order to declare that Slater’s
condition holds. Moreover, this point need not be optimal for the problem — any point at all which satisfies the
conditions will do.
Slater’s condition has a geometric interpretation: if there is a point in the relative interior of the feasible set which
is strictly feasible, then strong duality holds. In problems we are able to visualize, this makes it relatively easy to
determine whether Slater’s condition holds.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 137
EECS 127/227AT Course Reader 7.3. Strong Duality 2024-04-27 21:08:09-07:00

Figure 7.1: Points in the convex set Ω = {(x, y) ∈ R2 : − 5 ≤ x ≤ 5, 25


4
≥y≥ 1 2
4
x }. Blue points are feasible points which
fulfill the conditions of Slater’s condition (thus ensuring that Slater’s condition holds for the problem min~x∈Ω f0 (~x) provided that
f0 is convex). Red points are feasible points which do not fulfill the conditions of Slater’s condition. Brown points are infeasible
points.

Example 167. In this example, we consider the equality-constrained minimum-norm problem

(7.49)
2
p? = minn k~xk2
x∈R
~

s.t. A~x = ~y .

This problem only has equality constraints, and so it only has the Lagrange multiplier ~ν . The Lagrangian of this problem
is
p
(7.50)
2
X
L(~x, ~ν ) = k~xk2 + νj (A~x − ~y )j
j=1

(7.51)
2
= k~xk2 + ~ν > (A~x − ~y ).

The dual function is

g(~ν ) = minn L(~x, ~ν ) (7.52)


x∈R
~
n o
(7.53)
2
= minn k~xk2 + ~ν > (A~x − ~y ) .
x∈R
~

The Lagrangian is convex in ~x and bounded below, and so it is minimized wherever its gradient in ~x is ~0. The optimal
~x? is a function of ~ν , so we write it as ~x? (~ν ). We have

~0 = ∇~x L(~x? (~ν ), ~ν ) (7.54)


= 2~x (~ν ) + A ~ν ? >
(7.55)
1
=⇒ ~x? (~ν ) = − A>~ν . (7.56)
2
Thus we can plug this optimal point back in to get g:

g(~ν ) = L(~x? (~ν ), ~ν ) (7.57)


(7.58)
2
= k~x? (~ν )k2 + ~ν > (A~x? (~ν ) − ~y )

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 138
EECS 127/227AT Course Reader 7.3. Strong Duality 2024-04-27 21:08:09-07:00

2    
1 1
= − A>~ν + ~ν > A − A>~ν − ~y (7.59)
2 2 2
1 > 1
= ~ν AA>~ν − ~ν > AA>~ν − ~ν > ~y (7.60)
4 2
1 >
= − ~ν AA>~ν − ~ν > ~y . (7.61)
4
Thus the dual problem is
   
1 1 >
d = maxp − ~ν > AA>~ν − ~ν > ~y
?
= − minp > >
~ν AA ~ν + ~ν ~y . (7.62)
ν ∈R
~ 4 ν ∈R
~ 4
There are no constraints on the dual problem, because there are no inequality constraints on the primal problem. Since
this dual problem is a convex unconstrained problem whose objective value is bounded below, it can be solved by
setting the gradient of the objective to ~0 and solving for the optimal dual variable. This yields

~0 = ∇g(~ν ? ) (7.63)
1
= AA>~ν ? + ~y (7.64)
2
=⇒ ~ν ? = −2(AA> )−1 ~y . (7.65)

Our original primal problem is convex, and all constraints are affine. As long as there is a feasible point ~x, i.e., a point
~x such that A~x = ~y , then Slater’s condition implies that strong duality holds. Suppose that there is such a feasible point
(i.e., the primal problem is feasible). Then strong duality holds. Because the primal problem is convex and strong
duality holds, we can recover the optimal primal variable ~x? from the optimal dual variable ~ν ? as
1
~x? = ~x? (~ν ? ) = − A> (−2(AA> )−1 ~y ) = A> (AA> )−1 ~y . (7.66)
2
This corresponds to the familiar minimum-norm solution.

Example 168. Consider a so-called linear program:

p? = minn ~c> ~x (7.67)


x∈R
~

s.t. A~x = ~y
~x ≥ ~0

where the last constraint means xi ≥ 0 for all i. Slater’s condition implies that strong duality holds for this problem
provided there is a feasible point. The Lagrangian is
n m
(7.68)
X X
L(~x, ~λ, ~ν ) = ~c> ~x + λi (−xi ) + νj (A~x − ~y )j
i=1 j=1

= ~c> ~x − ~λ> ~x + ~ν > (A~x − ~y ) (7.69)


 >
= ~c + A>~ν − ~λ ~x − ~ν > ~y . (7.70)

This function is affine in ~x, hence convex in ~x, so its dual function is

g(~λ, ~ν ) = minn L(~x, ~λ, ~ν ) (7.71)


x∈R
~
 > 
= minn > ~ >
~c + A ~ν − λ ~x − ~ν ~y (7.72)
x∈R
~

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 139
EECS 127/227AT Course Reader 7.3. Strong Duality 2024-04-27 21:08:09-07:00


−~ν > ~y , if ~c + A>~ν − ~λ = ~0
= (7.73)
−∞, otherwise.

(It is a good exercise to figure out why this last equality is correct.) Thus, the dual is

d? = max g(~λ, ~ν ) (7.74)


~
λ∈Rn
ν ∈Rm
~

s.t. λi ≥ 0, ∀i ∈ {1, . . . , n}.

Expanding out g, this problem is equivalent to

d? = max − ~ν > ~y (7.75)


~
λ∈Rn
ν ∈Rm
~

s.t. λi ≥ 0, ∀i ∈ {1, . . . , n}
~c + A ~ν − ~λ = ~0.
>

Example 169 (Shadow Prices). In this example, we determine an economic interpretation of Lagrange multipliers.
Suppose we have 200 kilos of merlot grapes and 300 kilos of shiraz grapes. Consider the following possible blends:

• 4 kilos merlot, 1 kilo shiraz for $20 per bottle.

• 2 kilos merlot, 3 kilos shiraz for $15 per bottle.

We want to maximize our profits. Suppose we make q1 bottles of the first blend, and q2 of the second. Our optimization
problem is:

p? = max 20q1 + 15q2 (7.76)


q1 ,q2 ∈R

s.t. 4q1 + 2q2 ≤ 200


q1 + 3q2 ≤ 300
q1 ≥ 0
q2 ≥ 0.

In reality, we can’t make fractional bottles of wine, so we can round q1 and q2 to the nearest integer if needed. Now
consider the following modification. Suppose that we actually want to sell off some of the grapes instead of turning
them into wine. We can earn λ1 dollars per kilo of merlot and λ2 per kilo of shiraz. This yields a new optimization
problem

max {20q1 + 15q2 + λ1 (200 − 4q1 − 2q2 ) + λ2 (300 − q1 − 3q2 )} (7.77)


q1 ,q2 ∈R+

= max {(20 − 4λ1 − λ2 )q1 + (15 − 2λ1 − 3λ2 )q2 + 200λ1 + 300λ2 } . (7.78)
q1 ,q2 ∈R+

If the coefficient of q1 is negative, we shouldn’t make any of the first blend. If the coefficient of q2 is negative, we
shouldn’t make any of the second blend. If both are positive, we should make as many of each as possible. What
happens when 20 − 4λ1 − λ2 = 0 and 15 − 2λ1 − 3λ2 = 0? This is an indifference point, i.e. the point at which our
profit is the same no matter how many bottles of either wine we make. Under this condition, the minimum profit we
could possibly make is

min 200λ1 + 300λ2 (7.79)


λ1 ,λ2 ∈R+

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 140
EECS 127/227AT Course Reader 7.4. Karush-Kuhn-Tucker (KKT) Conditions 2024-04-27 21:08:09-07:00

s.t. 20 − 4λ1 − λ2 = 0
15 − 2λ1 − 3λ2 = 0.

This problem is the dual to our original (primal) problem. The λi ’s are called shadow prices, and capture how much
we’re willing to pay to violate our constraints.

7.4 Karush-Kuhn-Tucker (KKT) Conditions


We now go into some more involved theory which connects strong duality to optimality conditions.
A broadly applicable set of conditions which are sometimes necessary and sometimes sufficient for optimality is
called the Karush-Kuhn-Tucker (KKT) conditions.

Definition 170 (KKT Conditions)


~e ~
Let (~x
e, λ, νe) ∈ Rn × Rm × Rp be a decision variable and Lagrange multipliers. Suppose that the objective
~e ~
function f0 and constraint functions f1 , . . . , fm , h1 , . . . , hp are differentiable. We say that (~x
e, λ, νe) fulfills the
KKT conditions if

• (Primal feasibility.) ~x
e is feasible for P, i.e.,

fi (~x
e) ≤ 0, ∀i ∈ {1, . . . , m} (7.80)
and hj (~x
e) = 0, ∀j ∈ {1, . . . , p}. (7.81)

~e ~
• (Dual feasibility.) (λ, νe) is feasible for D, i.e.,

ei ≥ 0,
λ ∀i ∈ {1, . . . , m}. (7.82)

• (Complementary slackness.)
ei fi (~x
λ e) = 0, ∀i ∈ {1, . . . , m}. (7.83)

• (Stationarity, or first-order condition.)


~e ~
~0 = ∇~x L(~x
e, λ, νe) (7.84)
m p
(7.85)
X X
= ∇f (~x
e) + ei ∇fi (~x
λ e) + νej ∇hj (~x
e).
i=1 j=1

Given an arbitrary optimization problem, the KKT conditions do not have to be related to the optimality conditions.
But it turns out that in many cases, they are related.

Theorem 171 (If Strong Duality Holds, then KKT Conditions are Necessary for Optimality)
Suppose P is a primal problem with dual D. Suppose that the objective function f0 and constraint functions
f1 , . . . , fm , h1 , . . . , hp are differentiable, strong duality holds, and (~x? , ~λ? , ~ν ? ) are optimal primal and dual vari-
ables. Then (~x? , ~λ? , ~ν ? ) fulfill the KKT conditions.

Proof. By assumption, ~x? is feasible for P and (~λ? , ~ν ? ) is feasible for D. This implies that λ?i fi (~x? ) ≤ 0 for all i, and

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 141
EECS 127/227AT Course Reader 7.4. Karush-Kuhn-Tucker (KKT) Conditions 2024-04-27 21:08:09-07:00

νj? hj (~x? ) = 0 for all j. Thus, we have

d? = g(~λ? , ~ν ? ) (7.86)
= minn L(~x, ~λ? , ~ν ? ) (7.87)
x∈R
~

≤ L(~x? , ~λ? , ~ν ? ) (7.88)


m p
(7.89)
X X
= f0 (~x? ) + λ?i fi (~x? ) + νj? hj (~x? )
| {z }
i=1 j=1
| {z }
≤0 =0

≤ f0 (~x ) ?
(7.90)
= p? . (7.91)

Because p? = d? , all inequalities in the above chain are actually equalities. This means that ~x? minimizes L(~x, ~λ? , ~ν ? )
over ~x ∈ Rn . By the main theorem of vector calculus, this implies that ∇~x L(~x? , ~λ? , ~ν ? ) = ~0, which is the stationarity
condition. It also implies that i=1 λ?i fi (~x? ) = 0. But each term in the sum is ≤ 0, so they must all be 0. This means
Pm

that λ?i fi (~x? ) = 0 for each i. This gives the complementary slackness condition. Thus the KKT conditions hold for
(~x? , ~λ? , ~ν ? ).

Theorem 172 (If Convexity Holds, then KKT Conditions are Sufficient for Optimality)
Suppose P is a primal problem with dual D. Suppose that the objective function f0 and constraint functions
f1 , . . . , fm , h1 , . . . , hp of P are differentiable. Suppose that f0 , f1 , . . . , fm are convex and h1 , . . . , hp are affine.
~e ~ ~e ~
Suppose that (~x e, λ, νe) fulfill the KKT conditions. Then P is convex, strong duality holds, and (~x e, λ, νe) are optimal
primal and dual variables.

~e ~
Proof. By assumption, ~x
e is feasible for P and (λ, νe) is feasible for D. Since the fi are convex and the hj are affine, P
is convex, and
m p
(7.92)
X X
L(~x, ~λ, ~ν ) = f0 (~x) + λi fi (~x) + νj hj (~x)
i=1 j=1

is convex in ~x. Since stationarity holds, we have


~e ~
∇~x L(~x
e, λ, νe) = ~0, (7.93)

~e ~
so because the primal problem is convex, we have that ~x
e minimizes L(~x, λ, νe) over all ~x ∈ Rn . Thus

~e ~ ~e ~
g(λ, νe) = minn L(~x, λ, νe) (7.94)
x∈R
~
~e ~
= L(~x
e, λ, νe) (7.95)
m p
(7.96)
X X
= f0 (~x
e) + ei fi (~x
λ e) + νei hi (~x
e)
| {z } | {z }
i=1 =0 j=1 =0

= f0 (~x
e). (7.97)

~e ~ ~e ~
Thus f0 (~x
e) = g(λ, νe), so the duality gap is 0 and strong duality holds, with (~x
e, λ, νe) being optimal primal and dual
variables.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 142
EECS 127/227AT Course Reader 7.4. Karush-Kuhn-Tucker (KKT) Conditions 2024-04-27 21:08:09-07:00

In this course, most problems will be convex and strong duality will hold; in this case the above two theorems can
apply and we get that the KKT conditions are equivalent to optimality conditions. This is the source of the intuition
that convex problems are easier to optimize.

Corollary 173 (If Convexity and Strong Duality Hold, then KKT Conditions are Necessary and Sufficient for Optimal-
ity). Suppose P is a primal problem with dual D. Suppose strong duality holds for P and D. Suppose that the objective
function f0 and constraint functions f1 , . . . , fm , h1 , . . . , hp for P are differentiable. Suppose that f0 , f1 , . . . , fm are
convex and h1 , . . . , hP are affine. Let (~x, ~λ, ~ν ) ∈ Rn × Rm × Rp be primal and dual variables. Then (~x, ~λ, ~ν ) are
optimal primal and dual variables if and only if they fulfill the KKT conditions.

The following is a generic sequence of steps that you can try for any convex optimization problem.

Problem Solving Strategy 174 (Solving Convex Optimization Problems Using KKT Conditions). Suppose you have
a problem P with dual D.

1. Show that P is convex and the objective and constraint functions are differentiable.

2. Show Slater’s condition holds and/or that strong duality holds for P and D.

3. Compute the KKT conditions for P and D.

4. Solve for the optimal primal and dual variables using the KKT conditions.

Even for more complicated problems where it is not possible to solve the KKT conditions analytically, many algo-
rithms can be interpreted as solving the KKT conditions under various conditions.

Example 175 (Example 161, with KKT Conditions). In this example, we apply the KKT conditions to the problem in
Example 161. Consider the following problem:

p? = min 5x2 (7.98)


x∈R

s.t. 3x ≤ 5. (7.99)

Its Lagrangian is
L(x, λ) = 5x2 + λ(3x − 5). (7.100)

The problem is convex, and there exists a strictly feasible point in the relative interior of the feasible set, e.g., x = 0
(notice that this is also global optimum, but it does not have to be; x = −1 would have sufficed just as well to be a
strictly feasible point in the relative interior of the feasible set). Thus Slater’s condition holds, strong duality holds,
and the KKT conditions are necessary and sufficient for optimality. Now we solve the system to global optimum using
KKT conditions.
Let (e e solve the KKT conditions. Then they must obey:
x, λ)

• Primal feasibility: 3e
x ≤ 5.

• Dual feasibility: λ
e ≥ 0.

• Complementary slackness: λ(3e


e x − 5) = 0.

• Stationarity: 0 = ∇x L(e
x, λ) x + 3λ.
e = 10e e

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 143
EECS 127/227AT Course Reader 7.5. (OPTIONAL) Conic Duality 2024-04-27 21:08:09-07:00

Now we solve for λ


e in terms of x
e. We obtain from stationarity that

e = − 10 x
λ e. (7.101)
3
Thus by complementary slackness we have
10
0=− x x − 5),
e(3e (7.102)
3
so that
5
x
e=0 or x
e= . (7.103)
3
Supposing that x
e = 5/3, then we have λ e = −5/3, which violates dual feasibility. Thus we must have xe = 0, implying
λ = 0. And this indeed satisfies all of the KKT conditions. In particular, we can tell from inspection that at least the
e
primal variable x
e = 0 = x? is globally optimal.
Note that even when solving the KKT system correctly, we (momentarily) ended up with an answer that would not
satisfy the KKT conditions in the first place. This is one way to solve such systems: simplify the system as much as
possible, generate two or several possible solutions (e e and check again to see which one(s) still make the KKT
x, λ),
conditions hold.

The following content is optional/out of scope for this semester. Regardless, it may be helpful to read it to
gain context, or get a deeper understanding of various results.

7.5 (OPTIONAL) Conic Duality


Below, we introduce how Lagrange duality can be extended to optimization problems with inequality constraints defined
via generalized inequalities, as defined below. For more details, please see [1, Section 5.9].

Definition 176 (Generalized Inequalities using Convex Cones)


Let K ⊆ Rn be a proper cone. Let int(K) be the interior of K. The generalized inequality induced by K is a
relation K defined on pairs of vectors ~u, ~v ∈ Rn with the following properties:

(a) If ~u − ~v ∈ K, we write ~u K ~v and ~v K ~u.

(b) If ~u − ~v ∈ int(K), we write ~u K ~v and ~v ≺K ~u.

Substituting ~0 into the above notation, we get the following notations:

(c) If ~u ∈ K, we write ~u K ~0.

(d) If −~v ∈ K, we write ~v K ~0.

(e) If ~u ∈ int(K), we write ~u K ~0.

(f) If −~v ∈ int(K), we write ~v ≺K ~0.

It is possible that ~u − ~v ∈
/ K, in which case there is no generalized inequality with respect to K between them.

Example 177. While the above may look scary, we have already seen generalized inequalities in this course. The set of
positive definite matrices Sn+ is a proper cone in the set of symmetric matrices Sn . Thus when we write that A  0 for a
symmetric matrix A, we are really using the generalized inequality with respect to the cone K = Sn+ . Correspondingly,
we can write A  B to mean A − B is positive definite.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 144
EECS 127/227AT Course Reader 7.5. (OPTIONAL) Conic Duality 2024-04-27 21:08:09-07:00

Generalized inequalities can also help simplify or generalize several concepts introduced earlier in the course.
Suppose for example that we have a familiar convex optimization problem:

min f0 (~x) (7.104)


x∈Rn
~

s.t. fi (~x) ≤ 0, ∀i ∈ {1, . . . , m}. (7.105)

(One can also add equality constraints to this problem without issue.) Now consider the set K = Rm
+ = {~
x ∈
Rm : xi ≥ 0 ∀i ∈ {1, . . . , m}}, i.e., the non-negative orthant. Indeed, K is a proper cone (proof is an exercise). If we
collect all constraints into a single vector-valued constraint function f~ : Rn → Rm given by
 
f1 (~x)
 . 
f~(~x) =  .  (7.106)
 . 
fm (~x)

then the constraint f~(~x) K ~0 means that −f~(~x) ∈ K, that is, each −fi (~x) ≥ 0, or equivalently fi (~x) ≤ 0.
These forms of the m constraints are absolutely equivalent. Thus, our original familiar convex optimization problem
is equivalent to the following problem:

min f0 (~x) (7.107)


x∈Rn
~

s.t. f~(~x) K ~0. (7.108)

We will work with generalizations of this problem shortly.

Observe that, for any proper cone K ⊆ Rn and given ~v1 , ~v2 , ~v3 ∈ Rn , if we have ~v1 K ~v2 and ~v2 K ~v3 , then
~v1 K ~v3 .
Now we use this generalized inequality system to define more general convex optimization problems. Let f0 : Rn →
R be the objective function. For each i ∈ {1, . . . , m}, let f~i : Rn → Rdi be a vector-valued inequality constraint
function (we will see shortly how this works), and let Ki ⊆ Rdi be a proper cone (this could be called the constraint
. Pm
cone). For convenience, let d = i=1 di . Finally, for each j ∈ {1, . . . , p}, let hj : Rn → R be a (usual scalar-valued)
equality constrained function.
Consider the following primal optimization problem over Rn with generalized inequality constraints:

problem P: p? = minn f0 (~x) (7.109)


x∈R
~

s.t. f~i (~x) Ki ~0, ∀i ∈ {1, . . . , m}


hj (~x) = 0, ∀j ∈ {1, . . . , p}.

We aim to derive a theory of generalized Lagrangian duality for this system.


As before, let 1 [·] be an indicator function that returns 0 when the input condition is true, and +∞ otherwise. Then,
as before, we can look at the unconstrained variant:
 
m h i Xp
1 f~i (~x) Ki ~0 + 1 [hj (~x) = 0] . (7.110)
X
p? = minn f0 (~x) +
x∈R
~
i=1 j=1

Recall that we showed, en route to the Lagrangian duality presented in Section 7.1, that

1 [fi (~x) ≤ 0] = max λi fi (~x), 1 [hj (~x) = 0] = max νj hj (~x). (7.111)


λi ∈R+ νj ∈R

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 145
EECS 127/227AT Course Reader 7.5. (OPTIONAL) Conic Duality 2024-04-27 21:08:09-07:00

The latter equality will help us here again, but the former will not, and we have to derive an analogue. Indeed, we claim
that h i
1 f~i (~x) Ki ~0 = maxd ~λ> ~ x),
i fi (~ (7.112)
~
λi ∈R i
~
λi K ? ~
0
i

where Ki?denotes the dual cone of Ki in R . Why is this true? It turns out to be for a similar reason as the more
di

special case above. If f~i (~x) K ~0, then −f~i (~x) ∈ Ki . By definition, for ~λi ∈ K ? , we must have
i i

0 ≤ ~λ> ~ x)) = −~λ> f~i (~x) =⇒ ~λ> f~i (~x) ≤ 0.


i (−fi (~ i i (7.113)

Equality in this case is obtained by selecting ~λi = ~0, which we are assured is legal because, by definition of a proper
cone, ~0 ∈ Ki . Thus we have shown

f~i (~x) Ki ~0 =⇒ max ~λ> ~ x) = 0.


i fi (~ (7.114)
~
λi ∈Rdi
~
λi K ? ~
0
i

In the other case, suppose f~i (~x) 6Ki ~0. Then we have −f~i (~x) ∈
/ Ki . Since Ki is a proper cone, it is closed and convex,
so Ki = (K ) . We thus have −fi (~x) ∈
? ?
i
~ / (K ) . Namely, there exists some ~λi ∈ K ? such that ~λ> (−f~i (~x)) < 0, or
? ?
i i i
equivalently ~λ> ~ x) > 0. Since K ? is a cone, it is closed under positive scalar multiplication; thus, for any αi > 0,
i fi (~ i
we have αi~λi ∈ Ki? , so
(αi~λi )> f~i (~x) = αi (~λ> f~i (~x)) > 0 i (7.115)

Taking αi → ∞ obtains that the maximum over ~λi is +∞ in this case. Namely, we have shown

f~i (~x) 6Ki ~0 =⇒ max ~λ> ~ x) = ∞.


i fi (~ (7.116)
~
λi ∈Rdi
~
λi K ? ~
0
i

Thus, defining ~λ = (~λ1 , . . . , ~λm ) ∈ Rd1 × · · · × Rdm = Rd for convenience, we have


 
m p
(7.117)
X X
p? = minn max f0 (~x) + ~λ> f~i (~x) + νj hj (~x) .
i
x∈R
~ ~
λ∈Rd i=1 j=1
ν ∈Rp
~
~
λi Ki ~
0 ∀i∈{1,...,m}

The above discussion inspires the following definition of a generalized Lagrangian L : Rn × Rd × Rp for the given
primal optimization problem with generalized inequalities.²
m p
.
(7.118)
X X
L(~x, ~λ, ~ν ) = f0 (~x) + ~λ> f~i (~x) +
i νj hj (~x).
i=1 j=1

As before, we define the dual function g : Rd × Rp → R by


.
g(~λ, ~ν ) = minn L(~x, ~λ, ~ν ). (7.119)
x∈R
~

Thus, we can write the dual problem of (7.109) as

problem D: d? = max g(~λ, ~ν ) (7.120)


~
λ∈Rd
ν ∈Rp
~

²If the optimization problem under study were over a general inner product space, then the expressions ~λ> ~ x) should be replaced with
i fi (~
x) . This occurs primarily in semidefinite programming, where fi could produce a symmetric matrix as an output. In this case, ~λi
~λi , f~i (~ ~
would similarly be a symmetric matrix, and the inner product would be the Frobenius inner product.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 146
EECS 127/227AT Course Reader 7.5. (OPTIONAL) Conic Duality 2024-04-27 21:08:09-07:00

s.t. ~λi Ki? ~0, ∀i ∈ {1, . . . , m}.

Note that weak duality, i.e., d? ≤ p? , always holds, by the minimax inequality.
An analog of Slater’s condition holds for optimization problems with generalized inequalities induced by proper
cones. We first require the following definition, which extends the notion of the convexity of a function to generalized
convexity, where the inequalities in the definition of convexity are turned into general inequalities induced by proper
cones.

Definition 178 (Generalized Convexity)


Let f~ : Rn → Rd , and let K ⊆ Rd be a proper cone.

(a) We say that f~ is K-convex if for any α ∈ [0, 1] and ~x, ~y ∈ Rn , we have

f~(α~x + (1 − α)~y ) K αf~(~x) + (1 − α)f~(~y ). (7.121)

(b) We say that f~ is strictly K-convex if for any α ∈ (0, 1) and ~x, ~y ∈ Rn with ~x 6= ~y , we have

f~(α~x + (1 − α)~y ) ≺K αf~(~x) + (1 − α)f~(~y ). (7.122)

Theorem 179 (Generalized Slater’s Condition)


Consider a primal problem P:

problem P: p? = minn f0 (~x) (7.109)


x∈R
~

s.t. f~i (~x) Ki ~0, ∀i ∈ {1, . . . , m}


hj (~x) = 0, ∀j ∈ {1, . . . , p},

where

• the function f0 : Rn → R is convex.;

• for each i ∈ {1, . . . , m}, the function f~i : Rn → Rdi is Ki -convex, where Ki ⊆ Rdi is a proper cone;

• for each j ∈ {1, . . . , p}, the function hj : Rn → R is affine.

If there exists any point ~x ∈ relint(Ω) which is strictly feasible, i.e., such that all of the following holds:

• for each i ∈ {1, . . . , m}, we have f~i (~x) ≺Ki ~0;

• and for each j ∈ {1, . . . , p}, we have hj (~x) = 0;

then strong duality holds for P and its dual D, i.e., the duality gap is 0. a
aAnalogous conditions hold for the general case where the primal optimization problem is defined over a vector space equipped with an
inner product and a norm, as in the case of semidefinite programming.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 147
Chapter 8

Types of Optimization Problems

Relevant sections of the textbooks:

• [1] Chapter 4.

• [2] Chapters 9, 10, 12.

8.1 Linear Programs


In this chapter we introduce a taxonomy of common optimization problems that can be efficiently solved through a
variety of ways. We use the notation ~u ≥ ~v to denote ui ≥ vi for all i.
A linear program is one with an affine objective function and affine constraint functions (both inequality and equality
constraints). It is the most restrictive (e.g. smallest) class in the taxonomy we present. It has a standard form denoted
below.

Definition 180 (Linear Program)


A linear program (LP) is an optimization problem with an affine objective and affine constraints. A standard form
linear program is an optimization problem of the following form:a

p? = minn ~c> ~x (8.1)


x∈R
~

s.t. A~x = ~y
~x ≥ ~0.

aConstant terms have been omitted from the objective function.

There are many equivalent forms of a linear program; in particular, the following proposition (whose proof we leave
as an exercise) can be shown using slack variables.

Proposition 181
Any linear program is equivalent to a standard form linear program.

Putting linear programs in standard form is important because the standard form is commonly accepted by opti-
mization algorithms and implementations. Usually if you provide a linear program that isn’t in standard form to a

148
EECS 127/227AT Course Reader 8.1. Linear Programs 2024-04-27 21:08:09-07:00

solver, they will convert it to standard form first before running their algorithms. Conversions to-and-from standard
form may increase the number of variables in the problem and the eventual algorithmic complexity of solving it. One
example of this conversion is done below.

Example 182. Suppose we have the linear program

min 2x1 + 4x2 (8.2)


x∈R2
~

s.t. x1 + x2 ≥ 3 (8.3)
3x1 + 2x2 = 14 (8.4)
x1 ≥ 0. (8.5)

To convert this to standard form, first we note that

x1 + x2 ≥ 3 ⇐⇒ −x1 − x2 ≤ −3 (8.6)

which obtains the following system:

min 2x1 + 4x2 (8.7)


x∈R2
~

s.t. − x1 − x2 ≤ −3 (8.8)
3x1 + 2x2 = 14 (8.9)
x1 ≥ 0. (8.10)

Adding a slack variable x3 ≥ 0 to the first constraint yields equality:

min 2x1 + 4x2 (8.11)


x∈R3
~

s.t. − x1 − x2 + x3 = −3 (8.12)
3x1 + 2x2 = 14 (8.13)
x1 ≥ 0 (8.14)
x3 ≥ 0. (8.15)

The only reason this is not in standard form is that x2 is unconstrained. We can represent any real number as the
difference of non-negative numbers; one such construction is x2 = x+
2 − x2 where x2 = max{x2 , 0} and x2 =
− + −

− min{x2 , 0}, but there are many others. Thus we can replace x2 by x4 − x5 and add the constraints that x4 ≥ 0 and
x5 ≥ 0, obtaining the problem

min 2x1 + 4x4 − 4x5 (8.16)


x∈R5
~

s.t. − x1 + x3 − x4 + x5 = −3 (8.17)
3x1 + 2x4 − 2x5 = 14 (8.18)
x1 ≥ 0 (8.19)
x3 ≥ 0 (8.20)
x4 ≥ 0 (8.21)
x5 ≥ 0. (8.22)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 149
EECS 127/227AT Course Reader 8.1. Linear Programs 2024-04-27 21:08:09-07:00

Finally, we notice that x2 is neither in the objective nor the constraints and so can be eliminated.

min 2x1 + 4x3 − 4x4 (8.23)


x∈R4
~

s.t. − x1 + x2 − x3 + x4 = −3 (8.24)
3x1 + 2x3 − 2x4 = 14 (8.25)
x1 ≥ 0 (8.26)
x2 ≥ 0 (8.27)
x3 ≥ 0 (8.28)
x4 ≥ 0. (8.29)

Linear programs are convex problems, and in particular the feasible set of a linear program is a convex set.

Proposition 183
Any linear program is a convex optimization problem.

Proposition 184
The dual of the standard form linear program

p? = minn ~c> ~x (8.30)


x∈R
~

s.t. A~x = ~y
~x ≥ ~0

is

d? = max − ~y >~ν (8.31)


~
λ∈Rn
ν ∈Rm
~

s.t. ~c − ~λ + A>~ν = ~0
~λ ≥ ~0.

Proof. The Lagrangian is

L(~x, ~λ, ~ν ) = ~c> ~x − ~λ> ~x + ~ν > (A~x − ~y ) (8.32)


= (~c − ~λ + A ~ν ) ~x − ~ν ~y .
> > >
(8.33)

The dual function is

g(~λ, ~ν ) = minn L(~x, ~λ, ~ν ) (8.34)


x∈R
~
n o
= minn (~c − ~λ + A>~ν )> ~x − ~ν > ~y (8.35)
x∈R
~
n o
= minn (~c − ~λ + A>~ν )> ~x − ~ν > ~y (8.36)
x∈R
~

−~ν > ~y , if ~c − ~λ + A>~ν = ~0
= (8.37)
−∞, otherwise.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 150
EECS 127/227AT Course Reader 8.1. Linear Programs 2024-04-27 21:08:09-07:00

The dual problem is

d? = max g(~λ, ~ν ) (8.38)


~
λ∈Rn
ν ∈Rm
~

s.t. ~λ ≥ ~0. (8.39)

Expanding this out and adding the domain constraint, we get

d? = max − ~ν > ~y (8.40)


~
λ∈Rn
ν ∈Rm
~

s.t. ~c − ~λ + A>~ν = ~0 (8.41)


~λ ≥ ~0 (8.42)

as desired.

One very relevant question is how to solve linear programs efficiently. There are several methods available to us —
for example, any constrained convex optimization solver, such as projected gradient descent, will solve the problem,
given an appropriate learning rate and efficient projection method. The algorithm most used in practice is the interior
point method; we will learn the basics of interior point methods later in this course.¹ But there is one algorithm which
was used for many years in practice, that truly exploits the structure of linear programs to efficiently solve them. It is
called the simplex algorithm, which was invented by George Dantzig in 1947.
The core idea behind the the simplex algorithm is that at least one optimal point of a linear program is a “vertex”
of its feasible set, so long as this feasible set is bounded. We need to prove this idea, so far stated informally, and to do
this we will first need to characterize the feasible set of a linear program.

Definition 185 (Polyhedron, Polygon)


A polyhedron is an intersection of a finite number of half-spaces. A polygon is a bounded polyhedron.

From the definition of a linear program, we see that its feasible region must be a polyhedron. For the standard form
linear program, we can check explicitly that the feasible set is the intersection of the three classes of half-spaces {~x ∈
Rn | ~a> x ≤ yi }, {~x ∈ Rn | ~a>
i ~ x ≥ yi }, and {~x ∈ Rn | xj ≥ 0}, where ~a>
i ~ i are the rows of A, yi are the entries of ~
y,
and xj are the entries of ~x. This feasible region can be unbounded or bounded; if it is bounded, it will be a polygon.

Definition 186 (Extreme Point, Vertex)


Let K ⊆ Rn be a set. We say that ~x ∈ K is an extreme point of K if there does not exist ~y , ~z ∈ K \ {~x} and
θ ∈ [0, 1] such that ~x = θ~y + (1 − θ)~z.
An extreme point of a polyhedron is called a vertex of the polyhedron.

If the feasible set of a linear program is an unbounded polyhedron, then there are examples where the optimal value
is not achieved at a vertex, as demonstrated in the following example.

Example 187. Consider the following linear program over variables ~x ∈ R2 :


" #
h i x
(8.43)
1
min 0 −1
x1 ,x2 ∈R x2
¹Talk to your TA Tarun if you want to learn more about interior point methods!

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 151
EECS 127/227AT Course Reader 8.1. Linear Programs 2024-04-27 21:08:09-07:00

s.t. x1 ≥ 0
x2 ≥ 0
x1 = 1.
" #
. 1
Define ~xα = . One can check that ~xα is feasible for all α ≥ 0, and that the objective value at ~xα is −α. By sending
α
α → ∞, we get p? = −∞, and the optimum is not achieved at a vertex (or indeed by any ~x ∈ R2 ).

With this understanding, we now seek to prove the main idea we had earlier. There are several proofs, but the
cleanest uses the following intuitive fact:

Proposition 188
A polygon has finitely many vertices and is the convex hull of its vertices.

Unfortunately, the proof is (surprisingly?) quite complicated, so we omit it. A complete proof is in ziegler2012lectures,
for example.

Theorem 189 (Main Theorem of Linear Programming)


Consider the standard form linear program:

p? = minn ~c> ~x (8.44)


x∈R
~

s.t. A~x = ~y
~x ≥ ~0.
.
Suppose that the feasible set Ω = {~x ∈ Rn | A~x = ~y , ~x ≥ ~0} is bounded. Then the optimal value is achieved at
a vertex.

Namely, one can find an optimal point which is a vertex. There may be optimal points that are not vertices. The simplest
example is to set the objective as ~c = ~0, so every feasible point is optimal with objective value 0, but there are other
examples which are a bit more complicated to set up.

Proof. Since the feasible set Ω is a bounded polyhedron, it is a polygon, and so it is the convex hull of its vertices, say
~v1 , . . . , ~vk . Thus any ~x ∈ Ω can be written as a convex combination of the vertices ~vi , namely,
k
(8.45)
X
~x = αi~vi
i=1

where each αi ≥ 0 and αi = 1. It then follows that


Pk
i=1

k
(8.46)
X
?
p = min αi (~c>~vi )
α1 ,...,αk ∈R
i=1

s.t. αi ≥ 0, ∀i ∈ {1, . . . , k} (8.47)


k
(8.48)
X
αi = 1.
i=1

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 152
EECS 127/227AT Course Reader 8.2. Quadratic Programs 2024-04-27 21:08:09-07:00

Now we have
k k  
(8.49)
X X
> >
αi (~c ~vi ) ≥ αi min ~c ~vj
j∈{1,...,k}
i=1 i=1
 k
 X !
= min >
~c ~vj αi (8.50)
j∈{1,...,k}
i=1
| {z }
=1

= min ~c>~vj . (8.51)


j∈{1,...,k}

Let i? ∈ {1, . . . , k} be an index such that ~c>~vi? achieves the above minimum, i.e., ~c>~vi? = minj∈{1,...,k} ~c>~vj . Then
the above lower bound is achieved when αi? = 1 and αi = 0 for i 6= i? , for example. Thus ~vi? is an optimal point for
the original linear program, concluding the proof.

This theorem says is that to solve a linear program, we only need to check the vertices of the constraint polyhedron.
This reduces an optimization problem over Rn to an optimization problem over the finite set of vertices. This reduction
motivates a “greedy-like heuristic” solver for linear programs with bounded feasible sets Ω, which is called the simplex
method. The simplex method is the following procedure:
• Start at a vertex ~v of Ω.

• While there is a neighboring vertex w


~ with ~c>~v > ~c> w,
~ move to it, i.e., set ~v ← w.
~

• When there are no neighboring vertices with better optima, stop and return ~v .
There are (rather more technical) modifications one can make to this algorithm to solve linear programs with unbounded
feasible sets. But the main idea is just the same as gradient descent: iteratively search locally for another point with
better objective value, and move to it.

8.2 Quadratic Programs

Definition 190 (Quadratic Program)


A quadratic program (QP) is an optimization problem with a quadratic objective and affine constraints. A standard
form quadratic program is an optimization problem of the following form:
1 >
p? = minn ~x H~x + ~c> ~x (8.52)
x∈R
~ 2
s.t. A~x ≤ ~y
C~x = ~z,

where H ∈ Sn .

In the standard form, we do not lose any generality by enforcing H ∈ Sn . In particular, for any H we have
H + H>
 
1 > 1
~x H~x + ~c> ~x = ~x> ~x + ~c> ~x (8.53)
2 2 2
H+H >
whence the matrix 2 (i.e., the symmetric part of H) is always symmetric. So if we have a non-symmetric H we
can just replace it with its symmetric part, and thus obtain a standard form quadratic program.
Quadratic programs may or may not be convex.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 153
EECS 127/227AT Course Reader 8.2. Quadratic Programs 2024-04-27 21:08:09-07:00

Proposition 191
Consider the following standard form quadratic program:
1 >
p? = minn ~x H~x + ~c> ~x (8.54)
x∈R
~ 2
s.t. A~x ≤ ~y
C~x = ~z,

where H ∈ Sn . The following are equivalent:

(a) The problem is convex.

(b) H ∈ Sn+ .

Proof. The gradient and Hessian of the objective are


 
1 > 1
∇ ~x H~x + ~c ~x = (H + H > )~x + ~c = H~x + ~c
>
(8.55)
2 2
 
1 >
∇2 ~x H~x + ~c> ~x = H (8.56)
2
so the objective function is convex if and only if H ∈ Sn+ . Thus the problem is a convex problem if and only if H ∈ Sn+
and the constraint set is convex. But the constraint set is always convex because it is defined by a set of linear equations
and inequalities. Thus the problem is convex if and only if H ∈ Sn+ .

Example 192. Let us consider the unconstrained quadratic program


 
1 >
p? = minn ~x H~x + ~c> ~x (8.57)
x∈R
~ 2

where H ∈ Sn , and solve it in a variety of cases, namely that where H ∈


/ Sn+ , H ∈ Sn+ with ~c ∈ N (H) \ {~0}, and
H ∈ S+ with ~c ∈ N (H) = R H = R(H).
⊥ >
n


Case 1. Suppose that H ∈


/ Sn+ . Then H has a negative eigenvalue λ; let ~v be any corresponding unit eigenvector.
Then, if we choose ~xt = t · ~v , we get
 
1 >
p? ≤ lim ~xt H~xt + ~c> ~xt (8.58)
t→∞ 2
 
1 2 >
= lim t ~v H~v + t~c>~v . (8.59)
t→∞ 2

To bound the terms inside the limit, we compute

~v > H~v = ~v > (λ~v ) = λ~v >~v = λ (8.60)


~c>~v ≤ k~ck2 k~v k2 = k~ck2 . (8.61)

Thus we have  
1 2
p? ≤ lim λt + t k~ck2 . (8.62)
t→∞ 2
Since λ < 0, the term inside the limit is a concave (i.e., downward facing) quadratic function of t, and so its
limit as t → ±∞ is −∞. Thus p? ≤ −∞ so p? = −∞.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 154
EECS 127/227AT Course Reader 8.2. Quadratic Programs 2024-04-27 21:08:09-07:00

Case 2. Suppose that H ∈ Sn+ , and suppose that ~c ∈ N (H) \ {0}. Then H has (at least) an eigenvalue λ equal to 0,
and in particular, by the spectral theorem ~c can be written as a linear combination of eigenvectors of H with
eigenvalue 0. Let ~v be any unit eigenvector with eigenvalue 0 such that ~c>~v 6= 0. Let ~xt = −t · sgn ~c>~v · ~v .


Then
 
1 >
p? ≤ lim ~xt H~xt + ~c> ~xt (8.63)
t→∞ 2
 
1 2 >
t ~v H~v − t · sgn ~c>~v · ~c>~v (8.64)

= lim
t→∞ 2
 
1 2 >~
= lim t ~v 0 − t · ~c>~v (8.65)
t→∞ 2

= lim −t · ~c>~v (8.66)



t→∞

= lim t · − ~c>~v (8.67)



t→∞ | {z }
<0

= −∞. (8.68)

Thus p? ≤ −∞ so p? = −∞.

Case 3. Suppose that H ∈ Sn+ with ~c ∈ R(H). Then there is nonzero ~x0 such that ~c = −H~x0 . Rewriting the
objective, we obtain
1 > 1
~x H~x + ~c> ~x = ~x> H~x − ~x>0 H~x (8.69)
2 2
1 1 1
= ~x> H~x − ~x>0 H~x + ~x> 0 H~x0 − ~x> H~x0 (8.70)
2 2 2 0
1 >  1
= ~x H~x − 2~x> x + ~x>
0 H~ 0 H~x0 − ~x> H~x0 (8.71)
2 2 0
1 1
= (~x − ~x0 )> H(~x − ~x0 ) − ~x> H~x0 . (8.72)
2 2 0
Since H ∈ Sn+ , the minimizer is at any ~x such that ~x − ~x0 ∈ N (H). One can write this as ~x ∈ ~x0 + N (H).
A particular solution in terms of problem parameters is ~x = −H †~c where H † is the Moore-Penrose pseu-
doinverse of H. Recall that we discussed the Moore-Penrose pseudoinverse in more generality in homework
where we derived the solution to the least-norm least-squares problem, but onecan show that if H = U ΛU >
1/Λ , if Λ 6= 0
ii ii
then H † = U Λ† U > where Λ† is the diagonal matrix whose entries are Λ†ii = .
0, if Λii = 0
The previous example shows that we can solve unconstrained quadratic programs directly and read off the solutions.
It turns out that one can transform any quadratic program with equality constraints into an unconstrained quadratic
program. So really, this analysis encapsulates a huge class of quadratic programs.
Computing the dual of a quadratic program has a similar number of cases; it is an exercise which is left to homework.

Example 193 (Linear-Quadratic Regulator). Suppose we have a discrete-time dynamical system, of the form

~xt+1 = A~xt + B~ut , ∀t ≥ 0 (8.73)


~
~x0 = ξ. (8.74)

Then one can show that


t−1
(8.75)
X
~xt = At ξ~ + At−k−1 B~uk .
k=0

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 155
EECS 127/227AT Course Reader8.3. Quadratically-Constrained Quadratic Programs 2024-04-27 21:08:09-07:00

For a fixed terminal time T , we want to reach goal state ~g . Namely, we want to solve the problem
T −1
(8.76)
2
X 2
min k~xT − ~g k2 + k~uk k2
~
x0 ,...,~
xT
~
u0 ,...,~
uT −1 k=0

s.t. ~xt+1 = A~xt + B~ut , ∀t ∈ {0, 1, . . . , T − 1} (8.77)


~
~x0 = ξ. (8.78)

This is a quadratic program since the objective function is a quadratic function of each ~xt , and the constraints are affine
equations relating the ~xt and the ~ut .

As a last note, problems with quadratic objectives and quadratic inequality constraints are called quadratically
constrained quadratic programs (QCQPs). Like quadratic programs, QCQPs can be convex or non-convex.

8.3 Quadratically-Constrained Quadratic Programs

Definition 194 (Quadratically-Constrained Quadratic Program)


A quadratically-constrained quadratic program (QCQP) is an optimization problem with a quadratic objective and
quadratic constraints. A standard form quadratically constrained quadratic program is an optimization problem
of the following form:
1 >
p? = minn ~x H~x + ~c> ~x (8.79)
x∈R
~ 2
1 >
s.t. ~x Pi ~x + ~b> x + ci ≤ 0,
i ~ ∀i ∈ {1, . . . , m}
2
1 >
~x Qi ~x + d~>i ~
x + fi = 0, ∀i ∈ {1, . . . , p},
2
where H, P1 , . . . , Pm , Q1 , . . . , Qp ∈ Sn .

Proposition 195
Consider the following standard form quadratically-constrained quadratic program:
1 >
p? = minn ~x H~x + ~c> ~x (8.80)
x∈R
~ 2
1 >
s.t. ~x Pi ~x + ~b> x + ci ≤ 0,
i ~ ∀i ∈ {1, . . . , m}
2
1 >
~x Qi ~x + d~>i ~
x + fi = 0, ∀i ∈ {1, . . . , p},
2
where H, P1 , . . . , Pm , Q1 , . . . , Qp ∈ Sn . If H, P1 , . . . , Pm ∈ Sn+ and Q1 = · · · = Qp = 0, then the problem is
convex.

Proof. Left as exercise.

8.4 Second-Order Cone Programs


There is one more broad class of problems that we consider in this course, called second-order cone programs (SOCPs).
They are among the broadest class of problems that we can efficiently solve using algorithms such as the interior point

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 156
EECS 127/227AT Course Reader 8.4. Second-Order Cone Programs 2024-04-27 21:08:09-07:00

method, which we may discuss later in the course.

Definition 196 (Second-Order Cone Program)


A second-order cone program is an optimization problem with a linear objective and affine and “second-order
cone constraints”, i.e., constraints which say that an affine function of ~x is contained in the second-order cone. A
standard form second-order cone program is an optimization problem of the following form:

p? = minn ~c> ~x (8.81)


x∈R
~

s.t. kAi ~x − ~yi k2 ≤ ~b>


i ~
x + zi , ∀i ∈ {1, . . . , m}. (8.82)

Second order cone constraints are strictly more broad than affine constraints; to encode an affine constraint Ai ~x = ~yi as
a second-order cone constraint, pick the corresponding ~bi = ~0 and zi = 0. This makes the constraint kAi ~x − ~yi k2 ≤ 0
or equivalently Ai ~x = ~yi .

Proposition 197
Second-order cone problems are convex optimization problems.

Proof. Each second-order cone constraint kAi ~x − ~yi k2 ≤ ~b> x + zi can be alternatively formulated as constraining the
i ~
tuple (Ai ~x −~yi , ~b> ~x +zi ) ∈ Rn+1 to lie within the second-order cone in Rn+1 . But this tuple is an affine transformation
i
of ~x, in particular " # " # " #
Ai ~x − ~yi Ai −~yi
= > ~x + . (8.83)
~b> ~x + zi ~b zi
i i

Since the second order cone is convex and the tuple is an affine transformation of ~x, it follows that {~x ∈ Rn |
kAi ~x − ~yi k ≤ ~b> ~x + zi } is a convex set. Thus the feasible set is convex (as the intersection of convex sets). The
2 i
objective function is linear in ~x, so the second-order cone problem is convex.

Example 198. Consider the following problem:


m
(8.84)
X
p? = minn kAi ~x − ~yi k2 .
x∈R
~
i=1

One can formulate this as a second-order cone program by using slack variables:
m
(8.85)
X
p? = minn si
x∈R
~
s∈Rm
~ i=1

s.t. kAi ~x − ~yi k2 ≤ si , ∀i ∈ {1, . . . , m}. (8.86)

The following problem is also very related:

p? = minn max kAi ~x − ~yi k2 . (8.87)


x∈R i∈{1,...,m}
~

We can use a similar slack variable reformulation to formulate this problem as a second order cone program.

p? = minn s (8.88)
x∈R
~
s∈R

s.t. kAi ~x − ~yi k2 ≤ s, ∀i ∈ {1, . . . , m}. (8.89)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 157
EECS 127/227AT Course Reader 8.4. Second-Order Cone Programs 2024-04-27 21:08:09-07:00

These problems can be formulated in terms of route planning – more specifically, finding the route which minimizes
the total length between waypoints (in the first problem), or the route which minimizes the maximum length between
waypoints (in the second problem).

Example 199 (LPs, QPs, and QCQPs as SOCPs). One can see how LPs are QPs and how QPs are QCQPs, because
in each transition the set of properties becomes more permissive — first a linear objective can become a quadratic
objective, then linear constraints can become quadratic constraints. It is less clear how LPs, QPs, and QCQPs are
SOCPs. In this example we derive a way to write QCQPs as SOCPs, which is also applicable to LPs and QPs (since
LPs and QPs are QCQPs).
Consider a QCQP of the form
1 >
p? = minn ~x H~x + ~c> ~x (8.90)
x∈R
~ 2
1 >
s.t. ~x Pi ~x + ~b> x + ci ≤ 0,
i ~ ∀i ∈ {1, . . . , m}
2
C~x = ~z,

where H, P1 , . . . , Pm ∈ Sn+ . We use the epigraph reformulation to obtain

p? = minn t + ~c> ~x (8.91)


x∈R
~
t∈R
1 >
s.t. ~x Pi ~x + ~b> x + ci ≤ 0,
i ~ ∀i ∈ {1, . . . , m}
2
1 >
~x H~x ≤ t
2
C~x = ~z.

Let us try to convert the quadratic constraint


1 >
~x Pi ~x + ~b> x + ci ≤ 0
i ~ (8.92)
2
into a second-order cone constraint. Notice that this constraint is equivalent to

~x> Pi ~x + 2(~b> x + ci ) ≤ 0.
i ~ (8.93)
2
In order to write each term as a square, we first write ~x> Pi ~x = Pi . We now use a difference of squares identity:
1/2
~x
2

(u + v)2 − (u − v)2 = 4uv. (8.94)

To apply this to the above formula, we plug in u = 1


2 and v = ~b> x + ci , to get
i ~
 2  2
1 ~> 1 ~>
2(~b>
i ~
x + ci ) = + bi ~x + ci − − bi ~x − ci . (8.95)
2 2

This gives us the constraint


 2  2
2 1 ~> 1 ~>
(8.96)
1/2
Pi ~x + + bi ~x + ci − − bi ~x − ci ≤ 0,
2 2 2

which is equivalent to
 2  2
2 1 ~> 1 ~>
(8.97)
1/2
Pi ~x + + bi ~x + ci ≤ − bi ~x − ci .
2 2 2

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 158
EECS 127/227AT Course Reader 8.4. Second-Order Cone Programs 2024-04-27 21:08:09-07:00

Now we would like to take square roots and write things in terms of the `2 -norm. For this, we need to show that
1 ~ > x − ci ≥ 0. This follows because, since Pi is positive semidefinite, we have 1 ≥ 0 ≥ − 1 ~x> Pi ~x, and so
2 − bi ~ 2 2

1 ~> 1
− bi ~x − ci ≥ − ~x> Pi ~x − ~b> x − ci ≥ 0.
i ~ (8.98)
2 2
Now, taking square roots and writing things in terms of the `2 -norm, we have
" 1/2
#
Pi ~x 1
≤ − ~b> x − ci .
i ~ (8.99)
1 ~ > 2
2 + bi ~ x + ci 2

Thus we can write the QCQP as

p? = minn t + ~c> ~x (8.100)


x∈R
~
t∈R
" 1/2
#
Pi ~x 1
s.t. ≤ − ~b> x − ci ,
i ~ ∀i ∈ {1, . . . , m}
1 ~ > 2
2 + bi ~x + ci
2
" #
H 1/2 ~x 1
1
≤ +t
2 − t 2
2

kC~x − ~zk2 ≤ 0.

Below, we establish that the dual of an SOCP is an SOCP. This fact can either be proved via conic duality, or proved
directly.

Theorem 200
Let ~c ∈ Rn , and for i ∈ {1, . . . , m} let Ai ∈ Rdi ×n , ~yi ∈ Rdi , ~bi ∈ Rn , and zi ∈ R. The dual of the following
SOCP in standard form:

p? = minn ~c> ~x (8.101)


x∈R
~

s.t. kAi ~x − ~yi k2 ≤ ~b>


i ~
x + zi , ∀i ∈ {1, . . . , m}. (8.102)

can be formulated as an SOCP in standard form.

.
Proof via Conic Duality. Let Ki = {(~u, r) ∈ Rdi × R | k~uk2 ≤ r} denote the second-order cone in Rdi +1 , and let
d = i=1 di . Then the standard form SOCP can be written as:
Pm

min ~c> ~x (8.103)


x∈Rn
~

s.t. − (Ai ~x − ~yi , ~b> x + zi ) Ki ~0,


i ~ ∀i ∈ {1, . . . , m}. (8.104)

The Lagrangian L : Rn × Rd × Rm → R can thus be defined as follows. For each ~x ∈ Rn , ~λ = (~λ1 , . . . , ~λm ) ∈ Rd
(with ~λi ∈ Rdi for each i ∈ {1, . . . , m}), and µ
~ ∈ Rm :
m h i
(8.105)
X
L(~x, ~λ, µ
~ ) = ~c> ~x − ~λ> (Ai ~x − ~yi ) + µi (~b> ~x + zi )
i i
i=1
m
!> m
(8.106)
X X
~c − (A> ~ + µi~bi ) (~λ> yi − µi zi ).
= i λi ~x + i ~
i=1 i=1

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 159
EECS 127/227AT Course Reader 8.4. Second-Order Cone Programs 2024-04-27 21:08:09-07:00

Next, define the dual function g : Rd × Rm → R by maximizing over the primal variable ~x ∈ Rn :

g(~λ, µ
~ ) = minn L(~x, ~λ, µ
~) (8.107)
x∈R
~

Pm (~λ> ~y − µ z ), if Pm (A>~λ + µ ~b ) = ~c,
i=1 i i i i i i i i
= i=1
(8.108)
−∞, otherwise.

The last equality follows by noticing that the objective is linear in ~x; if its coefficient ~c − i=1 (A> ~λi + µi~bi ) = ~0, then
Pm
Pmi
the objective value is the sum (λ ~yi − µi zi ) regardless of the value of ~x, while if ~c −
Pm ~ >
i=1 i (A>~λi + µi~bi ) 6= ~0 i=1 i
then we can make the objective value as low as we want by picking ~x appropriately. For instance, let K > 0 be a large
positive number; then
m
! m 2 m
(~λ> ~yi −µi zi ) (8.109)
X X X
~x = −K ~c − (A λi + µi~bi ) =⇒ L(~x, ~λ, µ
>~
i ~ ) = −K ~c − (A>~λi + µi~bi ) + i i
i=1 i=1 2 i=1

which we can drive down to −∞ by increasing K to +∞. Thus, the dual problem is given by:
m
(8.110)
X
max (~λ> yi − µi zi )
i ~
~
λ∈Rd i=1
~ ∈Rm
µ
m
s.t. (8.111)
X
(A> ~ ~
i λ i + µ i bi ) = ~
c,
i=1

− (~λi , µi ) Ki 0, ∀i ∈ {1, . . . , m}. (8.112)

where the last line uses the fact that for each Ki its conic dual is itself, or in standard SOCP form by:
m
(8.113)
X
max (~λ> yi − µi zi )
i ~
~
λ∈Rd i=1
~ ∈Rm
µ
m
s.t. (8.114)
X
(A> ~ ~
i λ i + µ i bi ) − ~
c ≤0
i=1 2
~λi ≤ µi , ∀i ∈ {1, . . . , m}. (8.115)
2

Direct Proof. We re-iterate the SOCP to take the dual of:

p? = minn ~c> ~x (8.116)


x∈R
~

s.t. kAi ~x − ~yi k2 ≤ ~b>


i ~
x + zi , ∀i ∈ {1, . . . , m}. (8.117)

We add some variables to simplify. Namely, we introduce ~ui ∈ Rdi and wi ∈ R for each i ∈ {1, . . . , m}. For
.
convenience, we define ~u = (~u1 , . . . , ~um ) ∈ Rd1 × · · · × Rdm = Rd , where again d = i=1 di , and also define
Pm

~ = (w1 , . . . , wm ) ∈ Rm . With these definitions, the SOCP can be written as


w

p? = minn ~c> ~x (8.118)


x∈R
~
u∈Rd
~
m
w∈R
~

s.t. k~ui k2 ≤ wi , ∀i ∈ {1, . . . , m} (8.119)


~ui = Ai ~x − ~yi , ∀i ∈ {1, . . . , m} (8.120)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 160
EECS 127/227AT Course Reader 8.4. Second-Order Cone Programs 2024-04-27 21:08:09-07:00

wi = ~b>
i ~
x + zi , ∀i ∈ {1, . . . , m}. (8.121)

We can thus define a Lagrangian for this system, say with dual variables ~λ ∈ Rm , ~η ∈ Rd (with ~ηi ∈ Rdi for each i)
and ~ν ∈ Rm . We have
m m m
(8.122)
X X X
~ ~λ, ~η , ~ν ) = ~c> ~x +
L(~x, ~u, w, λi (k~ui k2 − wi ) + ~ηi> (~ui − Ai ~x + ~yi ) + νi (wi − ~b> x − zi )
i ~
i=1 i=1 i=1
m
!> n m m
X X X X
= ~c − (A> ηi + νi~bi )
i ~ ~x + >
(λi k~ui k2 + ~ηi ~ui ) + (−λi + νi )wi + (~ηi> ~yi − νi zi ).
i=1 i=1 i=1 i=1
(8.123)

Now define the dual function g : Rm × Rd × Rm → R by minimizing over the primal variables (~x, ~u, w)
~ ∈ Rn × Rd ×
Rm :

g(~λ, ~η , ~ν ) = minn L(~x, ~u, w,


~ ~λ, ~η , ~ν ) (8.124)
x∈R
~
u∈Rd
~
m
w∈R
~

ηi> ~yi − νi zi ), if ~c = i=1 (A>
Pm Pm


 i=1 (~ ηi + νi~bi )
i ~

and k~η k2 ≤ λi ∀i ∈ {1, . . . , m}



= (8.125)


 and λi = νi ∀i ∈ {1, . . . , m}

otherwise.

−∞,

The last equality looks complicated and a bit magical but we methodically justify it here.

(a) The Lagrangian is linear in ~x, indeed having the form

m
!>
~x + other terms not involving ~x, (8.126)
X
L= ~c − (A> ηi + νi~bi )
i ~
i=1

so unless the coefficient ~c − + νi~bi ) is ~0, then we can make the Lagrangian arbitrarily negative by
Pm >
i=1 (Ai ~
ηi
varying ~x while keeping ~u and w
~ fixed. For instance, let K > 0 be a large positive number. Then

m
! m 2

+ other terms not involving ~x


X X
~x = −K ~c − (A>
i ~
ηi + νi~bi ) =⇒ L = −K ~c − (A> ηi + νi~bi )
i ~
i=1 i=1 2
(8.127)
which we can drive down to −∞ by sending K → ∞. On the other hand, if ~c − + νi bi ) = ~0, then
Pm > ~
i=1 (Ai ~
ηi
the first term in the Lagrangian is ~0 regardless of the value of ~x.
Thus, we have proved

Pn (λ k~u k + ~η > ~u ) + Pm (−λ + ν )w + Pm (~η > ~y − ν z ), if ~c = Pm (A> ~η + ν ~b )
i=1 i i 2 i i i=1 i i i i=1 i i i i i=1 i i i i
min L =
x∈Rn
~ −∞, otherwise.
(8.128)

(b) Suppose that ~c = + νi~bi ). The Lagrangian has the form


Pm >
i=1 (Ai ~
ηi
m
(λi k~ui k2 + ~ηi> ~ui ) + other terms not involving ~u. (8.129)
X
minn L =
x∈R
~
i=1

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 161
EECS 127/227AT Course Reader 8.4. Second-Order Cone Programs 2024-04-27 21:08:09-07:00

Towards minimizing this expression over ~u, we aim to solve the problem

min (λi k~ui k2 + ~ηi> ~ui ), (8.130)


ui ∈Rdi
~

and collect the results for each i at the end. At first glance, it may seem hard to imagine this term blowing up at
all. Towards finding out a possible blow-up case, if any, we use Cauchy-Schwarz to try to make the sum as small
as possible. In particular, by Cauchy-Schwarz we have

λi k~ui k2 + ~ηi> ~ui ≥ λi k~ui k2 − k~ηi k2 k~ui k2 = (λi − k~ηi k2 ) k~ui k2 (8.131)

with equality when ~ui points in the opposite direction as ~ηi , that is, ~ui = −K~ηi for some K ≥ 0. With this
value of ~ui (for varying K → ∞) we shall try to make the Lagrangian go to −∞. Indeed, in this case, we have

λi k~ui k2 + ~ηi> ~ui = K(λi − k~ηi k2 ) k~ηi k2 . (8.132)

First suppose that k~ηi k2 = 0. Since λi ≥ 0 in the Lagrangian formulation, we must have λi ≥ k~ηi k2 , as indicated
in the original equality. The optimal ~ui is ~ui = −K~ηi = ~0 (independently of the value of K), at which point the
term in the Lagrangian becomes 0. We now deal with the non-edge case, assuming that ~ηi 6= ~0.
Suppose that λi − k~ηi k2 < 0. Then by sending K → ∞ with this choice of ~ui = −K~ηi we drive the Lagrangian
to −∞. On the other hand, if λi − k~ηi k2 ≥ 0, then the minimizing choice for K is K = 0, so that ~ui = ~0, and
the term in the Lagrangian becomes 0. Thus,

0, if λi ≥ k~ui k2
min (λi k~ui k2 + ~ηi> ~ui ) = (8.133)
ui ∈Rdi
~ −∞, otherwise.

Applying this logic to each i ∈ {1, . . . , m}, we obtain



(−λi + νi )wi + i=1 (~ηi> ~yi − νi zi ), if ~c = i=1 (A>
Pm Pm Pm


 i=1 ηi + νi~bi )
i ~

min L = and λi ≥ k~ui k2 , ∀i ∈ {1, . . . , m} (8.134)
x∈Rn
~ 
u∈Rd
otherwise.

~
−∞,

(c) Suppose that ~c = + νi~bi ) and λi ≥ k~ui k2 for each i. Then the Lagrangian has the form
Pm >
i=1 (Ai ~
ηi
m
(νi − λi )wi + other terms not involving w. (8.135)
X
minn L = ~
x∈R
~
u∈Rd
~ i=1

Towards minimizing this expression over w,


~ we aim to solve the problem

min (νi − λi )wi (8.136)


wi ∈R

and collect the results at the end. Thankfully this is much simpler than the rest of the calculations, since the
objective is an unconstrained minimization of a linear function of a scalar wi . If the coefficient νi − λi is
nonzero, then we can thus blow up the objective in any direction by choosing wi accordingly. Namely, if νi 6= λi
then the choice of wi = −K(νi − λi ) for some positive scalar K > 0, simplifies the objective as −K(νi − λi )2 .
Since (νi − λi )2 > 0, taking K → ∞ shows that the optimal value of the objective is −∞. On the other hand,
if νi = λi then the objective has value 0 independent of the choice of wi . We have shown that

0, if νi = λi
min (νi − λi )wi = (8.137)
wi ∈R −∞, otherwise.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 162
EECS 127/227AT Course Reader 8.4. Second-Order Cone Programs 2024-04-27 21:08:09-07:00

Applying this logic to all i ∈ {1, . . . , m}, we obtain



if ~c =
Pm Pm


 i=1 (~ηi> ~yi − νi zi ), >
i=1 (Ai ~
ηi + νi~bi )

and λi ≥ k~ui k2 ,

∀i ∈ {1, . . . , m}


minn L = (8.138)
x∈R
~
u∈Rd
~


 and νi = λi , ∀i ∈ {1, . . . , m}
m

w∈R
~
otherwise.

−∞,

Now we can write down the dual problem as

d? = max g(~λ, ~η , ~ν ) (8.139)


~
λ∈Rm
+
~ ∈Rd
η
ν ∈Rm
~

which simplifies to
m
(8.140)
X
d? = max (~ηi> ~yi − νi zi )
~
λ∈Rm i=1
~ ∈Rd
η
ν ∈Rm
~
m
s.t. ~c = (8.141)
X
(A> ηi + νi~bi )
i ~
i=1

λi ≥ k~ui k2 , ∀i ∈ {1, . . . , m} (8.142)


λ i = νi , ∀i ∈ {1, . . . , m} (8.143)
λi ≥ 0, ∀i ∈ {1, . . . , m}. (8.144)

Note that the constraint λi ≥ k~ui k2 already implies λi ≥ 0 since k~ui k2 ≥ 0. Thus, we can rewrite the problem again
as
m
(8.145)
X
d? = max (~ηi> ~yi − νi zi )
~
λ∈Rm i=1
~ ∈Rd
η
ν ∈Rm
~
m
s.t. ~c = (8.146)
X
(A> ηi + νi~bi )
i ~
i=1

λi ≥ k~ui k2 , ∀i ∈ {1, . . . , m} (8.147)


λ i = νi , ∀i ∈ {1, . . . , m}. (8.148)

Now note that the last constraint forces ~λ = ~ν . Thus, we can eliminate one of them; we choose arbitrarily to eliminate
~ν by replacing it everywhere with ~λ. This gives the dual problem as
m
(8.149)
X
d? = max (~ηi> ~yi − λi zi )
~
λ∈Rm i=1
~ ∈Rd
η
m
s.t. ~c = (8.150)
X
(A> ηi + λi~bi )
i ~
i=1

λi ≥ k~ui k2 , ∀i ∈ {1, . . . , m}. (8.151)

To write this in SOCP form, we can write the affine constraint as a norm, obtaining
m
(8.152)
X
d? = max (~ηi> ~yi − λi zi )
~
λ∈Rm i=1
~ ∈Rd
η

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 163
EECS 127/227AT Course Reader 8.5. Semidefinite Programming 2024-04-27 21:08:09-07:00

m
s.t. (8.153)
X
~c − (A> ηi + λi~bi )
i ~ ≤0
i=1 2

k~ui k2 ≤ λi , ∀i ∈ {1, . . . , m}. (8.154)

Thus, we have obtained that the dual of an SOCP is another SOCP.

The following content is optional/out of scope for this semester. Regardless, it may be helpful to read it to
gain context, or get a deeper understanding of various results.

8.5 Semidefinite Programming


This section introduces the semidefinite program (SDP), one of the broadest classes of named optimization problems.
We begin with its two forms, the inequality form and standard form, and their properties.
In this section, we will use  and  to denote inequalities between symmetric matrices. These are instances of
generalized inequalities, as in Definition 176, associated with the (proper) cone of positive semidefinite matrices. All
you need to know is: if we write A  0, this means A is symmetric positive semidefinite, whereas if A  B then it
means A − B is symmetric positive semidefinite. On the other hand, if we write A  0, this means −A is symmetric
positive semidefinite; that is, A is symmetric negative semidefinite, with all non-positive eigenvalues.
Recall that Sn is the set of n × n symmetric matrices, and Sn+ is the set of n × n symmetric positive semidefinite
matrices.

Definition 201 (Semidefinite Program in Inequality Form)


A semidefinite program in inequality form is an optimization problem of the following form:

min ~c> ~x (8.155)


x∈Rn
~
n
s.t. F0 + (8.156)
X
xi Fi  0,
i=1

where ~c ∈ Rn , and F0 , F1 . . . , Fn ∈ Sn .

The expression F0 + i=1 xi Fi is referred to as a linear matrix inequality. The constraint set, i.e., the set of ~x ∈ Rn
Pn

such that F0 + i=1 xi Fi  0, is called a spectrahedron.


Pn

Notice that we only require one linear matrix inequality in the definition. What if we had multiple? Suppose that
we actually wanted to solve the problem

min ~c> ~x (8.157)


x∈Rn
~
n
s.t. F0 (8.158)
(1) (1)
X
+ xi Fi  0,
i=1
n
(8.159)
(2) (2)
X
F0 + xi Fi  0,
i=1
..
. (8.160)
n
(8.161)
(k) (k)
X
F0 + xi Fi  0.
i=1

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 164
EECS 127/227AT Course Reader 8.5. Semidefinite Programming 2024-04-27 21:08:09-07:00

This could be phrased using a single linear matrix inequality, and the problem would be

min ~c> ~x (8.162)


x∈Rn
~
 (1)   (1) 
F0 n
Fi
s.t. 
 ..  X  .. 
(8.163)
 . +
 xi 
 .   0.

(k) i=1 (k)
F0 Fi

(If this reduction isn’t clear to you, it’s totally fine; try to prove it as an exercise.)
We now introduce another major standard form of SDPs.

Definition 202 (Semidefinite Program in Standard Form)


A semidefinite program in standard form is an optimization problem of the following form:

min tr(CX) (8.164)


X∈Sn

s.t. tr(Ak X) = bk , ∀k ∈ {1, . . . , m} (8.165)


X  0, (8.166)

where C, A1 , . . . , Am ∈ Sn , and b1 , . . . , bm ∈ R.

The first main theorem below establishes that the inequality and standard forms of an SDP are equivalent, in the
sense that either can be reformulated as the other.

Theorem 203
An SDP in inequality form can be reformulated as an SDP in standard form, and vice versa.

Proof. Just for this proof, we introduce the notation vec : Rm×n → Rmn , which takes an m × n matrix and unrolls it
into an mn-length vector. With this notation, in fact, for two symmetric matrices A, B ∈ Sn , we can write tr(AB) =
j=1 Aij Bij = vec(A) vec(B). On the other hand, we will sometimes need to access the element of vec(A)
Pn Pn >
i=1
corresponding to Aij ; we denote this by vec(A)i,j (where the comma makes it clear that the index is not the product of
i and j). We will also use the notation diag : Rn → Rn×n which takes a vector and returns a diagonal matrix whose
diagonal is the entries of this vector. This notation will greatly simplify things to follow.
“Inequality form =⇒ Standard form”: Let ~c ∈ Rn and F0 , F1 , . . . , Fn ∈ Sd , and consider the following SDP in
inequality form:

min ~c> ~x (8.167)


x∈Rn
~
n
s.t. F0 + (8.168)
X
xi Fi  0.
i=1

Our goal is to write it in the form

min tr(CX) (8.169)


X∈Sm

s.t. tr(Ak X) = bk , ∀k ∈ {1, . . . , p}, (8.170)


X  0. (8.171)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 165
EECS 127/227AT Course Reader 8.5. Semidefinite Programming 2024-04-27 21:08:09-07:00

First, towards introducing the positive semidefinite constraint, we introduce a new variable Y ∈ Sd , associated with
−(F0 + i=1 xi Fi ). That is, our original problem has the form
Pn

min ~c> ~x (8.172)


x∈Rn
~
Y ∈Sd
n
s.t. Y + F0 + (8.173)
X
xi Fi = 0,
i=1

Y  0. (8.174)

Since we have a linear matrix equality, we can write it as a bunch of scalar equations to get it closer to the desired form,
say in the following way:

min ~c> ~x (8.175)


x∈Rn
~
Y ∈Sd
n
s.t. Yjk + (F0 )jk + (8.176)
X
xi (Fi )jk = 0, ∀j, k ∈ {1, . . . , d}
i=1

Y  0. (8.177)

But even this isn’t quite right – after all, we require all decision variables to be encapsulated in a positive semidefinite
matrix. The simplest way to do this is to form a block diagonal matrix where each block is an embedding of a decision
variable into a positive semidefinite matrix; the large matrix will also be positive semidefinite in this case. Towards
converting ~x to a positive semidefinite block, one could consider its diagonal matrix equivalent diag(~x), but this would
not be positive semidefinite unless all entries of ~x were positive. To ensure that this happens, we use slack variables,
akin to the proof that general linear programs can be written in standard form.
Namely, associate vectors ~x+ , ~x− ∈ Rn by the following formulae:
 
x , x > 0 0, xi > 0
i i
x+ i = x−i = (8.178)
0, x ≤ 0, −x , x < 0.
i i i

In this case ~x = ~x+ − ~x− . Thus the original problem is equivalent to the reformulation

min ~c> (~x+ − ~x− ) (8.179)


x+ ∈Rn
~
x− ∈Rn
~
Y ∈Sn
n
s.t. Yjk + (F0 )jk + (8.180)
X

(x+
i − xi )(Fi )jk = 0, ∀j, k ∈ {1, . . . , d}
i=1

x+
i ≥ 0, ∀i ∈ {1, . . . , n}, (8.181)
x−
i ≥ 0, ∀i ∈ {1, . . . , n}, (8.182)
Y  0. (8.183)

Now we are in business; we can write all the inequality/definiteness constraints as


 
diag(~x+ )
. 
Z= diag(~x− )   0. (8.184)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 166
EECS 127/227AT Course Reader 8.5. Semidefinite Programming 2024-04-27 21:08:09-07:00

This is the positive semidefiniteness constraint we want, so the decision variable is Z ∈ S2n+d . As notation, let
i be the i
Z 1,i = diag(~x+ )ii = x+ element of the first block, Z 2,i = diag(~x− )ii = x−
i be the i element of the
th th

second block, and Z 3,ij = Yij be the (i, j)th element of the third block. As notation for later, let O be the set of all
indices in {1, . . . , 2n + d} × {1, . . . , 2n + d} which are not on the diagonal or part of the Y block, and thus must be
set to zero; formally O = {(i, j) | 1 ≤ i, j ≤ 2n + d, i 6= j, i ≤ 2n or j ≤ 2n}.
Now, we have written our problem in the form
n
(8.185)
X
min ci (Z 1,i − Z 2,i )
Z∈S2n+d
i=1
n
s.t. (8.186)
X
Z 3,jk + (F0 )jk + (Z 1,i − Z 2,i )(Fi )jk = 0, ∀j, k ∈ {1, . . . , d}
i=1

Zi,j = 0, ∀(i, j) ∈ O, (8.187)


Z  0. (8.188)

Notice that all constraints are affine or positive definite, and our objective is affine; by our discussion of affine functions,
the affine constraints can be written in the form tr(Ak Z) = bk , and the objective can be written in the form tr(CZ),
for some symmetric matrices Ak , C and scalars bk , and k ∈ {1, . . . , m} where m = d2 + |O|.² Thus we can write our
problem as

min tr(CZ) (8.189)


Z∈S2n+d

s.t. tr(Ak Z) = bk , ∀k ∈ {1, . . . , m} (8.190)


Z  0, (8.191)

as desired.
“Standard form =⇒ Inequality form”: Let C, A1 , . . . , Am ∈ Sn be fixed symmetric matrices, and let b1 , . . . , bm ∈
R be fixed scalars. Consider the following SDP in standard form:

min tr(CX) (8.192)


X∈Sn

s.t. tr(Ak X) = bk , ∀k ∈ {1, . . . , m} (8.193)


X  0. (8.194)

We want to write it in the form

min ~c> ~x (8.195)


x∈Rm
~
m
s.t. F0 + (8.196)
X
xi Fi  0.
i=1

Notice by our notation that tr(CX) = vec(C)> vec(X) and that tr(Ak X) = vec(Ak )> vec(X). Thus, letting
~x ∈ Rn be defined as ~x = vec(X), our objective is linear in ~x, since it is ~c> ~x where ~c = vec(C). Furthermore, our
2

²Careful readers may notice that the discussion on affine functions ensured something slightly different; namely, for an affine function f on
symmetric matrices (or indeed all of Rn×n ), there was some matrix A and scalar b such that f (X) = tr A> X + b. In particular, the result did


not guarantee that such an A could be symmetric. But certainly the matrix (A + A> )/2 is symmetric, and for Z symmetric, we have tr A> Z =


tr [(A + A> )/2]Z , so indeed, for an affine function f : Sn → R there exists some matrix A ∈ Sn and scalar b ∈ R such that f (X) =

tr(AX) + b.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 167
EECS 127/227AT Course Reader 8.5. Semidefinite Programming 2024-04-27 21:08:09-07:00

equality constraints are affine, since they are ~a>


k~x = bk where ~ak = vec(Ak ). We express each equality constraint as
a pair of linear matrix inequalities, since that is the only type of constraint we are permitted to have. Indeed, we have
m
(8.197)
X
~a>
k~x = bk ⇐⇒ xi (~ak )i = bk
i=1
m
(8.198)
X
⇐⇒ −bk + xi (~ak )i = 0
i=1
m m
!
xi (~ak )i  0 and − (8.199)
X X
⇐⇒ −bk + −bk + xi (~ak )i  0,
i=1 i=1

where we are using  for ordering on the space of 1 × 1 symmetric matrices, i.e., scalars. These are bona-fide linear
matrix inequalities and will be combined with others, later, to form the full linear matrix inequality constraint for our
problem.
The only constraint remaining that cannot easily be expressed in vectorized form is the constraint X  0. For this,
we note that we are allowed to have a linear matrix inequality constraint, so we want to express X  0 in terms of a
linear matrix inequality involving ~x. This is difficult at first, so we handle it in the case n = 2 for an example. Write
 
" # x1
 
x1 x2 x2 
X= , ~x = 
x  .
 (8.200)
x3 x4  3
x4

Notice that, since X is symmetric (and so x2 = x3 ), we can write X in terms of a linear combination of constant
symmetric matrices, as follows
" #
x1 x2
X= (8.201)
x2 x4
" # " # " #
1 0 0 1 0 0
= x1 + x2+ x4 (8.202)
0 0 1 0 0 1
" # " # " # " #
1 01 0 1 1 0 1 0 0
= x1 + x2 + x3 + x4 (8.203)
0 0 2 1 0 2 1 0 0 1
1 1
= x1 E 11 + x2 (E 12 + E 21 ) + x3 (E 12 + E 21 ) + x4 E 22 , (8.204)
2 2
where E ij is defined as the n×n matrix with 1 in the (i, j)th coordinate and 0 elsewhere. Thus the positive semidefinite
constraint can be replaced by the linear matrix inequality
 
1 1
X  0 ⇐⇒ − x1 E 11 + x2 (E 12 + E 21 ) + x3 (E 12 + E 21 ) + x4 E 22  0. (8.205)
2 2
The general case goes the same way. We can say
 
n n n
1 XX
(8.206)
X ii
xi,j (E ij + E ji )

X  0 ⇐⇒ −  x i,i E + 0

i=1
2 i=1 j=1
j6=i

where again xi,j refers to the element of ~x corresponding to the entry Xij .
This gives a linear matrix inequality for the last constraint, and so all constraints can be represented by some linear
matrix inequalities. Thus, by the discussion on reducing several linear matrix inequalities to a single one, all constraints

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 168
EECS 127/227AT Course Reader 8.5. Semidefinite Programming 2024-04-27 21:08:09-07:00

can be represented as a single linear matrix inequality of the form F0 + xi Fi . Thus the original problem can be
Pm
i=1
represented as

min ~c> ~x (8.207)


x∈Rm
~
m
s.t. F0 + (8.208)
X
xi Fi  0,
i=1

where m = n2 , ~c = vec(C), and the linear matrix inequality constraint is constructed in the aforementioned way.

Theorem 204 (Dual of an SDP)


The dual of an SDP is an SDP.

Proof. Let ~c ∈ Rn , and let F0 , F1 , . . . , Fn ∈ Sd . Consider the following inequality-form SDP:

min ~c> ~x (8.209)


x∈Rn
~
n
s.t. F0 + (8.210)
X
xi Fi  0.
i=1

We compute the conic dual of this problem. We know that the dual cone of Sd+ in Sd (equipped with the Frobenius
inner product hA, BiF = tr(AB) and corresponding Frobenius norm) is simply Sd+ itself. Thus we can define the
Lagrangian L : Rn × Sd+ as
* n
+
(8.211)
X
>
L(~x, Λ) = ~c ~x + Λ, F0 + x i Fi
i=1 F
n
!!
(8.212)
X
= ~c> ~x + tr Λ F0 + xi Fi
i=1
n
(8.213)
X
= ~c> ~x + tr(ΛF0 ) + xi tr(ΛFi )
i=1
n
(8.214)
X
= (ci + tr(ΛFi ))xi + tr(ΛF0 ).
i=1

Now, define the dual function g : Sn+ → R by minimizing over the primal variable ~x:

g(Λ) = min L(~x, Λ) (8.215)


x∈Rd
~
n
!
(8.216)
X
= min (ci + tr(ΛFi ))xi + tr(ΛF0 )
x∈Rd
~
i=1
n
(8.217)
X
= tr(ΛF0 ) + min (ci + tr(ΛFi ))xi
xi ∈R
i=1

tr(ΛF ), if tr(ΛF ) = −c , ∀i ∈ {1, . . . , n}
0 i i
= (8.218)
−∞, otherwise.

The last equality is because in each individual term (ci + tr(ΛFi ))xi , when minimizing over xi , if ci + tr(ΛFi ) 6= 0
then we can always drive it to −∞ by picking xi to be large and of the opposite sign.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 169
EECS 127/227AT Course Reader 8.5. Semidefinite Programming 2024-04-27 21:08:09-07:00

We can thus write the dual problem as

d? = max tr(F0 Λ) (8.219)


Λ∈Sd

s.t. tr(Fi Λ) = ci , ∀i ∈ {1, . . . , n}, (8.220)


Λ  0, (8.221)

which is an SDP in standard form.

SDPs generalize all previously introduced classes of convex optimization problems: LPs, (convex) QPs, (convex)
QCQPs, and SOCPs.

Theorem 205
SOCPs can be reformulated as SDPs.

Proof. We use the following useful characterization of second-order cone constraints as semidefinite constraints.
Claim. For (~x, t) ∈ Rm+1 , we have
" #
tI ~x
k~xk2 ≤ t ⇐⇒  0. (8.222)
~x> t

Proof of claim. We have


" # " #> " #" #
tI ~x ~a tI ~x ~a
 0 ⇐⇒ ≥ 0, ∀(~a, b) ∈ Rm+1 (8.223)
~x> t b ~x> t b
" #> " #
~a t~a + b~x
⇐⇒ ≥ 0, ∀(~a, b) ∈ Rm+1 (8.224)
b~x>~a + tb
 
(8.225)
2
⇐⇒ t k~ak2 + b2 + 2b~a> ~x ≥ 0, ∀(~a, b) ∈ Rm+1 .

By Cauchy-Schwarz we have
   
(8.226)
2 2
t k~ak2 + b2 + 2b~a> ~x ≥ t k~ak2 + b2 − 2 |b| k~ak2 k~xk2

with equality when ~a = −K~x for some positive scalar K > 0. Thus
" #
tI ~x  
(8.227)
2
 0 ⇐⇒ t k~
a k 2 + b 2
− 2 |b| k~ak2 k~xk2 ≥ 0, ∀(~a, b) ∈ Rm+1 .
~x> t

Now by the AM-GM inequality (i.e., expanding the square on (k~ak2 − |b|)2 ≥ 0), we have k~ak2 + b2 ≥ 2 |b| k~ak2 ,
2

with equality when k~ak2 = |b|. This gives


 
(8.228)
2
t k~ak2 + b2 − 2 |b| k~ak2 k~xk2 ≥ 2 |b| k~ak2 (t − k~xk2 )

with equality when k~ak2 = |b|. Thus we have


" #
tI ~x
 0 ⇐⇒ 2 |b| k~ak2 (t − k~xk2 ) ≥ 0 ∀(~a, b) ∈ Rm+1 (8.229)
~x> t
⇐⇒ t − k~xk2 ≥ 0 (8.230)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 170
EECS 127/227AT Course Reader 8.6. General Taxonomy 2024-04-27 21:08:09-07:00

as desired, so the claim is proved.


Now fix ~c ∈ Rn , fix ~b1 , . . . , ~bm ∈ Rn , fix A1 , . . . , Am so that Ai ∈ Rdi ×n , fix ~y1 , . . . , ~ym so that ~yi ∈ Rdi , and
fix z1 , . . . , zm ∈ R. Consider the following generic SOCP:

min ~c> ~x (8.231)


x∈Rn
~

s.t. kAi ~x − ~yi k2 ≤ ~b>


i ~
x + zi , ∀i ∈ {1, . . . , m}. (8.232)

Let K d be the second-order cone in Rd . Notice that each cone constraint can be written in the form

kAi ~x − ~yi k2 ≤ ~b>


i ~
x + zi (8.233)
⇐⇒ (Ai ~x − ~yi , ~b> x + zi ) ∈ K di +1
i ~ (8.234)
" #
(~b> x + zi )I Ai ~x − ~yi
i ~
⇐⇒ 0 (8.235)
(Ai ~x − ~yi )> ~b> i ~
x + zi
" # n
" #
zi I −~yi (~bi )j I (Ai )j
(8.236)
X
⇐⇒ + x j 0
−~yi> zi j=1
(Ai )> j (~bi )j

where (~bi )j is the j th entry of ~bi , and (Ai )j is the j th column of Ai . Anyways, this is a linear matrix inequality (after
some reshuffling of terms).
The conversion from SOCP to SDP consists of converting all second-order cone constraints to small linear matrix
inequalities, then combining them to form one larger linear matrix inequality, which defines the constraint set of the
inequality-form SDP. The objective function is already linear, so the resulting SDP is in the “standard” inequality form.
Thus, we have reduced the original SOCP to an SDP.

Note that in practice, this reduction is often extremely costly; SDPs are hard to solve at large scale, while SOCPs
are much easier.
The above content is optional/out of scope for this semester, but now we resume the required/in scope content.

8.6 General Taxonomy


We conclude this chapter with a taxonomy of problems that we have discussed until now:

LPs ⊂ Convex QPs ⊂ Convex QCQPs ⊂ SOCPs ⊂ SDPs ⊂ Convex Problems (8.237)

All inclusions are strict, i.e., none of the classes is equivalent to any of the others.
For extra optional reading, you may also look into geometric programs (GPs), which are nonconvex programs that
can be turned into convex programs with a change of variables; and mixed-integer programs (MIPs), which are useful
in practice to incorporate integer constraints, but difficult to solve exactly. All such material is out of scope of the
course.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 171
Chapter 9

Regularization and Sparsity

Relevant sections of the textbooks:

• [2] Chapters 9, 12, 13.

9.1 Recapping Ridge Regression and Defining LASSO


The first example of regularization we saw was ridge regression. In this section, we’ll review ridge regression. The
most basic perspective of ridge regression focuses on the additional term we add to the objective function. In ridge
regression, we solve the following problem:
n o
(9.1)
2 2
minn kA~x − ~y k2 + λ k~xk2 .
x∈R
~

This is different from the OLS problem due to the additional λ k~xk2 term, which can be thought of as a regularizer
2

(i.e., a penalty) for having large ~x values. The λ parameter controls the strength of the penalty and is usually called a
regularization parameter. In this sense, ridge regression is regularized least squares. More generally, we may define
regularization as follows.

Definition 206 (Regularization)


Consider the optimization problem
p? = min f0 (~x). (9.2)
x∈Ω
~

For a given function R : Ω → R+ (the regularizer) and a regularization parameter λ > 0, the regularized version
of the above problem is the problem
p?λ = min{f0 (~x) + λR(~x)}. (9.3)
x∈Ω
~

Here λ controls the strength of the regularization.

In general, the original problem and the regularized problem do not have the same solutions, nor do versions of the
regularized problem with different λ parameter. One need only consider ridge regression to keep this in mind; for a
fixed A and ~y , increasing λ will decrease the norm of the solution to the ridge regression problem, and sending it to 0
(i.e., recovering unregularized least squares) will increase the norm of the solution.
One example of regularization is the `2 -norm penalty R(~x) = k~xk2 , which (when combined with f0 (~x) =
2

kA~x − ~y k2 ) yields ridge regression. Another example is the elastic-net regression, which we covered briefly as an
2

172
EECS 127/227AT Course
9.2. Reader
Understanding the Difference Between the `2 -Norm and the `1 -Norm2024-04-27 21:08:09-07:00

example when discussing convexity. But the main objective of this chapter is to look at the so-called LASSO re-
gression problem, which uses an `1 -norm regularizer. Recall that for a vector ~x ∈ Rn , its `1 -norm is defined as
Pn
k~xk1 = i=1 |xi |.

Definition 207 (LASSO Regression)


Lett A ∈ Rm×n , ~y ∈ Rm , and λ > 0. The LASSO regression problem is:
n o
(9.4)
2
minn kA~x − ~y k2 + λ k~xk1 .
x∈R
~

Here are some key properties of the LASSO regression problem.

Proposition 208
Consider the LASSO regression problem
.
where (9.5)
2
min f0 (~x) f0 (~x) = kA~x − ~y k2 + λ k~xk1 .
x∈Rn
~

(a) The function f0 : Rn → R is convex.

(b) If A has full column rank then f0 is µ-strongly convex with µ = 2σn {A}2 .

(c) A solution ~x? ∈ argmin~x∈Rn f0 (~x) always exists.

(d) If A has full column rank then the above solution is unique.

This picture is very different from ridge regression, where we are guaranteed that a solution always exists, is unique,
and solvable in closed form. The question then becomes: why do we even care about the LASSO problem at all? The
basic answer is that it induces sparsity in the solution, i.e., solutions to LASSO usually tend to have few nonzero
entries. This sparsity is useful for applications in high-dimensional statistics and machine learning, as it reveals a
certain structure — in words, it points out which “features” are the most relevant to the regression. In the following
sections, we will observe how this sparsity emerges, both geometrically and algebraically.

9.2 Understanding the Difference Between the `2 -Norm and the `1 -Norm
In this section, we attempt to build more intuition about the difference between the `2 -norm and the `1 -norm. We do
this by solving some problems which use the `2 norm, then replace it with the `1 norm and solve this new problem.
Besides giving us intuition, it will help us learn how to analyze the LASSO problem.
Here is a diagram of the norm balls of the `1 (blue) and `2 (red) norms in n = 1 dimensions:

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 173
EECS 127/227AT Course
9.2. Reader
Understanding the Difference Between the `2 -Norm and the `1 -Norm2024-04-27 21:08:09-07:00

x2

x1

Figure 9.1: The `1 and `2 norm balls in n = 2 dimensions. Recall that the `p -norm ball is defined as the set of vectors ~v such that
k~v kp ≤ 1.

The border of the norm balls are the points where each norm is equal to 1. Notice the difference in the geometry of
these norm balls. The `2 norm ball is circular, while the `1 norm ball has distinctive corners.
In fact, these corners hint at a key difference between these norms: the `2 norm is differentiable everywhere, but
k~xk1 is not differentiable when any xi = 0. These corners will help us understand how the `1 norm regularizer induces
sparsity in the solution, and also inform our analysis of problems involving the `1 -norm, including LASSO.

Example 209 (Least `1 -Norm). Recall that we solved the problem

(9.6)
2
min k~xk2
x∈Rn
~

s.t. A~x = ~y . (9.7)

Using the KKT conditions, namely stationarity, we found an explicit solution to this problem: ~x? = A> (AA> )−1 ~y .
Now let us replace the `2 norm with an `1 norm; we obtain the problem

min k~xk1 (9.8)


x∈Rn
~

s.t. A~x = ~y . (9.9)

We cannot apply stationarity to this problem because the objective is non-differentiable. Thus, this problem seems
intractable to solve by hand, at least for the moment. Instead, let us formulate it as a linear program. As before,
we represent each xi as the difference of non-negative numbers which sum to |xi |. More formally, we introduce slack
variables ~x+ , ~x− ∈ Rn such that for each i ∈ {1, . . . , n} we have x+
i ≥ 0, xi ≥ 0, xi −xi = xi , and xi +xi = |xi |.
− + − + −

Thus we can rewrite the problem using the following linear program:
n
(9.10)
X

min (x+
i + xi )
x− ∈Rn
x+ ,~
~
i=1

s.t. A(~x+ − ~x− ) = ~y (9.11)


~x+ ≥ ~0 (9.12)

~x ≥ ~0. (9.13)

This is a linear program which is efficiently solvable.


As a corollary, we can consider the `1 -norm regression problem:

min kA~x − ~y k1 . (9.14)


x∈Rn
~

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 174
EECS 127/227AT Course
9.2. Reader
Understanding the Difference Between the `2 -Norm and the `1 -Norm2024-04-27 21:08:09-07:00

We can introduce the slack variable ~e = A~x − ~y and obtain the problem:

min k~ek1 (9.15)


x∈Rn
~
e∈Rm
~

s.t. A~x − ~y = ~e (9.16)

which is an equality-constrained `1 minimization problem, and thus a linear program as demonstrated above.

Example 210 (Mean Versus Median). Let k be a positive integer. Suppose we have points ~x1 , . . . , ~xk ∈ Rn . Consider
the problem
k
(9.17)
X 2
minn k~x − ~xi k2 .
x∈R
~
i=1

This is an unconstrained strongly convex differentiable problem, so it has a unique solution ~x?1 which we may find by
setting the derivative of the objective to ~0. We obtain
k
(9.18)
X
~0 = 2 (~x?1 − ~xi )
i=1
k k
(9.19)
X X
=⇒ ~0 = (~x?1 − ~xi ) = k · ~x?1 − ~xi
i=1 i=1
k
1X
=⇒ ~x?1 = ~xi . (9.20)
k i=1

This computation implies that the sample mean is the point which minimizes the total squared distance to all points in
the dataset.
Now suppose that we instead consider the problem
k
(9.21)
X
minn k~x − ~xi k2 .
x∈R
~
i=1

The solution to this problem is the sample median of the points. To see this, suppose that n = 1, i.e., all our data xi
are scalar-valued. Then we obtain the problem
k
(9.22)
X
min |x − xi | .
x∈R
i=1

This is an unconstrained, convex, non-differentiable problem. Let us examine all critical points – that is, points where
the derivative is 0 or undefined. The derivative of the objective is
k k
d X X d
|x − xi | = |x − xi | (9.23)
dx i=1 i=1
dx

k 
1,
 if x > xi

(9.24)
X
= −1, if x < xi

i=1 
undefined, if x = x

i

if x ∈
P P
i : x>xi 1 + i : x<xi −1, / {x1 , . . . , xk }

= (9.25)
undefined, if x ∈ {x1 , . . . , xk }

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 175
EECS 127/227AT Course Reader 9.3. Analysis of LASSO Regression 2024-04-27 21:08:09-07:00


|{i ∈ {1, . . . , k} : x > x }| − |{i ∈ {1, . . . , k} : x < x }| ,
i i if x ∈
/ {x1 , . . . , xk }
= (9.26)
undefined, if x ∈ {x1 , . . . , xk }.

Thus if x is such that |{i ∈ {1, . . . , k} : x > xi }| = |{i ∈ {1, . . . , k} : x < xi }|, then the derivative is 0, so this x is a
candidate solution. To put this convoluted-looking condition in words, notice that the first term in hte equality is just
the number of xi which are larger than x, and the second term is the number of xi which are smaller than x. Thus the
condition says that there are the same number of points in the set which are larger than x as there are points which are
smaller than x. This x would fulfill the traditional definition of “median” as the middle of the sorted list of points.
To formally solve this problem, one must also check all the values x = xi and compare the objective values. But
eventually after doing all this, one recovers that the optimal solutions are all possible medians of the dataset.
Because the median is defined using the |·| instead of (·)2 function, it inherits several different properties. The most
striking is its robustness; the median is much more robust than the mean. The mean is very sensitive to outliers, while
the median is less sensitive (i.e. if we blow up an outlier point, the mean will change a lot, while the median will be
unaffected).

9.3 Analysis of LASSO Regression


In this section we will solve the one-dimensional LASSO problem. The ideas generalize to the vector case directly,
through a reduction of the vector LASSO problems to several one-dimensional LASSO problems. The details of this
reduction are left as a homework exercise.
First consider the scalar ridge regression problem, with ~a 6= ~0:
. 1 1
where (9.27)
2
min fRR (x) fRR (x) = k~ax − ~y k2 + λx2 .
x∈R 2 2
By taking the derivative, we get
dfRR
(x) = ~a> (~ax − ~y ) + λx (9.28)
dx
= (~a>~a + λ)x − ~a> ~y (9.29)
(9.30)
2
= (k~ak2 + λ)x − ~a> ~y
>
~a ~y
=⇒ x?RR = 2 . (9.31)
k~ak2 + λ
By setting λ = 0 we obtain the least squares solution:
~a> ~y
x?LS = 2 (9.32)
k~ak2
Now consider the scalar LASSO problem
. 1
where (9.33)
2
min fLASSO (x) fLASSO (x) = k~ax − ~y k2 + λ |x| .
x∈R 2
This has a derivative everywhere except x = 0. We obtain



 1, if x > 0
dfLASSO 
>
(x) = ~a (~ax − ~y ) + λ −1, if x < 0 (9.34)
dx 
undefined, if x = 0.

Let x? be a critical point of this problem. We solve what x? should be using casework.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 176
EECS 127/227AT Course Reader 9.3. Analysis of LASSO Regression 2024-04-27 21:08:09-07:00

Case 1. If x? > 0, then the derivative is well-defined, so it must be equal to 0. Thus we have
dfLASSO ?
0= (x ) (9.35)
dx
= ~a> (~ax? − ~y ) + λ (9.36)
>
~a ~y − λ
=⇒ x? = 2 . (9.37)
k~ak2
a> ~
Thus if x? > 0 then x? = ~ y −λ
ak22
k~
. Thus x? > 0 if and only if ~a> ~y > λ.

Case 2. If x? < 0, then the derivative is well-defined, so it must be equal to 0. Thus we have
dfLASSO ?
0= (x ) (9.38)
dx
= ~a> (~ax? − ~y ) − λ (9.39)
~a> ~y + λ
=⇒ x? = 2 . (9.40)
k~ak2
a> ~
Thus if x? < 0 then x? = ~ y +λ
ak22
k~
. Thus x? < 0 if and only if ~a> ~y < −λ.

Case 3. If x? = 0 then it is neither > 0 nor < 0. Thus we must have −λ ≤ ~a> ~y ≤ λ.

Thus we have three cases:

• x? > 0 ⇐⇒ ~a> ~y > λ, in which case x? = (~a> ~y − λ)/ k~ak2 ;


2

• x? < 0 ⇐⇒ ~a> ~y < −λ, in which case x? = (~a> ~y + λ)/ k~ak2 ; and
2

• x? = 0 ⇐⇒ −λ ≤ ~a> ~y ≤ λ,

or in other words, 
(~a> ~y − λ)/ k~ak2 , if ~a> ~y > λ
 2



x? = (~a> ~y + λ)/ k~ak22 , if ~a> ~y < −λ (9.41)

if − λ ≤ ~a> ~y ≤ λ.


0,

As a function of ~a> ~y , the solution x? = x?LASSO looks like:

x?

x?LASSO

~a> ~y
−λ λ

Figure 9.2: The plot of the function which maps ~a> ~


y 7→ x?LASSO , where the latter term is the solution to our scalar LASSO
problem. When ~a> ~
y ∈ [−λ, λ], we have x? = 0. The function ~a> ~
y 7→ ~x? is continuous, yet not differentiable at ~a> ~
y = ±λ.

If we plot the least squares solution in red on the same graph, it has the same nonzero slope as the LASSO solution,
and looks like this:

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 177
EECS 127/227AT Course Reader 9.4. Geometry of LASSO Regression 2024-04-27 21:08:09-07:00

x?
x?LS

x?LASSO

~a> ~y
−λ λ

Figure 9.3: In red, we add the plot of the function which maps ~a> ~
y 7→ x?LS , where the latter term is the solution to to our scalar
least squares problem. Note that the LASSO solution (in blue) is always closer to zero than the least squares solution, and is set
directly to zero when ~a> ~
y ∈ [−λ, λ].

This illustrates a concept called soft thresholding: in the regime where the least squares solution x?LS is already
close to zero, x?LASSO becomes exactly zero. Meanwhile, ridge regression does not do this: x?RR = 0 if and only if
~a> ~y = 0, which is exactly when the unregularized least squares solution itself is zero. This fundamental difference is
why the solutions to LASSO regression tend to be sparse, i.e., have many entries set to 0.

9.4 Geometry of LASSO Regression


In this section we introduce a geometric description of LASSO. The geometry relies on a crucial theorem which unveils
a deep connection between regularization and constrained optimization for convex problems. We state the result here;
the proof uses duality theory and is left to homework.

Theorem 211
Let f0 : Rn → R be strictly convex and such that limt→∞ f0 (~xt ) = ∞ for all sequences (~xt )∞
t=0 such that
limt→∞ k~xt k2 = ∞,a and R : Rn → R+ be convex and take non-negative values. Further suppose that there
exists ~x0 ∈ Rn such that R(~x0 ) = 0.
For λ ≥ 0 and k ≥ 0, let R(λ) and C(k) be sets of solutions to the “regularized” and “constraint” programs:
.
R(λ) = argmin{f0 (~x) + λR(~x)} (9.42)
x∈Rn
~
.
C(k) = argmin f0 (~x). (9.43)
x∈Rn
~
R(~x)≤k

Then:

(a) for every λ ≥ 0 there exists k ≥ 0 such that R(λ) = C(k); and

(b) for every k > 0 there exists λ ≥ 0 such that R(λ) = C(k).
aThis assumption is called “coercivity”.

This shows that in some sense, regularized convex problems are equivalent to constrained convex problems; and
in this equivalence, the regularizer for the regularized problem shapes the constraint set of the constrained problem.
In particular, regularized least squares (f0 (~x) = kA~x − ~y k2 ) with full column rank is equivalent to constrained least
2

squares (with the same f0 ).

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 178
EECS 127/227AT Course Reader 9.4. Geometry of LASSO Regression 2024-04-27 21:08:09-07:00

Now, we sketch the feasible sets and level sets of the objective function for the constrained problems corresponding
to both ridge regression and LASSO regression.
x2 x2

x1 x1

Figure 9.4: Geometric differences between LASSO and ridge regression. On the left side, the blue diamond depicts the feasible
region for an `1 -norm constraint such as k~xk1 ≤ t, while the circle on the right side is the feasible region for an `2 -norm constraint
such as k~xk22 ≤ t. On both graphs, the red line is a level set of our objective function; specifically, the minimal level set that still
intersects the feasible region. The intersection of this level set with the feasible region is the solution to our constrained problem
and thus to an equivalent regularized problem.

Note how with the `1 -norm constraint, the intersection of the feasible region with the minimal level set is more likely
to be at a corner of the feasible region, which is a point where some coordinates are set exactly to zero. Meanwhile,
with the `2 -norm constraint, the intersection can be at an arbitrary point on the circle (or sphere in higher dimensions),
and likely isn’t at a corner. This is why LASSO induces sparsity in ~x, due to the distinctive corners we saw earlier in its
norm ball. Meanwhile, although ridge regression compresses ~x?RR to be smaller, it doesn’t necessarily induce sparsity
in ~x?RR .

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 179
Chapter 10

Advanced Descent Methods

Relevant sections of the textbooks:

• [1] Chapter 11.

• [2] Section 12.5.

10.1 Coordinate Descent


Recall from Chapter 6 the general idea of descent-based methods for solving optimization problems. These methods
are based on the process of starting with some initial guess ~x(0) ∈ Rn , then generating a sequence of refined guesses
~x(1) , ~x(2) , ~x(3) , . . . using the general update rule

~x(t+1) = ~x(t) + η~v (t) (10.1)

for some search direction ~v (t) and step size η. In Chapter 6 we covered the gradient descent method, which uses the
gradient of the function as the search direction. In this chapter we will revisit descent-based optimization methods and
introduce alternative update rules.
In this section we will introduce coordinate descent, a class of descent-based algorithms that finds a minimizer of
multivariate functions by iteratively minimizing it along one direction at a time. Consider the unconstrained convex
optimization problem
p? = minn f (~x), (10.2)
x∈R
~

with the optimization variable


 
x1
 x2 
 
 ..  .
~x =   (10.3)
 . 
xn
Before we introduce the algorithm, we introduce some notation. For indices i and j, we introduce the notation ~xi:j =
(xi , xi+1 , . . . , xj−1 , xj ) ∈ Rj−i+1 to be the entries of ~x between indices i and j (inclusive on both ends).
Given an initial guess ~x(0) , for t ≥ 0 the coordinate descent algorithm updates the iterate ~x(t) by sequentially
minimizing the function f (~x) with respect to each coordinate, namely using the update rule

(10.4)
(t+1) (t+1) (t)
xi ∈ argmin f (~x1:i−1 , xi , ~xi+1:n ).
xi ∈R

180
EECS 127/227AT Course Reader 10.1. Coordinate Descent 2024-04-27 21:08:09-07:00

Namely, we perform the update

(10.5)
(t+1) (t)
x1 ∈ argmin f (x1 , ~x2:n )
x1 ∈R

(10.6)
(t+1) (t+1) (t)
x2 ∈ argmin f (x1 , x2 , ~x3:n )
x2 ∈R
.. ..
. . (10.7)
(10.8)
(t+1)
x(t+1)
n ∈ argmin f (~x1:n−1 , xn ).
xn ∈R

This is a sequential process since after finding the minimizer along the ith coordinate (i.e. xi ) we use its values
(t+1)

for minimizing subsequent coordinates. Also we note that the order of the coordinates is arbitrary. We formalize this
update in the following algorithm.

Algorithm 6 CoordinateDescent
1: function CoordinateDescent(f, ~x(0) , T )
2: for t = 0, 1, . . . , T1 do
3: for i = 1, . . . , N do
(t+1) (t+1) (t)
4: xi ← argminxi ∈R f (x1:i−1 , xi , xi+1:n ).
5: end for
6: end for
7: return ~xT
8: end function

The algorithm breaks down the difficult multivariate optimization problem into a sequence of simpler univariate
optimization problems.
We first want to discuss the issue of well-posedness of the algorithm. We know that any of the argmins used may
not exist, in which case the algorithm is not well-defined, and so we cannot even think about its behavior or convergence.
Nevertheless, in a large class of problems which have many different characterizations, the argmins are well-defined.
We say in this case that the coordinate descent algorithm is well-posed.
We now want to address the question of convergence. It is not obvious that minimizing the function f (~x) can be
achieved by minimizing along each direction separately. In fact, the algorithm is not guaranteed to converge to an
optimal solution for general convex functions. However, under some additional assumptions on the function, we can
guarantee convergence. To build an intuition for what additional assumptions are needed we consider the following
question. Let f (~x) be a convex differentiable function. Suppose that x?i ∈ argminxi ∈R f (x?1:i−1 , xi , x?i+1:n ) for all
i. Can we conclude that ~x? is a global minimizer of f (~x)? The answer to this question is yes. We can prove this by
recalling the first order optimality condition for unconstrained convex functions and the definition of partial derivatives.
If ~x? is a minimizer of f (~x) along the direction ~ei then we have

∂f ?
(~x ) = 0. (10.9)
∂xi

If this is true for all i then ∇f (~x? ) = ~0, implying that ~x? is a global minimizer for f . This discussion forms a proof of
the following theorem, which is Theorem 12.4 in [2].

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 181
EECS 127/227AT Course Reader 10.1. Coordinate Descent 2024-04-27 21:08:09-07:00

Theorem 212 (Convergence of Coordinate Descent for Differentiable Convex Functions)


Let f : Rn → R be a differentiable convex function which is separately strictly convex in each argument. That
is, suppose that for each i, and each fixed ~x1:i−1 and ~xi+1:n , the function xi 7→ f (~x1:i−1 , xi , ~xi+1:n ) is strictly
convex. If the coordinate descent algorithm is well-posed, and the unconstrained minimization problem

min f (~x) (10.10)


x∈Rn
~

has a solution, then the sequence of iterates ~x(0) , ~x(1) , . . . generated by the coordinate descent algorithm converges
to an optimal solution to (10.10).

The coordinate descent algorithm may not converge to an optimal solution for general non-differentiable functions,
even if they are convex. However, we can still prove that coordinate descent converges for a special class of functions
of the form
n
(10.11)
X
f (~x) = g(~x) + hi (xi )
i=1
where g : Rn → R is convex and differentiable, and each hi : R → R is convex (but not necessarily differentiable). This
form includes various `1 regularization problems (such as LASSO regression) which have a separable non-differentiable
component. The provable convergence of coordinate descent algorithm makes it an attractive choice for this class of
problems.

Example 213. In this example we will consider the LASSO regression problem and examine how coordinate descent
algorithm can be applied to solve it. Note that the LASSO objective follows the form described in (10.11). For A ∈
Rm×n which has columns ~a1 , . . . , ~an ∈ Rm , and ~y ∈ Rm , we consider the LASSO objective
1
(10.12)
2
f (~x) = kA~x − ~y k2 + λ kxk1 .
2
We aim to use coordinate descent to minimize this function. Let ~x(0) be the initial guess. Then we perform the
coordinate descent update by solving the following optimization problem:

(10.13)
(t+1) (t+1) (t)
xi = argmin f (~x1:i−1 , xi , ~xi+1:n ).
xi ∈R

Each of these optimization problems will be solved similarly to what we did in Section 9.3. For notational clarity, let
us instead solve the more generic problem

x?i ∈˙ argmin f (~x1:i−1 , xi , ~xi+1:n ). (10.14)


xi ∈R

Then we have that ∂f


x1:i−1 , x?i , ~xi+1:n )
∂xi (~ is either 0 or undefined. Thus we compute each partial derivative of f .
Indeed, we have
∂f 1 ∂ ∂
(10.15)
2
(~x) = kA~x − ~y k2 + λ k~xk1
∂xi 2 ∂xi ∂xi
  n
1 ∂ X
(10.16)
2
= ~e>
i ∇ kA~
x − ~
y k2 + λ |xj |
2 ∂xi j=1

= ~e> >
i A (A~x − ~y ) + λ |xi | (10.17)
∂xi

sgn(x ),
i if xi =
6 0
= ~a>
i x
(A~ − ~
y ) + λ (10.18)
undefined, if x = 0.
i

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 182
EECS 127/227AT Course Reader 10.2. Newton’s Method 2024-04-27 21:08:09-07:00

We now introduce the additional notation Ai:j ∈ Rm×(j−i+1) as the sub-matrix of A whose columns are the ith through
j th columns of A (inclusive). Using this notation, we can simplify the first term as

~a> x − ~y ) = ~a>
i (A~ x − ~a>
i A~ i ~
y (10.19)
 
n
(10.20)
X
= ~a>
i
 ~aj xj − ~y 
j=1
 
n
(10.21)
2
X
= k~ai k2 xi + ~a>
 

i  ~aj xj − ~y 

j=1
j6=i

(10.22)
2
= k~ai k2 xi + ~a> x1:i−1 + Ai+1:n ~xi+1:n − ~y )
i (A1:i−1 ~

Now there are two cases depending on the value of x?i .

Case 1. Suppose x?i 6= 0. Then we have by the optimality condition that


∂f
0= (~x1:i−1 , x?i , ~xi+1:n ) (10.23)
∂xi
(10.24)
2
= k~ai k2 x?i + ~a> x1:i−1 + Ai+1:n ~xi+1:n − ~y ) + λ sgn(x?i )
i (A1:i−1 ~
1
=⇒ x?i = a> y − A1:i−1 ~x1:i−1 − Ai+1:n ~xi+1:n ] − λ sgn(x?i ) . (10.25)

2 ~ i [~
k~ai k2
Thus in particular we have

x?i > 0 ⇐⇒ ~a> y − A1:i−1 ~x1:i−1 − Ai+1:n ~xi+1:n ) > λ,


i (~ (10.26)

and the corresponding relation

x?i < 0 ⇐⇒ ~a> y − A1:i−1 ~x1:i−1 − Ai+1:n ~xi+1:n ) < −λ.


i (~ (10.27)

Case 2. By contrapositive, we have

−λ ≤ ~a> y − A1:i−1 ~x1:i−1 − Ai+1:n ~xi+1:n ) ≤ λ ⇐⇒ x?i = 0.


i (~ (10.28)

Therefore we have derived a closed-form coordinate descent update:



(t+1) (t)

(t+1) 1  >h (t+1) (t)
i 
~a>i ~y − A1:i−1 ~x1:i−1 − Ai+1:n ~xi+1:n > λ =⇒ xi = 2 ~ai ~y − A 1:i−1 ~
x 1:i−1 − A i+1:n ~
x i+1:n − λ
k~ai k2
(10.29)

(t+1) (t)

(t+1) 1  h
(t+1) (t)
i 
~a>i ~y − A1:i−1 ~x1:i−1 − Ai+1:n ~xi+1:n < −λ =⇒ xi = 2 ~a>
i ~y − A 1:i−1 ~
x 1:i−1 − A i+1:n ~
x i+1:n + λ
k~ai k2
(10.30)
 
(10.31)
(t+1) (t) (t+1)
~a>i ~y − A1:i−1 ~x1:i−1 − Ai+1:n ~xi+1:n ≤ λ =⇒ xi = 0.

10.2 Newton’s Method


Let us consider the unconstrained optimization problem

p? = minn f (~x). (10.32)


x∈R
~

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 183
EECS 127/227AT Course Reader 10.2. Newton’s Method 2024-04-27 21:08:09-07:00

Recall that in the gradient descent algorithm, we assumed that the objective function f (~x) is differentiable. Further-
more, we assumed that at every point ~x ∈ Rn we can compute f (~x) as well as ∇f (~x). Here, we make the additional
assumption that f (~x) is twice differentiable and that we can compute the Hessian ∇2 f (~x). We wish to use the Hessian
to choose a good search direction and accelerate convergence. Optimization algorithms that utilize second derivatives
(e.g. the Hessian) are called second-order methods.
One of the most famous second-order methods is Newton’s method. Newton’s method is based on the following
idea for minimizing strictly-convex functions with positive definite Hessians: first, start with an initial guess ~x(0) . Then
in each iteration t = 1, 2, 3, . . ., approximate the objective function with its second-order Taylor approximation around
the point ~x(t) . The minimizer of this quadratic approximation is then chosen as the next iterate ~x(t+1) .
More formally, let us assume that f is strictly convex and twice-differentiable with positive definite Hessian at ~x(t) ,
and let us write the second-order Taylor approximation of the function f (~x) around the point ~x(t) . We obtain
1
fb2 (~x; ~x(t) ) = f (~x(t) ) + [∇f (~x(t) )]> (~x − ~x(t) ) + (~x − ~x(t) )> [∇2 f (~x(t) )](~x − ~x(t) ). (10.33)
2
Since the Hessian ∇2 f (~x(t) )  0, we can solve the problem

min fb2 (~x; ~x(t) ), (10.34)


x∈Rn
~

which is a convex quadratic program, using our (by now) standard techniques. Setting the gradient (in ~x) to ~0, we obtain

~0 = ∇fb2 (~x? ; ~x(t) ) (10.35)


(t)
= ∇f (~x ) + [∇ f (~x )](~x − ~x )2 (t) ? (t)
(10.36)
=⇒ ~x = ~x? (t) 2
− [∇ f (~x )] (t) −1
[∇f (~x )]. (t)
(10.37)

This gives the Newton’s method update rule:

~x(t+1) = ~x(t) − [∇2 f (~x(t) )]−1 [∇f (~x(t) )]. (10.38)

The formal algorithm, detailed in Algorithm 7, just repeats this iteration.

Algorithm 7 Newton’s Method


1: function NewtonMethod(f, ~x(0) , T )
2: for t = 0, 1, . . . , T − 1 do
 −1
3: ~x(t+1) ← ~x(t) − ∇2 f (~x(t) ) [∇f (~x(t) )]
4: end for
5: return ~x(T )
6: end function

−1
We call the vector ∇2 f (~x(t) ) [∇f (~x(t) )] the Newton direction. Here, we do not choose a step-size η. Instead,


we take a full step in the Newton direction towards the minimizer of the quadratic approximation of the objective func-
tion. This is the basic version of Newton’s method; it is not guaranteed to converge in general. To achieve convergence,
we can introduce a step-size η > 0 to the Newton update, obtaining the so-called damped Newton’s method, which has
the iteration h i−1
~x(t+1) = ~x(t) − η ∇2 f (~x(t) ) [∇f (~x(t) )]. (10.39)

A full algorithm is very similar to Algorithm 7 and is provided in Algorithm 8.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 184
EECS 127/227AT Course Reader
10.3. Newton’s Method with Linear Equality Constraints 2024-04-27 21:08:09-07:00

Algorithm 8 Damped Newton’s Method


1: function DampedNewtonMethod(f, ~x(0) , η, T )
2: for t = 0, 1, . . . , T − 1 do
 −1
3: ~x(t+1) ← ~x(t) − η ∇2 f (~x(t) ) [∇f (~x(t) )]
4: end for
5: return ~x(T )
6: end function

We will not discuss convergence proofs of Newton’s method in this course; you may use [7] for further reading.
All of our discussion thus far has been under the assumption ∇2 f (~x(t) )  0. If the Hessian is not positive definite,
one may adapt the algorithm accordingly, forming a new class of methods called quasi-Newton’s methods. Discussion
of quasi-Newton’s methods are out of scope of the course.
Finally, we discuss the algorithmic complexity of Newton’s method. In every iteration, we need to compute and
invert the Hessian ∇2 f (~x(t) ) to obtain the search direction. This is much more expensive than computing the gradient
∇f (~x(t) ), which is used both in the gradient descent method and in Newton’s method. However, this expensive step
is not without justification; in many convex optimization problems, Newton’s method can be shown to converge to the
optimal solution much faster (i.e., in fewer iterations) than gradient descent.

10.3 Newton’s Method with Linear Equality Constraints


Our derivation of Newton’s method can be used to handle equality-constrained optimization problems. Let A ∈ Rm×n
and ~y ∈ Rm , and let f : Rn → R be a twice-differentiable strictly convex function with positive definite Hessian.
Consider the following convex optimization problem:

p? = min f (~x). (10.40)


x∈Rn
~

s.t. A~x = ~y . (10.41)

We will use the same approach as with the unconstrained Newton’s method, that is, we will take the second-order Taylor
.
approximation around ~x(t) and minimize it over the constraint set Ω = {~x ∈ Rn : A~x = ~y }. This method gives the
following constrained convex quadratic program:
1
min f (~x(t) ) + [∇f (~x(t) )]> (~x − ~x(t) ) + (~x − ~x(t) )> [∇2 f (~x(t) )](~x − ~x(t) ) (10.42)
x∈Rn
~ 2
s.t. A~x = b. ~ (10.43)

Note that the quadratic program is convex and, if the original problem is feasible (i.e., Ω is nonempty) that strong
duality holds by Slater’s condition. Thus, we can solve this QP by solving the KKT conditions, as they are necessary
and sufficient for global optimality. We begin by writing the Lagrangian L : Rn × Rm → R associated with this
quadratic program, which is defined as
1
L(~x, ~ν ) = f (~x(t) ) + [∇f (~x(t) )]> (~x − ~x(t) ) + (~x − ~x(t) )> [∇2 f (~x(t) )](~x − ~x(t) ) + ~ν > (A~x − ~b). (10.44)
2
Suppose that (~x? , ~ν ? ) are globally optimal for the constrained quadratic program. Then they must satisfy the KKT
conditions, which are:

• Primal feasibility:
A~x? = ~y . (10.45)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 185
EECS 127/227AT Course Reader 10.4. (OPTIONAL) Interior Point Method 2024-04-27 21:08:09-07:00

• Stationarity/first-order condition:

~0 = ∇~x L(~x? , ~ν ? ) (10.46)


(t) 2
= ∇f (~x ) + [∇ f (~x )](~x − ~x ) + A ~ν . (t) ? (t) >
(10.47)

Let us define a vector ~v (t) = ~x? − ~x(t) . Since ~x(t) is feasible, we have A~x(t) = ~y . Thus we have

A~v (t) = A(~x? − ~x(t) ) (10.48)


= A~x − A~x? (t)
(10.49)
= ~y − ~y (10.50)
= ~0. (10.51)

Thus, if we write the system in terms of ~v (t) instead of ~x? , we have the system of equations

~0 = ∇f (~x(t) ) + [∇2 f (~x(t) )]~v (t) + A>~ν ? (10.52)


~0 = A~v (t) (10.53)

which can be expressed in matrix form as


" #" # " #
∇2 f (~x(t) ) A> ~v (t) −∇f (~x(t) )
= . (10.54)
A 0 ~ν 0

After solving this system of equations for ~v (t) , our update rule becomes

~x(t+1) = ~x? = ~x(t) + ~v (t) , (10.55)

which is equivalent to setting the new iterate as the minimizer of the constrained QP. The formal iteration is given in
Algorithm 9.

Algorithm 9 Newton’s Method with Linear Equality Constraints


1: function EqualityConstrainedNewtonMethod(f, ~x(0) , T, A, ~y )
2: for t = 0, 1, . . . , T −"1 do #" # " #
∇2 f (~x(t) ) A> ~v (t) −∇f (~x(t) )
3: Solve the system = for ~v (t)
A 0 ~ν 0
4: ~x(t+1) ← ~x(t) + ~v (t)
5: end for
6: return ~x(T )
7: end function

There also exist damped versions of this algorithm, but their analysis is out of scope of the course.

10.4 (OPTIONAL) Interior Point Method


In the previous section we introduced Newton’s method, which allows us to solve convex optimization problems with
simple linear equality constraints. In this section, we will build on top of Newton’s method, introducing a new class

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 186
EECS 127/227AT Course Reader 10.4. (OPTIONAL) Interior Point Method 2024-04-27 21:08:09-07:00

of algorithms to handle convex optimization problems with inequality constraints. Precisely, we will introduce the
interior point method, which allows us to solve convex optimization problems of the following form

p? = minn f0 (~x) (10.56)


x∈R
~

s.t. fi (~x) ≤ 0, ∀i = 1, 2, . . . , m (10.57)


A~x = ~y , (10.58)

where f0 , f1 , . . ., fm are all convex twice-differentiable functions. Interior point methods (IPM) are a class of algo-
rithms which solves the problem (10.56) by solving a sequence of convex optimization problems with linear constraints
using Newton’s method. The key idea used in IPM is the barrier function, which we introduce next.

10.4.1 Barrier Functions


Consider the problem (10.56). Our goal is to eliminate the inequality constraints and convert the problem to an equality
constrained problem to which Newton’s method can be applied. One way to do so is to augment the inequality con-
straints to the objective using indicator functions, as we did when developing our theory of duality. More precisely,
consider the following unconstrained problem, which we denote by P:
m
problem P: (10.59)
X
p? = minn f0 (~x) + I(fi (~x))
x∈R
~
i=1

A~x = ~b. (10.60)

where I : R → R ∪ {+∞} is given by



0, if z ≤ 0
I(z) = 1 [z ≤ 0] = (10.61)
+∞, if z > 0

This gives us an optimization problem with only linear equality constraints that is equivalent to the original optimiza-
tion problem (10.56) (i.e., they have the same solution). However, introducing the indicator function now makes the
objective function non-differentiable so we can no longer apply Newton’s method to solve this problem. To overcome
this problem, we will instead approximate the indicator function with a differentiable function φ, which we call a barrier
function.
There are several choices for φ, i.e., good approximations for the indicator function, but they must all have something
in common. Namely, φ should be a convex increasing function on R−− , such that limz%0 φ(z) = +∞, just like the
indicator function I. There are many candidate functions that satisfy these criteria. One of the most used barrier
functions that we will introduce here is the logarithmic barrier function, which, for some α > 0, takes the form
1
φα (z) = −
log(−z), (10.62)
α
The parameter α controls the accuracy of the approximation — as α grows larger, the logarithmic barrier function
becomes a better and better approximation to the indicator function.
Using this logarithmic barrier, we can define an approximate optimization problem P(α)
b to P by the following:
m
problem P(α): (10.63)
X
b pb?α = minn f0 (~x) + φα (fi (~x))
x∈R
~
i=1

A~x = ~y . (10.64)

This optimization problem has a convex twice-differentiable objective function and linear equality constraints, so New-
ton’s method can be applied.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 187
EECS 127/227AT Course Reader 10.4. (OPTIONAL) Interior Point Method 2024-04-27 21:08:09-07:00

10.4.2 Barrier Method


The problem P(α)
b is just an approximation of the original problem P. In particular, they do not necessarily have the
same solution. Thus, the natural question to ask is how close the solutions of P and P(α)
b are. The choice of the
parameter α will be crucial, since small α mean that φα does not behave like an indicator function and so P(α)
b and
P are totally different in general, whereas large α offer much better approximations and thus allows us to expect much
better solutions. Indeed, it is possible to show, with some additional assumptions [7], some more complicated bounds
relating the distance of the solutions to P to the solutions to P(α),
b as a function of α. We do not go into this analysis
here, but its main takeaway is that large values of α give us good approximate solutions.
One natural question is: why don’t we just take the largest possible α and get a near-perfect approximation to the
indicator functions, and thus the original program P? This actually may not be optimal from an algorithmic standpoint,
as as solving problem P(α)
b using Newton’s method becomes difficult for large values of α. This is because Newton’s
method relies on computing and inverting the Hessian of the objective, and with large values of α the Hessian changes
rapidly for points near the boundary of the feasible set for the original problem, even points which may be the solution.
The barrier method overcomes this problem by solving the approximate problem for a relatively small value of α
and obtain the approximate solution ~x? (α). The algorithm then refines this approximate solution by using it as an initial
guess to solve the approximate problem with a larger value of α. This way, when the algorithm attempts to solve the
difficult approximate problems P(α)
b (as the value of α increases), it does so with a good initial guess. Newton’s method
converges extremely fast near the optimal solution, so this procedure greatly improves the convergence of Newton’s
method even when α is very large.¹
When solving constrained optimization problems, it is necessary for most algorithms to start with a feasible initial-
ization ~x(0) . This is particularly important for the barrier method; in fact, it is necessary to start at a strictly feasible
initial guess, so that the value of the approximate problem is finite and the derivative of the objective function is defined.
In Algorithm 10 we present one instance of the barrier method in which the value of α is updated by scaling it with
some constant µ > 1.

Algorithm 10 Barrier Method.


1: function BarrierMethod(f0 , f1 , . . . , fm , ~x(0) , α(0) , A, ~y , µ, T )
2: for t = 1, . . . , T do
3: ~x(t) ← Solution of P(α
b (t−1) ) using Newton’s method (Algorithm 9) starting at ~x(t−1)
4: α(t) ← µα(t−1)
5: end for
6: return ~x(T )
7: end function

¹This easy-to-hard solution process is one instance of a more general algorithmic paradigm called homotopy continuation or homotopy analysis,
which is used to precisely simulate very unstable dynamical systems.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 188
Chapter 11

Applications

In this chapter, we will discuss some applications of the theory we have developed so far in this class. Our explo-
ration will include deterministic control and the linear-quadratic regulator, stochastic control and the policy gradient
algorithm, and support vector machines.

11.1 Deterministic Control and Linear-Quadratic Regulator


The first application we’ll discuss is in the area of deterministic control.
Although control is usually thought of as related to motion, it can apply to any dynamical system where we can
understand the state and its dependence on time based on some function such as ~xt+1 = f~(~xt , ~ut ) which takes in a
state ~xt and a control input ~ut and outputs the next state ~xt+1 . The goal of control is to choose the control inputs ~ut to
achieve some desired behavior of the system.

Example 214 (Vertical Rocket System). For example, we can consider a vertical rocket. Our goal is to maximize its
height by time T . Let x1,t denote its height, x2,t denote its vertical velocity, and x3,t denote the weight of the rocket
(which we will approximate as the weight of the fuel), all at time t. The weight of the fuel will go down over time,
and that can affect the rocket’s velocity. The forces pushing the rocket down are drag and gravity, and the upward force
comes from the rocket’s thrust.
The forces at time t have the following expressions. (Here ẋ is the time-derivative of x.)

• Inertial force: x3,t · ẍ1,t = x3,t · ẋ2,t .

• Drag: cD ρ(x1,t )x22,t where cD is a numerical constant and ρ(x1 ) is the density of the air at height x1 .

• Gravity: gx3,t where g is a numerical constant.

• Thrust: cH ẋ3,t where cH is a numerical constant.

Given the input ut = −ẋ3,t , we want to write our dynamical system in the standard form for continuous dynamics, i.e.,
~x˙ t = f~(~xt , ut ). From the force expressions, we have

ẋ1,t = x2,t (11.1)


1
ẋ2,t = [−cD ρ(x1,t )x22,t + cH ẋ3,t − gx3,t ] (11.2)
x3,t
ẋ3,t = −ut (11.3)

189
EECS 127/227AT Course Reader
11.1. Deterministic Control and Linear-Quadratic Regulator 2024-04-27 21:08:09-07:00

Now we have a dynamical system of the form ~x˙ t = f~(~xt , ut ). Recall that we want to maximize the height of the rocket,
x1,T , at some terminal time T > 0, given an initial condition ~x0 = ξ~ for some ξ~ ∈ R3 . We can set up an optimization
problem to determine the (ut )t∈[0,T ] which accomplishes this:

max x1,T
(ut )t∈[0,T ]

s.t. ~x˙ t = f~(~xt , ut ), ∀t ∈ [0, T ]


~
~x0 = ξ.

In practice, we can use a numerical solver to solve this problem; conceptually one can solve many such systems by hand
using the so-called calculus of variations, which is a sort of infinite-dimensional optimization paradigm which we do
not explore more in this course.
Even though our dynamics are continuous-time and complex, we can discretize and locally linearize our system,
obtaining an approximate system which is discrete linear time-invariant, i.e., of the form

~xk+1 = A~xk + B~uk , ∀k ∈ {0, 1, . . . , K − 1}. (11.4)

where A and B are matrices of the appropriate sizes. This is a linear system because it is linear in (~xk )K
k=0 and (~ k=0 ,
uk )K−1
which we conceptually think of as very long (but finite-length) vectors. It is time-invariant because the matrices A and
B do not depend on the discrete-time index k.
This particular type of control problem is ubiquitous within science and engineering, and thus deserves a special
name — it is called the linear quadratic regulator problem.

Let’s formally define our linear quadratic regulator system.

Definition 215 (Linear Quadratic Regulator (LQR))


++ . Let ξ ∈ R . A linear
Let K ≥ 0 be a positive integer. Let A ∈ Rn×n , B ∈ Rn×m , Q, Qf ∈ Sn+ , and R ∈ Sm ~ n

quadratic regulator is an optimization problem of the form:


K−1
1 X > 1
min (~xk Q~xk + ~u> uk ) + ~x>
k R~ Qf ~xK (11.5)
xk )K
(~ k=0 ∈R
(K+1)n 2 2 K
k=0
(~uk )K−1
k=0
∈RKm

s.t. ~xk+1 = A~xk + B~uk , ∀k ∈ {0, . . . , K − 1}


~
~x0 = ξ.

Equation (11.5) is actually a quadratic program, since the objective is quadratic in the variables (~xk )K
k=0 and (~ k=0 ,
uk )K−1
and the constraints are linear. We are able to solve this with the methods we already know. However, this problem is
very large, having (K + 1)n + Km variables (as well as K + 1) constraints, and for n, m, K large this quickly becomes
intractable. Our saving grace is that this problem has significant additional structure. The traditional way to solve it is
using the dynamic programming approach and Bellman’s equation. However, in this section, we will solve it using the
KKT conditions and the Riccati equation.

Theorem 216 (Optimal Control in LQR is Linear)


An optimal control for the LQR problem (11.5) is linear in the state, i.e.,

~u?k = −R−1 B > (I + Pk+1 BR−1 B > )−1 Pk+1 A~x?k ~x?k , ∀k ∈ {0, . . . , K − 1} (11.6)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 190
EECS 127/227AT Course Reader
11.1. Deterministic Control and Linear-Quadratic Regulator 2024-04-27 21:08:09-07:00

where Pk are given by the recurrence relation

PK = Qf (11.7)
>
Pk = A (I + Pk+1 BR −1
B ) > −1
Pk+1 A + Q, ∀k ∈ {0, . . . , K − 1}. (11.8)

Proof. The Lagrangian of Equation (11.5) is


K−1
1 X > 1
L((~xk )K uk )K−1
k=0 , (~
~ K ν) =
k=0 , (λ)k=1 , ~ (~xk Q~xk + ~u> uk ) + ~x>
k R~ Qf ~xK (11.9)
2 2 K
k=0
K−1
(11.10)
X
+ ~λ> (A~xk + B~uk − ~xk+1 ) + ~ν > (~x0 − ξ).
~
k+1
k=0

Since the objective function of Equation (11.5) is convex and the constraints are affine, Slater’s condition holds auto-
matically. Thus strong duality holds. Since Equation (11.5) is a convex problem and strong duality holds, the KKT
conditions are necessary and sufficient for global optimality.
Let ((~x?k )K u?k )K−1
k=0 , (~
~ ? K ν ? ) be globally optimal primal and dual variables for Equation (11.5), hence
k=0 , (λk )k=1 , ~
satisfying the KKT conditions. Then we have

1. Primal feasibility:

(a) ~x?k+1 = A~x?k + B~u?k for all k ∈ {0, . . . , K − 1}.


(b) ~x?0 = ξ.
~

2. Dual feasibility: N/A since all constraints are equality constraints.

3. Complementary slackness: N/A since all constraints are equality constraints.

4. Lagrangian stationarity:

(a) ~0 = ∇~x0 L((~x?k )K u?k )K−1


k=0 , (~
~ ? K ν ? ) = Q~x? + A>~λ? + ~ν ? .
k=0 , (λk )k=1 , ~ 0 1

(b) ~0 = ∇~xk L((~x?k )K u?k )K−1


k=0 , (~
~ ? K ν ? ) = Q~x? + A>~λ? − ~λ? for all k ∈ {1, . . . , K − 1}.
k=0 , (λk )k=1 , ~ k k+1 k

(c) ~0 = ∇~xK L((~x?k )K u?k )K−1


k=0 , (~
~ ? K ν ? ) = Qf ~x? − ~λ? .
k=0 , (λk )k=1 , ~ K K

(d) ~0 = ∇~xK L((~x?k )K u?k )K−1


k=0 , (~
~ ? K ν ? ) = R~u? + B >~λ? for all k ∈ {0, . . . , K − 1}.
k=0 , (λk )k=1 , ~ k k+1

The stationarity condition gives an update dynamics for ~λ?k , i.e.,

~λ? = A>~λ? + Q~x? ,


k k+1 k ∀k ∈ {1, . . . , K − 1}, (11.11)
~λ?
K = Qf ~x?K . (11.12)

However, this update dynamics goes backwards in time from k = K to k = 1. This is in contrast to the update
dynamics for ~xk which goes forwards in time.
When we find the ~λ? , we are able to find the ~u? , since a stationarity equation gives
k k

~u?k = −R−1 B >~λ?k+1 , ∀k ∈ {0, . . . , K − 1}. (11.13)

This motivates solving for ~λ?k , which we do via (backwards) induction. Our induction hypothesis is of the form

~λ? = Pk ~x? ,
k k (11.14)

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 191
EECS 127/227AT Course Reader
11.1. Deterministic Control and Linear-Quadratic Regulator 2024-04-27 21:08:09-07:00

which we aim to show for k ∈ {1, . . . , K}. The base case is k = K, whence we have ~λ?K = Qf ~x?K , so that PK = Qf .
For the inductive step, for k ∈ {1, . . . , K − 1} we have

~λ? = Pk+1 ~x?


k+1 k+1 (11.15)
= Pk+1 (A~x?k + B~u?k ) (11.16)
= Pk+1 (A~x?k − BR −1
B >~λ?k+1 ) (11.17)
= Pk+1 A~x?k − Pk+1 BR−1 B >~λ?k+1 (11.18)
(I + Pk+1 BR −1
B )~λ?k+1 = Pk+1 A~x?k
>
(11.19)
~λ? = (I + Pk+1 BR−1 B > )−1 Pk+1 A~x? .
k+1 k (11.20)
~λ? = A>~λ? + Q~x?
k k+1 k (11.21)
= A> ((I + Pk+1 BR−1 B > )−1 Pk+1 A~x?k ) + Q~x?k (11.22)
>
= (A (I + Pk+1 BR B ) −1 > −1
Pk+1 A + Q) ~x?k (11.23)
.
| {z }
=P k

= Pk ~x?k . (11.24)

Above, to show that (11.20) follows from (11.19), we need to confirm that I + Pk+1 BR−1 B > is invertible. To show
this, we will explicitly construct its inverse in terms of the inverse of other matrices which we know are invertible.
First, observe that R + B > Pk+1 B is invertible, since it is symmetric positive definite: R is symmetric positive definite
and B > Pk+1 B is symmetric positive semidefinite, so their sum is symmetric positive definite. Next, we claim that
(I + Pk+1 BR−1 B > )−1 = I − Pk+1 B(R + B > Pk+1 B)−1 B > . This follows from the Sherman-Morrison-Woodbury
identity, but for the sake of completeness we prove it here. Indeed,

(I + Pk+1 BR−1 B > )(I − Pk+1 B(R + B > Pk+1 B)−1 B > ) (11.25)
= I + Pk+1 BR−1 B > − (I + Pk+1 BR−1 B > )Pk+1 B(R + B > Pk+1 B)−1 B > (11.26)
−1 > > −1 > −1 > > −1 >
= I + Pk+1 BR B − Pk+1 B(R + B Pk+1 B) B − Pk+1 BR B Pk+1 B(R + B Pk+1 B) B
(11.27)
= I + Pk+1 BR−1 B > − Pk+1 BR−1 R(R + B > Pk+1 B)−1 B > − Pk+1 BR−1 B > Pk+1 B(R + B > Pk+1 B)−1 B >
(11.28)
= I + Pk+1 BR −1 >
B − Pk+1 BR −1 >
(R + B Pk+1 B)(R + B Pk+1 B) > −1
B >
(11.29)
= I + Pk+1 BR−1 B > − Pk+1 BR−1 B > (11.30)
= I. (11.31)

This confirms that I + Pk+1 BR−1 B > is invertible. Thus we have for k ∈ {0, . . . , K − 1} that

~u?k = −R−1 B >~λ?k+1 (11.32)


= −R−1 B > (I + Pk+1 BR−1 B > )−1 Pk+1 A~x?k (11.33)

as desired.

Now, note that we have written our recurrence for Pk in terms of Pk+1 . Starting from PK , we can compute each
Pk in a backwards order, completely offline (i.e., without processing any iterations of the forward system). Once we
have the Pk , we can then compute each ~xk , ~λk , and ~uk directly. Therefore, we have a way to solve the LQR problem
just by using matrix multiplication.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 192
EECS 127/227AT Course Reader 11.2. Support Vector Machines 2024-04-27 21:08:09-07:00

11.2 Support Vector Machines


Support vector machines are an application of optimization to the classical machine learning problem of binary clas-
sification. We will see that using optimization and a specific perspective induced by the KKT conditions will allow us
to elegantly solve this classification problem.
Let’s review the basics of classification. Suppose that we have some image, and we want to figure out what kind
of animal it is. For example, we could be classifying it as a cat or a dog. Or, it could be X-ray images, and we want
to figure out if there is a fracture present or not. Nowadays, we use a wide range of techniques such as deep neural
networks to solve these tasks. These more advanced techniques build up from much more fundamental ideas such as
support vector machines.
For the purpose of this class, we will focus on binary classification. In binary classification, we are given some
data ~x1 , . . . , ~xn ∈ Rd as well as their corresponding labels y1 , . . . , yn ∈ {−1, +1}. We want to find a function
f : Rd → {−1, 1} such that f (~xi ) = yi for all i. We could try solving this problem using least squares, but in many
cases the least squares solution is suboptimal at classification tasks, being too susceptible to outliers. Here, we will
explore a different technique, called support vector machines (SVMs).
The basic premise of support-vector machines is that we want to find an affine function gw,b ~ > ~x − b which
x 7→ w
~ : ~
(approximately) separates the data into their respective classes, i.e., yi = sgn(gw,b
~ (~xi )) for all i, so f = sgn ◦ gw,b
~ .
Geometrically, gw,b
~ can be thought of as a separating hyperplane for the two classes, in fact corresponding to the
.
hyperplane Hw,b
~ = x ∈ Rd | w
{~ ~ > ~x = b}.
If yi = 1 then we would like gw,b
~ (~xi ) > 0. If yi = −1 then we would like gw,b
~ (~xi ) < 0. We can express these two
desires as always wanting yi gw,b
~ (~xi ) > 0, combining these two cases into one inequality.

Hard-Margin SVM

First, let us work through the case that the data are strictly linearly separable; that is, there exists some w,
~ b such that
yi gw,b
~ (~xi ) > 0 for all i. This hypothesis is unreasonable in many cases, but it allows us to build intuition for the
problem. We will later remove this hypothesis, but many tools remain the same.
In this case, we would like to find one (w,
~ b) pair for which yi gw,b
~ (~xi ) > 0 for all i. However, there can be many
such pairs. Thus, we have to determine which one we would like to pick.
~ b) with the largest margin — that is, the distance between the hyperplane
One possible heuristic¹ is to pick a pair (w,
~ and the closest point towards it. Thus we want to solve the problem:
Hw,b

max min ~ ,~
dist(Hw,b xi ) (11.34)
d i∈{1,...,n}
w∈R
~
b∈R

s.t. yi gw,b
~ (~xi ) > 0, ∀i ∈ {1, . . . , n}.

This problem is called the hard-margin SVM, so named because we do not allow any misclassification of the training
points — no training data can “cross the margin.”
Unfortunately, the problem in Equation (11.34) seems intractable. Let us go about simplifying it. First, the distance
between point ~x and Hw,b
~ is defined as

~ > ~x − b
. w |gw,b
~ (~x)|
~ ,~
dist(Hw,b x) = = . (11.35)
kwk
~ 2 kwk
~ 2
¹This heuristic makes sense from the perspective of robustness and generalization. We haven’t sampled all the data that exists, but we hypothesize
that the data we haven’t sampled is geometrically close to the data we have sampled. We want to make sure that the largest fraction possible of all
data (sampled and unsampled) is correctly classified by the w ~ and b we learn. Thus, to capture the most unsampled data possible, we require the
classification to be as robust to these geometric deviations as possible.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 193
EECS 127/227AT Course Reader 11.2. Support Vector Machines 2024-04-27 21:08:09-07:00

Thus² we obtain the slightly simplified, yet equivalent, problem


1
max min |gw,b
~ (~xi )| (11.36)
w∈R
~ d kwk
~ 2 i∈{1,...,n}
b∈R

s.t. yi gw,b
~ (~xi ) > 0, ∀i ∈ {1, . . . , n}.

Adding the real-valued slack variable s to denote the minimizing component, we have
s
max (11.37)
w∈R
~ d kwk
~ 2
b∈R
s∈R++

s.t. yi gw,b
~ (~xi ) > 0, ∀i ∈ {1, . . . , n},
|gw,b
~ (~xi )| ≥ s, ∀i ∈ {1, . . . , n}.

Since yi ∈ {−1, +1} and s > 0, we have that for any scalar ui that |ui | ≥ s and yi ui > 0 if and only if yi ui ≥ s. This
relation certainly holds for ui = gw,b
~ (~xi ), and so the above problem simplifies to the following:
s
max (11.38)
w∈R
~ d kwk
~ 2
b∈R
s∈R++

s.t. yi gw,b
~ (~xi ) ≥ s, ∀i ∈ {1, . . . , n}.

By the form of gw,b


~ (~ ~ > ~x − b, the above problem is equivalent to the following problem:
x) = w
1
max (11.39)
w∈R
~ d kw/sk
~ 2
b∈R
s∈R++

s.t. yi gw/s,b/s
~ (~xi ) ≥ 1, ∀i ∈ {1, . . . , n}.

Since the above problem only depends on the values of w/s


~ and b/s, where w/s
~ is still an optimization variable allowed
to take arbitrary values in Rd and b/s is still an optimization variable allowed to take arbitrary values in R, we may as
well remove s by setting it to 1, obtaining
1
max (11.40)
w∈R
~ d kwk
~ 2
b∈R

s.t. yi gw,b
~ (~xi ) ≥ 1, ∀i ∈ {1, . . . , n}.

Finally, we can obtain an equivalent problem which is a convex minimization problem by using the transformation
2x2 , which is monotonically decreasing on R++ , on the objective function, whence (after expanding the form of
1
x 7→
~ ), we obtain the problem
gw,b
1
(11.41)
2
min kwk
~ 2
w∈R
~ d 2
b∈R

s.t. yi (w
~ > ~xi − b) ≥ 1, ∀i ∈ {1, . . . , n}.

The problem (11.41) is by far the most common and simplified form of the hard-margin SVM problem. Moreover, this
is a quadratic program in (w,
~ b), which we know how to solve.
²It may seem natural to make gw,b
~ more complex and thus obtain a recipe for learning more complicated types of classifiers, but note that the
above simplification no longer holds if we do, so the final problem may be much more complex and not efficiently solvable.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 194
EECS 127/227AT Course Reader 11.2. Support Vector Machines 2024-04-27 21:08:09-07:00

Soft Margin SVM

Most of our real world data isn’t actually linearly separable. In that case, our hard margin SVM problem would just be
infeasible. But maybe we still want to try to separate the points, even if our classifier is not perfect on the training data.
To develop this relaxed SVM problem, let us consider the hard-margin SVM:
1
(11.42)
2
min kwk
~ 2
w∈R
~ d 2
b∈R

s.t. yi gw,b
~ (~xi ) ≥ 1, ∀i ∈ {1, . . . , n}.

It is a constrained optimization problem, so we can reformulate it into an unconstrained problem using indicator vari-
ables: !
n
1
(11.43)
2
X
min kwk
~ 2+ `0−∞ (1 − yi gw,b
~ (~xi ))
w∈R
~ d 2 i=1
b∈R

where the `0−∞ loss is defined by 


0, z≤0
`0−∞ (z) = (11.44)
∞, z > 0.
This corresponds to infinitely penalizing any violation of the margin. In order to relax the penalties for margin violation,
we can instead consider finite penalties which increase as the degree of violation of the margin increase. Namely, we
can consider constant multiples of the so-called hinge loss

0, z ≤ 0
`hinge (z) = = max{z, 0}. (11.45)
z, z > 0

This relaxation yields the following program:


n
!
1
(11.46)
2
X
min kwk
~ 2+C `hinge (1 − yi gw,b
~ (~xi )) ,
w∈R
~ d 2 i=1
b∈R

or using the definition of the hinge loss, the equivalent program


n
!
1
(11.47)
2
X
min kwk
~ 2+C max{1 − yi gw,b
~ (~xi ), 0} .
w∈R
~ d 2 i=1
b∈R

We can introduce some slack variables ξ~ to model this maximization term and make the problem differentiable. After
expanding the form of gw,b
~ , we have the program

n
1
(11.48)
2
X
min kwk
~ 2+C ξi
w∈R
~ d 2 i=1
b∈R
~
ξ∈R n

s.t. ξi ≥ 0, ∀i ∈ {1, . . . , n} (11.49)


~ > ~xi − b),
ξi ≥ 1 − yi (w ∀i ∈ {1, . . . , n}. (11.50)

This is the usual form of the soft margin SVM problem, and the solutions are parameterized by C.
In accordance with our derivation, if C is large then we allow only small violations to the margin, because the
second term becomes a better approximation of the sum of indicators in Equation (11.43). If C is small, then we allow
larger violations to the margin.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 195
EECS 127/227AT Course Reader 11.2. Support Vector Machines 2024-04-27 21:08:09-07:00

Depending on the perspective, either the first term or the second term of the loss can be viewed as the regularizer.
The first term works to maximize the margin, while the second term works to penalize the margin violations. Both
work together to form an approximate maximum-margin classifier.

KKT Conditions

It turns out that we get significant insight into the solutions to the hard-margin and soft-margin SVM using the KKT
conditions. First, let us consider the hard-margin SVM:
1
(11.51)
2
min kwk
~ 2
w∈R
~ d 2
b∈R

s.t. yi (w
~ > ~xi − b) ≥ 1, ∀i ∈ {1, . . . , n}.

The Lagrangian is
n
1
(11.52)
2
X
~ b, ~λ) = kwk
L(w, ~ 2+ ~ > ~xi − b)).
λi (1 − yi (w
2 i=1
This problem is convex, and the constraints are affine, so if the problem is feasible (i.e., the data are strictly linearly
separable), then Slater’s condition holds so strong duality holds. Since the problem is convex and strong duality holds,
the KKT conditions are necessary and sufficient for optimality.
Suppose that (w
~ ? , b? , ~λ? ) satisfy the KKT conditions. Then:

1. Primal feasibility: yi ((w


~ ? )> ~xi − b? ) ≥ 1 for all i ∈ {1, . . . , n}.

2. Dual feasibility: λ?i ≥ 0 for all i ∈ {1, . . . , n}.

3. Stationarity:

(a) ~0 = ∇w~ L(w ~ ? − i=1 λ?i yi ~xi .


Pn
~ ? , b? , ~λ? ) = w
(b) 0 = ∇b L(w ~ ? , b? , ~λ? ) = i=1 λ?i yi .
Pn

4. Complementary slackness: λ?i (1 − yi ((w


~ ? )> ~xi − b)) = 0 for all i ∈ {1, . . . , n}.

We say that (~xi , yi ) is a support vector if λ?i > 0. To see why, we consider the following cases:

1. If λ?i = 0 then, since w ~ ? = i=1 λ?i yi ~xi , we see that (~xi , yi ) does not contribute to the optimal solution.
Pn

2. If λ?i > 0 then by complementary slackness we have yi ((w ~ ? )> ~xi − b) = 1. Thus ~xi is on the margin of the
SVM. Furthermore, since w~ ? = i=1 λ?i yi ~xi , we see that (~xi , yi ) does contribute to the optimal solution.
P n

Now let us consider the analogous notion for soft-margin SVMs. Consider the soft-margin SVM problem:
n
1
(11.53)
2
X
min kwk
~ 2+C ξi
w∈R
~ d 2 i=1
b∈R
~ n
ξ∈R

s.t. ξi ≥ 0, ∀i ∈ {1, . . . , n} (11.54)


ξi ≥ 1 − yi (w
~ ~xi − b), >
∀i ∈ {1, . . . , n}. (11.55)

It has Lagrangian
n n n
1
(11.56)
2
X X X
L(w, ~ ~λ, µ
~ b, ξ, ~ ) = kwk
~ 2+C ξi + µi (−ξi ) + ~ > ~xi − b) − ξi ).
λi (1 − yi (w
2 i=1 i=1 i=1

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 196
EECS 127/227AT Course Reader 11.2. Support Vector Machines 2024-04-27 21:08:09-07:00

This problem is convex, and the constraints are affine; one can show that it is always feasible, so Slater’s condition
holds and strong duality holds. Since the problem is convex with strong duality, the KKT conditions are both necessary
and sufficient for optimality.
Suppose that (w ~ ? ) satisfy the KKT conditions. Then:
~ ? , b? , ξ~? , ~λ? , µ

1. Primal feasibility: ξi? ≥ 0 and ξi? ≥ 1 − yi ((w


~ ? )> ~xi − b? ) for all i ∈ {1, . . . , n}.

2. Dual feasibility: λ?i ≥ 0 and µ?i ≥ 0 for all i ∈ {1, . . . , n}.

3. Stationarity:

(a) ~0 = ∇w~ L(w ~ ? − i=1 λ?i yi ~xi .


Pn
~ ? , b? , ξ~? , ~λ? , µ
~ ?) = w
(b) 0 = ∇b L(w ~ ? ) = i=1 λ?i yi .
Pn
~ ? , b? , ξ~? , ~λ? , µ
(c) ~0 = ∇ξ~L(w
~ ? , b? , ξ~? , ~λ? , µ ~ ? − ~λ? .
~ ? ) = C~1 − µ

4. Complementary slackness: λ?i (1 − yi ((w


~ ? )> ~xi − b) − ξi? ) = 0 and µ?i ξi? = 0 for all i ∈ {1, . . . , n}.

Similarly to the case of the hard-margin SVM, we say that (~xi , yi ) is a support vector if λi > 0. To see why, we
consider the following cases:

1. If λ?i = 0 then µ?i = C. Thus µ?i > 0, so ξi? = 0. Thus the point ~xi does not violate the margin. Also since
~ ? = i=1 λ?i yi ~x?i , we see that (~xi , yi ) does not contribute to the optimal solution.
Pn
w

2. If λ?i = C then µ?i = 0. Thus we cannot say anything about ξi? . But since λ?i > 0, the other complementary
slackness condition says that yi ((w
~ ? )> ~xi − b) = 1 − ξi? . Thus (~xi , yi ) is either on the margin or violates the
margin. Also since w
~ ? = i=1 λ?i yi ~x?i , we see that (~xi , yi ) does contribute to the optimal solution.
P n

3. If λ?i ∈ (0, C), then µ?i ∈ (0, C) as well. Thus by complementary slackness, we have ξi? = 0. Applying this
to the other complementary slackness condition, we have yi ((w ~ ? )> ~xi − b) = 1. Thus (~xi , yi ) is exactly on the
margin. Also since w
~ ? = i=1 λ?i yi ~x?i , we see that (~xi , yi ) does contribute to the optimal solution.
P n

In general, the support vectors contribute to the optimal solution, and they are on/violate the margin.

© UCB EECS 127/227AT, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 197
Bibliography

[1] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.
[2] G. Calafiore and L. El Ghaoui, Optimization Models. Cambridge University Press, 2014.
[3] C. C. Pugh, Real Mathematical Analysis. Springer, 2002, vol. 2011.
[4] P. Varaiya et al., Lecture notes on optimization. Unpublished manuscript, University of California, Department of
Electrical Engineering and Computer Science, 1998.
[5] D. P. Bertsekas, “Nonlinear programming,” Journal of the Operational Research Society, vol. 48, no. 3, pp. 334–
334, 1997.
[6] G. Garrigos and R. M. Gower, “Handbook of convergence theorems for (stochastic) gradient methods,” arXiv
preprint arXiv:2301.11235, 2023.
[7] Y. Nesterov et al., Lectures on convex optimization. Springer, 2018, vol. 137.

198

You might also like