Lec 16
Lec 16
This section of the course focus on solving the optimization problems of the form:
where f is a convex function and K is a convex set. Recall that any convex f satisfies the
following equivalent inequalities:
Also recall that a convex set is any set where, for all x, y ∈ K, if z = λx + (1 − λ)y for some
λ ∈ [0, 1], then z ∈ K.
DG
f (x̄) − f (x? ) ≤ √ .
2 T
4D2 G2
So after T = 2
steps, we have f (x̄) − f (x?) ≤ .
1
2
The only thing our proof used about f was the kOf (xi )k2 . In particular, it could have been
that f differed at every iteration! If we have functions f0 , . . . , fT and run gradient descent
with updates equal to −ηOfi (xi ) on iteration i, (3) allows us to obtain the bound:
T T
1X 1X DG
fi (xi ) − fi (x? ) ≤ √ , (4)
T T 2 T
i=0 i=0
for any x? ∈ K.
This is a regret bound, just like we saw for the experts problem and multiplicative
weights update. I.e. instead of optimizing one fixed function f , suppose our goal is to
output a vector xi at time i, which corresponds to some strategy: e.g. a linear classifier for
spam prediction, or a distribution of funds over stocks, bonds, and other assets.
After playing strategy xi , we incur some loss fi (xi ). For example, suppose we receive a
data point ai and want to classify ai as positive or negative (e.g. yes it’s spam, no it’s not)
by looking at sign aTi xi . After we play xi , the true label of ai is revealed as bi ∈ {−1, 1}
and we pay the convex penalty:
We update our classifier based on Ofi (xi ) to obtain xi+1 and proceed. This procedure is
referred
P to as online gradient descent and the problem of minimizing our cumulative
loss i fi (xi ) is referred to as online convex optimization.
The bound of (4) means that our average penalty
√ is no worse than that of the best fixed
classifier x? by a factor that grows sublinearly in T . This powerful observation, due to
[1], has a number of applications, included below (we did not discuss these in lecture).
approach: there may be exponentially many paths connecting s, t in terms of the number
of nodes in the graph. So the updates take exponential time and space in each step, and
furthermore the algorithm needs too long to converge to the best solution.
Online gradient descent can solve this problem, once we realize that we can describe
the set of all distributions x over paths Ps,t as a convex set K ∈ Rm , with O(|E| + |V |)
constraints. Then the decision maker’s expected loss function would be fi (x) = wiT · x. The
following formulation of the problem as a convex polytope allows for efficient algorithms
with provable regret bounds.
X X
xe = xe = 1 Flow value is 1.
e=(s,w),w∈V e=(w,t),w∈V
X
∀w ∈ V, w 6= u, v, xe = 0 Flow conservation.
e3w
∀e ∈ E, 0 ≤ xe ≤ 1 Capacity constraints.
What is the meaning of the decision maker’s move being a distribution over paths? It
just means a fractional solution. This can be decomposed into a combination of paths as in
the lecture on approximation algorithms. She picks a random path from this distribution;
the expected regret is unchanged.
Some thought shows (confirming conventional wisdom) that it can be very suboptimal
to put all money in a single stock. A strategy that works better in practice is Constant
Rebalanced Portfolio (CRB): decide upon a fixed proportion of money to put into each stock,
and buy/sell individual stocks each day to maintain this proportion.
Example 1. Say there are only two assets, stocks and bonds. One CRB strategy is to put
split money equally between these two. Notice what this implies: if an asset’s price falls,
you tend to buy more of it, and if the price rises, you tend to sell it. Thus this strategy
roughly implements the age-old advice to “buy low, sell high.”Concretely, suppose the prices
each day fluctuate as follows.
Note that the prices go up and down by the same ratio on alternate days, so money
parked fully in stocks or fully in bonds earns nothing in the long run. (Aside: This kind of
fluctuation is not unusual; it is generally observed that bonds and stocks move in opposite
directions.) And what happens if you split your money equally between these two assets?
Each day it increases by a factor 0.5 × (4/3 + 3/4) = 0.5 × 25/12 ≈ 1.04. Thus your money
grows exponentially!
Exercise: Modify the price increases in the above example so that keeping all money
in stocks or bonds alone will cause it to drop exponentially, but the 50-50 CRB increases
money at an exponential rate.
CRB uses a fixed split among n assets, but what is this split? Wouldn’t it be great to
have an angel whisper in our ears on day 1 what this magic split is? Online optimization
is precisely such an angel. Suppose the algorithm uses the vector x(t) at time t; the ith
coordinate gives the proportion of money in stock i at the start of the tth day. Then the
algorithm’s wealth increases on t by a factor r(t) · x(t) . Thus the goal is to find x(t) ’s to
maximize the final wealth, which is
Y
r(t) · x(t) .
t
Taking logs, this becomes X
log(r(t) · x(t) ) (5)
t
For any fixed r(1) , r(2) , . . .
this function happens to be concave, but that is fine since we are
interested in maximization. Now we can try to run online gradient descent on this objective.
By Zinkevich’s theorem, the quantity in (5) converges to
X
log(r(t) · x∗ ) (6)
t
where x∗ is the best money allocation in hindsight.
This analysis needs to assume very little about the r(t) ’s, except a bound on the norm
of the gradient at each step, which translates into a weak condition on price movements. In
the next homework you will apply this simple algorithm on real stock data.
This is a very typical structure in machine learning, where f (x) sums some convex loss
function over n individual data points. We’ve seen the example of least squares regression:
n
X 2
f (x) = kAx − bk22 = aTi x − b .
i=1
5
Other examples
Pn include objective functions for robust function fitting, like `1 regresssion:
T
f (x) = i=1Pai x − b , and objective functions used in linear classification, like the hinge
loss: f (x) = ni=1 max(0, 1 − aTi x · bi ).
The key observation is that, when f (x) = nj=1 gj (x), Of (x) = nj=1 Ogj (x). So if we
P P
select j uniformly at random, nOgj (x) gives an unbiased estimator for Of (x). Stochastic
gradient descent uses this estimate in place of Of (x). The advantage of doing so is that the
estimate is much faster to compute, typically saving a factor n. For example, computing
Of (x) for any of the objectives we just listed takes O(nd) time, while computing Ogj (x)
takes O(d) time.
Let G0 be an upper bound on knOgj (x)k2 for all gi , x.
Stochastic Gradient Descent
Let η == G0D√T .
Let x0 be any point in K.
Repeat for i = 0 to T
Choose ji uniformly from {1, . . . , n}.
yi+1 ← xi − η · nOgji (xi )
xi+1 ← Projection of yi+1 on K.
1 PT
At the end output x̄ = xi .
T i=0
The output of the SGD algorithm is a random variable, so we give a bound on it’s
expected performance. First we use convexity and linearity of expectation:
T T
? 1X ? 1X
Ef (x̄) − f (x ) ≤ Ef (xi ) − f (x ) ≤ EOf (xi )T (xi − x? ).
T T
i=0 i=0
Now, for any xi , the expectation of n · Ogji (xi ) over our random choice of ji is equal to
Of (xi ). So the expression above is equivalent to:
T T
1X 1X
EOf (xi )T (xi − x? ) = EnOgji (xi )T (xi − x? ).
T T
i=0 i=0
Finally, any particular realization of gj1 , . . . , gjT , we can use (4) to bound:
T
1X DG0
EnOgji (xi )T xi − nOgji (xi )T x? ≤ √ .
T T
i=0
I.e., we let fi (y) = gji (xi )T y be the shifting objective function in online gradient descent.
0 2 02
We conclude that Ef (x̄) − f (x? ) ≤ DG √ , so we obtain error if we set T = O( D G
T 2
)
Pn How does this bound compare to standard gradient
Pn descent? First note that kOf (x)k 2 ≤
i=1 kOgi (x)k2 by triangle inequality. And since i=1 kOgi (x)k2 ≤ n max kOgi (x)k2 , we
always have G0 ≥ G. This is expected – in general, using stochastic gradient’s will slow
down the convergence of our algorithm.
However, in many cases this is more than made up for by how much we save in computing
√
gradients. In the examples given above, as long as G0 ≤ nG, we get an overall runtime
savings by using SGD instead of standard gradient descent.
6
Problem 2. Given a convex body (i.e., a closed and bounded convex set) K, find some point
x ∈ K or output EMPTY if K = ∅.
Intuitively, we can see that an algorithm for this problem can be used blackbox to
minimize a convex function f (x) over a convex set. In particular, if f is convex, then the
set {x : f (x) ≤ z} is convex. It follows that K ∩ {x : f (x) ≤ z} is convex. So, we can use
an algorithm for Problem 2, run on K ∩ {x : f (x) ≤ z}, to binary searching over values of
z and find the minimum z (or close to it) such that f (x) ≤ z for some x ∈ K.
There are a number of details to think about here which we do not have time to cover.
For example, we need at least upper and lower bounds on min f (x) to perform binary
search. As suggested in class, it might also be that K ∩ {x : f (x) ≤ z} is not bounded,
which violates the input assumption of Problem 2. Nisheeth Vishnoi’s lecture notes on the
ellipsoid method [2] are a good place to learn about some of these points in more detail
For now, we restrict our attention to solving Problem 2.
1
This can be a complicated issue, involving many details. E.g. for linear programming the methods we
will look at run in “weakly polynomial time”, meaning time polynomial in the input size and in L, which
is a bound on the maximum number of of bits required to specify each entry of our constraint matrix if all
entries are integers specified in binary (i.e., they can’t be specified in floating point).
7
(a) Closed and bounded convex set K ⊂ Rd . (b) Unbounded convex set K ⊂ Rd .
Figure 1: For this lecture, we restrict our attention to finding a point inside a bounded
convex set K that is specified by a separation oracle
Figure 2: Separating hyperplane theorem/Farka’s Lemma: Between every convex body and
a point outside it, there is a hyperplane.
2. There is a hyperplane that separates x from K. I.e. there is some c ∈ Rd such that
cT x > cT y ∀y ∈ K.
This claim was proven in our lecture on LP duality. It prompts the following definition:
Definition 1. A Separation Oracle for a convex set K is a procedure which given x,
either tells that p ∈ K or returns a vector c ∈ Rd which specifies a hyperplane that separates
p and all of K.
8
K = {x : Ax ≤ b} .
A separation oracle for K simply needs to compute Ax. If this vector is entrywise ≤ b, x is
in K. Alternatively, if [Ax]i > bi , we simple output the ith for of A, ai , as our separating
vector.
Example 3. The set of positive semidefinite matrices includes all symmetric matrices X ∈
Rd×d such that wT Xw ≥ 0 for all w. This is a convex set, which we will discuss more in
Lecture 18.
Unlike the polytopeP example, the set of PSD matrices is defined by infinite linear con-
straints of the form ij Xij wi wj ≥ 0. We can’t check all of these constraints to find a
separating hyperplane. However, since any symmetric matrix with all positive eigenvalues
is PSD, we can use an eigendecomposition to obtain a separation oracle.
If an input matrix X has any negative eigenvalues ,we take an eigenvector
P a correspond-
ing to one of these negative eigenvalue and returns the hyperplane ij Xij ai aj = 0. (Note
that ai ’s are constants here.)
Example 4. Not all simple to describe convex sets have efficient (e.g. polynomial time)
separation oracles. For example, consider the set of copositive matrices, which includes all
(possible asymmetric matrices) X ∈ Rd×d with wT Xw ≥ 0. While similar to the set of PSD
matrices, the separation problem for this set is actually NP-hard (and convex optimization
over the set of copositive matrices is NP-hard).
With our general strategy in place, the only problem is to make sure that the algorithm
makes progress at every step. To do so we use the following important lemma:
Lemma 3. The minimum volume ellipsoid surrounding a half ellipsoid (i.e. Ei H + where
T
H + is a halfspace as above) can be calculated in polynomial time and
1
V ol(Ei+1 ) ≤ 1 − V ol(Ei )
2n
10
Figure 4: A few of runs of the Ellipsoid method showing the tiny convex set in blue and
the containing ellipsoids. The separating hyperplanes do not pass through the centers of
the ellipsoids in this figure, although for the algorithm described here they would.
It’s an interesting exercise to work through a proof of this lemma – you can follow the
proof here: http://people.csail.mit.edu/moitra/docs/6854notes12.pdf.
Thus after T steps the volume of the enclosing ellipsoid has dropped by (1 − 1/2n)T ≤
exp(−T /2n). Our starting ellipse has volumen O(Rn ), so in T = O(n log(Rn /rn ) =
O(n2 log(R/r) steps, we can ensure that, either we find a point x ∈ K, or
ET ≤ O(rn ).
If this is the case, we can assume that K is empty since we assumed that if it wasn’t, it at
least contained a ball of volume O(rn ).
At first glance Lemma 3 might seem weak – ideally we could have hoped to cut the
1
volume of our search region in half with each query. We only cut it by a factor of (1 − 2n ),
which cost us an extra n factor in iteration complexity. This loss is an inherent cost of
approximating our search region by an ellipsoid. Recent work on improving the ellipsoid
method seeks to address this issue by maintaining alternative representations of the search
region [3].
References
[1] Zinkevich, Martin. Online Convex Programming and Generalized Infinitesimal Gradient
Ascent. Proceedings of the International Conference on Machine Learning (ICML), 2013.
[2] Nisheeth K. Vishnoi. Algorithms for Convex Optimization. EPFL lecture notes, 2018.
https://nisheethvishnoi.wordpress.com/convex-optimization/.
[3] Yin Tat Lee, Aaron Sidford, and Same Chiu-Wai Wong. A Faster Cutting Plane Method
and its Implications for Combinatorial and Convex Optimization. FOCS 2015. 1049-
1065.