0% found this document useful (0 votes)
7 views10 pages

Lec 16

Uploaded by

Arthur Costa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views10 pages

Lec 16

Uploaded by

Arthur Costa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

princeton univ.

F’18 cos 521: Advanced Algorithm Design


Lecture 16: Seperation, optimization, and the ellipsoid
method
Lecturer: Christopher Musco

This section of the course focus on solving the optimization problems of the form:

minf (x) such that x ∈K,


x

where f is a convex function and K is a convex set. Recall that any convex f satisfies the
following equivalent inequalities:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) ∀x, y, λ ∈ [0, 1] (1)


T
f (x) − f (y) ≤ Of (x) (x − y) ∀x, y. (2)

Also recall that a convex set is any set where, for all x, y ∈ K, if z = λx + (1 − λ)y for some
λ ∈ [0, 1], then z ∈ K.

1 Gradient descent recap


Last lecture we analyzed the gradient descent procedure for solving a convex optimization
problem over a convex set.
Gradient Descent for Constrained Optimization
Let η == GD √ .
T
Let x0 be any point in K.
Repeat for i = 0 to T
yi+1 ← xi − ηOf (xi )
xi+1 ← Projection of yi+1 on K.
1 PT
At the end output x̄ = xi .
T i=0
D is the diameter of K (or, if our problem is unconstrained, simply an upper bound on
kx0 − x? k). G is an upper bound on the size of f ’s gradient, i.e. kOf (x)k2 ≤ G, ∀x.
Last lecture we proved:

Lemma 1. Let x? arg minx∈K f (x). After T steps of gradient descent,

DG
f (x̄) − f (x? ) ≤ √ .
2 T
4D2 G2
So after T = 2
steps, we have f (x̄) − f (x?) ≤ .

1
2

2 Online Gradient Descent


If you look back to last lecture, you will see that we actually proved something a bit stronger.
We showed that in every step,
1  η
f (xi ) − f (x? ) ≤ kxi − x? k22 − kxi+1 − x? k22 + G2 . (3)
2η 2

The only thing our proof used about f was the kOf (xi )k2 . In particular, it could have been
that f differed at every iteration! If we have functions f0 , . . . , fT and run gradient descent
with updates equal to −ηOfi (xi ) on iteration i, (3) allows us to obtain the bound:
T T
1X 1X DG
fi (xi ) − fi (x? ) ≤ √ , (4)
T T 2 T
i=0 i=0

for any x? ∈ K.
This is a regret bound, just like we saw for the experts problem and multiplicative
weights update. I.e. instead of optimizing one fixed function f , suppose our goal is to
output a vector xi at time i, which corresponds to some strategy: e.g. a linear classifier for
spam prediction, or a distribution of funds over stocks, bonds, and other assets.
After playing strategy xi , we incur some loss fi (xi ). For example, suppose we receive a
data point ai and want to classify ai as positive or negative (e.g. yes it’s spam, no it’s not)
by looking at sign aTi xi . After we play xi , the true label of ai is revealed as bi ∈ {−1, 1}
and we pay the convex penalty:

fi (xi ) = max(0, 1 − aTi x · bi )

We update our classifier based on Ofi (xi ) to obtain xi+1 and proceed. This procedure is
referred
P to as online gradient descent and the problem of minimizing our cumulative
loss i fi (xi ) is referred to as online convex optimization.
The bound of (4) means that our average penalty
√ is no worse than that of the best fixed
classifier x? by a factor that grows sublinearly in T . This powerful observation, due to
[1], has a number of applications, included below (we did not discuss these in lecture).

2.1 Case Study: Online Shortest Paths


The Online Shortest Paths problem models a commuter trying to find the best path with
fewest traffic delays. The traffic pattern changes from day to day, and she wishes to have
the smallest average delay over many days of experimentation.
We are given a graph G = (V, E) and two nodes s, t. At each time period i, the decision
maker selects one path pi from the set Ps,t of all paths that connect s, t (the choice for the
day’s commute). Then, an adversary independently chooses a weight function wi : E → R
(the traffic P
delays). The decision maker incurs a loss equal to the weight of the path he or
she chose: e∈pi wi (e).
The problem of finding the best would be natural to consider this problem in the context
of expert advice. We could think of every element of Ps,t as an expert and apply the
multiplicative weights algorithm we have seen before. There is one major flaw with this
3

approach: there may be exponentially many paths connecting s, t in terms of the number
of nodes in the graph. So the updates take exponential time and space in each step, and
furthermore the algorithm needs too long to converge to the best solution.
Online gradient descent can solve this problem, once we realize that we can describe
the set of all distributions x over paths Ps,t as a convex set K ∈ Rm , with O(|E| + |V |)
constraints. Then the decision maker’s expected loss function would be fi (x) = wiT · x. The
following formulation of the problem as a convex polytope allows for efficient algorithms
with provable regret bounds.
X X
xe = xe = 1 Flow value is 1.
e=(s,w),w∈V e=(w,t),w∈V
X
∀w ∈ V, w 6= u, v, xe = 0 Flow conservation.
e3w
∀e ∈ E, 0 ≤ xe ≤ 1 Capacity constraints.
What is the meaning of the decision maker’s move being a distribution over paths? It
just means a fractional solution. This can be decomposed into a combination of paths as in
the lecture on approximation algorithms. She picks a random path from this distribution;
the expected regret is unchanged.

2.2 Case Study: Portfolio Management


Let’s return to the portfolio management problem discussed in context of multiplicative
weights. We are trying to invest in a set of n stocks and maximize our wealth. For
t = 1, 2, . . . , let r(t) be the vector of relative price increase on day t, in other words

(t) Price of stock i on day t


ri = .
Price of stock i on day t − 1

Some thought shows (confirming conventional wisdom) that it can be very suboptimal
to put all money in a single stock. A strategy that works better in practice is Constant
Rebalanced Portfolio (CRB): decide upon a fixed proportion of money to put into each stock,
and buy/sell individual stocks each day to maintain this proportion.

Example 1. Say there are only two assets, stocks and bonds. One CRB strategy is to put
split money equally between these two. Notice what this implies: if an asset’s price falls,
you tend to buy more of it, and if the price rises, you tend to sell it. Thus this strategy
roughly implements the age-old advice to “buy low, sell high.”Concretely, suppose the prices
each day fluctuate as follows.

Stock r(t) Bond r(t)


Day 1 4/3 3/4
Day 2 3/4 4/3
Day 3 4/3 3/4
Day 4 3/4 4/3
... ... ...
4

Note that the prices go up and down by the same ratio on alternate days, so money
parked fully in stocks or fully in bonds earns nothing in the long run. (Aside: This kind of
fluctuation is not unusual; it is generally observed that bonds and stocks move in opposite
directions.) And what happens if you split your money equally between these two assets?
Each day it increases by a factor 0.5 × (4/3 + 3/4) = 0.5 × 25/12 ≈ 1.04. Thus your money
grows exponentially!
Exercise: Modify the price increases in the above example so that keeping all money
in stocks or bonds alone will cause it to drop exponentially, but the 50-50 CRB increases
money at an exponential rate.
CRB uses a fixed split among n assets, but what is this split? Wouldn’t it be great to
have an angel whisper in our ears on day 1 what this magic split is? Online optimization
is precisely such an angel. Suppose the algorithm uses the vector x(t) at time t; the ith
coordinate gives the proportion of money in stock i at the start of the tth day. Then the
algorithm’s wealth increases on t by a factor r(t) · x(t) . Thus the goal is to find x(t) ’s to
maximize the final wealth, which is
Y
r(t) · x(t) .
t
Taking logs, this becomes X
log(r(t) · x(t) ) (5)
t
For any fixed r(1) , r(2) , . . .
this function happens to be concave, but that is fine since we are
interested in maximization. Now we can try to run online gradient descent on this objective.
By Zinkevich’s theorem, the quantity in (5) converges to
X
log(r(t) · x∗ ) (6)
t
where x∗ is the best money allocation in hindsight.
This analysis needs to assume very little about the r(t) ’s, except a bound on the norm
of the gradient at each step, which translates into a weak condition on price movements. In
the next homework you will apply this simple algorithm on real stock data.

3 Stochastic Gradient Descent


Beyond it’s direct applications, online gradient descent also gives a way of analyzing the
ubiquitous stochastic gradient descent method. In the most standard setting, this method
applies to convex functions that can be decomposed as the sum of simpler convex function:
n
X
f (x) = gj (x) where each gj is convex .
j=1

This is a very typical structure in machine learning, where f (x) sums some convex loss
function over n individual data points. We’ve seen the example of least squares regression:
n
X 2
f (x) = kAx − bk22 = aTi x − b .
i=1
5

Other examples
Pn include objective functions for robust function fitting, like `1 regresssion:
T
f (x) = i=1Pai x − b , and objective functions used in linear classification, like the hinge
loss: f (x) = ni=1 max(0, 1 − aTi x · bi ).
The key observation is that, when f (x) = nj=1 gj (x), Of (x) = nj=1 Ogj (x). So if we
P P
select j uniformly at random, nOgj (x) gives an unbiased estimator for Of (x). Stochastic
gradient descent uses this estimate in place of Of (x). The advantage of doing so is that the
estimate is much faster to compute, typically saving a factor n. For example, computing
Of (x) for any of the objectives we just listed takes O(nd) time, while computing Ogj (x)
takes O(d) time.
Let G0 be an upper bound on knOgj (x)k2 for all gi , x.
Stochastic Gradient Descent
Let η == G0D√T .
Let x0 be any point in K.
Repeat for i = 0 to T
Choose ji uniformly from {1, . . . , n}.
yi+1 ← xi − η · nOgji (xi )
xi+1 ← Projection of yi+1 on K.
1 PT
At the end output x̄ = xi .
T i=0
The output of the SGD algorithm is a random variable, so we give a bound on it’s
expected performance. First we use convexity and linearity of expectation:
T T
? 1X ? 1X
Ef (x̄) − f (x ) ≤ Ef (xi ) − f (x ) ≤ EOf (xi )T (xi − x? ).
T T
i=0 i=0

Now, for any xi , the expectation of n · Ogji (xi ) over our random choice of ji is equal to
Of (xi ). So the expression above is equivalent to:
T T
1X 1X
EOf (xi )T (xi − x? ) = EnOgji (xi )T (xi − x? ).
T T
i=0 i=0

Finally, any particular realization of gj1 , . . . , gjT , we can use (4) to bound:
T
1X DG0
EnOgji (xi )T xi − nOgji (xi )T x? ≤ √ .
T T
i=0

I.e., we let fi (y) = gji (xi )T y be the shifting objective function in online gradient descent.
0 2 02
We conclude that Ef (x̄) − f (x? ) ≤ DG √ , so we obtain error  if we set T = O( D G
T 2
)
Pn How does this bound compare to standard gradient
Pn descent? First note that kOf (x)k 2 ≤
i=1 kOgi (x)k2 by triangle inequality. And since i=1 kOgi (x)k2 ≤ n max kOgi (x)k2 , we
always have G0 ≥ G. This is expected – in general, using stochastic gradient’s will slow
down the convergence of our algorithm.
However, in many cases this is more than made up for by how much we save in computing

gradients. In the examples given above, as long as G0 ≤ nG, we get an overall runtime
savings by using SGD instead of standard gradient descent.
6

4 The Ellipsoid Method


The advantage of gradient descent and related methods is cheap per iteration cost – in many
cases linear in the size of the input problem. Furthermore, the convergence rate of gradient
descent is dimension independent, depending loosely on specific problem parameters, and
not at all on the problem size. At the same time, at least without making additional
assumptions about f , this convergence rate is relatively slow – a dependence on 1/2 limits
the possibility of using gradient descent to obtain high accuracy solutions to constrained
convex optimization problems.
In the remainder of this lecture and next lecture, we will look at alternative methods
for minimizing convex functions over convex sets which offer far more accurate solutions.
In fact, for many special cases like linear programming, these methods can be shown to
produce an exact solution in polynomial time1 .
The first such method we consider is the ellipsoid method, which originates in work by
Shor (1970), and Yudin and Nemirovskii(1975). In 1979 Khachiyan showed that the ellipsoid
method can be used to solve linear programs in polynomial time, giving the first polynomial
time solution to the problem. Next lecture we will look at interior point methods, which
were shown by Karmarkar to solve LPs in polynomial time in 1984. Generally, interior
point methods are considered practically faster than ellipsoid methods (and give better
asymptotic runtimes) but they also apply to less general problems.

4.1 Find a point in a convex set


The problem that the ellipsoid method actually solves is as follows:

Problem 2. Given a convex body (i.e., a closed and bounded convex set) K, find some point
x ∈ K or output EMPTY if K = ∅.

Intuitively, we can see that an algorithm for this problem can be used blackbox to
minimize a convex function f (x) over a convex set. In particular, if f is convex, then the
set {x : f (x) ≤ z} is convex. It follows that K ∩ {x : f (x) ≤ z} is convex. So, we can use
an algorithm for Problem 2, run on K ∩ {x : f (x) ≤ z}, to binary searching over values of
z and find the minimum z (or close to it) such that f (x) ≤ z for some x ∈ K.
There are a number of details to think about here which we do not have time to cover.
For example, we need at least upper and lower bounds on min f (x) to perform binary
search. As suggested in class, it might also be that K ∩ {x : f (x) ≤ z} is not bounded,
which violates the input assumption of Problem 2. Nisheeth Vishnoi’s lecture notes on the
ellipsoid method [2] are a good place to learn about some of these points in more detail
For now, we restrict our attention to solving Problem 2.
1
This can be a complicated issue, involving many details. E.g. for linear programming the methods we
will look at run in “weakly polynomial time”, meaning time polynomial in the input size and in L, which
is a bound on the maximum number of of bits required to specify each entry of our constraint matrix if all
entries are integers specified in binary (i.e., they can’t be specified in floating point).
7

(a) Closed and bounded convex set K ⊂ Rd . (b) Unbounded convex set K ⊂ Rd .

Figure 1: For this lecture, we restrict our attention to finding a point inside a bounded
convex set K that is specified by a separation oracle

4.2 Presenting a convex body: separation oracles


The first question to ask it how the input to Problem 2 is even specified. For some convex
optimization problems (e.g. linear programming) we had a convenient way of representing
K as the intersection of many halfspaces (a polytope) but this is not always the case. We
need a more generic approach.
One simple way to present a body to the algorithm is via a membership oracle: a black-
box program that, given a point x, tells us if x ∈ K. We will work with a stronger version
of the oracle, which relies upon the following fact.

Figure 2: Separating hyperplane theorem/Farka’s Lemma: Between every convex body and
a point outside it, there is a hyperplane.

Separating hyperplane theorem: If K ⊆ Rn is a closed convex set and x ∈ Rn is a


point, then one of the following holds:
1. p ∈ K

2. There is a hyperplane that separates x from K. I.e. there is some c ∈ Rd such that
cT x > cT y ∀y ∈ K.
This claim was proven in our lecture on LP duality. It prompts the following definition:
Definition 1. A Separation Oracle for a convex set K is a procedure which given x,
either tells that p ∈ K or returns a vector c ∈ Rd which specifies a hyperplane that separates
p and all of K.
8

A separating hyperplane oracle provides a lot more information than a membership


oracle. It tells us not only that x is not in K, by why – e.g. roughly what direction we
would have to move in to get close to K.
For many convex bodies of practical interest, it is possible to construct an efficient
separation oracle. This oracle will be used blackbox by the ellipsoid method, so its runtime
will dictate the cost of solving Problem 2.

Example 2. A polytope is a convex set described by linear constraints:

K = {x : Ax ≤ b} .

A separation oracle for K simply needs to compute Ax. If this vector is entrywise ≤ b, x is
in K. Alternatively, if [Ax]i > bi , we simple output the ith for of A, ai , as our separating
vector.

Example 3. The set of positive semidefinite matrices includes all symmetric matrices X ∈
Rd×d such that wT Xw ≥ 0 for all w. This is a convex set, which we will discuss more in
Lecture 18.
Unlike the polytopeP example, the set of PSD matrices is defined by infinite linear con-
straints of the form ij Xij wi wj ≥ 0. We can’t check all of these constraints to find a
separating hyperplane. However, since any symmetric matrix with all positive eigenvalues
is PSD, we can use an eigendecomposition to obtain a separation oracle.
If an input matrix X has any negative eigenvalues ,we take an eigenvector
P a correspond-
ing to one of these negative eigenvalue and returns the hyperplane ij Xij ai aj = 0. (Note
that ai ’s are constants here.)

Example 4. Not all simple to describe convex sets have efficient (e.g. polynomial time)
separation oracles. For example, consider the set of copositive matrices, which includes all
(possible asymmetric matrices) X ∈ Rd×d with wT Xw ≥ 0. While similar to the set of PSD
matrices, the separation problem for this set is actually NP-hard (and convex optimization
over the set of copositive matrices is NP-hard).

4.3 Main idea


A separation oracle is not sufficient to allow an algorithm to test K for nonemptiness in
finite time. Each time the algorithm questions the oracle about a point x, the oracle could
just answer x 6∈ K, since the convex body could be further from the origin than all the
(finitely many) points that the algorithm has queried about thus far. After all, space is
infinite!
Thus the algorithm needs some very rough idea of where K may lie. It needs K to lie
in some known bounding box. The bounding box could be a cube, sphere etc.
The Ellipsoid method will use an ellipsoid as a bounding box. In particular, it solves
Problem 2 under the assumption that K is contained in an n-dimensional ball of finite radius
R, and also contains an n-dimensional ball of radius r < R (which ensures that K can’t be
arbitrarily small).
Assumption:
9

1. K ⊆ B(α1 , R) for some known center α1 and radius R.


2. B(α2 , r) ⊆ K for some known center α2 and radius r < R.
It is not always possible to find such bounding balls apriori, but for many problems of
practical interest we can. For example, consider a polytope with integer constraints (i.e.
every entry in A and b is integral) that takes a total of L bits to represent. Because the
polytope is invariant to scaling A and b, this is equivalent to saying that the input constraints
have bounded precision. In this case, it is possible to show that R . 2L and, either K is
empty, or r & 2−L . Again, we refer to [2] for more details, and for now assume that R and
r exists and can be determined.
The ellipsoid method is an iterative algorithm. It begins with bounding ellipse B(α1 , R)
and queries whether or not the center of that ellipse is in K. If it is, we are done. If not, is
uses the separating hyperplane returned by our separation oracle to cut B(α1 , R) in half,
reducing our search space. The new search space is itself represented by an ellipse – we
take the minimum volume ellipse that contains the half ellipse B(α1 , R) ∩ {y : cT y ≤ cT x}.
So overall, we produce a sequence of ellipses:
E0 = B(α, R), E1 , E2 , . . . , ET
where Ei+1 is the ellipse of minimal volume containing Ei ∩ {y : cTi y ≤ cT xi }. Here xi is
the center of ellipse Ei , which we pass as a query to our separation oracle, and ci is the
separation vector returned in xi ∈/ K.
A representation for Ei can be found analytically in O(n2 ) time by using the fact that
any ellipse E is defined as containing all points x such that (x − α)T A(x − α) ≤ 1 for some
symmetric PSD matrix A and center α.
Why do this? Why not just keep track of the half ellipse, query its center, and cut in
half again? Why make an approximation at each step by surrounding our search space by
a new ellipse? The issue is that it’s hard to find the center of a generic convex body, so we
very quickly would be unable to decide which point x to query next in our search region.

Figure 3: 3D-Ellipsoid and its axes

With our general strategy in place, the only problem is to make sure that the algorithm
makes progress at every step. To do so we use the following important lemma:
Lemma 3. The minimum volume ellipsoid surrounding a half ellipsoid (i.e. Ei H + where
T
H + is a halfspace as above) can be calculated in polynomial time and
 
1
V ol(Ei+1 ) ≤ 1 − V ol(Ei )
2n
10

Figure 4: A few of runs of the Ellipsoid method showing the tiny convex set in blue and
the containing ellipsoids. The separating hyperplanes do not pass through the centers of
the ellipsoids in this figure, although for the algorithm described here they would.

It’s an interesting exercise to work through a proof of this lemma – you can follow the
proof here: http://people.csail.mit.edu/moitra/docs/6854notes12.pdf.
Thus after T steps the volume of the enclosing ellipsoid has dropped by (1 − 1/2n)T ≤
exp(−T /2n). Our starting ellipse has volumen O(Rn ), so in T = O(n log(Rn /rn ) =
O(n2 log(R/r) steps, we can ensure that, either we find a point x ∈ K, or

ET ≤ O(rn ).

If this is the case, we can assume that K is empty since we assumed that if it wasn’t, it at
least contained a ball of volume O(rn ).
At first glance Lemma 3 might seem weak – ideally we could have hoped to cut the
1
volume of our search region in half with each query. We only cut it by a factor of (1 − 2n ),
which cost us an extra n factor in iteration complexity. This loss is an inherent cost of
approximating our search region by an ellipsoid. Recent work on improving the ellipsoid
method seeks to address this issue by maintaining alternative representations of the search
region [3].

References
[1] Zinkevich, Martin. Online Convex Programming and Generalized Infinitesimal Gradient
Ascent. Proceedings of the International Conference on Machine Learning (ICML), 2013.

[2] Nisheeth K. Vishnoi. Algorithms for Convex Optimization. EPFL lecture notes, 2018.
https://nisheethvishnoi.wordpress.com/convex-optimization/.

[3] Yin Tat Lee, Aaron Sidford, and Same Chiu-Wai Wong. A Faster Cutting Plane Method
and its Implications for Combinatorial and Convex Optimization. FOCS 2015. 1049-
1065.

You might also like