0% found this document useful (0 votes)

7 views10 pages

Lec 16

Uploaded by

Arthur Costa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views10 pages

Lec 16

Uploaded by

Arthur Costa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

princeton univ.

F’18 cos 521: Advanced Algorithm Design

Lecture 16: Seperation, optimization, and the ellipsoid
method
Lecturer: Christopher Musco

This section of the course focus on solving the optimization problems of the form:

minf (x) such that x ∈K,

where f is a convex function and K is a convex set. Recall that any convex f satisfies the
following equivalent inequalities:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) ∀x, y, λ ∈ [0, 1] (1)

T
f (x) − f (y) ≤ Of (x) (x − y) ∀x, y. (2)

Also recall that a convex set is any set where, for all x, y ∈ K, if z = λx + (1 − λ)y for some
λ ∈ [0, 1], then z ∈ K.

1 Gradient descent recap

Last lecture we analyzed the gradient descent procedure for solving a convex optimization
problem over a convex set.
Gradient Descent for Constrained Optimization
Let η == GD √ .
T
Let x0 be any point in K.
Repeat for i = 0 to T
yi+1 ← xi − ηOf (xi )
xi+1 ← Projection of yi+1 on K.
1 PT
At the end output x̄ = xi .
T i=0
D is the diameter of K (or, if our problem is unconstrained, simply an upper bound on
kx0 − x? k). G is an upper bound on the size of f ’s gradient, i.e. kOf (x)k2 ≤ G, ∀x.
Last lecture we proved:

Lemma 1. Let x? arg minx∈K f (x). After T steps of gradient descent,

DG
f (x̄) − f (x? ) ≤ √ .
2 T
4D2 G2
So after T = 2
steps, we have f (x̄) − f (x?) ≤ .

1
2

2 Online Gradient Descent

If you look back to last lecture, you will see that we actually proved something a bit stronger.
We showed that in every step,
1 η
f (xi ) − f (x? ) ≤ kxi − x? k22 − kxi+1 − x? k22 + G2 . (3)
2η 2

The only thing our proof used about f was the kOf (xi )k2 . In particular, it could have been
that f differed at every iteration! If we have functions f0 , . . . , fT and run gradient descent
with updates equal to −ηOfi (xi ) on iteration i, (3) allows us to obtain the bound:
T T
1X 1X DG
fi (xi ) − fi (x? ) ≤ √ , (4)
T T 2 T
i=0 i=0

for any x? ∈ K.
This is a regret bound, just like we saw for the experts problem and multiplicative
weights update. I.e. instead of optimizing one fixed function f , suppose our goal is to
output a vector xi at time i, which corresponds to some strategy: e.g. a linear classifier for
spam prediction, or a distribution of funds over stocks, bonds, and other assets.
After playing strategy xi , we incur some loss fi (xi ). For example, suppose we receive a
data point ai and want to classify ai as positive or negative (e.g. yes it’s spam, no it’s not)
by looking at sign aTi xi . After we play xi , the true label of ai is revealed as bi ∈ {−1, 1}
and we pay the convex penalty:

fi (xi ) = max(0, 1 − aTi x · bi )

We update our classifier based on Ofi (xi ) to obtain xi+1 and proceed. This procedure is
referred
P to as online gradient descent and the problem of minimizing our cumulative
loss i fi (xi ) is referred to as online convex optimization.
The bound of (4) means that our average penalty
√ is no worse than that of the best fixed
classifier x? by a factor that grows sublinearly in T . This powerful observation, due to
[1], has a number of applications, included below (we did not discuss these in lecture).

2.1 Case Study: Online Shortest Paths

The Online Shortest Paths problem models a commuter trying to find the best path with
fewest traffic delays. The traffic pattern changes from day to day, and she wishes to have
the smallest average delay over many days of experimentation.
We are given a graph G = (V, E) and two nodes s, t. At each time period i, the decision
maker selects one path pi from the set Ps,t of all paths that connect s, t (the choice for the
day’s commute). Then, an adversary independently chooses a weight function wi : E → R
(the traffic P
delays). The decision maker incurs a loss equal to the weight of the path he or
she chose: e∈pi wi (e).
The problem of finding the best would be natural to consider this problem in the context
of expert advice. We could think of every element of Ps,t as an expert and apply the
multiplicative weights algorithm we have seen before. There is one major flaw with this
3

approach: there may be exponentially many paths connecting s, t in terms of the number
of nodes in the graph. So the updates take exponential time and space in each step, and
furthermore the algorithm needs too long to converge to the best solution.
Online gradient descent can solve this problem, once we realize that we can describe
the set of all distributions x over paths Ps,t as a convex set K ∈ Rm , with O(|E| + |V |)
constraints. Then the decision maker’s expected loss function would be fi (x) = wiT · x. The
following formulation of the problem as a convex polytope allows for efficient algorithms
with provable regret bounds.
X X
xe = xe = 1 Flow value is 1.
e=(s,w),w∈V e=(w,t),w∈V
X
∀w ∈ V, w 6= u, v, xe = 0 Flow conservation.
e3w
∀e ∈ E, 0 ≤ xe ≤ 1 Capacity constraints.
What is the meaning of the decision maker’s move being a distribution over paths? It
just means a fractional solution. This can be decomposed into a combination of paths as in
the lecture on approximation algorithms. She picks a random path from this distribution;
the expected regret is unchanged.

2.2 Case Study: Portfolio Management

Let’s return to the portfolio management problem discussed in context of multiplicative
weights. We are trying to invest in a set of n stocks and maximize our wealth. For
t = 1, 2, . . . , let r(t) be the vector of relative price increase on day t, in other words

(t) Price of stock i on day t

ri = .
Price of stock i on day t − 1

Some thought shows (confirming conventional wisdom) that it can be very suboptimal
to put all money in a single stock. A strategy that works better in practice is Constant
Rebalanced Portfolio (CRB): decide upon a fixed proportion of money to put into each stock,
and buy/sell individual stocks each day to maintain this proportion.

Example 1. Say there are only two assets, stocks and bonds. One CRB strategy is to put
split money equally between these two. Notice what this implies: if an asset’s price falls,
you tend to buy more of it, and if the price rises, you tend to sell it. Thus this strategy
roughly implements the age-old advice to “buy low, sell high.”Concretely, suppose the prices
each day fluctuate as follows.

Stock r(t) Bond r(t)

Day 1 4/3 3/4
Day 2 3/4 4/3
Day 3 4/3 3/4
Day 4 3/4 4/3
... ... ...
4

Note that the prices go up and down by the same ratio on alternate days, so money
parked fully in stocks or fully in bonds earns nothing in the long run. (Aside: This kind of
fluctuation is not unusual; it is generally observed that bonds and stocks move in opposite
directions.) And what happens if you split your money equally between these two assets?
Each day it increases by a factor 0.5 × (4/3 + 3/4) = 0.5 × 25/12 ≈ 1.04. Thus your money
grows exponentially!
Exercise: Modify the price increases in the above example so that keeping all money
in stocks or bonds alone will cause it to drop exponentially, but the 50-50 CRB increases
money at an exponential rate.
CRB uses a fixed split among n assets, but what is this split? Wouldn’t it be great to
have an angel whisper in our ears on day 1 what this magic split is? Online optimization
is precisely such an angel. Suppose the algorithm uses the vector x(t) at time t; the ith
coordinate gives the proportion of money in stock i at the start of the tth day. Then the
algorithm’s wealth increases on t by a factor r(t) · x(t) . Thus the goal is to find x(t) ’s to
maximize the final wealth, which is
Y
r(t) · x(t) .
t
Taking logs, this becomes X
log(r(t) · x(t) ) (5)
t
For any fixed r(1) , r(2) , . . .
this function happens to be concave, but that is fine since we are
interested in maximization. Now we can try to run online gradient descent on this objective.
By Zinkevich’s theorem, the quantity in (5) converges to
X
log(r(t) · x∗ ) (6)
t
where x∗ is the best money allocation in hindsight.
This analysis needs to assume very little about the r(t) ’s, except a bound on the norm
of the gradient at each step, which translates into a weak condition on price movements. In
the next homework you will apply this simple algorithm on real stock data.

3 Stochastic Gradient Descent

Beyond it’s direct applications, online gradient descent also gives a way of analyzing the
ubiquitous stochastic gradient descent method. In the most standard setting, this method
applies to convex functions that can be decomposed as the sum of simpler convex function:
n
X
f (x) = gj (x) where each gj is convex .
j=1

This is a very typical structure in machine learning, where f (x) sums some convex loss
function over n individual data points. We’ve seen the example of least squares regression:
n
X 2
f (x) = kAx − bk22 = aTi x − b .
i=1
5

Other examples
Pn include objective functions for robust function fitting, like `1 regresssion:
T
f (x) = i=1Pai x − b , and objective functions used in linear classification, like the hinge
loss: f (x) = ni=1 max(0, 1 − aTi x · bi ).
The key observation is that, when f (x) = nj=1 gj (x), Of (x) = nj=1 Ogj (x). So if we
P P
select j uniformly at random, nOgj (x) gives an unbiased estimator for Of (x). Stochastic
gradient descent uses this estimate in place of Of (x). The advantage of doing so is that the
estimate is much faster to compute, typically saving a factor n. For example, computing
Of (x) for any of the objectives we just listed takes O(nd) time, while computing Ogj (x)
takes O(d) time.
Let G0 be an upper bound on knOgj (x)k2 for all gi , x.
Stochastic Gradient Descent
Let η == G0D√T .
Let x0 be any point in K.
Repeat for i = 0 to T
Choose ji uniformly from {1, . . . , n}.
yi+1 ← xi − η · nOgji (xi )
xi+1 ← Projection of yi+1 on K.
1 PT
At the end output x̄ = xi .
T i=0
The output of the SGD algorithm is a random variable, so we give a bound on it’s
expected performance. First we use convexity and linearity of expectation:
T T
? 1X ? 1X
Ef (x̄) − f (x ) ≤ Ef (xi ) − f (x ) ≤ EOf (xi )T (xi − x? ).
T T
i=0 i=0

Now, for any xi , the expectation of n · Ogji (xi ) over our random choice of ji is equal to
Of (xi ). So the expression above is equivalent to:
T T
1X 1X
EOf (xi )T (xi − x? ) = EnOgji (xi )T (xi − x? ).
T T
i=0 i=0

Finally, any particular realization of gj1 , . . . , gjT , we can use (4) to bound:
T
1X DG0
EnOgji (xi )T xi − nOgji (xi )T x? ≤ √ .
T T
i=0

I.e., we let fi (y) = gji (xi )T y be the shifting objective function in online gradient descent.
0 2 02
We conclude that Ef (x̄) − f (x? ) ≤ DG √ , so we obtain error if we set T = O( D G
T 2
)
Pn How does this bound compare to standard gradient
Pn descent? First note that kOf (x)k 2 ≤
i=1 kOgi (x)k2 by triangle inequality. And since i=1 kOgi (x)k2 ≤ n max kOgi (x)k2 , we
always have G0 ≥ G. This is expected – in general, using stochastic gradient’s will slow
down the convergence of our algorithm.
However, in many cases this is more than made up for by how much we save in computing
√
gradients. In the examples given above, as long as G0 ≤ nG, we get an overall runtime
savings by using SGD instead of standard gradient descent.
6

4 The Ellipsoid Method

The advantage of gradient descent and related methods is cheap per iteration cost – in many
cases linear in the size of the input problem. Furthermore, the convergence rate of gradient
descent is dimension independent, depending loosely on specific problem parameters, and
not at all on the problem size. At the same time, at least without making additional
assumptions about f , this convergence rate is relatively slow – a dependence on 1/2 limits
the possibility of using gradient descent to obtain high accuracy solutions to constrained
convex optimization problems.
In the remainder of this lecture and next lecture, we will look at alternative methods
for minimizing convex functions over convex sets which offer far more accurate solutions.
In fact, for many special cases like linear programming, these methods can be shown to
produce an exact solution in polynomial time1 .
The first such method we consider is the ellipsoid method, which originates in work by
Shor (1970), and Yudin and Nemirovskii(1975). In 1979 Khachiyan showed that the ellipsoid
method can be used to solve linear programs in polynomial time, giving the first polynomial
time solution to the problem. Next lecture we will look at interior point methods, which
were shown by Karmarkar to solve LPs in polynomial time in 1984. Generally, interior
point methods are considered practically faster than ellipsoid methods (and give better
asymptotic runtimes) but they also apply to less general problems.

4.1 Find a point in a convex set

The problem that the ellipsoid method actually solves is as follows:

Problem 2. Given a convex body (i.e., a closed and bounded convex set) K, find some point
x ∈ K or output EMPTY if K = ∅.

Intuitively, we can see that an algorithm for this problem can be used blackbox to
minimize a convex function f (x) over a convex set. In particular, if f is convex, then the
set {x : f (x) ≤ z} is convex. It follows that K ∩ {x : f (x) ≤ z} is convex. So, we can use
an algorithm for Problem 2, run on K ∩ {x : f (x) ≤ z}, to binary searching over values of
z and find the minimum z (or close to it) such that f (x) ≤ z for some x ∈ K.
There are a number of details to think about here which we do not have time to cover.
For example, we need at least upper and lower bounds on min f (x) to perform binary
search. As suggested in class, it might also be that K ∩ {x : f (x) ≤ z} is not bounded,
which violates the input assumption of Problem 2. Nisheeth Vishnoi’s lecture notes on the
ellipsoid method [2] are a good place to learn about some of these points in more detail
For now, we restrict our attention to solving Problem 2.
1
This can be a complicated issue, involving many details. E.g. for linear programming the methods we
will look at run in “weakly polynomial time”, meaning time polynomial in the input size and in L, which
is a bound on the maximum number of of bits required to specify each entry of our constraint matrix if all
entries are integers specified in binary (i.e., they can’t be specified in floating point).
7

(a) Closed and bounded convex set K ⊂ Rd . (b) Unbounded convex set K ⊂ Rd .

Figure 1: For this lecture, we restrict our attention to finding a point inside a bounded
convex set K that is specified by a separation oracle

4.2 Presenting a convex body: separation oracles

The first question to ask it how the input to Problem 2 is even specified. For some convex
optimization problems (e.g. linear programming) we had a convenient way of representing
K as the intersection of many halfspaces (a polytope) but this is not always the case. We
need a more generic approach.
One simple way to present a body to the algorithm is via a membership oracle: a black-
box program that, given a point x, tells us if x ∈ K. We will work with a stronger version
of the oracle, which relies upon the following fact.

Figure 2: Separating hyperplane theorem/Farka’s Lemma: Between every convex body and
a point outside it, there is a hyperplane.

Separating hyperplane theorem: If K ⊆ Rn is a closed convex set and x ∈ Rn is a

point, then one of the following holds:
1. p ∈ K

2. There is a hyperplane that separates x from K. I.e. there is some c ∈ Rd such that
cT x > cT y ∀y ∈ K.
This claim was proven in our lecture on LP duality. It prompts the following definition:
Definition 1. A Separation Oracle for a convex set K is a procedure which given x,
either tells that p ∈ K or returns a vector c ∈ Rd which specifies a hyperplane that separates
p and all of K.
8

A separating hyperplane oracle provides a lot more information than a membership

oracle. It tells us not only that x is not in K, by why – e.g. roughly what direction we
would have to move in to get close to K.
For many convex bodies of practical interest, it is possible to construct an efficient
separation oracle. This oracle will be used blackbox by the ellipsoid method, so its runtime
will dictate the cost of solving Problem 2.

Example 2. A polytope is a convex set described by linear constraints:

K = {x : Ax ≤ b} .

A separation oracle for K simply needs to compute Ax. If this vector is entrywise ≤ b, x is
in K. Alternatively, if [Ax]i > bi , we simple output the ith for of A, ai , as our separating
vector.

Example 3. The set of positive semidefinite matrices includes all symmetric matrices X ∈
Rd×d such that wT Xw ≥ 0 for all w. This is a convex set, which we will discuss more in
Lecture 18.
Unlike the polytopeP example, the set of PSD matrices is defined by infinite linear con-
straints of the form ij Xij wi wj ≥ 0. We can’t check all of these constraints to find a
separating hyperplane. However, since any symmetric matrix with all positive eigenvalues
is PSD, we can use an eigendecomposition to obtain a separation oracle.
If an input matrix X has any negative eigenvalues ,we take an eigenvector
P a correspond-
ing to one of these negative eigenvalue and returns the hyperplane ij Xij ai aj = 0. (Note
that ai ’s are constants here.)

Example 4. Not all simple to describe convex sets have efficient (e.g. polynomial time)
separation oracles. For example, consider the set of copositive matrices, which includes all
(possible asymmetric matrices) X ∈ Rd×d with wT Xw ≥ 0. While similar to the set of PSD
matrices, the separation problem for this set is actually NP-hard (and convex optimization
over the set of copositive matrices is NP-hard).

4.3 Main idea

A separation oracle is not sufficient to allow an algorithm to test K for nonemptiness in
finite time. Each time the algorithm questions the oracle about a point x, the oracle could
just answer x 6∈ K, since the convex body could be further from the origin than all the
(finitely many) points that the algorithm has queried about thus far. After all, space is
infinite!
Thus the algorithm needs some very rough idea of where K may lie. It needs K to lie
in some known bounding box. The bounding box could be a cube, sphere etc.
The Ellipsoid method will use an ellipsoid as a bounding box. In particular, it solves
Problem 2 under the assumption that K is contained in an n-dimensional ball of finite radius
R, and also contains an n-dimensional ball of radius r < R (which ensures that K can’t be
arbitrarily small).
Assumption:
9

1. K ⊆ B(α1 , R) for some known center α1 and radius R.

2. B(α2 , r) ⊆ K for some known center α2 and radius r < R.
It is not always possible to find such bounding balls apriori, but for many problems of
practical interest we can. For example, consider a polytope with integer constraints (i.e.
every entry in A and b is integral) that takes a total of L bits to represent. Because the
polytope is invariant to scaling A and b, this is equivalent to saying that the input constraints
have bounded precision. In this case, it is possible to show that R . 2L and, either K is
empty, or r & 2−L . Again, we refer to [2] for more details, and for now assume that R and
r exists and can be determined.
The ellipsoid method is an iterative algorithm. It begins with bounding ellipse B(α1 , R)
and queries whether or not the center of that ellipse is in K. If it is, we are done. If not, is
uses the separating hyperplane returned by our separation oracle to cut B(α1 , R) in half,
reducing our search space. The new search space is itself represented by an ellipse – we
take the minimum volume ellipse that contains the half ellipse B(α1 , R) ∩ {y : cT y ≤ cT x}.
So overall, we produce a sequence of ellipses:
E0 = B(α, R), E1 , E2 , . . . , ET
where Ei+1 is the ellipse of minimal volume containing Ei ∩ {y : cTi y ≤ cT xi }. Here xi is
the center of ellipse Ei , which we pass as a query to our separation oracle, and ci is the
separation vector returned in xi ∈/ K.
A representation for Ei can be found analytically in O(n2 ) time by using the fact that
any ellipse E is defined as containing all points x such that (x − α)T A(x − α) ≤ 1 for some
symmetric PSD matrix A and center α.
Why do this? Why not just keep track of the half ellipse, query its center, and cut in
half again? Why make an approximation at each step by surrounding our search space by
a new ellipse? The issue is that it’s hard to find the center of a generic convex body, so we
very quickly would be unable to decide which point x to query next in our search region.

Figure 3: 3D-Ellipsoid and its axes

With our general strategy in place, the only problem is to make sure that the algorithm
makes progress at every step. To do so we use the following important lemma:
Lemma 3. The minimum volume ellipsoid surrounding a half ellipsoid (i.e. Ei H + where
T
H + is a halfspace as above) can be calculated in polynomial time and

1
V ol(Ei+1 ) ≤ 1 − V ol(Ei )
2n
10

Figure 4: A few of runs of the Ellipsoid method showing the tiny convex set in blue and
the containing ellipsoids. The separating hyperplanes do not pass through the centers of
the ellipsoids in this figure, although for the algorithm described here they would.

It’s an interesting exercise to work through a proof of this lemma – you can follow the
proof here: http://people.csail.mit.edu/moitra/docs/6854notes12.pdf.
Thus after T steps the volume of the enclosing ellipsoid has dropped by (1 − 1/2n)T ≤
exp(−T /2n). Our starting ellipse has volumen O(Rn ), so in T = O(n log(Rn /rn ) =
O(n2 log(R/r) steps, we can ensure that, either we find a point x ∈ K, or

ET ≤ O(rn ).

If this is the case, we can assume that K is empty since we assumed that if it wasn’t, it at
least contained a ball of volume O(rn ).
At first glance Lemma 3 might seem weak – ideally we could have hoped to cut the
1
volume of our search region in half with each query. We only cut it by a factor of (1 − 2n ),
which cost us an extra n factor in iteration complexity. This loss is an inherent cost of
approximating our search region by an ellipsoid. Recent work on improving the ellipsoid
method seeks to address this issue by maintaining alternative representations of the search
region [3].

References
[1] Zinkevich, Martin. Online Convex Programming and Generalized Infinitesimal Gradient
Ascent. Proceedings of the International Conference on Machine Learning (ICML), 2013.

[2] Nisheeth K. Vishnoi. Algorithms for Convex Optimization. EPFL lecture notes, 2018.
https://nisheethvishnoi.wordpress.com/convex-optimization/.

[3] Yin Tat Lee, Aaron Sidford, and Same Chiu-Wai Wong. A Faster Cutting Plane Method
and its Implications for Combinatorial and Convex Optimization. FOCS 2015. 1049-
1065.

Online Learning: 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das
No ratings yet
Online Learning: 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das
33 pages
Convex Functions
No ratings yet
Convex Functions
13 pages
Nisheeth VishnoiFall2014 ConvexOptimization PDF
No ratings yet
Nisheeth VishnoiFall2014 ConvexOptimization PDF
114 pages
Tom Mitchell Provides A More Modern Definition
No ratings yet
Tom Mitchell Provides A More Modern Definition
10 pages
Les Hoches 2022 Convex Optimization
No ratings yet
Les Hoches 2022 Convex Optimization
34 pages
Berkeley-Tutorial Optimization For Machine Learning-Part1
No ratings yet
Berkeley-Tutorial Optimization For Machine Learning-Part1
37 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
Lecture 4 - More On Linear Regression and Polynomial Regression
No ratings yet
Lecture 4 - More On Linear Regression and Polynomial Regression
26 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
Projection-Free Online Learning
No ratings yet
Projection-Free Online Learning
8 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
L4 More On Linear Regression and Polynomial Regression
No ratings yet
L4 More On Linear Regression and Polynomial Regression
37 pages
What Is Machine Learning by Coursera
No ratings yet
What Is Machine Learning by Coursera
47 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
43 pages
L07 Optimization
No ratings yet
L07 Optimization
12 pages
Online Learning: T T T T T T T T
No ratings yet
Online Learning: T T T T T T T T
8 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
Online Gradient Descent
No ratings yet
Online Gradient Descent
7 pages
Unit 2
No ratings yet
Unit 2
37 pages
Week 10 Notes MLF
No ratings yet
Week 10 Notes MLF
20 pages
02 Lecturenote GD
No ratings yet
02 Lecturenote GD
10 pages
Cours D'optimisation
No ratings yet
Cours D'optimisation
159 pages
1 One Dimension: Gradient Descent
No ratings yet
1 One Dimension: Gradient Descent
5 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Sol3 2016
No ratings yet
Sol3 2016
8 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
Poor Starting Points in Machine Learning
No ratings yet
Poor Starting Points in Machine Learning
11 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
A Layman's Guide To The Project
No ratings yet
A Layman's Guide To The Project
34 pages
Algorithms and Complexity
No ratings yet
Algorithms and Complexity
130 pages
Local Search in Smooth Convex Sets: CX Ax B A I A A A A A A O D X Ax B X CX CX O A I J Z O Opt D X X C A B P CX
No ratings yet
Local Search in Smooth Convex Sets: CX Ax B A I A A A A A A O D X Ax B X CX CX O A I J Z O Opt D X X C A B P CX
9 pages
Online Learning Lecture Notes 2011 Oct 20
No ratings yet
Online Learning Lecture Notes 2011 Oct 20
125 pages
What Is Machine Learning?
No ratings yet
What Is Machine Learning?
12 pages
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
No ratings yet
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
19 pages
MScFE 650 MLF - Video - Transcripts - M3
No ratings yet
MScFE 650 MLF - Video - Transcripts - M3
19 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Optimization, Learning, and Games With Predictable Sequences
No ratings yet
Optimization, Learning, and Games With Predictable Sequences
22 pages
Kaka de 09 Generalization
No ratings yet
Kaka de 09 Generalization
8 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
Neural Networks Study Notes
100% (2)
Neural Networks Study Notes
11 pages
ML Intro
No ratings yet
ML Intro
5 pages
DSCTP 2022 1 ML Slides
No ratings yet
DSCTP 2022 1 ML Slides
110 pages
1 Solving Systems of Linear Inequalities
No ratings yet
1 Solving Systems of Linear Inequalities
4 pages
PAMS-22Fall-Smart Marketing With RRM-5-Optimization
No ratings yet
PAMS-22Fall-Smart Marketing With RRM-5-Optimization
40 pages
CSC2411 - Linear Programming and Combinatorial Optimization Lecture 1: Introduction To Optimization Problems and Mathematical Programming
No ratings yet
CSC2411 - Linear Programming and Combinatorial Optimization Lecture 1: Introduction To Optimization Problems and Mathematical Programming
9 pages
Chapter Gradient Descent
No ratings yet
Chapter Gradient Descent
6 pages
Workshop CNAM
No ratings yet
Workshop CNAM
66 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Models PDF
No ratings yet
Models PDF
86 pages
06 Optimization
No ratings yet
06 Optimization
42 pages
Mit18 S096iap23 Lec06
No ratings yet
Mit18 S096iap23 Lec06
9 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Optimization and Search
No ratings yet
Optimization and Search
27 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Origem Da Geometric Mechanics
No ratings yet
Origem Da Geometric Mechanics
29 pages
Duality Estimation Control Goodwin
No ratings yet
Duality Estimation Control Goodwin
10 pages
Nasir B
No ratings yet
Nasir B
7 pages
FULLTEXT01
No ratings yet
FULLTEXT01
14 pages
Pertuz Et Al-2020-Analog Integrated Circuits and Signal Processing
No ratings yet
Pertuz Et Al-2020-Analog Integrated Circuits and Signal Processing
10 pages
Demonstrative Project 1: 1.1 Requirements
No ratings yet
Demonstrative Project 1: 1.1 Requirements
4 pages
Author Guidelines Based On The British Machine Vision Conference
No ratings yet
Author Guidelines Based On The British Machine Vision Conference
6 pages
Fundamentals of Computer Algorithms: Single-Source Shortest Paths
No ratings yet
Fundamentals of Computer Algorithms: Single-Source Shortest Paths
21 pages
Average Case Analysis of Binary Search
No ratings yet
Average Case Analysis of Binary Search
3 pages
Machine Learning Overview
No ratings yet
Machine Learning Overview
54 pages
Singly Linked Lists Introduction and Operations
No ratings yet
Singly Linked Lists Introduction and Operations
8 pages
Lab 10 - Neural Network
No ratings yet
Lab 10 - Neural Network
11 pages
Cot 5310 Notes
No ratings yet
Cot 5310 Notes
408 pages
Cos 102
No ratings yet
Cos 102
8 pages
Arning
No ratings yet
Arning
4 pages
Gaussian Elimination Solution of Simultaneous Equation
No ratings yet
Gaussian Elimination Solution of Simultaneous Equation
30 pages
Moore Mealy Machine Lecture-1
No ratings yet
Moore Mealy Machine Lecture-1
15 pages
Problems For Algorithm Development: Java Programming
0% (1)
Problems For Algorithm Development: Java Programming
7 pages
Information Theory: Mohamed Hamada
No ratings yet
Information Theory: Mohamed Hamada
24 pages
Fast Fourier Transform For Dummies PDF
No ratings yet
Fast Fourier Transform For Dummies PDF
2 pages
Artificial Intelligence: Paf-Karachi Institute of Economics & Technology College of Engineering
No ratings yet
Artificial Intelligence: Paf-Karachi Institute of Economics & Technology College of Engineering
8 pages
Lecture Notes COMP3506
No ratings yet
Lecture Notes COMP3506
17 pages
Algorithms and Data Structures: Counting Sort and Radix Sort
No ratings yet
Algorithms and Data Structures: Counting Sort and Radix Sort
21 pages
Lec 1
No ratings yet
Lec 1
49 pages
Asymptotic Notations and Complexity Analysis
No ratings yet
Asymptotic Notations and Complexity Analysis
19 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Digital Signal Processing: Course Code: 15EC1115 L T P C 3 1 0 4
No ratings yet
Digital Signal Processing: Course Code: 15EC1115 L T P C 3 1 0 4
3 pages
Tutorial Bilevel Optimization Without Tears
No ratings yet
Tutorial Bilevel Optimization Without Tears
39 pages
CMP 335 Regular Expression Exercises Note
No ratings yet
CMP 335 Regular Expression Exercises Note
18 pages
Ant Colony Optimization
No ratings yet
Ant Colony Optimization
18 pages
M218 Discrete Mathematic CSE
No ratings yet
M218 Discrete Mathematic CSE
2 pages
2AC3 Midterm Practice
No ratings yet
2AC3 Midterm Practice
4 pages
© LPU:: CSE310 Programming in Java:: Sawal Tandon
No ratings yet
© LPU:: CSE310 Programming in Java:: Sawal Tandon
12 pages
3.5 Dividing Polynomials
No ratings yet
3.5 Dividing Polynomials
2 pages
Slides 06
No ratings yet
Slides 06
41 pages
Data Flow Testing
100% (3)
Data Flow Testing
40 pages
Sequential Equivalence Checking of Clock-Gated Circuits: Yu-Yun Dai, Kei-Yong Khoo, Robert K. Brayton
No ratings yet
Sequential Equivalence Checking of Clock-Gated Circuits: Yu-Yun Dai, Kei-Yong Khoo, Robert K. Brayton
6 pages

Lec 16

Uploaded by

Lec 16

Uploaded by

princeton univ.

F’18 cos 521: Advanced Algorithm Design

minf (x) such that x ∈K,

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) ∀x, y, λ ∈ [0, 1] (1)

1 Gradient descent recap

Lemma 1. Let x? arg minx∈K f (x). After T steps of gradient descent,

2 Online Gradient Descent

fi (xi ) = max(0, 1 − aTi x · bi )

2.1 Case Study: Online Shortest Paths

2.2 Case Study: Portfolio Management

(t) Price of stock i on day t

Stock r(t) Bond r(t)

3 Stochastic Gradient Descent

4 The Ellipsoid Method

4.1 Find a point in a convex set

4.2 Presenting a convex body: separation oracles

Separating hyperplane theorem: If K ⊆ Rn is a closed convex set and x ∈ Rn is a

A separating hyperplane oracle provides a lot more information than a membership

Example 2. A polytope is a convex set described by linear constraints:

4.3 Main idea

1. K ⊆ B(α1 , R) for some known center α1 and radius R.

Figure 3: 3D-Ellipsoid and its axes

You might also like