0% found this document useful (0 votes)
27 views4 pages

Lecture 11

The document discusses optimization of non-convex functions. It introduces the concept of one-point convexity, where some non-convex functions still have nice properties that allow obtaining global guarantees. It also discusses how the gradient method can converge to stationary points for general non-convex functions, and how recent research focuses on escaping saddle points. The document concludes by discussing some stepsize rules for implementing gradient-based optimization methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views4 pages

Lecture 11

The document discusses optimization of non-convex functions. It introduces the concept of one-point convexity, where some non-convex functions still have nice properties that allow obtaining global guarantees. It also discusses how the gradient method can converge to stationary points for general non-convex functions, and how recent research focuses on escaping saddle points. The document concludes by discussing some stepsize rules for implementing gradient-based optimization methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

ECE 490: Introduction to Optimization Fall 2018

Lecture 11
Basics of Non-convex Optimization and Some Stepsize Rules
Lecturer: Bin Hu, Date:10/04/2018

So far we have talked about optimization of smooth convex functions. What if the
functions are not convex? Let’s talk about this topic.

11.1 One-Point Convexity


In general, the guarantees for optimization of all non-convex functions are weak. However,
some of the non-convex functions still have nice properties that can be exploited for obtaining
a global guarantee. One such property is the so-called “one-point convexity.” Recall that we
have used the following inequality to prove the linear convergence of the gradient method:
T 
x − x∗ −2mLI (m + L)I x − x∗
  
≥ 0. (11.1)
∇f (x) (m + L)I −2I ∇f (x)

When f is L-smooth and m-strongly convex, we have


 T   
x−y −2mLI (m + L)I x−y
≥ 0. (11.2)
∇f (x) − ∇f (y) (m + L)I −2I ∇f (x) − ∇f (y)

which is actually more general than (11.1). So we actually have proved the gradient method
linearly converges not only for smooth strongly-convex f but also for all f satisfying (11.1).
Comparing (11.1) with (11.2), we can see that we just replace the arbitrary vector y in (11.2)
with a specific point x∗ in (11.1). Hence (11.1) can be viewed as a “one-point convexity”
condition. For functions satisfying one-point convexity, we can still use the gradient method
which is guaranteed to achieve linear convergence. Notice the above one-point convexity
condition does not even require smoothness.
In phase retrieval problems, a commonly-used condition is the regularity condition. The
global regularity condition states that the following inequality holds for some positive µ and
λ
T 
x − x∗ x − x∗
  
−λI I
≥ 0. (11.3)
∇f (x) I −µI ∇f (x)

This is an equivalent form of the one-point convexity and has been used to show linear
convergence of the gradient method for phase retrieval problems. One technical issue is
that usually this condition only holds locally for phase retrieval problems. So a lot of phase
retrieval research focuses on how to develop good initialization techniques that guarantee

11-1
ECE 490 Lecture 11 — 10/04/2018 Fall 2018

the initial condition of the gradient method is in the region where the regularity condition
holds.
Several other one-point convexity conditions include the Polyak-Lojasiewicz (PL) condi-
tion, Quadratic Growth (QG) condition, and the restricted secant inequality. We will not
cover these conditions in details. But the take-home message is that you can expect the
problem to be relatively “simple” if the function satisfies some sort of one-point convexity
condition.

11.2 Optimization of General Non-Convex Functions


For general non-convex functions, even finding a local min is NP-hard in the worst case. If f
is smooth and also bounded below by some constant C, the gradient method is guaranteed to
converge to a point whose gradient is 0 (or equivalently just the so-called stationary point).
To see this, notice the L-smoothness directly leads to the following
L
f (xk+1 ) ≤f (xk ) + ∇f (xk )T (xk+1 − xk ) + kxk+1 − xk k2
2
Lα2
 
=f (xk ) − α − k∇f (xk )k2
2
Summing the above inequality from k = 0 to k = T and canceling terms, we have
 T
Lα2 X

f (xT +1 ) ≤ f (x0 ) − α − k∇f (xk )k2
2 k=0

This states the following inequality holds for all T


 T
Lα2 X

α− k∇f (xk )k2 ≤ f (x0 ) − f (xT +1 ) ≤ f (x0 ) − C
2 k=0
2
As long as α− Lα2 > 0, we know Tk=0 k∇f (xk )k2 is bounded and increases
P
P as T increases. We
know a bounded monotone sequence converges to one point. Hence ∞ k∇f
k=0 P (x k )k 2
exists
and ∞ 2 k+T −1
P
k=T k∇f (x k )k converges to 0 as T goes to ∞. Notice x k+T − x k = α t=k ∇f (xt ).
Hence we can show {xk } is a Cauchy sequence and converges to one point. This point has
to have a zero gradient since k∇f (xk )k → 0. Therefore, we have shown the gradient method
converges to a stationary point.
In general, a stationary point may not even be a local min. Recall that x∗ is a local min
if there is a neighborhood U around x∗ such that f (x∗ ) ≤ f (x) for all x ∈ U . Similarly, x∗ is
a local max if there is a neighborhood U around x∗ such that f (x∗ ) ≥ f (x) for all x ∈ U . A
point x∗ is a saddle point if it is a stationary point but not a local min or max. So a natural
question is whether we can at least avoid converging to some kinds of saddle points. A lot
of recent research papers focus on how to escape strict saddle points. Before talking about
what strict saddle points are, we first review some optimality conditions for local min.

11-2
ECE 490 Lecture 11 — 10/04/2018 Fall 2018

Now we only consider twice-differentiable f . A sufficient condition guaranteeing x∗ being


a local min is ∇f (x∗ ) = 0 and ∇2 f (x∗ ) > 0. A necessary condition required by every local
min x∗ is ∇f (x∗ ) = 0 and ∇2 f (x∗ ) ≥ 0. Generally speaking, if a stationary point x∗ has
a positive semidefinite Hessian, it is non-trivial to decide whether this is a local min or a
saddle point. If a saddle point has a positive semidefinite Hessian, then it is hard to handle.
On the other hand, if the Hessian for a saddle point x∗ has a negative minimum eigenvalue,
then this saddle point is a strict saddle point and it is relatively easy to handle. By Stable
Manifold theorem, we can guarantee the gradient method with a random initialization does
not converge to such strict saddle points with probability one.
To summarize, one focus of the cutting edge theoretical research for non-convex opti-
mization is on how to escape certain kinds of saddle points. Escaping saddle points is still
an important research topic and many people are working on this. For exposure purpose, we
briefly talked escaping saddle points here. This topic is not going to be tested in homework
or exam. However, the optimality conditions for local min is something that you may be
tested in HW or exam.

11.3 Stepsize Rules


So far we have talked about the theoretical side of optimization. The theory looks at stepsize
depending on the smoothness parameter L. How about practice? How to implement things?
Now we talk about some stepsize rules for implementation of the gradient method.

1. Trial and error: grid α and start with trying some larger α first. Intuitively larger
stepsize leads to faster convergence (although this is not always true). If the larger
stepsize fails, then divide it by a factor of constant and try it again. Keep on shrinking
the stepsize until the function value starts to decrease and converge. This is the trial-
and-error approach. So in practice, you may have to try various stepsizes before you
find something that works.

2. Direct line search: The line search method involves solving a one-dimensional opti-
mization at every step. Specifically, choose the stepsize as follows

αk = arg min f (xk − α∇f (xk ))


α∈R

So at every step just try to decrease the function value as much as you can. Although
we already know that being greedy at every step may not help in the long run (e.g.
using momentum is helpful in the long run but may not be greedy at every step), the
line search is still a popular heuristic.

3. Armijo rule: This is also known as the backtracking search. Fix positive β < 1 and
σ < 1 in advance. Then find the smallest integer m such that

f (xk − α0 β m ∇f (xk )) ≤ f (xk ) − σα0 β m k∇f (xk )k2 (11.4)

11-3
ECE 490 Lecture 11 — 10/04/2018 Fall 2018

Here, start with m = 0. Then increase m until the above inequality is satisfied and use
that m. When f is L-smooth, there always exists an integer m such that the above
inequality holds. To see this, notice L-smoothness means
L
f (xk − α0 β m ∇f (xk )) ≤ f (xk ) + ∇f (xk )T (−α0 β m ∇f (xk )) + k−α0 β m ∇f (xk )k2
2
2m 2
 
Lβ α 0
= f (xk ) − α0 β m − k∇f (xk )k2
2
Lβ 2m α2
If we choose m such that α0 β m − 2 0 ≥ σα0 β m (which is equivalent to β m ≤ 2(1−σ)
α0 L
),
we guarantee the condition (11.4) is satisfied. Since β < 1, there always exists m such
that the Armijo rule can be used.

For machine learning problems, the learning rate (stepsize) of SGD is typically tuned
using the trial-and-error approach. Another popular choice is the adaptive stepsize rule
such as ADAM, AMSGRAD, etc. The point is that the stepsize rules for deterministic
optimization and stochastic optimization are quite different.

11-4

You might also like