Lecture 11
Lecture 11
Lecture 11
Basics of Non-convex Optimization and Some Stepsize Rules
Lecturer: Bin Hu, Date:10/04/2018
So far we have talked about optimization of smooth convex functions. What if the
functions are not convex? Let’s talk about this topic.
which is actually more general than (11.1). So we actually have proved the gradient method
linearly converges not only for smooth strongly-convex f but also for all f satisfying (11.1).
Comparing (11.1) with (11.2), we can see that we just replace the arbitrary vector y in (11.2)
with a specific point x∗ in (11.1). Hence (11.1) can be viewed as a “one-point convexity”
condition. For functions satisfying one-point convexity, we can still use the gradient method
which is guaranteed to achieve linear convergence. Notice the above one-point convexity
condition does not even require smoothness.
In phase retrieval problems, a commonly-used condition is the regularity condition. The
global regularity condition states that the following inequality holds for some positive µ and
λ
T
x − x∗ x − x∗
−λI I
≥ 0. (11.3)
∇f (x) I −µI ∇f (x)
This is an equivalent form of the one-point convexity and has been used to show linear
convergence of the gradient method for phase retrieval problems. One technical issue is
that usually this condition only holds locally for phase retrieval problems. So a lot of phase
retrieval research focuses on how to develop good initialization techniques that guarantee
11-1
ECE 490 Lecture 11 — 10/04/2018 Fall 2018
the initial condition of the gradient method is in the region where the regularity condition
holds.
Several other one-point convexity conditions include the Polyak-Lojasiewicz (PL) condi-
tion, Quadratic Growth (QG) condition, and the restricted secant inequality. We will not
cover these conditions in details. But the take-home message is that you can expect the
problem to be relatively “simple” if the function satisfies some sort of one-point convexity
condition.
11-2
ECE 490 Lecture 11 — 10/04/2018 Fall 2018
1. Trial and error: grid α and start with trying some larger α first. Intuitively larger
stepsize leads to faster convergence (although this is not always true). If the larger
stepsize fails, then divide it by a factor of constant and try it again. Keep on shrinking
the stepsize until the function value starts to decrease and converge. This is the trial-
and-error approach. So in practice, you may have to try various stepsizes before you
find something that works.
2. Direct line search: The line search method involves solving a one-dimensional opti-
mization at every step. Specifically, choose the stepsize as follows
So at every step just try to decrease the function value as much as you can. Although
we already know that being greedy at every step may not help in the long run (e.g.
using momentum is helpful in the long run but may not be greedy at every step), the
line search is still a popular heuristic.
3. Armijo rule: This is also known as the backtracking search. Fix positive β < 1 and
σ < 1 in advance. Then find the smallest integer m such that
11-3
ECE 490 Lecture 11 — 10/04/2018 Fall 2018
Here, start with m = 0. Then increase m until the above inequality is satisfied and use
that m. When f is L-smooth, there always exists an integer m such that the above
inequality holds. To see this, notice L-smoothness means
L
f (xk − α0 β m ∇f (xk )) ≤ f (xk ) + ∇f (xk )T (−α0 β m ∇f (xk )) + k−α0 β m ∇f (xk )k2
2
2m 2
Lβ α 0
= f (xk ) − α0 β m − k∇f (xk )k2
2
Lβ 2m α2
If we choose m such that α0 β m − 2 0 ≥ σα0 β m (which is equivalent to β m ≤ 2(1−σ)
α0 L
),
we guarantee the condition (11.4) is satisfied. Since β < 1, there always exists m such
that the Armijo rule can be used.
For machine learning problems, the learning rate (stepsize) of SGD is typically tuned
using the trial-and-error approach. Another popular choice is the adaptive stepsize rule
such as ADAM, AMSGRAD, etc. The point is that the stepsize rules for deterministic
optimization and stochastic optimization are quite different.
11-4