0% found this document useful (0 votes)

27 views4 pages

Lecture 11

The document discusses optimization of non-convex functions. It introduces the concept of one-point convexity, where some non-convex functions still have nice properties that allow obtaining global guarantees. It also discusses how the gradient method can converge to stationary points for general non-convex functions, and how recent research focuses on escaping saddle points. The document concludes by discussing some stepsize rules for implementing gradient-based optimization methods.

Uploaded by

Денис Грачев

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views4 pages

Lecture 11

Uploaded by

Денис Грачев

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

ECE 490: Introduction to Optimization Fall 2018

Lecture 11
Basics of Non-convex Optimization and Some Stepsize Rules
Lecturer: Bin Hu, Date:10/04/2018

So far we have talked about optimization of smooth convex functions. What if the
functions are not convex? Let’s talk about this topic.

11.1 One-Point Convexity

In general, the guarantees for optimization of all non-convex functions are weak. However,
some of the non-convex functions still have nice properties that can be exploited for obtaining
a global guarantee. One such property is the so-called “one-point convexity.” Recall that we
have used the following inequality to prove the linear convergence of the gradient method:
T
x − x∗ −2mLI (m + L)I x − x∗

≥ 0. (11.1)
∇f (x) (m + L)I −2I ∇f (x)

When f is L-smooth and m-strongly convex, we have

T
x−y −2mLI (m + L)I x−y
≥ 0. (11.2)
∇f (x) − ∇f (y) (m + L)I −2I ∇f (x) − ∇f (y)

which is actually more general than (11.1). So we actually have proved the gradient method
linearly converges not only for smooth strongly-convex f but also for all f satisfying (11.1).
Comparing (11.1) with (11.2), we can see that we just replace the arbitrary vector y in (11.2)
with a specific point x∗ in (11.1). Hence (11.1) can be viewed as a “one-point convexity”
condition. For functions satisfying one-point convexity, we can still use the gradient method
which is guaranteed to achieve linear convergence. Notice the above one-point convexity
condition does not even require smoothness.
In phase retrieval problems, a commonly-used condition is the regularity condition. The
global regularity condition states that the following inequality holds for some positive µ and
λ
T
x − x∗ x − x∗

−λI I
≥ 0. (11.3)
∇f (x) I −µI ∇f (x)

This is an equivalent form of the one-point convexity and has been used to show linear
convergence of the gradient method for phase retrieval problems. One technical issue is
that usually this condition only holds locally for phase retrieval problems. So a lot of phase
retrieval research focuses on how to develop good initialization techniques that guarantee

11-1
ECE 490 Lecture 11 — 10/04/2018 Fall 2018

the initial condition of the gradient method is in the region where the regularity condition
holds.
Several other one-point convexity conditions include the Polyak-Lojasiewicz (PL) condi-
tion, Quadratic Growth (QG) condition, and the restricted secant inequality. We will not
cover these conditions in details. But the take-home message is that you can expect the
problem to be relatively “simple” if the function satisfies some sort of one-point convexity
condition.

11.2 Optimization of General Non-Convex Functions

For general non-convex functions, even finding a local min is NP-hard in the worst case. If f
is smooth and also bounded below by some constant C, the gradient method is guaranteed to
converge to a point whose gradient is 0 (or equivalently just the so-called stationary point).
To see this, notice the L-smoothness directly leads to the following
L
f (xk+1 ) ≤f (xk ) + ∇f (xk )T (xk+1 − xk ) + kxk+1 − xk k2
2
Lα2

=f (xk ) − α − k∇f (xk )k2
2
Summing the above inequality from k = 0 to k = T and canceling terms, we have
T
Lα2 X

f (xT +1 ) ≤ f (x0 ) − α − k∇f (xk )k2
2 k=0

This states the following inequality holds for all T

T
Lα2 X

α− k∇f (xk )k2 ≤ f (x0 ) − f (xT +1 ) ≤ f (x0 ) − C
2 k=0
2
As long as α− Lα2 > 0, we know Tk=0 k∇f (xk )k2 is bounded and increases
P
P as T increases. We
know a bounded monotone sequence converges to one point. Hence ∞ k∇f
k=0 P (x k )k 2
exists
and ∞ 2 k+T −1
P
k=T k∇f (x k )k converges to 0 as T goes to ∞. Notice x k+T − x k = α t=k ∇f (xt ).
Hence we can show {xk } is a Cauchy sequence and converges to one point. This point has
to have a zero gradient since k∇f (xk )k → 0. Therefore, we have shown the gradient method
converges to a stationary point.
In general, a stationary point may not even be a local min. Recall that x∗ is a local min
if there is a neighborhood U around x∗ such that f (x∗ ) ≤ f (x) for all x ∈ U . Similarly, x∗ is
a local max if there is a neighborhood U around x∗ such that f (x∗ ) ≥ f (x) for all x ∈ U . A
point x∗ is a saddle point if it is a stationary point but not a local min or max. So a natural
question is whether we can at least avoid converging to some kinds of saddle points. A lot
of recent research papers focus on how to escape strict saddle points. Before talking about
what strict saddle points are, we first review some optimality conditions for local min.

11-2
ECE 490 Lecture 11 — 10/04/2018 Fall 2018

Now we only consider twice-differentiable f . A sufficient condition guaranteeing x∗ being

a local min is ∇f (x∗ ) = 0 and ∇2 f (x∗ ) > 0. A necessary condition required by every local
min x∗ is ∇f (x∗ ) = 0 and ∇2 f (x∗ ) ≥ 0. Generally speaking, if a stationary point x∗ has
a positive semidefinite Hessian, it is non-trivial to decide whether this is a local min or a
saddle point. If a saddle point has a positive semidefinite Hessian, then it is hard to handle.
On the other hand, if the Hessian for a saddle point x∗ has a negative minimum eigenvalue,
then this saddle point is a strict saddle point and it is relatively easy to handle. By Stable
Manifold theorem, we can guarantee the gradient method with a random initialization does
not converge to such strict saddle points with probability one.
To summarize, one focus of the cutting edge theoretical research for non-convex opti-
mization is on how to escape certain kinds of saddle points. Escaping saddle points is still
an important research topic and many people are working on this. For exposure purpose, we
briefly talked escaping saddle points here. This topic is not going to be tested in homework
or exam. However, the optimality conditions for local min is something that you may be
tested in HW or exam.

11.3 Stepsize Rules

So far we have talked about the theoretical side of optimization. The theory looks at stepsize
depending on the smoothness parameter L. How about practice? How to implement things?
Now we talk about some stepsize rules for implementation of the gradient method.

1. Trial and error: grid α and start with trying some larger α first. Intuitively larger
stepsize leads to faster convergence (although this is not always true). If the larger
stepsize fails, then divide it by a factor of constant and try it again. Keep on shrinking
the stepsize until the function value starts to decrease and converge. This is the trial-
and-error approach. So in practice, you may have to try various stepsizes before you
find something that works.

2. Direct line search: The line search method involves solving a one-dimensional opti-
mization at every step. Specifically, choose the stepsize as follows

αk = arg min f (xk − α∇f (xk ))

α∈R

So at every step just try to decrease the function value as much as you can. Although
we already know that being greedy at every step may not help in the long run (e.g.
using momentum is helpful in the long run but may not be greedy at every step), the
line search is still a popular heuristic.

3. Armijo rule: This is also known as the backtracking search. Fix positive β < 1 and
σ < 1 in advance. Then find the smallest integer m such that

f (xk − α0 β m ∇f (xk )) ≤ f (xk ) − σα0 β m k∇f (xk )k2 (11.4)

11-3
ECE 490 Lecture 11 — 10/04/2018 Fall 2018

Here, start with m = 0. Then increase m until the above inequality is satisfied and use
that m. When f is L-smooth, there always exists an integer m such that the above
inequality holds. To see this, notice L-smoothness means
L
f (xk − α0 β m ∇f (xk )) ≤ f (xk ) + ∇f (xk )T (−α0 β m ∇f (xk )) + k−α0 β m ∇f (xk )k2
2
2m 2

Lβ α 0
= f (xk ) − α0 β m − k∇f (xk )k2
2
Lβ 2m α2
If we choose m such that α0 β m − 2 0 ≥ σα0 β m (which is equivalent to β m ≤ 2(1−σ)
α0 L
),
we guarantee the condition (11.4) is satisfied. Since β < 1, there always exists m such
that the Armijo rule can be used.

For machine learning problems, the learning rate (stepsize) of SGD is typically tuned
using the trial-and-error approach. Another popular choice is the adaptive stepsize rule
such as ADAM, AMSGRAD, etc. The point is that the stepsize rules for deterministic
optimization and stochastic optimization are quite different.

11-4

Primal - Dual Decomposition Methods
No ratings yet
Primal - Dual Decomposition Methods
40 pages
Erpnext Documentation
No ratings yet
Erpnext Documentation
8 pages
01 TD Infinity C700 For IT en
No ratings yet
01 TD Infinity C700 For IT en
40 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Solar Tracking System Thesis PDF
100% (3)
Solar Tracking System Thesis PDF
5 pages
Some Results On The Graph Theory For Complex Neutrosophic Sets
No ratings yet
Some Results On The Graph Theory For Complex Neutrosophic Sets
32 pages
Algorithmic Stability
No ratings yet
Algorithmic Stability
87 pages
DL 1
No ratings yet
DL 1
10 pages
Extending The Step-Size Restriction For Gradient Descent
No ratings yet
Extending The Step-Size Restriction For Gradient Descent
17 pages
Chapter 4: Unconstrained Optimization
No ratings yet
Chapter 4: Unconstrained Optimization
25 pages
Part3 1
No ratings yet
Part3 1
15 pages
12 Barrier
No ratings yet
12 Barrier
44 pages
Plasma Cutter Owners Manual P40HF
No ratings yet
Plasma Cutter Owners Manual P40HF
19 pages
Interroll Drum Motor 113C: Material Versions Product Description
No ratings yet
Interroll Drum Motor 113C: Material Versions Product Description
2 pages
Boiler Turbine Technical Report
No ratings yet
Boiler Turbine Technical Report
31 pages
HTML
No ratings yet
HTML
10 pages
Risk Assessment: Concrete Work (PCC/RCC)
100% (1)
Risk Assessment: Concrete Work (PCC/RCC)
2 pages
Optimal Stochastic Non-Smooth Non-Convex Optimization Through
No ratings yet
Optimal Stochastic Non-Smooth Non-Convex Optimization Through
39 pages
Clnote Sept24
No ratings yet
Clnote Sept24
24 pages
Subgradients Slides
No ratings yet
Subgradients Slides
37 pages
Convergence of Descent Methods For Semi-Algebraic and
No ratings yet
Convergence of Descent Methods For Semi-Algebraic and
37 pages
GSC SelfPrint Ticket
No ratings yet
GSC SelfPrint Ticket
1 page
Subgradient Methods
No ratings yet
Subgradient Methods
56 pages
Unit 1 Introduction To Analog Electronics: BJT-Bipolar Junction Transistor
No ratings yet
Unit 1 Introduction To Analog Electronics: BJT-Bipolar Junction Transistor
42 pages
Wize Free 13 Hour Final Exam Crash Course Math 180 Fall 2021 Final Exam Booklet
No ratings yet
Wize Free 13 Hour Final Exam Crash Course Math 180 Fall 2021 Final Exam Booklet
103 pages
IP ADDRESS ACTIVITY - Quiz
No ratings yet
IP ADDRESS ACTIVITY - Quiz
2 pages
Lecture 11
No ratings yet
Lecture 11
30 pages
Lecture 7
No ratings yet
Lecture 7
54 pages
DSA3102 Midterm Chestsheet
No ratings yet
DSA3102 Midterm Chestsheet
2 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
PIMDSEN
No ratings yet
PIMDSEN
2 pages
Exam 2023
No ratings yet
Exam 2023
16 pages
Numerical Algebra, Control and Optimization Volume 6, Number 2, June 2016
No ratings yet
Numerical Algebra, Control and Optimization Volume 6, Number 2, June 2016
13 pages
Exam 2018
No ratings yet
Exam 2018
18 pages
Lecture 15 Projected Gradient
No ratings yet
Lecture 15 Projected Gradient
8 pages
A Note On The Optimal Convergence Rate of Descent
No ratings yet
A Note On The Optimal Convergence Rate of Descent
11 pages
Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages
06 Optimization
No ratings yet
06 Optimization
42 pages
Lec 13
No ratings yet
Lec 13
6 pages
Lecture 4 Si416 2025
No ratings yet
Lecture 4 Si416 2025
22 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
Latex For Mu
No ratings yet
Latex For Mu
3 pages
Microcontroller Lab Manual GTU SEM V 2012
0% (1)
Microcontroller Lab Manual GTU SEM V 2012
76 pages
02 Controlling Access To VPC Networks 2.0 - OD
No ratings yet
02 Controlling Access To VPC Networks 2.0 - OD
18 pages
Informed Consent Form For Telerehabilitation Consultation: English Version
No ratings yet
Informed Consent Form For Telerehabilitation Consultation: English Version
3 pages
Lect5 Removed
No ratings yet
Lect5 Removed
35 pages
Bologna 07
No ratings yet
Bologna 07
315 pages
Sol3 2015
No ratings yet
Sol3 2015
8 pages
Chapterone ICT Assignment
No ratings yet
Chapterone ICT Assignment
8 pages
Borehole Packers: TEL 604 540 1100 RST Instruments Ltd. 11545 Kingston ST., Maple Ridge, BC V2X 0Z5 Canada
No ratings yet
Borehole Packers: TEL 604 540 1100 RST Instruments Ltd. 11545 Kingston ST., Maple Ridge, BC V2X 0Z5 Canada
2 pages
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
No ratings yet
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
8 pages
Opt Lec 10
No ratings yet
Opt Lec 10
16 pages
01-MCC310 Single Line Diagram For UCD4 MCC of TLM Plant
No ratings yet
01-MCC310 Single Line Diagram For UCD4 MCC of TLM Plant
17 pages
Subgrad Method Slides
No ratings yet
Subgrad Method Slides
33 pages
Felins US 2000 Preventive Maintenance MB19
No ratings yet
Felins US 2000 Preventive Maintenance MB19
2 pages
Ad3301 Data Exploration and Visualization
No ratings yet
Ad3301 Data Exploration and Visualization
38 pages
斯坦福大学机器学习数学基础 57-64
No ratings yet
斯坦福大学机器学习数学基础 57-64
8 pages
Sheet 2 Solution
No ratings yet
Sheet 2 Solution
5 pages
Lecture 11 AGD Restart Lower Bounds
No ratings yet
Lecture 11 AGD Restart Lower Bounds
5 pages
Nonlinear Programming: Operations Research: Applications and Algorithms 4th Edition
No ratings yet
Nonlinear Programming: Operations Research: Applications and Algorithms 4th Edition
45 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
Tutorial 8 Questions
No ratings yet
Tutorial 8 Questions
3 pages
Chapter 9st - Non-Linear Programming
No ratings yet
Chapter 9st - Non-Linear Programming
21 pages
Gradient
No ratings yet
Gradient
37 pages
Panasonic VF0 Inverters
100% (3)
Panasonic VF0 Inverters
4 pages
When Are Nonconvex Problems Not Scary
No ratings yet
When Are Nonconvex Problems Not Scary
11 pages
Introducing Your New AT&T Bill
0% (1)
Introducing Your New AT&T Bill
2 pages
Subgradients: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradients: Ryan Tibshirani Convex Optimization 10-725
25 pages
4 Chapter 21 Non Linear Programming
No ratings yet
4 Chapter 21 Non Linear Programming
37 pages
Lecture 2
No ratings yet
Lecture 2
19 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
Convex Problems
No ratings yet
Convex Problems
48 pages
Gradient
No ratings yet
Gradient
31 pages
Data Sheet 3D 40-200 9.26
No ratings yet
Data Sheet 3D 40-200 9.26
6 pages
Flip Flops and Registers
No ratings yet
Flip Flops and Registers
12 pages
Smooth Convex Minimization Problems
No ratings yet
Smooth Convex Minimization Problems
28 pages
Gradient Methods For Minimizing Composite Objective Function
No ratings yet
Gradient Methods For Minimizing Composite Objective Function
31 pages
06 SG Method
No ratings yet
06 SG Method
33 pages
Cheatsheet
No ratings yet
Cheatsheet
2 pages
Arkanoid - Doh It Again (US) - Text
No ratings yet
Arkanoid - Doh It Again (US) - Text
20 pages
2011 03 Print
No ratings yet
2011 03 Print
2 pages
NLP Slides
No ratings yet
NLP Slides
201 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
Linear Quadratic Control
No ratings yet
Linear Quadratic Control
7 pages
Basic Concepts: 1.1 Continuity
No ratings yet
Basic Concepts: 1.1 Continuity
7 pages
TBE GRP PIPE SUBOR OFFRE - ZONE EAU DE MER-Eau Produite - SUBOR
100% (1)
TBE GRP PIPE SUBOR OFFRE - ZONE EAU DE MER-Eau Produite - SUBOR
2 pages
Query The Service Contracts Tables For Header, Line, Subline and Billing Information
No ratings yet
Query The Service Contracts Tables For Header, Line, Subline and Billing Information
6 pages
Harmonic Analysis and the Theory of Probability
From Everand
Harmonic Analysis and the Theory of Probability
Salomon Bochner
No ratings yet
A Short Course in Automorphic Functions
From Everand
A Short Course in Automorphic Functions
Joseph Lehner
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

Lecture 11

Uploaded by

Lecture 11

Uploaded by

ECE 490: Introduction to Optimization Fall 2018

11.1 One-Point Convexity

When f is L-smooth and m-strongly convex, we have

11.2 Optimization of General Non-Convex Functions

This states the following inequality holds for all T

Now we only consider twice-differentiable f . A sufficient condition guaranteeing x∗ being

11.3 Stepsize Rules

αk = arg min f (xk − α∇f (xk ))

f (xk − α0 β m ∇f (xk )) ≤ f (xk ) − σα0 β m k∇f (xk )k2 (11.4)

You might also like