0% found this document useful (0 votes)

37 views6 pages

(K) K (k+1) (K) K (K)

The document summarizes a lecture on gradient descent optimization. It introduces the gradient descent algorithm and how it iteratively moves in the negative gradient direction to find the minimum of a function. It also discusses choosing step sizes and provides an example application to minimize a convex function. Finally, it discusses convergence properties for strongly convex functions.

Uploaded by

shashwat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views6 pages

(K) K (k+1) (K) K (K)

Uploaded by

shashwat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

AM 221: Advanced Optimization Spring 2016

Prof. Yaron Singer Lecture 9 — February 24th

1 Overview

In the previous lecture we reviewed results from multivariate calculus in preparation for our journey
into convex optimization. In this lecture we present the gradient descent algorithm for minimizing
a convex function and analyze its convergence properties.

2 The Gradient Descent Algorithm

From the previous lecture, we know that in order to minimize a convex function, we need to find
a stationary point. As we will see in this lecture as well as the upcoming ones, there are different
methods and heuristics to find a stationary point. One possible approach is to start at an arbitrary
point, and move along the gradient at that point towards the next point, and repeat until (hopefully)
converging to a stationary point. We illustrate this in the figure below.

Direction and step size. In general, one can consider a search for a stationary point as having
two components: the direction and the step size. The direction decides which direction we search
next, and the step size determines how far we go in that particular direction. Such methods can be
generally described as starting at some arbitrary point x(0) and then at every step k ≥ 0 iteratively
moving at direction ∆x(k) by step size tk to the next point x(k+1) = x(k) + tk · ∆x(k) . In gradient
descent, the direction we search is the negative gradient at the point, i.e. ∆x = −∇f (x). Thus,
the iterative search of gradient descent can be described through the following recursive rule:

x(k+1) = x(k) − tk ∇f (x(k) )

Choosing a step size. Given that the search for a stationary point is currently at a certain point
x(k) , how should we choose our step size tk ? Since our objective is to minimize the function, one
reasonable approach is to choose the step size in manner that will minimize the value of the new
point, i.e. find the step size that minimizes f (x(k+1) ). Since x(k+1) = x(k) − t∇f (x(k) ) the step size
t?k of this approach is:

t?k = argmint≥0 f (x(k) − t∇f (x(k) ))

For now we will assume that t?k can be computed analytically, and later revisit this assumption.

The algorithm. Formally, given a desired precision > 0, we define the gradient descent as
described below.

1
Algorithm 1 Gradient Descent
1: Guess x(0) , set k ← 0
2: while ||∇f (x(k) )|| ≥ do
3: x(k+1) = x(k) − tk ∇f (x(k) )
4: k ←k+1
5: end while
6: return x(k)

f(x)

x(0) x(1) x(2)

x
∇f(z)
f(x1)

Figure 1: An example of a gradient search for a stationary point.

Some remarks. The gradient descent algorithm we present here is for unconstrained minimiza-
tion. That is, we assume that every point we choose is feasible (inside S). In a few lectures we will
see that gradient descent can be applied for constrained minimization as well. The stopping condi-
tion where we check k∇f (x)k ≥ does not a priori guarantee us that we are close to the optimal
solution, i.e. that we are at a point x(k) for which f (x(k) ) − minx∈S f (x) ≤ . In section however,
you will show that this is implies as a consequence of the characterization of convex functions we
showed in the previous lecture. Finally, computing the step size as shown here is called exact line
search. In some cases finding t?k is computationally expensive and different methods are used. In
your problem set this week, you will implement gradient descent and use an alternative method
called backtracking that can be implemented efficiently.

Example. Consider the problem of minimizing f (x, y) = 4x2 − 4xy + 2y 2 using the gradient
descent method. Notice that the optimal solution is (x, y) = (0, 0). To apply the gradient descent
algorithm let’s first compute the gradient:
!
∂f (x,y)
∂x 8x − 4y
∇f (x, y) = ∂f (x,y) =
∂y
−4x + 4y

We will start from the point (x(0) , y (0) ) = (2, 3). To find the next point (x(1) , y (1) ) we compute:

(x(1) , y (1) ) = (x(0) , y (0) ) − t?0 ∇f (x(0) , y (0) ).

To find t?0 we need to find the minimum of the function θ(t) = f (x(0) , y (0) ) − t∇f (x(0) , y (0) ) . To

do this we will look for the stationary point:

2
|
θ0 (t) = −∇f (x(0) , y (0) ) − t∇f (x(0) , y (0) ) ∇f (x(0) , y (0) )

| 4
= −∇f (2 − 4t, 3 − 4t)
4
| 4
= − 8(2 − 4t) − 4(3 − 4t), −4(2 − 4t) + 4(3 − 4t)
4
= −16(2 − 4t)
= −32 + 64t

In this case θ0 (t) = 0 if and only t = 1/2. Since the function θ(t) is convex, the stationary point is
a global minimum. Therefore, t0 = 1/2.
The next point will be:

(1) (1) 2 1 4 0
(x ,y )= − =
3 2 4 1

and the algorithm will continue finding the next point by performing similar calculations as above.
It is important to note that the directions in which the algorithm proceeds are orthogonal. That is:

∇f (2, 3)| ∇f (0, 1) = 0

This is due the way in which we compute the multiplier t?k :

|
θ0 (t) = 0 ⇐⇒ −∇f (x(k) , y (k) ) − t∇f (x(k) , y (k) ) ∇f (x(k) , y (k) ) = 0
⇐⇒ ∇f (x(k+1) , y (k+1) )| ∇f (x(k) , y (k) ) = 0

3 Convergence Analysis of Gradient Descent

The convergence analysis we will prove will hold for strongly convex functions, defined below. We
will first show some important properties of strongly convex functions, and then use these properties
in the proof of the convergence of gradient descent.

3.1 Strongly convex functions

Definition. For a convex set S ⊆ Rn , a convex function f : S → R is called strongly convex

if there exist constants m < M ∈ R≥0 s.t.:

mI ≤ Hf (x) ≤ M I

It is important to observe the relationship between strictly convex and strongly convex functions,
as we do in the following claim.

3
Claim 1. Let f be a strongly convex function, then f is strictly convex.

f (y) ≥ f (x) + ∇f (x)| (y − x)

The exact same proof shows that a function is strictly convex if and only if, for every x, y ∈ S the
above inequality is strict. Thus strong convexity indeed implies a strict inequality and hence the
function is strictly convex.

Lemma 2. Let f : S → R be a strongly convex function with parameters m, M as in the definition

above, and let α? = minx∈S f (x). Then:
1 1
f (x) − k∇f (x)k22 ≤ α? ≤ f (x) − k∇f (x)k22
2m 2M

Proof. For any x, y ∈ S, from the second-order Taylor expansion we know that there exists a
z ∈ [x, y]:
1
f (y) = f (x) + ∇f (x)| (y − x) + (y − x)| Hf (z)(y − x)
2
strong convexity implies there exists a constant m > 0 s.t.:
m
f (y) ≥ f (x) + ∇f (x)| (y − x) + ky − xk22
2
The function:
m
ky − xk22
f (x) + ∇f (x)| (y − x) +
2
1
is convex quadratic in y and minimized at ỹ = x − m ∇f (x). Therefore we can apply the above
inequality to show that for any y ∈ S we have that f (y) is lower bounded by the convex quadratic
function at y = ỹ:
m
f (y) ≥ f (x) + ∇f (x)| (ỹ − x) + kỹ − xk22
2
1 m 1
= f (x) + ∇f (x)| (x − ∇f (x) − x) + kx − ∇f (x) − xk22
m 2 m
1 m 1
= f (x) − ∇f (x)| ∇f (x) + · k∇f (x)k22
m 2 m2
1 1
= f (x) − k∇f (x)k22 + k∇f (x)k22
m 2m
1
= f (x) − k∇f (x)k22
2m

4
Since this holds for any y ∈ S, it holds for y? = argminy∈S f (y), which implies the first side
of our desired inequality. In a similar manner we can show the other side of the inequality by
relying on the second-order Taylor expansion, upper bound of the Hessian Hf by M I and choosing
1
ỹ = x − M ∇f (x).

3.2 Convergence of gradient descent

Theorem 3. Let f : S → R be a strongly convex function with parameters m, M as in the definition

above. For any > 0 we have that f (x(k) ) − minx∈S f (x) ≤ after k ? iterations for any k ? that
respects: (0) ?
log f (x )−α
k? ≥
1
log 1−m/M

Proof. For a given step k define the optimal step size t?k = argmint≥0 f (x(k) − t∇f (x(k) )). From the
second-order Taylor expansion we have that:
1
f (y) = f (x) + ∇f (x)| (y − x) + (y − x)| Hf (z)(y − x)
2
Together with strong convexity we have that:
M
f (y) ≤ f (x) + ∇f (x)| (y − x) + ky − xk22
2

For y = x(k) − t∇f (x(k) ) and x = x(k) we get:

M
f (x(k) − t∇f (x(k) )) ≤ f (x(k) ) + ∇f (x(k) )| (−t∇f (x(k) )) + k−t∇f (x(k) )k22
2
M 2
= f (x(k) ) − t · k∇f (x(k) )k22 + · t k∇f (x(k) )k22
2
In particular, using t = tM = 1/M we get:

1 M 1
f (x(k) − tM ∇f (x(k) )) ≤ f (x(k) ) − · k∇f (x(k) )k22 + · 2 k∇f (x(k) )k22
M 2 M
1 1
= f (x(k) ) − · k∇f (x(k) )k22 + k∇f (x(k) )k22
M 2M
1
= f (x(k) ) − · k∇f (x(k) )k22
2M

By the minimality of t?k we know that f (x(k) − t?k ∇f (x(k) )) ≤ f (x(k) − tM ∇f (x(k) )) and thus:

1
f (x(k) − t?k ∇f (x(k) )) ≤ f (x(k) ) − · k∇f (x(k) )k22
2M

Notice that x(k+1) = x(k) − t?k ∇f (x(k) ) and thus:

1
f (x(k+1) ) ≤ f (x(k) ) − · k∇f (x(k) )k22
2M

5
subtracting α? = minx∈S f (x) from both sides we get:
1
f (x(k+1) ) − α? ≤ f (x(k) ) − α? − · k∇f (x(k) )k22
2M

Applying Lemma 2 we know that k∇f (x(k) )k22 ≥ 2m(f (x(k) ) − α? ), hence:

1
f (x(k+1) ) − α? ≤ f (x(k) ) − α? − · k∇f (x(k) )k22
2M
m
≤ f (x(k+1) ) − α? − f (x(k+1) ) − α?
M
m (k+1)

= 1− f (x ) − α?
M

Applying this rule recursively on our initial point x(0) we get:

m k
f (x(k+1) ) − α? ≤ 1 − f (x(0) ) − α?
M

Thus, f (x(k) ) − α? ≤ when

f (x(0) )−α?
log
k≥ .
1
log 1−m/M

Notice that the rate converges to both as a function of how far our initial point was from the
optimal solution, as well as the ratio between m and M . As m and M get closer, we have tigher
bounds on the strong convexity property of the function, and the algorithm converges faster as a
result.

4 Further Reading

For further reading on gradient descent and general descent methods please see Chapter 9 of the
Convex Optimization book by Boyd and Vandenberghe.

QCS 2014 - Section 22-Air Conditioning, Refrigeration and Ventilation
No ratings yet
QCS 2014 - Section 22-Air Conditioning, Refrigeration and Ventilation
119 pages
CRE GATE Question Paper
100% (1)
CRE GATE Question Paper
28 pages
BBM en-GB 2015.4
100% (2)
BBM en-GB 2015.4
476 pages
Optimization PPT - Part-2
No ratings yet
Optimization PPT - Part-2
42 pages
Presentation of Joule Thomson Effect
100% (6)
Presentation of Joule Thomson Effect
16 pages
Evaporation New
No ratings yet
Evaporation New
64 pages
2004, Vol.6, No.4, Pediatric Surgery PDF
100% (1)
2004, Vol.6, No.4, Pediatric Surgery PDF
95 pages
Spare Parts Specification: Especificacion de Repuestos
No ratings yet
Spare Parts Specification: Especificacion de Repuestos
223 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
9.intracellular Accumulations 1
No ratings yet
9.intracellular Accumulations 1
45 pages
Gradient Descent - Problem of Hiking Down A Mountain: Derivatives
No ratings yet
Gradient Descent - Problem of Hiking Down A Mountain: Derivatives
8 pages
Unconstrained Minimization
No ratings yet
Unconstrained Minimization
7 pages
Voolenvine FavoriteSocks 2020 Final PDF
No ratings yet
Voolenvine FavoriteSocks 2020 Final PDF
6 pages
The Bugs: Sammy Betty
No ratings yet
The Bugs: Sammy Betty
1 page
Microbial Growth Kinetics - Sent
No ratings yet
Microbial Growth Kinetics - Sent
73 pages
Richard Cross - The Medieval Christian Philosophers - An Introduction (Library of Medieval Studies) - I.B. Tauris (2013)
No ratings yet
Richard Cross - The Medieval Christian Philosophers - An Introduction (Library of Medieval Studies) - I.B. Tauris (2013)
286 pages
3-Structure Analysis - Trusses
No ratings yet
3-Structure Analysis - Trusses
58 pages
32
No ratings yet
32
11 pages
Chapter 4: Unconstrained Optimization
No ratings yet
Chapter 4: Unconstrained Optimization
25 pages
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
No ratings yet
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
19 pages
Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
CSEC Math 2018 Paper 032
No ratings yet
CSEC Math 2018 Paper 032
16 pages
5 1 SD 17122020
No ratings yet
5 1 SD 17122020
47 pages
The Trinity - Lesson 4
100% (1)
The Trinity - Lesson 4
3 pages
Implicit & Explicit Finite Element Analysis - CAE ANALYSIS
No ratings yet
Implicit & Explicit Finite Element Analysis - CAE ANALYSIS
2 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Transport Phenomena 1
No ratings yet
Transport Phenomena 1
8 pages
06 23ECE216 GradientDescent v2
No ratings yet
06 23ECE216 GradientDescent v2
73 pages
Clnote Oct8
No ratings yet
Clnote Oct8
39 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
DLL G6 Q3 WEEK 9 Version2 (Mam Inkay Peralta)
No ratings yet
DLL G6 Q3 WEEK 9 Version2 (Mam Inkay Peralta)
71 pages
Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages
Lecture2 Gradient Descent Linear Regression
No ratings yet
Lecture2 Gradient Descent Linear Regression
75 pages
Lec 02
No ratings yet
Lec 02
43 pages
Lecture8 UnconstrainedII 2023
No ratings yet
Lecture8 UnconstrainedII 2023
57 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
BT Inter Phone User Manual
100% (1)
BT Inter Phone User Manual
15 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
Topic3 PDF
No ratings yet
Topic3 PDF
50 pages
06 Optimization
No ratings yet
06 Optimization
42 pages
Gradient - Descent Important 23-24
No ratings yet
Gradient - Descent Important 23-24
37 pages
Lec 11
No ratings yet
Lec 11
13 pages
Machine Learning - Lecture 2
No ratings yet
Machine Learning - Lecture 2
28 pages
Membrane Separations-1408
No ratings yet
Membrane Separations-1408
62 pages
Advanced Transport Phenomena Module 9 Lecture 38: Student Exercises: True/ False Questions
No ratings yet
Advanced Transport Phenomena Module 9 Lecture 38: Student Exercises: True/ False Questions
41 pages
Lecture10v01 Descent2
No ratings yet
Lecture10v01 Descent2
18 pages
Grade 11 Mathematics June Exam P2 QP
No ratings yet
Grade 11 Mathematics June Exam P2 QP
8 pages
p5 CO Opti Algo
No ratings yet
p5 CO Opti Algo
15 pages
Lecture 14
No ratings yet
Lecture 14
9 pages
Opt Lec 10
No ratings yet
Opt Lec 10
16 pages
John Dewey - Towards A Flexible Curriculum
No ratings yet
John Dewey - Towards A Flexible Curriculum
8 pages
O4MD 03 Descent Methods
No ratings yet
O4MD 03 Descent Methods
18 pages
Lecture 5
No ratings yet
Lecture 5
31 pages
Subgrad Method Slides
No ratings yet
Subgrad Method Slides
33 pages
Multi Variable Optimization: Min F (X, X, X, - X)
No ratings yet
Multi Variable Optimization: Min F (X, X, X, - X)
38 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Lecture Notes 4 Gradient Descent
No ratings yet
Lecture Notes 4 Gradient Descent
6 pages
EE - L9 To L10 - Economic Evaluation of Alternatives - Present Worth Method
No ratings yet
EE - L9 To L10 - Economic Evaluation of Alternatives - Present Worth Method
48 pages
Chương 9
No ratings yet
Chương 9
12 pages
Gradient Descent PDF
No ratings yet
Gradient Descent PDF
9 pages
Pengertian Narrative Text Kls 2
No ratings yet
Pengertian Narrative Text Kls 2
11 pages
Materials Compatibility Milling Units Chart
No ratings yet
Materials Compatibility Milling Units Chart
1 page
Lecture 12
No ratings yet
Lecture 12
16 pages
CCA Shree Cement
No ratings yet
CCA Shree Cement
10 pages
Advanced Gradient Descent
No ratings yet
Advanced Gradient Descent
14 pages
Chapter 8 Lecture Notes
No ratings yet
Chapter 8 Lecture Notes
4 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
Gradient Descent
No ratings yet
Gradient Descent
12 pages
6 Gradient Method
No ratings yet
6 Gradient Method
19 pages
BSC Part 3
No ratings yet
BSC Part 3
29 pages
Steepest Descent Algorithm
No ratings yet
Steepest Descent Algorithm
28 pages
Just A Pretty Face
No ratings yet
Just A Pretty Face
2 pages
Chapter Gradient Descent
No ratings yet
Chapter Gradient Descent
6 pages
Untitled
No ratings yet
Untitled
4 pages
Lec4 Gradient Method Revise
No ratings yet
Lec4 Gradient Method Revise
33 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
Chapter-4: Microcontroller Based Ubbelohde Viscometer For Measurement of Viscosity of Liquids
No ratings yet
Chapter-4: Microcontroller Based Ubbelohde Viscometer For Measurement of Viscosity of Liquids
11 pages
MAE Opti Worksheet 4 Correction
No ratings yet
MAE Opti Worksheet 4 Correction
3 pages
Boiler Report
No ratings yet
Boiler Report
1 page
EE - L12 Annual Equivalent Amount
No ratings yet
EE - L12 Annual Equivalent Amount
22 pages
Elementary GW 07b
No ratings yet
Elementary GW 07b
2 pages
06 SG Method
No ratings yet
06 SG Method
33 pages
02-Subgrad Method Notes
No ratings yet
02-Subgrad Method Notes
27 pages
Multi-Variable Optimization Methods
No ratings yet
Multi-Variable Optimization Methods
21 pages
PiCar-X v2 Assembly Instructions
No ratings yet
PiCar-X v2 Assembly Instructions
2 pages
Steepest Descent
No ratings yet
Steepest Descent
4 pages
Platoon #1
No ratings yet
Platoon #1
3 pages
Download
No ratings yet
Download
7 pages
Biefeld-Brown Effect and Space Curvature
No ratings yet
Biefeld-Brown Effect and Space Curvature
6 pages
Role of UN and International NGOs in Global Health Governance - Edited
No ratings yet
Role of UN and International NGOs in Global Health Governance - Edited
3 pages
J 6 Test
No ratings yet
J 6 Test
1 page
Gradient Of A Function هّلادلا رادحنإ
No ratings yet
Gradient Of A Function هّلادلا رادحنإ
11 pages
(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
No ratings yet
(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
5 pages
Case Study 10 - Motivation
No ratings yet
Case Study 10 - Motivation
2 pages
Lecture Notes: Some Notes On Gradient Descent: Marc Toussaint
No ratings yet
Lecture Notes: Some Notes On Gradient Descent: Marc Toussaint
4 pages
Steepest Descent
No ratings yet
Steepest Descent
7 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

(K) K (k+1) (K) K (K)

Uploaded by

(K) K (k+1) (K) K (K)

Uploaded by

AM 221: Advanced Optimization Spring 2016

Prof. Yaron Singer Lecture 9 — February 24th

2 The Gradient Descent Algorithm

x(k+1) = x(k) − tk ∇f (x(k) )

t?k = argmint≥0 f (x(k) − t∇f (x(k) ))

x(0) x(1) x(2)

Figure 1: An example of a gradient search for a stationary point.

(x(1) , y (1) ) = (x(0) , y (0) ) − t?0 ∇f (x(0) , y (0) ).

do this we will look for the stationary point:

∇f (2, 3)| ∇f (0, 1) = 0

This is due the way in which we compute the multiplier t?k :

3 Convergence Analysis of Gradient Descent

3.1 Strongly convex functions

Definition. For a convex set S ⊆ Rn , a convex function f : S → R is called strongly convex

f (y) ≥ f (x) + ∇f (x)| (y − x)

Lemma 2. Let f : S → R be a strongly convex function with parameters m, M as in the definition

3.2 Convergence of gradient descent

Theorem 3. Let f : S → R be a strongly convex function with parameters m, M as in the definition

For y = x(k) − t∇f (x(k) ) and x = x(k) we get:

Notice that x(k+1) = x(k) − t?k ∇f (x(k) ) and thus:

Applying this rule recursively on our initial point x(0) we get:

Thus, f (x(k) ) − α? ≤  when  

You might also like

Thus, f (x(k) ) − α? ≤ when