0% found this document useful (0 votes)
37 views6 pages

(K) K (k+1) (K) K (K)

The document summarizes a lecture on gradient descent optimization. It introduces the gradient descent algorithm and how it iteratively moves in the negative gradient direction to find the minimum of a function. It also discusses choosing step sizes and provides an example application to minimize a convex function. Finally, it discusses convergence properties for strongly convex functions.

Uploaded by

shashwat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views6 pages

(K) K (k+1) (K) K (K)

The document summarizes a lecture on gradient descent optimization. It introduces the gradient descent algorithm and how it iteratively moves in the negative gradient direction to find the minimum of a function. It also discusses choosing step sizes and provides an example application to minimize a convex function. Finally, it discusses convergence properties for strongly convex functions.

Uploaded by

shashwat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

AM 221: Advanced Optimization Spring 2016

Prof. Yaron Singer Lecture 9 — February 24th

1 Overview

In the previous lecture we reviewed results from multivariate calculus in preparation for our journey
into convex optimization. In this lecture we present the gradient descent algorithm for minimizing
a convex function and analyze its convergence properties.

2 The Gradient Descent Algorithm

From the previous lecture, we know that in order to minimize a convex function, we need to find
a stationary point. As we will see in this lecture as well as the upcoming ones, there are different
methods and heuristics to find a stationary point. One possible approach is to start at an arbitrary
point, and move along the gradient at that point towards the next point, and repeat until (hopefully)
converging to a stationary point. We illustrate this in the figure below.

Direction and step size. In general, one can consider a search for a stationary point as having
two components: the direction and the step size. The direction decides which direction we search
next, and the step size determines how far we go in that particular direction. Such methods can be
generally described as starting at some arbitrary point x(0) and then at every step k ≥ 0 iteratively
moving at direction ∆x(k) by step size tk to the next point x(k+1) = x(k) + tk · ∆x(k) . In gradient
descent, the direction we search is the negative gradient at the point, i.e. ∆x = −∇f (x). Thus,
the iterative search of gradient descent can be described through the following recursive rule:

x(k+1) = x(k) − tk ∇f (x(k) )

Choosing a step size. Given that the search for a stationary point is currently at a certain point
x(k) , how should we choose our step size tk ? Since our objective is to minimize the function, one
reasonable approach is to choose the step size in manner that will minimize the value of the new
point, i.e. find the step size that minimizes f (x(k+1) ). Since x(k+1) = x(k) − t∇f (x(k) ) the step size
t?k of this approach is:

t?k = argmint≥0 f (x(k) − t∇f (x(k) ))

For now we will assume that t?k can be computed analytically, and later revisit this assumption.

The algorithm. Formally, given a desired precision  > 0, we define the gradient descent as
described below.

1
Algorithm 1 Gradient Descent
1: Guess x(0) , set k ← 0
2: while ||∇f (x(k) )|| ≥  do
3: x(k+1) = x(k) − tk ∇f (x(k) )
4: k ←k+1
5: end while
6: return x(k)

f(x)

x(0) x(1) x(2)


x
∇f(z)
f(x1)

Figure 1: An example of a gradient search for a stationary point.

Some remarks. The gradient descent algorithm we present here is for unconstrained minimiza-
tion. That is, we assume that every point we choose is feasible (inside S). In a few lectures we will
see that gradient descent can be applied for constrained minimization as well. The stopping condi-
tion where we check k∇f (x)k ≥  does not a priori guarantee us that we are  close to the optimal
solution, i.e. that we are at a point x(k) for which f (x(k) ) − minx∈S f (x) ≤ . In section however,
you will show that this is implies as a consequence of the characterization of convex functions we
showed in the previous lecture. Finally, computing the step size as shown here is called exact line
search. In some cases finding t?k is computationally expensive and different methods are used. In
your problem set this week, you will implement gradient descent and use an alternative method
called backtracking that can be implemented efficiently.

Example. Consider the problem of minimizing f (x, y) = 4x2 − 4xy + 2y 2 using the gradient
descent method. Notice that the optimal solution is (x, y) = (0, 0). To apply the gradient descent
algorithm let’s first compute the gradient:
! 
∂f (x,y) 
∂x 8x − 4y
∇f (x, y) = ∂f (x,y) =
∂y
−4x + 4y

We will start from the point (x(0) , y (0) ) = (2, 3). To find the next point (x(1) , y (1) ) we compute:

(x(1) , y (1) ) = (x(0) , y (0) ) − t?0 ∇f (x(0) , y (0) ).

To find t?0 we need to find the minimum of the function θ(t) = f (x(0) , y (0) ) − t∇f (x(0) , y (0) ) . To


do this we will look for the stationary point:

2
 |
θ0 (t) = −∇f (x(0) , y (0) ) − t∇f (x(0) , y (0) ) ∇f (x(0) , y (0) )
 
| 4
= −∇f (2 − 4t, 3 − 4t)
4
 |  4 
= − 8(2 − 4t) − 4(3 − 4t), −4(2 − 4t) + 4(3 − 4t)
4
= −16(2 − 4t)
= −32 + 64t

In this case θ0 (t) = 0 if and only t = 1/2. Since the function θ(t) is convex, the stationary point is
a global minimum. Therefore, t0 = 1/2.
The next point will be:

     
(1) (1) 2 1 4 0
(x ,y )= − =
3 2 4 1

and the algorithm will continue finding the next point by performing similar calculations as above.
It is important to note that the directions in which the algorithm proceeds are orthogonal. That is:

∇f (2, 3)| ∇f (0, 1) = 0

This is due the way in which we compute the multiplier t?k :


 |
θ0 (t) = 0 ⇐⇒ −∇f (x(k) , y (k) ) − t∇f (x(k) , y (k) ) ∇f (x(k) , y (k) ) = 0
⇐⇒ ∇f (x(k+1) , y (k+1) )| ∇f (x(k) , y (k) ) = 0

3 Convergence Analysis of Gradient Descent

The convergence analysis we will prove will hold for strongly convex functions, defined below. We
will first show some important properties of strongly convex functions, and then use these properties
in the proof of the convergence of gradient descent.

3.1 Strongly convex functions

Definition. For a convex set S ⊆ Rn , a convex function f : S → R is called strongly convex


if there exist constants m < M ∈ R≥0 s.t.:

mI ≤ Hf (x) ≤ M I

It is important to observe the relationship between strictly convex and strongly convex functions,
as we do in the following claim.

3
Claim 1. Let f be a strongly convex function, then f is strictly convex.

Proof. For any x, y ∈ S, from the second-order Taylor expansion we know that there exists a
z ∈ [x, y]:
1
f (y) = f (x) + ∇f (x)| (y − x) + (y − x)| Hf (z)(y − x)
2
strong convexity implies there exists a constant m > 0 s.t.:
m
f (y) ≥ f (x) + ∇f (x)| (y − x) + ky − xk22
2
and hence:
f (y) > f (x) + ∇f (x)| (y − x)
In the previous lecture we proved that a function is convex if and only if, for every x, y ∈ S:

f (y) ≥ f (x) + ∇f (x)| (y − x)

The exact same proof shows that a function is strictly convex if and only if, for every x, y ∈ S the
above inequality is strict. Thus strong convexity indeed implies a strict inequality and hence the
function is strictly convex.

Lemma 2. Let f : S → R be a strongly convex function with parameters m, M as in the definition


above, and let α? = minx∈S f (x). Then:
1 1
f (x) − k∇f (x)k22 ≤ α? ≤ f (x) − k∇f (x)k22
2m 2M

Proof. For any x, y ∈ S, from the second-order Taylor expansion we know that there exists a
z ∈ [x, y]:
1
f (y) = f (x) + ∇f (x)| (y − x) + (y − x)| Hf (z)(y − x)
2
strong convexity implies there exists a constant m > 0 s.t.:
m
f (y) ≥ f (x) + ∇f (x)| (y − x) + ky − xk22
2
The function:
m
ky − xk22
f (x) + ∇f (x)| (y − x) +
2
1
is convex quadratic in y and minimized at ỹ = x − m ∇f (x). Therefore we can apply the above
inequality to show that for any y ∈ S we have that f (y) is lower bounded by the convex quadratic
function at y = ỹ:
m
f (y) ≥ f (x) + ∇f (x)| (ỹ − x) + kỹ − xk22
2
1 m 1
= f (x) + ∇f (x)| (x − ∇f (x) − x) + kx − ∇f (x) − xk22
m 2 m
1 m 1
= f (x) − ∇f (x)| ∇f (x) + · k∇f (x)k22
m 2 m2
1 1
= f (x) − k∇f (x)k22 + k∇f (x)k22
m 2m
1
= f (x) − k∇f (x)k22
2m

4
Since this holds for any y ∈ S, it holds for y? = argminy∈S f (y), which implies the first side
of our desired inequality. In a similar manner we can show the other side of the inequality by
relying on the second-order Taylor expansion, upper bound of the Hessian Hf by M I and choosing
1
ỹ = x − M ∇f (x).

3.2 Convergence of gradient descent

Theorem 3. Let f : S → R be a strongly convex function with parameters m, M as in the definition


above. For any  > 0 we have that f (x(k) ) − minx∈S f (x) ≤  after k ? iterations for any k ? that
respects:  (0) ? 
log f (x )−α
k? ≥  
1
log 1−m/M

Proof. For a given step k define the optimal step size t?k = argmint≥0 f (x(k) − t∇f (x(k) )). From the
second-order Taylor expansion we have that:
1
f (y) = f (x) + ∇f (x)| (y − x) + (y − x)| Hf (z)(y − x)
2
Together with strong convexity we have that:
M
f (y) ≤ f (x) + ∇f (x)| (y − x) + ky − xk22
2

For y = x(k) − t∇f (x(k) ) and x = x(k) we get:

M
f (x(k) − t∇f (x(k) )) ≤ f (x(k) ) + ∇f (x(k) )| (−t∇f (x(k) )) + k−t∇f (x(k) )k22
2
M 2
= f (x(k) ) − t · k∇f (x(k) )k22 + · t k∇f (x(k) )k22
2
In particular, using t = tM = 1/M we get:

1 M 1
f (x(k) − tM ∇f (x(k) )) ≤ f (x(k) ) − · k∇f (x(k) )k22 + · 2 k∇f (x(k) )k22
M 2 M
1 1
= f (x(k) ) − · k∇f (x(k) )k22 + k∇f (x(k) )k22
M 2M
1
= f (x(k) ) − · k∇f (x(k) )k22
2M

By the minimality of t?k we know that f (x(k) − t?k ∇f (x(k) )) ≤ f (x(k) − tM ∇f (x(k) )) and thus:

1
f (x(k) − t?k ∇f (x(k) )) ≤ f (x(k) ) − · k∇f (x(k) )k22
2M

Notice that x(k+1) = x(k) − t?k ∇f (x(k) ) and thus:

1
f (x(k+1) ) ≤ f (x(k) ) − · k∇f (x(k) )k22
2M

5
subtracting α? = minx∈S f (x) from both sides we get:
1
f (x(k+1) ) − α? ≤ f (x(k) ) − α? − · k∇f (x(k) )k22
2M

Applying Lemma 2 we know that k∇f (x(k) )k22 ≥ 2m(f (x(k) ) − α? ), hence:

1
f (x(k+1) ) − α? ≤ f (x(k) ) − α? − · k∇f (x(k) )k22
2M 
m 
≤ f (x(k+1) ) − α? − f (x(k+1) ) − α?
M
 m (k+1)

= 1− f (x ) − α?
M

Applying this rule recursively on our initial point x(0) we get:


 m k  
f (x(k+1) ) − α? ≤ 1 − f (x(0) ) − α?
M

Thus, f (x(k) ) − α? ≤  when  


f (x(0) )−α?
log 
k≥   .
1
log 1−m/M

Notice that the rate converges to  both as a function of how far our initial point was from the
optimal solution, as well as the ratio between m and M . As m and M get closer, we have tigher
bounds on the strong convexity property of the function, and the algorithm converges faster as a
result.

4 Further Reading

For further reading on gradient descent and general descent methods please see Chapter 9 of the
Convex Optimization book by Boyd and Vandenberghe.

You might also like