Lecture 3 Gradient Descent
Lecture 3 Gradient Descent
Lecture 3 Gradient Descent
Nicholas Ruozzi
University of Texas at Dallas
Gradient Descent
• Method to find local optima of differentiable a function
• Intuition: gradient tells us direction of greatest increase,
negative gradient gives us direction of greatest decrease
• Take steps in directions that reduce the function value
• Definition of derivative guarantees that if we take a small
enough step in the direction of the negative gradient, the
function will decrease in value
• How small is small enough?
2
Gradient Descent
3
Gradient Descent
When do we stop?
4
Gradient Descent
5
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥
Step size:
(0)
𝑥 =− 4
6
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥
Step size:
(0)
𝑥 =− 4
(1)
𝑥 =− 4 −.8 ⋅ 2 ⋅( − 4)
7
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥
Step size:
(0)
𝑥 =− 4
(1)
𝑥 =2.4
8
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥
Step size:
(0)
𝑥 =− 4
(1)
𝑥 =2.4
(2)
𝑥 =2.4 −.8 ⋅ 2⋅ 2.4
(1)
𝑥 =0.4
9
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥
Step size:
(0)
𝑥 =− 4
(1)
𝑥 =2.4
(2)
𝑥 =−1.44
10
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥
Step size:
(0)
𝑥 =− 4
(1)
𝑥 =2.4
1.44
(3)
𝑥 =.864
(4 )
𝑥 =− 0.5184
(5)
𝑥 =0.31104
(30)
𝑥 =−8.8429 6 𝑒−07
11
Gradient Descent
Step size: .9
12
Gradient Descent
Step size: .2
13
Gradient Descent
14
Gradient Descent
15
Line Search
• Instead of picking a fixed step size that may or may not actually
result in a decrease in the function value, we can consider
minimizing the function along the direction specified by the
gradient to guarantee that the next iteration decreases the
function value
• In other words choose,
• This is called exact line search
• This optimization problem can be expensive to solve exactly
• However, if is convex, this is a univariate convex optimization
problem
16
Backtracking Line Search
• Instead of exact line search, could simply use a strategy that
finds some step size that decreases the function value (one must
exist)
• Backtracking line search: start with a large step size, , and keep
shrinking it until
• This always guarantees a decrease, but it may not decrease as
much as exact line search
• Still, this is typically much faster in practice as it only requires
a few function evaluations
17
Backtracking Line Search
• Set
18
Backtracking Line Search
𝛼=.2 , 𝛽=.99
19
Backtracking Line Search
𝛼=.1 , 𝛽=.3
20
Gradient Descent: Convex Functions
• For convex functions, local optima are always global optima (this
follows from the definition of convexity)
• If gradient descent converges to a critical point, then the
result is a global minimizer
21
Gradients of Convex Functions
• For a differentiable convex function its gradients yield linear
underestimators
𝑔 ( 𝑥)
22
Gradients of Convex Functions
• For a differentiable convex function its gradients yield linear
underestimators
𝑔 ( 𝑥)
23
Gradients of Convex Functions
• For a differentiable convex function its gradients yield linear
underestimators: zero gradient corresponds to a global
optimum
𝑔 ( 𝑥)
24
Subgradients
• For a convex function , a subgradient at a point is given by any
line, such that and for all , i.e., it is a linear underestimator
𝑔 ( 𝑥)
𝑥
0
𝑥
25
Subgradients
• For a convex function , a subgradient at a point is given by any
line, such that and for all , i.e., it is a linear underestimator
𝑔 ( 𝑥)
𝑥
0
𝑥
26
Subgradients
• For a convex function , a subgradient at a point is given by any
line, such that and for all , i.e., it is a linear underestimator
𝑔 ( 𝑥)
𝑥
0
𝑥
27
Subgradients
• For a convex function , a subgradient at a point is given by any
line, such that and for all , i.e., it is a linear underestimator
𝑔 ( 𝑥)
If is a subgradient
at , then is a global
minimum
𝑥
0
𝑥
28
Subgradients
• If a convex function is differentiable at a point , then it has a
unique subgradient at the point given by the gradient
• If a convex function is not differentiable at a point , it can have
many subgradients
• E.g., the set of subgradients of the convex function at the
point is given by the set of slopes
29
Subgradient Example
30
Subgradient Example
• If ,
• If ,
31
Subgradient Descent
32
Subgradient Descent
33
Subgradient Descent
Step Size: .9
34
Diminishing Step Size Rules
• A fixed step size may not result in convergence for non-
differentiable functions
• Instead, can use a diminishing step size:
• Required property: step size must decrease as number of
iterations increase but not too quickly that the algorithm fails
to make progress
• Common diminishing step size rules:
• for some
• for some
35
Subgradient Descent
36
Theoretical Guarantees
• The hard work in convex optimization is to identify conditions
that guarantee quick convergence to within a small error of the
optimum
• Let
• For a fixed step size, , we are guaranteed that
37