Lecture 3 Gradient Descent

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 37

Gradient Descent

Nicholas Ruozzi
University of Texas at Dallas
Gradient Descent
• Method to find local optima of differentiable a function
• Intuition: gradient tells us direction of greatest increase,
negative gradient gives us direction of greatest decrease
• Take steps in directions that reduce the function value
• Definition of derivative guarantees that if we take a small
enough step in the direction of the negative gradient, the
function will decrease in value
• How small is small enough?

2
Gradient Descent

Gradient Descent Algorithm:


• Pick an initial point
• Iterate until convergence

where is the step size (sometimes called learning rate)

3
Gradient Descent

Gradient Descent Algorithm:


• Pick an initial point
• Iterate until convergence

where is the step size (sometimes called learning rate)

When do we stop?

4
Gradient Descent

Gradient Descent Algorithm:


• Pick an initial point
• Iterate until convergence

where is the step size (sometimes called learning rate)

Possible Stopping Criteria: iterate until for some

How small should be?

5
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
(0)
𝑥 =− 4

6
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
(0)
𝑥 =− 4
(1)
𝑥 =− 4 −.8 ⋅ 2 ⋅( − 4)

7
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
(0)
𝑥 =− 4
(1)
𝑥 =2.4

8
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
(0)
𝑥 =− 4
(1)
𝑥 =2.4
(2)
𝑥 =2.4 −.8 ⋅ 2⋅ 2.4

(1)
𝑥 =0.4
9
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
(0)
𝑥 =− 4
(1)
𝑥 =2.4
(2)
𝑥 =−1.44

10
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
(0)
𝑥 =− 4
(1)
𝑥 =2.4
1.44
(3)
𝑥 =.864
(4 )
𝑥 =− 0.5184
(5)
𝑥 =0.31104

(30)
𝑥 =−8.8429 6 𝑒−07
11
Gradient Descent

Step size: .9

12
Gradient Descent

Step size: .2

13
Gradient Descent

Step size matters!

14
Gradient Descent

Step size matters!

15
Line Search
• Instead of picking a fixed step size that may or may not actually
result in a decrease in the function value, we can consider
minimizing the function along the direction specified by the
gradient to guarantee that the next iteration decreases the
function value
• In other words choose,
• This is called exact line search
• This optimization problem can be expensive to solve exactly 
• However, if is convex, this is a univariate convex optimization
problem

16
Backtracking Line Search
• Instead of exact line search, could simply use a strategy that
finds some step size that decreases the function value (one must
exist)
• Backtracking line search: start with a large step size, , and keep
shrinking it until
• This always guarantees a decrease, but it may not decrease as
much as exact line search
• Still, this is typically much faster in practice as it only requires
a few function evaluations

17
Backtracking Line Search

• To implement backtracking line search, choose two parameters


• Set
• While

• Set

Iterations continue until


a step size is found that
decreases the function
“enough”

18
Backtracking Line Search

𝛼=.2 , 𝛽=.99
19
Backtracking Line Search

𝛼=.1 , 𝛽=.3
20
Gradient Descent: Convex Functions

• For convex functions, local optima are always global optima (this
follows from the definition of convexity)
• If gradient descent converges to a critical point, then the
result is a global minimizer

• Not all convex functions are differentiable, can we still apply


gradient descent?

21
Gradients of Convex Functions
• For a differentiable convex function its gradients yield linear
underestimators

𝑔 ( 𝑥)

22
Gradients of Convex Functions
• For a differentiable convex function its gradients yield linear
underestimators

𝑔 ( 𝑥)

23
Gradients of Convex Functions
• For a differentiable convex function its gradients yield linear
underestimators: zero gradient corresponds to a global
optimum

𝑔 ( 𝑥)

24
Subgradients
• For a convex function , a subgradient at a point is given by any
line, such that and for all , i.e., it is a linear underestimator

𝑔 ( 𝑥)

𝑥
0
𝑥

25
Subgradients
• For a convex function , a subgradient at a point is given by any
line, such that and for all , i.e., it is a linear underestimator

𝑔 ( 𝑥)

𝑥
0
𝑥

26
Subgradients
• For a convex function , a subgradient at a point is given by any
line, such that and for all , i.e., it is a linear underestimator

𝑔 ( 𝑥)

𝑥
0
𝑥

27
Subgradients
• For a convex function , a subgradient at a point is given by any
line, such that and for all , i.e., it is a linear underestimator

𝑔 ( 𝑥)

If is a subgradient
at , then is a global
minimum

𝑥
0
𝑥

28
Subgradients
• If a convex function is differentiable at a point , then it has a
unique subgradient at the point given by the gradient
• If a convex function is not differentiable at a point , it can have
many subgradients
• E.g., the set of subgradients of the convex function at the
point is given by the set of slopes

• The set of all subgradients of at form a convex set, i.e.,


subgradients, then is also a subgradient
• Subgradients only guaranteed to exist for convex functions

29
Subgradient Example

• Subgradient of for convex functions?

30
Subgradient Example

• Subgradient of for convex functions?

• If ,

• If ,

• and are both subgradients (and so are all convex


combinations of these)

31
Subgradient Descent

Subgradient Descent Algorithm:


• Pick an initial point
• Iterate until convergence

where is the step size and is a subgradient of at

32
Subgradient Descent

Subgradient Descent Algorithm:


• Pick an initial point
• Iterate until convergence

where is the step size and is a subgradient of at

Can you use line search here?

33
Subgradient Descent

Step Size: .9

34
Diminishing Step Size Rules
• A fixed step size may not result in convergence for non-
differentiable functions
• Instead, can use a diminishing step size:
• Required property: step size must decrease as number of
iterations increase but not too quickly that the algorithm fails
to make progress
• Common diminishing step size rules:
• for some
• for some

35
Subgradient Descent

Diminishing Step Size

36
Theoretical Guarantees
• The hard work in convex optimization is to identify conditions
that guarantee quick convergence to within a small error of the
optimum
• Let
• For a fixed step size, , we are guaranteed that

where is some positive constant that depends on


• If is differentiable, then we have whenever is small enough

37

You might also like