0% found this document useful (0 votes)
6 views52 pages

Gradient Descent

The document provides an overview of machine learning algorithms, focusing on supervised and unsupervised learning, and introduces concepts from multivariate calculus essential for optimization in machine learning. It explains gradient descent and subgradient descent methods for finding local minima of functions, emphasizing the importance of step size and line search techniques. Additionally, it discusses the role of convexity in optimization and the use of subgradients for non-differentiable functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views52 pages

Gradient Descent

The document provides an overview of machine learning algorithms, focusing on supervised and unsupervised learning, and introduces concepts from multivariate calculus essential for optimization in machine learning. It explains gradient descent and subgradient descent methods for finding local minima of functions, emphasizing the importance of step size and line search techniques. Additionally, it discusses the role of convexity in optimization and the use of subgradients for non-differentiable functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Recap: Machine learning algorithms

Supervised Unsupervised
Learning Learning

Discrete Classification Clustering

Dimensionality
Continuous Regression
reduction
MOTIVATION
• It is good for finding global minima/maxima if the function is convex
• It is good for finding local minima/maxima if the function is not
convex
• It is used for optimizing many models in Machine learning:
• It is used in conjunction with:
üLinear Regression
üLogistic Regression
üSupport Vector Machines
FUNCTION EXAMPLE
QUICKEST EVER REVIEW OF MULTIVARIATE
CALCULUS
• Derivative
• Partial Derivative
• Gradient Vector
DERIVATIVE Slope of the tangent line

𝑓 𝑥 = x!

𝑓′ 𝑥 = 2x

𝑓′′ 𝑥 = 2 Easy when a function is univariate


PARTIAL DERIVATIVE – MULTIVARIATE FUNCTIONS
For multivariate functions (e.g two variables) we need partial
derivatives – one per dimension. Examples of multivariate functions:

f x, y = x ! + y ! f x, y = −x ! − y ! f x, y = cos ! (x) + y ! f x, y = cos ! 𝑥 + cos ! 𝑦

Convex Concave
PARTIAL DERIVATIVE – MULTIVARIATE FUNCTIONS
• To visualize the partial derivative for each of the dimensions x and y,
we can imagine a plane that “cuts” our surface along the two
dimensions and once again we get the slope of the tangent line.

Surface:f 𝑥, 𝑦 = 9 − 𝑥 ! − 𝑦 ! Plane: 𝑦 = 1 Cut: f 𝑥, 1 = 8 − 𝑥 !


Slope/deviarate of cut f′ 𝑥 = −2𝑥
PARTIAL DERIVATIVE – MULTIVARIATE FUNCTIONS
• If we partially differentiate a function with respect to x, we pretend y
is constant. f 𝑥, 𝑦 = 9 − 𝑥 ! − 𝑦 !

𝜕f 𝜕f
= −2𝑥 = −2𝑦
𝜕𝑥 𝜕𝑦
PARTIAL DERIVATIVE – MULTIVARIATE FUNCTIONS
• The two tangent lines that pass through a point, define the tangent
plane to that point
GRADIENT VECTOR
• Is the vector that has as coordinates the partial derivatives of the
function:
𝜕f
= −2𝑥
𝜕𝑥
f 𝑥, 𝑦 = 9 − 𝑥 6 − 𝑦 6
𝜕f
= −2𝑦
𝜕𝑦

𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
∇𝑓 = 𝑖+ 𝑗= , = −2𝑥, −2𝑦
𝜕x 𝜕y 𝜕x 𝜕y

Note: Gradient Vector is not parallel to tangent surface


GRADIENT DESCENT

• Method to find local optima of differentiable a function 𝑓


• Intuition: gradient tells us direction of greatest increase, negative gradient
gives us direction of greatest decrease
• Take steps in directions that reduce the function value

• Definition of derivative guarantees that if we take a small enough step in the


direction of the negative gradient, the function will decrease in value
• How small is small enough?

12
GRADIENT DESCENT
Gradient Descent Algorithm:

• Pick an initial point 𝑥!

• Iterate until convergence


𝑥"#$ = 𝑥" − 𝛾" 𝛻𝑓(𝑥" )

where 𝛾" is the 𝑡 "% step size (sometimes called learning rate)

Possible Stopping Criteria: iterate until


When do we stop?
∇𝑓(𝑥! ) ≤ 𝜖 for some 𝜖 > 0
How small should 𝜖 be? 13
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !

Step size: .8
𝑥 (#) = −4

14
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !

Step size: .8
𝑥 (#) = −4
𝑥 (%) = −4 − .8 ⋅ 2 ⋅ (−4)

15
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !

Step size: .8
𝑥 (#) = −4
𝑥 (%) = 2.4

16
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !

Step size: .8
𝑥 (#) = −4
𝑥 (%) = 2.4
𝑥 (!) = 2.4 − .8 ⋅ 2 ⋅ 2.4

𝑥 (%) = 0.4

17
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !

Step size: .8
𝑥 (#) = −4
𝑥 (%) = 2.4
𝑥 (!) = −1.44

18
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !

Step size: .8
𝑥 (#) = −4
𝑥 (%) = 2.4
𝑥 (!) = −1.44
𝑥 (() = .864
𝑥 (') = −0.5184
𝑥 (&) = 0.31104

𝑥 ((#) = −8.84296𝑒 − 07
19
GRADIENT DESCENT

Step size: .9

20
GRADIENT DESCENT

Step size: .2

21
GRADIENT DESCENT

Step size matters!

22
GRADIENT DESCENT

Step size matters!

23
LINE SEARCH

• Instead of picking a fixed step size that may or may not actually result
in a decrease in the function value, we can consider minimizing the
function along the direction specified by the gradient to guarantee
that the next iteration decreases the function value
• In other words choose, 𝑥@AB ∈ arg min 𝑓(𝑥@ − 𝛾∇𝑓 𝑥@ )
CD E

• This is called exact line search

• This optimization problem can be expensive to solve exactly L


• However, if 𝑓 is convex, this is a univariate convex optimization problem

24
BACKTRACKING LINE SEARCH

• Instead of exact line search, could simply use a strategy that finds
some step size that decreases the function value (one must exist)
• Backtracking line search: start with a large step size, 𝛾, and keep
shrinking it until 𝑓 𝑥" − 𝛾∇𝑓 𝑥" < 𝑓(𝑥" )

• This always guarantees a decrease, but it may not decrease as much


as exact line search
• Still, this is typically much faster in practice as it only requires a few function
evaluations

25
BACKTRACKING LINE SEARCH

• To implement backtracking line search, choose two parameters 𝛼 ∈


0, . 5 , 𝛽 ∈ (0,1)

• Set 𝛾 = 1
/
• While 𝑓 𝑥" − 𝛾∇𝑓 𝑥" > 𝑓 𝑥" − 𝛼 ⋅ 𝛾 ⋅ ∇𝑓 𝑥"
Iterations continue until
• 𝛾 = 𝛽𝛾 a step size is found that
decreases the function
• Set 𝑥"#$ = 𝑥" − 𝛾∇𝑓 𝑥" “enough”

26
BACKTRACKING LINE SEARCH

𝛼 = .2, 𝛽 = .99

27
BACKTRACKING LINE SEARCH

𝛼 = .1, 𝛽 = .3

28
GRADIENT DESCENT: CONVEX FUNCTIONS

• For convex functions, local optima are always global optima (this
follows from the definition of convexity)
• If gradient descent converges to a critical point, then the result is a global
minimizer

• Not all convex functions are differentiable, can we still apply gradient
descent?

29
GRADIENTS OF CONVEX FUNCTIONS

• For a differentiable convex function 𝑔(𝑥) its gradients yield linear


underestimators
𝑔(𝑥)

30
GRADIENTS OF CONVEX FUNCTIONS

• For a differentiable convex function 𝑔(𝑥) its gradients yield linear


underestimators
𝑔(𝑥)

31
GRADIENTS OF CONVEX FUNCTIONS

• For a differentiable convex function 𝑔(𝑥) its gradients yield linear


underestimators: zero gradient corresponds to a global optimum
𝑔(𝑥)

32
SUBGRADIENTS

• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥 ! is given by


any line, 𝑙, such that 𝑙 𝑥 ! = 𝑔(𝑥 ! ) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all 𝑥, i.e., it
is a linear underestimator

𝑔(𝑥)

𝑥
𝑥#

33
SUBGRADIENTS

• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥 ! is given by


any line, 𝑙, such that 𝑙 𝑥 ! = 𝑔(𝑥 ! ) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all 𝑥, i.e., it
is a linear underestimator
𝑔(𝑥)

𝑥
𝑥#

34
SUBGRADIENTS

• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥 ! is given by


any line, 𝑙, such that 𝑙 𝑥 ! = 𝑔(𝑥 ! ) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all 𝑥, i.e., it
is a linear underestimator
𝑔(𝑥)

𝑥
𝑥#

35
SUBGRADIENTS

• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥 ! is given by


any line, 𝑙, such that 𝑙 𝑥 ! = 𝑔(𝑥 ! ) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all 𝑥, i.e., it
is a linear underestimator
𝑔(𝑥)

If 0 is a subgradient
at 𝑥 " , then 𝑥 " is a
global minimum

𝑥
𝑥#

36
SUBGRADIENTS

• If a convex function is differentiable at a point 𝑥, then it has a unique


subgradient at the point 𝑥 given by the gradient
• If a convex function is not differentiable at a point 𝑥, it can have many
subgradients
• E.g., the set of subgradients of the convex function |𝑥| at the point 𝑥 = 0 is given by
the set of slopes [−1,1]

• The set of all subgradients of 𝑓 at 𝑥 form a convex set, i.e., 𝑔, ℎ subgradients, then
.5𝑔 + .5ℎ is also a subgradient
• Subgradients only guaranteed to exist for convex functions

37
SUBGRADIENT EXAMPLE

• Subgradient of 𝑔 𝑥 = max(𝑓$ 𝑥 , 𝑓/ 𝑥 ) for 𝑓$ , 𝑓/ convex functions?

38
SUBGRADIENT EXAMPLE

• Subgradient of 𝑔 𝑥 = max(𝑓$ 𝑥 , 𝑓/ 𝑥 ) for 𝑓$ , 𝑓/ convex functions?

• If 𝑓B 𝑥 > 𝑓6(𝑥), ∇𝑓B(𝑥)

• If 𝑓6 𝑥 > 𝑓B(𝑥), ∇𝑓6(𝑥)

• If 𝑓B 𝑥 = 𝑓6 𝑥 , ∇𝑓B(𝑥) and ∇𝑓6(𝑥) are both subgradients (and so are all


convex combinations of these)

39
SUBGRADIENT DESCENT

Subgradient Descent Algorithm:

• Pick an initial point 𝑥!

• Iterate until convergence


𝑥"#$ = 𝑥" − 𝛾" 𝑠0 (𝑥" )

where 𝛾" is the 𝑡 "% step size and 𝑠0 (𝑥" ) is a subgradient of 𝑓 at


𝑥"

40
SUBGRADIENT DESCENT

Subgradient Descent Algorithm:

• Pick an initial point 𝑥!

• Iterate until convergence


𝑥"#$ = 𝑥" − 𝛾" 𝑠0 (𝑥" )

where 𝛾" is the 𝑡 "% step


Cansize
you and 𝑠0search
use line is a subgradient of 𝑓 at
(𝑥" ) here?
𝑥"

41
SUBGRADIENT DESCENT

Step Size: .9

42
DIMINISHING STEP SIZE RULES

• A fixed step size may not result in convergence for non-differentiable


functions
• Instead, can use a diminishing step size:
• Required property: step size must decrease as number of iterations increase
but not too quickly that the algorithm fails to make progress

• Common diminishing step size rules:


I
• 𝛾@ = for some 𝑎 > 0, 𝑏 ≥ 0
JA@
I
• 𝛾@ = @
for some 𝑎 > 0

43
SUBGRADIENT DESCENT

Diminishing Step Size

44
THEORETICAL GUARANTEES
• The hard work in convex optimization is to identify conditions that
guarantee quick convergence to within a small error of the optimum
($)
• Let 𝑓!"#$ = min 𝑓(𝑥$ ! )
$ ! ∈{),…,$}

• For a fixed step size, 𝛾, we are guaranteed that


($)
lim 𝑓!"#$ − inf 𝑓 𝑥 ≤ 𝜖(𝛾)
$→. /

where 𝜖(𝛾) is some positive constant that depends on 𝛾


• If 𝑓 is differentiable, then we have 𝜖 𝛾 = 0 whenever 𝛾 is small enough
(more on rates of convergence later)

45
GRADIENT DESCENT CODE (PYTHON)

f x = x ' − 3x ( + 2

f ) x = 4x ( − 9x !
GRADIENT DESCENT WITH MOMENTUM

v1 = 𝛾𝑣"2$ + 𝜂∇3 𝐽 𝜃 , thông thường 𝛾 được chọn là khoảng 0.9


MINI-BATCH GRADIENT DESCENT
• It is wasteful to compute the loss over the entire set to
perform a single parameter update for large datasets
• E.g., ImageNet has 14M images
• GD (a.k.a. vanilla GD) is replaced with mini-batch GD
• Mini-batch gradient descent
• Approach:
• Compute the loss ℒ 𝜃 on a batch of images, update the parameters 𝜃,
and repeat until all images are used
• At the next epoch, shuffle the training data, and repeat above process
• Mini-batch GD results in much faster training
• Typical batch size: 32 to 256 images
• It works because the examples in the training data are correlated
• I.e., the gradient from a mini-batch is a good approximation of the
gradient of the entire training set
STOCHASTIC GRADIENT DESCENT
• Stochastic gradient descent
• SGD uses mini-batches that consist of a single input example
• E.g., one image mini-batch
• Although this method is very fast, it may cause significant fluctuations in the
loss function
• Therefore, it is less commonly used, and mini-batch GD is preferred
• In most DL libraries, SGD is typically a mini-batch SGD (with an option to add
momentum)
PROBLEMS WITH GRADIENT DESCENT
• Besides the local minima problem, the GD algorithm can be very slow
at plateaus, and it can get stuck at saddle points
cost ℒ 𝜃

Very slow at the plateau

Stuck at a saddle point

Stuck at a local minimum

𝛻ℒ 𝜃 ≈ 0 𝛻ℒ 𝜃
=0 𝛻ℒ 𝜃 = 0
𝜃
Slide credit: Hung-yi Lee – Deep Learning Tutorial
GRADIENT DESCENT WITH MOMENTUM
• Gradient descent with momentum uses the momentum of the
gradient for parameter optimization
cost ℒ 𝜃
Movement = Negative of Gradient + Momentum

Negative of Gradient
Momentum
Real Movement

𝜃
Gradient = 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial
GRADIENT DESCENT WITH MOMENTUM
• Parameters update in GD with momentum : 𝜃 2"3 = 𝜃 456 −
𝑉 2"3
• Where: 𝑉 !"# = 𝛽𝑉 $%& + 𝛼𝛻ℒ 𝜃 $%&

• Compare to vanilla GD: 𝜃 2"3 = 𝜃 456 − 𝛼𝛻ℒ 𝜃 456


• The term 𝑉 2"3 is called momentum
• This term accumulates the gradients from the past several steps
• It is similar to a momentum of a heavy ball rolling down the hill
• The parameter 𝛽 referred to as a coefficient of momentum
• A typical value of the parameter 𝛽 is 0.9
• This method updates the parameters 𝜃 in the direction of
the weighted average of the past gradients

You might also like