Gradient Descent
Gradient Descent
Supervised Unsupervised
Learning Learning
Dimensionality
Continuous Regression
reduction
MOTIVATION
• It is good for finding global minima/maxima if the function is convex
• It is good for finding local minima/maxima if the function is not
convex
• It is used for optimizing many models in Machine learning:
• It is used in conjunction with:
üLinear Regression
üLogistic Regression
üSupport Vector Machines
FUNCTION EXAMPLE
QUICKEST EVER REVIEW OF MULTIVARIATE
CALCULUS
• Derivative
• Partial Derivative
• Gradient Vector
DERIVATIVE Slope of the tangent line
𝑓 𝑥 = x!
𝑓′ 𝑥 = 2x
Convex Concave
PARTIAL DERIVATIVE – MULTIVARIATE FUNCTIONS
• To visualize the partial derivative for each of the dimensions x and y,
we can imagine a plane that “cuts” our surface along the two
dimensions and once again we get the slope of the tangent line.
𝜕f 𝜕f
= −2𝑥 = −2𝑦
𝜕𝑥 𝜕𝑦
PARTIAL DERIVATIVE – MULTIVARIATE FUNCTIONS
• The two tangent lines that pass through a point, define the tangent
plane to that point
GRADIENT VECTOR
• Is the vector that has as coordinates the partial derivatives of the
function:
𝜕f
= −2𝑥
𝜕𝑥
f 𝑥, 𝑦 = 9 − 𝑥 6 − 𝑦 6
𝜕f
= −2𝑦
𝜕𝑦
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
∇𝑓 = 𝑖+ 𝑗= , = −2𝑥, −2𝑦
𝜕x 𝜕y 𝜕x 𝜕y
12
GRADIENT DESCENT
Gradient Descent Algorithm:
where 𝛾" is the 𝑡 "% step size (sometimes called learning rate)
Step size: .8
𝑥 (#) = −4
14
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !
Step size: .8
𝑥 (#) = −4
𝑥 (%) = −4 − .8 ⋅ 2 ⋅ (−4)
15
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !
Step size: .8
𝑥 (#) = −4
𝑥 (%) = 2.4
16
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !
Step size: .8
𝑥 (#) = −4
𝑥 (%) = 2.4
𝑥 (!) = 2.4 − .8 ⋅ 2 ⋅ 2.4
𝑥 (%) = 0.4
17
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !
Step size: .8
𝑥 (#) = −4
𝑥 (%) = 2.4
𝑥 (!) = −1.44
18
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !
Step size: .8
𝑥 (#) = −4
𝑥 (%) = 2.4
𝑥 (!) = −1.44
𝑥 (() = .864
𝑥 (') = −0.5184
𝑥 (&) = 0.31104
𝑥 ((#) = −8.84296𝑒 − 07
19
GRADIENT DESCENT
Step size: .9
20
GRADIENT DESCENT
Step size: .2
21
GRADIENT DESCENT
22
GRADIENT DESCENT
23
LINE SEARCH
• Instead of picking a fixed step size that may or may not actually result
in a decrease in the function value, we can consider minimizing the
function along the direction specified by the gradient to guarantee
that the next iteration decreases the function value
• In other words choose, 𝑥@AB ∈ arg min 𝑓(𝑥@ − 𝛾∇𝑓 𝑥@ )
CD E
24
BACKTRACKING LINE SEARCH
• Instead of exact line search, could simply use a strategy that finds
some step size that decreases the function value (one must exist)
• Backtracking line search: start with a large step size, 𝛾, and keep
shrinking it until 𝑓 𝑥" − 𝛾∇𝑓 𝑥" < 𝑓(𝑥" )
25
BACKTRACKING LINE SEARCH
• Set 𝛾 = 1
/
• While 𝑓 𝑥" − 𝛾∇𝑓 𝑥" > 𝑓 𝑥" − 𝛼 ⋅ 𝛾 ⋅ ∇𝑓 𝑥"
Iterations continue until
• 𝛾 = 𝛽𝛾 a step size is found that
decreases the function
• Set 𝑥"#$ = 𝑥" − 𝛾∇𝑓 𝑥" “enough”
26
BACKTRACKING LINE SEARCH
𝛼 = .2, 𝛽 = .99
27
BACKTRACKING LINE SEARCH
𝛼 = .1, 𝛽 = .3
28
GRADIENT DESCENT: CONVEX FUNCTIONS
• For convex functions, local optima are always global optima (this
follows from the definition of convexity)
• If gradient descent converges to a critical point, then the result is a global
minimizer
• Not all convex functions are differentiable, can we still apply gradient
descent?
29
GRADIENTS OF CONVEX FUNCTIONS
30
GRADIENTS OF CONVEX FUNCTIONS
31
GRADIENTS OF CONVEX FUNCTIONS
32
SUBGRADIENTS
𝑔(𝑥)
𝑥
𝑥#
33
SUBGRADIENTS
𝑥
𝑥#
34
SUBGRADIENTS
𝑥
𝑥#
35
SUBGRADIENTS
If 0 is a subgradient
at 𝑥 " , then 𝑥 " is a
global minimum
𝑥
𝑥#
36
SUBGRADIENTS
• The set of all subgradients of 𝑓 at 𝑥 form a convex set, i.e., 𝑔, ℎ subgradients, then
.5𝑔 + .5ℎ is also a subgradient
• Subgradients only guaranteed to exist for convex functions
37
SUBGRADIENT EXAMPLE
38
SUBGRADIENT EXAMPLE
39
SUBGRADIENT DESCENT
40
SUBGRADIENT DESCENT
41
SUBGRADIENT DESCENT
Step Size: .9
42
DIMINISHING STEP SIZE RULES
43
SUBGRADIENT DESCENT
44
THEORETICAL GUARANTEES
• The hard work in convex optimization is to identify conditions that
guarantee quick convergence to within a small error of the optimum
($)
• Let 𝑓!"#$ = min 𝑓(𝑥$ ! )
$ ! ∈{),…,$}
45
GRADIENT DESCENT CODE (PYTHON)
f x = x ' − 3x ( + 2
f ) x = 4x ( − 9x !
GRADIENT DESCENT WITH MOMENTUM
𝛻ℒ 𝜃 ≈ 0 𝛻ℒ 𝜃
=0 𝛻ℒ 𝜃 = 0
𝜃
Slide credit: Hung-yi Lee – Deep Learning Tutorial
GRADIENT DESCENT WITH MOMENTUM
• Gradient descent with momentum uses the momentum of the
gradient for parameter optimization
cost ℒ 𝜃
Movement = Negative of Gradient + Momentum
Negative of Gradient
Momentum
Real Movement
𝜃
Gradient = 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial
GRADIENT DESCENT WITH MOMENTUM
• Parameters update in GD with momentum : 𝜃 2"3 = 𝜃 456 −
𝑉 2"3
• Where: 𝑉 !"# = 𝛽𝑉 $%& + 𝛼𝛻ℒ 𝜃 $%&