0% found this document useful (0 votes)
46 views23 pages

CSC411: Optimization For Machine Learning: September 20-26, 2018

This document provides an overview of the CSC411 course on Optimization for Machine Learning held from September 20-26, 2018 at the University of Toronto. The course covers topics related to gradient descent algorithms like batch gradient descent and stochastic gradient descent. It discusses concepts such as learning rates, decaying learning rates, and how stochastic gradient descent can help optimization escape local minima and converge towards global minima for non-convex problems.

Uploaded by

john bianco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views23 pages

CSC411: Optimization For Machine Learning: September 20-26, 2018

This document provides an overview of the CSC411 course on Optimization for Machine Learning held from September 20-26, 2018 at the University of Toronto. The course covers topics related to gradient descent algorithms like batch gradient descent and stochastic gradient descent. It discusses concepts such as learning rates, decaying learning rates, and how stochastic gradient descent can help optimization escape local minima and converge towards global minima for non-convex problems.

Uploaded by

john bianco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

CSC411: Optimization for Machine Learning

University of Toronto

September 20–26, 2018

1
based on slides by Eleni Triantafillou, Ladislav Rampasek, Jake Snell, Kevin
Swersky, Shenlong Wang and other
!

!
!

!
θ∗ = θ (θ) θ
! θ∈R
! :R →R
(θ) − (θ)
! θ
! θ
!
θ

! ( , )
( | , θ)
! − log ( | , θ)
!
∂ (θ ∗ )
∂θ =
! θ
!

! ∂ ∂ ∂
∇θ = ( ∂θ , ∂θ , ..., ∂θ )
η
! θ
! = :
! δ ← −η∇θ −
! θ ←θ− +δ
η
! θ
! = :
! η (θ − η ∇θ − ) < (θ )
! δ ← −η ∇θ −
! θ ←θ− +δ
α∈[ , )

! θ
! δ
! = :
! δ ← −η∇θ − +αδ −
! θ ←θ− +δ

α
η
! θ
!

! δ ← −η∇θ −
! θ ←θ− +δ
!
!
| (θ + ) − (θ )| < ϵ
! ∥∇θ ∥ < ϵ
!
Learning Rate (Step Size)

In gradient descent, the learning rate ↵ is a hyperparameter we


need to tune. Here are some things that can go wrong:

↵ too small: ↵ too large: ↵ much too large:


slow progress oscillations instability
Good values are typically between 0.001 and 0.1. You should do a
grid search if you want good performance (i.e. try
0.1, 0.03, 0.01, . . .).

Intro ML (UofT) CSC311-Lec3 37 / 42


!


!

∂ ((θ , . . . , θ + ϵ, . . . , θ )) − ((θ , . . . , θ − ϵ, . . . , θ ))

∂θ ϵ

!
!

!
!

!
!

!
Stochastic Gradient Descent

Batch gradient descent moves directly downhill. SGD takes steps


in a noisy direction, but moves downhill on average.

batch gradient descent stochastic gradient descent

Intro ML (UofT) CSC311-Lec4 66 / 70


SGD Learning Rate

In stochastic training, the learning rate also influences the


fluctuations due to the stochasticity of the gradients.

Typical strategy:
I Use a large learning rate early in training so you can get close to
the optimum
I Gradually decay the learning rate to reduce the fluctuations

Intro ML (UofT) CSC311-Lec4 67 / 70


SGD Learning Rate

Warning: by reducing the learning rate, you reduce the


fluctuations, which can appear to make the loss drop suddenly.
But this can come at the expense of long-run performance.

Intro ML (UofT) CSC311-Lec4 68 / 70


SGD and Non-convex optimization

Stochastic Gradient descent


updates

3
<latexit sha1_base64="(null)">(null)</latexit>

4
Local minimum <latexit sha1_base64="(null)">(null)</latexit>

Global minimum

Stochastic methods have a chance of escaping from bad minima.


Gradient descent with small step-size converges to first minimum
it finds.
Intro ML (UofT) CSC311-Lec4 69 / 70

You might also like