CSC411: Optimization For Machine Learning: September 20-26, 2018
CSC411: Optimization For Machine Learning: September 20-26, 2018
University of Toronto
1
based on slides by Eleni Triantafillou, Ladislav Rampasek, Jake Snell, Kevin
Swersky, Shenlong Wang and other
!
!
!
!
θ∗ = θ (θ) θ
! θ∈R
! :R →R
(θ) − (θ)
! θ
! θ
!
θ
! ( , )
( | , θ)
! − log ( | , θ)
!
∂ (θ ∗ )
∂θ =
! θ
!
! ∂ ∂ ∂
∇θ = ( ∂θ , ∂θ , ..., ∂θ )
η
! θ
! = :
! δ ← −η∇θ −
! θ ←θ− +δ
η
! θ
! = :
! η (θ − η ∇θ − ) < (θ )
! δ ← −η ∇θ −
! θ ←θ− +δ
α∈[ , )
! θ
! δ
! = :
! δ ← −η∇θ − +αδ −
! θ ←θ− +δ
α
η
! θ
!
! δ ← −η∇θ −
! θ ←θ− +δ
!
!
| (θ + ) − (θ )| < ϵ
! ∥∇θ ∥ < ϵ
!
Learning Rate (Step Size)
∇
!
∂ ((θ , . . . , θ + ϵ, . . . , θ )) − ((θ , . . . , θ − ϵ, . . . , θ ))
≈
∂θ ϵ
!
!
!
!
!
!
!
Stochastic Gradient Descent
Typical strategy:
I Use a large learning rate early in training so you can get close to
the optimum
I Gradually decay the learning rate to reduce the fluctuations
3
<latexit sha1_base64="(null)">(null)</latexit>
4
Local minimum <latexit sha1_base64="(null)">(null)</latexit>
Global minimum