Lecture12 Diff
Lecture12 Diff
to Machine Learning
Lecture 13 –
Differentiation
Henry Chai & Matt Gormley & Hoda Heidari
2/26/24
𝑦
𝜷 ∈ ℝ)(
𝜷
𝛽& ∈ ℝ (%)
𝑦 = σ((𝜷)$ 𝒛 + 𝛽& )
Recall: Neural ($)
𝑧!
($)
𝑧$ … ($)
𝑧&"
𝜶
%
∈ ℝ'×)( (%) % $ (")
Networks 𝜶
($)
𝒃(%) ∈ ℝ)(
𝒛 = σ((𝜶 ) 𝒛 + 𝒃(%) )
(Matrix Form) 𝑧!
(!) (!)
𝑧$ … (!)
𝑧&!
"
𝒛
(")
= σ((𝜶
" $
) 𝒙+ 𝒃(") )
𝜶 ∈ ℝ'×)'
(!)
𝜶
𝒃(") ∈ ℝ)'
𝑥! 𝑥$ … 𝑥%
10/6/23 2
$ 1
𝑦 𝑦= σ 𝜷* %
𝒛
𝛽&
𝜷 𝜷* = ∈ ℝ)(+"
𝜷
(%) % *
$ 1
Recall: Neural ($)
𝑧!
($)
𝑧$ … ($)
𝑧&"
% $
𝒛 = σ 𝜶
𝒛
"
% * 𝒃
Networks 𝜶
($) 𝜶 =
𝜶
% ∈ℝ )' +" ×)(
(Matrix Form)
)
(!) (!) (!) (") " * 1
𝑧! 𝑧$ … 𝑧&! 𝒛 = σ 𝜶
𝒙
(!)
𝒃 " $
𝜶
𝜶 " * = ∈ℝ '+" ×)'
"
𝜶
𝑥! 𝑥$ … 𝑥%
10/6/23 3
Inputs: weights 𝜶 ! ,…,𝜶 " , 𝜷 and a query data point 𝒙#
Initialize 𝒛 $ = 𝒙#
Forward For 𝑙 = 1, … , 𝐿
Propagation 𝒂 % =𝜶 % & 𝒛 %'!
for Making
% %
Predictions 𝒛 =𝜎 𝒂
𝑦- = 𝜎 𝜷& 𝒛 "
10/6/23 4
( ( *
Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾
10/6/23 8
Let 𝚯 = 𝜶 ! ,…,𝜶 " , 𝜷 be the parameters of our neural network
Softmax … 𝑧2 = 𝜎 𝑎2
3
𝜶 𝑎2 = L 𝛼2,( 𝑥(
()$
10/6/23 11
( ( *
Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾
! "
Two questions: Initialize all weights 𝜶 ,…,𝜶 , 𝜷 (???)
While TERMINATION CRITERION is not satisfied (???)
1. What is this For 𝑖 ∈ shuf7le 1, … , 𝑁
loss function Compute 𝑔𝜷 = ∇𝜷𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
𝐽!? For 𝑙 = 1, … , 𝐿
Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
2. How on earth Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
do we take For 𝑙 = 1, … , 𝐿
these gradients? % %
Update 𝜶 =𝜶 − 𝛾𝑔𝜶 !
Types of
scalar vector matrix
Derivatives
Matrix scalar
Calculus
vector
Denominator
matrix
Derivatives of a scalar
scalar always
Matrix have the same
Calculus: shape as the
vector
scalar
Matrix
Calculus:
Denominator
Layout
vector
Approach 1:
Finite 𝜖𝜖 𝑥
Difference We want 𝜖 to be small to get a good approximation but we
Method run into floating point issues when 𝜖 is too small
Approach 1: >>> x = 2
Finite >>> z = 3
Approach 2:
Symbolic
Differentiation
If 𝑦 = 𝑓 𝒛 and 𝒛 = 𝑔 𝑥 then
𝑧!
𝑥 𝑦 6
𝜕𝑦 𝜕𝑦 𝜕𝑧2
⟹ =L
𝜕𝑥 𝜕𝑧2 𝜕𝑥
𝑧. 2)!
⋮
10/6/23 𝑧6 21
If 𝑦 = 𝑓 𝒛 , 𝒛 = 𝑔 𝒘 and 𝒘 = ℎ 𝑥 , does the equation
6
𝜕𝑦 𝜕𝑦 𝜕𝑧2
=L
𝜕𝑥 𝜕𝑧2 𝜕𝑥
2)!
A. Yes
B. No
C. Only on Fridays (TOXIC)
10/6/23 22
Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧
Approach 3: 𝑥 𝑎 𝑑
Automatic 2 ∗ 𝑒𝑥𝑝
Differentiation 𝑧 𝑏 𝑒 𝑦
(reverse mode) 3 𝑙𝑛 / +
𝑓
𝑐 𝑠𝑖𝑛 /
10/6/23 Example courtesy of Matt Gormley 26
Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
Requires the ability to call 𝑓 𝒙
Great for checking accuracy of implementations of
more complex differentiation methods
Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
Requires systematic knowledge of derivatives
Differentiation Can be computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing 9: 𝒙 Z9𝒙 is proportional
10/6/23
to the cost of computing 𝑓 𝒙 28
Given 𝑓: ℝ6 → ℝ1 , compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
3. Automatic differentiation (reverse mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing ∇𝒙 𝑓 𝒙 0 = 9: 𝒙 #Z9𝒙
is proportional to the cost of computing 𝑓 𝒙
Automatic Great for high-dimensional inputs and low-dimensional
outputs (𝐷 ≫ 𝐶)
Differentiation
4. Automatic differentiation (forward mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing 9: 𝒙 Z9𝒙$
is proportional to the cost of computing 𝑓 𝒙
Great for low-dimensional inputs and high-dimensional
10/6/23
outputs (𝐷 ≪ 𝐶) 29
The diagram represents an algorithm
10/6/23 30
The diagram represents a neural network
Objectives Instantiate an optimization method (e.g. SGD) and a regularizer (e.g. L2)
when the parameters of a model are comprised of several matrices
corresponding to different layers of a neural network
Use the finite difference method to evaluate the gradient of a function
Identify when the gradient of a function can be computed at all and when
it can be computed efficiently
Employ basic matrix calculus to compute vector/matrix/tensor derivatives.
10/6/23 32