0% found this document useful (0 votes)
10 views

Lecture12 Diff

lecture 12 - diff

Uploaded by

seshathiri.d
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture12 Diff

lecture 12 - diff

Uploaded by

seshathiri.d
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

10-301/601: Introduction

to Machine Learning
Lecture 13 –
Differentiation
Henry Chai & Matt Gormley & Hoda Heidari
2/26/24
𝑦

𝜷 ∈ ℝ)(
𝜷
𝛽& ∈ ℝ (%)
𝑦 = σ((𝜷)$ 𝒛 + 𝛽& )
Recall: Neural ($)
𝑧!
($)
𝑧$ … ($)
𝑧&"
𝜶
%
∈ ℝ'×)( (%) % $ (")
Networks 𝜶
($)
𝒃(%) ∈ ℝ)(
𝒛 = σ((𝜶 ) 𝒛 + 𝒃(%) )

(Matrix Form) 𝑧!
(!) (!)
𝑧$ … (!)
𝑧&!
"
𝒛
(")
= σ((𝜶
" $
) 𝒙+ 𝒃(") )
𝜶 ∈ ℝ'×)'
(!)
𝜶
𝒃(") ∈ ℝ)'
𝑥! 𝑥$ … 𝑥%

10/6/23 2
$ 1
𝑦 𝑦= σ 𝜷* %
𝒛
𝛽&
𝜷 𝜷* = ∈ ℝ)(+"
𝜷
(%) % *
$ 1
Recall: Neural ($)
𝑧!
($)
𝑧$ … ($)
𝑧&"
% $
𝒛 = σ 𝜶
𝒛
"

% * 𝒃
Networks 𝜶
($) 𝜶 =
𝜶
% ∈ℝ )' +" ×)(

(Matrix Form)
)
(!) (!) (!) (") " * 1
𝑧! 𝑧$ … 𝑧&! 𝒛 = σ 𝜶
𝒙
(!)
𝒃 " $
𝜶
𝜶 " * = ∈ℝ '+" ×)'
"
𝜶
𝑥! 𝑥$ … 𝑥%

10/6/23 3
– Inputs: weights 𝜶 ! ,…,𝜶 " , 𝜷 and a query data point 𝒙#

– Initialize 𝒛 $ = 𝒙#
Forward – For 𝑙 = 1, … , 𝐿
Propagation –𝒂 % =𝜶 % & 𝒛 %'!
for Making
% %
Predictions –𝒛 =𝜎 𝒂

– 𝑦- = 𝜎 𝜷& 𝒛 "

– Output: the prediction 𝑦-

10/6/23 4
( ( *
– Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾

– Initialize all weights 𝜶 ! ,…,𝜶 " , 𝜷 (???)


– While TERMINATION CRITERION is not satisfied (???)
– For 𝑖 ∈ shuf7le 1, … , 𝑁
Stochastic ( ! "
– Compute 𝑔𝜷 = ∇𝜷𝐽 𝜶 ,…,𝜶 ,𝜷
Gradient
– For 𝑙 = 1, … , 𝐿
Descent
( ! "
for Learning – Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 𝜶 ,…,𝜶 ,𝜷
– Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
– For 𝑙 = 1, … , 𝐿
– Update 𝜶 % =𝜶 % − 𝛾𝑔𝜶 !

– Output:𝜶 ! ,…,𝜶 " ,𝜷


10/6/23 5
( ( *
– Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾
! "
Two questions: – Initialize all weights 𝜶 ,…,𝜶 , 𝜷 (???)
– While TERMINATION CRITERION is not satisfied (???)
1. What is this – For 𝑖 ∈ shuf7le 1, … , 𝑁
loss function – Compute 𝑔𝜷 = ∇𝜷𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
𝐽!? – For 𝑙 = 1, … , 𝐿
– Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
2. How on earth – Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
do we take – For 𝑙 = 1, … , 𝐿
these gradients? % %
– Update 𝜶 =𝜶 − 𝛾𝑔𝜶 !

– Output:𝜶 ! ,…,𝜶 " ,𝜷


10/6/23 6
( ( *
– Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾
! "
Two questions: – Initialize all weights 𝜶 ,…,𝜶 , 𝜷 (???)
– While TERMINATION CRITERION is not satisfied (???)
1. What is this – For 𝑖 ∈ shuf7le 1, … , 𝑁
loss function – Compute 𝑔𝜷 = ∇𝜷𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
𝐽!? – For 𝑙 = 1, … , 𝐿
– Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
2. How on earth – Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
do we take – For 𝑙 = 1, … , 𝐿
these gradients? % %
– Update 𝜶 =𝜶 − 𝛾𝑔𝜶 !

– Output:𝜶 ! ,…,𝜶 " ,𝜷


10/6/23 7
– Let 𝚯 = 𝜶 ! ,…,𝜶 " , 𝜷 be the parameters of our neural network

– Regression - squared error (same as linear regression!)


( ( ( .
𝐽 𝚯 = 𝑦-𝚯 𝒙 −𝑦
Loss
– Binary classification - cross-entropy loss (same as logistic regression!)
Functions
– Assume 𝑌 ∈ 0,1 and 𝑃 𝑌 = 1 𝒙, 𝚯 = 𝑦-𝚯 𝒙
for Neural
Networks 𝐽 ( 𝚯 = − log 𝑃 𝑦 ( |𝒙 ( , 𝚯
" !'/ "
( ( / (
𝐽 𝚯 = − log 𝑦-𝚯 𝒙 1 − 𝑦-𝚯 𝒙

𝐽 ( 𝚯 =− 𝑦 ( log 𝑦-𝚯 𝒙 ( + 1−𝑦 ( log 1 − 𝑦-𝚯 𝒙 (

10/6/23 8
– Let 𝚯 = 𝜶 ! ,…,𝜶 " , 𝜷 be the parameters of our neural network

– Multi-class classification - cross-entropy loss again!


– Express the label as a one-hot or one-of-𝐶 vector e.g.,
Loss 𝒚= 0 0 1 0 ⋯ 0
Functions J𝚯
– Assume the neural network output is also a vector of length 𝐶, 𝒚
for Neural J𝚯 𝒙
𝑃 𝒚 𝑐 = 1 𝒙, 𝚯 = 𝒚 ( 𝑐
Networks
– Then the cross-entropy loss is
𝐽 ( 𝚯 = − log 𝑃 𝑦 ( |𝒙 ( , 𝚯
1
𝐽 ( 𝚯 = −L𝒚 ( J𝚯 𝒙
𝑐 log 𝒚 ( 𝑐
0)!
10/6/23 9
– Let 𝚯 = 𝜶 ! ,…,𝜶 " , 𝜷 be the parameters of our neural network

Okay but – Multi-class classification - cross-entropy loss


how do – Express the label as a one-hot or one-of-𝐶 vector e.g.,
we get 𝒚= 0 0 1 0 ⋯ 0
our J𝚯
– Assume the neural network output is also a vector of length 𝐶, 𝒚
network J𝚯 𝒙
𝑃 𝒚 𝑐 = 1 𝒙, 𝚯 = 𝒚 ( 𝑐
to output
– Then the cross-entropy loss is
this (
𝐽 𝚯 = − log 𝑃 𝑦 ( |𝒙 ( , 𝚯
vector? 1
𝐽 ( 𝚯 = −L𝒚 ( J𝚯 𝒙
𝑐 log 𝒚 ( 𝑐
0)!
10/6/23 10
exp 𝑏0
… 𝑦0 = 1
∑7)! exp 𝑏7
6
𝜷 𝑏0 = L 𝛽0,5 𝑧5
5)$

Softmax … 𝑧2 = 𝜎 𝑎2

3
𝜶 𝑎2 = L 𝛼2,( 𝑥(
()$

10/6/23 11
( ( *
– Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾
! "
Two questions: – Initialize all weights 𝜶 ,…,𝜶 , 𝜷 (???)
– While TERMINATION CRITERION is not satisfied (???)
1. What is this – For 𝑖 ∈ shuf7le 1, … , 𝑁
loss function – Compute 𝑔𝜷 = ∇𝜷𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
𝐽!? – For 𝑙 = 1, … , 𝐿
– Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
2. How on earth – Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
do we take – For 𝑙 = 1, … , 𝐿
these gradients? % %
– Update 𝜶 =𝜶 − 𝛾𝑔𝜶 !

– Output:𝜶 ! ,…,𝜶 " ,𝜷


10/6/23 12
Numerator

Types of
scalar vector matrix
Derivatives

Matrix scalar
Calculus

vector

Denominator

matrix

10/6/23 Table courtesy of Matt Gormley 13


Types of
scalar
Derivatives

– Derivatives of a scalar

scalar always
Matrix have the same
Calculus: shape as the
vector

Denominator entity that the


Layout derivative is
being taken
with respect to. matrix

10/6/23 Table courtesy of Matt Gormley 14


Types of
scalar vector
Derivatives

scalar
Matrix
Calculus:
Denominator
Layout
vector

10/6/23 Table courtesy of Matt Gormley 15


– Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
– Requires the ability to call 𝑓 𝒙
– Great for checking accuracy of implementations of
more complex differentiation methods
– Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
– Requires systematic knowledge of derivatives
Differentiation – Can computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
– Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
– Computational cost of computing 9: 𝒙 Z9𝒙 is
10/6/23
proportional to the cost of computing 𝑓 𝒙 16
– Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
𝜕𝑓 𝒙 𝑓 𝒙 + 𝜖𝒅( − 𝑓 𝒙 − 𝜖𝒅(

𝜕𝑥( 2𝜖
where 𝒅( is a one-hot vector with a 1 in the 𝑖 th position
𝑓 𝑥

Approach 1:
Finite 𝜖𝜖 𝑥
Difference – We want 𝜖 to be small to get a good approximation but we
Method run into floating point issues when 𝜖 is too small

– Getting the full gradient requires computing the above


approximation for each dimension of the input
10/6/23 17
– Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?


>>> from math import *
>>> y = lambda x,z: exp(x*z)+(x*z)/log(x)+sin(log(x))/(x*z)

Approach 1: >>> x = 2

Finite >>> z = 3

Difference >>> e = 10**-8


>>> dydx = (y(x+e,z)-y(x-e,z))/(2*e)
Method >>> dydz = (y(x,z+e)-y(x,z-e))/(2*e)
Example >>> print(dydx, dydz)

10/6/23 Example courtesy of Matt Gormley 18


– Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
– Requires the ability to call 𝑓 𝒙
– Great for checking accuracy of implementations of
more complex differentiation methods
– Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
– Requires systematic knowledge of derivatives
Differentiation – Can computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
– Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
– Computational cost of computing 9: 𝒙 Z9𝒙 is
10/6/23
proportional to the cost of computing 𝑓 𝒙 19
– Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

– Looks like we’re gonna need the chain rule!

Approach 2:
Symbolic
Differentiation

10/6/23 Example courtesy of Matt Gormley 20


– If 𝑦 = 𝑓 𝑧 and 𝑧 = 𝑔 𝑥 then
the corresponding computation graph is
𝑥 𝑧 𝑦
𝜕𝑦 𝜕𝑦 𝜕𝑧
⟹ =
𝜕𝑥 𝜕𝑧 𝜕𝑥
– If 𝑦 = 𝑓 𝑧!, 𝑧. and 𝑧! = 𝑔! 𝑥 , 𝑧. = 𝑔. 𝑥 then
𝑧!
𝑥 𝑦
The Chain Rule ⟹
𝜕𝑦 𝜕𝑦 𝜕𝑧! 𝜕𝑦 𝜕𝑧.
= +
of Calculus 𝑧.
𝜕𝑥 𝜕𝑧! 𝜕𝑥 𝜕𝑧. 𝜕𝑥

– If 𝑦 = 𝑓 𝒛 and 𝒛 = 𝑔 𝑥 then
𝑧!
𝑥 𝑦 6
𝜕𝑦 𝜕𝑦 𝜕𝑧2
⟹ =L
𝜕𝑥 𝜕𝑧2 𝜕𝑥
𝑧. 2)!

10/6/23 𝑧6 21
– If 𝑦 = 𝑓 𝒛 , 𝒛 = 𝑔 𝒘 and 𝒘 = ℎ 𝑥 , does the equation
6
𝜕𝑦 𝜕𝑦 𝜕𝑧2
=L
𝜕𝑥 𝜕𝑧2 𝜕𝑥
2)!

Poll Question 1 still hold?

A. Yes
B. No
C. Only on Fridays (TOXIC)

10/6/23 22
– Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?


𝜕𝑦 𝜕 ;< 𝜕 𝑥𝑧 𝜕 sin ln 𝑥
= 𝑒 + +
𝜕𝑥 𝜕𝑥 𝜕𝑥 ln 𝑥 𝜕𝑥 𝑥𝑧
Approach 2: 𝜕𝑦 ;<
= 𝑧𝑒 +
𝑧

𝑧
+
cos ln 𝑥

sin ln 𝑥
Symbolic 𝜕𝑥 ln 𝑥 ln 𝑥 . .
𝑥 𝑧 𝑥 .𝑧
𝜕𝑦 3 3 cos ln 2 sin ln 2
Differentiation =
= 3𝑒 + − . + −
𝜕𝑥 ln 2 ln 2 12 12
𝜕𝑦 𝜕 ;< 𝜕 𝑥𝑧 𝜕 sin ln 𝑥
= 𝑒 + +
𝜕𝑧 𝜕𝑧 𝜕𝑧 ln 𝑥 𝜕𝑧 𝑥𝑧
𝜕𝑦 =
2 sin ln 2
= 2𝑒 + −
10/6/23 𝜕𝑥 ln 2 18 Example courtesy of Matt Gormley 23
– Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
– Requires the ability to call 𝑓 𝒙
– Great for checking accuracy of implementations of
more complex differentiation methods
– Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
– Requires systematic knowledge of derivatives
Differentiation – Can be computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
– Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
– Computational cost of computing 9: 𝒙 Z9𝒙 is proportional
10/6/23
to the cost of computing 𝑓 𝒙 24
– Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?


– First define some intermediate quantities, draw the
computation graph and run the “forward” computation
𝑎 = 𝑥𝑧
Approach 3: 𝑥 𝑎 𝑑
𝑏 = ln 𝑥
Automatic 𝑐 = sin 𝑏
2 ∗ 𝑒𝑥𝑝
Differentiation 𝑑 = 𝑒>
𝑧 𝑏 𝑒 𝑦
(reverse mode) 𝑒 = 𝑎Z𝑏
3 𝑙𝑛 / +
𝑓
𝑓 = 𝑐⁄𝑎
𝑐 𝑠𝑖𝑛 /
𝑦 =𝑑+𝑒+𝑓
10/6/23 Example courtesy of Matt Gormley 25
– Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?


– Then compute partial derivatives,
starting from 𝑦 and working back

Approach 3: 𝑥 𝑎 𝑑
Automatic 2 ∗ 𝑒𝑥𝑝
Differentiation 𝑧 𝑏 𝑒 𝑦
(reverse mode) 3 𝑙𝑛 / +
𝑓
𝑐 𝑠𝑖𝑛 /
10/6/23 Example courtesy of Matt Gormley 26
– Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
– Requires the ability to call 𝑓 𝒙
– Great for checking accuracy of implementations of
more complex differentiation methods
– Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
– Requires systematic knowledge of derivatives
Differentiation – Can be computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
– Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
– Computational cost of computing 9: 𝒙 Z9𝒙 is proportional
10/6/23
to the cost of computing 𝑓 𝒙 28
– Given 𝑓: ℝ6 → ℝ1 , compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
3. Automatic differentiation (reverse mode)
– Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
– Computational cost of computing ∇𝒙 𝑓 𝒙 0 = 9: 𝒙 #Z9𝒙
is proportional to the cost of computing 𝑓 𝒙
Automatic – Great for high-dimensional inputs and low-dimensional
outputs (𝐷 ≫ 𝐶)
Differentiation
4. Automatic differentiation (forward mode)
– Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
– Computational cost of computing 9: 𝒙 Z9𝒙$
is proportional to the cost of computing 𝑓 𝒙
– Great for low-dimensional inputs and high-dimensional
10/6/23
outputs (𝐷 ≪ 𝐶) 29
– The diagram represents an algorithm

– Nodes are rectangles with one node per intermediate


variable in the algorithm

– Each node is labeled with the function that it computes


Computation (inside the box) and the variable name (outside the box)
Graph:
10-301/601 – Edges are directed and do not have labels

Conventions – For neural networks:


– Each weight, feature value, label and bias term
appears as a node
– We can include the loss function

10/6/23 30
– The diagram represents a neural network

– Nodes are circles with one node per hidden unit

– Each node is labeled with the variable corresponding to


the hidden unit
Neural
Network – Edges are directed and each edge is labeled with its weight

Diagram – Following standard convention, the bias term is typically


Conventions not shown as a node, but rather is assumed to be part of
the activation function i.e., its weight does not appear in
the picture anywhere.

– The diagram typically does not include any nodes related


to the loss computation
10/6/23 31
You should be able to…
– Differentiate between a neural network diagram and a computation graph
– Construct a computation graph for a function as specified by an algorithm
– Carry out the backpropagation on an arbitrary computation graph
– Construct a computation graph for a neural network, identifying all the
Backprop given and intermediate quantities that are relevant

Learning – Instantiate the backpropagation algorithm for a neural network

Objectives – Instantiate an optimization method (e.g. SGD) and a regularizer (e.g. L2)
when the parameters of a model are comprised of several matrices
corresponding to different layers of a neural network
– Use the finite difference method to evaluate the gradient of a function
– Identify when the gradient of a function can be computed at all and when
it can be computed efficiently
– Employ basic matrix calculus to compute vector/matrix/tensor derivatives.
10/6/23 32

You might also like