0% found this document useful (0 votes)
6 views

Lecture12 Diff

lecture 12 - diff

Uploaded by

seshathiri.d
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture12 Diff

lecture 12 - diff

Uploaded by

seshathiri.d
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

10-301/601: Introduction

to Machine Learning
Lecture 13 –
Differentiation
Henry Chai & Matt Gormley & Hoda Heidari
2/26/24
𝑦

𝜷 ∈ ℝ)(
𝜷
𝛽& ∈ ℝ (%)
𝑦 = σ((𝜷)$ 𝒛 + 𝛽& )
Recall: Neural ($)
𝑧!
($)
𝑧$ … ($)
𝑧&"
𝜶
%
∈ ℝ'×)( (%) % $ (")
Networks 𝜶
($)
𝒃(%) ∈ ℝ)(
𝒛 = σ((𝜶 ) 𝒛 + 𝒃(%) )

(Matrix Form) 𝑧!
(!) (!)
𝑧$ … (!)
𝑧&!
"
𝒛
(")
= σ((𝜶
" $
) 𝒙+ 𝒃(") )
𝜶 ∈ ℝ'×)'
(!)
𝜶
𝒃(") ∈ ℝ)'
𝑥! 𝑥$ … 𝑥%

10/6/23 2
$ 1
𝑦 𝑦= σ 𝜷* %
𝒛
𝛽&
𝜷 𝜷* = ∈ ℝ)(+"
𝜷
(%) % *
$ 1
Recall: Neural ($)
𝑧!
($)
𝑧$ … ($)
𝑧&"
% $
𝒛 = σ 𝜶
𝒛
"

% * 𝒃
Networks 𝜶
($) 𝜶 =
𝜶
% ∈ℝ )' +" ×)(

(Matrix Form)
)
(!) (!) (!) (") " * 1
𝑧! 𝑧$ … 𝑧&! 𝒛 = σ 𝜶
𝒙
(!)
𝒃 " $
𝜶
𝜶 " * = ∈ℝ '+" ×)'
"
𝜶
𝑥! 𝑥$ … 𝑥%

10/6/23 3
– Inputs: weights 𝜶 ! ,…,𝜶 " , 𝜷 and a query data point 𝒙#

– Initialize 𝒛 $ = 𝒙#
Forward – For 𝑙 = 1, … , 𝐿
Propagation –𝒂 % =𝜶 % & 𝒛 %'!
for Making
% %
Predictions –𝒛 =𝜎 𝒂

– 𝑦- = 𝜎 𝜷& 𝒛 "

– Output: the prediction 𝑦-

10/6/23 4
( ( *
– Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾

– Initialize all weights 𝜶 ! ,…,𝜶 " , 𝜷 (???)


– While TERMINATION CRITERION is not satisfied (???)
– For 𝑖 ∈ shuf7le 1, … , 𝑁
Stochastic ( ! "
– Compute 𝑔𝜷 = ∇𝜷𝐽 𝜶 ,…,𝜶 ,𝜷
Gradient
– For 𝑙 = 1, … , 𝐿
Descent
( ! "
for Learning – Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 𝜶 ,…,𝜶 ,𝜷
– Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
– For 𝑙 = 1, … , 𝐿
– Update 𝜶 % =𝜶 % − 𝛾𝑔𝜶 !

– Output:𝜶 ! ,…,𝜶 " ,𝜷


10/6/23 5
( ( *
– Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾
! "
Two questions: – Initialize all weights 𝜶 ,…,𝜶 , 𝜷 (???)
– While TERMINATION CRITERION is not satisfied (???)
1. What is this – For 𝑖 ∈ shuf7le 1, … , 𝑁
loss function – Compute 𝑔𝜷 = ∇𝜷𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
𝐽!? – For 𝑙 = 1, … , 𝐿
– Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
2. How on earth – Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
do we take – For 𝑙 = 1, … , 𝐿
these gradients? % %
– Update 𝜶 =𝜶 − 𝛾𝑔𝜶 !

– Output:𝜶 ! ,…,𝜶 " ,𝜷


10/6/23 6
( ( *
– Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾
! "
Two questions: – Initialize all weights 𝜶 ,…,𝜶 , 𝜷 (???)
– While TERMINATION CRITERION is not satisfied (???)
1. What is this – For 𝑖 ∈ shuf7le 1, … , 𝑁
loss function – Compute 𝑔𝜷 = ∇𝜷𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
𝐽!? – For 𝑙 = 1, … , 𝐿
– Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
2. How on earth – Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
do we take – For 𝑙 = 1, … , 𝐿
these gradients? % %
– Update 𝜶 =𝜶 − 𝛾𝑔𝜶 !

– Output:𝜶 ! ,…,𝜶 " ,𝜷


10/6/23 7
– Let 𝚯 = 𝜶 ! ,…,𝜶 " , 𝜷 be the parameters of our neural network

– Regression - squared error (same as linear regression!)


( ( ( .
𝐽 𝚯 = 𝑦-𝚯 𝒙 −𝑦
Loss
– Binary classification - cross-entropy loss (same as logistic regression!)
Functions
– Assume 𝑌 ∈ 0,1 and 𝑃 𝑌 = 1 𝒙, 𝚯 = 𝑦-𝚯 𝒙
for Neural
Networks 𝐽 ( 𝚯 = − log 𝑃 𝑦 ( |𝒙 ( , 𝚯
" !'/ "
( ( / (
𝐽 𝚯 = − log 𝑦-𝚯 𝒙 1 − 𝑦-𝚯 𝒙

𝐽 ( 𝚯 =− 𝑦 ( log 𝑦-𝚯 𝒙 ( + 1−𝑦 ( log 1 − 𝑦-𝚯 𝒙 (

10/6/23 8
– Let 𝚯 = 𝜶 ! ,…,𝜶 " , 𝜷 be the parameters of our neural network

– Multi-class classification - cross-entropy loss again!


– Express the label as a one-hot or one-of-𝐶 vector e.g.,
Loss 𝒚= 0 0 1 0 ⋯ 0
Functions J𝚯
– Assume the neural network output is also a vector of length 𝐶, 𝒚
for Neural J𝚯 𝒙
𝑃 𝒚 𝑐 = 1 𝒙, 𝚯 = 𝒚 ( 𝑐
Networks
– Then the cross-entropy loss is
𝐽 ( 𝚯 = − log 𝑃 𝑦 ( |𝒙 ( , 𝚯
1
𝐽 ( 𝚯 = −L𝒚 ( J𝚯 𝒙
𝑐 log 𝒚 ( 𝑐
0)!
10/6/23 9
– Let 𝚯 = 𝜶 ! ,…,𝜶 " , 𝜷 be the parameters of our neural network

Okay but – Multi-class classification - cross-entropy loss


how do – Express the label as a one-hot or one-of-𝐶 vector e.g.,
we get 𝒚= 0 0 1 0 ⋯ 0
our J𝚯
– Assume the neural network output is also a vector of length 𝐶, 𝒚
network J𝚯 𝒙
𝑃 𝒚 𝑐 = 1 𝒙, 𝚯 = 𝒚 ( 𝑐
to output
– Then the cross-entropy loss is
this (
𝐽 𝚯 = − log 𝑃 𝑦 ( |𝒙 ( , 𝚯
vector? 1
𝐽 ( 𝚯 = −L𝒚 ( J𝚯 𝒙
𝑐 log 𝒚 ( 𝑐
0)!
10/6/23 10
exp 𝑏0
… 𝑦0 = 1
∑7)! exp 𝑏7
6
𝜷 𝑏0 = L 𝛽0,5 𝑧5
5)$

Softmax … 𝑧2 = 𝜎 𝑎2

3
𝜶 𝑎2 = L 𝛼2,( 𝑥(
()$

10/6/23 11
( ( *
– Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾
! "
Two questions: – Initialize all weights 𝜶 ,…,𝜶 , 𝜷 (???)
– While TERMINATION CRITERION is not satisfied (???)
1. What is this – For 𝑖 ∈ shuf7le 1, … , 𝑁
loss function – Compute 𝑔𝜷 = ∇𝜷𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
𝐽!? – For 𝑙 = 1, … , 𝐿
– Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
2. How on earth – Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
do we take – For 𝑙 = 1, … , 𝐿
these gradients? % %
– Update 𝜶 =𝜶 − 𝛾𝑔𝜶 !

– Output:𝜶 ! ,…,𝜶 " ,𝜷


10/6/23 12
Numerator

Types of
scalar vector matrix
Derivatives

Matrix scalar
Calculus

vector

Denominator

matrix

10/6/23 Table courtesy of Matt Gormley 13


Types of
scalar
Derivatives

– Derivatives of a scalar

scalar always
Matrix have the same
Calculus: shape as the
vector

Denominator entity that the


Layout derivative is
being taken
with respect to. matrix

10/6/23 Table courtesy of Matt Gormley 14


Types of
scalar vector
Derivatives

scalar
Matrix
Calculus:
Denominator
Layout
vector

10/6/23 Table courtesy of Matt Gormley 15


– Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
– Requires the ability to call 𝑓 𝒙
– Great for checking accuracy of implementations of
more complex differentiation methods
– Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
– Requires systematic knowledge of derivatives
Differentiation – Can computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
– Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
– Computational cost of computing 9: 𝒙 Z9𝒙 is
10/6/23
proportional to the cost of computing 𝑓 𝒙 16
– Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
𝜕𝑓 𝒙 𝑓 𝒙 + 𝜖𝒅( − 𝑓 𝒙 − 𝜖𝒅(

𝜕𝑥( 2𝜖
where 𝒅( is a one-hot vector with a 1 in the 𝑖 th position
𝑓 𝑥

Approach 1:
Finite 𝜖𝜖 𝑥
Difference – We want 𝜖 to be small to get a good approximation but we
Method run into floating point issues when 𝜖 is too small

– Getting the full gradient requires computing the above


approximation for each dimension of the input
10/6/23 17
– Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?


>>> from math import *
>>> y = lambda x,z: exp(x*z)+(x*z)/log(x)+sin(log(x))/(x*z)

Approach 1: >>> x = 2

Finite >>> z = 3

Difference >>> e = 10**-8


>>> dydx = (y(x+e,z)-y(x-e,z))/(2*e)
Method >>> dydz = (y(x,z+e)-y(x,z-e))/(2*e)
Example >>> print(dydx, dydz)

10/6/23 Example courtesy of Matt Gormley 18


– Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
– Requires the ability to call 𝑓 𝒙
– Great for checking accuracy of implementations of
more complex differentiation methods
– Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
– Requires systematic knowledge of derivatives
Differentiation – Can computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
– Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
– Computational cost of computing 9: 𝒙 Z9𝒙 is
10/6/23
proportional to the cost of computing 𝑓 𝒙 19
– Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

– Looks like we’re gonna need the chain rule!

Approach 2:
Symbolic
Differentiation

10/6/23 Example courtesy of Matt Gormley 20


– If 𝑦 = 𝑓 𝑧 and 𝑧 = 𝑔 𝑥 then
the corresponding computation graph is
𝑥 𝑧 𝑦
𝜕𝑦 𝜕𝑦 𝜕𝑧
⟹ =
𝜕𝑥 𝜕𝑧 𝜕𝑥
– If 𝑦 = 𝑓 𝑧!, 𝑧. and 𝑧! = 𝑔! 𝑥 , 𝑧. = 𝑔. 𝑥 then
𝑧!
𝑥 𝑦
The Chain Rule ⟹
𝜕𝑦 𝜕𝑦 𝜕𝑧! 𝜕𝑦 𝜕𝑧.
= +
of Calculus 𝑧.
𝜕𝑥 𝜕𝑧! 𝜕𝑥 𝜕𝑧. 𝜕𝑥

– If 𝑦 = 𝑓 𝒛 and 𝒛 = 𝑔 𝑥 then
𝑧!
𝑥 𝑦 6
𝜕𝑦 𝜕𝑦 𝜕𝑧2
⟹ =L
𝜕𝑥 𝜕𝑧2 𝜕𝑥
𝑧. 2)!

10/6/23 𝑧6 21
– If 𝑦 = 𝑓 𝒛 , 𝒛 = 𝑔 𝒘 and 𝒘 = ℎ 𝑥 , does the equation
6
𝜕𝑦 𝜕𝑦 𝜕𝑧2
=L
𝜕𝑥 𝜕𝑧2 𝜕𝑥
2)!

Poll Question 1 still hold?

A. Yes
B. No
C. Only on Fridays (TOXIC)

10/6/23 22
– Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?


𝜕𝑦 𝜕 ;< 𝜕 𝑥𝑧 𝜕 sin ln 𝑥
= 𝑒 + +
𝜕𝑥 𝜕𝑥 𝜕𝑥 ln 𝑥 𝜕𝑥 𝑥𝑧
Approach 2: 𝜕𝑦 ;<
= 𝑧𝑒 +
𝑧

𝑧
+
cos ln 𝑥

sin ln 𝑥
Symbolic 𝜕𝑥 ln 𝑥 ln 𝑥 . .
𝑥 𝑧 𝑥 .𝑧
𝜕𝑦 3 3 cos ln 2 sin ln 2
Differentiation =
= 3𝑒 + − . + −
𝜕𝑥 ln 2 ln 2 12 12
𝜕𝑦 𝜕 ;< 𝜕 𝑥𝑧 𝜕 sin ln 𝑥
= 𝑒 + +
𝜕𝑧 𝜕𝑧 𝜕𝑧 ln 𝑥 𝜕𝑧 𝑥𝑧
𝜕𝑦 =
2 sin ln 2
= 2𝑒 + −
10/6/23 𝜕𝑥 ln 2 18 Example courtesy of Matt Gormley 23
– Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
– Requires the ability to call 𝑓 𝒙
– Great for checking accuracy of implementations of
more complex differentiation methods
– Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
– Requires systematic knowledge of derivatives
Differentiation – Can be computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
– Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
– Computational cost of computing 9: 𝒙 Z9𝒙 is proportional
10/6/23
to the cost of computing 𝑓 𝒙 24
– Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?


– First define some intermediate quantities, draw the
computation graph and run the “forward” computation
𝑎 = 𝑥𝑧
Approach 3: 𝑥 𝑎 𝑑
𝑏 = ln 𝑥
Automatic 𝑐 = sin 𝑏
2 ∗ 𝑒𝑥𝑝
Differentiation 𝑑 = 𝑒>
𝑧 𝑏 𝑒 𝑦
(reverse mode) 𝑒 = 𝑎Z𝑏
3 𝑙𝑛 / +
𝑓
𝑓 = 𝑐⁄𝑎
𝑐 𝑠𝑖𝑛 /
𝑦 =𝑑+𝑒+𝑓
10/6/23 Example courtesy of Matt Gormley 25
– Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?


– Then compute partial derivatives,
starting from 𝑦 and working back

Approach 3: 𝑥 𝑎 𝑑
Automatic 2 ∗ 𝑒𝑥𝑝
Differentiation 𝑧 𝑏 𝑒 𝑦
(reverse mode) 3 𝑙𝑛 / +
𝑓
𝑐 𝑠𝑖𝑛 /
10/6/23 Example courtesy of Matt Gormley 26
– Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
– Requires the ability to call 𝑓 𝒙
– Great for checking accuracy of implementations of
more complex differentiation methods
– Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
– Requires systematic knowledge of derivatives
Differentiation – Can be computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
– Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
– Computational cost of computing 9: 𝒙 Z9𝒙 is proportional
10/6/23
to the cost of computing 𝑓 𝒙 28
– Given 𝑓: ℝ6 → ℝ1 , compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
3. Automatic differentiation (reverse mode)
– Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
– Computational cost of computing ∇𝒙 𝑓 𝒙 0 = 9: 𝒙 #Z9𝒙
is proportional to the cost of computing 𝑓 𝒙
Automatic – Great for high-dimensional inputs and low-dimensional
outputs (𝐷 ≫ 𝐶)
Differentiation
4. Automatic differentiation (forward mode)
– Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
– Computational cost of computing 9: 𝒙 Z9𝒙$
is proportional to the cost of computing 𝑓 𝒙
– Great for low-dimensional inputs and high-dimensional
10/6/23
outputs (𝐷 ≪ 𝐶) 29
– The diagram represents an algorithm

– Nodes are rectangles with one node per intermediate


variable in the algorithm

– Each node is labeled with the function that it computes


Computation (inside the box) and the variable name (outside the box)
Graph:
10-301/601 – Edges are directed and do not have labels

Conventions – For neural networks:


– Each weight, feature value, label and bias term
appears as a node
– We can include the loss function

10/6/23 30
– The diagram represents a neural network

– Nodes are circles with one node per hidden unit

– Each node is labeled with the variable corresponding to


the hidden unit
Neural
Network – Edges are directed and each edge is labeled with its weight

Diagram – Following standard convention, the bias term is typically


Conventions not shown as a node, but rather is assumed to be part of
the activation function i.e., its weight does not appear in
the picture anywhere.

– The diagram typically does not include any nodes related


to the loss computation
10/6/23 31
You should be able to…
– Differentiate between a neural network diagram and a computation graph
– Construct a computation graph for a function as specified by an algorithm
– Carry out the backpropagation on an arbitrary computation graph
– Construct a computation graph for a neural network, identifying all the
Backprop given and intermediate quantities that are relevant

Learning – Instantiate the backpropagation algorithm for a neural network

Objectives – Instantiate an optimization method (e.g. SGD) and a regularizer (e.g. L2)
when the parameters of a model are comprised of several matrices
corresponding to different layers of a neural network
– Use the finite difference method to evaluate the gradient of a function
– Identify when the gradient of a function can be computed at all and when
it can be computed efficiently
– Employ basic matrix calculus to compute vector/matrix/tensor derivatives.
10/6/23 32

You might also like