0% found this document useful (0 votes)

14 views31 pages

Lecture12 Diff

lecture 12 - diff

Uploaded by

seshathiri.d

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views31 pages

Lecture12 Diff

lecture 12 - diff

Uploaded by

seshathiri.d

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

10-301/601: Introduction

to Machine Learning
Lecture 13 –
Differentiation
Henry Chai & Matt Gormley & Hoda Heidari
2/26/24
𝑦

𝜷 ∈ ℝ)(
𝜷
𝛽& ∈ ℝ (%)
𝑦 = σ((𝜷)$ 𝒛 + 𝛽& )
Recall: Neural ($)
𝑧!
($)
𝑧$ … ($)
𝑧&"
𝜶
%
∈ ℝ'×)( (%) % $ (")
Networks 𝜶
($)
𝒃(%) ∈ ℝ)(
𝒛 = σ((𝜶 ) 𝒛 + 𝒃(%) )

(Matrix Form) 𝑧!
(!) (!)
𝑧$ … (!)
𝑧&!
"
𝒛
(")
= σ((𝜶
" $
) 𝒙+ 𝒃(") )
𝜶 ∈ ℝ'×)'
(!)
𝜶
𝒃(") ∈ ℝ)'
𝑥! 𝑥$ … 𝑥%

10/6/23 2
$ 1
𝑦 𝑦= σ 𝜷* %
𝒛
𝛽&
𝜷 𝜷* = ∈ ℝ)(+"
𝜷
(%) % *
$ 1
Recall: Neural ($)
𝑧!
($)
𝑧$ … ($)
𝑧&"
% $
𝒛 = σ 𝜶
𝒛
"

% * 𝒃
Networks 𝜶
($) 𝜶 =
𝜶
% ∈ℝ )' +" ×)(

(Matrix Form)
)
(!) (!) (!) (") " * 1
𝑧! 𝑧$ … 𝑧&! 𝒛 = σ 𝜶
𝒙
(!)
𝒃 " $
𝜶
𝜶 " * = ∈ℝ '+" ×)'
"
𝜶
𝑥! 𝑥$ … 𝑥%

10/6/23 3
Inputs: weights 𝜶 ! ,…,𝜶 " , 𝜷 and a query data point 𝒙#

Initialize 𝒛 $ = 𝒙#
Forward For 𝑙 = 1, … , 𝐿
Propagation 𝒂 % =𝜶 % & 𝒛 %'!
for Making
% %
Predictions 𝒛 =𝜎 𝒂

𝑦- = 𝜎 𝜷& 𝒛 "

Output: the prediction 𝑦-

10/6/23 4
( ( *
Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾

Initialize all weights 𝜶 ! ,…,𝜶 " , 𝜷 (???)

While TERMINATION CRITERION is not satisfied (???)
For 𝑖 ∈ shuf7le 1, … , 𝑁
Stochastic ( ! "
Compute 𝑔𝜷 = ∇𝜷𝐽 𝜶 ,…,𝜶 ,𝜷
Gradient
For 𝑙 = 1, … , 𝐿
Descent
( ! "
for Learning Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 𝜶 ,…,𝜶 ,𝜷
Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
For 𝑙 = 1, … , 𝐿
Update 𝜶 % =𝜶 % − 𝛾𝑔𝜶 !

Output:𝜶 ! ,…,𝜶 " ,𝜷

10/6/23 5
( ( *
Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾
! "
Two questions: Initialize all weights 𝜶 ,…,𝜶 , 𝜷 (???)
While TERMINATION CRITERION is not satisfied (???)
1. What is this For 𝑖 ∈ shuf7le 1, … , 𝑁
loss function Compute 𝑔𝜷 = ∇𝜷𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
𝐽!? For 𝑙 = 1, … , 𝐿
Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
2. How on earth Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
do we take For 𝑙 = 1, … , 𝐿
these gradients? % %
Update 𝜶 =𝜶 − 𝛾𝑔𝜶 !

Output:𝜶 ! ,…,𝜶 " ,𝜷

10/6/23 6
( ( *
Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾
! "
Two questions: Initialize all weights 𝜶 ,…,𝜶 , 𝜷 (???)
While TERMINATION CRITERION is not satisfied (???)
1. What is this For 𝑖 ∈ shuf7le 1, … , 𝑁
loss function Compute 𝑔𝜷 = ∇𝜷𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
𝐽!? For 𝑙 = 1, … , 𝐿
Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
2. How on earth Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
do we take For 𝑙 = 1, … , 𝐿
these gradients? % %
Update 𝜶 =𝜶 − 𝛾𝑔𝜶 !

Output:𝜶 ! ,…,𝜶 " ,𝜷

10/6/23 7
Let 𝚯 = 𝜶 ! ,…,𝜶 " , 𝜷 be the parameters of our neural network

Regression - squared error (same as linear regression!)

( ( ( .
𝐽 𝚯 = 𝑦-𝚯 𝒙 −𝑦
Loss
Binary classification - cross-entropy loss (same as logistic regression!)
Functions
Assume 𝑌 ∈ 0,1 and 𝑃 𝑌 = 1 𝒙, 𝚯 = 𝑦-𝚯 𝒙
for Neural
Networks 𝐽 ( 𝚯 = − log 𝑃 𝑦 ( |𝒙 ( , 𝚯
" !'/ "
( ( / (
𝐽 𝚯 = − log 𝑦-𝚯 𝒙 1 − 𝑦-𝚯 𝒙

𝐽 ( 𝚯 =− 𝑦 ( log 𝑦-𝚯 𝒙 ( + 1−𝑦 ( log 1 − 𝑦-𝚯 𝒙 (

10/6/23 8
Let 𝚯 = 𝜶 ! ,…,𝜶 " , 𝜷 be the parameters of our neural network

Multi-class classification - cross-entropy loss again!

Express the label as a one-hot or one-of-𝐶 vector e.g.,
Loss 𝒚= 0 0 1 0 ⋯ 0
Functions J𝚯
Assume the neural network output is also a vector of length 𝐶, 𝒚
for Neural J𝚯 𝒙
𝑃 𝒚 𝑐 = 1 𝒙, 𝚯 = 𝒚 ( 𝑐
Networks
Then the cross-entropy loss is
𝐽 ( 𝚯 = − log 𝑃 𝑦 ( |𝒙 ( , 𝚯
1
𝐽 ( 𝚯 = −L𝒚 ( J𝚯 𝒙
𝑐 log 𝒚 ( 𝑐
0)!
10/6/23 9
Let 𝚯 = 𝜶 ! ,…,𝜶 " , 𝜷 be the parameters of our neural network

Okay but Multi-class classification - cross-entropy loss

how do Express the label as a one-hot or one-of-𝐶 vector e.g.,
we get 𝒚= 0 0 1 0 ⋯ 0
our J𝚯
Assume the neural network output is also a vector of length 𝐶, 𝒚
network J𝚯 𝒙
𝑃 𝒚 𝑐 = 1 𝒙, 𝚯 = 𝒚 ( 𝑐
to output
Then the cross-entropy loss is
this (
𝐽 𝚯 = − log 𝑃 𝑦 ( |𝒙 ( , 𝚯
vector? 1
𝐽 ( 𝚯 = −L𝒚 ( J𝚯 𝒙
𝑐 log 𝒚 ( 𝑐
0)!
10/6/23 10
exp 𝑏0
… 𝑦0 = 1
∑7)! exp 𝑏7
6
𝜷 𝑏0 = L 𝛽0,5 𝑧5
5)$

Softmax … 𝑧2 = 𝜎 𝑎2

3
𝜶 𝑎2 = L 𝛼2,( 𝑥(
()$

10/6/23 11
( ( *
Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾
! "
Two questions: Initialize all weights 𝜶 ,…,𝜶 , 𝜷 (???)
While TERMINATION CRITERION is not satisfied (???)
1. What is this For 𝑖 ∈ shuf7le 1, … , 𝑁
loss function Compute 𝑔𝜷 = ∇𝜷𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
𝐽!? For 𝑙 = 1, … , 𝐿
Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
2. How on earth Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
do we take For 𝑙 = 1, … , 𝐿
these gradients? % %
Update 𝜶 =𝜶 − 𝛾𝑔𝜶 !

Output:𝜶 ! ,…,𝜶 " ,𝜷

10/6/23 12
Numerator

Types of
scalar vector matrix
Derivatives

Matrix scalar
Calculus

vector

Denominator

matrix

10/6/23 Table courtesy of Matt Gormley 13

Types of
scalar
Derivatives

Derivatives of a scalar

scalar always
Matrix have the same
Calculus: shape as the
vector

Denominator entity that the

Layout derivative is
being taken
with respect to. matrix

10/6/23 Table courtesy of Matt Gormley 14

Types of
scalar vector
Derivatives

scalar
Matrix
Calculus:
Denominator
Layout
vector

10/6/23 Table courtesy of Matt Gormley 15

Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
Requires the ability to call 𝑓 𝒙
Great for checking accuracy of implementations of
more complex differentiation methods
Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
Requires systematic knowledge of derivatives
Differentiation Can computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing 9: 𝒙 Z9𝒙 is
10/6/23
proportional to the cost of computing 𝑓 𝒙 16
Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
𝜕𝑓 𝒙 𝑓 𝒙 + 𝜖𝒅( − 𝑓 𝒙 − 𝜖𝒅(
≈
𝜕𝑥( 2𝜖
where 𝒅( is a one-hot vector with a 1 in the 𝑖 th position
𝑓 𝑥

Approach 1:
Finite 𝜖𝜖 𝑥
Difference We want 𝜖 to be small to get a good approximation but we
Method run into floating point issues when 𝜖 is too small

Getting the full gradient requires computing the above

approximation for each dimension of the input
10/6/23 17
Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

>>> from math import *
>>> y = lambda x,z: exp(x*z)+(x*z)/log(x)+sin(log(x))/(x*z)

Approach 1: >>> x = 2

Finite >>> z = 3

Difference >>> e = 10**-8

>>> dydx = (y(x+e,z)-y(x-e,z))/(2*e)
Method >>> dydz = (y(x,z+e)-y(x,z-e))/(2*e)
Example >>> print(dydx, dydz)

10/6/23 Example courtesy of Matt Gormley 18

Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
Requires the ability to call 𝑓 𝒙
Great for checking accuracy of implementations of
more complex differentiation methods
Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
Requires systematic knowledge of derivatives
Differentiation Can computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing 9: 𝒙 Z9𝒙 is
10/6/23
proportional to the cost of computing 𝑓 𝒙 19
Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

Looks like we’re gonna need the chain rule!

Approach 2:
Symbolic
Differentiation

10/6/23 Example courtesy of Matt Gormley 20

If 𝑦 = 𝑓 𝑧 and 𝑧 = 𝑔 𝑥 then
the corresponding computation graph is
𝑥 𝑧 𝑦
𝜕𝑦 𝜕𝑦 𝜕𝑧
⟹ =
𝜕𝑥 𝜕𝑧 𝜕𝑥
If 𝑦 = 𝑓 𝑧!, 𝑧. and 𝑧! = 𝑔! 𝑥 , 𝑧. = 𝑔. 𝑥 then
𝑧!
𝑥 𝑦
The Chain Rule ⟹
𝜕𝑦 𝜕𝑦 𝜕𝑧! 𝜕𝑦 𝜕𝑧.
= +
of Calculus 𝑧.
𝜕𝑥 𝜕𝑧! 𝜕𝑥 𝜕𝑧. 𝜕𝑥

If 𝑦 = 𝑓 𝒛 and 𝒛 = 𝑔 𝑥 then
𝑧!
𝑥 𝑦 6
𝜕𝑦 𝜕𝑦 𝜕𝑧2
⟹ =L
𝜕𝑥 𝜕𝑧2 𝜕𝑥
𝑧. 2)!
⋮
10/6/23 𝑧6 21
If 𝑦 = 𝑓 𝒛 , 𝒛 = 𝑔 𝒘 and 𝒘 = ℎ 𝑥 , does the equation
6
𝜕𝑦 𝜕𝑦 𝜕𝑧2
=L
𝜕𝑥 𝜕𝑧2 𝜕𝑥
2)!

Poll Question 1 still hold?

A. Yes
B. No
C. Only on Fridays (TOXIC)

10/6/23 22
Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

𝜕𝑦 𝜕 ;< 𝜕 𝑥𝑧 𝜕 sin ln 𝑥
= 𝑒 + +
𝜕𝑥 𝜕𝑥 𝜕𝑥 ln 𝑥 𝜕𝑥 𝑥𝑧
Approach 2: 𝜕𝑦 ;<
= 𝑧𝑒 +
𝑧
−
𝑧
+
cos ln 𝑥
−
sin ln 𝑥
Symbolic 𝜕𝑥 ln 𝑥 ln 𝑥 . .
𝑥 𝑧 𝑥 .𝑧
𝜕𝑦 3 3 cos ln 2 sin ln 2
Differentiation =
= 3𝑒 + − . + −
𝜕𝑥 ln 2 ln 2 12 12
𝜕𝑦 𝜕 ;< 𝜕 𝑥𝑧 𝜕 sin ln 𝑥
= 𝑒 + +
𝜕𝑧 𝜕𝑧 𝜕𝑧 ln 𝑥 𝜕𝑧 𝑥𝑧
𝜕𝑦 =
2 sin ln 2
= 2𝑒 + −
10/6/23 𝜕𝑥 ln 2 18 Example courtesy of Matt Gormley 23
Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
Requires the ability to call 𝑓 𝒙
Great for checking accuracy of implementations of
more complex differentiation methods
Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
Requires systematic knowledge of derivatives
Differentiation Can be computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing 9: 𝒙 Z9𝒙 is proportional
10/6/23
to the cost of computing 𝑓 𝒙 24
Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

First define some intermediate quantities, draw the
computation graph and run the “forward” computation
𝑎 = 𝑥𝑧
Approach 3: 𝑥 𝑎 𝑑
𝑏 = ln 𝑥
Automatic 𝑐 = sin 𝑏
2 ∗ 𝑒𝑥𝑝
Differentiation 𝑑 = 𝑒>
𝑧 𝑏 𝑒 𝑦
(reverse mode) 𝑒 = 𝑎Z𝑏
3 𝑙𝑛 / +
𝑓
𝑓 = 𝑐⁄𝑎
𝑐 𝑠𝑖𝑛 /
𝑦 =𝑑+𝑒+𝑓
10/6/23 Example courtesy of Matt Gormley 25
Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

Then compute partial derivatives,
starting from 𝑦 and working back

Approach 3: 𝑥 𝑎 𝑑
Automatic 2 ∗ 𝑒𝑥𝑝
Differentiation 𝑧 𝑏 𝑒 𝑦
(reverse mode) 3 𝑙𝑛 / +
𝑓
𝑐 𝑠𝑖𝑛 /
10/6/23 Example courtesy of Matt Gormley 26
Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
Requires the ability to call 𝑓 𝒙
Great for checking accuracy of implementations of
more complex differentiation methods
Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
Requires systematic knowledge of derivatives
Differentiation Can be computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing 9: 𝒙 Z9𝒙 is proportional
10/6/23
to the cost of computing 𝑓 𝒙 28
Given 𝑓: ℝ6 → ℝ1 , compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
3. Automatic differentiation (reverse mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing ∇𝒙 𝑓 𝒙 0 = 9: 𝒙 #Z9𝒙
is proportional to the cost of computing 𝑓 𝒙
Automatic Great for high-dimensional inputs and low-dimensional
outputs (𝐷 ≫ 𝐶)
Differentiation
4. Automatic differentiation (forward mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing 9: 𝒙 Z9𝒙$
is proportional to the cost of computing 𝑓 𝒙
Great for low-dimensional inputs and high-dimensional
10/6/23
outputs (𝐷 ≪ 𝐶) 29
The diagram represents an algorithm

Nodes are rectangles with one node per intermediate

variable in the algorithm

Each node is labeled with the function that it computes

Computation (inside the box) and the variable name (outside the box)
Graph:
10-301/601 Edges are directed and do not have labels

Conventions For neural networks:

Each weight, feature value, label and bias term
appears as a node
We can include the loss function

10/6/23 30
The diagram represents a neural network

Nodes are circles with one node per hidden unit

Each node is labeled with the variable corresponding to

the hidden unit
Neural
Network Edges are directed and each edge is labeled with its weight

Diagram Following standard convention, the bias term is typically

Conventions not shown as a node, but rather is assumed to be part of
the activation function i.e., its weight does not appear in
the picture anywhere.

The diagram typically does not include any nodes related

to the loss computation
10/6/23 31
You should be able to…
Differentiate between a neural network diagram and a computation graph
Construct a computation graph for a function as specified by an algorithm
Carry out the backpropagation on an arbitrary computation graph
Construct a computation graph for a neural network, identifying all the
Backprop given and intermediate quantities that are relevant

Learning Instantiate the backpropagation algorithm for a neural network

Objectives Instantiate an optimization method (e.g. SGD) and a regularizer (e.g. L2)
when the parameters of a model are comprised of several matrices
corresponding to different layers of a neural network
Use the finite difference method to evaluate the gradient of a function
Identify when the gradient of a function can be computed at all and when
it can be computed efficiently
Employ basic matrix calculus to compute vector/matrix/tensor derivatives.
10/6/23 32

Machine Learning in Farm Animal Behavior using Python, 1st Edition pdf docx
100% (14)
Machine Learning in Farm Animal Behavior using Python, 1st Edition pdf docx
14 pages
Advanced Data Structures and Algorithms: M. Tech. I Semester Regular Examinations, MAR/APRIL-2024
No ratings yet
Advanced Data Structures and Algorithms: M. Tech. I Semester Regular Examinations, MAR/APRIL-2024
6 pages
Linear Algebra Assignment Solution
100% (1)
Linear Algebra Assignment Solution
28 pages
Introduction To Maxima and Minima
No ratings yet
Introduction To Maxima and Minima
9 pages
Demystifying Deep Learning
No ratings yet
Demystifying Deep Learning
68 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
8 pages
Matrix Calculus
No ratings yet
Matrix Calculus
33 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Gradient Notes PDF
No ratings yet
Gradient Notes PDF
7 pages
Game Playing: Kevin Night and Elaine Rich, Nair B. Chapter 12: PG No 231 - 246
No ratings yet
Game Playing: Kevin Night and Elaine Rich, Nair B. Chapter 12: PG No 231 - 246
32 pages
Neural Networks Optional
No ratings yet
Neural Networks Optional
96 pages
Image Segmentation, Representation and Description
100% (1)
Image Segmentation, Representation and Description
40 pages
opt-sem3
No ratings yet
opt-sem3
50 pages
Sms Essay 2
No ratings yet
Sms Essay 2
6 pages
6. Differentiation, Partial Differentiation & Gradients (1)
No ratings yet
6. Differentiation, Partial Differentiation & Gradients (1)
51 pages
UNIT - 1 DLNN
No ratings yet
UNIT - 1 DLNN
36 pages
Sigmoid Neuron - Parveen Khurana - Medium
No ratings yet
Sigmoid Neuron - Parveen Khurana - Medium
47 pages
Neural Networks Skimmed.ipynb - Colab
No ratings yet
Neural Networks Skimmed.ipynb - Colab
8 pages
FAI 4 Mathematical Concepts II
No ratings yet
FAI 4 Mathematical Concepts II
39 pages
Final2 Math EE
No ratings yet
Final2 Math EE
77 pages
04 - Polygon Rasterization
No ratings yet
04 - Polygon Rasterization
35 pages
Cell Name Original Value Final Value
No ratings yet
Cell Name Original Value Final Value
4 pages
IE2108 Tutorial 12
No ratings yet
IE2108 Tutorial 12
7 pages
2012 Chapter 01
No ratings yet
2012 Chapter 01
35 pages
2. Neural Network Training
No ratings yet
2. Neural Network Training
73 pages
Chap3slides
No ratings yet
Chap3slides
95 pages
CS-12
No ratings yet
CS-12
65 pages
TUM I2DL Matrix Derivatives
No ratings yet
TUM I2DL Matrix Derivatives
8 pages
Learning 3
No ratings yet
Learning 3
98 pages
XCS224N_Module2_Slides
No ratings yet
XCS224N_Module2_Slides
80 pages
Lecture NM 1 Numerical Differentiation Integration
No ratings yet
Lecture NM 1 Numerical Differentiation Integration
57 pages
Thesis
No ratings yet
Thesis
115 pages
8 Puzzle
No ratings yet
8 Puzzle
5 pages
Nonlinear Programming (Concepts, Algorithms, and Applications To Chemical Processes) - 3. Newton-Type Methods For Unconstrained Optimization (2010)
No ratings yet
Nonlinear Programming (Concepts, Algorithms, and Applications To Chemical Processes) - 3. Newton-Type Methods For Unconstrained Optimization (2010)
23 pages
nonlinear
No ratings yet
nonlinear
8 pages
Lec06 Derivatives
No ratings yet
Lec06 Derivatives
22 pages
2 Divide and Conquer
No ratings yet
2 Divide and Conquer
52 pages
Lecture 02
No ratings yet
Lecture 02
37 pages
Computer Vision Si Procesare Avansata de Imagini in Medii Virtuale Distribuite
No ratings yet
Computer Vision Si Procesare Avansata de Imagini in Medii Virtuale Distribuite
24 pages
Neural Network (Perceptrons)
No ratings yet
Neural Network (Perceptrons)
31 pages
Lars Vandenbergh'S Cubezone: Square-1 Solution Method - Step 5
No ratings yet
Lars Vandenbergh'S Cubezone: Square-1 Solution Method - Step 5
5 pages
Computing Neural Network Gradients-merged
No ratings yet
Computing Neural Network Gradients-merged
67 pages
Lecture 3-4
No ratings yet
Lecture 3-4
50 pages
Autodiff
No ratings yet
Autodiff
12 pages
Clustering
No ratings yet
Clustering
44 pages
3-Gradient.pptx
No ratings yet
3-Gradient.pptx
31 pages
mit18_s096iap23_lec1
No ratings yet
mit18_s096iap23_lec1
16 pages
Deep+Learning+Module-02+Search+Creators
No ratings yet
Deep+Learning+Module-02+Search+Creators
15 pages
A Step-By-step Introduction to the Implementation of Automatic Differentiation
No ratings yet
A Step-By-step Introduction to the Implementation of Automatic Differentiation
17 pages
Matrix Calculus Tutorial
No ratings yet
Matrix Calculus Tutorial
7 pages
Tut 01
No ratings yet
Tut 01
39 pages
Backpropagation Exercises
No ratings yet
Backpropagation Exercises
7 pages
Week 1 Solutions
No ratings yet
Week 1 Solutions
8 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
Calc
No ratings yet
Calc
6 pages
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
No ratings yet
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
84 pages
04 Numerical
No ratings yet
04 Numerical
46 pages
Practical-5_2CEIT606_Artificial Intelligence
No ratings yet
Practical-5_2CEIT606_Artificial Intelligence
14 pages
LOD Differentiable
No ratings yet
LOD Differentiable
55 pages
09: Neural Networks - Learning: Neural Network Cost Function
No ratings yet
09: Neural Networks - Learning: Neural Network Cost Function
9 pages
Backward Forward Propogation
No ratings yet
Backward Forward Propogation
19 pages
Data Structures-Model Paper
No ratings yet
Data Structures-Model Paper
7 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
Machine Learning: Backpropagation
No ratings yet
Machine Learning: Backpropagation
24 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Sparseautoencoder 2011new
No ratings yet
Sparseautoencoder 2011new
19 pages
Lecture20 Backprop
No ratings yet
Lecture20 Backprop
77 pages
07autodiff Nnets
No ratings yet
07autodiff Nnets
12 pages
DTSP Easy Soltn
No ratings yet
DTSP Easy Soltn
15 pages
Chapter 21
No ratings yet
Chapter 21
26 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Machine Learning and Pattern Recognition Week 8 - Backprop
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Backprop
8 pages
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
No ratings yet
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
114 pages
Differentiable Programming and Design Optimization
No ratings yet
Differentiable Programming and Design Optimization
72 pages
cs224n 2023 Lecture03 Neuralnets
No ratings yet
cs224n 2023 Lecture03 Neuralnets
83 pages
Lecture04 Neuralnets
No ratings yet
Lecture04 Neuralnets
81 pages
Chapter2 PDF
No ratings yet
Chapter2 PDF
6 pages
Neural-Network Questions
0% (1)
Neural-Network Questions
3 pages
Sparse Autoencoder
No ratings yet
Sparse Autoencoder
15 pages
Automatic Differentiation and Neural Networks
No ratings yet
Automatic Differentiation and Neural Networks
13 pages
ME Cse Regulation 2018
No ratings yet
ME Cse Regulation 2018
49 pages
Periodic Sequences With Optimal Properties For Channel Estimation and Fast Start-Up Equalization
No ratings yet
Periodic Sequences With Optimal Properties For Channel Estimation and Fast Start-Up Equalization
6 pages
CGPresentation Week6 (MidpointLine&Circle)
No ratings yet
CGPresentation Week6 (MidpointLine&Circle)
62 pages
DeepLearning Introduction
No ratings yet
DeepLearning Introduction
14 pages
How To Define and Solve Polynomials in Scilab: P (X) A X +a X +a X + +a X
No ratings yet
How To Define and Solve Polynomials in Scilab: P (X) A X +a X +a X + +a X
3 pages
Dce Lab Manual
0% (1)
Dce Lab Manual
21 pages
Shell Sort
No ratings yet
Shell Sort
22 pages
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
From Everand
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
Shari Eskenas
5/5 (1)
C Language Programming Codes
From Everand
C Language Programming Codes
Durgesh
No ratings yet

Lecture12 Diff

Uploaded by

Lecture12 Diff

Uploaded by

10-301/601: Introduction

 Output: the prediction 𝑦-

 Initialize all weights 𝜶 ! ,…,𝜶 " , 𝜷 (???)

 Output:𝜶 ! ,…,𝜶 " ,𝜷

 Output:𝜶 ! ,…,𝜶 " ,𝜷

 Output:𝜶 ! ,…,𝜶 " ,𝜷

 Regression - squared error (same as linear regression!)

𝐽 ( 𝚯 =− 𝑦 ( log 𝑦-𝚯 𝒙 ( + 1−𝑦 ( log 1 − 𝑦-𝚯 𝒙 (

 Multi-class classification - cross-entropy loss again!

Okay but  Multi-class classification - cross-entropy loss

 Output:𝜶 ! ,…,𝜶 " ,𝜷

10/6/23 Table courtesy of Matt Gormley 13

Denominator entity that the

10/6/23 Table courtesy of Matt Gormley 14

10/6/23 Table courtesy of Matt Gormley 15

 Getting the full gradient requires computing the above

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

Difference >>> e = 10**-8

10/6/23 Example courtesy of Matt Gormley 18

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

 Looks like we’re gonna need the chain rule!

10/6/23 Example courtesy of Matt Gormley 20

Poll Question 1 still hold?

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

 Nodes are rectangles with one node per intermediate

 Each node is labeled with the function that it computes

Conventions  For neural networks:

 Nodes are circles with one node per hidden unit

 Each node is labeled with the variable corresponding to

Diagram  Following standard convention, the bias term is typically

 The diagram typically does not include any nodes related

Learning  Instantiate the backpropagation algorithm for a neural network

You might also like

Output: the prediction 𝑦-

Initialize all weights 𝜶 ! ,…,𝜶 " , 𝜷 (???)

Output:𝜶 ! ,…,𝜶 " ,𝜷

Output:𝜶 ! ,…,𝜶 " ,𝜷

Output:𝜶 ! ,…,𝜶 " ,𝜷

Regression - squared error (same as linear regression!)

Multi-class classification - cross-entropy loss again!

Okay but Multi-class classification - cross-entropy loss

Output:𝜶 ! ,…,𝜶 " ,𝜷

Getting the full gradient requires computing the above

Looks like we’re gonna need the chain rule!

Nodes are rectangles with one node per intermediate

Each node is labeled with the function that it computes

Conventions For neural networks:

Nodes are circles with one node per hidden unit

Each node is labeled with the variable corresponding to

Diagram Following standard convention, the bias term is typically

The diagram typically does not include any nodes related

Learning Instantiate the backpropagation algorithm for a neural network