0% found this document useful (0 votes)

41 views81 pages

Lecture04 Neuralnets

This document provides a summary of a lecture on computing gradients for neural networks by hand and algorithmically using backpropagation. The lecture covers matrix calculus for computing gradients efficiently, examples of computing Jacobians, and applying the chain rule to compute gradients layer-by-layer in a neural network. It also discusses representing neural network computations as a computation graph and using backpropagation to efficiently compute gradients by passing error signals backward through the graph.

Uploaded by

suman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views81 pages

Lecture04 Neuralnets

Uploaded by

suman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 81

Natural LanguageProcessing

Natural Language Processing

with DeepLearning
with Deep Learning
CS224N/Ling284
CS224N/Ling284

Christopher Manning
Christopher Manning and Richard Socher
Lecture 4: Gradients by hand (matrix calculus) and
Lecture
algorithmically 2: Word Vectors algorithm)
(the backpropagation
1. Introduction
Assignment 2 is all about making sure you really understand the
math of neural networks … then we’ll let the software do it!

We’ll go through it quickly today, but also look at the readings!

This will be a tough week for some! à

Make sure to get help if you need it
Visit office hours Friday/Tuesday
Note: Monday is MLK Day – No office hours, sorry!
But we will be on Piazza
Read tutorial materials given in the syllabus

2
NER: Binary classification for center word being location

• We do supervised training and want high score if it’s a location

1
𝐽" 𝜃 = 𝜎 𝑠 =
1 + 𝑒 *+

x = [ xmuseums xin xParis xare xamazing ]

3
Remember: Stochastic Gradient Descent
Update equation:

𝛼 = step size or learning rate

How can we compute ∇- 𝐽(𝜃)?

1. By hand
2. Algorithmically: the backpropagation algorithm

4
Lecture Plan
Lecture 4: Gradients by hand and algorithmically
1. Introduction (5 mins)
2. Matrix calculus (40 mins)
3. Backpropagation (35 mins)

5
Computing Gradients by Hand
• Matrix calculus: Fully vectorized gradients
• “multivariable calculus is just like single-variable calculus if
you use matrices”
• Much faster and more useful than non-vectorized gradients
• But doing a non-vectorized gradient can be good for
intuition; watch last week’s lecture for an example
• Lecture notes and matrix calculus notes cover this
material in more detail
• You might also review Math 51, which has a new online
textbook:
http://web.stanford.edu/class/math51/textbook.html
6
Gradients
• Given a function with 1 output and 1 input
𝑓 𝑥 = 𝑥3
• It’s gradient (slope) is its derivative
45
46
= 3𝑥 8
“How much will the output change if we change the input a bit?”

7
Gradients
• Given a function with 1 output and n inputs

• Its gradient is a vector of partial derivatives with

respect to each input

8
Jacobian Matrix: Generalization of the Gradient
• Given a function with m outputs and n inputs

• It’s Jacobian is an m x n matrix of partial derivatives

9
Chain Rule

• For one-variable functions: multiply derivatives

• For multiple variables at once: multiply Jacobians

10
Example Jacobian: Elementwise activation Function

11
Example Jacobian: Elementwise activation Function

Function has n outputs and n inputs → n by n Jacobian

12
Example Jacobian: Elementwise activation Function

13
Example Jacobian: Elementwise activation Function

14
Example Jacobian: Elementwise activation Function

15
Other Jacobians

• Compute these at home for practice!

• Check your answers with the lecture notes
16
Other Jacobians

• Compute these at home for practice!

• Check your answers with the lecture notes
17
Other Jacobians

Fine print: This is the correct Jacobian.

Later we discuss the “shape convention”;
using it the answer would be h.

• Compute these at home for practice!

• Check your answers with the lecture notes
18
Other Jacobians

• Compute these at home for practice!

• Check your answers with the lecture notes
19
Back to our Neural Net!

x = [ xmuseums xin xParis xare xamazing ]

20
Back to our Neural Net!
• Let’s find
• Really, we care about the gradient of the loss, but we
will compute the gradient of the score for simplicity

x = [ xmuseums xin xParis xare xamazing ]

21
1. Break up equations into simple pieces

22
2. Apply the chain rule

23
2. Apply the chain rule

24
2. Apply the chain rule

25
2. Apply the chain rule

26
3. Write out the Jacobians

Useful Jacobians from previous slide

27
3. Write out the Jacobians

𝒖:

Useful Jacobians from previous slide

28
3. Write out the Jacobians

𝒖:

Useful Jacobians from previous slide

29
3. Write out the Jacobians

𝒖:

Useful Jacobians from previous slide

30
3. Write out the Jacobians

𝒖:
𝒖:
Useful Jacobians from previous slide

31
Re-using Computation

• Suppose we now want to compute

• Using the chain rule again:

32
Re-using Computation

• Suppose we now want to compute

• Using the chain rule again:

The same! Let’s avoid duplicated computation…

33
Re-using Computation

• Suppose we now want to compute

• Using the chain rule again:

𝒖:

34
𝛿 is local error signal
Derivative with respect to Matrix: Output shape

• What does look like?

• 1 output, nm inputs: 1 by nm Jacobian?
• Inconvenient to do

35
Derivative with respect to Matrix: Output shape

• What does look like?

• 1 output, nm inputs: 1 by nm Jacobian?
• Inconvenient to do

• Instead we use shape convention: the shape of

the gradient is the shape of the parameters

• So is n by m:

36
Derivative with respect to Matrix

• Remember
• is going to be in our answer
• The other term should be because

• Answer is:

𝛿 is local error signal at 𝑧

𝑥 is local input signal

37
Why the Transposes?

• Hacky answer: this makes the dimensions work out!

• Useful trick for checking your work!

• Full explanation in the lecture notes; intuition next

• Each input goes to each output – you get outer product
38
Why the Transposes?

39
Deriving local input gradient in backprop
• For this function:
𝜕𝑠 𝜕𝒛 𝜕
=𝜹 =𝜹 𝑾𝒙 + 𝒃
𝜕𝑾 𝜕𝑾 𝜕𝑾
• Let’s consider the derivative
of a single weight Wij
• Wij only contributes to zi s u2
• For example: W23 is only
f(z1)= h1 h2 =f(z2)
used to compute z2 not z1
W23
𝜕𝑧C 𝜕
= 𝑾CF 𝒙 + 𝑏C b2
𝜕𝑊CE 𝜕𝑊CE
H
= ∑4MNO 𝑊CM 𝑥M = 𝑥E x1 x2 x3 +1
HIJK
40
What shape should derivatives be?

• is a row vector
• But convention says our gradient should be a column vector
because is a column vector…

• Disagreement between Jacobian form (which makes

the chain rule easy) and the shape convention (which
makes implementing SGD easy)
• We expect answers to follow the shape convention
• But Jacobian form is useful for computing the answers

41
What shape should derivatives be?
Two options:
1. Use Jacobian form as much as possible, reshape to
follow the convention at the end:
• What we just did. But at the end transpose to make the
derivative a column vector, resulting in

2. Always follow the convention

• Look at dimensions to figure out when to transpose and/or
reorder terms

42
Deriving gradients: Tips
• Tip 1: Carefully define your variables and keep track of their
dimensionality!
• Tip 2: Chain rule! If y = f(u) and u = g(x), i.e., y = f(g(x)), then:
𝜕𝒚 𝜕𝒚 𝜕𝒖
=
𝜕𝒙 𝜕𝒖 𝜕𝒙
Keep straight what variables feed into what computations
• Tip 3: For the top softmax part of a model: First consider the
derivative wrt fc when c = y (the correct class), then consider
derivative wrt fc when c ¹ y (all the incorrect classes)
• Tip 4: Work out element-wise partial derivatives if you’re getting
confused by matrix calculus!
• Tip 5: Use Shape Convention. Note: The error message 𝜹 that
arrives at a hidden layer has the same dimensionality as that
43
hidden layer
3. Backpropagation

We’ve almost shown you backpropagation

It’s taking derivatives and using the (generalized,
multivariate, or matrix) chain rule
Other trick:
We re-use derivatives computed for higher layers in
computing derivatives for lower layers to minimize
computation

44
Computation Graphs and Backpropagation

• We represent our neural net

equations as a graph
• Source nodes: inputs
• Interior nodes: operations

45
Computation Graphs and Backpropagation

• We represent our neural net

equations as a graph
• Source nodes: inputs
• Interior nodes: operations
• Edges pass along result of the
operation

46
Computation Graphs and Backpropagation

• Representing our neural net

equations as a graph
• Source nodes: inputs
• “Forward Propagation”
Interior nodes: operations
• Edges pass along result of the
operation

47
Backpropagation

• Go backwards along edges

• Pass along gradients

48
Backpropagation: Single Node
• Node receives an “upstream gradient”
• Goal is to pass on the correct
“downstream gradient”

49 Downstream Upstream
gradient gradient
Backpropagation: Single Node

• Each node has a local gradient

• The gradient of its output with
respect to its input

50 Downstream Local Upstream

gradient gradient gradient
Backpropagation: Single Node

• Each node has a local gradient

• The gradient of its output with
respect to its input

Chain
rule!
51 Downstream Local Upstream
gradient gradient gradient
Backpropagation: Single Node

• Each node has a local gradient

• The gradient of it’s output with
respect to it’s input

• [downstream gradient] = [upstream gradient] x [local gradient]

52 Downstream Local Upstream

gradient gradient gradient
Backpropagation: Single Node
• What about nodes with multiple inputs?

53
Backpropagation: Single Node
• Multiple inputs → multiple local gradients

Downstream Local Upstream

54 gradients gradients gradient
An Example

55
An Example

Forward prop steps

*
max
56
An Example

Forward prop steps

2
+ 3

6
2 *
2
max
0
57
An Example

Forward prop steps Local gradients

2
+ 3

6
2 *
2
max
0
58
An Example

Forward prop steps Local gradients

2
+ 3

6
2 *
2
max
0
59
An Example

Forward prop steps Local gradients

2
+ 3

6
2 *
2
max
0
60
An Example

Forward prop steps Local gradients

2
+ 3

6
2 *
2
max
0
61
An Example

Forward prop steps Local gradients

2
+ 3
1*2 = 2 6
2 * 1
2
max 1*3 = 3
0
62 upstream * local = downstream
An Example

Forward prop steps Local gradients

2
+ 3
2
6
2 * 1
2
3*1 = 3
max 3
0
63 3*0 = 0 upstream * local = downstream
An Example

Forward prop steps Local gradients

1
2*1 = 2
2
+ 3
2
2*1 = 2 6
2 * 1
2
3
max 3
0
64 0 upstream * local = downstream
An Example

Forward prop steps Local gradients

1
2
2
+ 3
2
2 6
2 * 1
2
3
max 3
0
65 0
Gradients sum at outward branches

66
Gradients sum at outward branches

67
Node Intuitions

• + “distributes” the upstream gradient to each summand

1
2
2
+ 3
2
2 6
2 * 1
2
max
0
68
Node Intuitions

• + “distributes” the upstream gradient to each summand

• max “routes” the upstream gradient

2
+ 3

6
2 * 1
2
3
max 3
0
69 0
Node Intuitions

• + “distributes” the upstream gradient

• max “routes” the upstream gradient
• * “switches” the upstream gradient

2
+ 3
2
6
2 * 1
2
max 3
0
70
Efficiency: compute all gradients at once

• Incorrect way of doing backprop:

• First compute

* +

71
Efficiency: compute all gradients at once

• Incorrect way of doing backprop:

• First compute
• Then independently compute
• Duplicated computation!

* +

72
Efficiency: compute all gradients at once

• Correct way:
• Compute all the gradients at once
• Analogous to using 𝜹 when we
computed gradients by hand

* +

73
Back-Prop in General Computation Graph
1. Fprop: visit nodes in topological sort order
Single scalar output - Compute value of node given predecessors
2. Bprop:
- initialize output gradient = 1
… - visit nodes in reverse order:
Compute gradient wrt each node using
gradient wrt successors
… = successors of

Done correctly, big O() complexity of fprop and

bprop is the same
… In general our nets have regular layer-structure
74
and so we can use matrices and Jacobians…
Automatic Differentiation

• The gradient computation can be

automatically inferred from the
symbolic expression of the fprop
• Each node type needs to know how
to compute its output and how to
compute the gradient wrt its inputs
given the gradient wrt its output
• Modern DL frameworks (Tensorflow,
PyTorch, etc.) do backpropagation
for you but mainly leave layer/node
writer to hand-calculate the local
derivative

75
Backprop Implementations

76
Implementation: forward/backward API

77
Implementation: forward/backward API

78
Manual Gradient checking: Numeric Gradient

• For small h (≈ 1e-4),

• Easy to implement correctly
• But approximate and very slow:
• Have to recompute f for every parameter of our model

• Useful for checking your implementation

• In the old days when we hand-wrote everything, it was key
to do this everywhere.
• Now much less needed, when throwing together layers
79
Summary

We’ve mastered the core technology of neural nets! 🎉

• Backpropagation: recursively (and hence efficiently)

apply the chain rule along computation graph
• [downstream gradient] = [upstream gradient] x [local gradient]

• Forward pass: compute results of operations and save

intermediate values
• Backward pass: apply chain rule to compute gradients
80
Why learn all these details about gradients?
• Modern deep learning frameworks compute gradients for you!

• But why take a class on compilers or systems when they are

implemented for you?
• Understanding what is going on under the hood is useful!

• Backpropagation doesn’t always work perfectly

• Understanding why is crucial for debugging and improving
models
• See Karpathy article (in syllabus):
• https://medium.com/@karpathy/yes-you-should-understand-
backprop-e2f06eab496b
• Example in future lecture: exploding and vanishing gradients
81

XCS224N Module2 Slides
No ratings yet
XCS224N Module2 Slides
80 pages
cs224n 2023 Lecture03 Neuralnets
No ratings yet
cs224n 2023 Lecture03 Neuralnets
83 pages
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
No ratings yet
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
84 pages
Matrix Calculus
No ratings yet
Matrix Calculus
33 pages
TUM I2DL Matrix Derivatives
No ratings yet
TUM I2DL Matrix Derivatives
8 pages
The Matrix Calculus You Need For Deep Learning
No ratings yet
The Matrix Calculus You Need For Deep Learning
34 pages
Learning 3
No ratings yet
Learning 3
98 pages
Machine Learning: Backpropagation
No ratings yet
Machine Learning: Backpropagation
24 pages
2024 04 CS115 Vector Caculus
No ratings yet
2024 04 CS115 Vector Caculus
131 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Neural Networks: Derivation: 1 Model
No ratings yet
Neural Networks: Derivation: 1 Model
9 pages
Gradient Notes PDF
No ratings yet
Gradient Notes PDF
7 pages
Computing Neural Network Gradients-Merged
No ratings yet
Computing Neural Network Gradients-Merged
67 pages
Tut 01
No ratings yet
Tut 01
39 pages
5 Backward Propagation
No ratings yet
5 Backward Propagation
81 pages
Lecture12 Diff
No ratings yet
Lecture12 Diff
31 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Week 1 Solutions
No ratings yet
Week 1 Solutions
8 pages
Backpropagation Exercises
No ratings yet
Backpropagation Exercises
7 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
Foundations of Deep Learning
No ratings yet
Foundations of Deep Learning
30 pages
ML807 Distributed and Federated Learning Slides 2
No ratings yet
ML807 Distributed and Federated Learning Slides 2
211 pages
Deep Learning Basics Lecture 2 Backpropagation
No ratings yet
Deep Learning Basics Lecture 2 Backpropagation
31 pages
Lecture 02
No ratings yet
Lecture 02
37 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Introduction To Feed Forward Neural Networks
No ratings yet
Introduction To Feed Forward Neural Networks
121 pages
Mit18 S096iap23 Lec4
No ratings yet
Mit18 S096iap23 Lec4
14 pages
3EBX0 Lecture Notes Addendum
No ratings yet
3EBX0 Lecture Notes Addendum
10 pages
L3 Backpropagation
No ratings yet
L3 Backpropagation
61 pages
Lecture 2, Part 2: Backpropagation: Roger Grosse
No ratings yet
Lecture 2, Part 2: Backpropagation: Roger Grosse
9 pages
Lecture20 Backprop
No ratings yet
Lecture20 Backprop
77 pages
Deep Learning PDF
100% (1)
Deep Learning PDF
87 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
Backward Forward Propogation
No ratings yet
Backward Forward Propogation
19 pages
Ex4 Tutorial - Forward and Back-Propagation
No ratings yet
Ex4 Tutorial - Forward and Back-Propagation
20 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
Unit 3
No ratings yet
Unit 3
6 pages
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
9 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
CHAPTER 3.4.1 - Backpropagation - Updated
No ratings yet
CHAPTER 3.4.1 - Backpropagation - Updated
20 pages
CS231n Convolutional Neural Networks For Visual Recognition 4
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition 4
10 pages
Deep Learning Chapter 1
No ratings yet
Deep Learning Chapter 1
46 pages
CS445 - Neural Networks and Deep Learning - Lecture Notes
No ratings yet
CS445 - Neural Networks and Deep Learning - Lecture Notes
5 pages
3 Gradient
No ratings yet
3 Gradient
31 pages
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
No ratings yet
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
114 pages
Neural Network Presentation
No ratings yet
Neural Network Presentation
33 pages
Neural Networks
No ratings yet
Neural Networks
52 pages
Neural Networks & Deep Learning 2025
No ratings yet
Neural Networks & Deep Learning 2025
73 pages
CS231n Convolutional Neural Networks For Visual Recognition
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition
9 pages
L04 Slides - mlp1
No ratings yet
L04 Slides - mlp1
22 pages
1d Backprop
No ratings yet
1d Backprop
23 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
Chapter 6 - Backpropagation
No ratings yet
Chapter 6 - Backpropagation
48 pages
1d Backprop4
No ratings yet
1d Backprop4
6 pages
Slides 11
No ratings yet
Slides 11
48 pages
Assignment3 - DeepLearning
No ratings yet
Assignment3 - DeepLearning
16 pages
NN 2
No ratings yet
NN 2
12 pages
Exercises of Logarithms and Exponentials
From Everand
Exercises of Logarithms and Exponentials
Simone Malacrida
No ratings yet
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
2.5/5 (2)
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
34 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
45 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
36 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
57 pages
Grounding Graph
No ratings yet
Grounding Graph
8 pages
Model Ensemble Trpo
No ratings yet
Model Ensemble Trpo
15 pages
Constrained Policy Opt
No ratings yet
Constrained Policy Opt
18 pages
Adaptive Trpo
No ratings yet
Adaptive Trpo
59 pages
Quasi Newton Trpo
No ratings yet
Quasi Newton Trpo
10 pages
Matlab Project
No ratings yet
Matlab Project
24 pages
Block Matrices
No ratings yet
Block Matrices
6 pages
1 Mat070 Module1 Soln
No ratings yet
1 Mat070 Module1 Soln
94 pages
Linear Algebra Done Right 4th Edition 4th Sheldon Axler Instant Download
No ratings yet
Linear Algebra Done Right 4th Edition 4th Sheldon Axler Instant Download
74 pages
Wikimama Class 11 CH 12 3-D
No ratings yet
Wikimama Class 11 CH 12 3-D
3 pages
Assignment 1,2,10,11,12
No ratings yet
Assignment 1,2,10,11,12
9 pages
Systems of Linear Algebraic Equations and Their Solutions
No ratings yet
Systems of Linear Algebraic Equations and Their Solutions
12 pages
Final Maths
No ratings yet
Final Maths
176 pages
Worklist Class XII
No ratings yet
Worklist Class XII
4 pages
L3 Evaluation A Function Using A Matrix-1
No ratings yet
L3 Evaluation A Function Using A Matrix-1
16 pages
G.W. Stewart-Afternotes Goes To Graduate School - Lectures On Advanced Numerical Analysis (1987)
No ratings yet
G.W. Stewart-Afternotes Goes To Graduate School - Lectures On Advanced Numerical Analysis (1987)
240 pages
E1 251 Linear and Nonlinear Op2miza2on: Chapter 9: The Method of Conjugate Direc6ons
No ratings yet
E1 251 Linear and Nonlinear Op2miza2on: Chapter 9: The Method of Conjugate Direc6ons
32 pages
Samson Abramsky and Bob Coecke - A Categorical Semantics of Quantum Protocols
No ratings yet
Samson Abramsky and Bob Coecke - A Categorical Semantics of Quantum Protocols
21 pages
Asign 2
No ratings yet
Asign 2
8 pages
Determinant
No ratings yet
Determinant
6 pages
NCERT Solutions For Class 12 Maths Chapter 4 Exercise 4.6
No ratings yet
NCERT Solutions For Class 12 Maths Chapter 4 Exercise 4.6
17 pages
Compact SVD
No ratings yet
Compact SVD
10 pages
Quiz No. 2 Enhancement
No ratings yet
Quiz No. 2 Enhancement
7 pages
002 Lu - Decomposition Presentation v3
No ratings yet
002 Lu - Decomposition Presentation v3
35 pages
Theory of Elasticity Chapter 1
No ratings yet
Theory of Elasticity Chapter 1
15 pages
Geometrically Exact 3D Beam Theory: Implementation of A Strain-Invariant Finite Element For Statics and Dynamics
No ratings yet
Geometrically Exact 3D Beam Theory: Implementation of A Strain-Invariant Finite Element For Statics and Dynamics
31 pages
Rajendra Higher Secondary School, Balangir Academic Session 2020-21
50% (4)
Rajendra Higher Secondary School, Balangir Academic Session 2020-21
19 pages
Inner Product Spaces
100% (4)
Inner Product Spaces
14 pages
1 s2.0 S2590123024000215 Main
No ratings yet
1 s2.0 S2590123024000215 Main
8 pages
DCN v2 Paper
No ratings yet
DCN v2 Paper
14 pages
2025-27 - JR - Super60 (Incoming) - Nucleus BT - MAT - Teaching&Test Schedule - W.E.F - 04-04-2025 at 21st March 12PM
No ratings yet
2025-27 - JR - Super60 (Incoming) - Nucleus BT - MAT - Teaching&Test Schedule - W.E.F - 04-04-2025 at 21st March 12PM
25 pages
Lecture 9 Si416
No ratings yet
Lecture 9 Si416
14 pages
6.1. Introduction To Eigenvalues: Example 2
No ratings yet
6.1. Introduction To Eigenvalues: Example 2
1 page
Sistemas de Ecuaciones Lineales: Método de Jacobi
No ratings yet
Sistemas de Ecuaciones Lineales: Método de Jacobi
43 pages
Module - 2
No ratings yet
Module - 2
33 pages

Lecture04 Neuralnets

Uploaded by

Lecture04 Neuralnets

Uploaded by

Natural LanguageProcessing

Natural Language Processing

We’ll go through it quickly today, but also look at the readings!

This will be a tough week for some! à

• We do supervised training and want high score if it’s a location

x = [ xmuseums xin xParis xare xamazing ]

𝛼 = step size or learning rate

How can we compute ∇- 𝐽(𝜃)?

• Its gradient is a vector of partial derivatives with

• It’s Jacobian is an m x n matrix of partial derivatives

• For one-variable functions: multiply derivatives

• For multiple variables at once: multiply Jacobians

Function has n outputs and n inputs → n by n Jacobian

• Compute these at home for practice!

• Compute these at home for practice!

Fine print: This is the correct Jacobian.

• Compute these at home for practice!

• Compute these at home for practice!

x = [ xmuseums xin xParis xare xamazing ]

x = [ xmuseums xin xParis xare xamazing ]

Useful Jacobians from previous slide

Useful Jacobians from previous slide

Useful Jacobians from previous slide

Useful Jacobians from previous slide

• Suppose we now want to compute

• Suppose we now want to compute

The same! Let’s avoid duplicated computation…

• Suppose we now want to compute

• What does look like?

• What does look like?

• Instead we use shape convention: the shape of

𝛿 is local error signal at 𝑧

• Hacky answer: this makes the dimensions work out!

• Full explanation in the lecture notes; intuition next

• Disagreement between Jacobian form (which makes

2. Always follow the convention

We’ve almost shown you backpropagation

• We represent our neural net

• We represent our neural net

• Representing our neural net

• Go backwards along edges

• Each node has a local gradient

50 Downstream Local Upstream

• Each node has a local gradient

• Each node has a local gradient

• [downstream gradient] = [upstream gradient] x [local gradient]

52 Downstream Local Upstream

Downstream Local Upstream

Forward prop steps

Forward prop steps

Forward prop steps Local gradients

Forward prop steps Local gradients

Forward prop steps Local gradients

Forward prop steps Local gradients

Forward prop steps Local gradients

Forward prop steps Local gradients

Forward prop steps Local gradients

Forward prop steps Local gradients

• + “distributes” the upstream gradient to each summand

• + “distributes” the upstream gradient to each summand

• + “distributes” the upstream gradient

• Incorrect way of doing backprop:

• Incorrect way of doing backprop:

Done correctly, big O() complexity of fprop and

• The gradient computation can be

• For small h (≈ 1e-4),

• Useful for checking your implementation

We’ve mastered the core technology of neural nets! 🎉

• Backpropagation: recursively (and hence efficiently)

• Forward pass: compute results of operations and save

• But why take a class on compilers or systems when they are

• Backpropagation doesn’t always work perfectly

You might also like