0% found this document useful (0 votes)
53 views

Backpropagation: TA: Yi Wen

This document discusses backpropagation, which is an algorithm for efficiently computing gradients in neural networks. It motivates backpropagation by explaining that neural network training involves minimizing a loss function over parameters. It then covers numerical gradient estimation versus analytical gradients using the chain rule. The key steps of backpropagation are outlined as identifying intermediate functions during forward propagation, computing local gradients using the chain rule, and combining gradients to get the full gradient. Matrix calculus rules for derivatives with respect to vectors and matrices are presented. Tips for implementing backpropagation include writing the computation graph, tracking error signals, computing the loss derivative, and enforcing shape rules on gradients.

Uploaded by

Trần Văn Duy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Backpropagation: TA: Yi Wen

This document discusses backpropagation, which is an algorithm for efficiently computing gradients in neural networks. It motivates backpropagation by explaining that neural network training involves minimizing a loss function over parameters. It then covers numerical gradient estimation versus analytical gradients using the chain rule. The key steps of backpropagation are outlined as identifying intermediate functions during forward propagation, computing local gradients using the chain rule, and combining gradients to get the full gradient. Matrix calculus rules for derivatives with respect to vectors and matrices are presented. Tips for implementing backpropagation include writing the computation graph, tracking error signals, computing the loss derivative, and enforcing shape rules on gradients.

Uploaded by

Trần Văn Duy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Backpropagation

TA: Yi Wen

April 17, 2020


CS231n Discussion Section

Slides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen


Agenda
● Motivation
● Backprop Tips & Tricks
● Matrix calculus primer
Agenda
● Motivation
● Backprop Tips & Tricks
● Matrix calculus primer
Motivation
Recall: Optimization objective is minimize loss
Motivation
Recall: Optimization objective is minimize loss

Goal: how should we tweak the parameters to decrease the loss?


Agenda
● Motivation
● Backprop Tips & Tricks
● Matrix calculus primer
A Simple Example
Loss

Goal: Tweak the parameters to minimize loss

=> minimize a multivariable function in parameter space


A Simple Example

=> minimize a multivariable function

Plotted on WolframAlpha
Approach #1: Random Search
Intuition: the step we take in the domain of function
Approach #2: Numerical Gradient
Intuition: rate of change of a function with respect to a
variable surrounding a small region
Approach #2: Numerical Gradient
Intuition: rate of change of a function with respect to a
variable surrounding a small region

Finite Differences:
Approach #3: Analytical Gradient
Recall: partial derivative by limit definition
Approach #3: Analytical Gradient
Recall: chain rule
Approach #3: Analytical Gradient
Recall: chain rule

E.g.
Approach #3: Analytical Gradient
Recall: chain rule

E.g.
Approach #3: Analytical Gradient
Recall: chain rule

Intuition: upstream gradient values propagate backwards -- we can reuse them!


Gradient

“direction and rate of fastest increase”

Numerical Gradient vs Analytical Gradient


What about Autograd?
● Deep learning frameworks can automatically perform backprop!

● Problems might surface related to underlying gradients when debugging your


models

“Yes You Should Understand Backprop”

https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
Problem Statement: Backpropagation

Given a function f with respect to inputs x, labels y, and parameters 𝜃


compute the gradient of Loss with respect to 𝜃
Problem Statement: Backpropagation
An algorithm for computing the gradient of a compound function as a series of
local, intermediate gradients:

1. Identify intermediate functions (forward prop)


2. Compute local gradients (chain rule)
3. Combine with upstream error signal to get full gradient

local(x,W,b) => y
Input x W,b y output
dx,dW,db <= grad_local(dy,x,W,b)
dx dy
dW,db
Modularity: Previous Example

Compound function

Intermediate Variables
(forward propagation)
Modularity: 2-Layer Neural Network

Compound function

Intermediate Variables
(forward propagation)

=> Squared Euclidean Distance


between and
Intermediate Variables
(forward propagation) ? f(x;W,b) = Wx + b ?

(↑lecture note) Input one feature vector

(←here) Input a batch of data (matrix)


1. intermediate functions
Intermediate Variables Intermediate Gradients
2. local gradients
(forward propagation) (backward propagation)
3. full gradients

???

???

???
Agenda
● Motivation
● Backprop Tips & Tricks
● Matrix calculus primer
Derivative w.r.t. Vector

Scalar-by-Vector

Vector-by-Vector
1. intermediate functions
2. local gradients
Derivative w.r.t. Vector: Chain Rule 3. full gradients

?
Derivative w.r.t. Vector: Takeaway
Derivative w.r.t. Matrix

Scalar-by-Matrix

Vector-by-Matrix ?
Derivative w.r.t. Matrix: Dimension Balancing

When you take scalar-by-matrix gradients

The gradient has shape of denominator

● Dimension balancing is the “cheap” but efficient approach to


gradient calculations in most practical settings
Derivative w.r.t. Matrix: Takeaway
1. intermediate functions
Intermediate Variables Intermediate Gradients
2. local gradients
(forward propagation) (backward propagation)
3. full gradients
Backprop Menu for Success

1. Write down variable graph


2. Keep track of error signals
3. Compute derivative of loss function
4. Enforce shape rule on error signals, especially when deriving
over a linear transformation
Vector-by-vector

?
Vector-by-vector

?
Vector-by-vector

?
Vector-by-vector

?
Matrix multiplication [Backprop]

? ?
Elementwise function [Backprop]

You might also like