0% found this document useful (0 votes)
34 views

Machine Learning - Exercise 4: Companion Slides

This document provides an overview of backpropagation and training neural networks. It recaps the basics of neural networks and backpropagation, including calculating the derivative of the error with respect to the parameters. It then discusses implementing backpropagation with computational graphs and mini-batching. Examples are provided for calculating the derivatives in a linear/fully connected module both with and without batching. Tips are given for debugging gradients and expected results when training networks on MNIST data.

Uploaded by

Usman Muhammad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Machine Learning - Exercise 4: Companion Slides

This document provides an overview of backpropagation and training neural networks. It recaps the basics of neural networks and backpropagation, including calculating the derivative of the error with respect to the parameters. It then discusses implementing backpropagation with computational graphs and mini-batching. Examples are provided for calculating the derivatives in a linear/fully connected module both with and without batching. Tips are given for debugging gradients and expected results when training networks on MNIST data.

Uploaded by

Usman Muhammad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Machine Learning - Exercise 4

Companion Slides

Ali Athar Sabarinath Mahadevan

December 6, 2018
Exercise Goal Lecture Recap

General backpropagation with


Backpropagation for fixed network computational graphs

This exercise is about

▸ Understanding backpropagation, deriving formulas, optimizing them


▸ Implement simple neural network framework yourself
▸ Digit recognition
Recap: Neural Networks
labels

Network’s
fact Network’s floss error rate
inputs
output (E)

Linear module, Θi = (W, b) Activation function (tanh, σ, ReLu)


Parameters
▸ Training data (inputs) X = {xi}i=1..N with xi ∊ 𝕀, N the batch size

▸ Training labels T = {ti}i=1..N with xi ∊ 𝕆

▸ Network is a parametrized, (sub-)differentiable function F(X,Θ) : 𝕀 x ℙ → 𝕆

▸ e.g., 𝕆 = ℝDim (regression), 𝕆 = [0,1]Dim (prob. classification)


▸ Loss (criterion) L (T,F(X,Θ)) : 𝕆 x 𝕆 → ℝ, put on top of output to measure performance
▸ find optimal parameters: Θ* = argminΘ L (T,F(X,Θ))
Recap: Backpropagation
labels

Network’s
fact Network’s floss error rate
inputs
output (E)

▸ Optimize towards lower error rate, i.e., lower E


▸ Take derivative of E with respect to each modules parameters, follow gradient
▸ Example: Gradient Descent: Θ = Θ - λ*DΘ(E(x))
▸ DΘ(E(x)) = DΘ(E) for brevity
Derivative w.r.t. modules parameters Θ at point x
▸ How to calculate DΘ(E)
Learning rate
▸ Reverse order of modules
▸ Module gets Dout(E), calculates DΘ(E), passes Din(E) to next module
Example: Linear/Fully Connected Module

Given: Derivative with respect to output

Calculate:
x y = WT·x + b ▸ Derivatives with respect to parameters Θ

Θ = (W, b)

Element-wise:

!
Without batching
Example: Linear/Fully Connected Module

Given: Derivative with respect to output

Calculate:
x y = WT·x + b ▸ Derivative with respect to input

Θ = (W, b)

Element-wise:

!
Without batching
Example: Linear/Fully Connected Module
Putting it together:

fprop(x):
cache.x = x run training data through
x y = WT·x + b return WT*x + b (forwards)

bprop(dE):
dW = cache.x * dE
db = dE run gradients through
return dE * W (backwards)
Θ = (W, b)
update(rate):
W = W - rate*dW
update the parameters
b = b - rate*dbT
(grad. descent)

!
Without batching
Mini-Batching
▸ Batch learning
▸ All training samples processed at once, parameters updated once at the end
▸ Stable, well understood, many acceleration techniques, but slow
▸ Stochastic learning
▸ Each training sample separately, parameters updated at each step
▸ Noisy (though may lead to better results), fast
▸ Mini-batching
▸ Middle ground, batches of data processed, bundled updates
▸ Combine advantages, reduce drawbacks
▸ Example
▸ Linear Module f with input dimension Nin and output dimension Nout, batch size n

broadcast (i.e., repeat) b

mini-batch matrix
Batching Update Rule
▸ (Mini-)Batch learning
▸ Multiple samples processed at once
▸ Calculate gradient for each sample, but don’t update the parameters
▸ After processing the batch, update using a sum of all gradients
▸ Learning rate has to be adapted, e.g., divide E by batch size

▸ Example: Gradient Descent

Derivative of E w.r.t parameters Θ at point xk

▸ To make things easier, we write


Example: Linear/Fully Connected Module - Batching

Given: Derivatives with respect to outputs


Plural!
Calculate:
xk yk = WT·xk + b ▸ Derivatives with respect to parameters Θ

Θ = (W, b)
kth element in the
batch

Deriv. w.r.t outputs


assumed to be
given row-wise:
Example: Linear/Fully Connected Module - Batching

Given: Derivatives with respect to outputs


Plural!
Calculate:
xk yk = WT·xk + b ▸ Derivatives with respect to inputs

Θ = (W, b)
kth element in the
batch

Deriv. w.r.t outputs


assumed to be
given row-wise:
Example: Training a Network

1. network = [module1,module2, …, modulen], loss = floss


2.
3. for X, T in batched(inputs, labels) do
4. z = X
5. for module in net do
6. z = module.fprop(z)
7. end for
8. E = loss.fprop(z,T)
9. dz = loss.bprop(1/batchSize) // Normalization for batch size
for module in reversed(net) do
10.
dz = module.brop(dz)
11.
end for
12. for module in net do
13. module.update(rate)
14. end for
15. end for
16.
Debugging Tip: Gradient Checking
Check the Jacobian J from bprop with numerical differentiation

▸ Numerical approach: Column-wise (here for the first column)

▸ Backprop: Row-wise (here for the first row)

▸ Advice
▸ Use (small) random x
Expected Results/Tips for MNIST
▸ [Linear(28x28, 10), Softmax]
▸ should give ± 750 errors
▸ [Linear(28x28, 200), tanh, Linear(200,10), Softmax]
▸ should give ± 250 errors
▸ Typical learning rates
▸ λ ∊ [0.1, 0.01]
▸ Typical batch sizes
▸ NB ∊ [100, 1000]
▸ Weight initialization
▸ W ∊ ℝMxN
▸ W ~ N (0, ), i.e., sampled from normal distribution around 0 with deviation
▸ b=0
▸ Pre-process the data
▸ Dividide values by 255 (= max pixel value)

You might also like