Machine Learning - Exercise 4: Companion Slides
Machine Learning - Exercise 4: Companion Slides
Companion Slides
December 6, 2018
Exercise Goal Lecture Recap
Network’s
fact Network’s floss error rate
inputs
output (E)
Network’s
fact Network’s floss error rate
inputs
output (E)
Calculate:
x y = WT·x + b ▸ Derivatives with respect to parameters Θ
Θ = (W, b)
Element-wise:
!
Without batching
Example: Linear/Fully Connected Module
Calculate:
x y = WT·x + b ▸ Derivative with respect to input
Θ = (W, b)
Element-wise:
!
Without batching
Example: Linear/Fully Connected Module
Putting it together:
fprop(x):
cache.x = x run training data through
x y = WT·x + b return WT*x + b (forwards)
bprop(dE):
dW = cache.x * dE
db = dE run gradients through
return dE * W (backwards)
Θ = (W, b)
update(rate):
W = W - rate*dW
update the parameters
b = b - rate*dbT
(grad. descent)
!
Without batching
Mini-Batching
▸ Batch learning
▸ All training samples processed at once, parameters updated once at the end
▸ Stable, well understood, many acceleration techniques, but slow
▸ Stochastic learning
▸ Each training sample separately, parameters updated at each step
▸ Noisy (though may lead to better results), fast
▸ Mini-batching
▸ Middle ground, batches of data processed, bundled updates
▸ Combine advantages, reduce drawbacks
▸ Example
▸ Linear Module f with input dimension Nin and output dimension Nout, batch size n
mini-batch matrix
Batching Update Rule
▸ (Mini-)Batch learning
▸ Multiple samples processed at once
▸ Calculate gradient for each sample, but don’t update the parameters
▸ After processing the batch, update using a sum of all gradients
▸ Learning rate has to be adapted, e.g., divide E by batch size
Θ = (W, b)
kth element in the
batch
Θ = (W, b)
kth element in the
batch
▸ Advice
▸ Use (small) random x
Expected Results/Tips for MNIST
▸ [Linear(28x28, 10), Softmax]
▸ should give ± 750 errors
▸ [Linear(28x28, 200), tanh, Linear(200,10), Softmax]
▸ should give ± 250 errors
▸ Typical learning rates
▸ λ ∊ [0.1, 0.01]
▸ Typical batch sizes
▸ NB ∊ [100, 1000]
▸ Weight initialization
▸ W ∊ ℝMxN
▸ W ~ N (0, ), i.e., sampled from normal distribution around 0 with deviation
▸ b=0
▸ Pre-process the data
▸ Dividide values by 255 (= max pixel value)