Slides 11
Slides 11
Varun Kanade
University of Oxford
November 14 & 16, 2016
Announcements
1
Outline
I Multi-layer perceptrons
2
Artificial Neuron : Logistic Regression
1 Unit
x1 w1 Σ yb = Pr(y = 1 | x, w, b)
w2
x2
Linear Function
Non-linearity
1
b21
2
Σ
w11
x1 3
w11
2
w12
2
1 b31 Σ yb = Pr(y = 1 | x, W, b)
w21 3
x2 w12
2
w22
Σ
b22
1
4
Multilayer Perceptron (MLP) : Regression
1
b21
2
Σ
w11
x1 3
w11
2
w12
2
1 b31 Σ yb = E[y | x, W, b]
w21 3
x2 w12
2
w22
Σ
b22
1
5
A Toy Example
6
Logistic Regression Fails Badly
7
Solve using MLP
z12 a21
1
b21
Σ
2
w11 z13 a31
x1 3
w11
2
w12
1 b31 Σ yb = Pr(y = 1 | x, Wi , bi )
2
w21 z22 a22 3
x2 w12
2
w22
Σ
b22
1
a1 = z1 = x
z2 = W2 a1 + b2
a2 = tanh(z2 )
z3 = W3 a2 + b3
y = a3 = σ(z3 )
8
Scatterplot Comparison (x1 , x2 ) vs (a21 , a22 )
9
Decision Boundary of the Neural Net
10
Feedforward Neural Networks
Layer 1 Layer 2 Layer 3 Layer 4
(Input) (Hidden) (Hidden) (Output)
Fully
Connected
Layer
11
Computing Gradients on Toy Example
∂` ∂` ∂`
Would suffice to compute 3
∂z1
, 2
∂z1
, 2
∂z2
12
Computing Gradients on Toy Example
Let us compute the following:
a31 −y
1. ∂`
∂a3
= − ay3 + 1−y
1−a3
= a3 3
1 1 1 1 (1−a1 )
3
2. ∂a
3
∂z1
= a31 · (1 − a31 )
3
∂ z1 3 3
3. ∂a2
= [w11 , w12 ]
" #
∂ a2 1 − tanh2 (z12 ) 0
4. =
∂z2 0 1 − tanh2 (z22 )
∂ a3
∂`
3
∂z1
= ∂`
∂a3
· ∂z1
1
3 = a31 − y
1
∂` ∂` ∂ a3 ∂ z3 ∂ a2 ∂`
3
∂ z1 ∂ a2
∂z2
= ∂a3
· ∂z3 · ∂a21 ·
1
∂z2
= 3
∂z1
· ∂a2
· ∂z2
1 1
13
loss ` aL
layer l − 1
If there are nl units in layer l, then Wl is
nl × nl−1
layer 2
Backward pass to compute derivatives
∂`
input x a1 ∂z2
14
loss ` aL
(1) a1 = x (input)
layer L − 1
(2) zl = Wl al−1 + bl
(4) `(aL , y)
layer l − 1
layer 2
input x a1
15
Output Layer
aL
zL = WL aL−1 + bL
aL = fL (zL )
layer L (zL → aL )
Loss: `(y, aL )
∂` ∂` ∂` ∂ aL
aL−1 ∂zL ∂zL
= ∂aL
· ∂zL
∂` ∂`
If there are nL (output) units in layer L, then ∂aL
and ∂zL
are row vectors
L
∂a
with nL elements and ∂zL
is the nL × nL Jacobian matrix:
L
∂ a1 ∂ aL ∂ aL
1
· · · ∂zL1
∂z1L ∂z2 L
nL
∂ aL ∂ aL ∂ aL
L2 2
· · · ∂zL2
∂ aL ∂z1 ∂z2 L
nL
= . . .
∂zL .. .. .. ..
L .
∂ an ∂ aLnL ∂ aLnL
∂z L
L
∂z L
· · · ∂z L
1 2 nL
17
Gradients with respect to parameters
∂`
al ∂zl+1
zl = Wl al−1 + bl (wj,k
l
weight on connection from kth
unit in layer l-1 to j th unit in layer l)
l l
layer l (z → a )
∂`
∂zl
(obtained using backpropagation)
l−1 ∂`
a ∂zl
∂ zil
Consider ∂`
l
∂wij
= ∂`
∂zil
· ∂wijl = ∂`
∂zil
· al−1
j
∂` ∂`
∂bli
= ∂zil
T
More succinctly, we may write: ∂`
∂Wl
= al−1 ∂z
∂`
l
∂` ∂`
∂bl
= ∂zl
18
loss ` aL Forward Equations
(1) a1 = x (input)
layer l
∂` Back-propagation Equations
∂zl
layer l − 1 ∂` ∂` ∂ aL
(1) Compute ∂zL
= ∂aL
· ∂zL
∂ al
(2) ∂`
∂zl
= ∂`
∂zl+1
· Wl+1 · ∂zl
layer 2
T
(3) ∂`
∂Wl
= al−1 ∂z
∂`
l
∂` ∂` ∂`
input x a1 ∂z2
(4) ∂bl
= ∂zl
19
Computational Questions
What is the running time to compute the gradient for a single data point?
I As many matrix multiplications as there are fully connected layers
I Performed twice during forward and backward pass
20
Training Deep Neural Networks
I Regularisation
I How do we add `1 or `2 regularisation?
I Don’t regularise bias terms
I What did we learn in the last 10 years, that we didn’t know in the 80s?
21
Training Feedforward Deep Networks
Layer 1 Layer 2 Layer 3 Layer 4
(Input) (Hidden) (Hidden) (Output)
1
z12 a21
b21
Σ a21 Target is y = 1−x
2
w12
x ∈ {−1, 1}
∂ a2
∂`
2
∂z1
= 2(a21 − y) · ∂z1
1
2 = 2(a21 − y)σ 0 (z12 ) 0.8
0.6
If x = −1, w12 ≈ 5, b21 ≈ 0, then σ 0 (z12 ) ≈ 0
0.4
−8 −6 −4 −2 0 2 4 6 8
a21 −y ∂ a2
∂`
2
∂z1
= a2 2 · 1
2 = (a21 − y)
1 (1−a1 ) ∂z1
23
Propagating Gradients Backwards
1 1 1
b21 b31 b41
x = a11 w12 Σ w13 Σ w14 Σ a41
24
Avoiding Saturation
Rectifier
Use rectified linear units 3
Rectifier non-linearity
f (z) = max(0, z)
2
Other variants
0
leaky ReLUs, parametric ReLUs −3 −2 −1 0 1 2 3
25
Initialising Weights and Biases
26
Avoiding Overfitting
I For Problem Sheet 4, you will be asked to train an MLP for digit
recognition with 2 million parameters and only 60,000 training images
I For image detection, one of the most famous models, the neural net
used by Krizhevsky, Sutskever, Hinton (2012) has 60 million parameters
and 1.2 million training images
27
Early Stopping
28
Add Data: Modified Data
Typically, getting additional data is either impossible or expensive
30
Other Ideas to Reduce Overfitting
Gradient Clipping
Unsupervised Pre-training
(Bengio et al.)
31
Bagging (Bootstrap Aggregation)
I Train classifiers f1 , . . . , fk on D1 , . . . , Dk
32
Dropout
33
Errors Made by MLP for Digit Recognition
34
Avoiding Overfitting
35
Convolutional Neural Networks (convnets)
36
Image Convolution
Source: L. W. Kheng
37
Convolution
38
Source: Krizhevsky, Sutskever, Hinton (2012)
39
Sources: Krizhevsky, Sutskever, Hinton (2012); Wikipedia
40
Source: Krizhevsky, Sutskever, Hinton (2012)
41
Source: Zeiler and Fergus (2013)
42
Source: Zeiler and Fergus (2013)
43
Convolutional Layer
Suppose that there is no zero padding and strides in both directions are 1
Wf 0 Hf 0 F
l
XXX 0
zil+1
0 ,j 0 ,f 0 = bf 0 + ali0 +i−1,j 0 +j−1,f wi,j,f
l+1,f
i=1 j=1 f =1
∂ z l+1
i ,j ,f 0
0 0
l+1,f 0
= ali0 +i−1,j 0 +j−1,f
∂wi,j,f
X
∂`
l+1,f 0
= ∂`
· ali0 +i−1,j 0 +j−1,f
∂wi,j,f ∂z l+1
i ,j ,f 0
0 0
i0 ,j 0
44
Convolutional Layer
Suppose that there is no zero padding and strides in both directions are 1
Wf 0 Hf 0 F
l
XXX 0
zil+1
0 ,j 0 ,f 0 = bf 0 + ali0 +i−1,j 0 +j−1,f wi,j,f
l+1,f
i=1 j=1 f =1
∂ z l+1 0
i ,j ,f 0
0 0 l+1,f
∂ali,j,f
= wi−i 0 +1,j−j 0 +1,f
l+1,f 0
X
∂` ∂`
∂ali,j,f
= l+1 · wi−i 0 +1,j−j 0 +1,f
∂z 0 0 0
i ,j ,f
i0 ,j 0 ,f 0
45
Max-Pooling Layer
Let Ω(i0 , j 0 ) be the set of (i, j) pairs in the previous layer that are involved in
the maxpool
sl+1
i0 ,j 0 = max ali,j
i,j∈Ω(i0 ,j 0 )
!
∂ sl+1
0 0
i ,j
∂ali,j
= I (i, j) = argmax alĩ,j̃
ĩ,j̃∈Ω(i0 ,j 0 )
46
Next Week
47