Machine Learning: Feed Forward Neural Networks Backpropagation Algorithm Cnns and Rnns
Machine Learning: Feed Forward Neural Networks Backpropagation Algorithm Cnns and Rnns
I Backpropagation Algorithm
Introduction
Perceptron (Recall)
Backpropagation Algorithm: 1
Autoencoders
2
Introduction
A Snap Shot of Deep Learning
Features
3
A Snap Shot of Deep Learning (cont...)
Popular Models
Tools
I PyTorch
I Theano (outdated)
I Caffe
I TensorFlow
5
A Snap Shot of Deep Learning (cont...)
Consequences
6
Shallow Vs. Deep
7
Shallow Vs. Deep (Cont...)
I human speech,
I natural language,
8
Shallow Vs. Deep (Cont...)
9
Human Perception and Evidence for layered hierarchical
systems
10
Can we also emulate the same?
11
Can we also emulate the same? (Cont...)
12
What has changed now?
13
What has changed now? (Cont...)
14
What has changed now? (Cont...)
15
Perceptron: History
History:
16
Perceptron: History
17
Deep Learning: Advantages
I Nonlinearity
I Input-Output mapping
I Adaptivity
I Fault Tolerance
I VLSI implementability
18
Where do we start?
I Perceptron (Recall)
19
Perceptron (Recall)
Hyperplanes
w| x = 0
I By adding bias b ∈ R
w| x + b = 0 b > 0 moving the
hyperplane parallely along w
b<0 opposite direction
20
Hyperplane based classification
I Classification rule
y = sign(w| x + b)
w| x + b > 0 =⇒ y = +1
w| x + b < 0 =⇒ y = −1
21
Hyperplane based classification
22
The Perceptron Algorithm (Rosenblatt, 1958)
23
Perceptron Algorithm
I If yn (w| xn + b) ≤ 0
[Or sign(w| x + b) 6= yn i.e mistake mode]
I wnew = wold + yn xn
I bnew = bold + yn
24
Perceptron Convergence Theorem (Block and Novikoff )
25
Feed Forward Neural Networks
Some Basic Features of Multilayer Perceptrons (or Feed-
forward Deep Neural Networks
26
Why hidden Layers?
27
Why hidden Layers? (Cont...)
28
Two important steps in training neural network
1 Forward step:
I Input is fed to the first layer.
I Input signal is propagated through the network layer by
layer.
I Synaptic weights of the network are fixed i.e. no learning
happens in this step.
I Error is calculated at the output layer by comparing the
observed output with "desired output" (Ground truth)
2 Backward step:
I The observed error at the output layer is propagated
"backwards", layer by layer. (How?)
I Error is propagated "backwards", layer by layer.
I In this step, successive adjustments are made to the
synaptic weights.
29
Propagation of information in neural network
30
Computation of signals
2 Computation of gradient
I gradients of the "error surface" w.r.t weights (we will see
this later how)
31
Error
I D = {(x(n), z(n))}N
n=1 be a training sample where x(n) is
an input and z(n) is the desired output.
32
Error (contd. . . )
33
Backpropagation Algorithm: 1
The Backpropagation Algorithm
34
The Backpropagation Algorithm
35
The Backpropagation Algorithm (cont...)
36
The Backpropagation Algorithm (cont...)
37
The Backpropagation Algorithm (cont...)
The derivative
Since
I E is a function of ej
I vj is a function of wji
38
The Backpropagation Algorithm (cont...)
I The derivative is
1
PM 2 ∂E(n)
I E(n) = 2 j=1 ej (n) =⇒ ∂ej (n) = ej (n)
∂ej (n)
I ej (n) = zj (n) − yj (n) =⇒ ∂yj (n) = −1
∂y (n)
I yj (n) = ϕ(vj (n)) =⇒ ∂vjj (n) = ϕ0j (vj (n))
Pm ∂vj (n)
I vj (n) = i=0 wji (n)xi (n) =⇒ ∂w ji
= xi (n)
I =⇒
∂E(n)
= −ej (n)ϕ0j (vj (n))xi (n)
∂wji (n)
39
The Backpropagation Algorithm (contd. . . )
I We have
∂E(n)
= −ej (n)ϕ0j (vj (n))xi (n)
∂wji (n)
∂E(n)
wji (n + 1) = wji (n) − η
∂wji (n)
= wji (n) + ηej (n)ϕ0j (vj (n))xi (n)
40
The Backpropagation Algorithm (contd. . . )
Local Gradient
∂E(n)
δj (n) = −
∂vj (n)
∂E(n) ∂ej (n) ∂yj (n)
=− = ej (n)ϕ0j (vj (n))
∂ej (n) ∂yj (n) ∂vj (n)
41
BPA: Case 1: Neuron j is an output node
42
BPA: Case 1: Neuron j is an output node (Cont. . . )
I Update rule
∂E(n)
wji (n + 1) = wji (n) − η
∂wji (n)
= wji (n) − η ej (n)ϕ0 (vj (n)) xi (n)
| {z }
Local Gradient
43
BPA: Case 1: Neuron j is an output node (Cont. . . )
I Update rule
∂E(n)
wji (n + 1) = wji (n) − η
∂wji (n)
= wji (n) − η ej (n)ϕ0 (vj (n)) xi (n)
| {z }
Local Gradient
44
BPA: Case 2: Neuron j is a hidden node
I TRICK
45
BPA: Case 2: Neuron j is a hidden node (contd. . . )
Strategy
I First compute the local gradient δj (n) for j th hidden
neuron (How we will see...)
I Then use the update that is similar to output neuron
Note: If this had been the output neuron, we would have had
47
BPA: Case 2: Neuron j is a hidden node (contd. . . )
48
BPA: Case 2: Neuron j is a hidden node (contd. . . )
I We have
∂ek (n)
I
∂vk (n)= −ϕ0 (vk (n))
We have vk (n) = m
P
l=1 wkl (n)yl (n)
I
49
BPA: Case 2: Neuron j is a hidden node (contd. . . )
∂E(n) X
=− ek (n)ϕk (vk (n))wkj (n)
∂yj (n)
k∈C
X
=− δk (n)wkj (n)
k∈C
50
BPA: Case 2: Neuron j is a hidden node (contd. . . )
∂E(n) X
We have =− δk (n)wkj (n)
∂yj (n)
k
∂yj (n)
and = ϕ0j (vj (n))
∂vj (n)
X
Hence, δj (n) = ϕ0j (vj (n)) δk (n)wkj (n)
k
51
BPA: Case 2: Neuron j is a hidden node (contd. . . )
I Hence,
52
BPA: Update Rule Summary
53
BPA: Update Rule Summary
54
Online Vs Batch Learning
Batch Learning
55
Online Vs Batch Learning (Cont...)
Batch Learning
56
Online Vs Batch Learning
Online Learning
I Each adjustment to the weights is performed example by
example in the training data.
I The cost function is error obtained in each sample.
M M
1X 2 1X
E= ej (n) = (zj (n) − yj (n))2
2 2
j=1 j=1
I The learning curve in this case is E(n) vs epoch.
I Learning curve is significantly different from that of batch
learning.
I Online learning take advantage of redundant data (multiple
copies of data).
I Online learning is simple to implement.
57
Activation Function
1 Logistic function:
1
ϕ0j (vj (n)) = a>0
1 + exp(−avj (n))
58
Activation Functions
ϕ(x) = 0 if x<0
ϕ(x) = 1 if x>0
This is useful
in the case of perceptron which works
only when the data is linearly separable.
59
Activation Functions
Sigmoid Function
61
Activation Functions (contd. . . )
62
Activation Functions (contd. . . )
63
I Shares some disadvantages of sigmoid.
Activation Functions (contd. . . )
ReLU(x) = max(0, x)
64
Activation Functions (contd. . . )
I Is it non-linear? Yes.
f (x + y) = f (x) + f (y)
But ϕ(−1) + ϕ(+1) 6= ϕ(0)
65
Activation Functions (contd. . . )
ReLU (contd...)
I It is a non-bounded function.
I Change from sigmoid to ReLU as an activation function in
hidden layer is possible - hidden neurons need not have
bounded values.
I The issue with sigmoid function is that it has very small
values (near zero) everywhere except near 0.
66
Activation Functions (contd. . . )
ReLU (contd...)
67
Activation Functions (contd. . . )
ReLU (contd...)
I The reason is in the case of sigmoid ϕ0 (.) is always less than
1 with most values being 0.
=⇒ This imbalance in gradient magnitude makes it
difficult to change the parameters of the neural networks
with stochastic descent.
68
Activation Functions (contd. . . )
ReLU (contd...)
I This problem can be adjusted by the use of rectified linear
activation function because dervative of ReLU can have
many non-zero values.
=⇒ Which in turn means that magnitude of the gradient
is more balanced throughout the network.
I Dying ReLU: A neuron in the network is permanently dead
due to inability to fire in forward pass.
=⇒ When activation is zero in the forward pass all the
weights will get zero gradient.
=⇒ In backpropagation, the weights of neurons never get
updated.
I Using ReLU in RNNs can blow up the computations to
infinity as activations are not bounded. 69
Rate of Learning
70
Rate of Learning (contd. . . )
71
Stopping Criteria
72
Stopping Criteria (contd. . . )
∂E
I Euclidean norm of ∂w reaches a sufficiently small gradient
threshold.
∵ The necessary condition for w∗ to be global maximum or
∂E
local minimum is ∂w =0
w∗
73
Summary of Backpropagation Learning Algorithm
I η : learning rate
I δj (n) : local gradient
I xi (n) : Input to the j th neuron
Local gradient:
74
XOR Problem
0⊕0=0
1⊕0=0
0⊕1=1
1⊕0=1
n o n o
I (0,0),(1,1) and (0,1),(1,0) are not linearly separable.
Hence single layer neural network cannot solve this
problem.
75
XOR Problem (contd. . . )
76
XOR Problem (contd. . . )
77
XOR Problem (contd. . . )
78
XOR Problem (contd. . . )
79
XOR Problem (contd. . . )
80
Universal Approximation Theorem
81
Autoencoders
Introduction
82
Compression
The process...
I Transmit
84
Autoencoder
N
X
J(W1 , b1 , W2 , b2 ) = (x̃n − xn )2
i=1
85
Autoencoder
86
Convolutional Neural Networks
Convolutional Neural Network(Introduction)
87
Biological Connection
88
CNN
90
CNN
91
Convolution
5 X
X 5
r∗ = aij bij
i=1 j=1 93
Convolution Layer (cont....)
I Example:
94
Convolution Layer (cont....)
Stride: Stride is size of the shift of the filter across the image
(preciously we kept stride as 1).
Ex.
95
Convolution Layer (contd. . . )
I
Size of input − Size of filter
Size of output = +1
Stride 96
Convolution with stride
Filer is moved along the image and at each position the dot product is
computed
97
Convolution Layer (contd. . . )
98
Convolution Layer (contd. . . )
In general,
size of input − size of filter + 2 ∗ size of padding
size of output = +1
size of stride
99
Rectified Linear Unit OR ReLU
Activation functions
100
Pooling or Down Sampling Layer
Maxpooling
I Max Pooling
I Average Pooling
I L2 norm Pooling 101
Pooling or Down Sampling Layer (contd. . . )
I Advantages
I Controls overfitting
102
Dropout Layer
I Helps as a Regularizer.
103
Architecture of LeNet-5 (LeCun et al, 1986)
Architecture of LeNet-5
105
Some popular CNNs (contd. . . )
I Error of 11.2%.
I More of a fine tuning of Alex Net.
I Provided visualizations which provided better intuitions.
I ZF trained using 1.3 million images.
106
Some popular CNNs (contd. . . )
107
Some popular CNN’s (contd. . . )
I Error 7.3%.
I 19 layers of convolutional layer, 3x3 filters, padding of 2,
max pooling with stride of 2.
I Trained for two to three weeks.
I Error 6.7%
I 22 layer CNN.
I User Inception modules.
108
Some Popular CNN’s (contd. . . )
I Residual blocks.
I 152 layers.
109
Recurrent Neural Networks
What we have been doing so far?
110
What we have been doing so far? (cont. . . )
111
RNN (Introduction)
114
Vanishing Gradient Problem
116
RNNs are really useful for sequential data? (cont. . . )
117
RNNs are really useful for sequential data? (cont. . . )
120