NN 07
NN 07
NN 07
Convolution Layer
ReLU
Pooling Layers
Fully Connected layer & Classification
Agenda
Training
Dropout
Neural Networks in Practice: Mini-batches
Batch Norm layer
1
● A CNN consists of an input and an output layer, as well as multiple hidden
layers. The hidden layers of a CNN typically consist of convolution layers,
CNN pooling layers, fully connected layers and normalization layer.
Before: Now:
2
Convolution Layer
The filter depth must have the Connect neurons
same depth as the input. only to local
receptive fields.
Convolve the filter with the image 1 number: The result of taking a dot product
i.e. “slide over the image spatially, between the filter and a small 5x5x3 chunk of the
computing dot products”. image (i.e. 5*5*3 = 75-dimensional dot product +
bias) 5
Convolution Layer
3
• For example, if we had 6 5x5 filters, we’ll
Convolution Layer get 6 separate activation maps:
4
Higher-level features
• In general, the more convolution steps we have,
the more complicated features our network will
be able to learn to recognize.
• In Image Classification, a ConvNet may learn to
detect edges from raw pixels in the first layer,
then use the edges to detect simple shapes in
the second layer, and then use these shapes to
detect higher-level features, such as facial
shapes in higher layers.
10
Example 1
Input volume: 32x32x3
Receptive fields FxF: 5x5, stride 1
Number of filters: 10
Before:
Number of weights in such layer: (32*32*3)*10*76 = 30720 *76 ~= 3 million :\
11
5
Example 1, cont.
Input volume: 32x32x3
Receptive fields FxF: 5x5, stride 2
Number of filters: 10
12
Example 1, cont.
Input volume: 32x32x3
Receptive fields FxF: 5x5, stride 3
Number of filters: 10
13
6
Example 2
Assume input 32x32x3 image
If we had 30 filters with receptive fields 5x5, applied stride 1 and pad 2:
=> output volume: [32x32x30] (32*32*30 = 30720 neurons)
Each neuron has 5*5*3 +1 (=76) weights
=> Number of weights in the layer: (30 * 75) + 30 = 2280 (+30 Biases,
one for each neuron).
14
15
7
• It's common to apply a linear rectication nonlinearity: yi = max(zi ; 0)
ReLU
Why might we do this?
• Convolution is a linear
operation.
• Therefore, we need a non-
linearity, otherwise 2
convolution layers would be no
more powerful than 1.
• ReLU has been used after every
Convolution operation.
16
8
• Advantage:
• Sigmoid: not blowing up activation.
• Relu : not vanishing gradient
• Relu : More computationally efficient to compute than Sigmoid functions since
Relu just needs to pick max(0, x) and not perform expensive exponential
operations as in Sigmoids
• Relu : In practice, networks with Relu tend to show better convergence
performance than sigmoid. (Krizhevsky et al.)
• Disadvantage:
• Sigmoid: tend to vanish gradient.
• Relu : tend to blow up activation (there is no mechanism to constrain the output of
the neuron, as “a" itself is the output)
• Relu : Dying Relu problem - if too many activations get below zero then most of the
units (neurons) in network with Relu will simply output zero, in other words, die
and thereby prohibiting learning.(This can be handled, to some extent, by using
Leaky-Relu instead.)
19
Pooling
Layers
• A pooling layer is another building block of a CNN.
• These layers reduce the spatial dimensionality of each feature
map (but not depth) (reduce the amount of parameters and
computation in the network) and build in invariance to small
transformations.
20
9
MAX Pooling
• Pooling retains the most important information.
• Spatial Pooling can be of different types: Max, Average, Sum, L2 norm, Weighted
average based on the distance from the central pixel, etc.
• The most common type of pooling is the max-pooling layer, which slides a
window, like a normal convolution, and get the biggest value on the window as
the output.
21
Pooling layer
• Pooling operation is applied separately to each feature map.
22
10
Advantage of Pooling layer
• Makes the input representations (feature dimension) smaller and more
manageable.
• Reduces the number of parameters and computations in the network, therefore,
controlling Overfitting.
• Makes the network invariant to small transformations, distortions and
translations in the input image (a small distortion in input will not change the
output of Pooling – since we take the maximum / average value in a local
neighborhood).
24
25
11
Training
• Step1: We initialize all filters and parameters, weights with random
values.
• Step2: Compute convolution, ReLU and pooling operations along
with forward propagation in the fully connected layer and finds the
output probabilities for each class.
• Step3: Calculate the total error at the output layer (summation
over all 4 classes) Total Error = ∑ ½ (target probability – output
probability)²
• Step4: Use Backpropagation to calculate gradients, update the
weights.
• Step5: Repeat step2 to step4 with all images in the training set.
26
Training
• Note:
Parameters like
number of filters,
filter sizes,
architecture of the network etc.
have all been fixed before Step 1 and do not change during training process.
only the
values of the filter matrix and connection weights
get updated.
27
12
Test
• When a new (unseen) image is input into the ConvNet, the network
would go through the forward propagation step and output a
probability for each class (for a new image, the output probabilities
are calculated using the weights which have been optimized to
correctly classify all the previous training examples).
• If our training set is large enough, the network will
(hopefully) generalize well to new images and classify them into
correct categories.
28
[CONV-RELU-POOL]xN,[FC-RELU]xM,FC,SOFTMAX
or
[CONV-RELU-CONV-RELU-POOL]xN,[FC-RELU]xM,FC,SOFTMAX
N >= 0, M >=0
Note:
(last FC layer should not have RELU - these are the class scores)
29
13
CIFAR-10 example
30
CIFAR-10 example
input: [32x32x3]
CONV with 10 3x3 filters, stride 1, pad 1:
gives: [32x32x10]
new parameters: (3*3*3)*10 + 10 = 280 CONV with 10 3x3 filters, stride 1:
gives: [16x16x10]
RELU
new parameters: (3*3*10)*10 + 10 = 910
CONV with 10 3x3 filters, stride 1, pad 1: RELU
gives: [32x32x10] CONV with 10 3x3 filters, stride 1:
new parameters: (3*3*10)*10 + 10 = 910 gives: [16x16x10]
RELU new parameters: (3*3*10)*10 + 10 = 910
POOL with 2x2 filters, stride 2: RELU
POOL with 2x2 filters, stride 2:
gives: [16x16x10]
gives: [8x8x10]
parameters: 0 parameters: 0 31
14
Neural Networks
NN in Practice: in
Mini Batches
Practice: Mini-batches
33
34
15
Gradient Descent
16
Stochastic Gradient Descent
17
Mini-batches while training
• Mini-batch: Only use a small portion of the training set to compute
the gradient.
39
In general, Gradient
The range of values of descent converges
raw training data often much faster with
varies widely. feature scaling than
without it.
40
18
Internal covariate shift
That is also the input for layer ‘k’. In other words, that layer receives input data
that has a different distribution than before.
It is now forced to learn to fit to this new input.
As we can see, each layer ends up trying to learn from a constantly shifting input,
thus taking longer to converge and slowing down the training.
43
Common normalizations
Two methods are usually used for rescaling or normalizing data:
1. Scaling data all numeric variables to the range [0,1]. One possible
formula is given below:
19
• Batch Normalization (BN) is a normalization method/layer for
neural networks.
• Batch Norm is a normalization technique done between the
layers of a Neural Network instead of in the raw data.
Proposed • Moreover, the Batch Norm helps to stabilize these shifting
distributions from one iteration to the next, and thus speeds
Solution: up training.
Batch • Batch Normalization – Is a process normalize each scalar
feature independently, by making
Normalization • it have the mean of zero and the variance of 1
(BN) • and then scale and shift the normalized value for each
training mini-batch
thus reducing internal covariate shift fixing the distribution of
the layer inputs x as the training progresses.
45
Batch
Sigmoid
𝑥1 𝑊1 𝑧1 𝑎1 𝑊2 ……
Sigmoid
𝑥2 𝑊1 𝑧2 𝑎2 𝑊2 ……
Sigmoid
𝑥3 𝑊1 𝑧3 𝑎3 𝑊2 ……
46
20
Batch normalization
3
1
𝜇= 𝑧𝑖
𝑥 1
𝑊 1
𝑧 1 3
𝑖=1
3
1
𝑥2 𝑊1 𝑧2 𝜎= 𝑧𝑖 − 𝜇 2
3
𝑖=1
𝜇 and 𝜎
𝜇 𝜎
depends on 𝑧 𝑖
47
𝑖
𝑧𝑖 − 𝜇
Batch normalization 𝑧 =
𝜎
Sigmoid
𝑥1 𝑊1 𝑧1 𝑧1 𝑎1
Sigmoid
𝑥2 𝑊1 𝑧2 𝑧2 𝑎2
Sigmoid
𝑥3 𝑊1 𝑧3 𝑧3 𝑎3
𝜇 and 𝜎 𝜇 𝜎 How to do
depends on 𝑧 𝑖 backpropogation?
48
21
Batch normalization
• It is done along mini-batches instead of the full data set. It serves to
speed up training and use higher learning rates, making learning
easier.
• Normally, large learning rates may increase the scale of layer
parameters, which then amplify the gradient during backpropagation
and lead to the model explosion.
• However, with Batch Normalization, backpropagation through a layer
is unaffected by the scale of its parameters.
• The output of the batch norm layer, has 𝛾 𝑎𝑛𝑑 𝛽 parameters.
Those parameters will be learned to best represent your
activations. Those parameters allows a learnable (scale and
shift) factor. 𝑧 𝑖 = 𝛾⨀𝑧 𝑖 + 𝛽
• They shift the mean and standard deviation, respectively
49
Batch normalization 𝑖
𝑧𝑖 − 𝜇
𝑧 = 𝑧 𝑖 = 𝛾⨀𝑧 𝑖 + 𝛽
𝜎
𝑥1 𝑊1 𝑧1 𝑧1 𝑧1
𝑥2 𝑊1 𝑧2 𝑧2 𝑧2
𝑥3 𝑊1 𝑧3 𝑧3 𝑧3
𝜇 and 𝜎 𝜇 𝜎 𝛽 𝛾
depends on 𝑧 𝑖 50
22
The proposed
solution: To
add an extra
layer
51
52
23
Batch normalization:
Other benefits in practice
• BN reduces training times (Because of less
Covariate Shift, less exploding/vanishing
gradients.) , and make very deep net
trainable.
• BN reduces demand for regularization (for
generalization), e.g. dropout or L2 norm.
• BN enables training with saturating
nonlinearities in deep networks, e.g.
sigmoid. (Because the normalization
prevents them from getting stuck in
saturating ranges, e.g. very high/low values
for sigmoid.)
53
24