NN 07

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Artificial Neural Network

and Deep Learning


Lecture 7
Convolution Neural Networks
(CNN)

Neural Networks - Lecture 7 1

Convolution Layer
ReLU
Pooling Layers
Fully Connected layer & Classification
Agenda
Training
Dropout
Neural Networks in Practice: Mini-batches
Batch Norm layer

1
● A CNN consists of an input and an output layer, as well as multiple hidden
layers. The hidden layers of a CNN typically consist of convolution layers,
CNN pooling layers, fully connected layers and normalization layer.

Before: Now:

All Neural Net activations arranged in 3 dimensions:

For example, a CIFAR-10 image is a 32x32x3 volume


32 width, 32 height, 3 depth (RGB channels)
4

2
Convolution Layer
The filter depth must have the Connect neurons
same depth as the input. only to local
receptive fields.

image: 32x32x3 volume


before: full connectivity: 32x32x3 weights

now: one neuron will connect to, e.g. 5x5x3


chunk and only have 5x5x3 weights.

Convolve the filter with the image 1 number: The result of taking a dot product
i.e. “slide over the image spatially, between the filter and a small 5x5x3 chunk of the
computing dot products”. image (i.e. 5*5*3 = 75-dimensional dot product +
bias) 5

Convolution Layer

• Produces a new mapping of the image named an


Input volume: 32x32x3, activation map or feature map which is a 28x28
sheet of neuron outputs:
filter 5x5x3, stride 1
1. Each is connected to a small region in the input.
Output size: (N - F) / stride + 1
2. All of them share parameters.
= (32 - 5) / 1 + 1
= 28 “5x5 filter” -> “5x5 receptive field for each neuron”
6

3
• For example, if we had 6 5x5 filters, we’ll
Convolution Layer get 6 separate activation maps:

• Consider a second filter

• E.g. with 5 filters, CONV layer consists of


neurons arranged in a 3D grid (28x28x5).
There will be 5 different neurons all looking
Each filter focuses on specific at the same region in the input volume.
patterns in the image (e.g. vertical
edges, horizontal edges, color, etc.)
and produces a new mapping of the
image named activation map or
feature map. 7

Preview: ConvNet is a sequence of Convolution Layers, interspersed


with activation functions

• Output size: • Output size:


(32 - 5) / 1 + 1 = 28 (28 - 5) / 1 + 1 = 24

4
Higher-level features
• In general, the more convolution steps we have,
the more complicated features our network will
be able to learn to recognize.
• In Image Classification, a ConvNet may learn to
detect edges from raw pixels in the first layer,
then use the edges to detect simple shapes in
the second layer, and then use these shapes to
detect higher-level features, such as facial
shapes in higher layers.

10

Example 1
Input volume: 32x32x3
Receptive fields FxF: 5x5, stride 1
Number of filters: 10

• Output size: (N - F) / stride + 1


Output volume size: (32 - 5) / 1 + 1 = 28, so: 28x28x10

• Number of parameters in this layer?


each filter has 5*5*3 + 1 = 76 params (+1 for bias) Now: CNN
(parameter
76*10 = 760 sharing)

Before:
Number of weights in such layer: (32*32*3)*10*76 = 30720 *76 ~= 3 million :\
11

5
Example 1, cont.
Input volume: 32x32x3
Receptive fields FxF: 5x5, stride 2
Number of filters: 10

Output volume: ? Cannot: (32-5)/2 + 1 = 14.5 :\

12

Example 1, cont.
Input volume: 32x32x3
Receptive fields FxF: 5x5, stride 3
Number of filters: 10

• Output volume size: ? (32 - 5) / 3 + 1 = 10, so: 10x10x10

• Number of parameters in this layer?


each filter has 5*5*3 + 1 = 76 params (+1 for bias)
76*10 = 760
(unchanged)

13

6
Example 2
Assume input 32x32x3 image
If we had 30 filters with receptive fields 5x5, applied stride 1 and pad 2:
=> output volume: [32x32x30] (32*32*30 = 30720 neurons)
Each neuron has 5*5*3 +1 (=76) weights

=> Number of weights in the layer: (30 * 75) + 30 = 2280 (+30 Biases,
one for each neuron).

14

Output volume size


Input volume of size [W1 x H1 x D1]
using K filters with receptive fields F x F and applying them at strides
of S gives
Output volume size: [W2, H2, D2]

F*F*D1 weights per filter, for a total of (F*F*D1*K) weights and K


biases.

15

7
• It's common to apply a linear rectication nonlinearity: yi = max(zi ; 0)

ReLU
Why might we do this?
• Convolution is a linear
operation.
• Therefore, we need a non-
linearity, otherwise 2
convolution layers would be no
more powerful than 1.
• ReLU has been used after every
Convolution operation.
16

What are the advantages of ReLU over


sigmoid function in deep neural networks?
ReLU is h(a)=max(0,a) where a=Wx+b.

1- ReLU More computationally efficient to compute than Sigmoid functions


since ReLU just needs to pick max(0,a) and not perform expensive exponential
operations as in Sigmoids.
2- In practice, networks with ReLU tend to show better convergence
performance than sigmoid.
3- Sigmoid: tend to vanish gradient (cause there is a mechanism to reduce the
gradient as “u" increases, where “u" is the input of a sigmoid function).
Gradient of Sigmoid: S′(u)=S(u)(1−S(u)). When "a" grows to infinite
large, S′(u)=S(u)(1−S(u))=1×(1−1)=0.
Relu : not vanishing gradient (reduced likelihood of vanishing gradient).
17

8
• Advantage:
• Sigmoid: not blowing up activation.
• Relu : not vanishing gradient
• Relu : More computationally efficient to compute than Sigmoid functions since
Relu just needs to pick max(0, x) and not perform expensive exponential
operations as in Sigmoids
• Relu : In practice, networks with Relu tend to show better convergence
performance than sigmoid. (Krizhevsky et al.)
• Disadvantage:
• Sigmoid: tend to vanish gradient.
• Relu : tend to blow up activation (there is no mechanism to constrain the output of
the neuron, as “a" itself is the output)
• Relu : Dying Relu problem - if too many activations get below zero then most of the
units (neurons) in network with Relu will simply output zero, in other words, die
and thereby prohibiting learning.(This can be handled, to some extent, by using
Leaky-Relu instead.)
19

Pooling
Layers
• A pooling layer is another building block of a CNN.
• These layers reduce the spatial dimensionality of each feature
map (but not depth) (reduce the amount of parameters and
computation in the network) and build in invariance to small
transformations.

20

9
MAX Pooling
• Pooling retains the most important information.
• Spatial Pooling can be of different types: Max, Average, Sum, L2 norm, Weighted
average based on the distance from the central pixel, etc.
• The most common type of pooling is the max-pooling layer, which slides a
window, like a normal convolution, and get the biggest value on the window as
the output.

21

Pooling layer
• Pooling operation is applied separately to each feature map.

Input volume of size [W1 x H1 x D1]


Pooling unit receptive fields F x F and applying them at strides of S gives:
Output volume: [W2, H2, D1]
W2 = (W1-F)/S+1
H2 = (H1-F)/S+1

22

10
Advantage of Pooling layer
• Makes the input representations (feature dimension) smaller and more
manageable.
• Reduces the number of parameters and computations in the network, therefore,
controlling Overfitting.
• Makes the network invariant to small transformations, distortions and
translations in the input image (a small distortion in input will not change the
output of Pooling – since we take the maximum / average value in a local
neighborhood).

24

• The Fully Connected layer is a traditional Multilayer Perceptron


MLP that uses a Softmax activation function in the output layer.
• The output from the convolutional and pooling layers represent
high-level features of the input image. The purpose of the Fully
Connected layer is to use these features for classifying the input
image into various classes based on the training dataset.
Fully
Connected • Apart from classification, adding a fully-connected layer is also a
(usually) cheap way of learning non-linear combinations of these
layer & features. Most of the features from convolutional and pooling
Classification layers may be good for the classification task, but combinations
of those features might be even better.

25

11
Training
• Step1: We initialize all filters and parameters, weights with random
values.
• Step2: Compute convolution, ReLU and pooling operations along
with forward propagation in the fully connected layer and finds the
output probabilities for each class.
• Step3: Calculate the total error at the output layer (summation
over all 4 classes) Total Error = ∑ ½ (target probability – output
probability)²
• Step4: Use Backpropagation to calculate gradients, update the
weights.
• Step5: Repeat step2 to step4 with all images in the training set.
26

Training
• Note:
Parameters like
number of filters,
filter sizes,
architecture of the network etc.
have all been fixed before Step 1 and do not change during training process.
only the
values of the filter matrix and connection weights
get updated.
27

12
Test
• When a new (unseen) image is input into the ConvNet, the network
would go through the forward propagation step and output a
probability for each class (for a new image, the output probabilities
are calculated using the weights which have been optimized to
correctly classify all the previous training examples).
• If our training set is large enough, the network will
(hopefully) generalize well to new images and classify them into
correct categories.

28

Typical ConvNets look like:

[CONV-RELU-POOL]xN,[FC-RELU]xM,FC,SOFTMAX
or

[CONV-RELU-CONV-RELU-POOL]xN,[FC-RELU]xM,FC,SOFTMAX
N >= 0, M >=0

Note:

(last FC layer should not have RELU - these are the class scores)

29

13
CIFAR-10 example

30

CIFAR-10 example
input: [32x32x3]
CONV with 10 3x3 filters, stride 1, pad 1:
gives: [32x32x10]
new parameters: (3*3*3)*10 + 10 = 280 CONV with 10 3x3 filters, stride 1:
gives: [16x16x10]
RELU
new parameters: (3*3*10)*10 + 10 = 910
CONV with 10 3x3 filters, stride 1, pad 1: RELU
gives: [32x32x10] CONV with 10 3x3 filters, stride 1:
new parameters: (3*3*10)*10 + 10 = 910 gives: [16x16x10]
RELU new parameters: (3*3*10)*10 + 10 = 910
POOL with 2x2 filters, stride 2: RELU
POOL with 2x2 filters, stride 2:
gives: [16x16x10]
gives: [8x8x10]
parameters: 0 parameters: 0 31

14
Neural Networks
NN in Practice: in
Mini Batches

Practice: Mini-batches

33

• Error function (cost function) is minimized by


moving from current solution in direction of the
negative of gradient.
• Cost function often decompose as a
sum per sample loss function.
Cost function
• As training set size grows to billions, time
taken for single gradient step becomes
prohibitively long.

34

15
Gradient Descent

© Alexander Amini and Ava Soleimany


MIT 6.S191: Introduction to Deep Learning
IntroToDeepLearning.com 35

Stochastic Gradient Descent

© Alexander Amini and Ava Soleimany


MIT 6.S191: Introduction to Deep Learning
IntroToDeepLearning.com 36

16
Stochastic Gradient Descent

© Alexander Amini and Ava Soleimany


MIT 6.S191: Introduction to Deep Learning
IntroToDeepLearning.com 37

Stochastic Gradient Descent

© Alexander Amini and Ava Soleimany


MIT 6.S191: Introduction to Deep Learning
IntroToDeepLearning.com 38

17
Mini-batches while training
• Mini-batch: Only use a small portion of the training set to compute
the gradient.

• Common mini-batch sizes are ~ 100 examples.

• More accurate estimation of gradient.


• Smoother convergence.
• Allows for large learning rates.
• Mini-batches lead to fast training
• Can parallelize computation + achieve significant increases on GPU’s.

39

Batch Norm layer - Motivation

In general, Gradient
The range of values of descent converges
raw training data often much faster with
varies widely. feature scaling than
without it.

40

18
Internal covariate shift

 That is also the input for layer ‘k’. In other words, that layer receives input data
that has a different distribution than before.
 It is now forced to learn to fit to this new input.
 As we can see, each layer ends up trying to learn from a constantly shifting input,
thus taking longer to converge and slowing down the training.

43

Common normalizations
Two methods are usually used for rescaling or normalizing data:

1. Scaling data all numeric variables to the range [0,1]. One possible
formula is given below:

2. To have zero mean and unit variance:

• In the NN community this is call Whitening


44

19
• Batch Normalization (BN) is a normalization method/layer for
neural networks.
• Batch Norm is a normalization technique done between the
layers of a Neural Network instead of in the raw data.
Proposed • Moreover, the Batch Norm helps to stabilize these shifting
distributions from one iteration to the next, and thus speeds
Solution: up training.
Batch • Batch Normalization – Is a process normalize each scalar
feature independently, by making
Normalization • it have the mean of zero and the variance of 1
(BN) • and then scale and shift the normalized value for each
training mini-batch
thus reducing internal covariate shift fixing the distribution of
the layer inputs x as the training progresses.

45

Batch
Sigmoid

𝑥1 𝑊1 𝑧1 𝑎1 𝑊2 ……
Sigmoid

𝑥2 𝑊1 𝑧2 𝑎2 𝑊2 ……
Sigmoid

𝑥3 𝑊1 𝑧3 𝑎3 𝑊2 ……

46

20
Batch normalization
3
1
𝜇= 𝑧𝑖
𝑥 1
𝑊 1
𝑧 1 3
𝑖=1

3
1
𝑥2 𝑊1 𝑧2 𝜎= 𝑧𝑖 − 𝜇 2
3
𝑖=1

Note: Batch normalization


3 1 3
𝑥 𝑊 𝑧 cannot be applied on
small batch.

𝜇 and 𝜎
𝜇 𝜎
depends on 𝑧 𝑖
47

𝑖
𝑧𝑖 − 𝜇
Batch normalization 𝑧 =
𝜎
Sigmoid

𝑥1 𝑊1 𝑧1 𝑧1 𝑎1
Sigmoid

𝑥2 𝑊1 𝑧2 𝑧2 𝑎2
Sigmoid

𝑥3 𝑊1 𝑧3 𝑧3 𝑎3

𝜇 and 𝜎 𝜇 𝜎 How to do
depends on 𝑧 𝑖 backpropogation?
48

21
Batch normalization
• It is done along mini-batches instead of the full data set. It serves to
speed up training and use higher learning rates, making learning
easier.
• Normally, large learning rates may increase the scale of layer
parameters, which then amplify the gradient during backpropagation
and lead to the model explosion.
• However, with Batch Normalization, backpropagation through a layer
is unaffected by the scale of its parameters.
• The output of the batch norm layer, has 𝛾 𝑎𝑛𝑑 𝛽 parameters.
Those parameters will be learned to best represent your
activations. Those parameters allows a learnable (scale and
shift) factor. 𝑧 𝑖 = 𝛾⨀𝑧 𝑖 + 𝛽
• They shift the mean and standard deviation, respectively
49

Batch normalization 𝑖
𝑧𝑖 − 𝜇
𝑧 = 𝑧 𝑖 = 𝛾⨀𝑧 𝑖 + 𝛽
𝜎
𝑥1 𝑊1 𝑧1 𝑧1 𝑧1

𝑥2 𝑊1 𝑧2 𝑧2 𝑧2

𝑥3 𝑊1 𝑧3 𝑧3 𝑧3

𝜇 and 𝜎 𝜇 𝜎 𝛽 𝛾
depends on 𝑧 𝑖 50

22
The proposed
solution: To
add an extra
layer

A new layer is added so the gradient can “see” the


normalization and make adjustments if needed.

51

Where to use the Batch-Norm layer in CNN


• The batch norm layer is used after linear layers (ie: FC, conv), and
before the non-linear layers (Relu).
• There is actually 2 batch norm implementations one for FC layer and
the other for conv layers.

52

23
Batch normalization:
Other benefits in practice
• BN reduces training times (Because of less
Covariate Shift, less exploding/vanishing
gradients.) , and make very deep net
trainable.
• BN reduces demand for regularization (for
generalization), e.g. dropout or L2 norm.
• BN enables training with saturating
nonlinearities in deep networks, e.g.
sigmoid. (Because the normalization
prevents them from getting stuck in
saturating ranges, e.g. very high/low values
for sigmoid.)

53

24

You might also like