Convolutional Neural Networks: Convolutions, Pooling and Cnns. Neural Architectures For Computer Vision
Convolutional Neural Networks: Convolutions, Pooling and Cnns. Neural Architectures For Computer Vision
Alexey Artemov1,2
1Skoltech 2National Research University Higher School of Economics
1
Lecture overview
• Digital images and processing by neural networks
• Image processing operations: convolutions and pooling
• Convolutional neural networks from scratch
• Modern computer vision architectures: AlexNet, VGG, Inception and
ResNets
2
Images as inputs
to neural networks
3
Digital representation of an image
• Grayscale image is a matrix of pixels (picture elements)
• Dimensions of this matrix are called image resolution (e.g. 300 x 300)
• Each pixel stores its brightness (or intensity) ranging
from 0 to 255, 0 intensity corresponds to black color:
255
4
Image as a neural network input
(
• Normalize input pixels: 𝑥#$%& = − 0.5
)**
5
Image as a neural network input
(
• Normalize input pixels: 𝑥#$%& = − 0.5
)**
6
Image as a neural network input
(
• Normalize input pixels: 𝑥#$%& = − 0.5
)**
7
Why not MLP?
• Let’s say we want to train a “cat detector”
On this training image red
weights 𝑤/0 will change a little bit
to better detect a cat
8
Why not MLP?
• Let’s say we want to train a “cat detector”
On this training image red
weights 𝑤/0 will change a little bit
to better detect a cat
9
Why not MLP?
• Let’s say we want to train a “cat detector”
On this training image red
weights 𝑤/0 will change a little bit
to better detect a cat
1 0 1 0 1 0 1 2 5
0 1 1 0 0 1 * 3 4
1 0 1 0 Image patch
1 0 1 1 (local Kernel
Output
Input receptive
field)
11
Convolutions will help!
• Convolution is a dot product of a kernel (or filter)
and a patch of an image (local receptive field) of the same size
1 0 1 0 1 1 1 2 5 9 4
0 1 1 0 0 1 * 3 4 5 7
1 0 1 0 Image patch
1 0 1 1 (local Kernel
Output
Input receptive
field)
12
Convolutions have been used for a while
Kernel
-1 -1 -1
* -1 8 -1 Edge
= detection
-1 -1 -1
Original
image
13
Convolutions have been used for a while
Kernel
-1 -1 -1
* -1 8 -1 = Edge
-1 -1 -1 detection
0 -1 0
* -1 5 -1 = Sharpening
0 -1 0
Original
image Doesn’t change an image for solid fills
Adds a little intensity on the edges
14
Convolutions have been used for a while
Kernel
-1 -1 -1
* -1 8 -1 = Edge
-1 -1 -1 detection
0 -1 0
* -1 5 -1 = Sharpening
0 -1 0
Original
image 1 1 1
9
∗ 1 1 1 = Blurring
:
1 1 1
15
Convolution is similar to correlation
0 0 0 0
0 0 0
0 0 0 0 1 0
* = 0 1 0
0 0 1 0 0 1
0 0 2
0 0 0 1
16
Convolution is similar to correlation
0 0 0 0
0 0 0
0 0 0 0 1 0
* = 0 1 0
0 0 1 0 0 1
0 0 2
0 0 0 1
19
Convolution is translation equivariant
0 0 0 0
0 0 0
0 0 0 0 1 0
* = 0 1 0
0 0 1 0 0 1
0 0 2
0 0 0 1
24
Backpropagation for CNN
Gradients are first calculated as if
the kernel weights
were not shared:
𝜕𝐿 𝜕𝐿
𝑎 𝑎 = 𝑎 − 𝛾 𝑏 =𝑏−𝛾
𝑏 𝜕𝑎 𝜕𝑏
𝜕𝐿 𝜕𝐿
𝑐 𝑐 =𝑐−𝛾 𝑑 =𝑑−𝛾
𝑤9 𝑤) 𝜕𝑐 𝜕𝑑
𝑑
𝑤@ 𝒘𝟒
25
Backpropagation for CNN
Gradients are first calculated as if
the kernel weights
were not shared:
𝜕𝐿 𝜕𝐿
𝑎 𝑎 = 𝑎 − 𝛾 𝑏 =𝑏−𝛾
𝑏 𝜕𝑎 𝜕𝑏
𝜕𝐿 𝜕𝐿
𝑐 𝑐 =𝑐−𝛾 𝑑 =𝑑−𝛾
𝑤9 𝑤) 𝜕𝑐 𝜕𝑑
𝑑
𝑤@ 𝒘𝟒 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿
𝑤A = 𝑤A − 𝛾 + + +
𝜕𝑎 𝜕𝑏 𝜕𝑐 𝜕𝑑
26
Backpropagation for CNN
Gradients are first calculated as if
the kernel weights
were not shared:
𝜕𝐿 𝜕𝐿
𝑎 𝑎 = 𝑎 − 𝛾 𝑏 =𝑏−𝛾
𝑏 𝜕𝑎 𝜕𝑏
𝜕𝐿 𝜕𝐿
𝑐 𝑐 =𝑐−𝛾 𝑑 =𝑑−𝛾
𝑤9 𝑤) 𝜕𝑐 𝜕𝑑
𝑑
𝑤@ 𝒘𝟒 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿
𝑤A = 𝑤A − 𝛾 + + +
𝜕𝑎 𝜕𝑏 𝜕𝑐 𝜕𝑑
28
Convolutional vs fully connected layer
• In convolutional layer, the same kernel is used for every output neuron,
this way we share parameters of the network and train a better model;
• 300x300 input, 300x300 output, 5x5 kernel – 26 parameters in
convolutional layer and 8.1×10: parameters in fully connected layer
(each output is a perceptron);
29
Convolutional vs fully connected layer
• In convolutional layer, the same kernel is used for every output neuron,
this way we share parameters of the network and train a better model;
• 300x300 input, 300x300 output, 5x5 kernel – 26 parameters in
convolutional layer and 8.1×10: parameters in fully connected layer
(each output is a perceptron);
• Convolutional layer can be viewed as a special case of a fully connected
layer when all the weights outside the local receptive field of each neuron
equal 0 and kernel parameters are shared between neurons.
30
Intermediate summary
• We’ve introduced a convolutional layer which works better than fully
connected layer for images: it has fewer parameters and acts the same
for every patch of input.
• This layer will be used as a building block for larger neural networks!
31
Building convolutional neural
networks for vision
32
A color image input
• Let’s say we have a color image as an input, which is
W× 𝐻×𝐶/# tensor (multidimensional array), where
– W – is an image width,
– 𝐻 – is an image height,
– 𝐶/# − is a number of input channels (e.g. 3 RGB channels).
33
A color image input
• Let’s say we have a color image as an input, which is
W× 𝐻×𝐶/# tensor (multidimensional array), where
– W – is an image width,
– 𝐻 – is an image height,
– 𝐶/# − is a number of input channels (e.g. 3 RGB channels).
-1 -1 -1
𝐻 * -1 8 -1 =
-1 -1 -1
𝐶/#
𝑊
kernel of size feature map
𝑊S × 𝐻S ×𝐶/#
34
A color image input
• Let’s say we have a color image as an input, which is
W× 𝐻×𝐶/# tensor (multidimensional array), where
– W – is an image width,
– 𝐻 – is an image height,
– 𝐶/# − is a number of input channels (e.g. 3 RGB channels).
-1 -1 -1
𝐻 * -1 8 -1 =
-1 -1 -1
𝐶/#
𝑊
kernel of size feature map
𝑊S × 𝐻S ×𝐶/#
35
One kernel is not enough!
• We want to train 𝐶$TU kernels of size 𝑊S × 𝐻S ×𝐶/# .
• Having a stride of 1 and enough zero padding we can have
W× 𝐻×𝐶$TU output neurons. feature map
𝑊 𝑊 neuron
𝑊S
𝐻S
𝐻 𝐻 𝐶$TU
𝐶/#
36
One kernel is not enough!
• We want to train 𝐶$TU kernels of size 𝑊S × 𝐻S ×𝐶/# .
• Having a stride of 1 and enough zero padding we can have
W× 𝐻×𝐶$TU output neurons. feature map
𝑊 𝑊 neuron
𝑊S
𝐻S
𝐻 𝐻 𝐶$TU
𝐶/#
38
One convolutional layer is not enough!
• Let’s say neurons of the 1st convolutional layer look at the patches of
the image of size 3x3.
• What if an object of interest is bigger than that?
• We need a 2nd convolutional layer on top of the 1st!
2nd 3x3 conv receptive field 5x5
39
Receptive field after N convolutional layers
4th 1x9
3rd 1x7
2nd 1x5
1st 1x3 conv
1x3
layer
inputs
(1-dimensional)
40
Receptive field after N convolutional layers
4th 1x9
3rd 1x7
2nd 1x5
1st 1x3 conv
1x3
layer
inputs
(1-dimensional)
• If we stack 𝑁 convolutional layers with the same kernel size 3x3 the
receptive field on 𝑁-th layer will be 2𝑁 + 1×2𝑁 + 1.
• It looks like we need to stack a lot of convolutional layers!
To be able to identify objects as big as the input image 300x300 we will
need 150 convolutional layers! 41
We need to grow receptive field faster!
• We can increase a stride in our convolutional layer to reduce the output
dimensions! Stride: 2
1 1 1 4
2x2 conv
2 6 5 8 7 9
3 2 1 0 2x2 conv 4 6
1 1 3 5
42
We need to grow receptive field faster!
• We can increase a stride in our convolutional layer to reduce the output
dimensions! Stride: 2
1 1 1 4
2x2 conv
2 6 5 8 7 9
3 2 1 0 2x2 conv 4 6
1 1 3 5
43
How do we maintain translation invariance?
0 0 0 0
0 0 0
0 0 0 0 1 0 Max = 2
* = 0 1 0
0 0 1 0 0 1
0 0 2
0 0 0 1
Didn’t
Input Kernel Output
change
1 0 0 0
2 0 0
0 1 0 0 1 0 Max = 2
* = 0 1 0
0 0 0 0 0 1
0 0 0
0 0 0 0
Input Kernel Output
44
Pooling layer will help!
• This layer works like a convolutional layer but doesn’t have kernel,
instead it calculates maximum or average of input patch values.
200x200x64 Single depth slice
100x100x64 1 1 1 4
pooling 2 6 5 8 6 8
3 2 1 0 3 5
200 100 1 1 3 5
downsampling
100
200 2x2 max pooling with stride 2
45
Backpropagation for max pooling layer
• Strictly speaking: maximum is not a differentiable function!
46
Backpropagation for max pooling layer
• Strictly speaking: maximum is not a differentiable function!
6 8 7 8
Maximum = 8 Maximum = 8
3 5 3 5
47
Backpropagation for max pooling layer
• Strictly speaking: maximum is not a differentiable function!
6 8 7 8
Maximum = 8 Maximum = 8
3 5 3 5
6 8 7 9
Maximum = 8 Maximum = 9
3 5 3 5
48
Putting it all together into a simple CNN
• LeNet-5 architecture (1998) for handwritten digits recognition on MNIST
dataset:
32 28
14 10
5 5
2 5
2 2
5 5
2
28 14 10 5
32
120 84
16 fc1 fc2
6 16 pool2
6 pool1 conv2 2x2 10
1
conv1 2x2 5x5 fc3 + softmax
5x5
http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf 49
Learning deep representations
• Neurons of deep convolutional layers learn complex representations that
can be used as features for classification with MLP.
Automatic feature extraction
Good
features
for MLP
51
Neural architectures
for computer vision
52
ImageNet classification dataset
1000 classes, 1.2 million labeled photos
Human top 5 error: ~5%
Inception block
Jon Shlens, https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html
1 + ReLU
1
𝐶$TU
𝐶/# 57
Basic Inception block
• All operations inside a block use stride 1 and enough padding to output
the same spatial dimensions (𝑊×𝐻) of feature map.
• 4 different feature maps are concatenated on depth at the end
59
Filter decomposition
• It’s known that a Gaussian blur filter can be decomposed in two 1
dimensional filters:
61
ResNet (2015)
• Introduces residual connections
• ImageNet top 5 error: 4.5% (single model), 3.5% (ensemble)
• 152 layers, few 7x7 convolutional layers, the rest are 3x3, batch
normalization, max and average pooling.
• 60 million parameters
• Trains on 8 GPUs for 2-3 weeks.
62
Residual connections
• We create output channels adding a small delta 𝐹(𝑥) to original input
channels 𝑥:
• This way we can stack thousands of layers and gradients do not vanish
thanks to residual connections
63
Summary
• By stacking more convolution and pooling layers you can reduce the
error! Like in AlexNet or VGG.
• But you cannot do that forever, you need to utilize new kind of layers like
Inception block or residual connections.
• You’ve probably noticed that one needs a lot of time to train her neural
network!
64