0% found this document useful (0 votes)
30 views64 pages

Convolutional Neural Networks: Convolutions, Pooling and Cnns. Neural Architectures For Computer Vision

This document provides an overview of convolutional neural networks (CNNs) for computer vision tasks. It discusses how images can be represented as inputs to neural networks by normalizing pixel values. While multilayer perceptrons may not be effective for images due to different features appearing in various locations, convolutions help address this issue by applying filters over local regions. Convolutions have long been used for tasks like edge detection, sharpening, and blurring images. The document also notes that convolution is similar to correlation and is translation equivariant, meaning the result of the convolution does not change based on shifts in the input.

Uploaded by

ashishamitav123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views64 pages

Convolutional Neural Networks: Convolutions, Pooling and Cnns. Neural Architectures For Computer Vision

This document provides an overview of convolutional neural networks (CNNs) for computer vision tasks. It discusses how images can be represented as inputs to neural networks by normalizing pixel values. While multilayer perceptrons may not be effective for images due to different features appearing in various locations, convolutions help address this issue by applying filters over local regions. Convolutions have long been used for tasks like edge detection, sharpening, and blurring images. The document also notes that convolution is similar to correlation and is translation equivariant, meaning the result of the convolution does not change based on shifts in the input.

Uploaded by

ashishamitav123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Convolutional Neural Networks

Convolutions, pooling and CNNs. Neural architectures


for computer vision.
Fourth Machine Learning in High Energy Physics Summer School,
MLHEP 2018, August 6--12

Alexey Artemov1,2
1Skoltech 2National Research University Higher School of Economics
1
Lecture overview
• Digital images and processing by neural networks
• Image processing operations: convolutions and pooling
• Convolutional neural networks from scratch
• Modern computer vision architectures: AlexNet, VGG, Inception and
ResNets

2
Images as inputs
to neural networks

3
Digital representation of an image
• Grayscale image is a matrix of pixels (picture elements)
• Dimensions of this matrix are called image resolution (e.g. 300 x 300)
• Each pixel stores its brightness (or intensity) ranging
from 0 to 255, 0 intensity corresponds to black color:

255

4
Image as a neural network input
(
• Normalize input pixels: 𝑥#$%& = − 0.5
)**

5
Image as a neural network input
(
• Normalize input pixels: 𝑥#$%& = − 0.5
)**

• Maybe MLP will work?


𝜎 2 𝑥/0 𝑤/0 + 𝑏
/,0

Pixels 𝑥/0 Weights 𝑤/0

6
Image as a neural network input
(
• Normalize input pixels: 𝑥#$%& = − 0.5
)**

• Maybe MLP will work?


𝜎 2 𝑥/0 𝑤/0 + 𝑏
/,0

Pixels 𝑥/0 Weights 𝑤/0


• Actually, no!

7
Why not MLP?
• Let’s say we want to train a “cat detector”
On this training image red
weights 𝑤/0 will change a little bit
to better detect a cat

8
Why not MLP?
• Let’s say we want to train a “cat detector”
On this training image red
weights 𝑤/0 will change a little bit
to better detect a cat

On this training image green


weights 𝑤/0 will change…

9
Why not MLP?
• Let’s say we want to train a “cat detector”
On this training image red
weights 𝑤/0 will change a little bit
to better detect a cat

On this training image green


weights 𝑤/0 will change…

• We learn the same “cat features” in different areas


and don’t fully utilize the training set!
• What if cats in the test set appear in different places?
10
Convolutions will help!
• Convolution is a dot product of a kernel (or filter)
and a patch of an image (local receptive field) of the same size

1 0 1 0 1 0 1 2 5
0 1 1 0 0 1 * 3 4
1 0 1 0 Image patch
1 0 1 1 (local Kernel
Output
Input receptive
field)

11
Convolutions will help!
• Convolution is a dot product of a kernel (or filter)
and a patch of an image (local receptive field) of the same size

1 0 1 0 1 1 1 2 5 9 4
0 1 1 0 0 1 * 3 4 5 7
1 0 1 0 Image patch
1 0 1 1 (local Kernel
Output
Input receptive
field)

12
Convolutions have been used for a while
Kernel
-1 -1 -1
* -1 8 -1 Edge
= detection
-1 -1 -1

Sums up to 0 (black color)


when the patch is a solid fill

Original
image

13
Convolutions have been used for a while
Kernel
-1 -1 -1
* -1 8 -1 = Edge
-1 -1 -1 detection

0 -1 0
* -1 5 -1 = Sharpening
0 -1 0
Original
image Doesn’t change an image for solid fills
Adds a little intensity on the edges

14
Convolutions have been used for a while
Kernel
-1 -1 -1
* -1 8 -1 = Edge
-1 -1 -1 detection

0 -1 0
* -1 5 -1 = Sharpening
0 -1 0
Original
image 1 1 1
9
∗ 1 1 1 = Blurring
:
1 1 1
15
Convolution is similar to correlation
0 0 0 0
0 0 0
0 0 0 0 1 0
* = 0 1 0
0 0 1 0 0 1
0 0 2
0 0 0 1

Input Kernel Output

16
Convolution is similar to correlation
0 0 0 0
0 0 0
0 0 0 0 1 0
* = 0 1 0
0 0 1 0 0 1
0 0 2
0 0 0 1

Input Kernel Output


0 0 0 0
0 0 0
0 0 0 0 1 0
* = 0 0 1
0 0 0 1 0 1
0 1 0
0 0 1 0

Input Kernel Output


17
Convolution is similar to correlation
0 0 0 0
0 0 0
0 0 0 0 1 0 Max = 2
* = 0 1 0
0 0 1 0 0 1
0 0 2
0 0 0 1
Simple
Input Kernel Output
classifier
0 0 0 0
0 0 0
0 0 0 0 1 0 Max = 1
* = 0 0 1
0 0 0 1 0 1
0 1 0
0 0 1 0

Input Kernel Output


18
Convolution is translation equivariant
0 0 0 0
0 0 0
0 0 0 0 1 0
* = 0 1 0
0 0 1 0 0 1
0 0 2
0 0 0 1

Input Kernel Output

19
Convolution is translation equivariant
0 0 0 0
0 0 0
0 0 0 0 1 0
* = 0 1 0
0 0 1 0 0 1
0 0 2
0 0 0 1

Input Kernel Output


1 0 0 0
2 0 0
0 1 0 0 1 0
* = 0 1 0
0 0 0 0 0 1
0 0 0
0 0 0 0
Input Kernel Output
20
Convolution is translation equivariant
0 0 0 0
0 0 0
0 0 0 0 1 0 Max = 2
* = 0 1 0
0 0 1 0 0 1
0 0 2
0 0 0 1
Didn’t
Input Kernel Output
change
1 0 0 0
2 0 0
0 1 0 0 1 0 Max = 2
* = 0 1 0
0 0 0 0 0 1
0 0 0
0 0 0 0
Input Kernel Output
21
Convolutional layer in neural network
Shared bias: Shared kernel:
𝑏 𝑤9 𝑤) 𝑤@
0 0 0 0 0 𝑤A 𝑤* 𝑤<
0 0 1 0 0 𝑤B 𝑤C 𝑤:
0 1 1 0 0
𝜎(𝑤< … …
0 1 0 1 0 + 𝒘𝟖
0 0 0 0 0 + 𝑤:
+ 𝑏)
Input 3x3 … … …
image with … … …
zero padding
(grey area) 9 output neurons (feature map)
with only 10 parameters 22
Convolutional layer in neural network
Shared bias: Shared kernel:
Stride: 1 𝑏 𝑤9 𝑤) 𝑤@
0 0 0 0 0 𝑤A 𝑤* 𝑤<
0 0 1 0 0 𝑤B 𝑤C 𝑤:
0 1 1 0 0
𝜎(𝑤< 𝜎(𝑤* …
0 1 0 1 0 + 𝒘𝟖 + 𝑤B
0 0 0 0 0 + 𝑤: + 𝒘𝟖
+ 𝑏) + 𝑏)
Input 3x3 … … …
image with … … …
zero padding
(grey area) 9 output neurons (feature map)
with only 10 parameters 23
Backpropagation for CNN
Gradients are first calculated as if
the kernel weights
were not shared:
𝑎
𝑏
𝑐
𝑤9 𝑤)
𝑑
𝑤@ 𝒘𝟒

24
Backpropagation for CNN
Gradients are first calculated as if
the kernel weights
were not shared:
𝜕𝐿 𝜕𝐿
𝑎 𝑎 = 𝑎 − 𝛾 𝑏 =𝑏−𝛾
𝑏 𝜕𝑎 𝜕𝑏
𝜕𝐿 𝜕𝐿
𝑐 𝑐 =𝑐−𝛾 𝑑 =𝑑−𝛾
𝑤9 𝑤) 𝜕𝑐 𝜕𝑑
𝑑
𝑤@ 𝒘𝟒

25
Backpropagation for CNN
Gradients are first calculated as if
the kernel weights
were not shared:
𝜕𝐿 𝜕𝐿
𝑎 𝑎 = 𝑎 − 𝛾 𝑏 =𝑏−𝛾
𝑏 𝜕𝑎 𝜕𝑏
𝜕𝐿 𝜕𝐿
𝑐 𝑐 =𝑐−𝛾 𝑑 =𝑑−𝛾
𝑤9 𝑤) 𝜕𝑐 𝜕𝑑
𝑑
𝑤@ 𝒘𝟒 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿
𝑤A = 𝑤A − 𝛾 + + +
𝜕𝑎 𝜕𝑏 𝜕𝑐 𝜕𝑑

26
Backpropagation for CNN
Gradients are first calculated as if
the kernel weights
were not shared:
𝜕𝐿 𝜕𝐿
𝑎 𝑎 = 𝑎 − 𝛾 𝑏 =𝑏−𝛾
𝑏 𝜕𝑎 𝜕𝑏
𝜕𝐿 𝜕𝐿
𝑐 𝑐 =𝑐−𝛾 𝑑 =𝑑−𝛾
𝑤9 𝑤) 𝜕𝑐 𝜕𝑑
𝑑
𝑤@ 𝒘𝟒 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿
𝑤A = 𝑤A − 𝛾 + + +
𝜕𝑎 𝜕𝑏 𝜕𝑐 𝜕𝑑

Gradients of the same shared weight are summed up!


27
Convolutional vs fully connected layer
• In convolutional layer, the same kernel is used for every output neuron,
this way we share parameters of the network and train a better model;

28
Convolutional vs fully connected layer
• In convolutional layer, the same kernel is used for every output neuron,
this way we share parameters of the network and train a better model;
• 300x300 input, 300x300 output, 5x5 kernel – 26 parameters in
convolutional layer and 8.1×10: parameters in fully connected layer
(each output is a perceptron);

29
Convolutional vs fully connected layer
• In convolutional layer, the same kernel is used for every output neuron,
this way we share parameters of the network and train a better model;
• 300x300 input, 300x300 output, 5x5 kernel – 26 parameters in
convolutional layer and 8.1×10: parameters in fully connected layer
(each output is a perceptron);
• Convolutional layer can be viewed as a special case of a fully connected
layer when all the weights outside the local receptive field of each neuron
equal 0 and kernel parameters are shared between neurons.

30
Intermediate summary
• We’ve introduced a convolutional layer which works better than fully
connected layer for images: it has fewer parameters and acts the same
for every patch of input.
• This layer will be used as a building block for larger neural networks!

31
Building convolutional neural
networks for vision

32
A color image input
• Let’s say we have a color image as an input, which is
W× 𝐻×𝐶/# tensor (multidimensional array), where
– W – is an image width,
– 𝐻 – is an image height,
– 𝐶/# − is a number of input channels (e.g. 3 RGB channels).

33
A color image input
• Let’s say we have a color image as an input, which is
W× 𝐻×𝐶/# tensor (multidimensional array), where
– W – is an image width,
– 𝐻 – is an image height,
– 𝐶/# − is a number of input channels (e.g. 3 RGB channels).

-1 -1 -1
𝐻 * -1 8 -1 =
-1 -1 -1
𝐶/#
𝑊
kernel of size feature map
𝑊S × 𝐻S ×𝐶/#
34
A color image input
• Let’s say we have a color image as an input, which is
W× 𝐻×𝐶/# tensor (multidimensional array), where
– W – is an image width,
– 𝐻 – is an image height,
– 𝐶/# − is a number of input channels (e.g. 3 RGB channels).

-1 -1 -1
𝐻 * -1 8 -1 =
-1 -1 -1
𝐶/#
𝑊
kernel of size feature map
𝑊S × 𝐻S ×𝐶/#
35
One kernel is not enough!
• We want to train 𝐶$TU kernels of size 𝑊S × 𝐻S ×𝐶/# .
• Having a stride of 1 and enough zero padding we can have
W× 𝐻×𝐶$TU output neurons. feature map
𝑊 𝑊 neuron
𝑊S
𝐻S

𝐻 𝐻 𝐶$TU
𝐶/#

36
One kernel is not enough!
• We want to train 𝐶$TU kernels of size 𝑊S × 𝐻S ×𝐶/# .
• Having a stride of 1 and enough zero padding we can have
W× 𝐻×𝐶$TU output neurons. feature map
𝑊 𝑊 neuron
𝑊S
𝐻S

𝐻 𝐻 𝐶$TU
𝐶/#

• Using 𝑊S ∗ 𝐻S ∗ 𝐶/# + 1 ∗ 𝐶$TU parameters.


37
One convolutional layer is not enough!
• Let’s say neurons of the 1st convolutional layer look at the patches of
the image of size 3x3.
• What if an object of interest is bigger than that?
• We need a 2nd convolutional layer on top of the 1st!

38
One convolutional layer is not enough!
• Let’s say neurons of the 1st convolutional layer look at the patches of
the image of size 3x3.
• What if an object of interest is bigger than that?
• We need a 2nd convolutional layer on top of the 1st!
2nd 3x3 conv receptive field 5x5

1st 3x3 conv receptive field 3x3

39
Receptive field after N convolutional layers
4th 1x9
3rd 1x7

2nd 1x5
1st 1x3 conv
1x3
layer
inputs
(1-dimensional)

40
Receptive field after N convolutional layers
4th 1x9
3rd 1x7

2nd 1x5
1st 1x3 conv
1x3
layer
inputs
(1-dimensional)
• If we stack 𝑁 convolutional layers with the same kernel size 3x3 the
receptive field on 𝑁-th layer will be 2𝑁 + 1×2𝑁 + 1.
• It looks like we need to stack a lot of convolutional layers!
To be able to identify objects as big as the input image 300x300 we will
need 150 convolutional layers! 41
We need to grow receptive field faster!
• We can increase a stride in our convolutional layer to reduce the output
dimensions! Stride: 2

1 1 1 4
2x2 conv
2 6 5 8 7 9
3 2 1 0 2x2 conv 4 6
1 1 3 5

42
We need to grow receptive field faster!
• We can increase a stride in our convolutional layer to reduce the output
dimensions! Stride: 2

1 1 1 4
2x2 conv
2 6 5 8 7 9
3 2 1 0 2x2 conv 4 6
1 1 3 5

• Further convolutions will effectively double


their receptive field!

43
How do we maintain translation invariance?
0 0 0 0
0 0 0
0 0 0 0 1 0 Max = 2
* = 0 1 0
0 0 1 0 0 1
0 0 2
0 0 0 1
Didn’t
Input Kernel Output
change
1 0 0 0
2 0 0
0 1 0 0 1 0 Max = 2
* = 0 1 0
0 0 0 0 0 1
0 0 0
0 0 0 0
Input Kernel Output
44
Pooling layer will help!
• This layer works like a convolutional layer but doesn’t have kernel,
instead it calculates maximum or average of input patch values.
200x200x64 Single depth slice
100x100x64 1 1 1 4
pooling 2 6 5 8 6 8
3 2 1 0 3 5

200 100 1 1 3 5
downsampling
100
200 2x2 max pooling with stride 2

45
Backpropagation for max pooling layer
• Strictly speaking: maximum is not a differentiable function!

46
Backpropagation for max pooling layer
• Strictly speaking: maximum is not a differentiable function!

6 8 7 8
Maximum = 8 Maximum = 8
3 5 3 5

• There is no gradient with respect to non maximum patch neurons, since


changing them slightly does not affect the output.

47
Backpropagation for max pooling layer
• Strictly speaking: maximum is not a differentiable function!

6 8 7 8
Maximum = 8 Maximum = 8
3 5 3 5

• There is no gradient with respect to non maximum patch neurons, since


changing them slightly does not affect the output.

6 8 7 9
Maximum = 8 Maximum = 9
3 5 3 5
48
Putting it all together into a simple CNN
• LeNet-5 architecture (1998) for handwritten digits recognition on MNIST
dataset:
32 28
14 10
5 5
2 5
2 2
5 5
2
28 14 10 5
32
120 84
16 fc1 fc2
6 16 pool2
6 pool1 conv2 2x2 10
1
conv1 2x2 5x5 fc3 + softmax
5x5
http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf 49
Learning deep representations
• Neurons of deep convolutional layers learn complex representations that
can be used as features for classification with MLP.
Automatic feature extraction

Good
features
for MLP

Inputs that provide highest activations:

conv1 conv2 conv3


http://web.eecs.umich.edu/~honglak/icml09-ConvolutionalDeepBeliefNetworks.pdf 50
Summary
• Using convolutional, pooling and fully connected layers we’ve built our
first network for handwritten digits recognition!

51
Neural architectures
for computer vision

52
ImageNet classification dataset
1000 classes, 1.2 million labeled photos
Human top 5 error: ~5%

Olga Russakovsky, https://arxiv.org/pdf/1409.0575.pdf


53
AlexNet (2012)
• First deep convolutional neural net for ImageNet
• Significantly reduced top 5 error from 26% to 15%

Alex Krizhevsky, https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf


• 11x11, 5x5, 3x3 convolutions, max pooling, dropout, data augmentation,
ReLU activations, SGD with momentum
• 60 million parameters
• Trains on 2 GPUs for 6 days 54
VGG (2015)
• Similar to AlexNet, only 3x3 convolutions, but lots of filters!
• ImageNet top 5 error: 8.0% (single model)

Vanessa He, http://www.datalearner.com/paper_note/content/300035

• Training similar to AlexNet with additional multi-scale cropping.


• 138 million parameters
• Trains on 4 GPUs for 2-3 weeks 55
Inception V3 (2015)
• Similar to AlexNet? Not quite, uses Inception block introduced in
GoogLeNet (a.k.a. Inception V1)
• ImageNet top 5 error: 5.6% (single model), 3.6% (ensemble)

Inception block
Jon Shlens, https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html

• Batch normalization, image distortions, RMSProp


• 25 million parameters!
• Trains on 8 GPUs for 2 weeks 56
1x1 convolutions
• Such convolutions capture interactions of input channels in one “pixel”
of feature map
• They can reduce the number of channels not hurting the quality of the
model, because different channels can correlate
• Dimensionality reduction with added ReLU activation

1 + ReLU
1
𝐶$TU
𝐶/# 57
Basic Inception block
• All operations inside a block use stride 1 and enough padding to output
the same spatial dimensions (𝑊×𝐻) of feature map.
• 4 different feature maps are concatenated on depth at the end

Christian Szegedy, https://arxiv.org/pdf/1512.00567.pdf 58


Replace 5x5 convolutions
• 5x5 convolutions are expensive! Let’s replace them with two layers of
3x3 convolutions which have an effective receptive field of 5x5.
Receptive field 5x5

Christian Szegedy, https://arxiv.org/pdf/1512.00567.pdf

59
Filter decomposition
• It’s known that a Gaussian blur filter can be decomposed in two 1
dimensional filters:

Ati’s presentation, http://www.florian-oeser.de/rtr/ue5/ 60


Filter decomposition in Inception block
• 3x3 convolutions are currently the most expensive parts!
• Let’s replace each 3x3 layer with 1x3 layer followed by 3x1 layer.

Christian Szegedy, https://arxiv.org/pdf/1512.00567.pdf

61
ResNet (2015)
• Introduces residual connections
• ImageNet top 5 error: 4.5% (single model), 3.5% (ensemble)

Kaiming He, https://arxiv.org/pdf/1512.03385.pdf

• 152 layers, few 7x7 convolutional layers, the rest are 3x3, batch
normalization, max and average pooling.
• 60 million parameters
• Trains on 8 GPUs for 2-3 weeks.
62
Residual connections
• We create output channels adding a small delta 𝐹(𝑥) to original input
channels 𝑥:

Kaiming He, https://arxiv.org/pdf/1512.03385.pdf

• This way we can stack thousands of layers and gradients do not vanish
thanks to residual connections
63
Summary
• By stacking more convolution and pooling layers you can reduce the
error! Like in AlexNet or VGG.
• But you cannot do that forever, you need to utilize new kind of layers like
Inception block or residual connections.
• You’ve probably noticed that one needs a lot of time to train her neural
network!

64

You might also like