CNN Slides Part2
CNN Slides Part2
Sudeshna Sarkar
Convolutional Layer: 2D example
A 2D 1 0 1 0 0 A filter: -1 -1 -1
image:
1 0 1 0 1 -1 1 -1
1 1 1 0 0 -1 -1 -1
1 0 1 0 1
1 0 1 0 1
After
convolution:
8
Convolutional Layer: 2D example
A 2D 1 0 1 0 0 A filter: -1 -1 -1
image:
1 0 1 0 1 -1 1 -1
1 1 1 0 0 -1 -1 -1
1 0 1 0 1
1 0 1 0 1
After
convolution:
8
Convolutional Layer: 2D example
A 2D 1 0 1 0 0 A filter: -1 -1 -1
image:
1 0 1 0 1 -1 1 -1
1 1 1 0 0 -1 -1 -1
1 0 1 0 1
1 0 1 0 1
-1 + 0 + -1 -7
+ -1 + 0 + -1
After
+ -1 + - 1 + -1
convolution:
= -7
8
Convolutional Layer: 2D example
A 2D A filter: -1 -1 -1
1 0 1 0 0
image:
1 0 1 0 1 -1 1 -1
1 1 1 0 0 -1 -1 -1
1 0 1 0 1
1 0 1 0 1
-7 -2
After
convolution:
8
Convolutional Layer: 2D example
A 2D 10 1 0 0 A filter: -1 -1 -1
image:
10 1 0 1 -1 1 -1
11 1 0 0 -1 -1 -1
10 1 0 1
10 1 0 1
-7 -2 -4
After -5 -2 -5
convolution: -7 -2 -5
8
Convolutional Layer: 2D example
0 00 0 0 0 0
A 2D 0 01 0 1 0 0 A filter: -1 -1 -1
image: 0 01 0 1 0 1 -1 1 -1
0 01 1 1 0 0 -1 -1 -1
0 01 0 1 0 1
0 01 0 1 0 1
0 00 0 0 0 0
-7 -2 -4
After -5 -2 -5
convolution: -7 -2 -5
8
Convolutional Layer: 2D example
0 0 0 0 0 0 0
A 2D 0 0 A filter: -1 -1 -1
1 0 1 0 0
image: 0 0
1 0 1 0 1 -1 1 -1
0 0 -1 -1 -1
1 1 1 0 0
0 1 0 1 0 1 0
0 1 0 1 0 1 0
0 0 0 0 0 0 0 0 -4 0 -3 -1
-2 -7 -2 -4 1
After -2 -5 -2 -5 -2
convolution
-2 -7 -2 -5 0
& ReLU:
0 -4 0 -4 0
9
Convolutional Layer: 2D example
0 0 0 0 0 0 0
A 2D 0 0 A filter: -1 -1 -1
1 0 1 0 0
image: 0 0
1 0 1 0 1 -1 1 -1
0 0 -1 -1 -1
1 1 1 0 0
0 1 0 1 0 1 0
0 1 0 1 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1
Detecting isolated 1’s. After 0 0 0 0 0
How do we detect all the 1’s?
-- Filter - 1 surrounded by 0’s.
convolution 0 0 0 0 0
How do we detect edges? & ReLU:
-- Exercise?
0 0 0 0 0
9
Convolutional Layer: 2D example
0 0 0 0 0 0 0
A 2D 0 1 0 1 0 0 0 A filter: -1 -1 -1
image: 0 1 0 1 0 1 0 -1 1 -1
0 11 1 0 0 0 -1 -1 -1
0 1 0 1 0 1 0
0 1 0 1 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1
Detecting isolated 1’s. After 0 0 0 0 0
How do we detect all the 1’s?
-- Filter - 1 surrounded by 0’s.
convolution 0 0 0 0 0
How do we detect edges? & ReLU:
-- Exercise?
0 0 0 0 0
Convolutional Layer: 2D example
0 00 0 0 0 0
A 2D 0 01 0 1 0 0 A filter: -1 -1 -1
image: 0 01 0 1 0 1 -1 1 -1
0 01 1 1 0 0 -1 -1 -1
0 01 0 1 0 1 with bias 2
0 01 0 1 0 1
0 00 0 0 0 0
After
convolution:
10
Convolutional Layer: 2D example
0 0 0 0 0 0 0
A 2D 0 0 A filter: -1 -1 -1
1 0 1 0 0
image: 0 0
1 0 1 0 1 -1 1 -1
0 0 -1 -1 -1
1 1 1 0 0
0 1 0 1 0 1 0 with bias 2
0 1 0 1 0 1 0
0 0 0 0 0 0 0 2-2 ? ? ?
? 2
What does it detect? After ? 1
■ Corner points and
■ isolated ones
convolution: ? 2
? ? ? ? 2
10
Convolutional Layer: 2D example
0 00 0 0 0 0
A 2D 0 01 0 1 0 0 A filter: -1 -1 -1
image: 0 01 0 1 0 1 -1 1 -1
0 01 1 1 0 0 -1 -1 -1
0 01 0 1 0 1 with bias 2
0 01 0 1 0 1
0 00 0 0 0 0
After
convolution:
11
Convolutional Layer: 2D example
0 0 0 0 0 0 0
A 2D 0 0 A filter: w w w
1 0 1 0 0 11 12 13
image: 0 0 w w
1 0 1 0 1 21 w
22
23
0 0
1 1 1 0 0
0 0 w w w33
1 0 1 0 1 with
31 bias
32 b
0 1 0 1 0 1 0
0 0 0 0 0 0 0
After
convolution:
11
Convolutional Layer: 2D example
0 0 0 0 0 0 0
A 2D 0 0 A filter: w w w
1 0 1 0 0 11 12 13
image: 0 0 w w
1 0 1 0 1 21 w
22
23
0 0
1 1 1 0 0
0 0 w w w33
1 0 1 0 1 with
31 bias
32 b
0 01 0 1 0 1
0 0 0 0 0 0 0
After
convolution:
11
Convolutional Layer: 2D example
A 2D 1 0 1 0 0 A filter: w w w
11 12 13
image: w w
1 0 1 0 1 21 w 23
22
1 1 1 0 0
w w w33
1 0 1 0 1 with
31 bias
32 b
1 0 1 0 1
12
Convolutional Layer
17
Convolutional Layer
Learn multiple filters.
18
32x32x3 image
32 height
Convolution Layer
32 width
3 Depth (color)
Slide based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Convolution Layer
32x32x3 image
5x5x3 filter
32
32
3
Convolution Layer
Filters always extend to the full
depth of the input volume
32x32x3 image
5x5x3 filter
32
32
3
Convolution Layer
32x32x3 image
5x5x3 filter
32
1 number:
the result of taking a dot product between the filter
and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3
Convolution Layer
32x32x3 image activation map
5x5x3 filter
32
28
32 28
3 1
consider a second, green filter
Convolution Layer
32x32x3 image activation maps
5x5x3 filter
32
28
32 28
3 1
Slide based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps
32
28
Convolution Layer
32 28
3 6
27
Preview: ConvNet is a sequence of Convolution Layers, interspersed with non-linear
activation functions
32 28
CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6
Slide based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Preview: ConvNet is a sequence of Convolution Layers, interspersed with non-linear
activation functions
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
Slide based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Recall
Slide based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Convolution
Filter
Image
Filter
Image
02 Feb 2022 34
Image courtesy:Vincent Dumoulin
Padding
•
Pooling layer
- makes the representations smaller and more manageable
- operates over each activation map independently:
Slide based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Why Pooling
bird
Subsampling
• Backpropagation. the backward pass for a max(x, y) operation routes the gradient
to the input that had the highest value in the forward pass.
• Hence, during the forward pass of a pooling layer you may keep track of the index
of the max activation (sometimes also called the switches) so that gradient
routing is efficient during backpropagation.
Receptive field
3x3 convolutions, stride 1
Input Output
The receptive field of a unit is
the region of the input feature
map whose values contribute to
the response of that unit (either
in the previous layer or in the
initial image)
Receptive field size: 3
Receptive field
3x3 convolutions, stride 1
Input Output
Input Output
17 [ https://www.mathworks.com/solutions/deep-learning/convolutional-neural-network.html ]
CNNs: typical architecture
classification
17 [ https://www.mathworks.com/solutions/deep-learning/convolutional-neural-network.html ]
CNNs: typical architecture
classification
Recall: we wanted to encode
• Spatial locality
• Translation
17
invariance [ https://www.mathworks.com/solutions/deep-learning/convolutional-neural-network.html ]
POOL POOL POOL
ReLU ReLU ReLU ReLU ReLU ReLU Fully-
connected
CONV CONV CONV CONV CONV CONV
32x32
l
na
ed
io
t
ec
ut
co lly
ol
nn
fu
nv
co
input
image
output,
identity/no
filter, activation
no bias
LeNet
The history of deep CNNs began with the appearance of LeNet (handwritten
character recognition)
Trained on MNIST digit dataset with 60K training examples
LeCun Y, Jackel LD, Bottou L, Cortes C, Denker JS, Drucker H, Guyon I, Muller UA, Sackinger E, Simard
P, et al. Learning algorithms for classification: a comparison on handwritten digit recognition. Neural
Netw Stat Mech Perspect. 1995;261:276. [
• ~14 million labeled images, 20k classes
• Images gathered from Internet
• Human labels via Amazon MTurk
• ImageNet Large-Scale Visual Recognition
Challenge (ILSVRC):
• 1.2 million training images, 1000 classes
• 100k test
Source: J. Johnson
GoogLeNet: Inception module
• Design a good network topology (network within network) and stack these
modules
• Parallel paths with different receptive field sizes and operations are meant to
capture sparse patterns of correlations in the stack of feature maps
• Use 1x1 convolutions for dimensionality reduction before expensive convolutions
Source: J. Johnson
The first three paths use convolutional layers with window sizes of 1×1, 3×3, and 5×5 to extract information
from different spatial sizes.
The middle two paths perform a 1×1 convolution on the input to reduce the number of channels, reducing
the model’s complexity.
The fourth path uses a 3×3 maximum pooling layer, followed by a 1×1 convolutional layer to change the
number of channels.
The four paths all use appropriate padding to give the input and output the same height and width.
Finally, the outputs along each path are concatenated along the channel dimension and comprise the
block’s output.
1x1 convolution
GoogLeNet Model
• Uses a stack of 9 inception blocks
• 22 total layers with weights
• Maximum pooling between inception blocks
reduces the dimensionality.
• After the last conv layer, a global average
pooling layer is used that spatially averages
across each feature map before final FC layer.
GAP
ResNet
• Deep models have more representation power (more parameters)
than shallower models.
• But deeper models are harder to optimize
• What should the deeper model learn to be at least as good as the
shallower model?
• A solution by construction is copying the learned layers from the
shallower model and setting additional layers to identity mapping.
Resnet
• Naïve solution
– If extra layers are an
identity mapping,
then training errors
do not increase