05introduction To Convolutional Neural Networks
05introduction To Convolutional Neural Networks
Lecture 5
1
Convolutional Neural Network (CNN)
• A class of Neural Networks
• Takes image as input (mostly)
• Make predictions about the input image
2
History
• The LeNet architecture (1990s)
3
First Strong Results
• AlexNet 2012
• Winner of ImageNet Large-Scale Visual Recognition Challenge (ILSVRC 2012)
• Error rate – 15.4% (the next best entry was at 26.2%)
4
Today: CNNs are everywhere
Classification
5
Today: CNNs are everywhere
Object detection Semantic Segmentation
Faster R-CNN: Ren, He, Girshick, Sun 2015 Semantic Segmentation Using GAN, Nasim, Concetto, and Mubarak, 2017.
6
Today: CNNs are everywhere
Image captioning Style transfer
"Show and tell: A neural image caption generator.“ A Neuíal Algoíithm of Aítistic Style
Vinyals, Oriol, et al. CVPR 2015. L. Gatys et al. 2015).
7
CNN – Not just images
• Natural Language Processing (NLP)
• Text classification
• Word to vector
• Audio Research
• Speech recognition
• Can be represented as spectrograms
• Converting data to a matrix (2-D) format
• 1D convolution – Audio, EEG, etc.
• 3D convolution - Videos
8
Background
What we already know!
9
General CNN architecture
10
General CNN architecture
11
What is a (digital) Image? - recap
• Definition: A digital image is defined by integrating and sampling
continuous (analog) data in a spatial domain [Klette, 2014].
13
Filtering - recap
• Image filtering: compute function of local neighborhood at
each position
[ ] [ ]
[ ]
14
Filtering - recap
• Output is linear combination of the neighborhood pixels
15
Correlation (linear relationship) - recap
f h
f1 f2 f3 h1 h2 h3 f h f1h1 f 2 h2 f 3 h3
f4 f5 f6 h4 h5 h6 f 4 h 4 f5h5 f 6 h 6
f7 f8 f9 h7 h8 h9 f 7 h7 f 8 h8 f 9 h9
16
Convolution – recap
h
f Image
X flip
h7 h8 h9 h1 h2 h3
h Kernel h4 h5 h6 h4 h5 h6
h1 h2 h3 h7 h8 h9
f Y flip
f1 f2 f3 h9 h8 h7 f * h f1h9 f 2 h8 f 3 h7
f4 f5 f6 h6 h5 h4 f 4 h 6 f5h5 f 6 h 4
f7 f8 f9 h3 h2 h1
f 7 h3 f 8 h2 f 9 h1
17
Sobel Edge Detector
18
General CNN architecture
19
Multi-layer perceptron (MLP) – recap
• …is a ‘fully connected’ neural network with non-
linear activation functions.
21
Learning phases Labels
Images
Training
Image Trained
Training
Features classifier
Image
Testing Image Features Apply classifier Prediction
not in
training set
Slide credit: D. Hoiem and L. Lazebnik
22
General CNN architecture
24
Fundamental operation
25
Convolution
• Core building block of a CNN
• Spatial structure of image is preserved
32x32x3 image
32 3x3x3 filter
3x3x3 filter
32x32x3 image
32
1 number
Result of convolution
32
3 27
Convolution
• Convolution over whole image
Convolution layer
32 30
3 6 30
Convolutional Network
• Convolution network is a sequence of these layers
32 28
6 5x5x3 filters
32 28
3 6
31
Convolutional Network
• Convolution network is a sequence of these layers
32 28 24
32 28 24
3 6 16
32
Parameters
3x3x3 filter Activation map
(feature map)
32x32x3 image
32 30
33
Parameters
Activation maps
32x32x3 image
32 30
Convolution layer
32 30
3 6
34
Convolution Operation
• Convolution of two functions f and g
function f(t) kernel g(t)
d
I
dx
*
Threshold Edges
Image I
d
* dy
I
36
Demo
filter
1 0 1
1 1 1 0 0 0 1 0
0 1 1 1 0 1 0 1
0 0 1 1 1
0 0 1 1 0 4
0 1 0 0 0
0 1 0 0 0
0 1 0 0 0
0 0 0 0 0
0 0 1 0 0
0 1 0 1 0
1 0 0 0 1
0 0 0 0 0
42
Convolution - Intuition
43
Convolution - Intuition
0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 1 0 0
0 1 0 1 0 0 1 0 1 0
*
1 0 0 0 1 1 0 0 0 1
0 0 0 0 0 0 0 0 0 0
45
Convolution - Intuition
0 0 0 0 1 0 0 0 0 0
0 0 0 1 0 0 0 1 0 0
0 0 1 0 0 0 1 0 1 0
*
0 1 1 1 1 1 0 0 0 1
0 0 0 0 0 0 0 0 0 0
1x1 = 1
46
Convolution
• Multiple filters
2 3x3x3 filter
Activation maps
(feature maps)
32x32x3 image
32
30
48
2D Convolution - dimensions
7x7 map 3x3 filter
49
2D Convolution - dimensions
7x7 map 3x3 filter
50
2D Convolution - dimensions
7x7 map 3x3 filter
51
2D Convolution - dimensions
7x7 map 3x3 filter
N – input size
F – filter size
52
Stride
7x7 map 3x3 filter
53
Stride
7x7 map 3x3 filter
54
Stride
7x7 map 3x3 filter
(N-F)/S + 1
55
Stride
7x7 map 3x3 filter
56
Stride
7x7 map 3x3 filter
57
Stride
7x7 map 3x3 filter
Output size (N-F)/S + 1
N = 7, F = 3
Stride 1
(7-3)/1 + 1 => 5
Stride 2
(7-3)/2 + 1 => 3
Stride 3
(7-3)/3 + 1 => 2.33
58
Padding
• Zero padding in the input
0 0 0 0 0 0 0 0 0
For 7x7 input and 3x3 filter
0 0
0 0
If we have padding of one pixel
0 0
0 0 Output
0 0
7x7
0 0
0 0 Size (recall (N-F)/S+1)
0 0 0 0 0 0 0 0 0 (N-F+2P)/S + 1
59
Padding
• Zero padding in the input
0 0 0 0 0 0 0 0 0
Common to see,
0 0
(F-1)/2 padding with stride 1 to preserve
0 0
the map size
0 0
0 0
N = (N-F+2P)/S + 1
0 0
(N-1)S = N-F+2P
0 0
P = (F-1)/2
0 0
0 0 0 0 0 0 0 0 0
60
Pooling
• Invariance to small translations of the input
61
Pooling
• Makes the representations
smaller
• Operates over each activation
map independently
62
Pooling
• Kernel size
• Stride
63
Visualizing CNN
Source : http://cs231n.github.io
64
AlexNet : Network Size
CONV1
MAX POOL1
NORM1
CONV2
MAX POOL2
NORM2
CONV3
CONV4
CONV5
• Input 227x227x3 MAX POOL3
FC6
• 5 convolution layers FC7
• 3 dense layers FC8
• Output 1000-D vector
65
AlexNet : Network Size
CONV1
MAX POOL1
NORM1
CONV2
MAX POOL2
NORM2
CONV3
CONV4
CONV5
• Input: 227x227x3 images MAX POOL3
• First layer (CONV1): 96 11x11 filters applied at stride 4 FC6
FC7
• What is the output volume size? (227-11)/4+1 = 55 FC8
• What is the number of parameters? 11x11x3x96 = 35K
66
AlexNet : Network Size
CONV1
MAX POOL1
NORM1
CONV2
MAX POOL2
NORM2
CONV3
CONV4
CONV5
• After CONV1: 55x55x96 MAX POOL3
• Second layer (POOL1): 3x3 filters applied at stride 2 FC6
FC7
• What is the output volume size? (55-3)/2+1 = 27 FC8
• What is the number of parameters in this layer? 0
67
AlexNet : Network Size
CONV1
MAX POOL1
NORM1
CONV2
MAX POOL2
NORM2
CONV3
CONV4
CONV5
• After POOL1: 27x27x96 MAX POOL3
• Third layer (NORM1): Normalization FC6
FC7
• What is the output volume size? 27x27x96 FC8
68
AlexNet : Network Size
1. [227x227x3] INPUT
2. [55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 CONV1 35K
3. [27x27x96] MAX POOL1: 3x3 filters at stride 2 MAX POOL1
4. [27x27x96] NORM1: Normalization layer NORM1
5. [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 CONV2 307K
6. [13x13x256] MAX POOL2: 3x3 filters at stride 2 MAX POOL2
7. [13x13x256] NORM2: Normalization layer NORM2
8. [13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 CONV3 884K
9. [13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 CONV4 1.3M
10. [13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 CONV5 442K
11.[6x6x256] MAX POOL3: 3x3 filters at stride 2 MAX POOL3
12. [4096] FC6: 4096 neurons FC6 37M
13. [4096] FC7: 4096 neurons FC7 16M
14. [1000] FC8: 1000 neurons (class scores) FC8 4M
69
AlexNet Parameters
conv1: (11*11)*3*96 + 96 = 34944
conv2: (5*5)*96*256 + 256 = 614656
conv3: (3*3)*256*384 + 384 = 885120
conv4: (3*3)*384*384 + 384 = 1327488
conv5: (3*3)*384*256 + 256 = 884992
fc1: (6*6)*256*4096 + 4096 = 37752832
fc2: 4096*4096 + 4096 = 16781312
fc3: 4096*1000 + 1000 = 4097000
70
Visualizing Convolution
71
Why not correlation neural network?
• It could be
• Deep learning libraries actually implement correlation
72