Week2_lecture1_2
Week2_lecture1_2
• Perceptron. • Regularization
• CNN layer
Image Classification: Traditional Data-Driven Approach
Testing Stage
x h s
32x32x3 image
32
3x3x3 filter
3x3x3 filter
32x32x3 image
32
1 number
Result of convolution
32
3
Convolution
32x32x3 image
30
32
• Multiple filters
Activation maps
2 3x3x3 filter (feature maps)
32x32x3 image
30
32
Activation maps
32x32x3 image
32 30
Convolution layer
32 30
3 6
Convolution
32 28 24
32 28 24
3 6 16
Convolution
N – input size
F – filter size
Convolution-Stride
Convolution - Stride
(N-F)/S + 1
Convolution-Zero padding
• Zero padding in the input
0 0 0 0 0 0 0 0 0
For 7x7 input and 3x3 filter
0 0
0 0
If we have padding of one pixel
0 0
0 0
Output
0 0
7x7
0 0
0 0
• Padd (F-1)/2 zeros, if S=1
0 0 0 0 0 0 0 0 0
Components of a Convolutional Network
Convolution Layers Pooling Layers Fully-Connected Layers
x h s
x h s
x h s
Input Output
1 1
10 x 3072
3072 10
weights
Fully-Connected Layer
32x32x3 image -> stretch to 3072 x 1
Input Output
1 1
10 x 3072
3072 10
weights
1 number:
the result of taking a dot
product between a row of W
and the input (a 3072-
dimensional dot product)
Components of a Convolutional Network
Convolution Layers Pooling Layers Fully-Connected Layers
x h s
FC
BN
tanh
FC
BN ImageNet
accuracy
tanh
BN
tanh
x h s
200
• The area in the input image “seen” by a unit in a CNN
• Units in deeper layers will have wider receptive fields
0
15
12
Receptive fields in CNNs
200
15
12
How to increase receptive field in CNN?
Ø Use large convolution kernels (eg. 7x7 conv) to increase the receptive filed?
Limitation: Increase number of parameters
Receptive field of three successive 3x3 convolutions =receptive field of one 7x7 convolution.
Ø Need only (9+9+9)x parameters instead of 49x parameters.
https://arxiv.org/abs/1603.07285
[Dumolin and Visin, 2018]
How to increase receptive field in CNN?
Ø Increase pooling layers to increase receptive filed?
Issue: lose local details due to pooling operation
• Aggregate multiple values into a single value
• Observe larger receptive field in next layer
• Hierarchically extract more abstract features
Ø More layers
Issue: Vanishing gradients: Magnitude of backpropagated gradients decreases rapidly in initial layers
Solution: Use skip connections , or intermediate supervision to ensure greater variance in gradients
Cross-Entropy Loss
Loss Function (recap)
A loss function tells how good our
current classifier is
cat 3.2
car 5.1
frog -1.7
Cross-Entropy Loss
Want to interpret raw classifier scores as probabilities
exp 𝑠𝑘 Softmax
𝑠 = 𝑓 𝑥i ; 𝑊 𝑃 𝑌 = 𝑘 | 𝑋 = 𝑥 i =
∑ j exp 𝑠j function
cat 3.2
car 5.1
frog -1.7
Cross-Entropy Loss
Want to interpret raw classifier scores as probabilities
exp 𝑠𝑘 Softmax
𝑠 = 𝑓 𝑥i ; 𝑊 𝑃 𝑌 = 𝑘 | 𝑋 = 𝑥 i
=
∑ j exp 𝑠j function
cat 3.2
car 5.1
frog -1.7
Unnormalized log-
probabilities / logits
Cross-Entropy Loss
Want to interpret raw classifier scores as probabilities
exp 𝑠𝑘 Softmax
𝑠 = 𝑓 𝑥i ; 𝑊 𝑃 𝑌 = 𝑘 | 𝑋 = 𝑥 i
=
∑ j exp 𝑠j function
Probabilities
must be >= 0
cat 3.2 24.5
exp
car 5.1 164.0
frog -1.7 0.18
Unnormalized log- unnormalized
probabilities / logits probabilities
Cross-Entropy Loss
Want to interpret raw classifier scores as probabilities
exp 𝑠𝑘 Softmax
𝑠 = 𝑓 𝑥i ; 𝑊 𝑃 𝑌 = 𝑘 | 𝑋 = 𝑥 i
=
∑ j exp 𝑠j function
Probabilities Probabilities
must be >= 0 must sum to 1
cat 3.2 24.5 0.13
exp normalize
car 5.1 164.0 0.87
frog -1.7 0.18 0.00
Unnormalized log- unnormalized
probabilities
probabilities / logits probabilities
Cross-Entropy Loss
Want to interpret raw classifier scores as probabilities
exp 𝑠𝑘 Softmax
𝑠 = 𝑓 𝑥i ; 𝑊 𝑃 𝑌 = 𝑘 | 𝑋 = 𝑥 i
=
∑ j exp 𝑠j function
Probabilities Probabilities
must be >= 0
𝐿 = − log 𝑃 𝑌 = 𝑦i | 𝑋 = 𝑥i
must sum to 1 i
cat 3.2 24.5 0.13 Li = -log(0.13)
exp normalize = 2.04
car 5.1 164.0 0.87
frog -1.7 0.18 0.00
Unnormalized log- unnormalized
probabilities
probabilities / logits probabilities
Cross-Entropy Loss
Want to interpret raw classifier scores as probabilities
exp 𝑠𝑘 Softmax
𝑠 = 𝑓 𝑥i ; 𝑊 𝑃 𝑌 = 𝑘 | 𝑋 = 𝑥 i
=
∑ j exp 𝑠j function
Probabilities Probabilities
must be >= 0
𝐿 = − log 𝑃 𝑌 = 𝑦i | 𝑋 = 𝑥i
must sum to 1 i
cat 3.2 24.5 0.13 Compare 1.00
exp normalize
car 5.1 164.0 0.87 0.00
frog -1.7 0.18 0.00 0.00
Unnormalized log- unnormalized Correct
probabilities
probabilities / logits probabilities probs
Cross-Entropy Loss
Want to interpret raw classifier scores as probabilities
exp 𝑠𝑘 Softmax
𝑠 = 𝑓 𝑥i ; 𝑊 𝑃 𝑌 = 𝑘 | 𝑋 = 𝑥 i
=
∑ j exp 𝑠j function
Probabilities Probabilities
must be >= 0
𝐿 = − log 𝑃 𝑌 = 𝑦i | 𝑋 = 𝑥i
must sum to 1 i
cat 3.2 24.5 0.13 Compare 1.00
exp normalize
car 5.1 164.0 0.87 0.00
frog -1.7 0.18 0.00 0.00
Unnormalized log- unnormalized Correct
probabilities
probabilities / logits probabilities probs
Cross-Entropy Loss
Want to interpret raw classifier scores as probabilities
exp 𝑠𝑘 Softmax
𝑠 = 𝑓 𝑥i ; 𝑊 𝑃 𝑌 = 𝑘 | 𝑋 = 𝑥 i
=
∑ j exp 𝑠j function
Probabilities Probabilities
must be >= 0
𝐿 = − log 𝑃 𝑌 = 𝑦i | 𝑋 = 𝑥i
must sum to 1 i
cat 3.2 24.5 0.13 Compare 1.00
exp normalize
car 5.1 164.0 0.87 Cross Entropy 0.00
frog -1.7 0.18 0.00 𝐻 𝑃, 𝑄 = 0.00
unnormalized Correct
probabilities 𝐻 𝑃 + 𝐷𝐾𝐿 𝑃 || 𝑄
Unnormalized log-
probabilities / logits probabilities probs
Cross-Entropy Loss
cat 3.2
car 5.1
frog -1.7
Cross-Entropy Loss
cat 3.2
car 5.1
Q: What is the min /
frog -1.7 max possible loss Li?
Cross-Entropy Loss
cat 3.2
car 5.1
Q: What is the min /
frog -1.7 A: Min 0, max +infinity
max possible loss Li?
Cross-Entropy Loss
cat 3.2
car 5.1
Q: If all scores are
frog -1.7 small random values,
what is the loss?
Cross-Entropy Loss
cat 3.2
car 5.1
Q: If all scores are
frog -1.7 small random values, A: -log(1/C)
what is the loss? log(10) ≈ 2.3
Summary
• Components of CNN
• Convolution
• Pooling
• Activation functions
• Fully connected layers
• Normalization: BN, LN
• Cross Entropy loss
• Receptive filed
CNN Architectures
CNN Architectures: Research Impact
• AlexNet: Publication year 2012, Citations 160k+
• VGG: Publication year 2014, Citations 129k+
• ResNet: Publication year 2016, Citations 232k+
Ø Darwin, “On the origin of species”, Publication year 1859, Citations 64k+
Ø Shannon, “A mathematical theory of communication”, Publication year 1948, Citations: 155k+
conv5 256
= 227
13 256
– 11 + 2*2)
3
/ 4
1
+ 11 256 13 169 590 100
pool5 256 =13220/4 + 1 = 56 3 2 0 256 6 36 over all
Convolve
H
flatten 256 6 9216 36
spatial locations
C
fc6 9216 4096 4096 227 16 37,749 38
fc7 4096 4096 4096 3 16 16,777 17
fc8 4096 1000 1000 4 4,096 4
Computational requirements for a CNN architecture
Note
conv4 384 13 256 3 1 1 256 13 169 885 145
conv5 256 13 256 3 1 1 256 22713 169 590 W100
56
• Most
pool5 papers use
256 “1 FLOP”
13 = ”1 multiply
3 and 2 0 256 6 36
1 addition” so dot product of two N-dim
flatten 256 6
vectors takes N FLOPs; some papers say
9216 36
Convolve over all
fc6 or MACC
MADD 9216 instead of 4096
FLOP 4096 16 37,749 H 38
spatial locations 56
fc7 sources
• Other 4096 (e.g. NVIDIA4096
marketing 4096 16 16,777
C64 17
fc8
material) 4096“1 multiply
count 1000
and one addition” 1000227 4 4,096 4
= 2 FLOPs, so dot product of two N-dim
3
vectors takes 2N FLOPs (N MACC)
Computational requirements for a CNN architecture
800 35000
700 200
30000
600
25000
150
500
20000
400
15000 100
300
200 10000
50
100 5000
0 0 0
v
1
5
6
8
5
nv
nv
nv
nv
nv
nv
nv
nv
nv
nv
nv
nv
nv
nv
fc
fc
fc
fc
fc
fc
fc
fc
fc
n
co
co
co
co
co
co
co
co
co
co
co
co
co
co
co
Computationally Efficient Convolution Operators
Depthwise Separable Convolutions
Standard 2D Convolutions
Standard 2D convolution to create output Standard 2D convolution to create output with 128 layer,
with 1 layer, using 1 filter. using 128 filters.
Standard 2D convolution. Mapping one layer with depth Din to another layer with depth Dout, by using Dout filters
Depthwise Separable Convolutions
Standard 2D convolution
(i) Depthwise Convolution: performs lightweight filtering (ii) Pointwise Convolution: It is a 1x1 convolution layer which
by applying a single convolutional filter per input channel is responsible for building new features through computing
linear combinations of the input channels.
Depthwise Separable Convolutions
Depthwise Separable Convolutions
Number of FLOPS =(number of output elements) * (ops per output elements)
• FLOPS of standard convolution= (Cout x H’ x W’) * (Cin x K x K)
• FLOPS of depthwise separable convolution=FLOPS of depthwise convolution+FLOPS of
pointwise convolution= (Cin x H’ x W’) * (1x K x K)+(Cout x H’ x W’) * (Cin x 1 x 1)
= (Cin x H’ x W’) *(K x K+ Cout )
• Depthwise separable convolution leads to a reduction in FLOPS by almost a factor of K2
compared to standard convolution (since Cout >> K x K).
Grouped Convolutions
Standard Convolutions
Convolution with groups=1: Standard convolution
Grouped Convolutions
Cin/2 Split
Group 1: Group 2:
x Cout/2
Cin Cout (Cin / 2) x H x W (Cin / 2) x H x W
Cin/2
Conv(K x K, Cin/2 -> Cout/2) Conv(K x K, Cin/2 -> Cout/2)
Out 1: Out 2:
(Cout / 2) x H’ x W’ (Cout / 2) x H’ x W’
Concat
Output: Cout x H’ x W’
Grouped Convolution
Convolution with groups=G:
X Cout/G G parallel conv layers; each “sees” Cin/G input
channels and produces Cout/G output channels
Cin/G
Input: C x H x W
x Cout/G
Split to G x [(Cin / G) x H x W]
Cin Cout Weight: G x (Cout / G) x (Cin /G) x K x K parallel
Cin/G
convolutions
Output: G x [(Cout / G) x H’ x W’] Concat to
Cout x H’ x W’
FLOPs: CoutCinK2H'W'/G
Layer 2
Layer 1
Input
Summary
• Components of CNN
• Convolution
• Pooling
• Activation functions
• Fully connected layers
• Normalization: BN
• Cross Entropy loss
• Computation of Flops, Parameters and Memory Requirements
• Variants of Convolution Operation