0% found this document useful (0 votes)
2 views

Week2_lecture1_2

Lecture notes of CV801
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Week2_lecture1_2

Lecture notes of CV801
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

CV801: Week 2 Lectures 1 and 2

CNN for Image Classification


Expected Deep learning and CNN backgrounds

• Perceptron. • Regularization

• Multi-layer Perceptron • Dropout

• Backpropagation • Data Augmentation


• Stochastic gradient descent. • Batch normalization

• Cross entropy loss

• CNN layer
Image Classification: Traditional Data-Driven Approach

Image Hand-crafted Trainable Object


Pixels feature extraction classifier Class

• Features are not learned.


• Trainable classifier is often generic (e.g. SVM).
Image Classification: Data-Driven Approach

Training stage Training


Labels
Training
Images

Image Classifier Trained


Features Training Classifier

Testing Stage

Image Trained Prediction


Features Classifier Outdoor
Test Image
Outline
• Components of CNN
• Convolution
• Pooling
• Activation functions
• Fully connected layers
CV701 (Deep learning Slides Shared)
• Normalization: BN
• Cross Entropy loss
• Training: Backpropagation, SGD
• Transfer Learning
• Computation of Flops, Parameters and Memory Requirements
• Important CNN architectures and design choices
Convolutional Neural Networks
Convolutional Neural Networks (CNNs)

• Convolution: Apply filters to generate feature maps.


• Non-linearity.
• Pooling: Down sampling operation on each feature map.
• Fully connected layers
Components of a Convolutional Network
Convolution Layers Pooling Layers Fully-Connected Layers

x h s

Activation Function Normalization


Convolution

• Core building block of a CNN


• Spatial structure of image is preserved

32x32x3 image
32
3x3x3 filter

A filter/kernel is convolved with the image


32
3
Convolution
• Convolution at one spatial location

3x3x3 filter

32x32x3 image
32
1 number
Result of convolution

32
3
Convolution

• Convolution over whole image


Activation map
3x3x3 filter (feature map)

32x32x3 image
30
32

Convolve over all


spatial locations 30
32 1
3
Convolution

• Multiple filters
Activation maps
2 3x3x3 filter (feature maps)

32x32x3 image
30
32

Convolve over all


spatial locations
30
32 1
3
Convolution
• One convolution layer
• 6 3x3x3 kernels

Activation maps
32x32x3 image
32 30

Convolution layer
32 30
3 6
Convolution

• Convolution network is a sequence of these layers

32 28 24

6 5x5x3 filters 16 5x5x6 filters

32 28 24
3 6 16
Convolution

7x7 map 3x3 filter


Convolution

7x7 map 3x3 filter


Convolution

7x7 map 3x3 filter


Convolution

7x7 map 3x3 filter


Convolution

7x7 map 3x3 filter

Output activation map 5x5


Output size
N-F+1
(7 – 3 + 1) = 5

N – input size
F – filter size
Convolution-Stride
Convolution - Stride

7x7 map 3x3 filter

Filter applied with stride 2


Convolution-Stride
Convolution - Stride

7x7 map 3x3 filter

Filter applied with stride 2


Convolution-Zero padding
Convolution - Stride

7x7 map 3x3 filter

Filter applied with stride 2

Activation map size 3x3


Output size
(7-3)/2 + 1 = 3

(N-F)/S + 1
Convolution-Zero padding
• Zero padding in the input

0 0 0 0 0 0 0 0 0
For 7x7 input and 3x3 filter
0 0
0 0
If we have padding of one pixel
0 0
0 0
Output
0 0
7x7
0 0
0 0
• Padd (F-1)/2 zeros, if S=1
0 0 0 0 0 0 0 0 0
Components of a Convolutional Network
Convolution Layers Pooling Layers Fully-Connected Layers

x h s

Activation Function Normalization


Pooling Layers
Pooling Layers

• Makes the representations smaller

• Operates over each activation map


independently
Pooling Layers
• Kernel size
• Stride
Components of a Convolutional Network
Convolution Layers Pooling Layers Fully-Connected Layers

x h s

Activation Function Normalization


Activation Functions

2-layer Neural Network

The function This is called the activation function of


is called “Rectified Linear Unit” the neural network

Q: What happens if we build a neural


network with no activation function?

A: We end up with a linear


classifier!
Components of a Convolutional Network
Convolution Layers Pooling Layers Fully-Connected Layers

x h s

Activation Function Normalization


Fully-Connected Layer
32x32x3 image -> stretch to 3072 x 1

Input Output
1 1
10 x 3072
3072 10
weights
Fully-Connected Layer
32x32x3 image -> stretch to 3072 x 1

Input Output
1 1
10 x 3072
3072 10
weights
1 number:
the result of taking a dot
product between a row of W
and the input (a 3072-
dimensional dot product)
Components of a Convolutional Network
Convolution Layers Pooling Layers Fully-Connected Layers

x h s

Activation Function Normalization


Batch Normalization
Idea: “Normalize” the outputs of a layer so they have zero mean
and unit variance

Why? Helps reduce “internal covariate shift”, improves optimization

We can normalize a batch of activations like this:


we can use this as an operator in
our networks and backprop
through it!
Ioffe and Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, ICML 2015
Batch Normalization for ConvNets
Batch Normalization for Batch Normalization for
fully-connected networks convolutional networks
(Spatial Batchnorm, BatchNorm2D)
𝑥 ∶𝑁×𝐷 𝑥 ∶𝑁×𝐶×𝐻×𝑊
Normalize Normalize
𝜇, 𝜎 ∶ 1 × 𝐷 𝜇, 𝜎 ∶ 1 × 𝐶 × 1 × 1
𝛾, 𝛽 ∶ 1 × 𝐷 𝛾, 𝛽 ∶ 1 × 𝐶 × 1 × 1
𝑥−𝜇 𝑥−𝜇
𝑦= 𝛾+𝛽 𝑦= 𝛾+𝛽
𝜎 𝜎
Batch Normalization

FC Usually inserted after Fully Connected


BN
or Convolutional layers, and before
nonlinearity.
tanh

FC

BN

tanh

Ioffe and Szegedy, “Batch normalization: Accelerating deep


network training by reducing internal covariate shift”, ICML 2015
Batch Normalization
- Makes deep networks much easier to train!
FC - Allows higher learning rates, faster convergence
- Networks become more robust to initialization
BN - Acts as regularization during training
- Zero overhead at test-time: can be fused with conv!
tanh

FC

BN ImageNet
accuracy
tanh

Ioffe and Szegedy, “Batch normalization: Accelerating deep Training iterations


network training by reducing internal covariate shift”, ICML 2015
Batch Normalization
- Makes deep networks much easier to train!
FC - Allows higher learning rates, faster convergence
- Networks become more robust to initialization
BN - Acts as regularization during training
- Zero overhead at test-time: can be fused with conv!
tanh - Not well-understood theoretically (yet)
- Behaves differently during training and testing: this
FC is a very common source of bugs!

BN

tanh

Ioffe and Szegedy, “Batch normalization: Accelerating deep


network training by reducing internal covariate shift”, ICML 2015
Layer Normalization Layer Normalization for fully-
connected networks
Batch Normalization for Same behavior at train and test!
fully-connected networks Used in RNNs, Transformers and
ConvNeXt
𝑥 ∶𝑁×𝐷
Normalize Normalize 𝑥 ∶𝑁×𝐷
𝜇, 𝜎 ∶ 1 × 𝐷 𝜇, 𝜎 ∶ 𝑁 × 1
𝛾, 𝛽 ∶ 1 × 𝐷 𝛾, 𝛽 ∶ 1 × 𝐷
𝑥−𝜇 𝑥−𝜇
𝑦= 𝛾+𝛽 𝑦= 𝛾+𝛽
𝜎 𝜎
Comparison of Normalization Layers

Wu and He, “Group Normalization”, ECCV 2018


Summary: Components of a Convolutional Network
Convolution Layers Pooling Layers Fully-Connected Layers

x h s

Activation Function Normalization


Receptive fields in CNNs
100
1

200
• The area in the input image “seen” by a unit in a CNN
• Units in deeper layers will have wider receptive fields
0

15

12
Receptive fields in CNNs

• The area in the input image “seen” by a unit in a CNN


• Units in deeper layers will have wider receptive fields
Receptive fields in CNNs
100
1

200

15

12
How to increase receptive field in CNN?

Hierarchical features through convolution

Ø Use large convolution kernels (eg. 7x7 conv) to increase the receptive filed?
Limitation: Increase number of parameters

Receptive field of three successive 3x3 convolutions =receptive field of one 7x7 convolution.
Ø Need only (9+9+9)x parameters instead of 49x parameters.
https://arxiv.org/abs/1603.07285
[Dumolin and Visin, 2018]
How to increase receptive field in CNN?
Ø Increase pooling layers to increase receptive filed?
Issue: lose local details due to pooling operation
• Aggregate multiple values into a single value
• Observe larger receptive field in next layer
• Hierarchically extract more abstract features

Ø More layers
Issue: Vanishing gradients: Magnitude of backpropagated gradients decreases rapidly in initial layers
Solution: Use skip connections , or intermediate supervision to ensure greater variance in gradients
Cross-Entropy Loss
Loss Function (recap)
A loss function tells how good our
current classifier is

Low loss = good classifier


High loss = bad classifier

(Also called: objective function;


cost function)
Loss Function (recap)
A loss function tells how good our
Given a dataset of examples
current classifier is

Low loss = good classifier


High loss = bad classifier
Where 𝑥i is image and
𝑦i is (integer) label
(Also called: objective function;
cost function)

Negative loss function sometimes


called reward function, profit
function, utility function, fitness
function, etc
Loss Function (recap)
A loss function tells how good our Given a dataset of examples
current classifier is

Low loss = good classifier


Where 𝑥i is image and
High loss = bad classifier 𝑦i is (integer) label

(Also called: objective function; Loss for a single example is


cost function)
𝐿i 𝑓 𝑥i , 𝑊 , 𝑦i
Negative loss function sometimes
called reward function, profit
function, utility function, fitness
function, etc
Loss Function (recap)
Given a dataset of examples
A loss function tells how good our
current classifier is
Where 𝑥i is image and
Low loss = good classifier 𝑦i is (integer) label
High loss = bad classifier
Loss for a single example is
(Also called: objective function;
cost function) 𝐿i 𝑓 𝑥i , 𝑊 , 𝑦i
Loss for the batch is
Negative loss function sometimes average of per-example
called reward function, profit losses:
function, utility function, fitness
function, etc
Cross-Entropy Loss
Want to interpret raw classifier scores as probabilities

cat 3.2
car 5.1
frog -1.7
Cross-Entropy Loss
Want to interpret raw classifier scores as probabilities
exp 𝑠𝑘 Softmax
𝑠 = 𝑓 𝑥i ; 𝑊 𝑃 𝑌 = 𝑘 | 𝑋 = 𝑥 i =
∑ j exp 𝑠j function

cat 3.2
car 5.1
frog -1.7
Cross-Entropy Loss
Want to interpret raw classifier scores as probabilities
exp 𝑠𝑘 Softmax
𝑠 = 𝑓 𝑥i ; 𝑊 𝑃 𝑌 = 𝑘 | 𝑋 = 𝑥 i
=
∑ j exp 𝑠j function

cat 3.2
car 5.1
frog -1.7
Unnormalized log-
probabilities / logits
Cross-Entropy Loss
Want to interpret raw classifier scores as probabilities
exp 𝑠𝑘 Softmax
𝑠 = 𝑓 𝑥i ; 𝑊 𝑃 𝑌 = 𝑘 | 𝑋 = 𝑥 i
=
∑ j exp 𝑠j function
Probabilities
must be >= 0
cat 3.2 24.5
exp
car 5.1 164.0
frog -1.7 0.18
Unnormalized log- unnormalized
probabilities / logits probabilities
Cross-Entropy Loss
Want to interpret raw classifier scores as probabilities
exp 𝑠𝑘 Softmax
𝑠 = 𝑓 𝑥i ; 𝑊 𝑃 𝑌 = 𝑘 | 𝑋 = 𝑥 i
=
∑ j exp 𝑠j function
Probabilities Probabilities
must be >= 0 must sum to 1
cat 3.2 24.5 0.13
exp normalize
car 5.1 164.0 0.87
frog -1.7 0.18 0.00
Unnormalized log- unnormalized
probabilities
probabilities / logits probabilities
Cross-Entropy Loss
Want to interpret raw classifier scores as probabilities
exp 𝑠𝑘 Softmax
𝑠 = 𝑓 𝑥i ; 𝑊 𝑃 𝑌 = 𝑘 | 𝑋 = 𝑥 i
=
∑ j exp 𝑠j function
Probabilities Probabilities
must be >= 0
𝐿 = − log 𝑃 𝑌 = 𝑦i | 𝑋 = 𝑥i
must sum to 1 i
cat 3.2 24.5 0.13 Li = -log(0.13)
exp normalize = 2.04
car 5.1 164.0 0.87
frog -1.7 0.18 0.00
Unnormalized log- unnormalized
probabilities
probabilities / logits probabilities
Cross-Entropy Loss
Want to interpret raw classifier scores as probabilities
exp 𝑠𝑘 Softmax
𝑠 = 𝑓 𝑥i ; 𝑊 𝑃 𝑌 = 𝑘 | 𝑋 = 𝑥 i
=
∑ j exp 𝑠j function
Probabilities Probabilities
must be >= 0
𝐿 = − log 𝑃 𝑌 = 𝑦i | 𝑋 = 𝑥i
must sum to 1 i
cat 3.2 24.5 0.13 Compare 1.00
exp normalize
car 5.1 164.0 0.87 0.00
frog -1.7 0.18 0.00 0.00
Unnormalized log- unnormalized Correct
probabilities
probabilities / logits probabilities probs
Cross-Entropy Loss
Want to interpret raw classifier scores as probabilities
exp 𝑠𝑘 Softmax
𝑠 = 𝑓 𝑥i ; 𝑊 𝑃 𝑌 = 𝑘 | 𝑋 = 𝑥 i
=
∑ j exp 𝑠j function
Probabilities Probabilities
must be >= 0
𝐿 = − log 𝑃 𝑌 = 𝑦i | 𝑋 = 𝑥i
must sum to 1 i
cat 3.2 24.5 0.13 Compare 1.00
exp normalize
car 5.1 164.0 0.87 0.00
frog -1.7 0.18 0.00 0.00
Unnormalized log- unnormalized Correct
probabilities
probabilities / logits probabilities probs
Cross-Entropy Loss
Want to interpret raw classifier scores as probabilities
exp 𝑠𝑘 Softmax
𝑠 = 𝑓 𝑥i ; 𝑊 𝑃 𝑌 = 𝑘 | 𝑋 = 𝑥 i
=
∑ j exp 𝑠j function
Probabilities Probabilities
must be >= 0
𝐿 = − log 𝑃 𝑌 = 𝑦i | 𝑋 = 𝑥i
must sum to 1 i
cat 3.2 24.5 0.13 Compare 1.00
exp normalize
car 5.1 164.0 0.87 Cross Entropy 0.00
frog -1.7 0.18 0.00 𝐻 𝑃, 𝑄 = 0.00
unnormalized Correct
probabilities 𝐻 𝑃 + 𝐷𝐾𝐿 𝑃 || 𝑄
Unnormalized log-
probabilities / logits probabilities probs
Cross-Entropy Loss

cat 3.2
car 5.1
frog -1.7
Cross-Entropy Loss

cat 3.2
car 5.1
Q: What is the min /
frog -1.7 max possible loss Li?
Cross-Entropy Loss

cat 3.2
car 5.1
Q: What is the min /
frog -1.7 A: Min 0, max +infinity
max possible loss Li?
Cross-Entropy Loss

cat 3.2
car 5.1
Q: If all scores are
frog -1.7 small random values,
what is the loss?
Cross-Entropy Loss

cat 3.2
car 5.1
Q: If all scores are
frog -1.7 small random values, A: -log(1/C)
what is the loss? log(10) ≈ 2.3
Summary
• Components of CNN
• Convolution
• Pooling
• Activation functions
• Fully connected layers
• Normalization: BN, LN
• Cross Entropy loss
• Receptive filed
CNN Architectures
CNN Architectures: Research Impact
• AlexNet: Publication year 2012, Citations 160k+
• VGG: Publication year 2014, Citations 129k+
• ResNet: Publication year 2016, Citations 232k+

Ø Darwin, “On the origin of species”, Publication year 1859, Citations 64k+
Ø Shannon, “A mathematical theory of communication”, Publication year 1948, Citations: 155k+

All Citation Counts are as per Googe Scholar on 26/AUG/2024


Objectives
• Learn good design practices from the literature.
• How to compute FLOPS, parameters and memory requirements.
How to compare between different CNN architectures?

• Accuracy on challenging datasets (eg. ImageNet)


• Number of parameters
• Floating point operations FLOPS (multiply+add)
• Memory requirements
Computational requirements for a CNN architecture
CNN architecture: computational requirements

Input size Layer Output size


Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M)
conv1 3 227 64 11 42 ? 64 56 784 23 73
pool1 64 56 3 2 0 64 27 182
conv2 64 27 192 5 1 2 192 27 547 307 224
pool2 192 27 3 2 0 192 13 127
Activation map
conv3 192 13 384 3 1 227x227x3
1 384
image 13 254filter
11x11x3 664 112
(feature map)
conv4 384 13 256 3 1 1 256 13 169 885 145
conv5 256 13 256 3 1 1 256 13 169 590 W 100
227
pool5 256 13 3 2 0 256 6 36
flatten 256 6 9216 36
fc6 9216 4096 4096 Convolve over16 all 37,749 38
H
fc7 4096 4096 4096 spatial locations
16 16,777 17
C
fc8 4096 1000 1000 4 4,096 4
227
3
Computational requirements for a CNN architecture

Input size Layer Output size


Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M)
conv1 3 227 64 11 4 2 64 ? 56 784 23 73
pool1 64 56 3 2 0 64 27 182
conv2 64 27 192 5 1 2 192 27 547 307 224
pool2 Recall:
192 Output27 channels 3 = number
2 0 of192filters13 127
conv3 192 13 384 3 1 1 384 13 254 664 112
conv4 384 13 256 3 1 1 256 227x227x3
13 image 169 11x11x3 filter885 Activation
145 map
(feature map)
conv5 256 13 256 3 1 1 256 13 169 590 100
pool5 256 13 3 2 0 256 6 227 36 W

flatten 256 6 9216 36


fc6 9216 4096 4096 16 over37,749
Convolve all 38
H
fc7 4096 4096 4096 16locations
spatial 16,777C 17
fc8 4096 1000 1000 227 4 4,096 4
3
Computational requirements for a CNN architecture

Input size Layer Output size


Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M)
conv1 3 227 64 11 4 2 64 56 784 23 73
pool1 64 56 3 2 0 64 27 182
conv2 64 27 192 5 1 2 192 27 547 307 224
pool2 192 27 3 2 0 192 227x227x3
13 image 127 11x11x3 filter Activation map
(feature map)
conv3Recall:
192W’ = 13(W –384K + 2P) 3/ S + 11 1 384 13 254 664 112
conv4 384 13 256 3 1 1 256 13 227 169 885 W145

conv5 256
= 227
13 256
– 11 + 2*2)
3
/ 4
1
+ 11 256 13 169 590 100
pool5 256 =13220/4 + 1 = 56 3 2 0 256 6 36 over all
Convolve
H
flatten 256 6 9216 36
spatial locations
C
fc6 9216 4096 4096 227 16 37,749 38
fc7 4096 4096 4096 3 16 16,777 17
fc8 4096 1000 1000 4 4,096 4
Computational requirements for a CNN architecture

Input size Layer Output size


Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M)
conv1 3 227 64 11 4 2 64 56 ? 784 23 73
pool1 64 56 3 2 0 64 27 182
conv2 64 27 192 5 1 2 192 27 547 307 224
pool2 192 27 3 2 0 192 13 127
conv3 192 13 384 3 1 1 384 13 254 664 112
conv4 384 13 256 3 1 1 256 13 169 885 145
conv5 256 13 256 3 1 1 256 13 169 11x11x3 filter590 Activation
100 map
227x227x3 image (feature map)
pool5 256 13 3 2 0 256 6 36
flatten 256 6 9216 36
227 W
fc6 9216 4096 4096 16 37,749 38
fc7 4096 4096 4096 16 16,777
Convolve over all
17
fc8 4096 1000 1000 spatial4locations4,096
H
4
C
227
3
Computational requirements for a CNN architecture

Input size Layer Output size


Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M)
conv1 3 227 64 11 4 2 64 56 784 23 73
pool1 64 56 3 2 0 64 27 182
conv2 64 Number27 192 of output5 elements
1 2 =192
C * H’ 27* W’ 547 307 224
pool2 192 27 3 2 0 192 13 127
conv3 192 13 384 3 1 1
=
384
64*56*5613
= 200,704
254 664 112
conv4 384 13 256 3 1 1 256 13 169 885 145
conv5 256 Bytes 13 per
256 element 3 = 41(for132-bit
256 floating
13 point)169 590 100
pool5 256 13 3 2 0 256 6 36
flatten 256 6 9216 36
fc6 9216 KB = (number
4096 of elements) *4096
(bytes per elem) / 1024
16 37,749 38
fc7 4096 4096
= 200704 * 4 / 1024 4096 16 16,777 17
fc8 4096 1000 1000 4 4,096 4
= 784
Computational requirements for a CNN architecture

Input size Layer Output size


Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M)
conv1 3 227 64 11 4 2 64 56 784 ? 23 73
pool1 64 56 3 2 0 64 27 182
conv2 64 27 192 5 1 2 192 27 547 307 224
pool2 of learnable
Number 192 parameters?
27 3 2 0 192 13 127
Activation map
conv3 192 13 384 3 1 227x227x3
1 384
image 13 254filter
11x11x3 664 112
(feature map)
conv4 384 13 256 3 1 1 256 13 169 885 145
conv5 256 13 256 3 1 1 256 13 169 590 W 100
227 56
pool5 256 13 3 2 0 256 6 36
flatten 256 6 9216 36
fc6 9216 4096 4096 Convolve over16 all 37,749 38
H
fc7 4096 4096 4096 spatial locations
16 16,77756 17
C64
fc8 4096 1000 1000 4 4,096 4
227
3
Computational requirements for a CNN architecture

Input size Layer Output size


Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M)
conv1 3 227 64 11 4 2 64 56 784 23 73
pool1 64 56 3 2 0 64 27 182
conv2 of learnable
Number 64 parameters?
27 192 5 1 2 192 27 547 307 224
pool2 192 27 3 2 0 192 13 127
conv3 192 Weight shape
13 384 3 =1 Cout1 x C384
in x K x13K 254 664 112
conv4 384 13 256 3 1= 641 x 3256
x 11 x 1113 169 885 145
conv5 256 13 Bias
256 shape 3 = C 1 = 164 256 13 169 590 100
out
pool5 256 13 3 2 0 256 6 36
flatten 256 6 Number of weights = 9216 64*3*11*11 + 64 36
fc6 9216 4096 = 4096
23,296=23k 16 37,749 38
fc7 4096 4096 4096 16 16,777 17
fc8 4096 1000 1000 4 4,096 4
Computational requirements for a CNN architecture

Input size Layer Output size


Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M)
conv1 3 227 64 11 4 2 64 56 784 23 ? 73
pool1 64 56 3 2 0 64 27 182
conv2 64 27 192 5 1 2 192 27 547 307 224
pool2
Number of192
floating 27 3
point operations (FLOPS)? 2 0 192 13 127
11x11x3 filter
Activation map
227x227x3 image
conv3 192 13 384 3 1 1 384 13 254 664 (feature112
map)

Note
conv4 384 13 256 3 1 1 256 13 169 885 145
conv5 256 13 256 3 1 1 256 22713 169 590 W100
56
• Most
pool5 papers use
256 “1 FLOP”
13 = ”1 multiply
3 and 2 0 256 6 36
1 addition” so dot product of two N-dim
flatten 256 6
vectors takes N FLOPs; some papers say
9216 36
Convolve over all
fc6 or MACC
MADD 9216 instead of 4096
FLOP 4096 16 37,749 H 38
spatial locations 56
fc7 sources
• Other 4096 (e.g. NVIDIA4096
marketing 4096 16 16,777
C64 17
fc8
material) 4096“1 multiply
count 1000
and one addition” 1000227 4 4,096 4
= 2 FLOPs, so dot product of two N-dim
3
vectors takes 2N FLOPs (N MACC)
Computational requirements for a CNN architecture

Input size Layer Output size


Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M)
conv1 3 227 64 11 4 2 64 56 784 23 73
pool1 64 56 3 2 0 64 27 182
conv2 64 27 192 5 1 2 192 27 547 307 224
pool2 192
Number
27
of floating
3
point
2
operations
0 192
(multiply+add)
13 127
conv3 192 = (number
13 384 of output3 elements)
1 1 384* (operations
13 per
254output664 element) 112
384 = (C 13 x256
out H’ x W’) * (Cin x K x K)
conv4 3 1 1 256 13 169 885 145
conv5 256 13 256 3 1 1 256 13 169 590 Activation100
pool5 256 = (64
13 * 56 * 56) 3* (3 *2 11 *
0 11)
256 227x227x3 image
6 36
11x11x3 filter
map
(feature map)

flatten 256 = 200,704


6 * 363 9216 227
36 W
56
fc6 9216 = 72,855,552
4096 MAC 4096 16 37,749 38
fc7 4096 4096 4096 16
Convolve 16,777
over all 17
4096 Assuming 1000 1MAC =1FLOP,
H
fc8 1000 spatial
4 locations4,096 C64 56 4
~=73M flops 3
227
Computational requirements for a CNN architecture

Input size Layer Output size


Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M)
conv1 3 227 64 11 4 2 64 56 784 23 73
pool1 64 56 3 2 0 64 ? 27 182
conv2 64 27 192 5 1 2 192 27 547 307 224
pool2 192
For pooling 27
layer: 3 2 0 192 13 127
conv3 192 13 384 3 1 1 384 13 254 664 112
conv4 384 13 256 3 1 1 256 13 169 885 145
conv5 256 13 256 3 1 1 256 13 169 590 100
pool5 256 13 3 2 0 256 6 36
flatten 256 6 9216 36
fc6 9216 4096 4096 16 37,749 38
fc7 4096 4096 4096 16 16,777 17
fc8 4096 1000 1000 4 4,096 4
Computational requirements for a CNN architecture

Input size Layer Output size


Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M)
conv1 3 227 64 11 4 2 64 56 784 23 73
pool1 64 56 3 2 0 64 27 182
conv2 64 27 192 5 1 2 192 27 547 307 224
pool2 192 27 For pooling 3 layer:
2 0 192 13 127
conv3 192 13 384 3 1 1 384 13 254 664 112
conv4 384 13 256 3 1 1 256 13 169 885 145
conv5 256 13 #output
256 channels
3 1 = #input
1 256 channels
13 = 64 169 590 100
pool5 256 13 3 2 0 256 6 36
flatten 256 6 9216 36
fc6 9216 W’
4096= floor((W – K) / S +4096
1) 16 37,749 38
fc7 4096 4096 4096 16 16,777 17
fc8 4096 1000
= floor((56-3 )/ 2 + 1)
1000
= floor(26.5+1) = 27
4 4,096 4
Computational requirements for a CNN architecture

Input size Layer Output size


Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M)
conv1 3 227 64 11 4 2 64 56 784 23 73
pool1 64 56 3 2 0 64 27 182 ?
conv2 64 27 192 5 1 2 192 27 547 307 224
pool2 192 27 3 2 0 192 13 127
Memory
conv3 requirements
192 13 384 to store 3 pool11 output:
1 384 13 254 664 112
conv4 384 13 #output
256 3
elems =1 Cout1 x H’
256x W’ 13and Bytes169
per elem885= 4 145
conv5 256 13 256 3 1 1 256 13 169 590 100
pool5 256 13 Memory KB 3 = C2out *0H’ * W’ * 4 6/ 1024 36
256
flatten 256 6 = 64 * 27 * 27 * 4 /9216 1024 36
fc6 9216 4096 4096 16 37,749 38
fc7 4096 4096
= 182.25 4096 16 16,777 17
fc8 4096 1000 1000 4 4,096 4
Computational requirements for a CNN architecture

Input size Layer Output size


Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M)
conv1 3 227 64 11 4 2 64 56 784 23 73
pool1 64 56 3 2 0 64 27 182 0
conv2 64 27 192 5 1 2 192 27 547 307 224
pool2 192 27 3 2 0 192 13 127
conv3Learnable
192 parameters?
13 384 3 1 1 384 13 254 664 112
conv4 384 13 256 3 1 1 256 13 169 885 145
conv5
Pooling
256
layers have
13 256
no learnable
3 1
parameters!
1 256 13 169 590 100
pool5 256 13 3 2 0 256 6 36
flatten 256 6 9216 36
fc6 9216 4096 4096 16 37,749 38
fc7 4096 4096 4096 16 16,777 17
fc8 4096 1000 1000 4 4,096 4
Computational requirements for a CNN architecture

Input size Layer Output size


Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M)
conv1 3 227 64 11 4 2 64 56 784 23 73
pool1 64 56 3 2 0 64 27 182 0 0
conv2 64 27 192 5 1 2 192 27 547 307 224
pool2 192 27 3 2 0 192 13 127
conv3 192
Floating-point
13 384
ops 3
for pooling
1 1
layer
384 13 254 664 112
conv4 384= (number
13 256of output 3 positions)
1 1 * (flops13per output
256 169position)
885 145
conv5 256=~ (C13 out *256
H’ * W’)3 * (K *1 K) 1 256 13 169 590 100
pool5 256 13 3 2 0 256 6 36
flatten 256
= ~(646 * 27 * 27) * (3 * 3) 9216 36
fc6 9216= 419,904 4096 4096 16 37,749 38
fc7 4096= 0.4 MFLOP 4096 4096 16 16,777 17
fc8 4096 1000 1000 4 4,096 4
Computational requirements for a CNN architecture

Input size Layer Output size


Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M)
conv1 3 227 64 11 4 2 64 56 784 23 73
pool1 64 56 3 2 0 64 27 182 0 0
conv2 64 27 192 5 1 2 192 27 547 307 224
pool2 192 27 3 2 0 192 13 127 0 0
conv3 192 13 384 3 1 1 384 13 254 664 112
conv4 384 13 256 3 1 1 256 13 169 885 145
conv5 256 13 256 3 1 1 256 13 169 590 100
pool5 256 13 3 2 0 256 6 36 0 0
flatten 256 6 9216 36 0 0
fc6 9216 4096 4096 16 37,749 38
fc7 4096
Flatten output
4096
size = Cin x H x W
4096 16 16,777 17
fc8 4096 1000 = 256 * 6 * 6 1000 4 4,096 4
= 9216
Computational requirements for a CNN architecture
How about fc layers?
Input size Layer Output size
Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M)
conv1 3 227 64 11 4 2 64 56 784 23 73
pool1 64 56 3 2 0 64 27 182 0 0
conv2 64 27 192 5 1 2 192 27 547 307 224
pool2 192 27 3 2 0 192 13 127 0 0
conv3 192 13 384 3 1 1 384 13 254 664 112
conv4 384 13 256 3 1 1 256 13 169 885 145
conv5 256 13 256 3 1 1 256 13 169 590 100
pool5 256 13 3 2 0 256 6 36 0 0
flatten 256 6 9216 36 0 0
fc6 9216 4096 4096 16 37,749 38
fc7 4096 4096 4096 16 16,777 17
fc8 For fc6 layer,
4096 1000 1000 4 4,096 46
Assuming 1 MAC=1flop
MAC 37748,736 flop
Computational requirements for a CNN architecture

Input size Layer Output size


Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M)
conv1 3 227 64 11 4 2 64 56 784 23 73
pool1 64 56 3 2 0 64 27 182 0 0
conv2 64 27 192 5 1 2 192 27 547 307 224
pool2 192 27 3 2 0 192 13 127 0 0
conv3 192 13 384 3 1 1 384 13 254 664 112
conv4 384 13 256 3 1 1 256 13 169 885 145
conv5 256 13 256 3 1 1 256 13 169 590 100
pool5 256 13 3 2 0 256 6 36 0 0
flatten 256 6 9216 36 0 0
fc6 9216 4096 4096 16 37,749 38
fc7 4096 4096 4096 16 16,777 17
fc8 4096 1000 1000 4 4,096 4
Computational requirements for a CNN architecture
Interesting trends here!
Input size Layer Output size
Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M)
conv1 3 227 64 11 4 2 64 56 784 23 73
pool1 64 56 3 2 0 64 27 182 0 0
conv2 64 27 192 5 1 2 192 27 547 307 224
pool2 192 27 3 2 0 192 13 127 0 0
conv3 192 13 384 3 1 1 384 13 254 664 112
conv4 384 13 256 3 1 1 256 13 169 885 145
conv5 256 13 256 3 1 1 256 13 169 590 100
pool5 256 13 3 2 0 256 6 36 0 0
flatten 256 6 9216 36 0 0
fc6 9216 4096 4096 16 37,749 38
fc7 4096 4096 4096 16 16,777 17
fc8 4096 1000 1000 4 4,096 4
Computational requirements for a CNN architecture
Most of the memory Most floating-point
usage is in the early Nearly all parameters are in
ops occur in the
convolution layers the fully-connected layers
convolution layers
Memory (KB) Params (K) MFLOP
900 40000 250

800 35000
700 200
30000
600
25000
150
500
20000
400
15000 100
300

200 10000
50
100 5000

0 0 0

v
1

5
6

8
5
nv

nv

nv

nv

nv

nv

nv

nv

nv

nv

nv

nv

nv

nv
fc

fc

fc

fc

fc

fc

fc

fc

fc
n
co

co

co

co

co

co

co

co

co

co

co

co

co

co
co
Computationally Efficient Convolution Operators
Depthwise Separable Convolutions
Standard 2D Convolutions

Standard 2D convolution to create output Standard 2D convolution to create output with 128 layer,
with 1 layer, using 1 filter. using 128 filters.

Standard 2D convolution. Mapping one layer with depth Din to another layer with depth Dout, by using Dout filters
Depthwise Separable Convolutions
Standard 2D convolution

Depthwise Separable Convolutions

(i) Depthwise Convolution: performs lightweight filtering (ii) Pointwise Convolution: It is a 1x1 convolution layer which
by applying a single convolutional filter per input channel is responsible for building new features through computing
linear combinations of the input channels.
Depthwise Separable Convolutions
Depthwise Separable Convolutions
Number of FLOPS =(number of output elements) * (ops per output elements)
• FLOPS of standard convolution= (Cout x H’ x W’) * (Cin x K x K)
• FLOPS of depthwise separable convolution=FLOPS of depthwise convolution+FLOPS of
pointwise convolution= (Cin x H’ x W’) * (1x K x K)+(Cout x H’ x W’) * (Cin x 1 x 1)
= (Cin x H’ x W’) *(K x K+ Cout )
• Depthwise separable convolution leads to a reduction in FLOPS by almost a factor of K2
compared to standard convolution (since Cout >> K x K).
Grouped Convolutions
Standard Convolutions
Convolution with groups=1: Standard convolution

Input: Cin x H x W Weight: Cout x Cin x K x K


Output: Cout x H’ x W’
FLOPs: CoutCinK2H’W’
All convolutional kernels touch all Cin channels of the input

Grouped Convolutions

Convolution with groups=2:


Two parallel convolution layers that work on half the channels
Grouped Convolution
Convolution with groups=2:
Two parallel convolution layers that
work on half the channels
X Cout/2
Input: Cin x H x W

Cin/2 Split
Group 1: Group 2:
x Cout/2
Cin Cout (Cin / 2) x H x W (Cin / 2) x H x W
Cin/2
Conv(K x K, Cin/2 -> Cout/2) Conv(K x K, Cin/2 -> Cout/2)

Out 1: Out 2:
(Cout / 2) x H’ x W’ (Cout / 2) x H’ x W’

Concat

Output: Cout x H’ x W’
Grouped Convolution
Convolution with groups=G:
X Cout/G G parallel conv layers; each “sees” Cin/G input
channels and produces Cout/G output channels
Cin/G
Input: C x H x W
x Cout/G
Split to G x [(Cin / G) x H x W]
Cin Cout Weight: G x (Cout / G) x (Cin /G) x K x K parallel
Cin/G
convolutions
Output: G x [(Cout / G) x H’ x W’] Concat to
Cout x H’ x W’
FLOPs: CoutCinK2H'W'/G

Convolution with groups=1: Normal convolution

Input: Cin x H x W Weight: Cout x Cin x K x K Output: Cout x H’ x W’ FLOPs: CoutCinK2H'W'

All convolutional kernels touch all Cin channels of the input


Grouped Convolution
Standard convolution Grouped Convolution with groups=G:
Grouped Convolution with groups=1 G parallel conv layers; each “sees”
All convolutional kernels touch all Cin channels Cin/G input channels and produces
of the input Cout/G output channels

Input: Cin x H x W Weight: Cout x Cin x K x Input: Cinx H x W


K Output: Cout x H’ x W’ Split to G x [(Cin / G) x H x W]
Weight: G x (Cout / G) x (Cin /G) x K x K
FLOPs: CoutCinK2HW G parallel convolutions
Output: G x [(Cout / G) x H’ x W’]
Concat to Cout x H’ x W’
FLOPs: CoutCinK2H'W'/G

Depthwise Separable Convolution Depthwise Convolution


Special case: G = Cin
FLOPS=(Cin x H’ x W’) *(Cout +K2) FLOPS=(Cout x H’ x W’) * K2
Additional Convolution Operators
Dilated Convolution
Dilated Convolution
Deformable Convolution: Modeling Spatial Transformations
Deformable Convolution: Modeling Spatial Transformations
Traditional Approaches

Build training dataset with sufficient desired variations.


Spatial transformations in CNNs
Deformable Convolution

Learning to deform the sampling locations in the convolution


Deformable Convolution
Sampling locations of deformable Convolution

Layer 2

Layer 1

Input
Summary
• Components of CNN
• Convolution
• Pooling
• Activation functions
• Fully connected layers
• Normalization: BN
• Cross Entropy loss
• Computation of Flops, Parameters and Memory Requirements
• Variants of Convolution Operation

You might also like