Week 8
Week 8
Topic
Lecture 36: CNN Architecture
Concepts Covered:
CNN
CNN Architecture
Convolution Layer
Receptive Field
Nonlinearity
Pooling
Convolutio
n
1 D Convolution
∞ ∞
y ( n) = ∑ x ( p ) h( n − p ) y (t ) = ∫ x(τ )h(t − τ )dτ
p =0 0
2 D Convolution
∞ ∞
y (m, n) = ∑∑ x( p, q )h(m − p, n − q )
p =0 q =0
Finite Convolution
Kernel
Feature at a point is local in nature
Convolution
Kernel
1D 2A+1
A
y ( n) = ∑ w( p) x(n − p)
p =− A
2D (2A+1)x(2A+1)
A A
y (m, n) = ∑ ∑ w( p, q) x(m − p, n − q)
p =− A q =− A
Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0)
Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0) Y(1)
Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0) Y(1) Y(2)
Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0) Y(1) Y(2) Y(3)
Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0) Y(1) Y(2) Y(3) Y(n-1)
Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0) Y(1) Y(2) Y(3) Y(n-1) Y(n)
Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0) Y(1) Y(2) Y(3) Y(n-1) Y(n) Y(n+1)
2D
Convolution
3 x 3 Kernel
6 x 6 Image
2D
Convolution
Flipping
0 Padding
2D
Convolution
2D
Convolution
2D
Convolution
2D
Convolution
2D
Convolution
2D
Convolution
2D
Convolution
2D
Convolution
Stride
No. of steps the kernel is moved during convolution
Stride =2
Stride =1
7 x 7 Input Image
3 x 3 Kernel
Class
Fully
Connected
Layer
Pooling
Nonlinearity
Convolution
Pooling
Architecture
CNN
Nonlinearity
Convolution
Image
Convolution Layer: 3 D
Convolution
• Color image has 3 dimensions: height, width and depth (depth is the
color channels i.e RGB)
• Filter or kernels that will be convolved with the RGB image could also
be 3D
• For multiple Kernels: All feature maps obtained from distinct kernels
are stacked to get the final output of that layer
3 D Convolution-
Visualization
• The kernel strides over the input
Image.
• At each location I (m, n) compute
f ( m, n) = ∑ ∑ w( p, q ) I ( p − m, q − n)
collect them in the feature map.
• The animation shows the sliding
operation at 4 locations, but in reality
it is performed over the entire input.
3 2 5 6
9 6
8 9 5 3 Max pool with 2x2 window
4 4 6 8 with stride = 2 4 8
1 1 2 1
Topic
Lecture 37: Popular CNN Models
Concepts Covered:
CNN
LeNet
AlexNet
VGG Net
GoogLeNet
etc.
Class
Fully
Connected
Layer
Pooling
Nonlinearity
Convolution
Pooling
Architecture
CNN
Nonlinearity
Convolution
Image
CNN
Architecture
MLP vs
CNN
Sparse Connectivity: Every node in the Convolution Layer receives input from a
small number of nodes in the previous layer (Receptive Field), needing smaller
number of parameters.
Parameter Sharing: Each member of the Convolution Kernel is used at every
position of the input, dramatically reducing the number of parameters.
This makes CNN much more efficient than MLP.
Some popular CNN
Models
LeNet
LeNet
5
• Proposed by Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick Haffner for
handwritten and machine-printed character recognition.
• Used by many Banks for recognition of hand written numbers on cheques.
• This architecture achieves an error rate as low as 0.95% on test data
No. of Kernels- 6
Kernel Size- 5 x 5
Stride- 1
Average Pooling
Window Size- 2 x 2
Stride- 2
No. of Kernels- 16
Kernel Size- 5 x 5
Stride- 1
No. of Kernels- 16
Kernel Size- 5 x 5
Stride- 1
Average Pooling
Window Size- 2 x 2
Stride- 2
https://engmrk.com/lenet-5-a-classic-cnn-architecture/
ILSVR
•C
IMAGENET Large Scale Visual Recognition Challenge.
• Evaluates algorithms for Object Detection and Image
Classification on large image database.
• Helps researchers to review state of the art Machine Learning
techniques for object detection across a wider variety of objects.
• Monitor the progress of computer vision for large scale image
indexing for retrieval and annotation.
• Database contains large number of Images from 1000
categories.
• More than 1000 images in every category.
ILSVR
C
• Every year of the challenge the forum also organizes a
workshop at one of the premier computer vision
conferences.
• The purpose of the workshop is to disseminate the new
findings of the challenge.
• Contestants with the most successful and innovative
techniques are invited to present their work.
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur
Topic
Lecture 38: Popular CNN Models II
Concepts Covered:
CNN
LeNet
ILSVRC
AlexNet
VGG Net
GoogLeNet
etc.
AlexNet
ILSVRC 2012 Winer
https://www.learnopencv.com/understanding-alexnet/
https://www.learnopencv.com/understanding-alexnet/
AlexNe
t
https://towardsdatascience.com/difference-between-local-
response-normalization-and-batch-normalization-
272308c034ac
Local Response Normalization (Intra-
Channel)
i
a
=
i x, y
b
x, y β
max(W , x + n / 2 ) min( H , y + n / 2 )
k +α
∑ ∑ a i
( )
p ,q
2
p = max( 0 , x − n / 2 ) q = max( 0 , y − n / 2 )
Local Response
Normalization
https://towardsdatascience.com/difference-between-local-
response-normalization-and-batch-normalization-
272308c034ac
Reducing
Overfitting
Train the network with different variants of the
same image helps avoiding overfitting.
Generate additional data from existing data
(Augmentation).
Data augmentation by mirroring.
Data Augmentation by random crops.
Dropout Regularization.
Krizhevsky Alex, Ilya Sutskever and Geoffrey E. Hilton, “Imagenet
Classification with deep convolutional neural networks”,
Advances in Neural Information Processing Systems, 2012
Dropou
t
Regularization Technique proposed by Srivastava et.
al. in 2014.
During training randomly selected neurons are
dropped from the network (with probability 0.5)
temporarily .
Their activations are not passed to the downstream
neurons in the forward pass.
In the backward pass weight updates are not
applied to theses neurons.
Srivastava Nitish et. al. “Dropout: A Simple Way to Prevent
Neural Networks from Overfitting” Journal of Machine
Learning Research 15 (2014), 1929-1958
Dropou
t
https://www.learnopencv.com/understanding-alexnet/
How does it
help?
While training weights of neurons are tuned for specific
features that provides some sort of specialization.
Neighbouring neurons starts relying on these specializations
(co-adaptation).
This leads to a neural network model too specialized to the
training data.
As neurons are randomly dropped other neurons have to
step in to compensate.
Thus the network learns multiple independent
representations
Srivastava Nitish et. al. “Dropout: A Simple Way to Prevent
Neural Networks from Overfitting” Journal of Machine
Learning Research 15 (2014), 1929-1958
Learned
Features
How does it
help?
This makes the network less sensitive to specific weights.
Enhances the generalization capability of the network
Less vulnerable to overfitting.
The whole network is used during testing – there is no dropout.
Dropout increases number of iterations for the network to
converge.
But helps avoid overfitting.
Topic
Lecture 39: Popular CNN Models III
Concepts Covered:
CNN
AlexNet
VGG Net
Transfer Learning
GoogLeNet
ResNet
etc.
1/15
VGG 16
ILSVRC 2014 1 st
Runner-Up
Visual Geometry Group
Oxford University
2/15
VGG
16
Transfer Learning
7/15
Transfer
Learning
Kevin McGuinness
https://www.slideshare.net/xavigiro/transfer-learning-
d2l4-insightdcu-machine-learning-workshop-2017
8/15
Transfer
Learning
CNN as Fixed Feature Extractor:
Take a pre-trained CNN architecture trained on a large
dataset (like ImageNet)
Remove the last fully connected layer of this pre-trained
network
Remaining CNN acts as a fixed feature extractor for the
new dataset
9/15
Transfer
Learning
Image Source:-
https://becominghuman.ai/what-exactly-does-cnn-see-
4d436d8e6e52
10/15
Transfer
Learning
Image Source:-
https://becominghuman.ai/what-exactly-does-cnn-see-
4d436d8e6e52
11/15
Transfer
Learning
Image Source:-
https://becominghuman.ai/what-exactly-does-cnn-see-
4d436d8e6e52
12/15
Transfer
Learning
Image Source:-
https://becominghuman.ai/what-exactly-does-cnn-see-
4d436d8e6e52
13/15
Transfer
Learning
Image Source:-
https://becominghuman.ai/what-exactly-does-cnn-see-
4d436d8e6e52
14/15
Transfer
Learning
Lower layers generate more general features:-
knowledge transfers very well to other tasks.
Higher layers are more task specific.
Fine-tuning improves generalization when sufficient
examples are available.
Transfer learning and fine tuning often lead to better
performance than training from scratch on the target
dataset.
Even features transferred from distant tasks often
perform better than random initial weights.
15/15
Fine
tuning
Weights of the pre-trained CNN is fine-tuned for the
new dataset by continuing the back propagation.
Fine-tuning can be done for all layers.
Due to overfitting concern, the earlier layers of the net
may be fixed and fine tuning is done only on the higher
layers.
Earlier layers can be fixed as lower layers extract
features that are more generic.
Higher layers on the other hand are task specific.
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur
Topic
Lecture 40: Popular CNN Models IV
Concepts Covered:
CNN
AlexNet
VGG Net
Transfer Learning
Challenges in Deep Learning
GoogLeNet
ResNet
etc.
9
Deep Learning
Challenges
15/15
Challenges
Deep learning is data hungry.
Overfitting or lack of generalization.
Vanishing/Exploding Gradient Problem.
Appropriate Learning Rate.
Covariate Shift.
Effective training.
14
Vanishing Gradient
13
Vanishing Gradient
Problem
https://towardsdatascience.com/an-overview-of-
resnet-and-its-variants-5281e2f56035
12
Vanishing Gradient
Problem
X f1 f2 f3 f4 O
W1 W2 W3 W4
θ2
θ3
10
Vanishing Gradient
Problem
O = f 4 (θ 4 ) θ 4 = W4 f 3 (θ 3 ) θ 3 = W3 f 2 (θ 2 ) θ 2 = W2 f1 (θ1 ) θ1 = W1 X
∂O ∂O ∂θ 4 ∂f 3 ∂θ 3 ∂f 2 ∂θ 2 ∂f1 ∂θ1 ∂O
= . . . . . . . = X . f1′.W2 . f 2′.W3 . f 3′.W4 .
∂W1 ∂θ 4 ∂f 3 ∂θ 3 ∂f 2 ∂θ 2 ∂f1 ∂θ1 ∂W1 ∂θ 4
∂O ∂O ∂θ 4 ∂f 3 ∂θ 3 ∂f 2 ∂θ 2 ∂O
= . . . . . = f1. f 2′.W3 . f 3′.W4 .
∂W2 ∂θ 4 ∂f 3 ∂θ 3 ∂f 2 ∂θ 2 ∂W2 ∂θ 4
10
Vanishing Gradient
Problem
Choice of activation function: ReLU instead
of Sigmoid.
Appropriate initialization of weights.
Intelligent Back Propagation Learning
Algorithm.