Unit 5 CNN
Unit 5 CNN
Convolution Neural
Network(CNN)
Contents
● Building blocks of CNNs,
● Architectures, convolution / pooling layers, Padding, Strided convolutions,
● Convolutions over volumes, SoftMax regression,
● Deep Learning frameworks, Training and testing on different distributions,
● Bias and Variance with mismatched data distributions,
● Transfer learning, multi-task learning, end-to-end deep learning,
● Introduction to CNN models: LeNet – 5, AlexNet, VGG – 16,Residual Networks
Course Outcome
https://colab.research.google.com/drive/1txIfzUJ_ehc_waLV67r-yEhTLIccbXIC#sc
rollTo=mL_dQ3K0M-ZL
Stride Demo
https://colab.research.google.com/drive/1txIfzUJ_ehc_waLV67r-yEhTLIccbXIC#sc
rollTo=mL_dQ3K0M-ZL
Max Polling Demo
https://deeplizard.com/resource/pavq7noze3
Keras Demo
:--https://colab.research.google.com/drive/1F4F6Q9O-hPvCDeOWcqMUa5BuBOv
uOBWc?usp=sharing
Disadvantage
Introduction to CNN models
➔ LeNet – 5
➔ AlexNet
➔ VGG – 16
➔ Residual Networks
LeNet – 5
AlexNet
VGGNET
Convolutional Neural Network
● A convolutional neural network, or CNN, is a network architecture for deep learning.
● It learns directly from images. A CNN is made up of several layers that process and transform an input to
produce an output.
● You can train a CNN to do image analysis tasks, including scene classification, object
90
91
Shared weights and biases
However, in the case of CNNs, the weights and bias values are the same for all hidden neurons in a given layer. 92
This means that all hidden neurons are detecting the same feature, such as an edge or a blob, in different regions of the
image. This makes the network tolerant to translation of objects in an image. For example, a network trained to recognize
cats will be able to do so whenever the cat is in the image.
Convolutional Neural Network
● Our third and final concept is activation and pooling. The activation step applies a transformation to the
output of each neuron by using activation functions. Rectified linear unit, or ReLU, is an example of a commonly
used activation function. It takes the output of a neuron and maps it to the highest positive value.
● Pooling reduces the dimensionality of the featured map by condensing the output of small regions of
neurons into a single output. This helps simplify the following layers and reduces the number of parameters that the
model needs to learn.
Convolutional Neural Network
A Convolutional Neural Network, also known as CNN or ConvNet, is a class of neural networks that
specializes in processing data that has a grid-like topology, such as an image. A digital image is a binary
representation of visual data. It contains a series of pixels arranged in a grid-like fashion that contains
pixel values to denote how bright and what color each pixel should be.
Ref : https://towardsdatascience.com/convolutional-neural-networks-explained-9cc5188c4939
Convolutional Neural Network Architecture
A CNN typically has three layers: a convolutional layer, a pooling layer, and a fully connected layer.
Ref : https://towardsdatascience.com/convolutional-neural-networks-explained-9cc5188c4939
Convolution Layer
The convolution layer is the core building block of the CNN. It carries the main portion of the network’s computational
load.
This layer performs a dot product between two matrices, where one matrix is the set of learnable parameters otherwise
known as a kernel, and the other matrix is the restricted portion of the receptive field. The kernel is spatially smaller
than an image but is more in-depth. This means that, if the image is composed of three (RGB) channels, the kernel
height and width will be spatially small, but the depth extends up to all three channels.
Ref : https://towardsdatascience.com/convolutional-neural-networks-explained-9cc5188c4939
Convolution Layer
During the forward pass, the kernel slides across the height and width of the image-producing the image representation
of that receptive region. This produces a two-dimensional representation of the image known as an activation map that
gives the response of the kernel at each spatial position of the image. The sliding size of the kernel is called a stride.
Ref : https://towardsdatascience.com/convolutional-neural-networks-explained-9cc5188c4939
Ref : https://towardsdatascience.com/convolutional-neural-networks-explained-9cc5188c4939
Padding
While applying convolutions we will not obtain the output dimensions the same as input we will lose data over
borders so we append a border of zeros and recalculate the convolution covering all the input values.
Ref : https://www.analyticsvidhya.com/blog/2020/10/what-is-the-convolutional-neural-network-architecture/
CNN
Padding - Padding has the following benefits:
1. It allows us to use a CONV layer without necessarily shrinking the height and width of the volumes. This
is important for building deeper networks, since otherwise the height/width would shrink as we go to
deeper layers. If we have an activation map of size W x W x D, a pooling kernel of spatial size F, and stride S,
then the size of output volume can be determined by the following formula:
100
CNN
Some padding terminologies:
Ref : https://www.analyticsvidhya.com/blog/2020/10/what-is-the-convolutional-neural-network-architecture/
● Convolution Operation with Multiple Filters
● Multiple filters can be used in a convolution layer to detect multiple features. The output of the layer then will have the
same number of channels as the number of filters in the layer.
● The total number of multiplications to calculate the result is (4 x 4 x 2) x (3 x 3 x 3) = 864
CNN
● 1 x 1 Convolution
● This is convolution with 1 x 1 filter. The effect is to flatten or “merge” channels together, which can save computations later
in the network:
104
CNN
Convolution parameters
105
Convolution Layer
If we have an input of size W x W x D and Dout number of kernels with a spatial size of F with stride S and
amount of padding P, then the size of output volume can be determined by the following formula:
Ref : https://towardsdatascience.com/convolutional-neural-networks-explained-9cc5188c4939
CNN
Convolution Layer - Convolutions occur in convolution layer which are the building blocks of CNN. This
layer generally has
108
CNN
● This layer identify and extract best features/patterns from input image and preserves the generic information
into a matrix. Matrix representation of the input image is multiplied element-wise with filters and summed up to
produce a feature map, which is the same as dot product between combination of vectors.
● Convolution involves the following important features :
– Local connectivity
– Where each neuron is connected only to a subset of input image (unlike a neural network where all neurons
are fully connected). In CNN, a certain dimension of filter is chosen, which slides over these subsets of input
data. Multiple filters are present in CNN where each filter moves over entire image and learns different
portions of input image.
– Parameter Sharing
– Is sharing of weights by all neurons in a particular feature map. All of them share same amounts of weight
hence called parameter sharing.
109
CNN
● Batch Normalization
● Batch normalization is generally done in between convolution and activation(ReLU) layers. It
normalizes the inputs at each layer, reduces internal co-variate shift(change in the distribution of
network activations) and is a method to regularize a convolutional network.
● Batch normalization allows higher learning rates that can reduce training time and gives better
performance. It allows learning at each layer by itself without being more dependent on other layers.
● Padding and Stride
● Padding and Stride influence how convolution operation is performed. Padding and stride can be
used to alter the dimensions(height and width) of input/output vectors either by increasing or
decreasing.
ReLu Layer
ReLU Layer (Rectified Linear Unit)
ReLU is computed after convolution. It is most commonly deployed activation function that allows the neural network to
account for non-linear relationships. In a given matrix (x), ReLU sets all negative values to zero and all other values remains
constant. It is mathematically represented as :
Pooling Layer
The pooling layer replaces the output of the network at certain locations by deriving a
summary statistic of the nearby outputs. This helps in reducing the spatial size of the
representation, which decreases the required amount of computation and weights. The pooling
operation is processed on every slice of the representation individually.
Pooling functions
● Average of the rectangular neighborhood,
● Max of the rectangular neighborhood,
● and a weighted average based on the distance from the central pixel.
Ref : https://towardsdatascience.com/convolutional-neural-networks-explained-9cc5188c4939
Pooling Layer
If we have an activation map of size W x W x D, a pooling kernel of spatial size F, and stride S,
then the size of output volume can be determined by the following formula:
Ref : https://towardsdatascience.com/convolutional-neural-networks-explained-9cc5188c4939
CNN
Why pooling is important ?
● It progressively reduces the spatial size of representation to reduce amount of
parameters and computation in network and also controls overfitting. If no pooling,
then the output consists of same resolution as input.
● There can be many number of convolution, ReLU and pooling layers. Initial layers of
convolution learns generic information and last layers learn more specific/complex
features. After the final Convolution Layer, ReLU, Pooling Layer the output feature
map(matrix) will be converted into vector(one dimensional array). This is called
flatten layer. 114
Fully Connected Layer
Neurons in this layer have full connectivity with all neurons in the preceding and succeeding
layer as seen in regular FCNN. This is why it can be computed as usual by a matrix
multiplication followed by a bias effect.
Ref : https://towardsdatascience.com/convolutional-neural-networks-explained-9cc5188c4939
Soft-Max Layer
Soft-max is an activation layer normally applied to the last layer of network that acts as a classifier. Classification of given input
into distinct classes takes place at this layer. The soft max function is used to map the non-normalized output of a network to a
probability distribution.
● The output from last layer of fully connected layer is directed to soft max layer, which converts it into probabilities.
● Here soft-max assigns decimal probabilities to each class in a multi-class problem, these probabilities sum equals 1.0.
● This allows the output to be interpreted directly as a probability.
116
CNN Architectures
CNN architectures:
You’ve learned the following:
● Convolution Layer
1. LeNet-5
● Pooling Layer
2. AlexNet
● Normalization Layer
● Fully Connected Layer 3. VGG-16
Ref: https://towardsdatascience.com/5-most-well-known-cnn-architectures-visualized-af76f1f0065e#7ebd
CNN Architectures - LeNet-5
Excluding pooling, LeNet-5 consists of 5 layers:
Each convolution layer is followed by a 2×2 average-pooling, and every layer has tanh activation function except
the last (which has softmax).
LeNet-5 has 60,000 parameters. The network is trained on greyscale 32×32 digit images and tries to recognize them as
one of the ten digits (0 to 9).
Ref: https://towardsdatascience.com/5-most-well-known-cnn-architectures-visualized-af76f1f0065e#7ebd
CNN Architectures - AlexNet
AlexNet introduces the ReLU activation function and LRN into the mix. ReLU becomes so popular that almost all CNN
architectures developed after AlexNet used ReLU in their hidden layers, abandoning the use of tanh activation function
in LeNet-5.
Ref: https://towardsdatascience.com/5-most-well-known-cnn-architectures-visualized-af76f1f0065e#7ebd
CNN Architectures - AlexNet
● The last layer uses the softmax activation function, and all others use ReLU. LRN is applied on the first and
second convolution layers after applying ReLU. The first, second, and fifth convolution layers are followed
● With the advancement of modern hardware, AlexNet can be trained with a whopping 60 million
parameters and becomes the winner of the ImageNet competition in 2012. ImageNet has become a
benchmark dataset in developing CNN architectures and a subset of it (ILSVRC) consists of various images
with 1000 classes. Default AlexNet accepts colored images with dimensions 224×224.
Ref: https://towardsdatascience.com/5-most-well-known-cnn-architectures-visualized-af76f1f0065e#7ebd
CNN Architectures - VGG16
● Researchers investigated the effect of CNN depth on its accuracy in the large-scale image recognition setting. By pushing the
depth to 11–19 layers, VGG families are born: VGG-11, VGG-13, VGG-16, and VGG-19. A version of VGG-11 with LRN was also
investigated but LRN doesn’t improve the performance. Hence, all other VGGs are implemented without LRN.
Ref: https://towardsdatascience.com/5-most-well-known-cnn-architectures-visualized-af76f1f0065e#7ebd
CNN Architectures - VGG16
● VGG-16 is one of the biggest networks that has 138 million parameters. Just like AlexNet, the last layer is
equipped with a softmax activation function and all others are equipped with ReLU.
● The 2nd, 4th, 7th, 10th, and 13th convolution layers are followed by a 2×2 max-pooling. Default VGG-16 accepts
colored images with dimensions 224×224 and outputs one of the 1000 classes.
Ref: https://towardsdatascience.com/5-most-well-known-cnn-architectures-visualized-af76f1f0065e#7ebd
CNN Architectures - Inception-v1
Going deeper has a caveat: exploding/vanishing gradients:
1. The exploding gradient is a problem when large error gradients accumulate and result in unstable weight updates
during training.
2. The vanishing gradient is a problem when the partial derivative of the loss function approaches a value close to
zero and the network couldn’t train.
Inception-v1 tackles this issue by adding two auxiliary classifiers connected to intermediate layers, with the hope to increase
the gradient signal that gets propagated back. During training, their loss gets added to the total loss of the network with a 0.3
Ref: https://towardsdatascience.com/5-most-well-known-cnn-architectures-visualized-af76f1f0065e#7ebd
CNN Architectures - Inception-v1
● 3 convolution layers
Ref: https://towardsdatascience.com/5-most-well-known-cnn-architectures-visualized-af76f1f0065e#7ebd
CNN Architectures - Inception-v1
Inception-v1 introduces the inception module, four series of one or two convolution and max-pool layers stacked in parallel
and concatenated at the end. The inception module aims to approximate an optimal local sparse structure in a CNN by
allowing the use of multiple types of kernel sizes, instead of being restricted to single kernel size.
Ref: https://towardsdatascience.com/5-most-well-known-cnn-architectures-visualized-af76f1f0065e#7ebd
CNN Architectures - Inception-v1
Inception-v1 has fewer parameters than AlexNet and VGG-16, a mere 7 million, even though it consists of 22
layers:
● 3 convolution layers with 7×7, 1×1, and 3×3 kernel sizes, followed by
● 18 layers that consist of 9 inception modules where each has 2 layers of convolution/max-pooling,
followed by
Ref: https://towardsdatascience.com/5-most-well-known-cnn-architectures-visualized-af76f1f0065e#7ebd
CNN Architectures - ResNet50
● When deeper networks can start converging, a degradation problem has been exposed:
with the network depth increasing, accuracy gets saturated and then degrades rapidly.
training error and higher testing error) since adding more layers to a suitably deep network
Ref: https://towardsdatascience.com/5-most-well-known-cnn-architectures-visualized-af76f1f0065e#7ebd
CNN Architectures - ResNet50
The degradation problem is addressed by introducing bottleneck residual blocks. There are 2
1. Identity block: consists of 3 convolution layers with 1×1, 3×3, and 1×1 kernel sizes, all of
which are equipped with BN. The ReLU activation function is applied to the first two
layers, while the input of the identity block is added to the last layer before applying ReLU.
2. Convolution block: same as identity block, but the input of the convolution block is first
passed through a convolution layer with 1×1 kernel size and BN before being added to the
Ref: https://towardsdatascience.com/5-most-well-known-cnn-architectures-visualized-af76f1f0065e#7ebd
CNN Architectures - ResNet50
Ref: https://towardsdatascience.com/5-most-well-known-cnn-architectures-visualized-af76f1f0065e#7ebd
CNN Architectures - ResNet50
Notice that both residual blocks have 3 layers. In total, ResNet-50 has 26 million parameters and
50 layers:
Ref: https://towardsdatascience.com/5-most-well-known-cnn-architectures-visualized-af76f1f0065e#7ebd
Summary of all Architectures
Ref: https://towardsdatascience.com/5-most-well-known-cnn-architectures-visualized-af76f1f0065e#7ebd
Batch Normalization
What is “Normalization”?
● Normalization is a data pre-processing tool used to bring the numerical data to a common scale
without distorting its shape.
● Generally, when we input the data to a machine or deep learning algorithm we tend to change the
values to a balanced scale. The reason we normalize is partly to ensure that our model can generalize
appropriately.
Ref: https://www.analyticsvidhya.com/blog/2021/03/introduction-to-batch-normalization/
Batch Normalization
network. The new layer performs the standardizing and normalizing operations on the input of a layer coming
● But what is the reason behind the term “Batch” in batch normalization? A typical neural network is trained using a
collected set of input data called batch. Similarly, the normalizing process in batch normalization takes place in
Ref: https://www.analyticsvidhya.com/blog/2021/03/introduction-to-batch-normalization/
Batch Normalization
● Let’s understand this through an example, we have a deep neural network as shown in the following image.
Ref: https://www.analyticsvidhya.com/blog/2021/03/introduction-to-batch-normalization/
Batch Normalization
● Initially, our inputs X1, X2, X3, X4 are in normalized form as they are coming from the pre-processing stage. When the
input passes through the first layer, it transforms, as a sigmoid function applied over the dot product of input X and
Ref: https://www.analyticsvidhya.com/blog/2021/03/introduction-to-batch-normalization/
Batch Normalization
● Similarly, this transformation will take place for the second layer and go till the last layer L as shown in the
following image.
Ref: https://www.analyticsvidhya.com/blog/2021/03/introduction-to-batch-normalization/
Batch Normalization
● Although, our input X was normalized with time the output will no longer be on the
same scale. As the data go through multiple layers of the neural network and L
activation functions are applied, it leads to an internal co-variate shift in the data
Ref: https://www.analyticsvidhya.com/blog/2021/03/introduction-to-batch-normalization/
Local Response Normalization
● Local Response Normalization (LRN) was first introduced in AlexNet architecture where the
activation function used was ReLU as opposed to the more common tanh and sigmoid at that
time.
● Apart from the reason mentioned above, the reason for using LRN was to encourage lateral
inhibition.
● It is a concept in Neurobiology that refers to the capacity of a neuron to reduce the activity of
its neighbors.
● In DNNs, the purpose of this lateral inhibition is to carry out local contrast enhancement so
that locally maximum pixel values are used as excitation for the next layers.
Ref:
https://towardsdatascience.com/difference-between-local-response-normalization-and-batch-normalization-272308c034ac#:~:text=Local%20Response%20Normalization%20(L
RN)%20was,was%20to%20encourage%20lateral%20inhibition
Local Response Normalization
LRN is a non-trainable layer that square-normalizes the pixel values in a feature map within a local
neighborhood. There are two types of LRN based on the neighborhood defined and can be seen in the
figure below.
Ref: http://surl.li/fduoi
Local Response Normalization
Inter-Channel LRN: This is originally what the AlexNet paper used. The neighborhood defined is across
the channel. For each (x,y) position, the normalization is carried out in the depth dimension and is given
by the following formula
Ref: http://surl.li/fduoi
Local Response Normalization
Inter-Channel LRN: where i indicates the output of filter i, a(x,y), b(x,y) the pixel values at (x,y)
position before and after normalization respectively, and N is the total number of channels. The
constants (k,α,β,n) are hyper-parameters. k is used to avoid any singularities (division by zero), α is
used as a normalization constant, while β is a contrasting constant. The constant n is used to define
the neighborhood length i.e. how many consecutive pixel values need to be considered while
carrying out the normalization. The case of (k,α, β, n)=(0,1,1,N) is the standard normalization).
Ref: http://surl.li/fduoi
Local Response Normalization
Let’s have a look at an
example of Inter-channel
LRN.
Ref: http://surl.li/fduoi
Local Response Normalization
Different colors denote different channels and hence N=4. Lets take the hyper-parameters to be
(k,α, β, n)=(0,1,1,2). The value of n=2 means that while calculating the normalized value at
position (i,x,y), we consider the values at the same position for the previous and next filter i.e
(i-1, x, y) and (i+1, x, y). For (i,x,y)=(0,0,0) we have value(i,x,y)=1, value(i-1,x,y) doesn’t exist and
value(i+,x,y)=1. Hence normalized_value(i,x,y) = 1/(¹²+¹²) = 0.5 and can be seen in the lower
part of the figure above. The rest of the normalized values are calculated in a similar way.
Ref: http://surl.li/fduoi
Local Response Normalization
Intra-Channel LRN: In Intra-channel LRN, the neighborhood is extended within the same
channel only as can be seen in the figure above. The formula is given by
Ref: http://surl.li/fduoi
Local Response Normalization
where (W,H) are the width and height of the feature map (for example in the figure above (W,H) =
(8,8)). The only difference between Inter and Intra Channel LRN is the neighborhood for
normalization. In Intra-channel LRN, a 2D neighborhood is defined (as opposed to the 1D
neighborhood in Inter-Channel) around the pixel under-consideration. As an example, the figure
below shows the Intra-Channel normalization on a 5x5 feature map with n=2 (i.e. 2D
neighborhood of size (n+1)x(n+1) centered at (x,y)).
Ref: http://surl.li/fduoi
Local Response Normalization
As an example, the figure below shows the Intra-Channel normalization on a 5x5 feature map with
n=2 (i.e. 2D neighborhood of size (n+1)x(n+1) centered at (x,y)).
Ref: http://surl.li/fduoi
Comparison of BN & LRN
LRN has multiple directions to perform normalization across (Inter or Intra Channel), on the other hand,
BN has only one way of being carried out (for each pixel position across all the activations). The table
below compares the two normalization techniques.
Training a Convolutional Network
150
CNN Architectures
https://towardsdatascience.com/5-most-well-known-cnn-architectures-visualized-af
76f1f0065e#7ebd
LRN:https://towardsdatascience.com/difference-between-local-response-normaliz
ation-and-batch-normalization-272308c034ac#:~:text=Local%20Response%20No
rmalization%20(LRN)%20was,was%20to%20encourage%20lateral%20inhibition.