Module 3
Module 3
Module 3
Syllabus
Convolutional Neural Networks –Architecture, Convolution operation, Motivation, pooling.
Variants of convolution functions, Structured outputs, Data types, Efficient convolution
algorithms, Applications of Convolutional Networks, Pre-trained convolutional Architectures :
AlexNet, ZFNet, VGGnet-19, ResNet50.
Overview: https://www.youtube.com/watch?v=QzY57FaENXg
https://www.youtube.com/watch?v=K_BHmztRTpA&sttick=0
Reference:https://www.ibm.com/topics/convolutional-neural-networks
• Talking about grayscale images, they have pixel ranges from 0 to 255 i.e. 8-bit pixel
values.
• If the size of the image is NxM, then the size of the input vector will be N*M. For RGB
images, it would be N*M*3. 3 represents channels.
• Consider an RGB image with size 30x30. This would require 2700 neurons.
[30*30*3=2700]. An RGB image of size 256x256 would require over 100000
neurons.
• A single neuron in the output layer will have 224x224x3 weights coming into it. This
would require more computation, memory, and data.
• Each layer performs convolution on CNN. CNN takes input as an image volume for
the RGB image.
● CNNs can be challenging to train and require large amounts of data. Additionally, they
can be computationally expensive, especially for large and complex models.
● Vulnerable to adversarial attacks
● Limited ability to generalize
Convolutions
• We can define multiple kernels for every convolution layer each giving rise to an output.
•The outputs corresponding to each filter are stacked giving rise to an output volume.
Padding
• Padded convolution is used when preserving the dimension of an input matrix that is
important to us and it helps us keep more of the information at the border of an
image.
• We have seen that convolution reduces the size of the feature map.
• To retain the dimension of feature map as that of an input map, we pad or append the
rows and column with zeros.
Stride
• Stride refers to the number of pixels the kernel filter will skip i.e pixels/time.
• A Stride of 2 means the kernel will skip 2 pixels before performing the convolution operation.
• In the figure above, the kernel filter is sliding over the input matrix by skipping one pixel at a
time.
•A Stride of 2 would perform this skipping action twice before performing the convolution like in
the image below.
•The output feature map is reduced(4 times) when the stride is increased from 1 to 2.
• The pooling operation involves sliding a two-dimensional filter over each channel
of feature map and summarising the features lying within the region covered by the
filter.
• Pooling layers are used to reduce the dimensions of the feature maps. Thus, it
reduces the number of parameters to learn and the amount of computation
performed in the network.
• The pooling layer summarizes the features present in a region of the feature map
generated by a convolution layer.
• The pooling operation divides 4x4 matrix into 4 2x2 matrices and picks the value
which is the greatest amongst the four(for max-pooling) and the average of the four(
for average pooling).
• This reduces the size of the feature maps which therefore reduces the number of
parameters without missing important information.
• One thing to note here is that the pooling operation reduces the Nx and Ny values
of the input feature map but does not reduce the value of Nc (number of channels).
• Also, the hyperparameters involved in pooling operation are the filter dimension,
stride, and type of pooling(max or avg).
Max Pooling
• Max pooling is a pooling operation that selects the maximum element from the region
of the feature map covered by the filter. Thus, the output after max-pooling layer
would be a feature map containing the most prominent features of the previous
feature map.
Average Pooling
• Average pooling computes the average of the elements present in the region of
feature map covered by the filter. Thus, while max pooling gives the most prominent
feature in a particular patch of the feature map, average pooling gives the average of
features present in a patch.
Global Pooling
• Global pooling reduces each channel in the feature map to a single value. Thus, an
nh x nw x nc feature map is reduced to 1 x 1 x nc feature map. This is equivalent to
using a filter of dimensions nh x nw i.e. the dimensions of the feature map.
• In convolutional neural networks (CNNs), the pooling layer is a common type of layer
that is typically added after convolutional layers. The pooling layer is used to reduce
the spatial dimensions (i.e., the width and height) of the feature maps, while
preserving the depth (i.e., the number of channels).
• The pooling layer works by dividing the input feature map into a set of
non-overlapping regions, called pooling regions. Each pooling region is then
transformed into a single output value, which represents the presence of a particular
feature in that region. The most common types of pooling operations are max pooling
and average pooling.
• In max pooling, the output value for each pooling region is simply the maximum
value of the input values within that region. This has the effect of preserving the most
salient features in each pooling region, while discarding less relevant information.
Max pooling is often used in CNNs for object recognition tasks, as it helps to identify
the most distinctive features of an object, such as its edges and corners.
• In average pooling, the output value for each pooling region is the average of the
input values within that region. This has the effect of preserving more information
than max pooling, but may also dilute the most salient features. Average pooling is
often used in CNNs for tasks such as image segmentation and object detection,
where a more fine-grained representation of the input is required.
• Dimensionality reduction: The main advantage of pooling layers is that they help in
reducing the spatial dimensions of the feature maps. This reduces the computational
cost and also helps in avoiding overfitting by reducing the number of parameters in
the model.
• Feature selection: Pooling layers can also help in selecting the most important
features from the input, as max pooling selects the most salient features and
average pooling preserves more information.
• Kernel size(Kw,Kh)
• Zero padding
• Stride(Sw, Sh)
Motivation
● Sparse interactions,
● Parameter sharing
● Equivariant representations.
Sparse Interactions
• The choice of stride is also important, but it affects the tensor shape after the
convolution, hence the whole network. The general rule is to use stride=1 in usual
convolutions and preserve the spatial size with padding, and use stride=2 when you
want to downsample the image.
• When the stride of a convolutional layer increases, it means that the filter or kernel
moves a larger distance with each step during the convolution operation. This results
in a reduction in the spatial dimensions of the output feature map. The maximum
stride you can use depends on the dimensions of the input data and the size of the
filter.
• Increasing the stride value reduces the output size. The maximum stride you can
use is limited by the filter size and the input size. If the stride is too large, the filter
might not effectively cover the input data, causing information loss and potentially
making the network unable to learn meaningful features.
• Typically, a common choice for stride is 1, which means the filter moves one pixel
at a time. Larger strides like 2 or 3 are often used in specific situations to
downsample the feature map and reduce computational complexity in deeper layers
of convolutional neural networks. However, the choice of stride should be made
carefully based on the specific task and network architecture to balance information
preservation and computational efficiency.
Can we apply multiple filters to the same image? - In practice instead of applying one
kernel, we can apply multiple kernels with different values on the same image one after another
so that we can get multiple outputs.
All of these outputs can be stacked on top of each other combined to form a volume.
If we apply three filters on the input we will get an output of depth equal to 3.
Depth of the output from the convolution operation is equal to the number of filters that are
being applied on the input
Full Convolution
Usually the optimal amount of 0 padding lies somewhere between ‘Valid’ or ‘Same’
Unshared Convolution
Useful when we know that each feature should be a function of a small part of space,
but no reason to think that the same feature should occur accross all the space. eg: look
for mouth only in the bottom half of the image.
It can be also useful to make versions of convolution or local connected layers in which
the connectivity is further restricted, eg: constrain each output channel i to be a function
of only a subset of the input channel.
Tiled Convolution
❖ Learn a set of kernels that we rotate through as we move through space.
Immediately neighboring locations will have different filters, but the memory
requirement for storing the parameters will increase by a factor of the size of this
set of kernels.
Structured outputs
Strategy for size reduction issue:
One strategy for pixel-wise labeling of images is to produce an initial guess of the
image label.
Repeat this refinement step serveral times corresponds to using the same convolution
at each stage, sharing weights between last layers of the deep net.
Data types
Efficient convolution algorithms
Modern convolutional network applications often involve networks containing more than one
million units. Powerful implementations exploiting parallel computation resources are essential.
However, in many cases it is also possible to speed up convolution by selecting an appropriate
convolution algorithm. Convolution is equivalent to converting both the input and the kernel to
the frequency domain using a Fourier transform, performing point-wise multiplication of the two
signals, and converting back to the time domain using an inverse Fourier transform. For some
problem sizes, this can be faster than the naive implementation of discrete convolution
When a d-dimensional kernel can be expressed as the outer product of d vectors, one vector
per dimension, the kernel is called separable.
● When the kernel is separable, naive convolution is inefficient.
● It is equivalent to compose d one-dimensional convolutions with each of these vectors.
● The composed approach is significantly faster than performing one d-dimensional convolution
with their outer product.
● The kernel also takes fewer parameters to represent as vectors.
● If the kernel is w elements wide in each dimension, then naive multidimensional convolution
requires O (w d ) runtime and parameter storage space, while separable convolution requires
O(w × d ) runtime and parameter storage space.
● Of course, not every convolution can be represented in this way
A spatial separable convolution simply divides a kernel into two, smaller kernels. The most
common case would be to divide a 3x3 kernel into a 3x1 and 1x3 kernel, like so
Now, instead of doing one convolution with 9 multiplications, we do two convolutions with 3
multiplications each (6 in total) to achieve the same effect. With less multiplications,
computational complexity goes down, and the network is able to run faster.
Case Studies of Convolutional Architectures :
LeNet-5
AlexNet
ZFNet
VGGNet-19,
ResNet-50
https://iq.opengenus.org/evolution-of-cnn-architectures/#google_vignette
Application Of CNN
Previous Year Questions:
1. Illustrate the strengths and weaknesses of convolutional neural networks.
2. What happens if the stride of the convolutional layer increases? What can be the
maximum stride? Justify your answer
3. Consider an activation volume of size 13×13×64 and a filter of size 3×3×64. Discuss
whether it is possible to perform convolutions with strides 2, 3 and 5. Justify your answer
in each case. (6 marks)
4. Suppose that a CNN was trained to classify images into different categories. It
performed well on a validation set that was taken from the same source as the training
set but not on a testing set. What could be the problem with the training of such a CNN?
How will you ascertain the problem? How can those problems be solved?
5. a. Explain the following convolution functions a)tensors b) kernel flipping c) down
sampling d) strides e) zero padding. (10 marks)
6. What is the motivation behind convolution neural networks? (4 marks)
7. Design a Convolutional Neural Network (CNN) for gender classification using face
images of size 256 x 256. Determine suitable filter sizes, activation functions, and the
width of each layer within the network
8. Consider an input image with dimensions of 28 x 28 pixels. You apply a convolutional
operation with a kernel (filter) size of 3x3, a padding of 0, and a stride of 2. Calculate the
dimensions of the output feature map. Also, calculate the padding value if we need the
output to have the same size as the input with a stride of 1.
9. What are the key differences between AlexNet, ZFNet, VGGnet-19, and ResNet50 in
terms of their architectures, performance?
10. Why do we use pooling in convolutional neural networks? Illustrate with an example how
pooling works?
11. Define the concept of the receptive field. Mention two strategies for expanding the
receptive field without increasing the filter size.
12. Explain the architecture of a Convolutional Neural Network (CNN) and its fundamental
components. (8)
13. Discuss different formats of data that can be used with CNN.
14. Provide examples of diverse applications where Convolutional Neural Networks excel
and explain their effectiveness in those domains.
15. Consider an input image with dimensions of 64 x 64 pixels. You apply a convolutional
operation with a kernel (filter) size of 5x5, a padding of 2, and a stride of 1. Calculate the
dimensions of the output feature map. Additionally, determine the padding value if we
need the output to have the same size as the input with a stride of 2.