0% found this document useful (0 votes)
14 views

Module 3

Uploaded by

janviboby
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Module 3

Uploaded by

janviboby
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Foundations of Deep Learning

Module 3

Syllabus
Convolutional Neural Networks –Architecture, Convolution operation, Motivation, pooling.
Variants of convolution functions, Structured outputs, Data types, Efficient convolution
algorithms, Applications of Convolutional Networks, Pre-trained convolutional Architectures :
AlexNet, ZFNet, VGGnet-19, ResNet50.

Overview: https://www.youtube.com/watch?v=QzY57FaENXg
https://www.youtube.com/watch?v=K_BHmztRTpA&sttick=0

Watch these videos

Reference:https://www.ibm.com/topics/convolutional-neural-networks

• A Convolutional Neural Network (CNN) is a type of Deep Learning neural network


architecture commonly used in Computer Vision. What is Computer Vision - field of
AI that deals with understanding and interpreting image/visual data

• CNN’s are a special type of ANN which accepts images as inputs.

Why do we have to use CNN?

• Talking about grayscale images, they have pixel ranges from 0 to 255 i.e. 8-bit pixel
values.

• If the size of the image is NxM, then the size of the input vector will be N*M. For RGB
images, it would be N*M*3. 3 represents channels.

• Consider an RGB image with size 30x30. This would require 2700 neurons.
[30*30*3=2700]. An RGB image of size 256x256 would require over 100000
neurons.

• The number of weights, parameters for 224x224x3 is very high.

• A single neuron in the output layer will have 224x224x3 weights coming into it. This
would require more computation, memory, and data.

• Each layer performs convolution on CNN. CNN takes input as an image volume for
the RGB image.

• Basically, an image is taken as an input and we apply a kernel/filter on the image to


get the output.
• CNN enables parameter sharing between the output neurons which means that a
feature detector (for example horizontal edge detector) that’s useful in one part of the
image is probably useful in another part of the image.

What are the issues with CNN?

● CNNs can be challenging to train and require large amounts of data. Additionally, they
can be computationally expensive, especially for large and complex models.
● Vulnerable to adversarial attacks
● Limited ability to generalize

They have three main types of layers, which are:

Convolutional layer Pooling layer Fully-connected layer

Convolutions

• Every output neuron is connected to a small neighborhood in the input through a


weight matrix also referred to as a kernel or a weight matrix.

• We can define multiple kernels for every convolution layer each giving rise to an output.

• Each filter is moved around the input image.

•The outputs corresponding to each filter are stacked giving rise to an output volume.

Padding
• Padded convolution is used when preserving the dimension of an input matrix that is
important to us and it helps us keep more of the information at the border of an
image.

• We have seen that convolution reduces the size of the feature map.

• To retain the dimension of feature map as that of an input map, we pad or append the
rows and column with zeros.

• Padding P=(F-1)/2 , F is the size of the kernel matrix

Stride

• Stride refers to the number of pixels the kernel filter will skip i.e pixels/time.

• A Stride of 2 means the kernel will skip 2 pixels before performing the convolution operation.

• In the figure above, the kernel filter is sliding over the input matrix by skipping one pixel at a
time.

•A Stride of 2 would perform this skipping action twice before performing the convolution like in
the image below.

•The output feature map is reduced(4 times) when the stride is increased from 1 to 2.

• The dimension of the output feature map is (N-F)/S + 1.


Pooling

• The pooling operation involves sliding a two-dimensional filter over each channel
of feature map and summarising the features lying within the region covered by the
filter.

• A common CNN model architecture is to have a number of convolution and pooling


layers stacked one after the other.

• Pooling layers are used to reduce the dimensions of the feature maps. Thus, it
reduces the number of parameters to learn and the amount of computation
performed in the network.

• The pooling layer summarizes the features present in a region of the feature map
generated by a convolution layer.

• Pooling provides translational invariance by subsampling: reduces the size of the


feature maps. The two commonly used Pooling techniques are max pooling and
average pooling.

• The pooling operation divides 4x4 matrix into 4 2x2 matrices and picks the value
which is the greatest amongst the four(for max-pooling) and the average of the four(
for average pooling).

• This reduces the size of the feature maps which therefore reduces the number of
parameters without missing important information.

• One thing to note here is that the pooling operation reduces the Nx and Ny values
of the input feature map but does not reduce the value of Nc (number of channels).
• Also, the hyperparameters involved in pooling operation are the filter dimension,
stride, and type of pooling(max or avg).

Max Pooling

• Max pooling is a pooling operation that selects the maximum element from the region
of the feature map covered by the filter. Thus, the output after max-pooling layer
would be a feature map containing the most prominent features of the previous
feature map.

Average Pooling

• Average pooling computes the average of the elements present in the region of
feature map covered by the filter. Thus, while max pooling gives the most prominent
feature in a particular patch of the feature map, average pooling gives the average of
features present in a patch.

Global Pooling

• Global pooling reduces each channel in the feature map to a single value. Thus, an
nh x nw x nc feature map is reduced to 1 x 1 x nc feature map. This is equivalent to
using a filter of dimensions nh x nw i.e. the dimensions of the feature map.

• Further, it can be either global max pooling or global average pooling

• In convolutional neural networks (CNNs), the pooling layer is a common type of layer
that is typically added after convolutional layers. The pooling layer is used to reduce
the spatial dimensions (i.e., the width and height) of the feature maps, while
preserving the depth (i.e., the number of channels).

• The pooling layer works by dividing the input feature map into a set of
non-overlapping regions, called pooling regions. Each pooling region is then
transformed into a single output value, which represents the presence of a particular
feature in that region. The most common types of pooling operations are max pooling
and average pooling.

• In max pooling, the output value for each pooling region is simply the maximum
value of the input values within that region. This has the effect of preserving the most
salient features in each pooling region, while discarding less relevant information.
Max pooling is often used in CNNs for object recognition tasks, as it helps to identify
the most distinctive features of an object, such as its edges and corners.

• In average pooling, the output value for each pooling region is the average of the
input values within that region. This has the effect of preserving more information
than max pooling, but may also dilute the most salient features. Average pooling is
often used in CNNs for tasks such as image segmentation and object detection,
where a more fine-grained representation of the input is required.

Advantages of Pooling Layer:

• Dimensionality reduction: The main advantage of pooling layers is that they help in
reducing the spatial dimensions of the feature maps. This reduces the computational
cost and also helps in avoiding overfitting by reducing the number of parameters in
the model.

• Translation invariance: Pooling layers are also useful in achieving translation


invariance in the feature maps. This means that the position of an object in the image
does not affect the classification result, as the same features are detected regardless
of the position of the object.

• Feature selection: Pooling layers can also help in selecting the most important
features from the input, as max pooling selects the most salient features and
average pooling preserves more information.

Output Feature Map

• The size of the output feature map or volume depends on:

• Size of the input feature map

• Kernel size(Kw,Kh)
• Zero padding

• Stride(Sw, Sh)

Motivation

● Sparse interactions,
● Parameter sharing
● Equivariant representations.

Sparse Interactions

➔ Convolutional networks have sparse interactions (also referred to as sparse connectivity


or sparse weights). This is accomplished by making the kernel smaller than the input.
➔ For example, when processing an image, the input image might have thousands or
millions of pixels, but we can detect small, meaningful features such as edges with
kernels that occupy only tens or hundreds of pixels. This means that we need to store
fewer parameters, which both reduces the memory requirements of the model and
improves its statistical efficiency.
➔ It also means that computing the output requires fewer operations. These improvements
in efficiency are usually quite large. If there are m inputs and n outputs, then matrix
multiplication requires m×n parameters and the algorithms used in practice have O(m ×
n) runtime (per example). If we limit the number of connections each output may have to
k, then the sparsely connected approach requires only k × n parameters and O(k × n)
runtime.
Parameter Sharing
➔ Parameter sharing refers to using the same parameter for more than one function in a
model. In a traditional neural net, each element of the weight matrix is used exactly once
when computing the output of a layer. It is multiplied by one element of the input and
then never revisited.
➔ lIn a traditional neural net, each element of the weight matrix is used exactly once when
computing the output of a layer.
➔ lIt is multiplied by one element of the input and then never revisited.
➔ lAs a synonym for parameter sharing, one can say that a network has tied weights,
because the value of the weight applied to one input is tied to the value of a weight
applied elsewhere.
➔ lIn a convolutional neural net, each member of the kernel is used at every position of the
input (except perhaps some of the boundary pixels, depending on the design decisions
regarding the boundary).
➔ lThe parameter sharing used by the convolution operation means that rather than
learning a separate set of parameters for every location, we learn only one set.
➔ This does not affect the runtime of forward propagation—it is still O(k × n )—but it does
further reduce the storage requirements of the model to k parameters.
➔ Recall that k is usually several orders of magnitude less than m.
➔ Since m and n are usually roughly the same size, k is practically insignificant compared
to m × n .
➔ Convolution is thus dramatically more efficient than dense matrix multiplication in terms
of the memory requirements and statistical efficiency.
➔ Feed-forward network connects every pixel with each node in the following layer,
ignoring any spatial information present in the image.
➔ Convolutional architecture looks at local regions of the image.
➔ In this case, a 2 by 2 filter with a stride of 2 is scanned across the image to output 4
nodes, each containing localized information about the image.
Equivariant Representations
➔ Due to parameter sharing, the layers of convolution neural network will have a property
of equivariance to translation.
➔ It says that if we change the input in a way, the output will also change in the same way.
➔ Specifically, a function f (x) is equivariant to a function g if f(g(x)) = g(f (x)).
➔ In the case of convolution, if we let g be any function that translates the input, i.e., shifts
it, then the convolution function is equivariant to g.
➔ For example, let I be a function giving image brightness at integer coordinates.
➔ Let g be a function mapping one image function to another image function, such that I’=
g(I) is the image function with I’ (x, y) = I(x − 1, y).
➔ This shifts every pixel of I one unit to the right.
➔ If we apply this transformation to I, then apply convolution, the result will be the same as
if we applied convolution to I , then applied the transformation g to the output.
➔ In a traditional 2D CNN designed for grayscale images, the input is a 2D grid of pixel
values. The convolutional layers employ 2D filters (kernels) that slide across the image
in both the vertical and horizontal directions.
➔ These filters capture local patterns in the image, making the network equivariant to
translation in the spatial domain.
➔ For instance, if an object moves to a different position in the image, the network will still
detect it as long as the local pattern (e.g., edges, textures) remains the same.
IMP: Q. What happens if the stride of the convolution layer increases? What can be
the maximum stride? Justify

• Stride is a component of convolutional neural networks, or neural networks tuned for


the compression of images and video data. Stride is a parameter of the neural
network's filter that modifies the amount of movement over the image or video. For
example, if a neural network's stride is set to 1, the filter will move one pixel, or unit,
at a time. The size of the filter affects the encoded output volume, so stride is often
set to a whole integer, rather than a fraction or decimal.

• Naturally, as the stride, or movement, is increased, the resulting output will be


smaller.

• The choice of stride is also important, but it affects the tensor shape after the
convolution, hence the whole network. The general rule is to use stride=1 in usual
convolutions and preserve the spatial size with padding, and use stride=2 when you
want to downsample the image.

• When the stride of a convolutional layer increases, it means that the filter or kernel
moves a larger distance with each step during the convolution operation. This results
in a reduction in the spatial dimensions of the output feature map. The maximum
stride you can use depends on the dimensions of the input data and the size of the
filter.

• The formula to calculate the output size of a convolutional layer is:

• Increasing the stride value reduces the output size. The maximum stride you can
use is limited by the filter size and the input size. If the stride is too large, the filter
might not effectively cover the input data, causing information loss and potentially
making the network unable to learn meaningful features.

• Typically, a common choice for stride is 1, which means the filter moves one pixel
at a time. Larger strides like 2 or 3 are often used in specific situations to
downsample the feature map and reduce computational complexity in deeper layers
of convolutional neural networks. However, the choice of stride should be made
carefully based on the specific task and network architecture to balance information
preservation and computational efficiency.
Can we apply multiple filters to the same image? - In practice instead of applying one
kernel, we can apply multiple kernels with different values on the same image one after another
so that we can get multiple outputs.

All of these outputs can be stacked on top of each other combined to form a volume.

If we apply three filters on the input we will get an output of depth equal to 3.

Depth of the output from the convolution operation is equal to the number of filters that are
being applied on the input

Variants of convolution functions

Full Convolution

Zero padding One stride


Zero Padding s stride
Convolution with a stride greater than 1 pixel is equivalent to conv with 1 stride followed
by downsampling:

Some Paddings and 1 Stride:


Special case of 0 padding:

● Valid: no 0 padding is used. Limited number of layers.


● Same: keep the size of the output to the size of input. Unlimited number of
layers. Pixels near the border influence fewer output pixels than the input
pixels near the center.
● Full: Enough zeros are added for every pixels to be visited k (kernel width)
times in each direction, resulting width m + k - 1. Difficult to learn a single
kernel that performs well at all positions in the convolutional feature map.

Usually the optimal amount of 0 padding lies somewhere between ‘Valid’ or ‘Same’

Unshared Convolution
Useful when we know that each feature should be a function of a small part of space,
but no reason to think that the same feature should occur accross all the space. eg: look
for mouth only in the bottom half of the image.

It can be also useful to make versions of convolution or local connected layers in which
the connectivity is further restricted, eg: constrain each output channel i to be a function
of only a subset of the input channel.
Tiled Convolution
❖ Learn a set of kernels that we rotate through as we move through space.

Immediately neighboring locations will have different filters, but the memory
requirement for storing the parameters will increase by a factor of the size of this
set of kernels.
Structured outputs
Strategy for size reduction issue:

● Avoid pooling altogether


● Emit a lower-resolution grid of labels
● Pooling operator with unit stride

One strategy for pixel-wise labeling of images is to produce an initial guess of the
image label.

1. Produce an initial guess of the image labels.


2. Refine this initial guess using the interactions between neighboring
pixels.

Repeat this refinement step serveral times corresponds to using the same convolution
at each stage, sharing weights between last layers of the deep net.
Data types
Efficient convolution algorithms

Modern convolutional network applications often involve networks containing more than one
million units. Powerful implementations exploiting parallel computation resources are essential.
However, in many cases it is also possible to speed up convolution by selecting an appropriate
convolution algorithm. Convolution is equivalent to converting both the input and the kernel to
the frequency domain using a Fourier transform, performing point-wise multiplication of the two
signals, and converting back to the time domain using an inverse Fourier transform. For some
problem sizes, this can be faster than the naive implementation of discrete convolution
When a d-dimensional kernel can be expressed as the outer product of d vectors, one vector
per dimension, the kernel is called separable.
● When the kernel is separable, naive convolution is inefficient.
● It is equivalent to compose d one-dimensional convolutions with each of these vectors.
● The composed approach is significantly faster than performing one d-dimensional convolution
with their outer product.
● The kernel also takes fewer parameters to represent as vectors.
● If the kernel is w elements wide in each dimension, then naive multidimensional convolution
requires O (w d ) runtime and parameter storage space, while separable convolution requires
O(w × d ) runtime and parameter storage space.
● Of course, not every convolution can be represented in this way
A spatial separable convolution simply divides a kernel into two, smaller kernels. The most
common case would be to divide a 3x3 kernel into a 3x1 and 1x3 kernel, like so

Now, instead of doing one convolution with 9 multiplications, we do two convolutions with 3
multiplications each (6 in total) to achieve the same effect. With less multiplications,
computational complexity goes down, and the network is able to run faster.
Case Studies of Convolutional Architectures :

LeNet-5

AlexNet
ZFNet

VGGNet-19,

ResNet-50
https://iq.opengenus.org/evolution-of-cnn-architectures/#google_vignette
Application Of CNN
Previous Year Questions:
1. Illustrate the strengths and weaknesses of convolutional neural networks.
2. What happens if the stride of the convolutional layer increases? What can be the
maximum stride? Justify your answer
3. Consider an activation volume of size 13×13×64 and a filter of size 3×3×64. Discuss
whether it is possible to perform convolutions with strides 2, 3 and 5. Justify your answer
in each case. (6 marks)
4. Suppose that a CNN was trained to classify images into different categories. It
performed well on a validation set that was taken from the same source as the training
set but not on a testing set. What could be the problem with the training of such a CNN?
How will you ascertain the problem? How can those problems be solved?
5. a. Explain the following convolution functions a)tensors b) kernel flipping c) down
sampling d) strides e) zero padding. (10 marks)
6. What is the motivation behind convolution neural networks? (4 marks)
7. Design a Convolutional Neural Network (CNN) for gender classification using face
images of size 256 x 256. Determine suitable filter sizes, activation functions, and the
width of each layer within the network
8. Consider an input image with dimensions of 28 x 28 pixels. You apply a convolutional
operation with a kernel (filter) size of 3x3, a padding of 0, and a stride of 2. Calculate the
dimensions of the output feature map. Also, calculate the padding value if we need the
output to have the same size as the input with a stride of 1.
9. What are the key differences between AlexNet, ZFNet, VGGnet-19, and ResNet50 in
terms of their architectures, performance?
10. Why do we use pooling in convolutional neural networks? Illustrate with an example how
pooling works?
11. Define the concept of the receptive field. Mention two strategies for expanding the
receptive field without increasing the filter size.
12. Explain the architecture of a Convolutional Neural Network (CNN) and its fundamental
components. (8)
13. Discuss different formats of data that can be used with CNN.
14. Provide examples of diverse applications where Convolutional Neural Networks excel
and explain their effectiveness in those domains.
15. Consider an input image with dimensions of 64 x 64 pixels. You apply a convolutional
operation with a kernel (filter) size of 5x5, a padding of 2, and a stride of 1. Calculate the
dimensions of the output feature map. Additionally, determine the padding value if we
need the output to have the same size as the input with a stride of 2.

You might also like