Deep Learning - AD3501 - Notes - Unit 2 - Convolutional Neural Networks
Deep Learning - AD3501 - Notes - Unit 2 - Convolutional Neural Networks
4th Semester
2nd Semester
Deep Learning -
AD3501
Embedded Systems
Data and Information Human Values and
and IoT - CS3691
5th Semester
7th Semester
8th Semester
Open Elective-1
Distributed Computing Open Elective 2
- CS3551 Project Work /
Elective-3
Open Elective 3 Intership
Big Data Analytics - Elective-4
CCS334 Open Elective 4
Elective-5
Elective 1 Management Elective
Elective-6
Elective 2
All Computer Engg Subjects - [ B.E., M.E., ] (Click on Subjects to enter)
Programming in C Computer Networks Operating Systems
Programming and Data Programming and Data Problem Solving and Python
Structures I Structure II Programming
Database Management Systems Computer Architecture Analog and Digital
Communication
Design and Analysis of Microprocessors and Object Oriented Analysis
Algorithms Microcontrollers and Design
Software Engineering Discrete Mathematics Internet Programming
Theory of Computation Computer Graphics Distributed Systems
Mobile Computing Compiler Design Digital Signal Processing
Artificial Intelligence Software Testing Grid and Cloud Computing
Data Ware Housing and Data Cryptography and Resource Management
Mining Network Security Techniques
Service Oriented Architecture Embedded and Real Time Multi - Core Architectures
Systems and Programming
Probability and Queueing Theory Physics for Information Transforms and Partial
Science Differential Equations
Technical English Engineering Physics Engineering Chemistry
Engineering Graphics Total Quality Professional Ethics in
Management Engineering
Basic Electrical and Electronics Problem Solving and Environmental Science and
and Measurement Engineering Python Programming Engineering
lOMoARcPSD|45374298
www.BrainKart.com
Input Layers: It’s the layer in which we give input to our model. The number of neurons in this layer
is equal to the total number of features in our data (number of pixels in the case of an image).
Hidden Layer: The input from the Input layer is then feed into the hidden layer. There can be many
hidden layers depending upon our model and data size. Each hidden layer can have different numbers
of neurons which are generally greater than the number of features. The output from each layer is
computed by matrix multiplication of output of the previous layer with learnable weights of that layer
and then by the addition of learnable biases followed by activation function which makes the network
nonlinear.
Output Layer: The output from the hidden layer is then fed into a logistic function like sigmoid or
softmax which converts the output of each class into the probability score of each class.
The data is fed into the model and output from each layer is obtained from the above step is called feed
forward, we then calculate the error using an error function, some common error functions are cross-entropy,
square loss error, etc. The error function measures how well the network is performing. After that, we back
propagate into the model by calculating the derivatives. This step is called Back propagation which basically
is used to minimize the loss.
Around the 1980s, CNNs were developed and deployed for the first time. A CNN could only detect
handwritten digits at the time. CNN was primarily used in various areas to read zip and pin codes etc. The
most common aspect of any AI model is that it requires a massive amount of data to train. This was one of
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 1
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
the biggest problems that CNN faced at the time, and due to this, they were only used in the postal
industry. Yann LeCun was the first to introduce convolutional neural networks.
Convolutional Neural Networks, commonly referred to as CNNs, are a specialized kind of neural network
architecture that is designed to process data with a grid-like topology. This makes them particularly well-
suited for dealing with spatial and temporal data, like images and videos that maintain a high degree of
correlation between adjacent elements.
CNNs are similar to other neural networks, but they have an added layer of complexity due to the fact that
they use a series of convolutional layers. Convolutional layers perform a mathematical operation called
convolution, a sort of specialized matrix multiplication, on the input data. The convolution operation helps
to preserve the spatial relationship between pixels by learning image features using small squares of input
data. . The picture below represents a typical CNN architecture.
Convolutional layers
Convolutional layers operate by sliding a set of ‘filters’ or ‘kernels’ across the input data. Each filter is
designed to detect a specific feature or pattern, such as edges, corners, or more complex shapes in the case
of deeper layers. As these filters move across the image, they generate a map that signifies the areas where
those features were found. The output of the convolutional layer is a feature map, which is a
representation of the input image with the filters applied. Convolutional layers can be stacked to create
more complex models, which can learn more intricate features from images. Simply speaking,
convolutional layers are responsible for extracting features from the input images. These features might
include edges, corners, textures, or more complex patterns.
Pooling layers
Pooling layers follow the convolutional layers and are used to reduce the spatial dimension of the input,
making it easier to process and requiring less memory. In the context of images, “spatial dimensions” refer
to the width and height of the image. An image is made up of pixels, and you can think of it like a grid,
with rows and columns of tiny squares (pixels). By reducing the spatial dimensions, pooling layers help
reduce the number of parameters or weights in the network. This helps to combat over-fitting and help
train the model in a fast manner. Max pooling helps in reducing computational complexity, owing to
reduction in size of feature map, and making the model invariant to small transitions. Without max
pooling, the network would not gain the ability to recognize features irrespective of small shifts or
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 2
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
rotations. This would make the model less robust to variations in object positioning within the image,
possibly affecting accuracy.
There are two main types of pooling: max pooling and average pooling. Max pooling takes the maximum
value from each feature map. For example, if the pooling window size is 2×2, it will pick the pixel with
the highest value in that 2×2 region. Max pooling effectively captures the most prominent feature or
characteristic within the pooling window. Average pooling calculates the average of all values within the
pooling window. It provides a smooth, average feature representation.
The combination of Convolution layer followed by max-pooling layer and then similar sets creates a
hierarchy of features. The first layer detects simple patterns, and subsequent layers build on those to detect
more complex patterns.
CNNs are often used for image recognition and classification tasks. For example, CNNs can be used to
identify objects in an image or to classify an image as being a cat or a dog. CNNs can also be used for more
complex tasks, such as generating descriptions of an image or identifying the points of interest in an image.
Beyond image data, CNNs can also handle time-series data, such as audio data or even text data, although
other types of networks like Recurrent Neural Networks (RNNs) or transformers are often preferred for
these scenarios. CNNs are a powerful tool for deep learning, and they have been used to achieve state-of-
the-art results in many different applications.
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 3
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
The Convolutional layer applies filters to the input image to extract features, the Pooling layer down
samples the image to reduce computation, and the fully connected layer makes the final prediction. The
network learns the optimal filters through back propagation and gradient descent as detailed in Fig. 3.
.
Fig. 3 Functions of CNN Layers
LeNet: LeNet is the first CNN architecture. It was developed in 1998 by Yann LeCun, Corinna Cortes, and
Christopher Burges for handwritten digit recognition problems. LeNet was one of the first successful CNNs
and is often considered the “Hello World” of deep learning. It is one of the earliest and most widely-used
CNN architectures and has been successfully applied to tasks such as handwritten digit recognition. The
LeNet architecture consists of multiple convolutional and pooling layers, followed by a fully-connected
layer. The model has five convolution layers followed by two fully connected layers. LeNet was the
beginning of CNNs in deep learning for computer vision problems. However, LeNet could not train well due
to the vanishing gradients problem. To solve this issue, a shortcut connection layer known as max-pooling is
used between convolutional layers to reduce the spatial size of images which helps prevent overfitting and
allows CNNs to train more effectively. The diagram below represents LeNet-5 architecture.
The LeNet CNN is a simple yet powerful model that has been used for various tasks such as handwritten
digit recognition, traffic sign recognition, and face detection. Although LeNet was developed more than 20
years ago, its architecture is still relevant today and continues to be used.
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 4
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
AlexNet: AlexNet is the deep learning architecture that popularized CNN. It was developed by Alex
Krizhevsky, Ilya Sutskever, and Geoff Hinton. AlexNet network had a very similar architecture to LeNet,
but was deeper, bigger, and featured Convolutional Layers stacked on top of each other. AlexNet was the
first large-scale CNN and was used to win the ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) in 2012. The AlexNet architecture was designed to be used with large-scale image datasets and it
achieved state-of-the-art results at the time of its publication. AlexNet is composed of 5 convolutional layers
with a combination of max-pooling layers, 3 fully connected layers, and 2 dropout layers. The activation
function used in all layers is Relu. The activation function used in the output layer is Softmax. The total
number of parameters in this architecture is around 60 million.
ZF Net: ZFnet is the CNN architecture that uses a combination of fully-connected layers and CNNs. ZF Net
was developed by Matthew Zeiler and Rob Fergus. It was the ILSVRC 2013 winner. The network has
relatively fewer parameters than AlexNet, but still outperforms it on ILSVRC 2012 classification task by
achieving top accuracy with only 1000 images per class. It was an improvement on AlexNet by tweaking the
architecture hyperparameters, in particular by expanding the size of the middle convolutional layers and
making the stride and filter size on the first layer smaller. It is based on the Zeiler and Fergus model, which
was trained on the ImageNet dataset. ZF Net CNN architecture consists of a total of seven layers:
Convolutional layer, max-pooling layer (downscaling), concatenation layer, convolutional layer with linear
activation function, and stride one, dropout for regularization purposes applied before the fully connected
output. This CNN model is computationally more efficient than AlexNet by introducing an approximate
inference stage through deconvolutional layers in the middle of CNNs.
GoogLeNet: GoogLeNet is the CNN architecture used by Google to win ILSVRC 2014 classification task.
It was developed by Jeff Dean, Christian Szegedy, Alexandro Szegedy et al.. It has been shown to have a
notably reduced error rate in comparison with previous winners AlexNet (Ilsvrc 2012 winner) and ZF-Net
(Ilsvrc 2013 winner). In terms of error rate, the error is significantly lesser than VGG (2014 runner up). It
achieves deeper architecture by employing a number of distinct techniques, including 1×1 convolution and
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 5
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
global average pooling. GoogleNet CNN architecture is computationally expensive. To reduce the
parameters that must be learned, it uses heavy unpooling layers on top of CNNs to remove spatial
redundancy during training and also features shortcut connections between the first two convolutional layers
before adding new filters in later CNN layers. Real-world applications/examples of GoogLeNet CNN
architecture include Street View House Number (SVHN) digit recognition task, which is often used as a
proxy for roadside object detection. Below is the simplified block diagram representing GoogLeNet CNN
architecture:
VGGNet: VGGNet is the CNN architecture that was developed by Karen Simonyan, Andrew Zisserman et
al. at Oxford University. VGGNet is a 16-layer CNN with up to 95 million parameters and trained on over
one billion images (1000 classes). It can take large input images of 224 x 224-pixel size for which it has
4096 convolutional features. CNNs with such large filters are expensive to train and require a lot of data,
which is the main reason why CNN architectures like GoogLeNet (AlexNet architecture) work better than
VGGNet for most image classification tasks where input images have a size between 100 x 100-pixel and
350 x 350 pixels. Real-world applications/examples of VGGNet CNN architecture include the ILSVRC
2014 classification task, which was also won by GoogleNet CNN architecture. The VGG CNN model is
computationally efficient and serves as a strong baseline for many applications in computer vision due to its
applicability for numerous tasks including object detection. Its deep feature representations are used across
multiple neural network architectures like YOLO, SSD, etc. The diagram below represents the standard
VGG16 network architecture diagram:
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 6
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
ResNet: ResNet is the CNN architecture that was developed by Kaiming He et al. to win the ILSVRC 2015
classification task with a top-five error of only 15.43%. The network has 152 layers and over one million
parameters, which is considered deep even for CNNs because it would have taken more than 40 days on 32
GPUs to train the network on the ILSVRC 2015 dataset. CNNs are mostly used for image classification
tasks with 1000 classes, but ResNet proves that CNNs can also be used successfully to solve natural
language processing problems like sentence completion or machine comprehension, where it was used by
the Microsoft Research Asia team in 2016 and 2017 respectively. Real-life applications/examples of ResNet
CNN architecture include Microsoft’s machine comprehension system, which has used CNNs to generate
the answers for more than 100k questions in over 20 categories. The CNN architecture ResNet is
computationally efficient and can be scaled up or down to match the computational power of GPUs.
MobileNets: MobileNets are CNNs that can be fit on a mobile device to classify images or detect objects
with low latency. MobileNets have been developed by Andrew G Trillion et al.. They are usually very small
CNN architectures, which makes them easy to run in real-time using embedded devices like smartphones
and drones. The architecture is also flexible so it has been tested on CNNs with 100-300 layers and it still
works better than other architectures like VGGNet. Real-life examples of MobileNets CNN architecture
include CNNs that is built into Android phones to run Google’s Mobile Vision API, which can automatically
identify labels of popular objects in images.
GoogLeNet_DeepDream: GoogLeNet_DeepDream is a deep dream CNN architecture that was developed
by Alexander Mordvintsev, Christopher Olah, et al.. It uses the Inception network to generate images
based on CNN features. The architecture is often used with the ImageNet dataset to generate psychedelic
images or create abstract artworks using human imagination at the ICLR 2017 workshop by David Ha, et
al.
To summarize the different types of CNN architectures described above in an easy to remember form, you
can use the following:
Table 1. Different Types of CNN Architectures
Architecture Year Key Features Use Case
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 7
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
Architecture Year Key Features Use Case
Now imagine taking a small patch of this image and running a small neural network, called a filter or
kernel on it, with say, K outputs and representing them vertically. Now slide that neural network across the
whole image, as a result, we will get another image with different widths, heights, and depths. Instead o f
just R, G, and B channels now we have more channels but lesser width and height. This operation is
called Convolution. If the patch size is the same as that of the image it will be a regular neural network.
Because of this small patch, we have fewer weights.
Now let’s talk about a bit of mathematics that is involved in the whole convolution process.
Convolution layers consist of a set of learnable filters (or kernels) having small widths and heights and the
same depth as that of input volume (3 if the input layer is image input).
For example, if we have to run convolution on an image with dimensions 34x34x3. The possible size of
filters can be axax3, where ‘a’ can be anything like 3, 5, or 7 but smaller as compared to the image
dimension.
During the forward pass, we slide each filter across the whole input volume step by step where each step is
called stride (which can have a value of 2, 3, or even 4 for high-dimensional images) and compute the dot
product between the kernel weights and patch from input volume.
As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together as a result, we’ll
get output volume having a depth equal to the number of filters. The network will learn all the filters.
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 8
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
1.3.1 Layers used to build ConvNets
A complete Convolution Neural Networks architecture is also known as convNets. A convNets is a
sequence of layers, and every layer transforms one volume to another through a differentiable function.
Let’s take an example by running a convNets on of image of dimension 32 x 32 x 3.
Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the input will
be an image or a sequence of images. This layer holds the raw input of the image with width 32,
height 32, and depth 3.
Convolutional Layers: This is the layer, which is used to extract the feature from the input dataset.
It applies a set of learnable filters known as the kernels to the input images. The filters/kernels are
smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input image data and computes
the dot product between kernel weight and the corresponding input image patch. The output of this
layer is referred ad feature maps. Suppose we use a total of 12 filters for this layer we’ll get an
output volume of dimension 32 x 32 x 12.
Activation Layer: By adding an activation function to the output of the preceding layer, activation
layers add nonlinearity to the network. it will apply an element-wise activation function to the
output of the convolution layer. Some common activation functions are RELU, Tanh, Leaky
RELU, etc. The volume remains unchanged hence output volume will have dimensions 32 x 32 x
12.
Pooling layer: This layer is periodically inserted in the convnets and its main function is to reduce
the size of volume which makes the computation fast reduces memory and also prevents over-
fitting. Two common types of pooling layers are max pooling and average pooling. If we use a
max pool with 2 x 2 filters and stride 2, the resultant volume will be of dimension 16x16x12.
Flattening: The resulting feature maps are flattened into a one-dimensional vector after the
convolution and pooling layers so they can be passed into a completely linked layer for
categorization or regression.
Fully Connected Layers: It takes the input from the previous layer and computes the final
classification or regression task.
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 9
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
Output Layer: The output from the fully connected layers is then fed into a logistic function for
classification tasks like sigmoid or softmax which converts the output of each class into the
probability score of each class.
2. Convolution Operation:
A convolutional neural network, or ConvNet, is just a neural network that uses convolution. To understand
the principle, we are going to work with a 2-dimensional convolution first.
Convolution is a mathematical operation that allows the merging of two sets of information. Convolution
between two functions in mathematics produces a third function expressing how the shape of one function is
modified by other.In the case of CNN, convolution is applied to the input data to filter the information and
produce a feature map.
This filter is also called a kernel, or feature detector, and its dimensions can be, for example, 3x3. A kernel is
a small 2D matrix whose contents are based upon the operations to be performed. A kernel maps on the input
image by simple matrix multiplication and addition, the output obtained is of lower dimensions and therefore
easier to work with.
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 10
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
Above is an example of a kernel for applying Gaussian blur (to smoothen the image before processing),
Sharpen image (enhance the depth of edges) and edge detection.To perform convolution, the kernel goes
over the input image, doing matrix multiplication element after element. The result for each receptive field
(the area where convolution takes place) is written down in the feature map.
The shape of a kernel is heavily dependent on the input shape of the image and architecture of the entire
network, mostly the size of kernels is (MxM) i.e., a square matrix. The movement of a kernel is always
from left to right and top to bottom.
Stride defines by what step does to kernel move, for example stride of 1 makes kernel slide by one
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 11
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
row/column at a time and stride of 2 moves kernel by 2 rows/columns. We continue sliding the filter until the
feature map is complete.
For input images with 3 or more channels such as RGB a filter is applied. Filters are one dimension higher
than kernels and can be seen as multiple kernels stacked on each other where every kernel is for a particular
channel. Therefore for an RGB image of (32x32) we have a filter of the shape say (5x5x3).
Here the input matrix has shape 4x4x1 and the kernel is of size 3x3 since the shape of input is larger than the
kernel, we are able to implement a sliding window protocol and apply the kernel over entire input. First entry
in the convoluted result is calculated as:
45*0+12*(-1)+ 5*0+22*(-1)+10*5+35*(-1)+88*0+26*(-1)+51*0 = -45
We continue sliding the filter until the feature map is complete.
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 12
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
2.1 Sliding window protocol:
1. The kernel gets into position at the top-left corner of the input matrix.
2. Then it starts moving left to right, calculating the dot product and saving it to a new matrix until it
has reached the last column.
3. Next, kernel resets its position at first column but now it slides one row to the bottom. Thus
following the fashion left-right and top-bottom.
4. Steps 2 and 3are repeated till the entire input has been processed.
For a 3D input matrix the movement of the kernel will be from front to back, left to right and top to bottom.
In practice, we don’t explicitly define the filters that our convolutional layer will use; we instead
parameterize the filters and let the network learn the best filters to use during training. We do, however,
define how many filters, we’ll use at each layer— a hyperparameter which is called the depth of the output
volume.
Another hyperparameter is the stride that defines how much we slide the filter over the data. For example
if stride is 1 then we move the window by 1 pixel at a time over the image, when our input is an image.
When we use larger values of stride 2 or 3 we allow jumping 2 or pixels at a time. This reduces
significantly the output size.
The last hyperparameter is the size of zero-padding, when sometimes is convenient to pad the input
volume with zeros around the border.
So now we can compute the spatial size of the output volume as a function of the input volume size (W),
the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 13
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
amount of zero padding used (P) on the border. The formula for calculating how many neurons “fit” is
given by
In our previous example for the 5x5 input (W=5) and the 2x2 filter (F=2) with stride 1(S=1) and pad 0
(P=0) we would get a 4x4x (number of filters) output for each network node.
Trivial neural network layers use matrix multiplication by a matrix of parameters describing the
interaction between the input an doutput unit. This means that every output unit interacts with every input
unit. However, convolution neural networks have sparse interaction. This is achieved by making kernel
smaller than the input e.g., an image can have millions or thousands of pixels, but while processing it
using kernel we can detect meaningful information that is of tens or hundreds of pixels. This means that
we need to store fewer parameters that not only reduces the memory requirement of the model but also
improves the statistical efficiency of the model.
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 14
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 15
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
The same filter (weights) (1, 0, -1) are used for that layer.
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 16
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 17
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
In some cases, we may not wish to share parameters across entire image
If image is cropped to be centered on a face, we may want different features from different parts of
the face
Part of the network processing the top of the face looks for eyebrows
Part of the network processing the bottom of the face looks for the chin
Certain image operations such as scale and rotation are not equivariant to convolution
Other mechanisms are needed for such transformations
2.5 Pooling
The pooling operation involves sliding a two-dimensional filter over each channel of feature map and
summarizing the features lying within the region covered by the filter.
For a feature map having dimensions nh x nw x nc, the dimensions of output obtained after a pooling layer is
(nh – f + 1)/ s x (nw-f+1)/s x nc
where,
nh - height of feature map
nw – width of feature map
nc – number of channels in the feature map
f - size of filter
s-stride length
A common CNN model architecture is to have a number of convolution and pooling layers stacked one after the
other.
Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the number of parameters
to learn and the amount of computation performed in the network.
The pooling layer summarizes the features present in a region of the feature map generated by a convolution
layer. So, further operations are performed on summarized features instead of precisely positioned features
generated by the convolution layer. This makes the model more robust to variations in the position of the
features in the input image.
Max Pooling
Max pooling is a pooling operation that selects the maximum element from the region of the feature map
covered by the filter. Thus, the output after max-pooling layer would be a feature map containing the most
prominent features of the previous feature map.
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 18
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
Average Pooling
Average pooling computes the average of the elements present in the region of feature map covered by the
filter. Thus, while max pooling gives the most prominent feature in a particular patch of the feature map,
average pooling gives the average of features present in a patch.
Global Pooling
Global pooling reduces each channel in the feature map to a single value. Thus, an nh x nw x nc feature map is
reduced to 1 x 1 x nc feature map. This is equivalent to using a filter of dimensions nh x nw i.e. the dimensions
of the feature map. Further, it can be either global max pooling or global average pooling.
Global Average Pooling
Considering a tensor of shape h*w*n, the output of the Global Average Pooling layer is a single value across
h*w that summarizes the presence of the feature. Instead of downsizing the patches of the input feature map,
the Global Average Pooling layer downsizes the whole h*w into 1 value by taking the average.
Global Max Pooling
With the tensor of shape h*w*n, the output of the Global Max Pooling layer is a single value across h*w that
summarizes the presence of a feature. Instead of downsizing the patches of the input feature map, the Global
Max Pooling layer downsizes the whole h*w into 1 value by taking the maximum.
In convolutional neural networks (CNNs), the pooling layer is a common type of layer that is typically added
after convolutional layers. The pooling layer is used to reduce the spatial dimensions (i.e., the width and height)
of the feature maps, while preserving the depth (i.e., the number of channels).
The pooling layer works by dividing the input feature map into a set of non-overlapping regions, called
pooling regions. Each pooling region is then transformed into a single output value, which represents the
presence of a particular feature in that region. The most common types of pooling operations are max
pooling and average pooling.
In max pooling, the output value for each pooling region is simply the maximum value of the input
values within that region. This has the effect of preserving the most salient features in each pooling
region, while discarding less relevant information. Max pooling is often used in CNNs for object
recognition tasks, as it helps to identify the most distinctive features of an object, such as its edges and
corners.
In average pooling, the output value for each pooling region is the average of the input values within
that region. This has the effect of preserving more information than max pooling, but may also dilute the
most salient features. Average pooling is often used in CNNs for tasks such as image segmentation and
object detection, where a more fine-grained representation of the input is required.
Pooling layers are typically used in conjunction with convolutional layers in a CNN, with each pooling layer
reducing the spatial dimensions of the feature maps, while the convolutional layers extract increasingly
complex features from the input. The resulting feature maps are then passed to a fully connected layer, which
performs the final classification or regression task.
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 19
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
2.5.2 Advantages of Pooling Layer
Dimensionality reduction: The main advantage of pooling layers is that they help in reducing the spatial
dimensions of the feature maps. This reduces the computational cost and also helps in avoiding over-fitting by
reducing the number of parameters in the model.
Translation invariance: Pooling layers are also useful in achieving translation invariance in the feature maps.
This means that the position of an object in the image does not affect the classification result, as the same
features are detected regardless of the position of the object.
Feature selection: Pooling layers can also help in selecting the most important features from the input, as max
pooling selects the most salient features and average pooling preserves more information.
3. Convolution Variants
The goal of a CNN is to transform the input image into concise abstract representations of the original input.
The individual convolutional layers try to find more complex patterns from the previous layer’s observations.
The logic is that 10 curved lines would form two elipses, which would make an eye.
To do this, each layer uses a kernel, usually a 2x2 or 3x3 matrix, that slides through the previous layer’s output
to generate a new output. The word convolve from convolution means to roll or slide.
The variants of convolution operations are as follows:
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 20
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
This means that we take the element-wise product as usual in this upper left (3 times 3) region, and then
multiply and sum elements. That gives us (91). But then instead of stepping the blue box over by one step,
we’re going to step it over by two steps. It’s illustrated how the upper left corner has gone from one dot to
another jumping over one position. Then we do the usual element-wise product and summing, and that gives us
(100). Next, we’re going to do that again and make the blue box jump over by two steps. We obtain the value
(83). Then, when we go to the next row, again we take two steps instead of one step. We will move filter by (2)
steps and we’ll obtain (69).
In this example we convolve (7 times 7) matrix with a (3 times 3) matrix and we get a (3 times 3) output. The
input and output dimensions turns out to be governed by the following formula:
n f 2p
s 1
If we have (n times n) image convolved with an (f times f) filter and if we use a padding (p) and a stride (s), in
this example (s=2), then we end up with an output that is (n-f+2p). Because we’re stepping (s) steps at the time
instead of just one step at a time, we now divide by (s) and add (1). In our example, we have ((7-3+0)/2+1 = 4/2
+1 =3), that is why we end up with this (3 times 3) output. Notice that in this formula above, we round the value
of this fraction, which generally might not be an integer value, down to the nearest integer.
Moreover, each convolution operation is effectively learning an additional feature (or map), which is a learned
representation of the training data, and the tiled layers, like convolutional layers, also still have a relatively
small number of learned parameters. In essence, it is the pooling operation over these multiple "tiled" maps that
allows the network to learn invariances over scaling and rotation.
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 21
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
In above figure, units with the same color belong to the
same map; within each map, units with the same fill texture have tied weights. We call this local untying of
weights “tiling.” Tiled CNNs are parametrized by a tile size k: we
constrain only units that are k steps away from each other to be tied. By varying k, we obtain a
spectrum of models which trade off between being able to learn complex invariances, and having
few learnable parameters. At one end of the spectrum we have traditional CNNs (k = 1), and at the
other, we have fully untied simple units.
Next, we will allow our model to use multiple “maps,” so as to learn highly over complete representations. A
map is a set of pooling units and simple units that collectively cover the entire image
(see Figure 8 - Right). When varying the tiling size, we change the degree of weight tying within
each map; for example, if k = 1, the simple units within each map will have the same weights. In
our model, simple units in different maps are never tied. By having units in different maps learn
different features, our model can learn a rich and diverse set of features. Tiled CNNs with multiple
maps enjoy the twin benefits of (i) being able to represent complex invariances, by pooling over
(partially) untied weights, and (ii) having a relatively small number of learnable parameters.
When down sampling and up sampling techniques are applied to transposed convolutional layers, their effects
are reversed. The reason for this is for a network to be able to use convolutional layers to compress the image,
then transposed convolutional layers with the exact same down sampling and up sampling techniques to
reconstruct the image.
When padding is ‘added’ to the transposed convolutional layer, it seems as if padding is removed from the
input, and the resulting output becomes smaller.
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 22
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
Without padding, the output is 7x7, but with padding on both sides, it is 5x5. When strides are used, they
instead affect the input, instead of the output.
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 23
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
The dilation rate effectively increases the receptive field of the filter without increasing the number of
parameters, because the filter is still the same size, but with gaps between the values. This can be useful in
situations where a larger receptive field is needed, but increasing the size of the filter would lead to an increase
in the number of parameters and computational complexity.
Dilated convolutions have been used successfully in various applications, such as semantic segmentation, where
a larger context is needed to classify each pixel, and audio processing, where the network needs to learn
patterns with longer time dependencies.
An additional parameter l (dilation factor) tells how much the input is expanded. In other words, based on the
value of this parameter, (l-1) pixels are skipped in the kernel. Figure 9 depicts the difference between normal
vs dilated convolution. In essence, normal convolution is just a 1-dilated convolution.
where,
F(s) = Input
k(t) = Applied Filter
*l = l-dilated convolution
(F*lk)(p) = Output
Advantages of Dilated Convolution:
Using this method rather than normal convolution is better as:
1. Larger receptive field (i.e. no loss of coverage)
2. Computationally efficient (as it provides a larger coverage on the same computation cost)
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 24
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
3. Lesser Memory consumption (as it skips the pooling step) implementation
4. No loss of resolution of the output image (as we dilate instead of performing pooling)
5. Structure of this convolution helps in maintaining the order of the data.
4. CNN Learning
A neural network without an activation function is essentially just a linear regression model. The activation
function does the non-linear transformation to the input making it capable to learn and perform more complex
tasks.
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 25
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
Here we will look into the ReLU activation function, more specifically about it’s non - linear behaviour. ReLU
is an acronym for Rectified Linear Unit. It is the most commonly used activation function. The function returns
0 if it receives any negative input, but for any positive value x it returns that value back. So, Mathematically it
can be expressed as:- f(x) = max(0,x) Basically, it sets anything less than or equal to 0 (negative numbers) to be
0. And keeps all the same values for any values > 0. Graphical representation of ReLU function is:
Now we will look into the derivative of the ReLU using above graph. So, let us see the derivative at different
values of x. For example let’s see the derivative for both positive value of x and negative value of x.
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 26
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
As we know the derivative of function is defined as the slope of the function at certain point. So you can see
that the function is mostly differentiable. If x is greater than 0 the derivative is 1 and if x is less than zero the
derivative is 0. But when x = 0, the derivative does not exist. There are two ways to deal with this. First, you
can just arbitrarily assign a value for the derivative of y = f(x) when x = 0. A second alternative is, instead of
using the actual y = f(x) function, use an approximation to ReLU which is differentiable for all values of x.
Anyway, Till now we were getting confused that actually what the ReLU is Linear or Nonlinear? We know that
Mathematically, it is clear that, A function is linear if the slope is constant in its complete domain and the ReLU
function is non-differentiable around 0, but the slope is always either 0 (for negative values) or 1 (for positive
values). That’s why the ReLU function is Non-Linear. Intuitively, we can understand that as The ReLU is an
activation function and the purpose of activation function is to introduce non-linearity in the neural network.
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 27
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
Figur
Cost Function vs Loss Function
[DCE |Prepared By Dr. K. Revathi, AP/AIDS & Mrs. R. Selvi, AP/AIDS) Page 28
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
Measures the error between predicted and Quantifies the overall cost or error of the model on
actual values in a machine learning model the entire training set
Used to optimize the model during training Used to guide the optimization process by
minimizing the cost or error
Can be specific to individual samples Aggregates the loss values over the entire training
set
Examples include mean squared error Often the average or sum of individual loss values
(MSE), mean absolute error (MAE), and in the training set
binary cross- entropy.
Used to evaluate model performance Used to determine the direction and magnitude of
parameter updates during optimization
Different loss functions can be used for Typically derived from the loss function, but can
different tasks or problem domains include additional regularization terms or other
considerations
Loss Function in Deep Learning
Regression
MSE(Mean Squared Error)
MAE(Mean Absolute Error)
Hubber loss
Classification
Binary cross-entropy
Categorical cross-entropy
A. Regression Loss
Advantage
o Easy to interpret
o Always differential because of the square
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
o Only one local minima
Disadvantage
o Error unit in the square. Because the unit in the square is not understood properly
o Not robust to outlier
Advantage
o Intuitive and easy
o Error Unit Same as the output column
o Robust to outlier
Disadvantage
Graph, not differential. We cannot use gradient descent directly, then we can sub
gradient calculation.
Note–In regression at the last neuron use linear activation function.
3. Huber Loss
In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers
in data.
Advantage
o Robust to outlier
o It lies between MAE and MSE
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
Disadvantage
o Its main disadvantage is the associated complexity. In order to maximize model
accuracy, the hyperparameter δ will also need to be optimized which increases the
training requirements.
B. Classification Loss
1. Binary Cross Entropy / log loss
It is used in binary classification problems like two classes. Example a person has covid or not or my
article gets popular or not.
Binary cross entropy compares each of the predicted probabilities to the actual class output which
can be either 0 or 1. It then calculates the score that penalizes the probabilities based on the distance
from the expected value. That means how close or far from the actual value than the squared error
loss.
Yi – actual values
yihat – neural network prediction
Advantage
o A cost function is a differential
Disadvantage
o Multiple local minima
o Not intuitive
Where k is classes
Where,
k is classes,
y-actual value
yhat–Neural Network prediction
Note – In multi-class classification at the last neuron use the softmax activation function.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45374298
www.BrainKart.com
If problem statement have 3 classes
softmax activation f(z)=ez1/(ez1+ez2+ez3)
If target column has One hot encode to classes like 0 0 1, 0 1 0, 1 0 0 then use categorical cross-entropy.
And if the target column has numerical encoding to classes like 1,2,3,4….n then use sparse categorical
cross-entropy. Sparse categorical cross-entropy faster than categorical cross-entropy.
2. Gradient Descent: Gradient descent is an optimization algorithm that uses the gradients of the loss
function to iteratively update the model's parameters in a way that reduces the loss. The basic idea is to
take steps in the opposite direction of the gradient to reach a local minimum of the loss function. This
process is repeated until the algorithm converges to a set of parameter values that hopefully result in a
well-trained model
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
Click on Subject/Paper under Semester to enter.
Professional English Discrete Mathematics Environmental Sciences
Professional English - - II - HS3252 - MA3354 and Sustainability -
I - HS3152 GE3451
Digital Principles and
Statistics and Probability and
Computer Organization
Matrices and Calculus Numerical Methods - Statistics - MA3391
- CS3351
- MA3151 MA3251
3rd Semester
1st Semester
4th Semester
2nd Semester
Deep Learning -
AD3501
Embedded Systems
Data and Information Human Values and
and IoT - CS3691
5th Semester
7th Semester
8th Semester
Open Elective-1
Distributed Computing Open Elective 2
- CS3551 Project Work /
Elective-3
Open Elective 3 Intership
Big Data Analytics - Elective-4
CCS334 Open Elective 4
Elective-5
Elective 1 Management Elective
Elective-6
Elective 2
All Computer Engg Subjects - [ B.E., M.E., ] (Click on Subjects to enter)
Programming in C Computer Networks Operating Systems
Programming and Data Programming and Data Problem Solving and Python
Structures I Structure II Programming
Database Management Systems Computer Architecture Analog and Digital
Communication
Design and Analysis of Microprocessors and Object Oriented Analysis
Algorithms Microcontrollers and Design
Software Engineering Discrete Mathematics Internet Programming
Theory of Computation Computer Graphics Distributed Systems
Mobile Computing Compiler Design Digital Signal Processing
Artificial Intelligence Software Testing Grid and Cloud Computing
Data Ware Housing and Data Cryptography and Resource Management
Mining Network Security Techniques
Service Oriented Architecture Embedded and Real Time Multi - Core Architectures
Systems and Programming
Probability and Queueing Theory Physics for Information Transforms and Partial
Science Differential Equations
Technical English Engineering Physics Engineering Chemistry
Engineering Graphics Total Quality Professional Ethics in
Management Engineering
Basic Electrical and Electronics Problem Solving and Environmental Science and
and Measurement Engineering Python Programming Engineering