Module 4 Notes
Module 4 Notes
What Is Padding
padding is a technique used to preserve the spatial dimensions of the input image after convolution
operations on a feature map. Padding involves adding extra pixels around the border of the input
feature map before convolution.
• Valid Padding: In the valid padding, no padding is added to the input feature map, and the
output feature map is smaller than the input feature map. This is useful when we want to
reduce the spatial dimensions of the feature maps.
• Same Padding: In the same padding, padding is added to the input feature map such that the
size of the output feature map is the same as the input feature map. This is useful when we
want to preserve the spatial dimensions of the feature maps.
The number of pixels to be added for padding can be calculated based on the size of the kernel and
the desired output of the feature map size. The most common padding value is zero-padding, which
involves adding zeros to the borders of the input feature map.
Padding can help in reducing the loss of information at the borders of the input feature map and can
improve the performance of the model. However, it also increases the computational cost of the
convolution operation. Overall, padding is an important technique in CNNs that helps in preserving the
spatial dimensions of the feature maps and can improve the performance of the model.
• For a grayscale (n x n) image and (f x f) filter/kernel, the dimensions of the image resulting
from a convolution operation is (n – f + 1) x (n – f + 1).
For example, for an (8 x 8) image and (3 x 3) filter, the output resulting after the convolution
operation would be of size (6 x 6). Thus, the image shrinks every time a convolution operation
is performed. This places an upper limit to the number of times such an operation could be
performed before the image reduces to nothing thereby precluding us from building deeper
networks.
For example,
Padding is simply a process of adding layers of zeros to our input images so as to avoid the problems
mentioned above through the following changes to the input image.
For example, by adding one layer of padding to an (8 x 8) image and using a (3 x 3) filter we would get
an (8 x 8) output after performing a convolution operation.
Types of Padding
Valid Padding: It implies no padding at all. The input image is left in its valid/unaltered shape.
Same Padding: In this case, we add ‘p’ padding layers such that the output image has the same
dimensions as the input image.
So,
So, if we use a (3 x 3) filter on an input image to get the output with the same dimensions. the 1 layer
of zeros must be added to the borders for the same padding. Similarly, if (5 x 5) filter is used 2 layers
of zeros must be appended to the border of the image.
• Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the
number of parameters to learn and the amount of computation performed in the network.
• The pooling layer summarises the features present in a region of the feature map generated
by a convolution layer. So, further operations are performed on summarised features instead
of precisely positioned features generated by the convolution layer. This makes the model
more robust to variations in the position of the features in the input image.
Max Pooling
1. Max pooling is a pooling operation that selects the maximum element from the region of the
feature map covered by the filter. Thus, the output after max-pooling layer would be a feature
map containing the most prominent features of the previous feature map.
[[9. 7.]
[8. 6.]]
Average Pooling
1. Average pooling computes the average of the elements present in the region of feature map
covered by the filter. Thus, while max pooling gives the most prominent feature in a particular
patch of the feature map, average pooling gives the average of features present in a patch.
1. Output:
[[4.25 4.25]
[4.25 3.5 ]]
Global Pooling
Global pooling reduces each channel in the feature map to a single value. Thus, an nh x nw x
nc feature map is reduced to 1 x 1 x nc feature map. This is equivalent to using a filter of
dimensions nh x nw i.e. the dimensions of the feature map.
Further, it can be either global max pooling or global average pooling.
1. The pooling layer works by dividing the input feature map into a set of non-overlapping
regions, called pooling regions. Each pooling region is then transformed into a single output
value, which represents the presence of a particular feature in that region. The most common
types of pooling operations are max pooling and average pooling.
2. In max pooling, the output value for each pooling region is simply the maximum value of the
input values within that region. This has the effect of preserving the most salient features in
each pooling region, while discarding less relevant information. Max pooling is often used in
CNNs for object recognition tasks, as it helps to identify the most distinctive features of an
object, such as its edges and corners.
3. In average pooling, the output value for each pooling region is the average of the input values
within that region. This has the effect of preserving more information than max pooling, but
may also dilute the most salient features. Average pooling is often used in CNNs for tasks such
as image segmentation and object detection, where a more fine-grained representation of the
input is required.
Pooling layers are typically used in conjunction with convolutional layers in a CNN, with each pooling
layer reducing the spatial dimensions of the feature maps, while the convolutional layers extract
increasingly complex features from the input. The resulting feature maps are then passed to a fully
connected layer, which performs the final classification or regression task.
1. Dimensionality reduction: The main advantage of pooling layers is that they help in reducing
the spatial dimensions of the feature maps. This reduces the computational cost and also helps
in avoiding overfitting by reducing the number of parameters in the model.
2. Translation invariance: Pooling layers are also useful in achieving translation invariance in the
feature maps. This means that the position of an object in the image does not affect the
classification result, as the same features are detected regardless of the position of the object.
3. Feature selection: Pooling layers can also help in selecting the most important features from
the input, as max pooling selects the most salient features and average pooling preserves more
information.
1. Information loss: One of the main disadvantages of pooling layers is that they discard some
information from the input feature maps, which can be important for the final classification or
regression task.
2. Over-smoothing: Pooling layers can also cause over-smoothing of the feature maps, which can
result in the loss of some fine-grained details that are important for the final classification or
regression task.
3. Hyperparameter tuning: Pooling layers also introduce hyperparameters such as the size of the
pooling regions and the stride, which need to be tuned in order to achieve optimal
performance. This can be time-consuming and requires some expertise in model building.
Convolutional Neural Network consists of multiple layers like the input layer, Convolutional layer,
Pooling layer, and fully connected layers.
Convolution Neural Networks or covnets are neural networks that share their parameters. Imagine you
have an image. It can be represented as a cuboid having its length, width (dimension of the image),
and height (i.e the channel as images generally have red, green, and blue channels).
Now imagine taking a small patch of this image and running a small neural network, called a filter or
kernel on it, with say, K outputs and representing them vertically. Now slide that neural network across
the whole image, as a result, we will get another image with different widths, heights, and depths.
Instead of just R, G, and B channels now we have more channels but lesser width and height. This
operation is called Convolution. If the patch size is the same as that of the image it will be a regular
neural network. Because of this small patch, we have fewer weights.
• For example, if we have to run convolution on an image with dimensions 34x34x3. The possible
size of filters can be axax3, where ‘a’ can be anything like 3, 5, or 7 but smaller as compared to
the image dimension.
• During the forward pass, we slide each filter across the whole input volume step by step where
each step is called stride (which can have a value of 2, 3, or even 4 for high-dimensional
images) and compute the dot product between the kernel weights and patch from input
volume.
• As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together as a
result, we’ll get output volume having a depth equal to the number of filters. The network will
learn all the filters.
• Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the input
will be an image or a sequence of images. This layer holds the raw input of the image with
width 32, height 32, and depth 3.
• Convolutional Layers: This is the layer, which is used to extract the feature from the input
dataset. It applies a set of learnable filters known as the kernels to the input images. The
filters/kernels are smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input
image data and computes the dot product between kernel weight and the corresponding input
image patch. The output of this layer is referred as feature maps. Suppose we use a total of 12
filters for this layer we’ll get an output volume of dimension 32 x 32 x 12.
• Activation Layer: By adding an activation function to the output of the preceding layer,
activation layers add nonlinearity to the network. it will apply an element-wise activation
function to the output of the convolution layer. Some common activation functions are RELU:
• Pooling layer: This layer is periodically inserted in the covnets and its main function is to
reduce the size of volume which makes the computation fast reduces memory and also
prevents overfitting. Two common types of pooling layers are max pooling and average
pooling. If we use a max pool with 2 x 2 filters and stride 2, the resultant volume will be of
dimension 16x16x12.
• Flattening: The resulting feature maps are flattened into a one-dimensional vector after the
convolution and pooling layers so they can be passed into a completely linked layer for
categorization or regression.
• Fully Connected Layers: It takes the input from the previous layer and computes the final
classification or regression task.
1. Good at detecting patterns and features in images, videos, and audio signals.
4. Interpretability is limited, it’s hard to understand what the network has learned.
The weight-sharing property of convolutional neural networks (CNNs) has been a revolutionary
concept in the field of deep learning and computer vision.
In neural networks, each neuron in one layer is connected to every neuron in the next layer, and each
of these connections has its own weight. This results in a massive number of parameters, especially
for large input sizes, making the network prone to overfitting and computationally expensive.
CNNs address this issue through the weight-sharing mechanism. In this approach, the same weights
are used across different parts of the input, significantly reducing the number of parameters. This is
achieved using convolutional filters (or kernels) that slide across the input image, extracting features
such as edges, textures, and shapes.
Advantages
1. Reduced Complexity: By sharing weights, CNNs drastically reduce the number of parameters,
making the network less complex and easier to train.
3. Efficiency: With fewer parameters, CNNs are more computationally efficient and require less
memory.
Disadvantages
1. Limited Perception: Due to their local receptive field, individual neurons in a CNN might have
a limited understanding of the overall context.
2. Spatial Invariance Limitation: While good at handling translation, CNNs are less effective with
other transformations like rotation and scaling without additional augmentation.
The weight-sharing property of CNNs has enabled advancements in numerous applications, including
image and video recognition, medical image analysis, and autonomous driving.
• A fully connected neural network consists of a series of fully connected layers that connect
every neuron in one layer to every neuron in the other layer.
• The major advantage of fully connected networks is that they are “structure agnostic” i.e.
there are no special assumptions needed to be made about the input.
• While being structure agnostic makes fully connected networks very broadly applicable, such
networks do tend to have weaker performance than special-purpose networks tuned to the
structure of a problem space.
• CNN architectures make the explicit assumption that the inputs are images, which allows
encoding certain properties into the model architecture.
• A simple CNN is a sequence of layers, and every layer of a CNN transforms one volume of
activations to another through a differentiable function. Three main types of layers are used
to build CNN architecture: Convolutional Layer, Pooling Layer, and Fully-Connected Layer.
Dataset Used
• MNIST (Modified National Institute of Standards and Technology database) dataset of 60,000
28x28 grayscale images of the 10 digits, along with a test set of 10,000 images.
• It is a subset of a larger set available from NIST. The digits have been size-normalized and
centered in a fixed-size image.
• It is a good database for people who want to try learning techniques and pattern recognition
methods on real-world data while spending minimal efforts on preprocessing and formatting.
Model Implementation
• Model Architecture
For the fully-connected architecture, I have used a total of three hidden layers with ‘relu’ activation
function apart from input and output layers.
• Model Summary
• Model Accuracy
On training the fully connected model for five epochs with a batch size of 128, and validation split
value set to 0.3 we got training accuracy of 98.6% and validation accuracy of 96.07%. Moreover, after
2nd epoch, we can visualize how train and validation accuracy tends to move wide apart.
• Model Architecture
For Convolutional Neural network architecture, we added 3 convolutional layers with activation as
‘relu’ and a max pool layer after the first convolutional layer.
• Model Summary
• Model Accuracy
On training, CNN for five epochs for a batch size of 128, and validation split value set to 0.3 we
got training accuracy of 99.19% and validation accuracy of 99.63%. Moreover, unlike the fully
connected model, we can visualize train and validation accuracy do not tend to move as wide apart.
On test data with 10,000 images, accuracy for the fully connected neural network is 98.9%.
• Kernel K with element Ki,j,k,lKi,j,k,l giving the connection strength between a unit in channel i
of output and a unit in channel j of the input, with an offset of k rows and l columns between
the output unit and the input unit.
Full Convolution
0 Padding 1 stride
Zi,j,k=∑l,m,nVl,j+m−1,k+n−1Ki,l,m,nZi,j,k=∑l,m,nVl,j+m−1,k+n−1Ki,l,m,n
0 Padding s stride
Zi,j,k=c(K,V,s)i,j,k=∑l,m,n[Vl,s∗(j−1)+m,s∗(k−1)+nKi,l,m,n]Zi,j,k=c(K,V,s)i,j,k=∑l,m,n[Vl,s∗(j−1)+m,s∗(k−1
)+nKi,l,m,n]
Convolution with a stride greater than 1 pixel is equivalent to conv with 1 stride followed by
downsampling:
Without 0 paddings, the width of representation shrinks by one pixel less than the kernel width at each
layer. We are forced to choose between shrinking the spatial extent of the network rapidly and using
small kernel. 0 padding allows us to control the kernel width and the size of the output independently.
• Same: keep the size of the output to the size of input. Unlimited number of layers. Pixels near
the border influence fewer output pixels than the input pixels near the center.
• Full: Enough zeros are added for every pixels to be visited k (kernel width) times in each
direction, resulting width m + k - 1. Difficult to learn a single kernel that performs well at all
positions in the convolutional feature map.
Usually the optimal amount of 0 padding lies somewhere between ‘Valid’ or ‘Same’
Unshared Convolution
In some case when we do not want to use convolution but want to use locally connected layer. We
use Unshared convolution. Indices into weight W
Zi,j,k=∑l,m,n[Vl,i+m−1,j+n−1Wi,j,k,l,m,n]Zi,j,k=∑l,m,n[Vl,i+m−1,j+n−1Wi,j,k,l,m,n]
Useful when we know that each feature should be a function of a small part of space, but no reason
to think that the same feature should occur accross all the space. eg: look for mouth only in the bottom
half of the image.
It can be also useful to make versions of convolution or local connected layers in which the connectivity
is further restricted, eg: constrain each output channeel i to be a function of only a subset of the input
channel.
Adv: * reduce memory consumption * increase statistical efficiency * reduce computation for both
forward and backward prop.
Learn a set of kernels that we rotate through as we move through space. Immediately neighboring
locations will have different filters, but the memory requirement for storing the parameters will
increase by a factor of the size of this set of kernels. Comparison on locally connected layers, tiled
convolution and stardard convolution:
Zi,j,k=∑l,m,n[Vi,i+m−1,j+n−1Ki,l,m,n,j%t+1,k%t+1]Zi,j,k=∑l,m,n[Vi,i+m−1,j+n−1Ki,l,m,n,j%t+1,k%t+1]
Local connected layers and tiled convolutional layer with max pooling: the detector units of these
layers are driven by different filters. If the filters learn to detect different tranformed version of the
same underlying features, then the max-pooled units become invariant to the learned transformation.
Review:
LeNET: Architecture:-
LeNet Architecture is developed by Yann LeCun and his colleagues in the late 1980s and early 1990s,
is one of the earliest convolutional neural networks that has substantially influenced the field of deep
learning, particularly in image recognition. Designed originally to recognize handwritten and machine-
printed characters, LeNet was a groundbreaking model at the time of its inception.
Its architecture, known as LeNet-5, consists of convolutional layers followed by subsampling and fully
connected layers, culminating in a softmax output layer.
LeNet’s significance in deep learning cannot be overstated. It was one of the first demonstrations that
convolutional neural networks (CNNs) could be successfully applied to visual pattern recognition.
LeNet introduced several key concepts that are now standard in CNN architectures, including the use
of multiple convolutional and pooling layers, local receptive fields, shared weights, and the
backpropagation algorithm for training the network.
1. Late 1980s: Yann LeCun begins foundational work on convolutional neural networks at AT&T
Bell Labs, leading to the development of the initial LeNet models.
2. 1989: The first iteration, LeNet-1, is introduced, employing backpropagation for training
convolutional layers.
3. 1998: LeNet-5, the most notable version, is detailed in the seminal paper “Gradient-Based
Learning Applied to Document Recognition.” This iteration is optimized for digit recognition
and demonstrates practical applications.
4. 2000s: LeNet’s success inspires further research and adaptations in various fields beyond digit
recognition, such as medical imaging and object recognition.
5. 2010s and Beyond: LeNet’s principles influence the development of more advanced CNN
architectures like AlexNet and ResNet, solidifying its legacy in the field of deep learning.
LeNet’s Architecture
LeNet Architecture
1. Input Layer: Accepts 32×32 pixel images, often zero-padded if original images are smaller.
2. First Convolutional Layer (C1): Consists of six 5×5 filters, producing six feature maps of 28×28
each.
3. First Pooling Layer (S2): Applies 2×2 average pooling, reducing feature maps’ size to 14×14.
4. Second Convolutional Layer (C3): Uses sixteen 5×5 filters, but with sparse connections,
outputting sixteen 10×10 feature maps.
5. Second Pooling Layer (S4): Further reduces feature maps to 5×5 using 2×2 average pooling.
• First Fully Connected Layer (C5): Fully connected with 120 nodes.
7. Output Layer: Softmax or Gaussian activation that outputs probabilities across 10 classes
(digits 0-9).
Applications of LeNet
LeNet’s architecture, originally developed for digit recognition, has proven versatile and foundational,
influencing a variety of applications beyond its initial scope. Here are some notable applications and
adaptations:
1. Handwritten Character Recognition: Beyond recognizing digits, LeNet has been adapted to
recognize a broad range of handwritten characters, including alphabets from various
languages. This adaptation has been crucial for applications such as automated form
processing and handwriting-based authentication systems.
2. Object Recognition in Images: The principles of LeNet have been extended to more complex
object recognition tasks. Modified versions of LeNet are used in systems that need to recognize
objects in photos and videos, such as identifying products in a retail setting or vehicles in traffic
management systems.
3. Document Classification: LeNet can be adapted for document classification by recognizing and
learning from the textual and layout features of different document types. This application is
particularly useful in digital document management systems where automatic categorization
of documents based on their content and layout can significantly enhance searchability and
retrieval.
4. Medical Image Analysis: Adaptations of LeNet have been applied in the field of medical image
analysis, such as identifying abnormalities in radiographic images, segmenting biological
features in microscopic images, and diagnosing diseases from patterns in medical imagery.
These applications demonstrate the potential of convolutional neural networks in supporting
diagnostic processes and enhancing the accuracy of medical evaluations.
AlexNet:
AlexNet Architecture:
AlexNet consists of 8 layers, including 5 convolutional layers and 3 fully connected layers. It uses
traditional stacked convolutional layers with max-pooling in between. Its deep network structure
allows for the extraction of complex features from images.
• The architecture employs overlapping pooling layers to reduce spatial dimensions while
retaining the spatial relationships among neighbouring features.
• Activation function: AlexNet uses the Softmax activation function and dropout regularization,
which enhance the model’s ability to capture non-linear relationships within the data.
• AlexNet was created to be more computationally efficient than earlier CNN topologies. It
introduced parallel computing by utilising two GPUs during training.
• AlexNet is a relatively shallow network compared to GoogleNet. It has eight layers, which
makes it simpler to train and less prone to overfitting on smaller datasets.
• In 2012, AlexNet produced ground-breaking results in the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC). It outperformed prior CNN architectures greatly and set the
path for the rebirth of deep learning in computer vision.
• Several architectural improvements were introduced by AlexNet, including the use of rectified
linear units (ReLU) as activation functions, overlapping pooling, and dropout regularisation.
These strategies aided in the improvement of performance and generalisation
An image classification task of various dog breeds. AlexNet’s convolutional layers learn features such
as edges, textures, and shapes to distinguish between different dog breeds. The fully connected layers
then analyze these learned features and make predictions.
Deep Architecture: Utilized a deep network with eight layers, much deeper than previous models,
contributing to advancements in CNN architectures.
Use of GPUs: Leveraged GPUs to speed up training, significantly enhancing performance and efficiency
in processing large datasets.
Innovative Techniques:
• ReLU Activation: Employed Rectified Linear Units for faster training, an essential component
in the optimization of gradient-based learning.
Large-Scale Data: Trained on the large ImageNet dataset, which contains millions of images,
demonstrating the importance of extensive and diverse datasets in machine learning.
Inspiration for Research: This work paved the way for more advanced neural network architectures
and deep learning research, influencing subsequent innovations in the field.
AlexNet: Introduced in 2012, AlexNet, developed by Geoffrey Hinton’s team, has a relatively shallow
architecture with stacked convolutional and pooling layers. Despite its groundbreaking nature at the
time, this depth limitation affects its ability to learn complex features. It utilizes techniques such as
normalization and the sigmoid activation function for classification tasks.
ResNet: Introduced in 2015, ResNet builds upon AlexNet by using a much deeper architecture with
“skip connections.” These connections allow the network to learn from the gradients of previous
layers, alleviating the vanishing gradient problem that hinders training in very deep networks. This
enables ResNet to achieve significantly higher accuracy. ResNet also excels in tasks such as image
segmentation and classification due to its robust architecture.
In the above plot, we can observe that a 56-layer CNN gives more error rate on both training and
testing dataset than a 20-layer CNN architecture. After analyzing more on error rate the authors were
able to reach conclusion that it is caused by vanishing/exploding gradient.
ResNet, which was proposed in 2015 by researchers at Microsoft Research introduced a new
architecture called Residual Network.
Residual Network: Solve the problem of the vanishing/exploding gradient, this architecture
introduced the concept called Residual Blocks. In this network, we use a technique called skip
connections. The skip connection connects activations of a layer to further layers by skipping some
layers in between. This forms a residual block. Resnets are made by stacking these residual blocks
together.
The approach behind this network is instead of layers learning the underlying mapping, we allow the
network to fit the residual mapping. So, instead of say H(x), initial mapping, let the network fit,
The advantage of adding this type of skip connection is that if any layer hurt the performance of
architecture then it will be skipped by regularization. So, this results in training a very deep neural
ResNet (short for Residual Network) is a type of neural network architecture introduced in 2015 by
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun from Microsoft Research. It was designed to
solve the problem of vanishing gradients in deep neural networks, which hindered their performance
on large-scale image recognition tasks.
The ResNet architecture is usually divided into four parts, each containing multiple residual blocks with
different depths. The first part of the Network comprises a single convolutional layer, followed by max
pooling, to reduce the spatial dimensions of the input. The second part of the Network contains 64
filters, while the third and fourth parts contain 128 and 256 filters, respectively. The final part of the
Network consists of global average pooling and a fully connected layer that produces the output.
Background
Deep neural networks have revolutionized the field of computer vision by achieving state-of-the-art
results on various tasks such as image classification, object detection, and semantic segmentation.
However, training deep neural networks can be challenging due to the problem of vanishing gradients.
Residual Learning
Residual learning is a concept that was introduced in the ResNet architecture to tackle the vanishing
gradient problem. In traditional deep neural networks, each layer applies a set of transformations to
the input to obtain the output. ResNet introduces residual connections that enable the Network to learn
residual mappings, which are the differences between the input and output of a layer.
The residual connections are formed by adding the input to the output of a layer, which allows the
gradients to flow directly through the Network without being attenuated. This enables the Network to
learn the residual mapping using a shortcut connection that bypasses the layer's transformation.
ResNet Architecture
The ResNet architecture consists of several layers, each containing residual blocks. A residual block is
a set of layers that perform a set of transformations on the input to obtain the output and includes a
shortcut connection that adds the input to the output.
The ResNet architecture has several variants, including ResNet-18, ResNet-34, ResNet-50, ResNet-101,
and ResNet-152. The number in each variant corresponds to the number of layers in the Network. For
example, ResNet-50 has 50 layers, while ResNet-152 has 152 layers.
The ResNet-50 architecture is one of the most popular variants, and it consists of five stages, each
containing several residual blocks. The first stage consists of a convolutional layer followed by a max-
pooling layer, which reduces the spatial dimensions of the input.
ResNet has achieved state-of-the-art results on various computer vision tasks, including image
classification, object detection, and semantic segmentation. In the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) 2015, the ResNet-152 architecture achieved a top-5 error rate of 3.57%,
significantly better than the previous state-of-the-art error rate of 3.57%.
Benefits of ResNet
ResNet has several benefits that make it a popular choice for deep learning applications:
o Deeper networks
ResNet enables the construction of deeper neural networks, with more than a hundred layers, which
was previously impossible due to the vanishing gradient problem. The residual connections allow the
Network to learn better representations and optimize the gradient flow, making it easier to train deeper
networks.
o Improved accuracy
ResNet has achieved state-of-the-art performance on several benchmark datasets, such as ImageNet,
CIFAR-10, and CIFAR-100, demonstrating its superior accuracy compared to other deep neural network
architectures.
o Faster convergence
ResNet enables faster convergence during training, thanks to the residual connections that allow for
better gradient flow and optimization. This results in faster training and better convergence to the
optimal solution.
ADVERTISEMENT
o Transfer learning
ResNet is suitable for transfer learning, allowing the Network to reuse previously
learned features for new tasks. This is especially useful in scenarios where the amount of Labeled data
is limited, as the pre-trained ResNet can be fine-tuned on the new dataset to achieve good
performance.
Drawbacks of ResNet
Despite its numerous benefits, ResNet has a few drawbacks that should be considered:
o Complexity
ResNet is a complex architecture that requires more memory and computational resources than
shallower networks. This can be a limitation in scenarios with limited resources, such as mobile devices
or embedded systems.
o Overfitting
ResNet can be prone to overfitting, especially when the Network is too deep or when the dataset is
small. This can be mitigated by regularization techniques, such as dropout, or by using smaller
networks with fewer layers.
o Interpretability
Conclusion
ResNet is a powerful deep neural network architecture that has revolutionized the field of computer
vision by enabling the construction of deeper and more accurate networks. Its residual connections
enable better gradient flow and optimization, making training deeper networks easier and achieving
better performance on benchmark datasets.
ResNet has limitations, such as complexity, susceptibility to overfitting, and limited interpretability.
When choosing ResNet or any other deep neural network architecture for a specific task, these
drawbacks should be considered.
Overall, ResNet has significantly impacted deep learning and computer vision, and its principles have
been extended to other domains, such as natural language processing and speech recognition. As
research in deep learning continues to evolve, new architectures and techniques will likely be developed
to address the current limitations of ResNet and other existing architectures.