0% found this document useful (0 votes)
16 views

Module 4 Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Module 4 Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

UNIT 4

Convolutional Neural Networks (CNN): Supervised Learning

What Is Padding

padding is a technique used to preserve the spatial dimensions of the input image after convolution
operations on a feature map. Padding involves adding extra pixels around the border of the input
feature map before convolution.

This can be done in two ways:

• Valid Padding: In the valid padding, no padding is added to the input feature map, and the
output feature map is smaller than the input feature map. This is useful when we want to
reduce the spatial dimensions of the feature maps.

• Same Padding: In the same padding, padding is added to the input feature map such that the
size of the output feature map is the same as the input feature map. This is useful when we
want to preserve the spatial dimensions of the feature maps.

The number of pixels to be added for padding can be calculated based on the size of the kernel and
the desired output of the feature map size. The most common padding value is zero-padding, which
involves adding zeros to the borders of the input feature map.

Padding can help in reducing the loss of information at the borders of the input feature map and can
improve the performance of the model. However, it also increases the computational cost of the
convolution operation. Overall, padding is an important technique in CNNs that helps in preserving the
spatial dimensions of the feature maps and can improve the performance of the model.

Problem With Convolution Layers Without Padding

• For a grayscale (n x n) image and (f x f) filter/kernel, the dimensions of the image resulting
from a convolution operation is (n – f + 1) x (n – f + 1).
For example, for an (8 x 8) image and (3 x 3) filter, the output resulting after the convolution
operation would be of size (6 x 6). Thus, the image shrinks every time a convolution operation
is performed. This places an upper limit to the number of times such an operation could be
performed before the image reduces to nothing thereby precluding us from building deeper
networks.

Padding in convolutional neural network


• Clearly, pixel A is touched in just one convolution operation and pixel B is touched in 3
convolution operations, while pixel C is touched in 9 convolution operations. In general, pixels
in the middle are used more often than pixels on corners and edges. Consequently, the
information on the borders of images is not preserved as well as the information in the middle.

For example,

Prof. Ashwini Garole Module 4


Effect Of Padding On Input Images

Padding is simply a process of adding layers of zeros to our input images so as to avoid the problems
mentioned above through the following changes to the input image.

Padding prevents the shrinking of the input image.

p = number of layers of zeros added to the border of the image,

then (n x n) image —> (n + 2p) x (n + 2p) image after padding.

(n + 2p) x (n + 2p) * (f x f) —–> outputs (n + 2p – f + 1) x (n + 2p – f + 1) images

For example, by adding one layer of padding to an (8 x 8) image and using a (3 x 3) filter we would get
an (8 x 8) output after performing a convolution operation.

Prof. Ashwini Garole Module 4


This increases the contribution of the pixels at the border of the original image by bringing them into
the middle of the padded image. Thus, information on the borders is preserved as well as the
information in the middle of the image.

Types of Padding

Valid Padding: It implies no padding at all. The input image is left in its valid/unaltered shape.

where, nxn is the dimension of input image

fxf is kernel size

n-f+1 is output image size

* represents a convolution operation.

Same Padding: In this case, we add ‘p’ padding layers such that the output image has the same
dimensions as the input image.
So,

[(n + 2p) x (n + 2p) image] * [(f x f) filter] —> [(n x n) image]

which gives p = (f – 1) / 2 (because n + 2p – f + 1 = n).

So, if we use a (3 x 3) filter on an input image to get the output with the same dimensions. the 1 layer
of zeros must be added to the borders for the same padding. Similarly, if (5 x 5) filter is used 2 layers
of zeros must be appended to the border of the image.

Why to use Pooling Layers?

• Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the
number of parameters to learn and the amount of computation performed in the network.

• The pooling layer summarises the features present in a region of the feature map generated
by a convolution layer. So, further operations are performed on summarised features instead
of precisely positioned features generated by the convolution layer. This makes the model
more robust to variations in the position of the features in the input image.

Types of Pooling Layers:

Max Pooling

1. Max pooling is a pooling operation that selects the maximum element from the region of the
feature map covered by the filter. Thus, the output after max-pooling layer would be a feature
map containing the most prominent features of the previous feature map.

Prof. Ashwini Garole Module 4


1. Output:

[[9. 7.]

[8. 6.]]

Average Pooling

1. Average pooling computes the average of the elements present in the region of feature map
covered by the filter. Thus, while max pooling gives the most prominent feature in a particular
patch of the feature map, average pooling gives the average of features present in a patch.

1. Output:

[[4.25 4.25]

[4.25 3.5 ]]

Global Pooling

Global pooling reduces each channel in the feature map to a single value. Thus, an nh x nw x
nc feature map is reduced to 1 x 1 x nc feature map. This is equivalent to using a filter of
dimensions nh x nw i.e. the dimensions of the feature map.
Further, it can be either global max pooling or global average pooling.

Prof. Ashwini Garole Module 4


In convolutional neural networks (CNNs), the pooling layer is a common type of layer that is typically
added after convolutional layers. The pooling layer is used to reduce the spatial dimensions (i.e., the
width and height) of the feature maps, while preserving the depth (i.e., the number of channels).

1. The pooling layer works by dividing the input feature map into a set of non-overlapping
regions, called pooling regions. Each pooling region is then transformed into a single output
value, which represents the presence of a particular feature in that region. The most common
types of pooling operations are max pooling and average pooling.

2. In max pooling, the output value for each pooling region is simply the maximum value of the
input values within that region. This has the effect of preserving the most salient features in
each pooling region, while discarding less relevant information. Max pooling is often used in
CNNs for object recognition tasks, as it helps to identify the most distinctive features of an
object, such as its edges and corners.

3. In average pooling, the output value for each pooling region is the average of the input values
within that region. This has the effect of preserving more information than max pooling, but
may also dilute the most salient features. Average pooling is often used in CNNs for tasks such
as image segmentation and object detection, where a more fine-grained representation of the
input is required.

Pooling layers are typically used in conjunction with convolutional layers in a CNN, with each pooling
layer reducing the spatial dimensions of the feature maps, while the convolutional layers extract
increasingly complex features from the input. The resulting feature maps are then passed to a fully
connected layer, which performs the final classification or regression task.

Advantages of Pooling Layer:

1. Dimensionality reduction: The main advantage of pooling layers is that they help in reducing
the spatial dimensions of the feature maps. This reduces the computational cost and also helps
in avoiding overfitting by reducing the number of parameters in the model.

2. Translation invariance: Pooling layers are also useful in achieving translation invariance in the
feature maps. This means that the position of an object in the image does not affect the
classification result, as the same features are detected regardless of the position of the object.

3. Feature selection: Pooling layers can also help in selecting the most important features from
the input, as max pooling selects the most salient features and average pooling preserves more
information.

Disadvantages of Pooling Layer:

1. Information loss: One of the main disadvantages of pooling layers is that they discard some
information from the input feature maps, which can be important for the final classification or
regression task.

2. Over-smoothing: Pooling layers can also cause over-smoothing of the feature maps, which can
result in the loss of some fine-grained details that are important for the final classification or
regression task.

3. Hyperparameter tuning: Pooling layers also introduce hyperparameters such as the size of the
pooling regions and the stride, which need to be tuned in order to achieve optimal
performance. This can be time-consuming and requires some expertise in model building.

Prof. Ashwini Garole Module 4


CNN architecture

Convolutional Neural Network consists of multiple layers like the input layer, Convolutional layer,
Pooling layer, and fully connected layers.

Simple CNN architecture


The Convolutional layer applies filters to the input image to extract features, the Pooling layer down
samples the image to reduce computation, and the fully connected layer makes the final prediction.
The network learns the optimal filters through backpropagation and gradient descent.

How Convolutional Layers works

Convolution Neural Networks or covnets are neural networks that share their parameters. Imagine you
have an image. It can be represented as a cuboid having its length, width (dimension of the image),
and height (i.e the channel as images generally have red, green, and blue channels).

Now imagine taking a small patch of this image and running a small neural network, called a filter or
kernel on it, with say, K outputs and representing them vertically. Now slide that neural network across
the whole image, as a result, we will get another image with different widths, heights, and depths.
Instead of just R, G, and B channels now we have more channels but lesser width and height. This
operation is called Convolution. If the patch size is the same as that of the image it will be a regular
neural network. Because of this small patch, we have fewer weights.

Prof. Ashwini Garole Module 4


convolution process:
• Convolution layers consist of a set of learnable filters (or kernels) having small widths and
heights and the same depth as that of input volume (3 if the input layer is image input).

• For example, if we have to run convolution on an image with dimensions 34x34x3. The possible
size of filters can be axax3, where ‘a’ can be anything like 3, 5, or 7 but smaller as compared to
the image dimension.

• During the forward pass, we slide each filter across the whole input volume step by step where
each step is called stride (which can have a value of 2, 3, or even 4 for high-dimensional
images) and compute the dot product between the kernel weights and patch from input
volume.

• As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together as a
result, we’ll get output volume having a depth equal to the number of filters. The network will
learn all the filters.

Layers used to build ConvNets

A complete Convolution Neural Networks architecture is also known as covnets. A covnets is a


sequence of layers, and every layer transforms one volume to another through a differentiable
function.
Types of layers: datasets
Let’s take an example by running a covnets on of image of dimension 32 x 32 x 3.

• Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the input
will be an image or a sequence of images. This layer holds the raw input of the image with
width 32, height 32, and depth 3.

• Convolutional Layers: This is the layer, which is used to extract the feature from the input
dataset. It applies a set of learnable filters known as the kernels to the input images. The
filters/kernels are smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input
image data and computes the dot product between kernel weight and the corresponding input
image patch. The output of this layer is referred as feature maps. Suppose we use a total of 12
filters for this layer we’ll get an output volume of dimension 32 x 32 x 12.

• Activation Layer: By adding an activation function to the output of the preceding layer,
activation layers add nonlinearity to the network. it will apply an element-wise activation
function to the output of the convolution layer. Some common activation functions are RELU:

Prof. Ashwini Garole Module 4


max(0, x), Tanh, Leaky RELU, etc. The volume remains unchanged hence output volume will
have dimensions 32 x 32 x 12.

• Pooling layer: This layer is periodically inserted in the covnets and its main function is to
reduce the size of volume which makes the computation fast reduces memory and also
prevents overfitting. Two common types of pooling layers are max pooling and average
pooling. If we use a max pool with 2 x 2 filters and stride 2, the resultant volume will be of
dimension 16x16x12.

• Flattening: The resulting feature maps are flattened into a one-dimensional vector after the
convolution and pooling layers so they can be passed into a completely linked layer for
categorization or regression.

• Fully Connected Layers: It takes the input from the previous layer and computes the final
classification or regression task.

Prof. Ashwini Garole Module 4


• Output Layer: The output from the fully connected layers is then fed into a logistic function
for classification tasks like sigmoid or softmax which converts the output of each class into the
probability score of each class.

Advantages of Convolutional Neural Networks (CNNs):

1. Good at detecting patterns and features in images, videos, and audio signals.

2. Robust to translation, rotation, and scaling invariance.

3. End-to-end training, no need for manual feature extraction.

4. Can handle large amounts of data and achieve high accuracy.

Disadvantages of Convolutional Neural Networks (CNNs):

1. Computationally expensive to train and require a lot of memory.

2. Can be prone to overfitting if not enough data or proper regularization is used.

3. Requires large amounts of labelled data.

4. Interpretability is limited, it’s hard to understand what the network has learned.

The weight-sharing property of convolutional neural networks (CNNs) has been a revolutionary
concept in the field of deep learning and computer vision.

The Weight-Sharing in CNN:-

In neural networks, each neuron in one layer is connected to every neuron in the next layer, and each
of these connections has its own weight. This results in a massive number of parameters, especially
for large input sizes, making the network prone to overfitting and computationally expensive.

CNNs address this issue through the weight-sharing mechanism. In this approach, the same weights
are used across different parts of the input, significantly reducing the number of parameters. This is
achieved using convolutional filters (or kernels) that slide across the input image, extracting features
such as edges, textures, and shapes.

Advantages

1. Reduced Complexity: By sharing weights, CNNs drastically reduce the number of parameters,
making the network less complex and easier to train.

2. Translation Invariance: Weight-sharing helps in detecting features regardless of their position


in the input space, contributing to the translation invariance property of CNNs.

3. Efficiency: With fewer parameters, CNNs are more computationally efficient and require less
memory.

Disadvantages

1. Limited Perception: Due to their local receptive field, individual neurons in a CNN might have
a limited understanding of the overall context.

2. Spatial Invariance Limitation: While good at handling translation, CNNs are less effective with
other transformations like rotation and scaling without additional augmentation.

Prof. Ashwini Garole Module 4


Conclusion

The weight-sharing property of CNNs has enabled advancements in numerous applications, including
image and video recognition, medical image analysis, and autonomous driving.

Fully connected neural network

• A fully connected neural network consists of a series of fully connected layers that connect
every neuron in one layer to every neuron in the other layer.

• The major advantage of fully connected networks is that they are “structure agnostic” i.e.
there are no special assumptions needed to be made about the input.

• While being structure agnostic makes fully connected networks very broadly applicable, such
networks do tend to have weaker performance than special-purpose networks tuned to the
structure of a problem space.

Multilayer Deep Fully Connected Network

Convolutional Neural Network

• CNN architectures make the explicit assumption that the inputs are images, which allows
encoding certain properties into the model architecture.

• A simple CNN is a sequence of layers, and every layer of a CNN transforms one volume of
activations to another through a differentiable function. Three main types of layers are used
to build CNN architecture: Convolutional Layer, Pooling Layer, and Fully-Connected Layer.

Prof. Ashwini Garole Module 4


To know more about the basic fundamentals related to CNN, check out my earlier blogs
on Convolutions and Pooling.

Simple Convolutional architecture

Dataset Used

• MNIST (Modified National Institute of Standards and Technology database) dataset of 60,000
28x28 grayscale images of the 10 digits, along with a test set of 10,000 images.

• It is a subset of a larger set available from NIST. The digits have been size-normalized and
centered in a fixed-size image.

• It is a good database for people who want to try learning techniques and pattern recognition
methods on real-world data while spending minimal efforts on preprocessing and formatting.

Model Implementation

A) Using Fully Connected Neural Network Architecture

• Model Architecture

For the fully-connected architecture, I have used a total of three hidden layers with ‘relu’ activation
function apart from input and output layers.

• Model Summary

Prof. Ashwini Garole Module 4


The total number of trainable parameters is around 0.3 million. In a fully-connected layer, for n inputs
and m outputs, the number of weights is n*m. Additionally, you have a bias for each output node, so
total (n+1)*m parameters.

• Model Accuracy

On training the fully connected model for five epochs with a batch size of 128, and validation split
value set to 0.3 we got training accuracy of 98.6% and validation accuracy of 96.07%. Moreover, after
2nd epoch, we can visualize how train and validation accuracy tends to move wide apart.

• Accuracy on Test data

Prof. Ashwini Garole Module 4


On test data with 10,000 images accuracy for the fully connected neural network is 96%.

B) Using Convolutional Neural Network Architecture

• Model Architecture

For Convolutional Neural network architecture, we added 3 convolutional layers with activation as
‘relu’ and a max pool layer after the first convolutional layer.

• Model Summary

Prof. Ashwini Garole Module 4


With CNN the differences you can notice in summary are Output shape and number of parameters. As
compared to the fully connected neural network model the total number of parameters is too less i.e.
0.1 million.

• Model Accuracy

On training, CNN for five epochs for a batch size of 128, and validation split value set to 0.3 we
got training accuracy of 99.19% and validation accuracy of 99.63%. Moreover, unlike the fully
connected model, we can visualize train and validation accuracy do not tend to move as wide apart.

Prof. Ashwini Garole Module 4


• Accuracy on the Test dataset

On test data with 10,000 images, accuracy for the fully connected neural network is 98.9%.

Variants of the Basic Convolution Function:-


Convolution in the context of NN means an operation that consists of many applications of convolution
in parallel.

• Kernel K with element Ki,j,k,lKi,j,k,l giving the connection strength between a unit in channel i
of output and a unit in channel j of the input, with an offset of k rows and l columns between
the output unit and the input unit.

• Input: Vi,j,kVi,j,k with channel i, row j and column k

• Output Z same format as V

• Use 1 as first entry

Full Convolution

0 Padding 1 stride

Zi,j,k=∑l,m,nVl,j+m−1,k+n−1Ki,l,m,nZi,j,k=∑l,m,nVl,j+m−1,k+n−1Ki,l,m,n

0 Padding s stride

Zi,j,k=c(K,V,s)i,j,k=∑l,m,n[Vl,s∗(j−1)+m,s∗(k−1)+nKi,l,m,n]Zi,j,k=c(K,V,s)i,j,k=∑l,m,n[Vl,s∗(j−1)+m,s∗(k−1
)+nKi,l,m,n]

Convolution with a stride greater than 1 pixel is equivalent to conv with 1 stride followed by
downsampling:

Prof. Ashwini Garole Module 4


Some 0 Paddings and 1 stride

Without 0 paddings, the width of representation shrinks by one pixel less than the kernel width at each
layer. We are forced to choose between shrinking the spatial extent of the network rapidly and using
small kernel. 0 padding allows us to control the kernel width and the size of the output independently.

Prof. Ashwini Garole Module 4


Special case of 0 padding:

• Valid: no 0 padding is used. Limited number of layers.

• Same: keep the size of the output to the size of input. Unlimited number of layers. Pixels near
the border influence fewer output pixels than the input pixels near the center.

• Full: Enough zeros are added for every pixels to be visited k (kernel width) times in each
direction, resulting width m + k - 1. Difficult to learn a single kernel that performs well at all
positions in the convolutional feature map.

Usually the optimal amount of 0 padding lies somewhere between ‘Valid’ or ‘Same’

Unshared Convolution

In some case when we do not want to use convolution but want to use locally connected layer. We
use Unshared convolution. Indices into weight W

• i: the output channel

• j: the output row;

Prof. Ashwini Garole Module 4


• k: the output column

• l: the input channel

• m: row offset within input

• n: column offset within input

Zi,j,k=∑l,m,n[Vl,i+m−1,j+n−1Wi,j,k,l,m,n]Zi,j,k=∑l,m,n[Vl,i+m−1,j+n−1Wi,j,k,l,m,n]

Comparison on local connections, convolution and full connection

Useful when we know that each feature should be a function of a small part of space, but no reason
to think that the same feature should occur accross all the space. eg: look for mouth only in the bottom
half of the image.

It can be also useful to make versions of convolution or local connected layers in which the connectivity
is further restricted, eg: constrain each output channeel i to be a function of only a subset of the input
channel.

Adv: * reduce memory consumption * increase statistical efficiency * reduce computation for both
forward and backward prop.

Prof. Ashwini Garole Module 4


Tiled Convolution

Learn a set of kernels that we rotate through as we move through space. Immediately neighboring
locations will have different filters, but the memory requirement for storing the parameters will
increase by a factor of the size of this set of kernels. Comparison on locally connected layers, tiled
convolution and stardard convolution:

Prof. Ashwini Garole Module 4


K: 6-D tensor, t different choice of kernel stack

Zi,j,k=∑l,m,n[Vi,i+m−1,j+n−1Ki,l,m,n,j%t+1,k%t+1]Zi,j,k=∑l,m,n[Vi,i+m−1,j+n−1Ki,l,m,n,j%t+1,k%t+1]

Local connected layers and tiled convolutional layer with max pooling: the detector units of these
layers are driven by different filters. If the filters learn to detect different tranformed version of the
same underlying features, then the max-pooled units become invariant to the learned transformation.

Review:

Prof. Ashwini Garole Module 4


Prof. Ashwini Garole Module 4
Prof. Ashwini Garole Module 4
Multichannel convolution operation,2D convolution:-
• The example of a 2D image matrix applied to a 2D kernel filter. However
Convolutional Neural Networks (CNN) generally operate on multi-channel

Prof. Ashwini Garole Module 4


images. When you have a "deep" network, you're stacking convolutional layers.
When this happens you will deal with multi-channel images.
• The input image may be an RGB image. This means you have a 3D volume:
Height x Width x Channels. Let's say 100x100x3. A convolutional layer is
intended to learn visual patterns relative to the kernel size. If we set our kernel
size to 5x5. Then we are looking for small patterns that fit into a 5x5 kernel. This
could be a curve, a straight line... or something else. However since our input
image is 100x100x3, we're not dealing with 2D image matrices anymore. We're
dealing with 3D volumes. This means our convolutional kernel filter also has to
be 3D.
• How does this work? The exact same way as before with a 2D matrix applied to
2D kernel filter. Element wise multiplication then reduce to a single number by
summing. This is often called a dot product operation. Take a 3D slice of the 3D
image, and do the dot product with the 3D kernel filter.
• Our RGB image 100x100x3, we would have a 5x5x3 convolutional kernel filter.
In a CNN this means we are learning 75 weights for a single pattern. Here I'm
using "weight" as a single number in a cell of a convolutional kernel filter.
However we may want to learn more than just 1 pattern. So we increase the
number of convolutional kernel filters. Let's say we think that there are 32
possible visual patterns with our RGB image. Then it's a simple matter of
creating 32 convolutional kernel filters, each of which is of size 5x5x3.
• 2D Convolution Animation. Convolution is the process of adding each element
of the image to its local neighbors, weighted by the kernel. This is related to a
form of mathematical convolution. The matrix operation being performed—
convolution—is not traditional matrix multiplication, despite being similarly
denoted by *.
• A 2D Convolution operation is a widely used operation in computer vision and
deep learning. It is a mathematical operation that applies a filter to an image,
producing a filtered output (also called a feature map).

LeNET: Architecture:-

LeNet Architecture is developed by Yann LeCun and his colleagues in the late 1980s and early 1990s,
is one of the earliest convolutional neural networks that has substantially influenced the field of deep
learning, particularly in image recognition. Designed originally to recognize handwritten and machine-
printed characters, LeNet was a groundbreaking model at the time of its inception.

Its architecture, known as LeNet-5, consists of convolutional layers followed by subsampling and fully
connected layers, culminating in a softmax output layer.

Significance of LeNet in Deep Learning

LeNet’s significance in deep learning cannot be overstated. It was one of the first demonstrations that
convolutional neural networks (CNNs) could be successfully applied to visual pattern recognition.
LeNet introduced several key concepts that are now standard in CNN architectures, including the use
of multiple convolutional and pooling layers, local receptive fields, shared weights, and the
backpropagation algorithm for training the network.

Prof. Ashwini Garole Module 4


Chronology of LeNet Architecture

1. Late 1980s: Yann LeCun begins foundational work on convolutional neural networks at AT&T
Bell Labs, leading to the development of the initial LeNet models.

2. 1989: The first iteration, LeNet-1, is introduced, employing backpropagation for training
convolutional layers.

3. 1998: LeNet-5, the most notable version, is detailed in the seminal paper “Gradient-Based
Learning Applied to Document Recognition.” This iteration is optimized for digit recognition
and demonstrates practical applications.

4. 2000s: LeNet’s success inspires further research and adaptations in various fields beyond digit
recognition, such as medical imaging and object recognition.

5. 2010s and Beyond: LeNet’s principles influence the development of more advanced CNN
architectures like AlexNet and ResNet, solidifying its legacy in the field of deep learning.

LeNet’s Architecture

LeNet Architecture

Prof. Ashwini Garole Module 4


The LeNet architecture consists of several layers that progressively extract and condense information
from input images. Here, is it the description of each layer of the LeNet architecture:

1. Input Layer: Accepts 32×32 pixel images, often zero-padded if original images are smaller.

2. First Convolutional Layer (C1): Consists of six 5×5 filters, producing six feature maps of 28×28
each.

3. First Pooling Layer (S2): Applies 2×2 average pooling, reducing feature maps’ size to 14×14.

4. Second Convolutional Layer (C3): Uses sixteen 5×5 filters, but with sparse connections,
outputting sixteen 10×10 feature maps.

5. Second Pooling Layer (S4): Further reduces feature maps to 5×5 using 2×2 average pooling.

6. Fully Connected Layers:

• First Fully Connected Layer (C5): Fully connected with 120 nodes.

• Second Fully Connected Layer (F6): Comprises 84 nodes.

7. Output Layer: Softmax or Gaussian activation that outputs probabilities across 10 classes
(digits 0-9).

Applications of LeNet

LeNet’s architecture, originally developed for digit recognition, has proven versatile and foundational,
influencing a variety of applications beyond its initial scope. Here are some notable applications and
adaptations:

1. Handwritten Character Recognition: Beyond recognizing digits, LeNet has been adapted to
recognize a broad range of handwritten characters, including alphabets from various
languages. This adaptation has been crucial for applications such as automated form
processing and handwriting-based authentication systems.

2. Object Recognition in Images: The principles of LeNet have been extended to more complex
object recognition tasks. Modified versions of LeNet are used in systems that need to recognize
objects in photos and videos, such as identifying products in a retail setting or vehicles in traffic
management systems.

3. Document Classification: LeNet can be adapted for document classification by recognizing and
learning from the textual and layout features of different document types. This application is
particularly useful in digital document management systems where automatic categorization
of documents based on their content and layout can significantly enhance searchability and
retrieval.

4. Medical Image Analysis: Adaptations of LeNet have been applied in the field of medical image
analysis, such as identifying abnormalities in radiographic images, segmenting biological
features in microscopic images, and diagnosing diseases from patterns in medical imagery.
These applications demonstrate the potential of convolutional neural networks in supporting
diagnostic processes and enhancing the accuracy of medical evaluations.

AlexNet:

Prof. Ashwini Garole Module 4


AlexNet, developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, is a landmark model that
won the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) in 2012. It introduced several
innovative ideas that shaped the future of CNNs.

AlexNet Architecture:
AlexNet consists of 8 layers, including 5 convolutional layers and 3 fully connected layers. It uses
traditional stacked convolutional layers with max-pooling in between. Its deep network structure
allows for the extraction of complex features from images.

• The architecture employs overlapping pooling layers to reduce spatial dimensions while
retaining the spatial relationships among neighbouring features.

• Activation function: AlexNet uses the Softmax activation function and dropout regularization,
which enhance the model’s ability to capture non-linear relationships within the data.

The key features of AlexNet are as follows:-

• AlexNet was created to be more computationally efficient than earlier CNN topologies. It
introduced parallel computing by utilising two GPUs during training.

• AlexNet is a relatively shallow network compared to GoogleNet. It has eight layers, which
makes it simpler to train and less prone to overfitting on smaller datasets.

• In 2012, AlexNet produced ground-breaking results in the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC). It outperformed prior CNN architectures greatly and set the
path for the rebirth of deep learning in computer vision.

• Several architectural improvements were introduced by AlexNet, including the use of rectified
linear units (ReLU) as activation functions, overlapping pooling, and dropout regularisation.
These strategies aided in the improvement of performance and generalisation

An image classification task of various dog breeds. AlexNet’s convolutional layers learn features such
as edges, textures, and shapes to distinguish between different dog breeds. The fully connected layers
then analyze these learned features and make predictions.

Differences between AlexNet and GoogleNet:

Features AlexNet GoogleNet

Architecture Deep (8 layers) Deep (22 layers)

Activation Function ReLU ReLU

Prof. Ashwini Garole Module 4


Features AlexNet GoogleNet

Pooling Overlapping Non-overlapping

Convolution Consecutive Parallel (inception)

Dimensionality No reduction 1×1 Convolution

Regularization Dropout Auxiliary Classifiers

AlexNet is Important explain in these steps:


Breakthrough Performance: Achieved a significant improvement in image classification accuracy in
2012, showcasing the power of machine learning algorithms.

Deep Architecture: Utilized a deep network with eight layers, much deeper than previous models,
contributing to advancements in CNN architectures.

Use of GPUs: Leveraged GPUs to speed up training, significantly enhancing performance and efficiency
in processing large datasets.

Innovative Techniques:

• ReLU Activation: Employed Rectified Linear Units for faster training, an essential component
in the optimization of gradient-based learning.

• Dropout: Prevented overfitting by randomly dropping neurons during training, improving


model robustness.

• Data Augmentation: Enhanced model generalization through techniques like image


translations and reflections, crucial for effective data preprocessing.

Large-Scale Data: Trained on the large ImageNet dataset, which contains millions of images,
demonstrating the importance of extensive and diverse datasets in machine learning.

Inspiration for Research: This work paved the way for more advanced neural network architectures
and deep learning research, influencing subsequent innovations in the field.

What is the difference between AlexNet and ResNet?

Prof. Ashwini Garole Module 4


AlexNet and ResNet are both convolutional neural networks (CNNs) that played a major role in the
advancement of computer vision. Here’s the key differences of these pretrained models:

AlexNet: Introduced in 2012, AlexNet, developed by Geoffrey Hinton’s team, has a relatively shallow
architecture with stacked convolutional and pooling layers. Despite its groundbreaking nature at the
time, this depth limitation affects its ability to learn complex features. It utilizes techniques such as
normalization and the sigmoid activation function for classification tasks.

ResNet: Introduced in 2015, ResNet builds upon AlexNet by using a much deeper architecture with
“skip connections.” These connections allow the network to learn from the gradients of previous
layers, alleviating the vanishing gradient problem that hinders training in very deep networks. This
enables ResNet to achieve significantly higher accuracy. ResNet also excels in tasks such as image
segmentation and classification due to its robust architecture.

summarize the architecture that we have seen in this article.

• It has 8 layers with learnable parameters.

• The input to the Model is RGB images.

• It has 5 convolution layers with a combination of max-pooling layers.

• Then it has 3 fully connected layers.

• The activation function used in all layers is Relu.

• It used two Dropout layers.

• The activation function used in the output layer is Softmax.

• The total number of parameters in this architecture is 62.3 million.

Residual Networks (ResNet) Architecture:-


The first CNN-based architecture (AlexNet) that win the ImageNet 2012 competition, Every subsequent
winning architecture uses more layers in a deep neural network to reduce the error rate. This works
for less number of layers, but when we increase the number of layers, there is a common problem in
deep learning associated with that called the Vanishing/Exploding gradient. This causes the gradient
to become 0 or too large. Thus when we increases number of layers, the training and test error rate
also increases.

Prof. Ashwini Garole Module 4


Comparison of 20-layer vs 56-layer architecture

In the above plot, we can observe that a 56-layer CNN gives more error rate on both training and
testing dataset than a 20-layer CNN architecture. After analyzing more on error rate the authors were
able to reach conclusion that it is caused by vanishing/exploding gradient.

ResNet, which was proposed in 2015 by researchers at Microsoft Research introduced a new
architecture called Residual Network.

Residual Network: Solve the problem of the vanishing/exploding gradient, this architecture
introduced the concept called Residual Blocks. In this network, we use a technique called skip
connections. The skip connection connects activations of a layer to further layers by skipping some
layers in between. This forms a residual block. Resnets are made by stacking these residual blocks
together.
The approach behind this network is instead of layers learning the underlying mapping, we allow the
network to fit the residual mapping. So, instead of say H(x), initial mapping, let the network fit,

F(x) := H(x) - x which gives H(x) := F(x) + x.

Skip (Shortcut) connection

The advantage of adding this type of skip connection is that if any layer hurt the performance of
architecture then it will be skipped by regularization. So, this results in training a very deep neural

Prof. Ashwini Garole Module 4


network without the problems caused by vanishing/exploding gradient. The authors of the paper
experimented on 100-1000 layers of the CIFAR-10 dataset.
There is a similar approach called “highway networks”, these networks also use skip connection.
Similar to LSTM these skip connections also use parametric gates. These gates determine how much
information passes through the skip connection. This architecture however has not provided accuracy
better than ResNet architecture.

ResNet: Residual Network

ResNet (short for Residual Network) is a type of neural network architecture introduced in 2015 by
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun from Microsoft Research. It was designed to
solve the problem of vanishing gradients in deep neural networks, which hindered their performance
on large-scale image recognition tasks.

The ResNet architecture is usually divided into four parts, each containing multiple residual blocks with
different depths. The first part of the Network comprises a single convolutional layer, followed by max
pooling, to reduce the spatial dimensions of the input. The second part of the Network contains 64
filters, while the third and fourth parts contain 128 and 256 filters, respectively. The final part of the
Network consists of global average pooling and a fully connected layer that produces the output.

Background

Deep neural networks have revolutionized the field of computer vision by achieving state-of-the-art
results on various tasks such as image classification, object detection, and semantic segmentation.
However, training deep neural networks can be challenging due to the problem of vanishing gradients.

Residual Learning

Residual learning is a concept that was introduced in the ResNet architecture to tackle the vanishing
gradient problem. In traditional deep neural networks, each layer applies a set of transformations to
the input to obtain the output. ResNet introduces residual connections that enable the Network to learn
residual mappings, which are the differences between the input and output of a layer.

The residual connections are formed by adding the input to the output of a layer, which allows the
gradients to flow directly through the Network without being attenuated. This enables the Network to
learn the residual mapping using a shortcut connection that bypasses the layer's transformation.

ResNet Architecture

The ResNet architecture consists of several layers, each containing residual blocks. A residual block is
a set of layers that perform a set of transformations on the input to obtain the output and includes a
shortcut connection that adds the input to the output.

The ResNet architecture has several variants, including ResNet-18, ResNet-34, ResNet-50, ResNet-101,
and ResNet-152. The number in each variant corresponds to the number of layers in the Network. For
example, ResNet-50 has 50 layers, while ResNet-152 has 152 layers.

The ResNet-50 architecture is one of the most popular variants, and it consists of five stages, each
containing several residual blocks. The first stage consists of a convolutional layer followed by a max-
pooling layer, which reduces the spatial dimensions of the input.

Prof. Ashwini Garole Module 4


Applications

ResNet has achieved state-of-the-art results on various computer vision tasks, including image
classification, object detection, and semantic segmentation. In the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) 2015, the ResNet-152 architecture achieved a top-5 error rate of 3.57%,
significantly better than the previous state-of-the-art error rate of 3.57%.

Benefits of ResNet

ResNet has several benefits that make it a popular choice for deep learning applications:

o Deeper networks

ResNet enables the construction of deeper neural networks, with more than a hundred layers, which
was previously impossible due to the vanishing gradient problem. The residual connections allow the
Network to learn better representations and optimize the gradient flow, making it easier to train deeper
networks.

o Improved accuracy

ResNet has achieved state-of-the-art performance on several benchmark datasets, such as ImageNet,
CIFAR-10, and CIFAR-100, demonstrating its superior accuracy compared to other deep neural network
architectures.

o Faster convergence

ResNet enables faster convergence during training, thanks to the residual connections that allow for
better gradient flow and optimization. This results in faster training and better convergence to the
optimal solution.

ADVERTISEMENT

o Transfer learning

ResNet is suitable for transfer learning, allowing the Network to reuse previously

learned features for new tasks. This is especially useful in scenarios where the amount of Labeled data
is limited, as the pre-trained ResNet can be fine-tuned on the new dataset to achieve good
performance.

Drawbacks of ResNet

Despite its numerous benefits, ResNet has a few drawbacks that should be considered:

o Complexity

ResNet is a complex architecture that requires more memory and computational resources than
shallower networks. This can be a limitation in scenarios with limited resources, such as mobile devices
or embedded systems.

o Overfitting

ResNet can be prone to overfitting, especially when the Network is too deep or when the dataset is
small. This can be mitigated by regularization techniques, such as dropout, or by using smaller
networks with fewer layers.

o Interpretability

Prof. Ashwini Garole Module 4


ResNet's interpretability can be challenging, as the Network learns complex and abstract
representations that are difficult to understand. This can be a limitation in scenarios where
interpretability is crucial, such as medical diagnosis or fraud detection.

Conclusion

ResNet is a powerful deep neural network architecture that has revolutionized the field of computer
vision by enabling the construction of deeper and more accurate networks. Its residual connections
enable better gradient flow and optimization, making training deeper networks easier and achieving
better performance on benchmark datasets.

ResNet has limitations, such as complexity, susceptibility to overfitting, and limited interpretability.
When choosing ResNet or any other deep neural network architecture for a specific task, these
drawbacks should be considered.

Overall, ResNet has significantly impacted deep learning and computer vision, and its principles have
been extended to other domains, such as natural language processing and speech recognition. As
research in deep learning continues to evolve, new architectures and techniques will likely be developed
to address the current limitations of ResNet and other existing architectures.

Prof. Ashwini Garole Module 4

You might also like