0% found this document useful (0 votes)
70 views83 pages

Machine Learning (CSO851) - Lecture 10

The document provides an overview of deep learning for image processing, focusing on convolutional neural networks (CNNs), their architecture, activation functions, and the concepts of overfitting and underfitting. It discusses the importance of machine vision in industrial applications, the challenges of image classification, and the foundational elements of CNNs, including their layers and properties. Additionally, it explains various activation functions and their roles in neural networks, along with strategies to detect and address overfitting and underfitting in model training.

Uploaded by

trijitrana9878
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views83 pages

Machine Learning (CSO851) - Lecture 10

The document provides an overview of deep learning for image processing, focusing on convolutional neural networks (CNNs), their architecture, activation functions, and the concepts of overfitting and underfitting. It discusses the importance of machine vision in industrial applications, the challenges of image classification, and the foundational elements of CNNs, including their layers and properties. Additionally, it explains various activation functions and their roles in neural networks, along with strategies to detect and address overfitting and underfitting in model training.

Uploaded by

trijitrana9878
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 83

CNN and Its Variants,

Activation Functions and


Overfitting and
Underfitting

Deep Learning for Image


Processing (CS9046)
Topics
• Introduction to machine vision
• Introduction to image classification
• Classification with neural networks
• Learning, cost function
• Image classification with CNN
• Foundations of CNN
• Building blocks of CNN
• Properties of CNN
• Layers in CNN
• Activation functions
• CNN variants
Introduction to Machine Vision
Machine vision uses the latest AI
technologies to give industrial equipment
the ability to see, analyze, and act, which
can increase product quality, reduce costs,
and optimize operations.

Applications of machine vision:


• Visual inspection and defect detection
• Positioning and measuring parts
• Identification
• Sorting
• Classification
• Tracking products.
Introduction to Machine Vision
When a camera captures an image of an object, what it’s really doing is capturing the
light that the object has reflected. The degree to which light is absorbed or reflected is
dependent on the object’s surface whether it is transparent, translucent or opaque.
Introduction to Image Classification
Image classification is the process of categorizing and labeling groups of pixels or
vectors within an image based on specific rules.
Why is image recognition difficult?
Handcrafted Features
"Hand Crafted" features refer to properties derived using various algorithms using the
information present in the image itself.

• Handcrafted features are usually not robust


• They are computationally intensive due to high dimensions.
• The discriminative power is usually low.
Classification with Neural
Networks
Training and Learning in Neural
Networks
• The problem of training is equivalent to the problem of minimizing the loss
function.
• Training is the process of making a neural network to perform a task.
• Neural networks learn by initially processing several large sets of labeled or
unlabelled data.
• By using these examples, they can then process unknown inputs more accurately.
• For example, a deep learning network training in facial recognition initially
processes hundreds of thousands of images of human faces, with various terms
related to ethnic origin, country, or emotion describing each image.
• The neural network slowly builds knowledge from these datasets, which provide
the right answer in advance.
• After the network has been trained, it starts making guesses about the ethnic origin
or emotion of a new image of a human face that it has never processed before.
Cost/Loss Function in Neural
Networks
• Loss function tells us, how good our neural network for a certain task.
• The intuitive way to do it is, take each training example, pass through the network to
get the number, subtract it from the actual number we wanted to get and square it.

• To compute the loss function we would go over each training example in our
dataset, compute y for that example, and then compute the function defined above.
• If the Loss function is big then our network doesn’t perform very well, we want as
small number as possible.
Low-level features include edges and blobs, and high-level features include objects and events. The
low-level feature extraction is based on signal/image processing techniques, while the high-level
feature extraction is based on machine learning techniques.
Image Classification with CNN
Image Classification with CNN
• Reduce the number of input nodes.
• Tolerate small shifts in where the pixels are in the image.
• Take advantage of the correlations that we observe in complex images.
• Its built-in convolutional layer reduces the high dimensionality of
images without losing its information.
• Spatial representation is used for input data.

Handcrafted features are manually engineered whereas learned features


are automatically obtained from the deep learning algorithm.
Deep Learning vs. Machine Learning

Machine Learning (ML): Algorithms learn from structured data to predict outputs and discover
patterns in that data.

Deep Learning (DL): Algorithms based on highly complex neural networks that mimic the way
a human brain works to detect patterns in large unstructured data sets
https://levity.ai/blog/difference-machine-learning-deep-learning
Foundations of Convolutional Neural Networks

• Objectives are:
• To understand the convolution operation
• To understand the pooling operation
• Understanding the vocabulary used in convolutional neural
networks (padding, stride, filter, etc.)
• Building convolutional neural network models for
• Image enhancement
• Image denoising
• Object detection
• Classification of images
Building Blocks of CNN

https://www.analyticsvidhya.com/blog/2022/01/convolutional-neural-network-an-overview/
Properties of CNN
• Convolutional Neural Networks (CNN) have characteristics that enable
invariance to the affine transformations of images that are fed through the
network. This provides the ability to recognize patterns that are shifted,
tilted or slightly warped within images. CNN has the following main
properties.
• Local Receptive Fields: Receptive fields are defined portion of sensory space or
spatial construct containing units that provide input to a set of units within a
corresponding layer.
• Shared Weights
• Spatial/Temporal Sub-sampling

Each neuron within a CNN is responsible for a defined region of the input data, and this
enables neurons to learn patterns such as lines, edges and small details that make up
the image.
https://www.cybercontrols.org/
Overview
• Convolutional neural networks, also known as CNNs, are a specific type of neural
networks that are generally composed of the following layers:

The convolution layer and the pooling layer can be fine-tuned with respect to
hyperparameters.
Types of layer
• Convolution layer (CONV) - The convolution layer (CONV) uses filters that perform
convolution operations as it is scanning the input I with respect to its dimensions. Its
hyperparameters include the filter size F and stride S. The resulting output O is called
feature map or activation map.
Filtering Dynamics
Contd..
• Pooling (POOL) – The pooling layer (POOL) is a downsampling operation, typically
applied after a convolution layer, which does some spatial invariance. In particular, max
and average pooling are special kinds of pooling where the maximum and average value
is taken, respectively.
Contd..
• Fully Connected (FC) – The fully connected layer (FC) operates on a flattened input
where each input is connected to all neurons. If present, FC layers are usually found
towards the end of CNN architectures and can be used to optimize objectives such as
class scores.
Filter hyperparameters
• The convolution layer contains filters for which it is important to know the meaning
behind its hyperparameters.
• Dimensions of a filter - A filter of size F×F applied to an input containing C channels is a
F×F×C volume that performs convolutions on an input of size I×I×C and produces an output
feature map (also called activation map) of size O×O×1.
Contd..
• Stride - For a convolutional or a pooling operation, the stride S denotes the number of
pixels by which the window moves after each operation.
• Zero-padding – Zero-padding denotes the process of adding P zeroes to each side of the
boundaries of the input. This value can either be manually specified or automatically set
through one of the three modes detailed below:
Tuning hyperparameters
• Parameter compatibility in convolution layer – If I the length of the input volume size, F
the length of the filter, P the amount of zero padding, S the stride, then the output size
O of the feature map along that dimension is given by:
Contd..
• Understanding the complexity of the model – In order to assess the complexity of a
model, it is often useful to determine the number of parameters that its architecture
will have. In a given layer of a convolutional neural network, it is done as follows:
Contd..
Activation Functions
• Activation functions are used in a neural network to compute the weighted sum of
inputs and biases, which is in turn used to decide whether a neuron can be activated
or not.
• It manipulates the presented data and produces an output for the neural network that
contains the parameters in the data.
• The activation functions are also referred to as transfer functions in some literature.
• These can either be linear or nonlinear depending on the function it represents and is
used to control the output of neural networks across different domains.

The need for these activation functions includes converting the linear input
signals and models into non-linear output signals, which aids the learning of
high order polynomials for deeper networks.
Activation Functions
• The Sigmoid Function curve looks like a S-shape.
• We prefer to use sigmoid function because it exists
between (0 to 1).
• It is used for models where we have to predict the
probability as an output.
• The function is differentiable. That means, we can
find the slope of the sigmoid curve at any two points.
• The function is monotonic but function’s derivative is
not.
• The logistic sigmoid function can cause a neural
network to get stuck at the training time.
• Sigmoid function is computationally expensive,
causes vanishing gradient problem and not zero-
centred. This method is generally used for binary
classification problems.
Contd..
• Softmax function is described as a combination of multiple sigmoids.
• The softmax function can be used for multiclass classification problems.
• This function returns the probability for a datapoint belonging to each individual class.
• The mathematical expression

• While building a network for a multiclass problem, the output layer would have as
many neurons as the number of classes in the target.
• For instance if we have three classes, there would be three neurons in the output
layer. Suppose you got the output from the neurons as [1.2 , 0.9 , 0.75].
• Applying the softmax function over these values, we will get the following result –
[0.42 , 0.31, 0.27].
• These represent the probability for the data point belonging to each class.
Contd..
• The tanh function is very similar to the sigmoid
function.
• The only difference is that it is symmetric around the
origin.
• The range of values in this case is from -1 to 1.
• The advantage is that the negative inputs will be
mapped strongly negative and the zero inputs will be
mapped near zero in the tanh graph.
• The function is differentiable.
• The function is monotonic while its derivative is not
monotonic.
• Both tanh and logistic sigmoid activation functions are
used in feed-forward nets.
Contd..
• Rectified Linear Unit: The rectified linear unit layer (ReLU) is an activation function g
that is used on all elements of the volume. It aims at introducing non-linearity to the
network. Its variants are summarized below:
Contd..
• The ReLU is half rectified (from bottom). • In Leaky RELU, the leak helps to increase the
• f(z) is zero when z is less than zero and f(z) range of the ReLU function. Usually, the value
is equal to z when z is above or equal to of a is 0.01 or so.
zero. • When a is not 0.01 then it is called
• Range: [ 0 to infinity) Randomized ReLU.
• The function and its derivative both are • Therefore the range of the Leaky ReLU is (-
monotonic. infinity to infinity).
• But the issue is that all the negative values • Both Leaky and Randomized ReLU functions
become zero immediately which are monotonic in nature. Also, their
decreases the ability of the model to fit or derivatives also monotonic in nature.
train from the data properly. That means • Exponential Linear Unit or ELU is also a
any negative input given to the ReLU variant of Rectified Linear Unit (ReLU) that
activation function turns the value into modifies the slope of the negative part of the
zero immediately in the graph, which in function.
turns affects the resulting graph by not • Unlike the leaky ReLU and parametric ReLU
mapping the negative values functions, instead of a straight line, ELU uses
appropriately. a log curve for defining the negative values.
Contd..
• Swish: It is defined as f(x) = x*sigmoid(x).

It is slightly better in
performance as compared to
ReLU since its graph is quite
similar to ReLU. However,
because it does not change
abruptly at a point as ReLU does
at x = 0, this makes it easier to
converge while training.

But, the drawback of Swish is


that it is computationally
expensive. To solve that we
come to the next version of
Swish.
Contd..
• Hard-Swish or H-Swish: The best part is that it is almost similar to swish but it is less expensive
computationally since it replaces sigmoid (exponential function) with a ReLU (linear type).
How to use activation functions in deep learning?

• Tanh and sigmoid cause huge vanishing gradient problems. Hence, they
should not be used.
• Start with ReLU in your network. Activation layer is added after the weight
layer (something like CNN, RNN, LSTM or linear dense layer). If you think the
model has stopped learning, then you can replace it with a LeakyReLU to
avoid the Dying ReLU problem. However, the Leaky ReLU will increase the
computation time a little bit.
• If you also have Batch-Norm layers in your network, that is added before the
activation function making the order CNN-Batch Norm-Act.
• Activation functions work best in their default hyperparameters that are
used in popular frameworks such as Tensorflow and Pytorch. However, one
can fiddle with the negative slope in LeakyReLU and set it to 0.02 to
expedite learning.
Underfitting vs. Overfitting
• Underfitting is a situation when your model
is too simple for your data. More formally,
your hypothesis about data distribution is
wrong and too simple.
• When a model is unable to learn the
patterns in the training data well and is
unable to generalize well on the new data.
• An underfit model has poor performance
on the training data and will result in
unreliable predictions.
• For example, your data is quadratic and
your model is linear. This situation is also
called high bias. This means that your
algorithm can do accurate predictions, but
the initial assumption about the data is
incorrect.
Underfitting vs. Overfitting
• Overfitting is a situation when your
model is too complex for your data.
More formally, your hypothesis about
data distribution is wrong and too
complex.
• For example, your data is linear and your
model is high-degree polynomial. This
situation is also called high variance. This
means that your algorithm can’t do
accurate predictions — changing the input
data only a little, the model output
changes very much.
Underfitting vs. Overfitting: Possible Options

• Low bias, low variance — is a good result, just right.


• Low bias, high variance — overfitting — the algorithm outputs very
different predictions for similar data.
• High bias, low variance — underfitting — the algorithm outputs
similar predictions for similar data, but predictions are wrong
(algorithm “miss”).
• High bias, high variance — very bad algorithm. You will most likely
never see this.

https://towardsdatascience.com/overfitting-and-underfitting-principles-ea8964d9c45c
Detect Underfitting and Overfitting
• Underfitting means that your model
makes accurate, but initially incorrect
predictions. In this case, train error is
large and val/test error is large too.
• Overfitting means that your model
makes not accurate predictions. In this
case, train error is very small and val/test
error is large.
• When you find a good model, train error
is small (but larger than in the case of
overfitting), and val/test error is small
too.
LeNet-5 Architecture

• Every convolutional layer includes three parts: • Subsampling uses average pooling
convolution, pooling, and nonlinear activation • tanh activation function
functions • Using MLP as the last classifier
• Using convolution to extract spatial features • Sparse connection between layers to
reduce the complexity of computation
LeNet-5 Architecture
LeNet-5 Architecture: Features
LeNet-5 Architecture: Layers
C1 layer-convolutional layer: S2 layer-pooling layer (downsampling layer):
• Input picture: 32*32 • Input: 28*28
• Convolution kernel size: 5*5 • Sampling area: 2*2
• Convolution kernel types: 6 • Sampling method: 4 inputs are added,
• Output feature map size: 28*28 (32-5 multiplied by a trainable parameter, plus a
+ 1) = 28 trainable offset. Results via sigmoid
• Number of neurons: 28*28*6 • Sampling type: 6
• Trainable parameters: (5*5 + 1)*6 • Output feature Map size: 14*14 (28/2)
(5*5 = 25 unit parameters and one • Number of neurons: 14*14*6
bias parameter per filter, a total of 6 • Trainable parameters: 2*6 (the weight of the
filters) sum + the offset)
• Number of connections: (5*5 + • Number of connections: (2*2 + 1)*6*14*14
1)*6*28*28 = 122304 • The size of each feature map in S2 is 1/4 of
the size of the feature map in C1.
LeNet-5 Architecture: Layers
C3 layer-convolutional layer:
• Input: all 6 or several feature map combinations in S2
• Convolution kernel size: 5*5
• Convolution kernel type: 16
• Output feature Map size: 10*10 (14-5 + 1) = 10
• Each feature map in C3 is connected to all 6 or several feature maps in S2, indicating
that the feature map of this layer is a different combination of the feature maps
extracted from the previous layer.
• One way is that the first 6 feature maps of C3 take 3 adjacent feature map subsets in
S2 as input. The next 6 feature maps take 4 subsets of neighbouring feature maps in S2
as input. The next three take the non-adjacent 4 feature map subsets as input. The last
one takes all the feature maps in S2 as input.
• The trainable parameters are: 6 (3 5 5 + 1) + 6 (4 5 5 + 1) + 3 (4 5 5 + 1) + 1 (6 5 5 +1) =
1516
• Number of connections: 10 10 1516 = 151600
LeNet-5 Architecture: Layers
S4 layer-pooling layer (downsampling layer) C5 layer-convolution layer
• Input: 10*10 • Input: All 16 unit feature maps of the
• Sampling area: 2*2 S4 layer (all connected to s4)
• Sampling method: 4 inputs are added, • Convolution kernel size: 5*5
multiplied by a trainable parameter, plus a • Convolution kernel type: 120
trainable offset. Results via sigmoid • Output feature Map size: 1*1 (5-5 + 1)
• Sampling type: 16 • Trainable parameters / connection:
• Output feature Map size: 5*5 (10/2) 120*(16*5*5 + 1) = 48120
• Number of neurons: 5*5*16 = 400
• Trainable parameters: 2*16 = 32 (the weight
of the sum + the offset)
• Number of connections: 16*(2*2 + 1)*5*5 =
2000
• The size of each feature map in S4 is 1/4 of
the size of the feature map in C3
LeNet-5 Architecture: Network Structure of C5 Layer
LeNet-5 Architecture: Layers
F6 layer-fully connected layer
• Input: c5 120-dimensional
vector
• Calculation method:
calculate the dot product
between the input vector
and the weight vector, plus
an offset, and the result is
output through the sigmoid
function.
• Trainable parameters:
84*(120 + 1) = 10164
LeNet-5 Architecture: Layers
• Layer 6 is a fully connected layer.
• The F6 layer has 84 nodes,
corresponding to a 7x12 bitmap,
-1 means white, 1 means black,
so the black and white of the
bitmap of each symbol
corresponds to a code.
• The training parameters and
number of connections for this
layer are (120+1)x84 = 10164.
• The ASCII encoding diagram is as
follows:
ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
Alexnet Architecture
Salient Features of Alexnet:
• 1.3 million high-resolution images in the ImageNet LSVRC-2010 contest.
• 1000 different classes
• On the test data, the network achieved top-1 and top-5 error rates of 37.5% and
17.0%
• 60 million parameters
• 650,000 neurons
• Consists of five convolutional layers, some of which are followed by max-pooling
layers, and three fully-connected layers with a final 1000-way softmax.
• To make training faster, authors used non-saturating neurons and a very efficient
GPU implementation of the convolution operation. To reduce overfitting in the
fully-connected layers they employed regularization method called “dropout” that
proved to be very effective.
• Top-1 and Top-5 error rates of 39.7% and 18.9%.
Alexnet Architecture
Alexnet Summary
VGG-16 Architecture: Salient
Features
• VGG16 is a convolution neural net (CNN) architecture which was used to win
ILSVR(Imagenet) competition in 2014.
• It is considered to be one of the excellent vision model architecture till date.
• VGG16 is having convolution layers of 3x3 filter with a stride 1.
• Used same padding and maxpool layer of 2x2 filter of stride 2.
• It follows this arrangement of convolution and max pool layers consistently
throughout the whole architecture.
• In the end it has 2 FC (fully connected layers) followed by a softmax for output.
• The 16 in VGG16 refers to it has 16 layers that have weights.
• This network has about 138 million (approx) parameters.
VGG-16 Architecture: Salient
Features
AlexNet uses smaller window sizes and strides in the first convolutional layer, VGG addresses another very
important aspect of CNNs: depth.

• Input. VGG considers a 224x224 pixel RGB image. For the ImageNet competition, the authors cropped out
the center 224x224 patch in each image to keep the input image size consistent.

• Convolutional Layers. The convolutional layers in VGG use a very small receptive field (3x3, the smallest
possible size that still captures left/right and up/down). There are also 1x1 convolution filters which act as
a linear transformation of the input, which is followed by a ReLU unit. The convolution stride is fixed to 1
pixel so that the spatial resolution is preserved after convolution.

• Fully-Connected Layers. VGG has three fully-connected layers: the first two have 4096 channels each and
the third has 1000 channels, 1 for each class.

• Hidden Layers. All of VGG’s hidden layers use ReLU (a huge innovation from AlexNet that cut training
time). VGG does not generally use Local Response Normalization (LRN), as LRN increases memory
consumption and training time with no particular increase in accuracy.
Alexnet Vs. VGG-16 Architecture
• Large receptive field used in AlexNet (11x11 with a stride of 4), VGG uses very
small receptive fields (3x3 with a stride of 1).
• VGG 16 uses three ReLU units instead of just one in Alexnet. The decision function
is more discriminative.
• There are also fewer parameters in VGG 16 (27 times the number of channels
instead of AlexNet’s 49 times the number of channels).
• VGG incorporates 1x1 convolutional layers to make the decision function more
non-linear without changing the receptive fields.
• The small-size convolution filters allows VGG to have a large number of weight
layers; of course, more layers leads to improved performance. This isn’t an
uncommon feature, though. GoogLeNet, another model that uses deep CNNs and
small convolution filters, was also showed up in the 2014 ImageNet competition.
VGG-16 Architecture

https://towardsdatascience.com/extract-features-visualize-filters-and-feature-maps-in-vgg16-and-vgg19-cnn-models-d2da6333edd0
VGG-16 Architecture

https://pub.towardsai.net/the-architecture-and-implementation-of-vgg-16-b050e5a5920b
GoogLeNet/Inception
Architecture
The performance of deep neural networks can be increased by
• Including depth – the number of levels – of the network and
• its width: the number of units at each level

Major drawbacks
• larger number of parameters prone to overfitting (especially, if the number of labeled
examples in the training set is limited)
• uniformly increased network size is the dramatically increased use of computational
resources
Solution:
Moving from fully connected to sparsely connected architectures, even inside the
convolutions.
The main idea of the Inception architecture is based on finding out how an optimal
local sparse structure in a convolutional vision network can be approximated and
covered by readily available dense components.
GoogLeNet/Inception
Architecture
Inception Modules are used in Convolutional Neural Networks to allow
• more efficient computation and
• deeper Networks through a dimensionality reduction with stacked 1×1 convolutions.

The modules were designed to solve the problem of computational expense, as well as
overfitting, among other issues.
GoogLeNet/Inception
Architecture
• Inception network is a network consisting of modules stacked upon each other
• occasional max-pooling layers with stride 2 to halve the resolution of the grid
• Inception network starts using Inception modules only at higher layers while
keeping the lower layers in traditional convolutional fashion
• The improved use of computational resources allows for increasing both the width
of each stage as well as the number of stages without getting into computational
difficulties.
• Another way to utilize the inception architecture is to create slightly inferior, but
computationally cheaper versions of it.

GoogLeNet
To make the process even less computationally expensive, the neural network can be designed to add an extra 1x1
convolution before the 3x3 ad 5x5 layers. By doing so, the number of input channels is limited and 1x1 convolutions
are far cheaper than 5x5 convolutions. It is important to note, however, that the 1x1 convolution is added after the
max-pooling layer, rather than before.
GoogLeNet/Inception
Architecture
GoogLeNet/Inception
Architecture
• One interesting insight is that the strong performance of relatively shallower networks suggests that
the features produced by the layers in the middle of the network should be very discriminative.
• By adding auxiliary classifiers connected to these intermediate layers, we would expect to encourage
discrimination in the lower stages in the classifier, increase the gradient signal that gets propagated
back, and provide additional regularization.
• These classifiers take the form of smaller convolutional networks put on top of the output of the
Inception (4a) and (4d) modules.
• During training, their loss gets added to the total loss of the network with a discount weight (the
losses of the auxiliary classifiers were weighted by 0.3).
• At inference time, these auxiliary networks are discarded.
• An average pooling layer with 5×5 filter size and stride 3, resulting in an 4×4×512 output for the (4a),
and 4×4×528 for the (4d) stage.
• A 1×1 convolution with 128 filters for dimension reduction and rectified linear activation.
• A fully connected layer with 1024 units and rectified linear activation.
• A dropout layer with 70% ratio of dropped outputs.
• A linear layer with softmax loss as the classifier (predicting the same 1000 classes as the main classifier,
but removed at inference time).
Residual Network: Motivation
• After the first CNN-based architecture (AlexNet) that win the ImageNet 2012
competition, every subsequent winning architecture uses more layers in a deep
neural network to reduce the error rate.

• This works for less number of layers, but when we increase the number of layers,
there is a common problem in deep learning associated with that called the
Vanishing/Exploding gradient.

• This causes the gradient to become 0 or too large. Thus when we increases
number of layers, the training and test error rate also increases.
Motivation
ResNet Architecture
Residual Network (ResNet) architecture is a type of artificial neural network that allows
the model to skip layers without affecting performance.
• ResNet introduced as a deep neural network (DNN) model for computer vision tasks.
• In 2015, it won the ImageNet competition 2015.
• DNNs perform better than Multilayer Perceptron(MLP).
• However, training a hugely stacked DNN has been infamous for its vanishing gradient problem
that causes performance deterioration in models.
• ResNet models, having up to 150+ layers, have solved this issue using identity shortcut
connections – they’re connections that skip one or more layers. This provides a detour for
gradients to pass through without diminishing.
• ResNet-50, pre-trained on ImageNet allows us to use its knowledge on a smaller database as it
has already learned patterns from many images. This concept is known as transfer learning.
• The architecture of the proposed model for the task involves a ResNet-50 model followed by 4
additional task-specific layers. The weights of this model, pre-trained on ImageNet, had been
loaded. The input size of the ResNet-50 is 100×100×3 and it uses average pooling.
ResNet Architecture
ResNet Architecture

https://open-instruction.com/dl-algorithms/overview-of-residual-neural-network-resnet/
Benefits of Residual Connections
• Instead of layers learning the underlying
mapping, it allows the network to fit the
residual mapping.
• So, instead of say H(x), initial mapping, let the
network fit,
F(x) := H(x) - x which gives H(x) := F(x) + x.
• Residual connections support efficient training
in very deep convolutional models (Image
detection).
• The use of residual connections seems to
improve the training speed greatly.
• The advantage of adding this type of skip
connection is that if any layer hurt the
performance of architecture then it will be
skipped by regularization.
Type of Residual Connections
Inception-ResNet-A Inception-ResNet-B
Reduction-B

Inception-ResNet-C
Reduction-A

STEM

You might also like