DL-Unit-3 Final
DL-Unit-3 Final
Building blocks of CNN, Local receptive fields, Shared weights and bias, stride, Pooling
layers, Max-pooling, Average pooling, CNN for image classification - Alex Net, VGG,
GoogleNet, ResNet architectures. CNN for segmentation – Unet.
The convolutional layer is the first layer while the FC layer is the last. From the
convolutional layer to the FC layer, the complexity of the CNN increases. It is this increasing
complexity that allows the CNN to successively identify larger portions and more complex
features of an image until it finally identifies the object in its entirety.
Convolutional layer. The majority of computations happen in the convolutional
layer, which is the core building block of a CNN. A second convolutional layer can follow
the initial convolutional layer. The process of convolution involves a kernel or filter inside
this layer moving across the receptive fields of the image, checking if a feature is present in
the image.
Over multiple iterations, the kernel sweeps over the entire image. After each iteration a dot
product is calculated between the input pixels and the filter. The final output from the series
of dots is known as a feature map or convolved feature. Ultimately, the image is converted
into numerical values in this layer, which allows the CNN to interpret the image and extract
relevant patterns from it.
Pooling layer. Like the convolutional layer, the pooling layer also sweeps a kernel or filter
across the input image. But unlike the convolutional layer, the pooling layer reduces the
number of parameters in the input and also results in some information loss. On the positive
side, this layer reduces complexity and improves the efficiency of the CNN.
Fully connected layer. The FC layer is where image classification happens in the CNN
based on the features extracted in the previous layers. Here, fully connected means that all the
inputs or nodes from one layer are connected to every activation unit or node of the next
layer.
How convolutional neural networks work
A CNN can have multiple layers, each of which learns to detect the different features of an
input image. A filter or kernel is applied to each image to produce an output that gets
progressively better and more detailed after each layer. In the lower layers, the filters can
start as simple features.
At each successive layer, the filters increase in complexity to check and identify features that
uniquely represent the input object. Thus, the output of each convolved image -- the partially
recognized image after each layer -- becomes the input for the next layer. In the last layer,
which is an FC layer, the CNN recognizes the image or the object it represents.
With convolution, the input image goes through a set of these filters. As each filter activates
certain features from the image, it does its work and passes on its output to the filter in the
next layer. Each layer learns to identify different features and the operations end up being
repeated for dozens, hundreds or even thousands of layers. Finally, all the image data
progressing through the CNN's multiple layers allow the CNN to identify the entire object.
The above image shows a filter/kernel (3×3 matrix) and applies it to the input image to get
the convolved feature. This convolved feature is passed on to the next layer.
Convolutional neural networks are composed of multiple layers of artificial neurons. When
you input an image in a ConvNet, each layer generates several activation functions that are
passed on to the next layer.
The first layer usually extracts basic features such as horizontal or diagonal edges. This
output is passed on to the next layer which detects more complex features such as corners or
combinational edges. As we move deeper into the network it can identify even more complex
features such as objects, faces, etc.
Based on the activation map of the final convolution layer, the classification layer outputs a
set of confidence scores (values between 0 and 1) that specify how likely the image is to
belong to a “class.” For instance, if you have a ConvNet that detects cats, dogs, and horses,
the output of the final layer is the possibility that the input image contains any of those
animals.
Pooling layer is responsible for reducing the spatial size of the Convolved Feature. This is to
decrease the computational power required to process the data by reducing the dimensions.
There are two types of pooling average pooling and max pooling.
Max Pooling is the maximum value of a pixel from a portion of the image covered by the
kernel. Max Pooling also performs as a Noise Suppressant. It discards the noisy activations
altogether and also performs de-noising along with dimensionality reduction.
Average Pooling returns the average of all the values from the portion of the image covered
by the Kernel. Average Pooling simply performs dimensionality reduction as a noise
suppressing mechanism. Hence, we can say that Max Pooling performs a lot better than
Average Pooling.
Applications of convolutional neural networks
The most common applications of CV and CNNs are used in fields such as the following:
Healthcare. CNNs can examine thousands of visual reports to detect any
anomalous conditions in patients, such as the presence of malignant cancer cells.
Automotive. CNN technology is powering research into autonomous vehicles and
self-driving cars.
Social media. Social media platforms use CNNs to identify people in a user's
photograph and help the user tag their friends.
Retail. E-commerce platforms that incorporate visual search allow brands to
recommend items that are likely to appeal to a shopper.
Advantages and Disadvantages of CNN
Advantages
CNN automatically detects the important features without any human supervision.
CNN is also computationally efficient.
Higher accuracy.
Weight sharing is another major advantage of CNNs.
Convolutional neural networks also minimize computation in comparison with a
regular neural network.
CNNs make use of the same knowledge across all image locations.
Disadvantages
Adversarial attacks are cases of feeding the network 'bad' examples to cause
misclassification.
CNN requires lot of training data.
CNNs tend to be much slower because of operations like maxpool.
Receptive field for CNN:
In the context of Convolutional Neural Networks (CNNs), a local receptive field refers to the
portion of the input image that a neuron (or a filter) “looks at” or is connected to.
For example, if we are using a 5x5 filter on an image, each neuron in the first hidden layer
will be connected to a 5x5 region of pixels in the input image, and that region is the neuron’s
local receptive field.
The concept of local receptive fields in CNNs is inspired by the field of neurobiology, where
it was observed that many neurons in the visual cortex are only responsive to stimuli located
in a limited region of the visual field, which is known as the neuron’s receptive field.
We can assume these rows as the layers of the neural network. The first row means the first
layer, which does not have the whole idea of the screen or in the case of a neural network, the
input image. On the other hand, the last row means the last layer, which has the whole idea of
the image without any problem.
The first layer has a kernel of 3x3 moving from the input image. In this case, we talk about
just one layer, which is the local receptive field. Since at a time, our kernel can just see the
pixels of 3x3, our local receptive field, becomes 3x3.
As soon as we move to the next layer, we can see the output feature map has the idea of every
3x3 pixel of the image. Furthermore, another kernel of 3x3 is moving through this feature
map, and has the idea of all these pixels.
The local receptive field of both the layers is 3 since the kernel can only see 3x3 pixels at a
time. So, we can come to the conclusion that the receptive field directly depends on the
kernel size.
Each time the local receptive field is not changing since we can use the same kernel and the
global receptive field is increasing
At the beginning, the kernel tries to extract the small features, like edges and gradients and
only has an idea about those. However, as soon as it moves to the next layer, it starts getting a
larger perspective of the image and notices the pattern and textures. Furthermore, in the final
layer, our network is able to build the complete object.
Weight Sharing and Bias:
Weight sharing is a key feature of CNNs that differentiates them from other types of neural
networks. In CNNs, each filter has a set of weights (parameters) associated with it, and these
weights are shared across all the neurons that use that filter. This is also known as parameter
sharing.
The idea behind weight sharing is that if one feature (like an edge or a texture) is useful to
compute at some location in the image, then it should be useful to compute at other locations
as well. This drastically reduces the amount of parameters in the model and thus reduces
computational cost and controls overfitting.
In contrast, in a regular feedforward neural network, each neuron in the hidden layer would
have its own set of weights, and these weights would not be shared with other neurons. This
increases the number of parameters in the model, making the network more prone to
overfitting and less able to handle larger images.
By combining local receptive fields and weight sharing, CNNs are able to detect local
features in the image (like edges and textures), regardless of where in the image these
features are located. This gives CNNs their characteristic ability to handle translation
invariance, i.e., the ability to recognize objects regardless of where they are located in the
image.
LeNet 5 Architecture
In the 1990s, Yann LeCun, Leon Bottou, Yosuha Bengio, and Patrick Haffner
proposed the LeNet-5 neural network design for character recognition in both
handwriting and machine printing. Since the design is clear-cut and easy to comprehend, it
is frequently used as the first step in teaching convolutional neural networks.
LeNet is a common term for LeNet-5, a simple convolutional neural network. The
LeNet-5 signifies CNN’s emergence and outlines its core components.
However, it was not popular at the time due to a lack of hardware, especially GPU (Graphics
Process Unit, a specialized electronic circuit designed to change memory to accelerate the
creation of images during a buffer intended for output to a show device) and alternative
algorithms, like SVM, which could perform effects similar to or even better than those of the
LeNet.
Features of LeNet-5
Every convolutional layer includes three parts: convolution, pooling, and nonlinear
activation functions.
Using convolution to extract spatial features (Convolution was called receptive fields
originally)
The average pooling layer is used for subsampling.
‘tanh’ is used as the activation function
Using Multi-Layered Perceptron or Fully Connected Layers as the last classifier
The sparse connection between layers reduces the complexity of computation.
Architecture
The LeNet-5 CNN architecture has seven layers. Three convolutional layers, two
subsampling layers, and two fully linked layers make up the layer composition.
First Layer
A 32x32 grayscale image serves as the input for LeNet-5 and is processed by the first
convolutional layer comprising six feature maps or filters with a stride of one.From 32x32x1
to 28x28x6, the image’s dimensions shift.
Second Layer
Then, using a filter size of 2x2 and a stride of 2, the LeNet-5 adds an average pooling layer or
sub-sampling layer. 14x14x6 will be the final image’s reduced size.
Third Layer
A second convolutional layer with 16 feature maps of size 5x5 and a stride of 1 is then
present. Only 10 of the 16 feature maps in this layer are linked to the six feature maps in the
layer below, as can be seen in the illustration below.
The primary goal is to disrupt the network’s symmetry while maintaining a manageable
number of connections. Because of this, there are 1516 training parameters instead of 2400 in
these layers, and similarly, there are 151600 connections instead of 240000.
Fourth Layer
With a filter size of 2x2 and a stride of 2, the fourth layer (S4) is once more an average
pooling layer. The output will be decreased to 5x5x16 because this layer is identical to the
second layer (S2) but has 16 feature maps.
Fifth Layer
With 120 feature maps, each measuring 1 x 1, the fifth layer (C5) is a fully connected
convolutional layer. All 400 nodes (5x5x16) in layer four, S4, are connected to each of the
120 units in C5’s 120 units.
Sixth Layer
A fully connected layer (F6) with 84 units makes up the sixth layer.
Output Layer
The SoftMax output layer, which has 10 potential values and corresponds to the digits 0 to 9,
is the last layer.
Alex Net Architecture
The convolutional neural network (CNN) architecture known as AlexNet was created
by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, who served as Krizhevsky’s PhD
advisor.
The first model we'll discuss is the winner of the 2012 Image Net Large Scale Visual
Recognition Challenge (ILSVRC, or simply Image Net).
This was the first architecture that used GPU to boost the training performance. AlexNet
consists of 5 convolution layers, 3 max-pooling layers, 2 Normalized layers, 2 fully
connected layers and 1 SoftMax layer. Each convolution layer consists of a convolution filter
and a non-linear activation function called “ReLU”. The pooling layers are used to perform
the max-pooling function and the input size is fixed due
to the presence of fully connected layers. The input size is mentioned at most of the places as
224x224x3 but due to some padding which happens it works out to be 227x227x3. Above all
this AlexNet has over 60 million parameters.
Key Features:
‘ReLU’ is used as an activation function rather than ‘tanh’. Compared to a network
using tanh, this is six times faster.
SGD Momentum is used as a learning algorithm.
Data Augmentation is been carried out like flipping, jittering, cropping, colour
normalization, etc.
• It is worth mentioning that there were two parallel pipelines of processing in the
original architecture.
• These two pipelines are caused by two GPUs working together to build the training
model with a faster speed and memory sharing..
• The network was originally trained on a GTX 580 GPU with 3 GB of memory, and it
was impossible to fit the intermediate computations in this amount of space.
Therefore, the network was partitioned across two GPUs.
• The authors of AlexNet used pooling windows, sized 3×3 with a stride of 2 between
the adjacent windows. Due to this overlapping nature of Max Pool, the top-1 error rate
was reduced by 0.4% and the top-5 error rate was reduced by 0.3% respectively. If
you compare this to using non-overlapping pooling windows of size 2×2 with a stride
of 2, which would give the same, output dimensions.
First Conv Layer
• AlexNet starts with 224 × 224 × 3 images and uses 96 filters of size 11 × 11 × 3 in the
first layer and stride of 4. This results in a first layer of size 55 × 55 × 96.
• After the first layer has been computed, a max-pooling layer is used.
• The ReLU activation function was applied after each convolutional layer, which was
followed by response normalization and max-pooling.
Second Conv Layer
• The second convolutional layer uses the response-normalized and pooled output of
the first convolutional layer and filters it with 256 filters of size 5 × 5 × 96.
Third, Fourth, and Fifth Conv Layers
• The sizes of the filters of the third, fourth, and fifth convolutional layers are 3 × 3 ×
256 (with 384 filters), 3 × 3 × 384 (with 384 filters), and 3 × 3 × 384 (with 256
filters).
• No intervening pooling or normalization layers are present in the third, fourth, or fifth
convolutional layers.
Pooling
• All max-pooling layers used 3 × 3 filters at stride 2. Therefore, there was some
overlap among the pools.
Summary
VGG-16 Architecture
In recent years, deep learning has gained immense popularity due to its ability to
perform complex tasks such as image recognition, natural language processing, and voice
recognition. One of the most popular deep learning models for image recognition is the
VGG16 (Visual Geometry Group) model.
VGG16 is a convolutional neural network (CNN) architecture that was developed by
the Visual Geometry Group at the University of Oxford. It was first introduced in 2014, and
since then, it has become one of the most popular deep learning models for image
recognition. This architecture was the 1 st runner up of the Visual Recognition Challenge of
2014 i.e. ILSVRC (ImageNet Large Scale Visual Recognition Challenge)-2014 and was
developed by Simonyan and Zisserman.
ILSVRC is an annual computer vision competition. Each year, teams compete on two
tasks. The first is to detect objects within an image coming from 200 classes, which is called
object localization. The second is to classify images, each labeled with one of 1000
categories, which is called image classification. This model won 1 and 2 place in the above
categories in the 2014 ILSVRC challenge. This model achieves 92.7% top-5 test accuracy on
the Image Net dataset which contains 14 million images belonging to 1000 classes.
The architecture of VGG-16 is shown in figure.
In the figure above, all the blue rectangles represent the convolution layers along with the
non-linear activation function which is a rectified linear unit (or ReLU). There are 13 blue
and 5 red rectangles i.e there are 13 convolution layers and 5 max-pooling layers. Along with
these, there are 3 green rectangles representing 3 fully connected layers. So, the total number
of layers having tunable parameters is 16 of which 13 are for convolution layers and 3 for
fully connected layers, thus the name is given as VGG-16. At the output, we have a softmax
layer having 1000 outputs per image category in the image Net dataset.
In this architecture, started with a very low channel size of 64 and then gradually
increased by a factor of 2 after each max-pooling layers, until it reaches 512. The flattened
architecture of VGG-16 is as shown below:
The architecture is very simple. It has got 2 contiguous blocks of 2 convolution layers
followed by a max-pooling, then it has 3 contiguous blocks of 3 convolution layers followed
by max-pooling, and at last, we have 3 dense layers. The last 3 convolution layers have
different depths in different architectures. The important thing to analyze here is that after
every max-pooling the size is getting half.
Features of VGG-16 network:
1. Input Layer: It accepts color images as an input with the size 224 x 224 and 3 channels
i.e. Red, Green, and Blue.
2. Convolution Layer: The images pass through a stack of convolution layers where every
convolution filter has a very small receptive field of 3 x 3 and stride of 1. Every
convolution kernel uses row and column padding so that the size of input as well as the
output feature maps remains the same.
3. Max pooling: It is performed over a max-pool window of size 2 x 2 with stride equals to
2, which means here max pool windows are non-overlapping windows.
4. Not every convolution layer is followed by a max pool layer as at some places a
convolution layer is following another convolution layer without the max-pool layer in
between.
5. The first two fully connected layers have 4096 channels each and the third fully
connected layer which is also the output layer have 1000 channels, one for each category
of images in the imagenet database.
6. The hidden layers have ReLU as their activation function.
Difference between VGG-16 and AlexNet:
• As compared to VGG-16 where all the convolution kernels are of the uniform size 3 x
3 with stride as 1, the AlexNet has convolution kernels of variable size like 5 x 5 and
3 x 3. Though AlexNet uses multiple kernels of different sizes, the realization of
every convolution kernel of different sizes can be done using multiple 3 x 3 size
kernels.
Advantages of having 3 x 3 kernel size :
1. As we know more the layers of convolution more sharply the features will be
extracted from our input as compared to when we have fewer layers. So having 3 x 3
kernel size would lead to much better feature extraction than 7 x 7 kernel size.
2. When we take 3 x 3 kernel size the number of trainable parameters will be 27K² as
compared to 7 x 7 kernel size when taken gives 49K² trainable parameters which is
81% more.
Calculations involved in getting output size from each layer
The complete architecture of the VGG-16 has been summed up in the table shown
below:
Input Layer:
• The size of the input image is 224 x 224.
Convolution Layer - 1:
• Input size = N = 224
• Filter size = f = 3 x 3
• No. of filters = 64
• Strides = S = 1
• Padding = P = 1
• Output feature map size = [(224–3+2)/1] + 1 = 224
• Output with channels = 224 x 224 x 64
Convolution Layer - 2:
• Input size = N = 224
• Filter size = f = 3 x 3
• No. of filters = 64
• Strides = S = 1
• Padding = P = 1
• Output feature map size = [(224–3+2)/1] + 1 = 224
• Output with channels = 224 x 224 x 64
Max-Pooling Layer - 1:
• Input size = N = 224
• Filter size = f = 2 x 2
• Strides = S = 2
• Padding = P = 0
• Output feature map size = [(224–2+0)/2] + 1 = 112
• Output with channels = 112 x 112 x 64
Convolution Layer - 3:
• Input size = N = 112
• Filter size = f = 3 x 3
• No. of filters = 128
• Strides = S = 1
• Padding = P = 1
• Output feature map size = [(112–3+2)/1] + 1 = 112
• Output with channels = 112 x 112 x 128
Convolution Layer - 4:
• Input size = N = 112
• Filter size = f = 3 x 3
• No. of filters = 128
• Strides = S = 1
• Padding = P = 1
• Output feature map size = [(112–3+2)/1] + 1 = 112
• Output with channels = 112 x 112 x 128
Max-Pooling Layer - 2:
• Input size = N = 112
• Filter size = f = 2 x 2
• Strides = S = 2
• Padding = P = 0
• Output feature map size = [(112–2+0)/2] + 1 = 56
• Output with channels = 56 x 56 x 128
Convolution Layer - 5:
• Input size = N = 56
• Filter size = f = 3 x 3
• No. of filters = 256
• Strides = S = 1
• Padding = P = 1
• Output feature map size = [(56–3+2)/1] + 1 = 56
• Output with channels = 56 x 56 x 256
Convolution Layer - 6:
• Input size = N = 56
• Filter size = f = 3 x 3
• No. of filters = 256
• Strides = S = 1
• Padding = P = 1
• Output feature map size = [(56–3+2)/1] + 1 = 56
• Output with channels = 56 x 56 x 256
Convolution Layer - 7:
• Input size = N = 56
• Filter size = f = 3 x 3
• No. of filters = 256
• Strides = S = 1
• Padding = P = 1
• Output feature map size = [(56–3+2)/1] + 1 = 56
• Output with channels = 56 x 56 x 256
Max-Pooling Layer - 3:
• Input size = N = 56
• Filter size = f = 2 x 2
• Strides = S = 2
• Padding = P = 0
• Output feature map size = [(56–2+0)/2] + 1 = 28
• Output with channels = 28 x 28 x 256
Similar calculations will be performed for the rest of the network.
Applications of VGG16
• VGG16 has been used in various applications such as object detection, facial
recognition, and image classification.
Limitations of VGG 16:
• It is very slow to train (the original VGG model was trained on Nvidia Titan GPU for
2-3 weeks).
• The size of VGG-16 trained imageNet weights is 528 MB. So, it takes quite a lot of
disk space and bandwidth which makes it inefficient.
• 138 million parameters lead to exploding gradients problem.
Further advancements:
• Resnets are introduced to prevent exploding gradients problem that occurred in VGG-
16.
ResNet Architecture
A novel architecture called Residual Network was launched by Microsoft Research
experts in 2015 with the proposal of ResNet. The Residual Blocks idea was created by this
design to address the issue of the vanishing/exploding gradient. We apply a method known as
skip connections in this network. The skip connection bypasses some levels in between to
link-layer activations to subsequent layers. This creates a leftover block. These leftover
blocks are stacked to create resnets. The strategy behind this network is to let the network fit
the residual mapping rather than have layers learn the underlying mapping. Thus, let the
network fit instead of using, say, the initial mapping of H(x),
F(x) := H(x) - x which gives H(x) := F(x) + x
The benefit of including this kind of skip link is that regularisation will skip any layer
that degrades architecture performance. As a result, training an extremely deep neural
network is possible without encountering issues with vanishing or expanding gradients.
Similar techniques exist under the name “highway networks,” which also employ
skip connections. These skip connections also make use of parametric gates, just as LSTM.
The amount of data that flows across the skip connection is controlled by these gates.
However, this design has not offered accuracy that is superior to ResNet architecture.
Salient features of ResNet Architecture:
With a top-5 mistake rate of 3.57 per cent, won first prize in the ILSVRC
classification competition in 2015. (A model ensemble)
Won first prize in the categories of ImageNet detection, ImageNet localization, Coco
detection, and Coco segmentation at the 2015 ILSVRC and COCO competition.
ResNet-101 is used in Faster R-CNN to replace the VGG-16 layers. They noticed a 28
per cent relative improvement.
Networks of 100 layers and 1000 layers that are well trained.
ResNet Architecture: The VGG-19-inspired 34-layer plain network architecture used by
ResNet is followed by the addition of the shortcut connection. The architecture is
subsequently transformed into the residual network by these short-cut connections, as
depicted in the following figure.
Google Net Architecture
GoogleNet is the first version of Inception Models (or Inception V1) was proposed
by research at Google (with the collaboration of various universities) in 2014 in the research
paper titled “Going Deeper with Convolutions”. This architecture was the winner at the
ILSVRC 2014 image classification challenge. It has provided a significant decrease in error
rate as compared to previous winners AlexNet (Winner of ILSVRC 2012) and significantly
less error rate than VGG (2014 runner up).
Features of GoogleNet:
This architecture uses techniques such as 1×1 convolutions in the middle of the
architecture and global average pooling.
1×1 convolution : The inception architecture uses 1×1 convolution in its architecture. These
convolutions used to decrease the number of parameters (weights and biases) of the
architecture. By reducing the parameters we also increase the depth of the architecture.
Let’s look at an example of a 1×1 convolution below: For Example, If we want to
perform 5×5 convolution having 48 filters without using 1×1 convolution as intermediate:
Global Average Pooling : In GoogLeNet architecture, there is a method called global
average pooling is used at the end of the network. This layer takes a feature map of 7×7 and
averages it to 1×1. This also decreases the number of trainable parameters to 0 and improves
the top-1 accuracy by 0.6%
Inception Module: The inception module is different from previous architectures such as
AlexNet, ZF-Net. In this architecture, there is a fixed convolution size for each layer. In the
Inception module 1×1, 3×3, 5×5 convolution and 3×3 max pooling performed in a parallel
way at the input and the output of these are stacked together to generated final output. The
idea behind those convolution filters of different sizes will handle objects at multiple scales
better.
Auxiliary Classifier for Training: Inception architecture used some intermediate classifier
branches in the middle of the architecture, these branches are used during training only.
These branches consist of a 5×5 average pooling layer with a stride of 3, 1×1 convolutions
with 128 filters, two fully connected layers of 1024 outputs and 1000 outputs and a softmax
classification layer. The generated loss of these layers added to total loss with a weight of 0.3.
These layers help in combating gradient vanishing problem and also provide regularization.
Model Architecture:
The overall architecture is 22 layers deep. The architecture was designed to keep
computational efficiency in mind. The idea behind that the architecture can be run on
individual devices even with low computational resources. The architecture also contains two
auxiliary classifier layer connected to the output of Inception (4a) and Inception (4d) layers.
The architectural details of auxiliary classifiers as follows:
An average pooling layer of filter size 5×5 and stride 3.
A 1×1 convolution with 128 filters for dimension reduction and ReLU activation.
A fully connected layer with 1025 outputs and ReLU activation
Dropout Regularization with dropout ratio = 0.7
A softmax classifier with 1000 classes output similar to the main softmax classifier.
This architecture takes image of size 224 x 224 with RGB color channels. All the
convolutions inside this architecture uses Rectified Linear Units (ReLU) as their activation
functions.
1x1 convolutions are less computationally expensive due to fewer parameters, fewer memory
requirements, and less computation required for convolution, making them more efficient and
suitable for dimensionality reduction.
Results: GoogLeNet was the winner at ILSRVRC 2014 taking 1 place in both classification
an detection task. It has top-5 error rate of 6.67% in classification task. An ensemble of 6
GoogLeNets gives 43.9 % mAP on ImageNet test set.
UNet Architecture
UNET is an architecture developed by Olaf Ronneberger et al. for Biomedical Image
Segmentation in 2015 at the University of Freiburg, Germany. It is one of the most popularly
used approaches in any semantic segmentation task today. It is a fully convolutional neural
network that is designed to learn from fewer training samples. It is an improvement over the
existing FCN — “Fully convolutional networks for semantic segmentation” developed by
Jonathan Long et al. in (2014).
UNET — Network Architecture:
UNET is a U-shaped encoder-decoder network architecture, which consists of four
encoder blocks and four decoder blocks that are connected via a bridge. The encoder network
(contracting path) half the spatial dimensions and double the number of filters (feature
channels) at each encoder block. Likewise, the decoder network doubles the spatial
dimensions and half the number of feature channels.
Encoder Network:
The encoder network acts as the feature extractor and learns an abstract representation
of the input image through a sequence of the encoder blocks. Each encoder block consists of
two 3x3 convolutions, where each convolution is followed by a ReLU (Rectified Linear
Unit) activation function. The ReLU activation function introduces non-linearity into the
network, which helps in the better generalization of the training data. The output of the ReLU
acts as a skip connection for the corresponding decoder block. Next, follows a 2x2 max-
pooling, where the spatial dimensions (height and width) of the feature maps are reduced by
half. This reduces the computational cost by decreasing the number of trainable parameters.
Skip Connections:
These skip connections provide additional information that helps the decoder to
generate better semantic features. They also act as a shortcut connection that helps the
indirect flow of gradients to the earlier layers without any degradation. In simple terms, we
can say that skip connection helps in better flow of gradient while backpropagation, which in
turn helps the network to learn better representation.
Bridge:
The bridge connects the encoder and the decoder network and completes the flow of
information. It consists of two 3x3 convolutions, where each convolution is followed by a
ReLU activation function. Decoder Network:
The decoder network is used to take the abstract representation and generate a
semantic segmentation mask. The decoder block starts with a 2x2 transpose convolution.
Next, it is concatenated with the corresponding skip connection feature map from the encoder
block. These skip connections provide features from earlier layers that are sometimes lost
due to the depth of the network. After that, two 3x3 convolutions are used, where each
convolution is followed by a ReLU activation function. The output of the last decoder passes
through a 1x1 convolution with sigmoid activation. The sigmoid activation function gives the
segmentation mask representing the pixel-wise classification.
NOTE:
Some researchers prefer to use a batch normalization layer in between the
convolution layer and the ReLU activation function. The batch normalization reduces
internal covariance shift and makes the network more stable while training.
The dropout is also used sometime after the ReLU activation function. It forces the
network to learn a different representation by dropping out (ignoring) some randomly
selected neurons. It helps the network to become less dependent upon certain neurons.
This in turn helps the network to better generalize and prevent it from overfitting.
****