Report Explo
Report Explo
by
MUKHRAM YADAV
22075049
I, the undersigned student, hereby declare that the project entitled ”Comparative
Analysis of Semantic Segmentation Architectures: U-Net, SegNet and
FCN” submitted by me to the Indian Institute of Technology (BHU) Varanasi
during the academic year 2023-24 in fulfillment of the requirements of the Ex-
ploratory Project for the award of Degree of Bachelor of Technology in Computer
Science and Engineering is a record of bonafide project work carried out by me
under the guidance and supervision of Dr. Rajeev Srivastava.
I have worked on the project and followed the ethical guidelines for conducting
research and ensured that my methods and results were accurate and reliable. I
have also maintained a detailed record of my research methodology, data collection,
and analysis procedures, and given due credit to external sources through citations.
I further declare that the work reported in this project has not been submitted
and will not be submitted, either in part or in full, for the award of any other
degree or diploma in this institute or any other University.
Mukhram yadav
(22075049)
CERTIFICATE
ACKNOWLEDGEMENT i
ABSTRACT ii
LIST OF FIGURES iv
Chapter 1 : Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chapter 4 : Training 13
4.0.1 Cityscapes dataset . . . . . . . . . . . . . . . . . . . . . . . 13
REFERENCES
ACKNOWLEDGEMENT
I would like to express my sincere gratitude to Dr. Rajeev Srivastava and Dr. SK
Singh, Head of the Department of Computer Science and Engineering, for their
guidance and Sir Arun shahi and Adrash kumar for their support.
Mukhram yadav
i
ABSTRACT
ii
LIST OF TABLES
iv
CHAPTER 1
INTRODUCTION
1.1 Overview
This overview delves into the realm of semantic segmentation architectures, focus-
ing on four prominent CNN-based models: U-Net, SegNet and Fully Convolutional
Networks (FCNs). Each architecture offers unique design choices and innovations
tailored to address the challenges of semantic segmentation.
1
indices from the encoder to perform efficient upsampling in the decoder. It is
known for its simplicity and computational efficiency[15].
Fully Convolutional Networks (FCNs): FCNs were among the pioneering archi-
tectures for semantic segmentation, featuring a fully convolutional end-to-end de-
sign. They introduced transposed convolutions for upsampling, enabling dense
pixel-wise predictions[3].
1.2 Motivation
Over the past two decades, machine learning has advanced significantly, from a
curious idea in-lab tool to a useful tool with widespread commercial application.
Machine learning has become the approach of choice in artificial intelligence (AI)
for creating useful software for computer vision, speech recognition, natural lan-
guage processing, robot control, and other applications [9]. The recent surge of
interest in gaining knowledge about deep learning methods lies in the fact that
they have been shown to outperform previous previously existing technologies in
various tasks, as well as utilizing and learning on the abundance of complex data
from different sources (e.g., visual, audio, medical, social, and sensor) [17].
2
The primary motivation for taking on this project was our desire to venture into
an exciting field of research and space of immense futuristic capacity. The chance
to learn more about Computer Vision and Image Segmentation we had not been
exposed to earlier pushed us to work harder[13]. Studying the effects of implemen-
tations of various machine learning and deep learning algorithms to solve real-life
problems across domains has helped us to delve deeper into the impacts of our
work in practicality and motivated us to make them better.
3
CHAPTER 2
LITERATURE REVIEW
4
decoder architecture augmented with skip connections[1]. This design facilitated
the precise localization of objects in medical imaging tasks while mitigating the
vanishing gradient problem. SegNet, introduced by Badrinarayanan et al. (2017),
focused on computational efficiency by leveraging max-pooling indices for upsam-
pling in the decoder[1].
Comparative studies have played a pivotal role in evaluating the performance and
characteristics of different semantic segmentation architectures[14]. Zhao et al.
(2017) conducted a comprehensive comparative analysis of FCNs, SegNet, and
UNet on benchmark datasets, highlighting the trade-offs between accuracy and
computational efficiency. Similarly, Ronneberger et al. (2015) compared U-Net
with traditional segmentation methods, showcasing its superior performance in
medical image segmentation tasks.
5
CHAPTER 3
ARCHITECTURES USED
3.1 U-NET
U-Net is a widely used deep learning architecture that was first introduced in
the “U-Net: Convolutional Networks for Biomedical Image Segmentation” paper.
The primary purpose of this architecture was to address the challenge of limited
annotated data in the medical field. This network was designed to effectively
leverage a smaller amount of data while maintaining speed and accuracy.
The contracting path in U-Net is responsible for identifying the relevant features in
the input image. The encoder layers perform convolutional operations that reduce
the spatial resolution of the feature maps while increasing their depth, thereby
capturing increasingly abstract representations of the input. This contracting path
is similar to the feedforward layers in other convolutional neural networks. On the
6
other hand, the expansive path works on decoding the encoded data and locating
the features while maintaining the spatial resolution of the input. The decoder
layers in the expansive path upsample the feature maps, while also performing
convolutional operations. The skip connections from the contracting path help
to preserve the spatial information lost in the contracting path, which helps the
decoder layers to locate the features more accurately.
U-net architecture (example for 32x32 pixels in the lowest resolution). Each blue
box corresponds to a multi-channel feature map. The number of channels is
denoted on top of the box. The x-y-size is provided at the lower left edge of the
box. White boxes represent copied feature maps. The arrows denote the
different operations.
3.2 SegNet
7
Figure 3.2: segnet enco-deco
the SegNet architecture. There are no fully connected layers and hence it is only
convolutional. A decoder upsamples its input using the transferred pool indices
from its encoder to produce a sparse feature map(s). It then performs convolution
with a trainable filter bank to densify the feature map. The final decoder output
feature maps are fed to a soft-max classifier for pixel-wise classification.The en-
coder network consists of 13 convolutional layers which correspond to the first 13
convolutional layers in the VGG16 network [1] designed for object classification.
We can therefore initialize the training process from weights trained for classifica-
tion on large datasets [41]. I can also discard the fully connected layers in favour
of retaining higher resolution feature maps at the deepest encoder output. This
also reduces the number of parameters in the SegNet encoder network significantly
(from 134M to 14.7M) as compared to other recent architectures [2], [4] . Each
encoder layer has a corresponding decoder layer and hence the decoder network
has 13 layers. The final decoder output is fed to a multi-class soft-max classifier
to produce class probabilities for each pixel independently.
3.2.1 Encoder
In the encoder network, each convolutional layer produces feature maps that un-
dergo batch normalization and rectified linear unit (ReLU) activation. Subse-
quently, max-pooling with a 2x2 window and stride 2 is applied for down-sampling,
aiming for translation invariance. However, multiple layers of max-pooling lead
8
to a loss of spatial resolution, which is detrimental for segmentation tasks requir-
ing precise boundary delineation. To address this, the encoder feature maps are
stored efficiently by retaining only the max-pooling indices, representing the lo-
cations of maximum feature values in each pooling window. This storage method
significantly reduces memory usage compared to storing entire feature maps. Al-
though this approach incurs a slight accuracy loss, it remains suitable for practical
applications with memory constraints.
3.2.2 Decoder
The appropriate decoder in the decoder network upsamples its input feature
map(s) using the memorized max-pooling indices from the corresponding encoder
feature map(s). This step produces sparse feature map(s). These feature maps
are then convolved with a trainable decoder filter bank to produce dense feature
maps. A batch normalization step is then applied to each of these maps. Note
that the decoder corresponding to the first encoder (closest to the input image)
9
produces a multi-channel feature map, although its encoder input has 3 channels
(RGB). This is unlike the other decoders in the network which produce feature
maps with the same number of size and channels as their encoder inputs. The
high dimensional feature representation at the output of the final decoder is fed to
a trainable soft-max classifier. This soft-max classifies each pixel independently.
The output of the soft-max classifier is a K channel image of probabilities where
K is the number of classes. The predicted segmentation corresponds to the class
with maximum probability at each pixel.
This show that a fully convolutional network (FCN), trained end-to-end, pixels-
to-pixels on semantic segmentation exceeds the state-of-the-art without further
machinery. To our knowledge, this is the first work to train FCNs end-to-end (1)
for pixelwise prediction and (2) from supervised pre-training. Fully convolutional
versions of existing networks predict dense outputs from arbitrary-sized inputs.
Both learning and inference are performed whole-image-at a-time by dense feed-
forward computation and backpropagation. In-network upsampling layers enable
pixelwise prediction and learning in nets with subsampled pooling.
This method is efficient, both asymptotically and absolutely, and precludes the
need for the complications in other works. Patchwise training is common [27, 2,
8, 28, 11], but lacks the efficiency of fully convolutional training. Our approach
does not make use of pre- and post-processing complications, including superpix-
els [8, 16], proposals [16, 14], or post-hoc refinement by random fields or local
classifiers [8, 16]. Our model transfers recent success in classification [19, 31, 32]
to dense prediction by reinterpreting classification nets as fully convolutional and
fine-tuning from their learned representations. In contrast, previous works have
applied small convnets without supervised pre-training [8, 28, 27].
10
Figure 3.4: FCN structure
Convolution is a process getting the output size smaller. Thus, the name, de-
convolution, is coming from when we want to have upsampling to get the output
size larger. (But the name, deconvolution, is misinterpreted as reverse process of
convolution, but it is not.) And it is also called, up convolution, and transposed
11
Figure 3.5: Upsampling Via Deconvolution
12
CHAPTER 4
TRAINING
The dataset contains annotations for 30 different classes of objects and elements
commonly found in urban environments.
Images are captured across 50 different cities, spanning several months and various
weather conditions (spring, summer, fall), primarily during daytime with good
to medium weather conditions. The dataset includes manually selected frames
with varying scene layouts, backgrounds, and a large number of dynamic objects,
contributing to its complexity.
As test dataset doesn’t have annotations so I used split traning data into train
and test data.
13
Figure 4.1: labels visulisation
We use the Cityscapes road scenes dataset to benchmark the perfor mance of
the decoder variants. This dataset is small, consisting of 2380 training and 595
testing RGB images (day and dusk scenes) at 360 480resolution. The challenge
is to segment 20 classes such as road, building, cars, pedestrians, signs, poles,
side-walk etc. We perform local contrast normalization [4] to the RGB input. The
encoder and decoder weights were all initialized using the technique described
in He et al. [5]. To train all the variants we use stochastic gradient descent
(SGD) with a fixed learning rate of 0.1 and momentum of 0.9 [12] using our Caffe
14
implementation of SegNet-Basic [6]. We train the variants until the training loss
converges. Before each epoch, the training set is shuffled and each mini-batch (12
images) is then picked in order thus ensuring that each image is used only once
in an epoch. We select the model which performs highest on a validation dataset.
We use the cross-entropy loss [2] as the objective function for training the network.
The loss is summed up over all the pixels in a mini-batch. When there is large
variation in the number of pixels in each class in the training set (e.g road, sky
and building pixels dominate the CamVid dataset) then there is a need to weight
the loss differently based on the true class. This is termed class balancing. We
use median frequency balancing [13] where the weight assigned to a class in the
loss function is the ratio of the median of class frequencies computed on the entire
training set divided by the class frequency. This implies that larger classes in the
training set have a weight smaller than 1 and the weights of the smallest classes
are the highest. We also experimented with training the different variants without
class balancing or equivalently using natural frequency balancing.
15
CHAPTER 5
Here table shows the mIoU for training ,val,and test for all three architecture.
Below table shows the loss for training ,val,and test for all three architecture.
16
Below a comparison of computational time and hardware resources required for
various deep architectures. The caffe time command was used to compute time
requirement averaged over 10 iterations with mini batch size 1 and an image
of 360 480 resolution We used nvidia-smi unix command to compute memory
consumption. For training memory computation we used a mini-batch of size 4
and for inference memory the batch size was 1. Model size was the size of the
caffe models on disk. SegNet is most memory efficient during inference model.
Network forword pass(ms) beckword pass(ms) GPU training memory(MB)
Segnet 422.76 488.21 6803
Unet 317.34 394.71 9731
FCN 484 470.68 9735
From the Table 5.3, we see that bilinear interpolation based upsampling without
any learning performs the worst based on all the measures of accuracy. All the
other methods which either use learning for upsampling (FCN-Basic and vari-
ants) or learning decoder filters after upsampling (SegNet-Basic and its variants)
perform significantly better. This emphasizes the need to learn decoders for seg-
mentation. This is also supported by experimental evidence gathered by other
authors when comparing FCN with SegNet-type decoding techniques [4].
17
Figure 5.1: Quantative result
18
CHAPTER 6
Deep learning models have often achieved increasing success due to the availability
of massive datasets and expanding model depth and parameterisation. However,
in practice factors like memory and computational time during training and test-
ing are important factors to consider when choosing a model from a large bank
of models. Training time becomes an important consideration particularly when
the performance gain is not commensurate with increased training time as shown
in our experiments[2]. Test time memory and computational load are important
to deploy models on specialised embedded devices, for example, in AR applica-
tions. From an overall efficiency viewpoint, I feel less attention has been paid to
smaller and more memory, time efficient models for real-time applications such as
road scene understanding and AR. This was the primary motivation behind the
proposal of SegNet, which is significantly smaller and faster than other competing
architectures, but which we have shown to be efficient for tasks such as road scene
understanding.
19
signing architectures for segmentation, particularly training time, memory versus
accuracy. Those architectures which store the encoder network feature maps in
full perform best but consume more memory during inference time[8]. SegNet on
the other hand is more efficient since it only stores the max-pooling indices of the
feature maps and uses them in its decoder network to achieve good performance.
On large and well known datasets SegNet performs competitively, achieving high
scores for road scene understanding. End-to-end learning of deep segmentation
architectures is a harder challenge and we hope to see more attention paid to this
important problem.
For the future, I would like to exploit our understanding of segmentation architec-
tures gathered from our analysis to design more efficient architectures for real-time
applications. I am also interested in estimating the model uncertainty for predic-
tions from deep segmentation architectures
20
REFERENCES
[2] Meriem Amrane, Saliha Oukid, Ikram Gagaoua, and Tolga Ensari. Breast
cancer classification using machine learning. In 2018 electric electronics, com-
puter science, biomedical engineerings’ meeting (EBBT), pages 1–4. IEEE,
2018.
[3] Prakhar Bansal, Rahul Kumar, and Somesh Kumar. Disease detection in
apple leaves using deep convolutional neural network. Agriculture, 11(7):617,
2021.
[6] S. Hong H. Noh and B. Han. ”learning deconvolutional network for semantic
segmentation”. in ICCV ,pp. 1520-1528, 2015.
[7] ShelHamer J. Long and T. Darrell. ”fully convolutional network for semantic
segmentation”. in CVPR ,pp. 3431-3440, 2015.
[8] Peng Jiang, Yuehan Chen, Bin Liu, Dongjian He, and Chunquan Liang. Real-
time detection of apple leaf diseases using deep learning approach based on
improved convolutional neural networks. IEEE Access, 7:59069–59080, 2019.
[9] Michael I Jordan and Tom M Mitchell. Machine learning: Trends, perspec-
tives, and prospects. Science, 349(6245):255–260, 2015.
[10] A. Zisserman K. Simonyan. very deep convolutional network for large scale
image recognition. arXiv priprint arXiv, 2014.
[12] Muhammad Ramzan, Adnan Abid, Hikmat Ullah Khan, Shahid Mahmood
Awan, Amina Ismail, Muzamil Ahmed, Mahwish Ilyas, and Ahsan Mahmood.
A review on state-of-the-art violence detection techniques. IEEE Access,
7:107560–107575, 2019.
[13] Mubarak Shah, Omar Javed, and Khurram Shafique. Automated visual
surveillance in realistic scenarios. IEEE MultiMedia, 14(1):30–39, 2007.
[14] Wei Song, Dongliang Zhang, Xiaobing Zhao, Jing Yu, Rui Zheng, and Antai
Wang. A novel violent video detection scheme based on modified 3d convo-
lutional neural networks. IEEE Access, 7:39172–39179, 2019.
[15] Ranjita Thapa, Noah Snavely, Serge Belongie, and Awais Khan. The plant
pathology 2020 challenge dataset to classify foliar disease of apples. arXiv
preprint arXiv:2004.11958, 2020.
[16] Fath U Min Ullah, Mohammad S Obaidat, Amin Ullah, Khan Muhammad,
Mohammad Hijji, and Sung Wook Baik. A comprehensive review on vision-
based violence detection in surveillance videos. ACM Computing Surveys,
55(10):1–44, 2023.
[17] Athanasios Voulodimos, Nikolaos Doulamis, Anastasios Doulamis, Eftychios
Protopapadakis, et al. Deep learning for computer vision: A brief review.
Computational intelligence and neuroscience, 2018, 2018.
[18] Anju Yadav, Udit Thakur, Rahul Saxena, Vipin Pal, Vikrant Bhateja, and
Jerry Chun-Wei Lin. Afd-net: Apple foliar disease multi classification using
deep learning on plant pathology dataset. Plant and Soil, 477(1-2):595–611,
2022.