Image Segmentation Keras: Implementation of Segnet, FCN, Unet, Pspnet and Other Models in Keras

Uploaded by

tiwano5319

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views5 pages

Image Segmentation Keras: Implementation of Segnet, FCN, Unet, Pspnet and Other Models in Keras

Uploaded by

tiwano5319

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Image Segmentation Keras : Implementation of Segnet, FCN, UNet, PSPNet and

other models in Keras

Divam Gupta
Carnegie Mellon University
[email protected]

Abstract

Semantic segmentation plays a vital role in computer

vision tasks, enabling precise pixel-level understanding of
images. In this paper, we present a comprehensive library
for semantic segmentation, which contains implementations
of popular segmentation models like SegNet, FCN, UNet,
and PSPNet. We also evaluate and compare these models
on several datasets, offering researchers and practitioners
a powerful toolset for tackling diverse segmentation chal-
lenges.

1. Introduction
Semantic segmentation, an essential task in the field of
computer vision, aims to assign precise labels to every pixel
in an image. Semantic segmentation has wide-ranging ap-
plications in autonomous driving, image and video analysis,
medical imaging, scene understanding etc. Over the years, To ensure ease of use, we have extensively documented
deep learning approaches have achieved remarkable success the library, providing clear instructions and code exam-
in semantic segmentation. ples. This documentation includes architectural details and
In this paper, we present a comprehensive library for training procedures Additionally, we provide pre-trained
semantic segmentation 1 , aimed at the machine learning weights, enabling users to quickly fine-tune the models on
community. We also compare various semantic segmenta- their datasets.
tion models on multiple datasets. The library provides an
easy to use interface and is built using the TensorFlow and
Keras framework. The library offers an extensive collec- 2. Networks for Semantic Segmentation
tion of easy-to-use models, including SegNet [1], FCN [6],
Like most of the other applications, using a CNN for se-
UNet [8], and PSPNet [10], which are widely used networks
mantic segmentation is the obvious choice. When using a
for semantic segmentation.
CNN for semantic segmentation, the output is also an image
The primary objective behind the development of this
rather than a fixed length vector.
library is to provide a user-friendly and accessible plat-
form for machine learning researchers and practitioners in-
terested in semantic segmentation. Our library promotes 2.1. Fully Convolutional Network
modular and extensible design principles, allowing users to
Usually, the architecture of the model contains several
easily integrate, customize, and extend existing segmenta-
convolutional layers, non-linear activations, batch normal-
tion models to meet their specific needs.
ization, and pooling layers. The initial layers learn the low-
1 The code is available at https: level concepts such as edges and colors and the later level
//github.com/divamgupta/image-segmentation-keras layers learn the higher level concepts such as different ob-
Figure 1. Qualitative results on the CamVid dataset

Method mIoU fIoU Sky Buildg Pole Road Pav Tree Signl Fence Car Ped Bicyc misc
MobileNet UNet 64.51 86.08 94.3 84.03 12.9 96.59 84.98 88.28 32.69 52.25 85.25 42.43 72.68 27.77
VGG UNet 59.59 84.13 93.22 81.28 9.98 95.76 81.4 89.59 33.99 57.0 69.12 30.69 50.89 22.18
ResNet50 UNet 60.38 85.89 94.01 85.42 8.54 96.71 86.91 89.63 44.21 65.93 55.98 38.0 33.76 25.45
SegNet 40.89 73.4 81.1 74.31 3.14 92.13 68.46 75.05 9.19 8.61 32.06 7.32 25.79 13.53
PSP Pretrained 65.88 86.5 93.15 86.29 12.03 94.87 81.89 90.04 51.53 71.65 81.72 37.39 59.52 30.43
ResNet50 PSPNet 51.42 80.08 90.15 79.15 5.17 92.96 75.24 86.68 25.14 51.06 47.33 20.08 23.24 20.82

Table 1. Quantitative results of semantic segmentation on the CamVid dataset.

jects. 2 segmentation outputs. To do that we add more convolution

At a lower level, the neurons contain information for a layers coupled with upsampling layers which increase the
small region of the image, whereas at a higher level the size of the spatial tensor. As we increase the resolution, we
neurons contain information for a large region of the image. decrease the number of channels as we are getting back to
Thus, as we add more layers, the size of the image keeps on the low-level information.
decreasing and the number of channels keeps on increasing. This is called an encoder-decoder structure. Where the
The downsampling is done by the pooling layers. layers which downsample the input are the part of the en-
For the case of image classification, we need to map the coder and the layers which upsample are part of the decoder.
spatial tensor from the convolution layers to a fixed length When the model is trained for the task of semantic segmen-
vector. To do that, fully connected layers are used, which tation, the encoder outputs a tensor containing information
destroy all the spatial information. For the task of seman- about the objects, and its shape and size. The decoder takes
tic segmentation, we need to retain the spatial information, this information and produces the segmentation maps.
hence no fully connected layers are used. That’s why they
are called fully convolutional networks. The convolutional 2.2. Skip Connections
layers coupled with downsampling layers produce a low- If we simply stack the encoder and decoder layers, there
resolution tensor containing the high-level information. could be loss of low-level information. Hence, the bound-
Taking the low-resolution spatial tensor, which contains aries in segmentation maps produced by the decoder could
high-level information, we have to produce high-resolution be inaccurate.
2 Source: https://divamgupta.com/image- To make up for the information lost, we let the decoder
segmentation/2019/06/06/deep-learning-semantic- access the low-level features produced by the encoder lay-
segmentation-keras.html ers. That is accomplished by skip connections. Intermediate
Figure 2. Qualitative results on the sitting people dataset

Method mIoU fIoU BG Head tor R LW a RU a R h L. LW a LUa Lh R LW l RU l R f L LW l LU l L f

VGG UNet 48.2 91.59 96.03 77.7 62.08 31.41 28.63 34.22 31.63 16.59 35.42 66.36 65.38 25.88 64.31 65.2 22.23
PSP Pretrained 62.19 93.95 97.02 83.18 76.42 43.5 63.2 49.05 47.1 57.39 51.15 79.15 73.54 27.82 77.05 80.94 26.31
ResNet50 UNet 62.16 94.2 97.25 84.28 77.04 37.36 55.47 40.14 45.42 50.7 51.5 79.36 68.84 40.6 81.58 76.37 46.46
ResNet50 PSPNet 44.83 91.15 95.51 77.11 74.71 12.91 43.1 18.38 18.66 42.64 22.04 52.06 64.59 24.04 45.72 64.6 16.39
SegNet 49.26 92.48 96.62 78.24 68.49 27.44 43.07 24.39 24.62 44.17 18.93 72.71 63.22 24.93 72.21 64.11 15.81
MobileNet UNet 58.6 93.8 97.06 78.91 80.23 34.41 43.31 44.2 38.39 52.19 49.76 72.7 68.8 30.8 77.81 76.59 33.78

Table 2. Quantitative results of semantic segmentation on the sitting people dataset.

outputs of the encoder are added/concatenated with the in- and contains approximately 15 distinct classes.
puts to the intermediate layers of the decoder at appropriate SUIM dataset : The SUIM (Semantic Underwater Im-
positions. The skip connections from the earlier layers pro- age Manipulation) [5] dataset is a comprehensive collection
vide the necessary information to the decoder layers which of underwater imagery. It consists of more than 1500 im-
is required for creating accurate boundaries. ages with pixel-level annotations for eight object categories,
including fish (vertebrates), reefs (invertebrates), aquatic
3. Experiments plants, wrecks/ruins, human divers, robots, and sea-floor.
These images are gathered during oceanic explorations and
In this section we compare various implementations of human-robot collaborative experiments, and annotated by
segmentation models in several datasets. human participants.
CamVid dataset : CamVid [2] is a unique dataset that We benchmark the following models:
provides pixel-level semantic labels for driving scenario SegNet: Standard encoder-decoder network, where the
videos, with annotations for 11 semantic classes. It of- encoder network produces an input feature map, and de-
fers over 10 minutes of high-quality footage, along with coder predicts the segmentation classes at the input resolu-
corresponding semantically labeled images, calibration se- tion. This has no skip connections or any pretraining.
quences. MobileNet UNet : Efficient and accurate semantic
Sitting people dataset : The Human Part Segmentation segmentation, combining MobileNet’s [4] lightweight fea-
dataset [7] by the University of Freiburg is specifically de- ture extraction with U-Net’s precise pixel-wise predictions.
signed for semantic segmentation of sitting people. It com- Ideal for real-time applications and resource-constrained
prises various human parts, such as hands, legs, and arms, environments.
Method mIoU fIoU Background Human Plants Wreks Robots Feefs Fish Floor
SegNet 17.03 36.7 68.13 11.94 0.0 0.0 0.0 31.04 13.8 11.3
ResNet50 PSPNet 15.71 33.8 63.55 10.77 0.14 11.71 0.0 25.98 4.42 9.12
MobileNet UNet 31.38 55.96 85.0 24.65 0.0 6.4 0.02 45.3 39.44 50.21
VGG UNet 24.51 46.6 76.5 9.34 16.21 5.08 0.0 36.86 16.76 35.36
ResNet50 UNet 29.38 52.11 80.74 24.37 6.83 12.99 0.0 43.18 24.41 42.52
PSP Pretrained 24.03 49.21 76.41 0.22 0.0 11.99 0.0 43.9 14.11 45.64

Table 3. Quantitative results of semantic segmentation on the SUIM dataset.

Figure 3. Qualitative results on the SUIM dataset

VGG UNet : A powerful semantic segmentation net- models are in Figure 1, 2 and 3.
work, leveraging VGG’s [9] deep and expressive features
for robust segmentation. Offers high-quality segmentation
results at the expense of increased computational complex- 5. Conclusion
ity.
ResNet50 UNet : ResNet50’s [3] deep residual blocks Our paper introduces a comprehensive library for se-
for highly accurate and detailed segmentation results. Bal- mantic segmentation for well-known models such as Seg-
ances between computational efficiency and superior per- Net, FCN, UNet, and PSPNet. The library empowers re-
formance, making it suitable for various segmentation tasks. searchers and practitioners in the field of computer vision
PSP Pretrained : In this we use a pre-trained PSPNet with a toolset to achieve pixel-level understanding of im-
model on the ADE 20K dataset. Here the model is pre- ages. We have demonstrated the efficacy and robustness of
trained specifically on the semantic segmentation task. these models, underscoring their applicability in addressing
diverse segmentation applications.
4. Results
The results for CamVid, sitting people and SUIM 6. Acknowledgements
datasets are in Table 1, 2 and 3 respectively. For CamVid
and sitting datasets, the pretrained PSPNet yields the best We would like to thank all the contributors from the
mIoU scores. For the SUIM dataset, MobileNet UNet gives open-source community. We would also like to thank Chat-
the best results. The qualitative visualizations of the best GPT which helped in the writing of this paper.
References
[1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla.
Segnet: A deep convolutional encoder-decoder architecture
for image segmentation. IEEE transactions on pattern anal-
ysis and machine intelligence, 39(12):2481–2495, 2017. 1
[2] Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, and
Roberto Cipolla. Segmentation and recognition using struc-
ture from motion point clouds. In ECCV (1), pages 44–57,
2008. 3
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016. 4
[4] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
tional neural networks for mobile vision applications. arXiv
preprint arXiv:1704.04861, 2017. 3
[5] Md Jahidul Islam, Chelsey Edge, Yuyang Xiao, Peigen Luo,
Muntaqim Mehtaz, Christopher Morse, Sadman Sakib Enan,
and Junaed Sattar. Semantic segmentation of underwater im-
agery: Dataset and benchmark. In 2020 IEEE/RSJ Interna-
tional Conference on Intelligent Robots and Systems (IROS),
pages 1769–1776. IEEE, 2020. 3
[6] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
convolutional networks for semantic segmentation. In Pro-
ceedings of the IEEE conference on computer vision and pat-
tern recognition, pages 3431–3440, 2015. 1
[7] Gabriel L Oliveira, Abhinav Valada, Claas Bollen, Wolfram
Burgard, and Thomas Brox. Deep learning for human part
discovery in images. In 2016 IEEE International confer-
ence on robotics and automation (ICRA), pages 1634–1641.
IEEE, 2016. 3
[8] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
net: Convolutional networks for biomedical image segmen-
tation. In Medical Image Computing and Computer-Assisted
Intervention–MICCAI 2015: 18th International Conference,
Munich, Germany, October 5-9, 2015, Proceedings, Part III
18, pages 234–241. Springer, 2015. 1
[9] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556, 2014. 4
[10] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Wang, and Jiaya Jia. Pyramid scene parsing network. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 2881–2890, 2017. 1