Image Segmentation Keras: Implementation of Segnet, FCN, Unet, Pspnet and Other Models in Keras
Image Segmentation Keras: Implementation of Segnet, FCN, Unet, Pspnet and Other Models in Keras
Divam Gupta
Carnegie Mellon University
[email protected]
Abstract
1. Introduction
Semantic segmentation, an essential task in the field of
computer vision, aims to assign precise labels to every pixel
in an image. Semantic segmentation has wide-ranging ap-
plications in autonomous driving, image and video analysis,
medical imaging, scene understanding etc. Over the years, To ensure ease of use, we have extensively documented
deep learning approaches have achieved remarkable success the library, providing clear instructions and code exam-
in semantic segmentation. ples. This documentation includes architectural details and
In this paper, we present a comprehensive library for training procedures Additionally, we provide pre-trained
semantic segmentation 1 , aimed at the machine learning weights, enabling users to quickly fine-tune the models on
community. We also compare various semantic segmenta- their datasets.
tion models on multiple datasets. The library provides an
easy to use interface and is built using the TensorFlow and
Keras framework. The library offers an extensive collec- 2. Networks for Semantic Segmentation
tion of easy-to-use models, including SegNet [1], FCN [6],
Like most of the other applications, using a CNN for se-
UNet [8], and PSPNet [10], which are widely used networks
mantic segmentation is the obvious choice. When using a
for semantic segmentation.
CNN for semantic segmentation, the output is also an image
The primary objective behind the development of this
rather than a fixed length vector.
library is to provide a user-friendly and accessible plat-
form for machine learning researchers and practitioners in-
terested in semantic segmentation. Our library promotes 2.1. Fully Convolutional Network
modular and extensible design principles, allowing users to
Usually, the architecture of the model contains several
easily integrate, customize, and extend existing segmenta-
convolutional layers, non-linear activations, batch normal-
tion models to meet their specific needs.
ization, and pooling layers. The initial layers learn the low-
1 The code is available at https: level concepts such as edges and colors and the later level
//github.com/divamgupta/image-segmentation-keras layers learn the higher level concepts such as different ob-
Figure 1. Qualitative results on the CamVid dataset
Method mIoU fIoU Sky Buildg Pole Road Pav Tree Signl Fence Car Ped Bicyc misc
MobileNet UNet 64.51 86.08 94.3 84.03 12.9 96.59 84.98 88.28 32.69 52.25 85.25 42.43 72.68 27.77
VGG UNet 59.59 84.13 93.22 81.28 9.98 95.76 81.4 89.59 33.99 57.0 69.12 30.69 50.89 22.18
ResNet50 UNet 60.38 85.89 94.01 85.42 8.54 96.71 86.91 89.63 44.21 65.93 55.98 38.0 33.76 25.45
SegNet 40.89 73.4 81.1 74.31 3.14 92.13 68.46 75.05 9.19 8.61 32.06 7.32 25.79 13.53
PSP Pretrained 65.88 86.5 93.15 86.29 12.03 94.87 81.89 90.04 51.53 71.65 81.72 37.39 59.52 30.43
ResNet50 PSPNet 51.42 80.08 90.15 79.15 5.17 92.96 75.24 86.68 25.14 51.06 47.33 20.08 23.24 20.82
outputs of the encoder are added/concatenated with the in- and contains approximately 15 distinct classes.
puts to the intermediate layers of the decoder at appropriate SUIM dataset : The SUIM (Semantic Underwater Im-
positions. The skip connections from the earlier layers pro- age Manipulation) [5] dataset is a comprehensive collection
vide the necessary information to the decoder layers which of underwater imagery. It consists of more than 1500 im-
is required for creating accurate boundaries. ages with pixel-level annotations for eight object categories,
including fish (vertebrates), reefs (invertebrates), aquatic
3. Experiments plants, wrecks/ruins, human divers, robots, and sea-floor.
These images are gathered during oceanic explorations and
In this section we compare various implementations of human-robot collaborative experiments, and annotated by
segmentation models in several datasets. human participants.
CamVid dataset : CamVid [2] is a unique dataset that We benchmark the following models:
provides pixel-level semantic labels for driving scenario SegNet: Standard encoder-decoder network, where the
videos, with annotations for 11 semantic classes. It of- encoder network produces an input feature map, and de-
fers over 10 minutes of high-quality footage, along with coder predicts the segmentation classes at the input resolu-
corresponding semantically labeled images, calibration se- tion. This has no skip connections or any pretraining.
quences. MobileNet UNet : Efficient and accurate semantic
Sitting people dataset : The Human Part Segmentation segmentation, combining MobileNet’s [4] lightweight fea-
dataset [7] by the University of Freiburg is specifically de- ture extraction with U-Net’s precise pixel-wise predictions.
signed for semantic segmentation of sitting people. It com- Ideal for real-time applications and resource-constrained
prises various human parts, such as hands, legs, and arms, environments.
Method mIoU fIoU Background Human Plants Wreks Robots Feefs Fish Floor
SegNet 17.03 36.7 68.13 11.94 0.0 0.0 0.0 31.04 13.8 11.3
ResNet50 PSPNet 15.71 33.8 63.55 10.77 0.14 11.71 0.0 25.98 4.42 9.12
MobileNet UNet 31.38 55.96 85.0 24.65 0.0 6.4 0.02 45.3 39.44 50.21
VGG UNet 24.51 46.6 76.5 9.34 16.21 5.08 0.0 36.86 16.76 35.36
ResNet50 UNet 29.38 52.11 80.74 24.37 6.83 12.99 0.0 43.18 24.41 42.52
PSP Pretrained 24.03 49.21 76.41 0.22 0.0 11.99 0.0 43.9 14.11 45.64
VGG UNet : A powerful semantic segmentation net- models are in Figure 1, 2 and 3.
work, leveraging VGG’s [9] deep and expressive features
for robust segmentation. Offers high-quality segmentation
results at the expense of increased computational complex- 5. Conclusion
ity.
ResNet50 UNet : ResNet50’s [3] deep residual blocks Our paper introduces a comprehensive library for se-
for highly accurate and detailed segmentation results. Bal- mantic segmentation for well-known models such as Seg-
ances between computational efficiency and superior per- Net, FCN, UNet, and PSPNet. The library empowers re-
formance, making it suitable for various segmentation tasks. searchers and practitioners in the field of computer vision
PSP Pretrained : In this we use a pre-trained PSPNet with a toolset to achieve pixel-level understanding of im-
model on the ADE 20K dataset. Here the model is pre- ages. We have demonstrated the efficacy and robustness of
trained specifically on the semantic segmentation task. these models, underscoring their applicability in addressing
diverse segmentation applications.
4. Results
The results for CamVid, sitting people and SUIM 6. Acknowledgements
datasets are in Table 1, 2 and 3 respectively. For CamVid
and sitting datasets, the pretrained PSPNet yields the best We would like to thank all the contributors from the
mIoU scores. For the SUIM dataset, MobileNet UNet gives open-source community. We would also like to thank Chat-
the best results. The qualitative visualizations of the best GPT which helped in the writing of this paper.
References
[1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla.
Segnet: A deep convolutional encoder-decoder architecture
for image segmentation. IEEE transactions on pattern anal-
ysis and machine intelligence, 39(12):2481–2495, 2017. 1
[2] Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, and
Roberto Cipolla. Segmentation and recognition using struc-
ture from motion point clouds. In ECCV (1), pages 44–57,
2008. 3
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016. 4
[4] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
tional neural networks for mobile vision applications. arXiv
preprint arXiv:1704.04861, 2017. 3
[5] Md Jahidul Islam, Chelsey Edge, Yuyang Xiao, Peigen Luo,
Muntaqim Mehtaz, Christopher Morse, Sadman Sakib Enan,
and Junaed Sattar. Semantic segmentation of underwater im-
agery: Dataset and benchmark. In 2020 IEEE/RSJ Interna-
tional Conference on Intelligent Robots and Systems (IROS),
pages 1769–1776. IEEE, 2020. 3
[6] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
convolutional networks for semantic segmentation. In Pro-
ceedings of the IEEE conference on computer vision and pat-
tern recognition, pages 3431–3440, 2015. 1
[7] Gabriel L Oliveira, Abhinav Valada, Claas Bollen, Wolfram
Burgard, and Thomas Brox. Deep learning for human part
discovery in images. In 2016 IEEE International confer-
ence on robotics and automation (ICRA), pages 1634–1641.
IEEE, 2016. 3
[8] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
net: Convolutional networks for biomedical image segmen-
tation. In Medical Image Computing and Computer-Assisted
Intervention–MICCAI 2015: 18th International Conference,
Munich, Germany, October 5-9, 2015, Proceedings, Part III
18, pages 234–241. Springer, 2015. 1
[9] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556, 2014. 4
[10] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Wang, and Jiaya Jia. Pyramid scene parsing network. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 2881–2890, 2017. 1