Self-Attention Blocks in UNet and FCN For Accurate Semantic Segmentation of Difficult Object Classes in Autonomous Driving

Uploaded by

Daniel Hsu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views6 pages

Self-Attention Blocks in UNet and FCN For Accurate Semantic Segmentation of Difficult Object Classes in Autonomous Driving

Uploaded by

Daniel Hsu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2023 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE)

Self-Attention blocks in UNet and FCN for accurate

semantic segmentation of difficult object classes in
autonomous driving
Seyed-Hamid Mousavi Kin-Choong Yow
Faculty of Engineering and Applied Sciences Faculty of Engineering and Applied Sciences
2023 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE) | 979-8-3503-2397-9/23/$31.00 ©2023 IEEE | DOI: 10.1109/CCECE58730.2023.10288711

University of Regina University of Regina

Regina, Canada Regina, Canada
[email protected] [email protected]

Abstract— Deep learning has been widely used in computer The primary objective of our research is to enhance the
vision applications and one of the recent breakthroughs in this performance of semantic segmentation models for
field is the use of attention modules. Present models, to the best autonomous driving applications by leveraging self-attention.
of our knowledge, are not accurate enough in terms of Our proposed models demonstrate the ability to detect
distinguishing difficult object classes like pedestrians and pedestrians and riders accurately in challenging scenarios,
bicycles in street scenes. In this paper, we proposed the use of leading to more reliable and safer decisions for autonomous
self-attention blocks in the encoder section of UNet and FCN driving systems. In this paper, we present the design,
with the aim of improving the performance of the models in implementation, and evaluation of our models on the
segmenting difficult object classes. The proposed SA-UNet and
Cityscape Dataset, showcasing the substantial improvements
SA-FCN models excel in detecting critical object classes,
providing better insights into street scenes, and improving the
achieved in IoU scores compared to baseline models.
safety of pedestrians and drivers in autonomous driving By introducing self-attention as a targeted solution for
systems. We tested our proposed models on the Cityscape challenging object classes, we contribute to the advancement
Dataset, and the experimental results show that our proposed of autonomous driving technology and the broader field of
models improved the IoU score by 0.1 in FCN-32 when self- semantic segmentation in complex real-world scenarios. The
attention was deployed. Similarly, in UNet, the IoU was rest of the paper is organized as follows: Section II presents
improved by 5 percent with the attention block. Also, the visual
related work in the field, Section III describes the proposed
representation of the output images demonstrates how the self-
methods in detail, Section IV presents the experimental setup
attention block in the encoder of the model can improve
accuracy in detecting occluded yet important classes like
and results, and finally, Section V concludes the paper and
Pedestrian. discusses future research directions.
Our work makes several significant contributions to the
Keywords—Semantic Segmentation, Self-Attention, UNet, field of semantic segmentation for challenging object classes
Fully Convolutional Network, Accurate Pedestrian Segmentation. in street scenes:
I. INTRODUCTION • Novel Deep Neural Network Architecture: We propose a
Semantic segmentation, a crucial task in computer vision, novel architecture that incorporates self-attention blocks into
has seen significant advancements in recent years with the the FCN architecture, specifically in the third through fifth
advent of Deep Neural Networks (DNNs). Its applications layers of the feature extractor part of the network to capture
span various domains, including medical imaging [1] and global contextual information.
autonomous driving systems [2]. In the context of autonomous • We have proposed a novel UNet architecture that
driving, semantic segmentation plays a vital role in scene incorporates self-attention blocks in the encoder side of the
understanding, enhancing the safety of pedestrians and drivers network. This modification ensures that the entire network
[3,4,5]. However, achieving accurate segmentation in receives more focused and informative features, leading to
challenging street scenes, especially for difficult object classes improved semantic segmentation for challenging object
such as pedestrians, remains a considerable challenge [6,7]. classes.
These classes often exhibit diverse appearances across
different scenes, making it difficult for conventional models • We conduct a comprehensive evaluation to assess the
to achieve satisfactory performance. impact of using self-attention as feature extractor in both FCN
and UNet architectures.
In this paper, we focus on addressing the limitations of
existing semantic segmentation models in effectively II. RELATED WORK
distinguishing and segmenting difficult object classes in street
scenes. We target the critical task of detecting pedestrians and A. Fully Convolutional Networks
riders, given their significance in ensuring road safety. To The first successful model for semantic segmentation was
overcome these challenges, we propose the integration of self- fully convolutional network (FCN) [8] that proposed an end-
attention blocks within the encoder section of widely used to-end approach to learn pixel-wise classification. The
UNet and FCN architectures. By introducing self-attention SkipNet [9] architecture was utilized to refine the
mechanisms, our models can capture global contextual segmentation using higher resolution feature maps. Noh et al.
information and emphasize relevant regions of the input, [10] proposed a deeper decoder network, in which stacked
resulting in improved accuracy for challenging object classes. transposed convolution and unpooling layers are used.
Badrinarayanan et al. [11] proposed SegNet which is an

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on August 06,2024 at 10:11:41 UTC from IEEE Xplore. Restrictions apply.
979-8-3503-2397-9/23/$31.00 ©2023 IEEE 273
2023 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE)

encoder-decoder architecture. Instead of skipping deeper the encoder and decoder, our approach utilizes self-attention
layers when upsampling and doing prediction, our purpose in the encoder to serve as a feature extractor. This unique
here is to use SA block and its favorable features to have a utilization of self-attention in the encoder enables more
deep network which can retain both high and low-level effective extraction of global contextual information, leading
features. to improved semantic segmentation for challenging object
classes in autonomous driving scenarios.
B. UNet Architecture
U-net [12] builds on top of the fully convolutional A. Proposed model I, FCN with Self-Attention block
network. It was first designed and applied in 2015 to process To address the problems of fully convolutional networks
biomedical images. However, there are many other tasks that mentioned above, we used three Self Attention (SA) blocks in
requires pixel-level classification, like autonomous driving or the encoder parts of an FCN network. To use SA most
self-driving cars. It also consists of an encoder which down- efficiently i.e., to reduce computational complexity and keep
samples the input image to a feature map and the decoder the number of parameters as low as possible, with different
which up samples the feature map to input image size using experiments, we decided to use SA block in the third, fourth,
learned deconvolution layers. The reason UNet is able to and fifth layers of FCN architecture. This means after the third
localize and distinguish borders is by doing classification on pooling layer, the weights of different filters go through an
every pixel, so the input and output share the same size. On attention block so that the weights would change in a way that
top of UNet, other models like MobileNets [13] were built. makes the model focus more on the important parts of the
The efficiency of UNet makes it a perfect architecture to be image and reduce the impact of irrelevant parts of the image.
used as the backbone of many models. Many works used very The same procedure is repeated in layers 4 and 6. Fig. 1
deep networks like ResNet [14] as the feature extractor of the illustrates how we constructed our first proposed model.
network and build segmentation map using Decoder part of
the UNet.
C. Self-Attention Block
Self Aattention and Multiheaded attention was first
introduced in [15] for content-based text summarization of
information from a variable-length source sentence. The
favorable feature of attention model is that: (1) it can learn to
focus on important region of images within the context. This
important feature made the attention an important component
of neural networks models. SA is one of type multi-head
attention in which the model concentrates its attention on one
context, in contrast to multi-head attention where multiple
contexts is the main aim of the model. (2) another important
feature of SA is that it can directly model long-distance
interactions and correlations. (3) finally, the structure of SA
models allows for parallel computing and feasibility of
implementing on GPUs.
III. PROPOSED METHODS Fig. 1. Proposed model I, FCN with Self-Attention. Using SA mechanism
For tasks requiring more distant correlations to be in the encoder parts of the network to extract concentrated feature
captured, CNNs face some difficulties and are not performing maps.
extensively [16]. Since convolution operations would lead to
a local receptive field, the features corresponding to the pixels B. Proposed model II, UNet with Self-Attention
with the same label may differ. These differences introduce In UNet [12], shortcuts from the coarser level are intended
intra-class inconsistency and affect recognition accuracy [15]. to make the prediction more accurate by preserving high-level
information that may lost in deeper layers. However, in an
To address this, we used a SA block to enhance the input image, not all the regions are equally important. For
segmentation performance of both UNet and FCN networks. example, in a street scene image which meant to be fed to an
One of our contributions was to showcase the effectiveness of autonomous driving system, the upper part of the image,
using self-attention as a feature extractor for different neural including sky, buildings, walls, etc. are not useful information
network architectures. By proposing two models, namely SA- in the context of self-driving cars. But they occupy large
FCN and SA-UNet, we aim to provide empirical evidence that proportion of many images in many street images. So, using
self-attention can enhance the performance of different SA block in proper layers would feed the decoder with
architectures in semantic segmentation tasks. Each model weighted images coming from upper layers of the encoder,
represents a specific network configuration, and their and make more accurate prediction for classes that are usually
comparison allows us to understand the impact of self- small section of the image but are crucial for the task of
attention on different components of the segmentation segmentation in autonomous driving systems such as Persons
process. and Cyclists.
We would like to emphasize that the novelty of our After some initial evaluations and considering the results
proposed approaches lies in the strategic integration of self- of other similar research in the field, we decided to use SA
attention blocks in the encoder section of the FCN and UNet block at the last three layers of encoder in a standard five-layer
architectures. While existing methods commonly employ UNet. This way, the entire decoder will be fed by the feature
attention blocks in the decoder section or as a bridge between maps that have been passed through the attention mechanism

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on August 06,2024 at 10:11:41 UTC from IEEE Xplore. Restrictions apply.
274
2023 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE)

and hence, are more focused on more important regions of the a SA after each convolution layers, the input feature to the next
input features. Fig. 2 shows our second proposed model in layer of the network would be modified such that more
which we modified UNet by incorporating SA block in layers attention would be concentrated on the regions of the image
3 to 5 of the Encoder. with more valuable information.
The rationale behind this combination of SA and In our experimentation, we realize that using SA at the
convolution block is that in the shallower layers of the upper layers of the UNet would not yield to high performance.
network, convolution can still perform very good, since it is The reason behind this is that in upper layers, number of
not focused too much on the input image pixels. As we go convolutions on each feature map is not high enough to cause
further deeper in the layers of the network, convolution is losing much long-range dependencies. Hence, modifying, and
performing on smaller portion of the input image, hence losing weighting feature maps just interfere with the essence of
more and more of long-range dependencies. When deploying convolution.

Fig. 2. Proposed model II, UNet with Self-Attention mechanism in the Encoder part of the network. Using self-attention mechanisms lead to more details
extracting from the input feature maps and the following layers will be fed with more detailed information.

IV. EXPERIMENTS B. Experiment setup

different parameters need to be tuned to achieve the
To show the effectiveness of our proposed models, we
desired results. We used the greyscale segmentation map for
used two sets of comparisons. In one set, we compared the
training and testing our models. The color scale was also
results of basic FCN against FCN with SA, to show how
tested, but the computation complexity was much higher in
effectively we used self-attention in feature extraction.
Similarly, we compared the results of baseline UNet against colored segmentation maps, and the results were more or less
UNet with SA. In another set of experiments, we compared the same as greyscale. In order to further reduce the
the results of our proposed models against competing models computation complexity of models, we reduced the input
with similar complexity and model structure. image resolutions. The original Cityscapes dataset images
have a resolution of 1024 by 2048, but we reduced it to 320
A. Dataset by 640 pixels. We used Adam Optimizer and the Mean Square
Semantic segmentation algorithms usually test on Error loss function was used.
different dataset, depending on the application of the network. C. Metrics
Here, our focus was to propose an architecture enhancing the
performance of autonomous cars. So, the Cityscapes dataset Both quantitative and qualitative measures were used to
[17] was used for training, evaluating, and testing of different show the performance of our proposed method. For
models. The Cityscapes dataset includes 30 classes in 8 quantitative comparison, we used Intersection Over Union
categories. Since we focused on the model's accuracy in (IoU) metrics. We used mean IoU [18] and per-class IoU for
detecting and segmenting the Person and Rider classes, we our models and competing models. For qualitative
decreased the number of classes to 10. In addition, different assessments, we compared the output images of our models
classes of cityscapes have different safety importance. For with competing models.
instance, accurately segmenting the Sky class does not D. Results
contribute to the safety of pedestrians and cars. However, it is We compared the performance of our models using IoU
vital to distinguish and segment the Person class in different and per-class IoU scores, as well as visual presentation of
situations. output images of different models. First, we compare the
results of FCN and UNet with and without self-attention, to
show how effectively we incorporated attention block. To
make the comparison more effective, we included the results

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on August 06,2024 at 10:11:41 UTC from IEEE Xplore. Restrictions apply.
275
2023 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE)

of baseline models like VGG and ResNet. Then, we compare impact the overall mIoU score. ResNet-50, being a deeper and
the results of our proposed model with SOTA models. more complex model, may perform well on classes with more
training samples but might struggle with classes that are
1) Comparing FCN and UNet with SA-FCN and SA-UNet underrepresented in the dataset. Our SA-UNet, with its
Table I presents the IoU scores of our proposed models attention mechanism, can better adapt to such imbalanced
(SA-FCN and SA-UNet) and compares it with the IoU scores scenarios and enhance segmentation for critical but
FCN-32,VGG-UNet [12], MobileNet-UNet [13], and challenging classes.
ResNet50-UNet [14]. The first column (from left) is the
metrics including Mean IoU, and IoU scores for 10 different SA-UNet is designed to include the self-attention blocks
classes which include Road, Car, Persons, Sky, Vegetation, in specific layers of the encoder, aiming to capture long-range
Bus, Bikes, Wall, Sidewalk and Cyclist classes. The other dependencies and global context without significantly
columns show the results of FCN-32, SA-FCN (our first increasing the model's computational cost. As a result, its
proposed model), VGG-UNet, SA-UNet (our second performance might not surpass ResNet-50 in overall mIoU,
proposed model), MobileNet UNet, and ResNet UNet, but it can excel in certain critical classes that are essential for
respectively. the application, such as pedestrian detection in autonomous
driving.
As it is shown in table I, ResNet-50 achieved mean IoU
score of 0.921 and after that our method, SA-UNet, achieved 2) Statistical analysis of the results
0.919 score. This is because ResNet-50 can effectively To assess the statistical significance of the observed
distinguish those classes that occupies big parts of the picture, differences in mean IoU scores for the important classes, we
like Sky and Vegetation classes. That means its mean and F- conducted a two-sample t-test. The t-test compares the means
W IoU scores will be higher than our method. However, in of two independent samples (SA-UNet and the other methods)
per-class IoU scores, our method performs better in important to determine if the differences in their mean IoU scores are
classes like Persons and Bikes. In Person class, which is the statistically significant. We assume Significance Level (α)
main focus of this work, our method achieved 0.666, while equals to 0.05 (5%).
ResNet-50 got 0.495, and VGG-UNet and Mobile-Net get
0.266 and 0.401 respectively, which are inferior comparing to Hypotheses:
our method. Null Hypothesis (H0): There is no significant difference
between SA-UNet and the other methods in terms of IoU for
TABLE I. MEAN AND PER CLASS IOU SCORES FOR VGG-UNET, the important classes (Persons and Bikes).
SA-UNET (OUR METHOD), MOBILENET, AND RESNET-50 ON
VALIDATION SET OF CITYSCAPE. Alternative Hypothesis (H1): SA-UNet performs
Parameter FCN-32 SA-FCN VGG- SA-UNet MobileNet Resnet- significantly better than the other methods in terms of IoU for
UNet (ours) UNet UNet the important classes (Persons and Bikes).
Mean IoU 0.714 0.901 0.857 0.919 0.901 0.921
With these two hypotheses, the results are:
Road 0.95 0.930 0.920 0.959 0.935 0.948
Car 0.833 0.912 0.858 0.919 0.912 0.928 Persons Class:
Persons 0.132 0.573 0.266 0.666 0.401 0.495 p-value: < 0.001 (very small)
Sky 0.976 0.966 0.960 0.979 0.973 0.980
Vegetation 0.858 0.865 0.891 0.809 0.857 0.889
Conclusion: We reject the null hypothesis (H0) as the p-
value is less than 0.05. SA-UNet performs significantly better
Bus 0.792 0.883 0.806 0.902 0.862 0.894
than the other methods in terms of IoU for the Persons class.
Bikes 0.456 0.692 0.523 0.767 0.685 0.752
Wall 0.677 0.691 0.788 0.747 0.741 0.772 Bikes Class:
Sidewalk 0.900 0.871 0.852 0.894 0.912 0.939 p-value: < 0.001 (very small)
Cyclist 0.490 0.553 0.334 0.546 0.520 0.632
Conclusion: We reject the null hypothesis (H0) as the p-
value is less than 0.05. SA-UNet performs significantly better
The phenomenon where SA-UNet outperforms ResNet in than the other methods in terms of IoU for the Bikes class.
terms of certain classes while not showing a significant The statistical analysis indicates that SA-UNet (ours)
improvement in overall mIoU score can be attributed to the outperforms the other methods in both the Persons and Bikes
unique capabilities of the self-attention mechanism and the classes with a high level of significance. These results validate
specific characteristics of the target classes. the effectiveness of the proposed SA-UNet model in
The self-attention mechanism in SA-UNet and SA-FCN enhancing semantic segmentation for these important classes
allows the model to focus more on important regions of the in autonomous driving scenarios.
input image during the feature extraction process. This is 3) Visual Comparison
particularly beneficial for certain classes like the Person and
To visually compare the performance of different
Bike classes, which might appear in various shapes and sizes
methods, we presented three representative yet challenging
in different street scenes. By attending to the relevant parts of
images to show the effectiveness of our model. Figs. 3 depict
the input, our models can better capture the details and
the input image and output segmentation map of different
nuances of these classes, leading to improved segmentation
models. Each of the three rows, from left to right, (a) is the
performance.
input image, (b) Ground truth segmentations, (c) FCN-32 [8],
In semantic segmentation tasks, class imbalance is a (d) SA-FCN (Our proposed model), (e) VGG-UNet, (f) SA-
common challenge, where certain classes may have fewer UNet (our second proposed model), (g) MobileNet [13], and
training samples compared to others. This imbalance can (h) ResNet-50 [14] outputs.

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on August 06,2024 at 10:11:41 UTC from IEEE Xplore. Restrictions apply.
276
2023 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE)

The basic FCN usually has the poorest performance, but pedestrians on the sidewalk. Again, our model, SA-UNet
thanks to Attention mechanism, we improved its performance. outperforms other methods and clearly do the semantic
For example, in Fig. 3, the input image 2 is challenging segmentation for the objects in the input image, including
because of low light condition of the street that makes it even parked bikes, pedestrians, and vehicles (see 2f in Fig.3).
for the human eye to distinguish different objects. FCN-32 MobileNet also performed good, but its segmentation is not as
(see 2c in Fig.3) failed to see Persons on the left sidewalk of clear cut as our method, and it did not notice parked bikes (2e
the image, but SA-FCN at least partially recognized two in Fig.3). Finally, Resnet’s performance is like MobileNet,
pedestrians (see 2d in Fig.3). VGG-UNet did a better job in and it fails to notice one of the bikes in the sidewalk (2h in
this input image, but it still does not fully classify all bikes and Fig. 3).
(a) Input Image (b) Ground Truth (c) FCN-32 (d) SA-FCN (ours) (e) VGG-UNet (f) SA-UNet (ours) (g) MobileNet (h) ResNet-50

Fig. 3. Three input images to with challenging situation. Image 1includes cyclists, and pedestrians in the background. Image 2 presents a low-light condition
in a crowdy street with many vehicles and pedestrians. Image 3 shows a pedestrian far in the background. Ground truth segmentation maps of all three images
are presented in column (b), columns (c) to (h) presents FCN-32, SA-FCN (our first model), VGG-UNet, SA-UNet (our second proposed model), MobileNet
UNet, and ResNet50 UNet, respectively.

4) Comparing with SOTA models accurate input image to make the best possible decision in
We compare the mean IoU scores against the number of different situations of everyday street instances. The more
parameters to show how efficient our proposed model is. Fig. accurately we feed decision maker system, the more reliable
4 shows the IoU score of different models in the vertical axis their decision would be. For this reason, we deployed self-
and the logarithm of the number of parameters in the attention block and proposed SA-UNet and SA-FCN which
horizontal axis. We compared our method against PSPNet outperforms other methods in detecting Pedestrians and
[19], ResNest 201 and 269 [20], DeepLabV3 [21], RefineNet cyclists, even when they were not completely present in the
[22], DeepLabV2 [23], FCN(ResNet101) [24], DeepLabV1 input images. This improvement has achieved with minimum
[25], and SegNet [26]. As demonstrated in Fig. 4, our decrease in the performance of the model in detecting other
proposed model, the SA-UNet, make an excellent trade-off classes- those that were not our point of interest, like Car, Bus,
between accuracy and complexity. Despite having millions of and Road classes.
fewer parameters compared to competing models, our model The proposed models show promising results in enhancing
achieves an accuracy of 79 percent on the Cityscapes dataset. semantic segmentation for autonomous driving systems.
However, certain limitations should be acknowledged. The
evaluation focused on the Cityscapes dataset, and further
assessment on diverse and challenging driving scenarios is
necessary to assess generalization. The reliance on labeled
data for training poses a challenge, and acquiring diverse
datasets remains important. Additionally, hyperparameter
settings may impact performance, requiring tailored fine-
tuning for specific scenarios. Addressing these limitations will
contribute to the robustness and effectiveness of the SA-UNet
and SA-FCN in real-world driving applications.
There are many directions that worth further
investigations, including:
• Using self-attention block in both the Encoder and
Fig. 4. IoU score comparison of our proposed model and State-Of-The-Art
(SOTA) methods against the number of parameters on the Cityscapes
Decoder sides of the UNet architecture.
dataset. The horizontal axis represents the logarithm (base 10) of the number • In this work, we used simple attention block with
of parameters, while the vertical axis represents the mean Intersection over
Union (IoU) accuracy in percentage. Our proposed model stands out by
multiplicative similarity score function. It worth investigating
achieving remarkable IoU accuracy despite having significantly fewer using attention block with more complex similarity score
parameters compared to other state-of-the-art methods. function.
• Investigate the combination of different attention
V. CONCLUSION
mechanisms, such as self-attention and spatial attention, in the
This work was an attempt to improve the performance of UNet architecture. Evaluating the benefits of utilizing
autonomous driving system by feeding them with more multiple attention mechanisms may lead to more

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on August 06,2024 at 10:11:41 UTC from IEEE Xplore. Restrictions apply.
277
2023 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE)

comprehensive feature representations and further improve IEEE transactions on pattern analysis and machine intelligence, Vols.
segmentation accuracy. 39(12),, pp. 2481-2495, Jan 2, 2017.
[12] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional
networks for biomedical image segmentation," International
Conference on Medical image computing and computer-assisted
REFERENCES intervention, Springer, Cham., pp. 234-241, Oct 5, 2015.
[13] MobileNet, arXiv:1704.04861v1 [cs.CV] 17 Apr 2017.
[1] P. Sharma, and D.P Bhatt., "Importance of Deep Learning Models to [14] K. He, X. Zhang, S. Ren, J Sun, Deep Residual Learning for Image
Perform Segmentation on Medical Imaging Modalities," Data Recognition. Proceedings of the IEEE conference on computer vision
Engineering for Smart Systems, no. Springer, Singapore, pp. 593-603., and pattern recognition, pp. 770-778, 2016.
2022. [15] A. Vaswani et al, "Attention is all you need," Advances in neural
[2] G. Rossolini et al, "On the Real-World Adversarial Robustness of Real- information processing systems, vol. 30, 2017.
Time Semantic Segmentation Models for Autonomous Driving" arXiv [16] D.K. Dewangan, S.P.Sahu, RCNet: road classification convolutional
preprint arXiv: 2201.01850.2022 Jan 5. neural networks for intelligent vehicle system. Intel Serv Robotics 14,
[3] G. J. Brostow, J. Fauqueur, and R. Cipolla, "Semantic Object Classes pp 199–214, 14 Feb, 2021.
in Video: A high definition ground truch database," Pattern [17] M. Cordts et al, "The cityscapes dataset for semantic urban scene
Recognition Letters, vol. 30, pp. 99-97, 2018. understanding.," In Proceedings of the IEEE Conference on Computer
[4] M. Cordts et al, "The cityscapes dataset for semantic urban scene Vision and Pattern Recognition, pp. 3213-3223, 2016.
understanding.," In Proceedings of the IEEE Conference on Computer [18] G. Rossolini et al, "On the Real-World Adversarial Robustness of Real-
Vision and Pattern Recognition, pp. 3213-3223, 2016. Time Semantic Segmentation Models for Autonomous Driving" arXiv
[5] G. Ros et al, "The synthia dataset: A large collection of synthetic preprint arXiv: 2201.01850.2022 Jan 5.
images for semantic segmentation of urban scenes," In Proceedings of [19] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
the IEEE Conference on Computer Vision and Pattern Recognition, pp. network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017,
3234-3243, 2016. pp. 2881–2890.
[6] G. Cheng and J.Y. Zheng, "Semantic Segmentation for Pedestrian [20] H. Zhang et al., “ResNeSt: Split-attention networks,” 2020, arXiv:
Detection from Motion in Temporal Domain", 25th International 2004.08955.
Conference on Pattern Recognition (ICPR) Milan, Italy, Jan 10-15, [21] G. Lin et al., "RefineNet: Multi-path Refinement Networks for High-
2021. Resolution Semantic Segmentation, in 2017 IEEE Conference on
[7] R. Benenson et al. “Ten Years of Pedestrian Detection, what have we Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5168–
learned?, What Have We Learned?”, arXiv:1411.4304v1 [cs.CV] 16 5177.
Nov 2014. [22] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
[8] E. Shelhamer, J. Long, and T. Darrell, "Fully convolutional networks “DeepLab: Semantic image segmentation with deep convolutional
for semantic segmentation," IEEE transactions on pattern analysis and nets, atrous convolution, and fully connected CRFs,” IEEE Trans.
machine intelligence, vol. 39(4), pp. 640-651, Oct 5, 2016. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2018.
[9] X. Wang et al, "Skipnet: Learning dynamic routing in convolutional [23] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
networks.," in In Proceedings of the European Conference on for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern
Computer Vision (ECCV), pp 409-424, 2018. Recognit., 2015, pp. 3431–3440.
[10] H. Noh, S. Hong, and B. Han, "Learning deconvolution network for [24] B. Cheng et al., “Panoptic-DeepLab,” 2019, arXiv: 1910.04751.
semantic segmentation.," in IEEE international conference on [25] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep
computer vision, pp. 1520-1528, 2015. convolutional encoder-decoder architecture for image segmentation,”
[11] V. Badrinarayanan, A. Kendall, and R. Cipolla, "Segnet: A deep IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495,
convolutional encoder-decoder architecture for image segmentation.," Dec. 2017.

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on August 06,2024 at 10:11:41 UTC from IEEE Xplore. Restrictions apply.
278