Real-Time Traffic Scene Segmentation Based On Multi-Feature Map and Deep Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2018 IEEE Intelligent Vehicles Symposium (IV)

Changshu, Suzhou, China, June 26-30, 2018

Real-time Traffic Scene Segmentation Based on Multi-Feature Map


and Deep Learning
Linhui Li, Weina Zheng, Lingchao Kong, Ümit Özgüner, Wenbin Hou, Jing Lian*

than the RGB image. There are two ways to apply the depth

Abstract—Visual-based semantic segmentation for traffic


scene plays an important role in intelligent vehicles. In this map to the image semantic segmentation: one is to combine
paper, we present a new real-time deep fully convolution neural the raw depth images and the RGB images into a four-channel
network (FCNN) for pixel-wise segmentation with six channel RGB-D image as a CNNs input [2]-[4], and the other is to
inputs. The six channel inputs include the RGB three channel input the images containing richer depth information and the
color image, the Disparity (D) image generated by stereo vision RGB images into two CNNs, respectively [5]-[7]. Specifically,
sensor, the image to describe the Height (H) of each pixel above with the help of rich information about the object relationships
road ground, and the image to describe the Angle (A) between that is provided in the depth images, both methods can achieve
each pixel normal direction and the predicted direction of a better performance than using only the RGB image.
gravity, which are defined as a RGB-DHA multi-feature map. However, entering data into two CNNs will increase the
The FCNN is simplified and modified based on AlexNet to meet number of parameters causing the network slow down.
the real-time requirements of intelligent vehicle for Therefore, in this paper, to improve the accuracy, the Disparity,
environmental perception. The proposed algorithm is tested and
Height, and Angle maps (DHA) are fused with RGB images
compared in Cityscapes dataset, yields global accuracies 73.4%
and 22ms for 400×200 resolution image with one Titan X GPU.
into 6-channel RGB-DHA map and used directly as input data.
Index Terms—Intelligent vehicle, traffic scene segmentation, This paper focuses on building a fast functioning semantic
multi-feature map, deep learning segmentation network with good performance, especially for
I. INTRODUCTION the road targets that drivers are more concerned about. As a
result, a new network architecture is proposed, and then the
Traffic scene segmentation is a fundamental task for depth map and its derived height and norm angle maps are
intelligent vehicles in detecting obstacles, planning paths, and added to train the network for higher accuracy. The main work
navigating autonomously. Semantic segmentation, also can be stated as follows:
known as image parsing or image comprehension [1], aims to
divide the image into predefined non-overlapping regions and  A fully convolution neural network called D-AlexNet
translate them into abstract semantic information. In recent network is developed based on AlexNet [8], which
years, with the rapid development of computer hardware, has a simple structure containing several
especially the Graphics Processing Unit (GPU), the convolutional layers to increase forward speed of the
emergence of large-scale markup data, and the application of network.
deep Convolutional Neural Networks (CNNs) in image  The proposed D-AlexNet achieves 2.2x+ speedup of
classification and object detection have been rapidly reference and reduces parameters by 39+ times.
developed and have become the current mainstream image
segmentation methods. Recently, most studies have been  The 6-channel RGB-DHA map can achieve a better
focused on improving the accuracy of semantic segmentation result in semantic segmentation than only using RGB
by making the network deeper and larger. However, images as input, especially for identifying road targets
increasing the parameters often comes at the expense of the in a traffic scene, such as pedestrians and cars.
memory of computers and leads to the network being slower.
So, how to improve accuracy under the premise of ensuring II. RELATED WORK
real-time functionality is one of the most important tasks in
deep learning. A. RGB Semantic Segmentation
A Fully Convolutional Network (FCN) [9] replaces the
The advent of depth sensors makes it possible to obtain last fully-connected layers of the traditional neural network
depth information, which contain more positional information with the convolution layers, which lays the foundation for
FCN to be applied to semantic segmentation. Deeplab [10],
*Resrach supported by the National Natural Science Foundation of China
(Grant Nos. 51775082, 61473057 and 61203171) and the China Fundamental proposed by L. C. Chen et al., obtained better results by
Research Funds for the Central Universities (Grant Nos. DUT17LAB11 and reducing the stride, using the hole algorithm, and the
DUT15LK13). conditional random field to fine-tune the network. SegNet [11],
L. Li, W. Zheng, L. Kong, W. Hou and J. Lian are with the School of [12] achieves the pixel-level semantic segmentation by using
Automotive Engineering, Faculty of Vehicle Engineering and Mechanics, encoder-decoder structures to restore the feature maps from
Dalian University of Technology, Dalian 116024, China. And J. Lian is the
corresponding author. (e-mail: [email protected]; zhengweina_1993
the higher layers with spatial information from the lower
@mail.dlut.edu.cn; [email protected]; [email protected]; layers. In [13], [14], multi-scale feature ensembles are used to
[email protected]). increase performance. The PSPNet [15] complete the
Ümit Özgüner is with the Department of Electrical and Computer prediction by aggregating context information.
Engineering, The Ohio State University, Columbus, OH, 43210 USA (e-mail:
[email protected]).

978-1-5386-4452-2/18/$31.00 ©2018 IEEE 7


To perform segmentation in real-time on existing layer is changed from 4 to 1 and the kernel size of the
hardware. Some of the methods have been used to speed up maximum pooling layer is changed from 3×3 to 2×2.
the network. SegNet [12] improved forward speed by reducing
the number of layers in the network. A. Chaurasia et al. [16]  Experimental results showed that the existence of a
linked the encoder blocks to the corresponding decoder packet structure in convolutional layer can’t improve
directly to decrease the processing time. Z. Hengshuang et al. the accuracy of the final semantic segmentation.
[17] proposed compressed-PSPNet-based image cascade Therefore, we removed the second, fourth, and fifth
network that incorporates multi-resolution branches under convolutional packets and deleted the two LRN
proper label guidance to yield real time inference. layers.

B. Semantic Segmentation with Depth Information  The existence of internal covariates will increase the
difficulty of deep network training. This paper added
Compared to the single RGB images, depth maps contain a batch normalization layer between each convolution
more location information that is a benefit to semantic layer and ReLU layer to solve this problem.
segmentation. In [18], the raw depth image was simply treated
as a one-channel image, and CNNs were then applied to  The convolution kernels of all the convolutional
extract features for indoor semantic segmentation. In [5], layers are unified to be 3×3 in size, and the number of
depth information was used as three channels: horizontal convolution kernel outputs is 96.
disparity, height above ground, and norm angle. Qi et al. [19]
With reference to the upsampling method used by Z. D.
proposed a 3D Graph Neural Network (3DGNN) that builds a
Matthew et al. [21], we record the maximum eigenvalue
k-nearest neighbor graph, and finally boosted the prediction.
position of each pooling window in the pooling process and
The above works prove that using more characteristic
put it in the corresponding position in the upsampling process.
information as input to train the network helps to improve the
The decoder is the mirror structure of the encoder, except for
accuracy of semantic segmentation.
its sixth convolutional layer where the kernel size is 1×1. The
III. NETWORK ARCHITECTURE output of the decoder network is K feature maps and then it is
fed to the softmax layer to produce a K-channel class
In general, a deeper network structure will result in better probability map, where K is the number of classes. The result
semantic segmentation although it often comes at the expense of the segmentation is that each pixel of the image corresponds
of having many training parameters and a longer running time, to the class with the largest predicted probability.
which can’t meet the real-time requirements of intelligent
driving. To tackle this problem, intuitively, we believe that IV. MULTI-FEATURE MAP
reducing network parameters and simplifying the network
DHA images can contain richer image feature information
model can speed up the network, and, moreover, adding the
compared to using the raw depth information for learning the
depth information can improve network performance.
deep network. This process includes the steps described
Motivated by AlexNet [8] and N. Hyeonwoo [20], who
below.
proposed an encoder-decoder network architecture based on
the VGG16 network, the proposed deep fully convolution A. Horizontal Disparity Map
neural network architecture is shown in Figure 1, and includes The left and right images obtained from the Cityscapes
11 convolutional layers, 3 pooling layers, 3 upsampling layers, dataset can be used to generate the disparity map with a stereo
and 1 softmax layer. matching algorithm. According to the degree of matching, the
In the new network structure, AlexNet is modified in the stereo vision matching algorithm can be divided into three
following ways to make it suitable for pixel-level semantic categories: the local matching algorithm, the semi-global
segmentation tasks: matching algorithm, and the global matching algorithm. The
global matching algorithm gets the highest matching accuracy,
 In order to adapt the network to images of different and worst real-time performance. The local matching
sizes, the full connectivity layer of AlexNet is algorithm is the fastest, but its matching accuracy is very low.
removed. Then, the stride of the first convolutional
Encoder Decoder

Cov + BN + ReLU Pooling


Input:RGB-DHAimages Output:Predicted results
Upsampling Softmax

Figure 1. The structure of D-AlexNet network.

8
The semi-global matching algorithm can better match the orthogonal to the locally estimated surface normal directions
accuracy and real-time computing needs, so for this paper we at as many points as possible. Hence, to leverage this structure,
chose this method for obtaining the disparity map. the algorithm proposed by G. Saurabh et al. [5] is used to
determine the direction of gravity.
An edge preserving smoothing method proposed by M.
Dongbo [22] is used to improve the segmentation accuracy by Finally, by calculating the angle between the pixel normal
optimizing the coarse disparity map and making the disparity direction and the predicted direction of gravity, the required
value more continuous. angle information can be obtained.
B. Height Above Ground
V. EXPERIMENTS AND ANALYSIS
Based on the obtained parallax maps, the P(x, y, z) points
in the world coordinate system corresponding to P'(u, v) The experiments were conducted based on the learning
pixels in the image coordinate system can be obtained by platform Caffe. In addition, all our experiments were
equations (1) and (2), performed on the software and hardware shown in Table I.

fb TABLE I. TRAINING SOFTWARE AND HARDWARE CONDITIONS


z (1)
d Project Content
CPU Intel Xeon E5-2620
v  cy RAM 32GB
y z (2) GPU GeForce GTX TITAN X
fy Operating System Ubuntu 14.04 LTS
Cuda Cuda7.5 with Cudnn v5
where x and y are the coordinates of point P in the world
coordinate system. z is the distance between point P and the Deep Learning Framework Caffe
camera. f and b are the focal length of the camera and the
baseline length of the two cameras, respectively. fy and Cy are A. Dataset and Evaluation Metrics
the internal parameters of the camera, and y is the height of We applied our system to the recent urban scene
the pixel. A correction is required since the camera's understanding data Cityscapes, which contains 5,000 finely
installation does not guarantee complete parallelism with the and 20,000 coarsely annotated images. In addition, the dataset
ground plane. A part of the ground area in the parallax map is provides left and right views captured by a stereo camera,
selected, and the least squares method is used to fit the ground. providing the chance of obtaining parallax and depth maps. In
By assuming that the fitted ground plane equation is Y = aX + this paper, 5,000 finely annotated images were selected, and
bZ + c, the value of a, b, and c can be obtained by equation (3). were split into training, verification, and a test set. These sets
After correcting the ground, the actual pixel height can be contained 2,975, 500, and 1,525 images, respectively. The
obtained by equation (4). image size was converted to 200 ×400 to shorten training time
and reduce memory consumption. To mark the significant
  X i2  X i Zi  X i   a    X iYi  traffic information, traffic scenarios were classified into 11
  categories, including roads, road borders, buildings, poles,
  X i Zi  Zi2  Zi  b     Z iYi  (3) traffic signs, trees, lawns, skies, people, cars, and bicycles or
  Xi  Zi n   c    Yi  motorcycles. Both the global accuracy rate and network
 forward time were used for evaluation.
h  y   aX  bZ  c  (4) B. Training Process
In the training process, the weight of the convolution
In the height map, the sky, the building, and the tree layers were initialized in the same way as AlexNet, and the
correspond to a large height value, while the more important method used by H. Kaiming et al. [23] was applied to initialize
objects such as vehicles and pedestrians correspond to a the weight of the batch normalization layers. Cross-entropy
relatively small height value. To highlight important goals, was employed as a loss function for training the network and
equation (5) is used to transform the height value calculating the loss value. In the back-propagation phase,
corresponding to each pixel to generate the height image stochastic gradient descent was adopted to optimize the
whose height value is between 0 and 255. network weights. The initial learning rate and momentum
were set to 0.01 and to 0.9, respectively. In addition, the
 h 15 weight decay was set to 0.0005 to prevent overfitting of the
h '  255  log 2 / log 2 , h  0.1
(5) network. It is noteworthy that to maintain the purity of the data
 0.1 0.1
h '  0 , and simplify the training process, we trained our network
h  0.1 without data augmentation and no pre-trained model with
C. Surface Normal other datasets was used.
For urban traffic scenes, generally, the road surface is For every 300 training times, we conducted an accuracy
horizontal, and the surface of objects, such as buildings, traffic assessment on the validation set and saved a snapshot. The
signs, vehicles, and so on, are in a vertical direction. validation accuracy, training loss value curves based on
According to these characteristics, an algorithm can be used to RGB-DHA images are shown in Figure 2. More iterations
find the direction which is the most aligned to or most may mean higher accuracy. However, when the accuracy and

9
the loss start to converge, it is feasible to stop the training. C. Comparison and Analysis
Therefore, the network was iteratively trained 10,000 times, We first evaluated how our proposed network was useful
and the Caffe model with the highest accuracy was selected as in speeding up semantic segmentation taking SegNet[11] and
the model that was finally used for scene segmentation. SegNet-basic[12] as the baseline. When taking RGB images
and RGB-DHA images as the input data, the performance
results of the networks are shown in table II. Our proposed
network structure was 2.2 times faster than SegNet and 1.8
times faster than SegNet-basic. From Figure 2 and table II we
can find that our proposed architecture can achieve better
real-time results with competitive segmentation results.
Furthermore, for each network frame, the validation accuracy
obtained using RGB-DHA images is higher than that obtained
using RGB images, which also indicates that more
characteristic information is useful to improve the
performance of networks.

TABLE II. PERFORMANCE OF BASELINE AND D-ALEXNET ON


VALIDATION SET OF CITYSCAPES

Project SegNet SegNet-basic D-AlexNet


Accuracy (%) (RGB) 72.6 69.9 69.4
Accuracy(%)(RGB-DHA) 73.2 71.5 73.4
Training time (ms) 140 77 63
Testing time (ms) 48 28 22
Memory(MB) 117.8 5.7 3.0

Figure 2. Training loss and Accuracy curves of different networks.

TABLE III. ACCURACY COMPARISON OF DIFFERENT INPUT ON VALIDATION SET OF CITYSCAPES

Mean accuracy
Two-wheelers
Traffic signs

Pedestrian
Sidewalks

accuracy
Building

Global
Lawns
Trees
Poles
Input

Road

cars
Sky

RGB(%) 90.5 64 61 72.3 74 80.6 67.3 96.7 79 85.8 62.6 75.8 69.4
RGB-D(%) 91.6 67.7 75.1 74.2 78.6 80.2 72.5 98.1 81.2 88.7 59.3 78.8 73.1
RGB-H(%) 90.3 65 72.9 68.4 76.5 78.7 67.5 98.4 83.8 84.9 61.6 77.1 71.5
RGB-A(%) 91.1 68.5 71.6 75.3 73.6 80.8 77.8 98 77.1 89.9 65.8 79.0 72.4
RGB-DHA(%) 91.2 70.2 76.3 72.8 78 80.7 68.9 99 84 91 52 78.5 73.4

(a) Original set of color images of the test samples

(b) Semantic segmentation results based on RGB images

10
(c) Semantic segmentation results based on RGB-D images

(d) Semantic segmentation results based on RGB-H images

(e) Semantic segmentation results based on RGB-A images

(f) Semantic segmentation results based on RGB-DHA images

Figure 3. Examples of semantic segmentation results in the test set.

To further understand the efficiency gain in each feature segmentation results obtained based on RGB images were
map, we first respectively merged three feature maps obtained sometimes rough and there were many wrongly classified
from Section 4 with the RGB images into 4-channel images, pixels on road or around boundary contours of different
and then all 3 feature images were merged with RGB images categories. For example, many pixels in the road surface were
into 6-channel images. After that, both 4-channal and misclassified as sidewalks in the left image of Figure 3 (b).
6-channal images were used as input data for training the The effect based on the four-channel images was generally
network. The testing results are shown in Table Ⅲ, from batter than that based on RGB three-channel images, and the
which we can draw conclusion that the segmentation accuracy RGB-DHA images can further improve the segmentation
based on 4-channel and 6-channel images was obviously accuracy, which shows less error classification points.
improved when compared with the one based on 3-channel
images. Under the same training parameters, the global In addition, when use RGB-DHA images as net input, road
accuracies obtained from RGB-D, RGB-H, RGB-A, and targets such as pedestrians and cars got higher segment
accuracy than use RGB image as net input. For example, the
RGB-DHA images are 3.7%, 2.1%, 3%, and 4% higher than
those obtained from raw RGB images, respectively. With pedestrians segment accuracy rise from 79% to 84% and the
RDB-DHA 6-channel images as input, our proposed system cars segment accuracy rise from 85.8% to 91%. Some details
finally achieves a segmentation accuracy of 73.4%. comparison are shown as Figure 4. It can be seen that
pedestrian and car in Figure 4(c) and Figure 4(f) have clearer
Figure 3 shows the results of semantic segmentation on the contours than in Figure 4(b) and Figure 4(e), which will
test set of our network model with 3-channels, 4-channels, and helpful for behavior analysis of different road targets.
6-channels, respectively, as inputs. As shown, the

11
[6] G. Yangrong and C. Tao, “Semantic segmentation of RGBD images
based on deep depth regression (Periodical style—Submitted for
publication),” Pattern Recognition Letters, submitted for publication.
[7] E. David and F. Rob, “Predicting Depth, Surface Normals and
Semantic Labels with a Common Multi-scale Convolutional
Architecture,” in Proceedings of the IEEE International Conference on
Computer Vision, Santiago, Feb. 2015, pp. 2650-2658.
[8] K. Alex, S. Ilya and H. E. Geoffrey, “ImageNet classification with deep
(a) Pedestrian image (b) RGB as input (c) RGB-DHA as input convolutional neural networks,” Communications of the ACM, vol. 60,
no. 6, pp. 84-90, June 2017.
[9] S. Evan, L. Jonathan and D. Trevor, “Fully convolutional networks for
semantic segmentation,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 39, no. 4, pp. 640-651, Apr. 2017.
[10] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. L. Yuille,
“Deeplab: semantic image segmentation with deep convolutional nets,
atrous convolution, and fully connected CRFs (Periodical
style—Submitted for publication),” IEEE Transactions on Pattern
(d) Car image (e) RGB as input (f) RGB-DHA as input
Analysis and Machine Intelligence, submitted for publication.
Figure 4. Examples of detail comparison of pedestrians and cars. [11] V. Badrinarayanan, A. Handa and R. Cipolla. “Segnet: a deep
convolutional encoder-decoder architecture for robust semantic
VI. CONCLUSION pixel-wise labelling,” Computer Science, May 2015.
[12] V. Badrinarayanan, A. Kendall and R. Cipolla, “Segnet: a deep
This paper presents a traffic scene semantic segmentation convolutional encoder-decoder architecture for scene segmentation,”
method based on a novel deep fully convolutional network IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.
(D-AlexNet) and multi-feature map (RGB-DHA). The 39, no. 12, pp. 2481-2495, Dec. 2017.
network achieves good real-time performance by 22ms for [13] F. Xia, P. Wang, L. C. Chen and A. L. Yuille, “Zoom better to see
clearer: human and object parsing with hierarchical auto-zoom net,” in
each 400×200 resolution image on Titan X GPU. Disparity European Conference on Computer Vision, Switzerland, 2016, pp.
maps, height maps, and angle maps are obtained from the 648-663.
original RGB images and fused into 6-channel images to train [14] C. Liang-Chieh, Y. Yi, W. Jiang, X. Wei and Y. L. Alan, “Attention to
the network. Experiments show that using multi-feature map scale: Scale-aware semantic image segmentation,” in Proceedings of
as the input to the network can achieve 4% higher the IEEE Computer Society Conference on Computer Vision and
segmentation accuracy compared with using RGB image-as Pattern Recognition, Las Vegas, July 2016, pp. 3640-3649.
input. In the future, we will focus on more efficient deep [15] Z. Hengshuang, S. Jianping, Q. Xiaojuan, W. Xiaogang and J. Jiaya,
“Pyramid scene parsing network,” in the IEEE Conference on
network to joint semantic segmentation, targets tracking and Computer Vision and Pattern Recognition, Honolulu, 2017, pp.
parameter identification together. 2881-2890.
[16] A. Chaurasia, and E. Culurciello, “Linknet: exploiting encoder
ACKNOWLEDGMENT representations for efficient semantic segmentation,” arXiv preprint
arXiv: 1707.03718, 2017.
The authors would like to thank Dr. Rencheng Zheng for [17] Z. Hengshuang, Q. Xiaojuan, S. Xiaoyong, S. Jianping and J. Jiaya,
his contribution to the fruitful discussions. “ICNet for Real-Time Semantic Segmentation on High-Resolution
Images,” arXiv preprint, arXiv:1704.08545, 2017.
[18] H. Caner, M. Lingni, D. Csaba and C. Daniel. “FuseNet: Incorporating
REFERENCES depth into semantic segmentation via fusion-based CNN architecture,”
[1] W. Fan, A. Samia, L. Chunfeng and B. Abdelaziz, “Multimodality in 13th Asian Conference on Computer Vision, Taipei, Nov. 2016, vol.
semantic segmentation based on polarization and color images,” 10111 LNCS, pp. 213-228.
Neurocomputing, vol. 253, pp. 193-200, Aug. 2017. [19] Q. Xiaojuan, L.Renjie, J. Jiaya, F. Sanja and U. Raquel, “3D Graph
[2] L. Linhui, Q. Bo, L. Jing, Z. Weina and Z. Yafu, “Traffic scene Neural Networks for RGBD Semantic Segmentation,” in IEEE
segmentation based on RGB-D image and deep learning (Periodical International Conference on Computer Vision, Venice, Oct, 2017, pp.
style—Submitted for publication),” IEEE Transactions on Intelligent 5209-5218.
Transportation Systems, submitted for publication. [20] N. Hyeonwoo, H. Seunghoon and H. Bohyung, “Learning
[3] F. David, B. Emmanuel, B. Stéphane, D, Guillaume, G. Alexander et al, deconvolution network for semantic segmentation,” in Proceedings of
“RGBD object recognition and visual texture classification for indoor the IEEE International Conference on Computer Vision, Santiago, Feb.
semantic mapping,” in IEEE International Conference on Technologies 2015, pp. 1520-1528.
for Practical Robot Applications, Woburn, 2012, pp. 127-132. [21] Z. D. Matthew and F. Rob, “Visualizing and Understanding
[4] H. Farzad, S. Hannes, D. Babette, T. Carme and B. Sven, “Combining Convolutional Networks,” in 13th European Conference on Computer
semantic and geometric features for object class segmentation of indoor Vision. Sep. 2014, Vol. 8689 LNCS, no. PART 1, pp. 818-833.
scenes,” IEEE Robotics & Automation Letters, vol. 2, no. 1, pp. 49-55, [22] M. Dongbo, C. Sunghwan, L. Jiangbo, H. Bumsub, S. Kwanghoon and
Jan. 2017. D. N. Minh, “Fast global image smoothing based on weighted least
[5] G. Saurabh, G. Ross, A. Pablo and M. Jitendra, “Learning rich features squares,” IEEE Transactions on Image Processing, vol. 23, no. 12, pp.
from RGB-D images for object detection and segmentation,” Lecture 5638-5653, Dec. 2014.
Notes in Computer Science, vol. 8695 LNCS, no. PART 7, pp. 345-360, [23] H. Kaiming, Z. Xiangyu, R. Shaoqing and S. Jian. “Delving deep into
2014. rectifiers: Surpassing human-level performance on imagenet
classification,” in Proceedings of the IEEE International Conference
on Computer Vision, Santiago, Dec. 2015, pp. 1026-1034.

12

You might also like