Real-Time Traffic Scene Segmentation Based On Multi-Feature Map and Deep Learning
Real-Time Traffic Scene Segmentation Based On Multi-Feature Map and Deep Learning
Real-Time Traffic Scene Segmentation Based On Multi-Feature Map and Deep Learning
than the RGB image. There are two ways to apply the depth
B. Semantic Segmentation with Depth Information The existence of internal covariates will increase the
difficulty of deep network training. This paper added
Compared to the single RGB images, depth maps contain a batch normalization layer between each convolution
more location information that is a benefit to semantic layer and ReLU layer to solve this problem.
segmentation. In [18], the raw depth image was simply treated
as a one-channel image, and CNNs were then applied to The convolution kernels of all the convolutional
extract features for indoor semantic segmentation. In [5], layers are unified to be 3×3 in size, and the number of
depth information was used as three channels: horizontal convolution kernel outputs is 96.
disparity, height above ground, and norm angle. Qi et al. [19]
With reference to the upsampling method used by Z. D.
proposed a 3D Graph Neural Network (3DGNN) that builds a
Matthew et al. [21], we record the maximum eigenvalue
k-nearest neighbor graph, and finally boosted the prediction.
position of each pooling window in the pooling process and
The above works prove that using more characteristic
put it in the corresponding position in the upsampling process.
information as input to train the network helps to improve the
The decoder is the mirror structure of the encoder, except for
accuracy of semantic segmentation.
its sixth convolutional layer where the kernel size is 1×1. The
III. NETWORK ARCHITECTURE output of the decoder network is K feature maps and then it is
fed to the softmax layer to produce a K-channel class
In general, a deeper network structure will result in better probability map, where K is the number of classes. The result
semantic segmentation although it often comes at the expense of the segmentation is that each pixel of the image corresponds
of having many training parameters and a longer running time, to the class with the largest predicted probability.
which can’t meet the real-time requirements of intelligent
driving. To tackle this problem, intuitively, we believe that IV. MULTI-FEATURE MAP
reducing network parameters and simplifying the network
DHA images can contain richer image feature information
model can speed up the network, and, moreover, adding the
compared to using the raw depth information for learning the
depth information can improve network performance.
deep network. This process includes the steps described
Motivated by AlexNet [8] and N. Hyeonwoo [20], who
below.
proposed an encoder-decoder network architecture based on
the VGG16 network, the proposed deep fully convolution A. Horizontal Disparity Map
neural network architecture is shown in Figure 1, and includes The left and right images obtained from the Cityscapes
11 convolutional layers, 3 pooling layers, 3 upsampling layers, dataset can be used to generate the disparity map with a stereo
and 1 softmax layer. matching algorithm. According to the degree of matching, the
In the new network structure, AlexNet is modified in the stereo vision matching algorithm can be divided into three
following ways to make it suitable for pixel-level semantic categories: the local matching algorithm, the semi-global
segmentation tasks: matching algorithm, and the global matching algorithm. The
global matching algorithm gets the highest matching accuracy,
In order to adapt the network to images of different and worst real-time performance. The local matching
sizes, the full connectivity layer of AlexNet is algorithm is the fastest, but its matching accuracy is very low.
removed. Then, the stride of the first convolutional
Encoder Decoder
8
The semi-global matching algorithm can better match the orthogonal to the locally estimated surface normal directions
accuracy and real-time computing needs, so for this paper we at as many points as possible. Hence, to leverage this structure,
chose this method for obtaining the disparity map. the algorithm proposed by G. Saurabh et al. [5] is used to
determine the direction of gravity.
An edge preserving smoothing method proposed by M.
Dongbo [22] is used to improve the segmentation accuracy by Finally, by calculating the angle between the pixel normal
optimizing the coarse disparity map and making the disparity direction and the predicted direction of gravity, the required
value more continuous. angle information can be obtained.
B. Height Above Ground
V. EXPERIMENTS AND ANALYSIS
Based on the obtained parallax maps, the P(x, y, z) points
in the world coordinate system corresponding to P'(u, v) The experiments were conducted based on the learning
pixels in the image coordinate system can be obtained by platform Caffe. In addition, all our experiments were
equations (1) and (2), performed on the software and hardware shown in Table I.
9
the loss start to converge, it is feasible to stop the training. C. Comparison and Analysis
Therefore, the network was iteratively trained 10,000 times, We first evaluated how our proposed network was useful
and the Caffe model with the highest accuracy was selected as in speeding up semantic segmentation taking SegNet[11] and
the model that was finally used for scene segmentation. SegNet-basic[12] as the baseline. When taking RGB images
and RGB-DHA images as the input data, the performance
results of the networks are shown in table II. Our proposed
network structure was 2.2 times faster than SegNet and 1.8
times faster than SegNet-basic. From Figure 2 and table II we
can find that our proposed architecture can achieve better
real-time results with competitive segmentation results.
Furthermore, for each network frame, the validation accuracy
obtained using RGB-DHA images is higher than that obtained
using RGB images, which also indicates that more
characteristic information is useful to improve the
performance of networks.
Mean accuracy
Two-wheelers
Traffic signs
Pedestrian
Sidewalks
accuracy
Building
Global
Lawns
Trees
Poles
Input
Road
cars
Sky
RGB(%) 90.5 64 61 72.3 74 80.6 67.3 96.7 79 85.8 62.6 75.8 69.4
RGB-D(%) 91.6 67.7 75.1 74.2 78.6 80.2 72.5 98.1 81.2 88.7 59.3 78.8 73.1
RGB-H(%) 90.3 65 72.9 68.4 76.5 78.7 67.5 98.4 83.8 84.9 61.6 77.1 71.5
RGB-A(%) 91.1 68.5 71.6 75.3 73.6 80.8 77.8 98 77.1 89.9 65.8 79.0 72.4
RGB-DHA(%) 91.2 70.2 76.3 72.8 78 80.7 68.9 99 84 91 52 78.5 73.4
10
(c) Semantic segmentation results based on RGB-D images
To further understand the efficiency gain in each feature segmentation results obtained based on RGB images were
map, we first respectively merged three feature maps obtained sometimes rough and there were many wrongly classified
from Section 4 with the RGB images into 4-channel images, pixels on road or around boundary contours of different
and then all 3 feature images were merged with RGB images categories. For example, many pixels in the road surface were
into 6-channel images. After that, both 4-channal and misclassified as sidewalks in the left image of Figure 3 (b).
6-channal images were used as input data for training the The effect based on the four-channel images was generally
network. The testing results are shown in Table Ⅲ, from batter than that based on RGB three-channel images, and the
which we can draw conclusion that the segmentation accuracy RGB-DHA images can further improve the segmentation
based on 4-channel and 6-channel images was obviously accuracy, which shows less error classification points.
improved when compared with the one based on 3-channel
images. Under the same training parameters, the global In addition, when use RGB-DHA images as net input, road
accuracies obtained from RGB-D, RGB-H, RGB-A, and targets such as pedestrians and cars got higher segment
accuracy than use RGB image as net input. For example, the
RGB-DHA images are 3.7%, 2.1%, 3%, and 4% higher than
those obtained from raw RGB images, respectively. With pedestrians segment accuracy rise from 79% to 84% and the
RDB-DHA 6-channel images as input, our proposed system cars segment accuracy rise from 85.8% to 91%. Some details
finally achieves a segmentation accuracy of 73.4%. comparison are shown as Figure 4. It can be seen that
pedestrian and car in Figure 4(c) and Figure 4(f) have clearer
Figure 3 shows the results of semantic segmentation on the contours than in Figure 4(b) and Figure 4(e), which will
test set of our network model with 3-channels, 4-channels, and helpful for behavior analysis of different road targets.
6-channels, respectively, as inputs. As shown, the
11
[6] G. Yangrong and C. Tao, “Semantic segmentation of RGBD images
based on deep depth regression (Periodical style—Submitted for
publication),” Pattern Recognition Letters, submitted for publication.
[7] E. David and F. Rob, “Predicting Depth, Surface Normals and
Semantic Labels with a Common Multi-scale Convolutional
Architecture,” in Proceedings of the IEEE International Conference on
Computer Vision, Santiago, Feb. 2015, pp. 2650-2658.
[8] K. Alex, S. Ilya and H. E. Geoffrey, “ImageNet classification with deep
(a) Pedestrian image (b) RGB as input (c) RGB-DHA as input convolutional neural networks,” Communications of the ACM, vol. 60,
no. 6, pp. 84-90, June 2017.
[9] S. Evan, L. Jonathan and D. Trevor, “Fully convolutional networks for
semantic segmentation,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 39, no. 4, pp. 640-651, Apr. 2017.
[10] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. L. Yuille,
“Deeplab: semantic image segmentation with deep convolutional nets,
atrous convolution, and fully connected CRFs (Periodical
style—Submitted for publication),” IEEE Transactions on Pattern
(d) Car image (e) RGB as input (f) RGB-DHA as input
Analysis and Machine Intelligence, submitted for publication.
Figure 4. Examples of detail comparison of pedestrians and cars. [11] V. Badrinarayanan, A. Handa and R. Cipolla. “Segnet: a deep
convolutional encoder-decoder architecture for robust semantic
VI. CONCLUSION pixel-wise labelling,” Computer Science, May 2015.
[12] V. Badrinarayanan, A. Kendall and R. Cipolla, “Segnet: a deep
This paper presents a traffic scene semantic segmentation convolutional encoder-decoder architecture for scene segmentation,”
method based on a novel deep fully convolutional network IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.
(D-AlexNet) and multi-feature map (RGB-DHA). The 39, no. 12, pp. 2481-2495, Dec. 2017.
network achieves good real-time performance by 22ms for [13] F. Xia, P. Wang, L. C. Chen and A. L. Yuille, “Zoom better to see
clearer: human and object parsing with hierarchical auto-zoom net,” in
each 400×200 resolution image on Titan X GPU. Disparity European Conference on Computer Vision, Switzerland, 2016, pp.
maps, height maps, and angle maps are obtained from the 648-663.
original RGB images and fused into 6-channel images to train [14] C. Liang-Chieh, Y. Yi, W. Jiang, X. Wei and Y. L. Alan, “Attention to
the network. Experiments show that using multi-feature map scale: Scale-aware semantic image segmentation,” in Proceedings of
as the input to the network can achieve 4% higher the IEEE Computer Society Conference on Computer Vision and
segmentation accuracy compared with using RGB image-as Pattern Recognition, Las Vegas, July 2016, pp. 3640-3649.
input. In the future, we will focus on more efficient deep [15] Z. Hengshuang, S. Jianping, Q. Xiaojuan, W. Xiaogang and J. Jiaya,
“Pyramid scene parsing network,” in the IEEE Conference on
network to joint semantic segmentation, targets tracking and Computer Vision and Pattern Recognition, Honolulu, 2017, pp.
parameter identification together. 2881-2890.
[16] A. Chaurasia, and E. Culurciello, “Linknet: exploiting encoder
ACKNOWLEDGMENT representations for efficient semantic segmentation,” arXiv preprint
arXiv: 1707.03718, 2017.
The authors would like to thank Dr. Rencheng Zheng for [17] Z. Hengshuang, Q. Xiaojuan, S. Xiaoyong, S. Jianping and J. Jiaya,
his contribution to the fruitful discussions. “ICNet for Real-Time Semantic Segmentation on High-Resolution
Images,” arXiv preprint, arXiv:1704.08545, 2017.
[18] H. Caner, M. Lingni, D. Csaba and C. Daniel. “FuseNet: Incorporating
REFERENCES depth into semantic segmentation via fusion-based CNN architecture,”
[1] W. Fan, A. Samia, L. Chunfeng and B. Abdelaziz, “Multimodality in 13th Asian Conference on Computer Vision, Taipei, Nov. 2016, vol.
semantic segmentation based on polarization and color images,” 10111 LNCS, pp. 213-228.
Neurocomputing, vol. 253, pp. 193-200, Aug. 2017. [19] Q. Xiaojuan, L.Renjie, J. Jiaya, F. Sanja and U. Raquel, “3D Graph
[2] L. Linhui, Q. Bo, L. Jing, Z. Weina and Z. Yafu, “Traffic scene Neural Networks for RGBD Semantic Segmentation,” in IEEE
segmentation based on RGB-D image and deep learning (Periodical International Conference on Computer Vision, Venice, Oct, 2017, pp.
style—Submitted for publication),” IEEE Transactions on Intelligent 5209-5218.
Transportation Systems, submitted for publication. [20] N. Hyeonwoo, H. Seunghoon and H. Bohyung, “Learning
[3] F. David, B. Emmanuel, B. Stéphane, D, Guillaume, G. Alexander et al, deconvolution network for semantic segmentation,” in Proceedings of
“RGBD object recognition and visual texture classification for indoor the IEEE International Conference on Computer Vision, Santiago, Feb.
semantic mapping,” in IEEE International Conference on Technologies 2015, pp. 1520-1528.
for Practical Robot Applications, Woburn, 2012, pp. 127-132. [21] Z. D. Matthew and F. Rob, “Visualizing and Understanding
[4] H. Farzad, S. Hannes, D. Babette, T. Carme and B. Sven, “Combining Convolutional Networks,” in 13th European Conference on Computer
semantic and geometric features for object class segmentation of indoor Vision. Sep. 2014, Vol. 8689 LNCS, no. PART 1, pp. 818-833.
scenes,” IEEE Robotics & Automation Letters, vol. 2, no. 1, pp. 49-55, [22] M. Dongbo, C. Sunghwan, L. Jiangbo, H. Bumsub, S. Kwanghoon and
Jan. 2017. D. N. Minh, “Fast global image smoothing based on weighted least
[5] G. Saurabh, G. Ross, A. Pablo and M. Jitendra, “Learning rich features squares,” IEEE Transactions on Image Processing, vol. 23, no. 12, pp.
from RGB-D images for object detection and segmentation,” Lecture 5638-5653, Dec. 2014.
Notes in Computer Science, vol. 8695 LNCS, no. PART 7, pp. 345-360, [23] H. Kaiming, Z. Xiangyu, R. Shaoqing and S. Jian. “Delving deep into
2014. rectifiers: Surpassing human-level performance on imagenet
classification,” in Proceedings of the IEEE International Conference
on Computer Vision, Santiago, Dec. 2015, pp. 1026-1034.
12