0% found this document useful (0 votes)
24 views7 pages

Key Points Estimation and Point Instance Segmentat

Uploaded by

sainanajfi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views7 pages

Key Points Estimation and Point Instance Segmentat

Uploaded by

sainanajfi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/339323527

Key Points Estimation and Point Instance Segmentation Approach for Lane
Detection

Preprint · February 2020

CITATIONS READS

0 901

4 authors, including:

Yeongmin Ko
Gwangju Institute of Science and Technology
14 PUBLICATIONS 65 CITATIONS

SEE PROFILE

All content following this page was uploaded by Yeongmin Ko on 30 March 2020.

The user has requested enhancement of the downloaded file.


Key Points Estimation and Point Instance Segmentation Approach for
Lane Detection*
Yeongmin Ko1 , Jiwon Jun2 , Donghwuy Ko3 , Moongu Jeon4

Abstract— State-of-the-art lane detection methods achieve


successful performance. Despite their advantages, these meth-
ods have critical deficiencies such as the limited number of
detectable lanes and high false positive. In especial, high false
positive can cause wrong and dangerous control. In this paper,
arXiv:2002.06604v2 [cs.CV] 18 Feb 2020

we propose a novel lane detection method for the arbitrary


number of lanes using the deep learning method, which has
the lower number of false positives than other recent lane
detection methods. The architecture of the proposed method has
the shared feature extraction layers and several branches for
detection and embedding to cluster lanes. The proposed method
can generate exact points on the lanes, and we cast a clustering
Fig. 1. The proposed framework. Given an input image, PINet predict
problem for the generated points as a point cloud instance
three value, confidence, offset, and feature. From confidence and offset
segmentation problem. The proposed method is more compact outputs, exact points on the lanes can be predicted, and the feature output
because it generates fewer points than the original image distinguishes the predicted points into each instance. Finally, the post
pixel size. Our proposed post processing method eliminates processing module is applied, and it generates smooth lane.
outliers successfully and increases the performance notably.
Whole proposed framework achieves competitive results on the
tuSimple dataset. Our code is available at https://github.
com/koyeongmin/PINet like color(e.g. [1], [2]), edge(e.g. [3], [4]), etc. These low-
level features can be combined by Hough transform [5]
I. INTRODUCTION
and Kalman filter [6], and the combined features generate
To achieve fully autonomous driving, it is required to lane segment information. These methods are simple and
understand the environment around the car, and various can be adapted to many various environments without major
perception modules are fused for the understanding. Lane modification, but the performance of these methods depends
detection module is one of the main modules. It is included on the test environment such as the lighting conditions and
in not only the fully autonomous driving system but also the occlusion.
partial autonomous driving system already developed such
Deep learning methods have outstanding performance in
as Advanced Driver Assistant System(ADAS) and the cruise
the complex scene. Among deep learning methods, Convolu-
control system. There are many modules that are derived
tional Neural Network(CNN) methods are especially used in
from lane detection module such as Simultaneous Local-
the field of computer vision. Many recent methods apply the
ization And Mapping(SLAM), lane centering function in
CNN for the feature extract [7, 8]. Semantic segmentation
ADAS, etc. Despite various sensors can be used for the
methods are frequently applied to the lane detection problem
perception modules, a RGB camera is considered one of the
for inference about the shape and the location of lanes[9, 10].
most important sensors because it has a low price. However,
These methods distinguish instance and label of the pixels
because a RGB image has only pixel color information, some
on whole image. Despite these methods achieve outstanding
feature extraction method should be applied to RGB image
performance, they can only be applied to the scene that
for the inference of other useful information like lane, object
consist of the fixed number of the lanes because of their
location, etc.
multi-class approach to distinguish each lane. Neven et al.
Most traditional methods for lane detection extract low-
[11] cast this problem to the instance segmentation. LaneNet
level features of lanes firstly using various hand-craft features
that they proposed has a shared encoder for feature extraction
*This work was not supported by any organization and two decoders. One of them performs the binary lane
1 Yeongmin Ko is with School of Electrical Engineering and Computer
segmentation, and other is embedding branch for instance
Science, Gwangju Institute of Science and Technology, Gwangju, South
Korea [email protected] segmentation. Because LaneNet applies the instance segmen-
2 Jiwon Jun is with School of Electrical Engineering and Computer tation method, it can detect the arbitrary number of lanes.
Science, Gwangju Institute of Science and Technology, Gwangju, South Chen et al. [12] propose the network that predicts directly
Korea [email protected]
3 Donghwuy Ko is with School of Electrical Engineering and Computer x axis values for the fixed y values on each lane, but the
Science, Gwangju Institute of Science and Technology, Gwangju, South method only works on the vertical lane detection.
Korea [email protected] Our proposed method predicts the fewer exact points
4 Moongu Jeon is with School of Electrical Engineering and Computer
Science, Gwangju Institute of Science and Technology, Gwangju, South on the lanes than input pixel size and distinguishes each
Korea [email protected] point into each instance. The hourglass network [13] is
Fig. 2. The detailed network training procedure. It has three main parts. 512x256 size input data is compressed by the resizing layer, and the compressed
input is passed to feature extraction layer. Three output branches are applied at end of each hourglass block, and they predict confidence, offset, and
instance feature for each grid. Loss function can be calculated from outputs of each hourglass block.

usually applied to the field of the key points estimation points into each instance. The loss function of this network is
like pose estimation [14] and object detection [15, 16]. inspired from SPGN, Similarity Group Proposal Network[19]
The hourglass network can extract information about various that is a instance segmentation framework for 3D points
scales by sequences of down-sampling and up-sampling. If cloud. Unlike other instance segmentation methods in com-
some hourglass networks are stacked, loss function can be puter vision area, in our proposed method, embedding is
applied to each stacked network, and it help more stable needed for only the predicted points not the all pixels. In this
training. The instance segmentation methods in the field of respect, the 3D points cloud instance segmentation method
the computer vision can generate clusters of the pixels that is appropriate to our task. In addition, simple post processing
belong to each instance [17, 18]. method is adapted to raw outputs of the network. This post
Camera-based lane detection has been actively developed, processing eliminates outliers of each predicted point, and
and many state-of-the-art methods work almost completely make each lane more smooth. Section II-A introduces details
in some public data sets. However, these methods have some of the main architecture and the loss function, and Section
weaknesses like the limited number of lanes that the module II-B introduces the proposed post processing method.
can detect and high false positive. The false negatives, the
lanes that the module misses to detect, do not change control A. Lane Instance Point Network
value suddenly, and correct control values can be calculated PINet generates points and distinguishes them into each
by other detected lanes. However, the false positive can lead instance. Fig.2. shows the detailed architecture of the pro-
to serious risks. The false positives, the wrong lanes that the posed network. A input size is 512x256, and it is passed to
module alarms, can cause the rapid change of the control a resizing layer and a feature extraction layer. The input data
values. that has 512x256 size is compressed into a smaller size by
Fig.1. shows the proposed framework for lane detection. It a sequence of the convolution layers and the max pooling
has three output branches and predicts the exact location and layers. In this study, we experiment two cases of the resized
the instance features of the points on the lanes. More details input size, 64x32 and 32x16. The feature extraction layer is
are introduced Section II. In summary, there are primary inspired from the stacked hourglass network that achieves
contributions of this study: (1) We propose the novel method outstanding performance on the key point prediction. PINet
for the lane detection that has more compact output size includes two hourglass blocks for the feature extraction. Each
than semantic segmentation based methods, and the compact block has three output branches, and output grid size is same
size can save memory of the module. (2) The proposed post to the resized input size. Fig.3. shows the detailed architec-
processing method can eliminate outliers successfully and ture about the hourglass block. In Fig.3, blue boxes denote
increases the performance. (3) The proposed method can be down-sampling bottleneck layers, green boxes denote same
applied to various scenes that include any oriented lanes like bottleneck layers, and orange boxes denote up-sampling
vertical or horizontal lane and the arbitrary number of lanes. bottleneck layers. The detail of the resizing layer and each
(4) The evaluation result of the proposed method has lower bottleneck layer can be seen at Table I. Batch normalization
false positive ratio than other method and the state-of-the- and Relu layers are applied after every convolutional layer
art accuracy performance, and it guarantee stability of the except when they are applied at the end of the output branch.
autonomous driving car. The number of filters in output branch is determined by the
output values. For example, the confidence branch is 1, the
II. METHOD offset branch is 2, and the feature branch is 4. Following
We train a neural network for the lane detection. The detailed explanations include the role and the loss function
network, which we will refer to as PINet, Point Instance of each output branch.
Network, generates points on whole lane and distinguishes Confidence branch The confidence branch predicts con-
Fig. 3. The hourglass block and bottleneck layer architecture. The hourglass block consist three types of bottleneck layers, same, up-sampling, down-
sampling. Output branches are applied at end of hourglass layer, and the confidence output is forwarded to next block.

TABLE I
T HE DETAIL OF THE RESIZING LAYER AND EACH BOTTLENECK ( THE
1 X 2
CASE OF THE OUTPUT SIZE , 64 X 32) Lof f set = γx (gx∗ − gx )
Ne
g∈Ge
Type Filter Size/Stride (2)
1 X
Resizing layer Conv 64 7/2 + γy (gy∗ − gy )2
same bottleneck 64 Ne
g∈Ge
max pooling 64 2/2
same bottleneck 64
max pooling 64 2/2 Feature branch This branch is inspired from SGPN, a
same bottleneck 128 3D points cloud instance segmentation method. The feature
Same bottleneck Conv 32 1/1
Conv 32 3/1 means an information about instance, and the branch is
Conv 128 1/1 trained to make features of grid in the same instance more
(Residual) Conv 128 1/1 closer. Equation 3 and 4. shows the loss function of the
Up-sampling bottleneck Conv 32 1/1
ConvTranspose 32 3/2 feature branch.
Conv 128 1/1
Ne X
Ne
(Residual) ConvTranspose 128 3/2 1 X
Down-sampling bottleneck Conv 32 1/1 Lf eature = l(i, j) (3)
Conv 32 3/2 Ne2 i j
Conv 128 1/1 (
(Residual) Conv 128 3/2 ||Fi − Fj ||2 if Cij = 1
l(i, j) = (4)
max(0, K − ||Fi − Fj ||2 ) if Cij = 2

fidence values of each grid. The output of confidence branch where Cij indicate whether a point i and a point j are same
has 1 channel, and it is passed to the next hourglass block, instance, Fi denotes the predicted feature of point i by the
and it helps stable training. Equation 1. shows the loss proposed network, and K in constant such that K > 0. If
function of the confidence branch. Cij = 1, they are same instance, and if Cij = 2, these points
are in the different lane to each other. When the network
1 X 2
Lconf idence = γe (gc∗ − gc ) is trained, the loss function makes feature more closer
Ne when two points belong to same instance, and distributes
g∈Ge
(1) them when two points belong to different instance. We can
1 X
+ γn (gc∗ − gc )2 distinguish points into each instance by the distance based
Nn
g∈Gn
simple clustering technique. The feature size is set to 4,
where Ne , Nn denote the number of grids that a point exist or and this size is observed to have no major effect for the
non-exist in, G denotes a set of grids, gc denotes a confidence performance.
output of the grid, gc∗ denotes the ground-truth, and γ denotes The total loss Ltotal is equal to summation of the above
each coefficient. three loss term, and whole network is trained by end-to-end
Offset branch From the offset branch, we can find the procedure using the following total loss.
exact location for each point. Outputs of the offset branch
have a value between 0 and 1, and the value means position Ltotal = aLconf idence + bLof f set + cLf eature (5)
related to a grid. In this paper, a grid is matched to 8 or 16
pixels according to the ratio between input size and output In training step, we set all coefficients 1.0 initially, and add
size. The offset branch has two channel for predicting x-axis 0.5 to a and γn at last few epochs. The proposed loss function
offset and y-axis offset. Equation 2. shows the loss function is adapted to the end of each hourglass block, and it help
of the offset branch. stable training for whole network.
Fig. 4. The result of the post processing. (a) is input image, and (b) is raw Fig. 5. The explanation about the post processing. There are no other point
output out PINet. In (b), the blue lane consist of some outliers and other in the margin that is made by the straight line connecting point S and A,
lane can be distinguished. In (c), the result of the proposed post processing but margin of point S and B consist of 2 other points. As a result, point B
method, outliers are eliminated, and only smooth longest lanes remain. is selected.

B. Post processing method 1.5, and γn to 1.5 during the last 200 epochs. Other hyper-
parameters like the margin size of the post processing are
The raw outputs of the network have some errors. For determined by the experimental method. Optimized values
example, basically, a instance should consist of only one of these hyper-parameters need to be modified slightly ac-
smooth lane. However, there are some outliers or other lane cording to the training results. Any additional dataset and
that can be distinguished visually. Fig 4. show this problem pre-trained weights are not used, and two cases of the output
and the effect of the proposed post processing method. The size, 64x32 and 32x16, are evaluated. The test hardward is
detailed procedure can be seen following: NVIDIA RTX 2080ti.
• Step 1: Find six starting points. Starting points are
defined as the three lowest points and the three leftmost A. Evaluation metrics
or rightmost points. If the predicted lane is on the left Accuracy is the main evaluation metric of the tuSimple
related to the center of the image, the leftmost point are dataset. Accuracy is defined by following equation in the
selected. tuSimple dataset, it means the average number of the correct
• Step 2: Select three closest points to the starting point points.
X Cclip
among points that are higher than each starting point. accuracy =
Sclip (6)
• Step 3: Consider a straight line connecting two points clip
that are selected at step 1 and 2.
where Cclip denotes the number of the correct predicted
• Step 4: Calculate the distance between the straight line
points from the trained module on the given image clip,
and other points.
and Sclip denotes the number of ground-truth point in the
• Step 5: Count the number of the points that are within
clip. False negative and false positive are also provided by
the margin. The margin, γ, is set to 12 in this paper.
following equation.
• Step 6: Select the point that has maximum and larger
count than threshold as new starting point, and consider Fpred
FP = (7)
that the point belong to the same cluster with starting Npred
point. We set the threshold to twenty percent of the Mpred
remaining points. FN = (8)
Ngt
• Step 7: Repeat from step 2 to step 6 until no points are
found at step 2. Graphical explanation can be seen in where Fpred denotes the number of wrongly predicted lanes,
Fig. 5. Npred denotes the number of predicted lanes, Mpred denotes
• Step 8: Repeat from step 1 to step 7 for all starting the number of missed lanes, and Ngt denotes the number of
point, and consider the maximum length cluster as a ground-truth lanes.
result lane.
B. Experiments
• Step 9: Repeat from step 1 to step 8 for all predicted
lane. Our target training and test domain is the tuSimple dataset.
The tuSimple dataset consists of 3626 annotated data for
training, and 2782 images for testing. We apply simple data
III. RESULT
augmentation methods like flip, translation, rotation, adding
The network is trained on the training set of the tuSimple Gaussian noise, changing intensity, and adding shadow to
dataset [20] by Adam optimizer with learning rate 2e-4 train more robust model. The tuSimple dataset has different
initially. We train the network using the initial setup during distribution of the scenes according to the number of the
the first 1000 epochs, and we set learning rate to 1e-4, a to lanes that are shown in the scene, and Table II. show more
Fig. 6. The results on the tuSimple dataset. First row is ground-truth data, second and third row are raw outputs of the proposed network and final outputs
after post processing.

detail. The number of the scenes that consist of five lanes without the post processing, and if the post processing is
on the test set is 2 times bigger than the training set. To applied, whole module works about 10 frames per second.
balance the two distribution, we set generated ratio of the
TABLE III
data that consist of five lanes bigger than other in the data
E VALUATION RESULT ON THE TU S IMPLE DATASET
augmentation step.
Acc FP FN
TABLE II
SCNN[10] 96.53% 0.0617 0.0180
D ISTRIBUTION OF THE SCENES ACCORDING TO THE NUMBER OF LANES LaneNet(+H-net)[11] 96.38% 0.0780 0.0244
ON TRAINING AND TEST SET PointLaneNet(MoblieNet)[12] 96.34% 0.0467 0.0518
ENet-SAD[21] 96.64% 0.0602 0.0205
Num Training set Test set Ours(32x16) 95.75% 0.0266 0.0362
Five 239 569 Ours(64x32) 96.62% 0.0308 0.0272
Four 2982 468 Ours(64x32 + post) 96.70% 0.0294 0.0263
Three 404 1740
Two 1 5
Total 3626 2782
TABLE IV
M ODEL ANALYSIS
The evaluation of tuSimple dataset require exact x axis
value for some fixed y axis value, and we apply simple Model Parameters(M)
SCNN 20.72
linear interpolation to find corresponding points for the given LaneNet(+H-net) 15.98
y values. Because the distance between predicted points is PointLaneNet(MoblieNet) 5.33
close, we can estimate accurate results without any complex ENet-SAD 0.98
curve fitting method. R-18-SAD 12.41
Ours(32x16) 4.40
The detailed evaluation results can be seen in Table III,
Ours(64x32) 4.39
and Fig. 6. show some results on tuSimple dataset. Three
cases of our proposed method show particularly low FP
rate. This means that wrongly predicted lanes by our PINet
are a lot rarer than other method, and it can guarantee IV. CONCLUSIONS
the distinguished safety performance. Despite pre-trained In this study, we have proposed the novel lane detection
weights and extra dataset are not used, our proposed method method combining with the point estimation and the point
also outperforms other state of the art methods for the instance segmentation method, and it can works in real-
accuracy metric. time. The method achieves the lowest false positive rate and
Table IV shows the amount of the parameters in each guarantees the safety performance of the autonomous driving
method, and it show that PINet is one of the lightest method car because wrongly predicted lanes are rarely occurred.
among other method. Almost components of PINet is built by The post processing method increase the performance of
bottleneck layers, this architecture can save a lot of memory. the lane detection module notably, but the current imple-
The proposed method can run about 30 frames per second mented version requires a lot of computing cost. We expect
that this problem can be solved by the parallel computation [12] Chen, Zhenpeng, Qianfei Liu, and Chenfan Lian. ”PointLaneNet:
or other optimization technique. Efficient end-to-end CNNs for Accurate Real-Time Lane Detection.”
2019 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2019.
R EFERENCES [13] Newell, Alejandro, Kaiyu Yang, and Jia Deng. ”Stacked hourglass net-
works for human pose estimation.” European conference on computer
[1] He, Yinghua, Hong Wang, and Bo Zhang. ”Color-based road detection vision. Springer, Cham, 2016.
in urban traffic scenes.” IEEE Transactions on intelligent transportation [14] Yang, Wei, et al. ”Learning feature pyramids for human pose estima-
systems 5.4 (2004): 309-318. tion.” proceedings of the IEEE international conference on computer
[2] Chiu, Kuo-Yu, and Sheng-Fuu Lin. ”Lane detection using color-based vision. 2017.
segmentation.” IEEE Proceedings. Intelligent Vehicles Symposium, [15] Duan, Kaiwen, et al. ”Centernet: Object detection with keypoint
2005.. IEEE, 2005. triplets.” arXiv preprint arXiv:1904.08189 (2019).
[3] Wang, Yue, Dinggang Shen, and Eam Khwang Teoh. ”Lane detection [16] Zhou, Xingyi, Jiacheng Zhuo, and Philipp Krahenbuhl. ”Bottom-up
using catmull-rom spline.” IEEE International Conference on Intelli- object detection by grouping extreme and center points.” Proceedings
gent Vehicles. Vol. 1. 1998. of the IEEE Conference on Computer Vision and Pattern Recognition.
[4] Lee, Chanho, and Ji-Hyun Moon. ”Robust lane detection and tracking 2019.
for real-time applications.” IEEE Transactions on Intelligent Trans- [17] He, Kaiming, et al. ”Mask r-cnn.” Proceedings of the IEEE interna-
portation Systems 19.12 (2018): 4043-4048. tional conference on computer vision. 2017.
[5] Duda, R. O., and P. E. Hart. ”Use of the Hough transform to detect [18] De Brabandere, Bert, Davy Neven, and Luc Van Gool. ”Semantic
lines and curves in pictures.” Commun. ACM, vol. 15, no. 1, pp. 1115, instance segmentation with a discriminative loss function.” arXiv
1972. preprint arXiv:1708.02551 (2017).
[6] Borkar, Amol, Monson Hayes, and Mark T. Smith. ”Robust lane [19] Wang, Weiyue, et al. ”Sgpn: Similarity group proposal network for
detection and tracking with ransac and kalman filter.” 2009 16th IEEE 3d point cloud instance segmentation.” Proceedings of the IEEE
International Conference on Image Processing (ICIP). IEEE, 2009. Conference on Computer Vision and Pattern Recognition. 2018.
[7] Van Gansbeke, Wouter, et al. ”End-to-end Lane Detection through [20] The TuSimple lane challenge, http://benchmark.tusimple.ai/
Differentiable Least-Squares Fitting.” Proceedings of the IEEE Inter- [21] Hou, Yuenan, et al. ”Learning lightweight lane detection cnns by
national Conference on Computer Vision Workshops. 2019. self attention distillation.” Proceedings of the IEEE International
[8] Zou, Qin, et al. ”Robust lane detection from continuous driving Conference on Computer Vision. 2019.
scenes using deep neural networks.” IEEE Transactions on Vehicular
Technology (2019).
[9] Yang, Wei-Jong, Yoa-Teng Cheng, and Pau-Choo Chung. ”Improved
Lane Detection With Multilevel Features in Branch Convolutional
Neural Networks.” IEEE Access 7 (2019): 173148-173156.
[10] Pan, Xingang, et al. ”Spatial as deep: Spatial cnn for traffic scene
understanding.” Thirty-Second AAAI Conference on Artificial Intelli-
gence. 2018.
[11] Neven, Davy, et al. ”Towards end-to-end lane detection: an instance
segmentation approach.” 2018 IEEE intelligent vehicles symposium
(IV). IEEE, 2018.

View publication stats

You might also like