0% found this document useful (0 votes)
32 views

Key Points Estimation and Point Instance Segmentation Approach For Lane Detection

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Key Points Estimation and Point Instance Segmentation Approach For Lane Detection

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1

Key Points Estimation and Point Instance


Segmentation Approach for Lane Detection
Yeongmin Ko, Student Member, IEEE, Younkwan Lee, Student Member, IEEE, Shoaib Azam, Student
Member, IEEE, Farzeen Munir, Senior Member, IEEE, Moongu Jeon*, Senior Member, IEEE, and Witold
Pedrycz, Fellow, IEEE

Abstract—Perception techniques for autonomous driving


arXiv:2002.06604v4 [cs.CV] 14 Sep 2020

should be adaptive to various environments. In the case of traffic


line detection, an essential perception module, many condition
should be considered, such as number of traffic lines and
computing power of the target system. To address these problems,
in this paper, we propose a traffic line detection method called
Point Instance Network (PINet); the method is based on the key
points estimation and instance segmentation approach. The PINet
includes several stacked hourglass networks that are trained
simultaneously. Therefore the size of the trained models can be
chosen according to the computing power of the target environ-
ment. We cast a clustering problem of the predicted key points
as an instance segmentation problem; the PINet can be trained
regardless of the number of the traffic lines. The PINet achieves Fig. 1. System overview. The proposed framework predicts key points on
competitive accuracy and false positive on the TuSimple and traffic lines and distinguishes individual instances regardless of the number of
Culane datasets, popular public datasets for lane detection. Our traffic lines. In addition, if user wants to run the trained model on a system
code is available at https://github.com/koyeongmin/PINet new with weak computing power, like an embedded board, the network can be
clipped and transferred without additional training.
Index Terms—Lane detection, autonomous driving, deep learn-
ing.
proposed network, distinguishes key points into individual
I. I NTRODUCTION instances. In addition, the proposed network is trained end-
to-end, and the network size can be modified according to the
F ULLY autonomous driving requires understanding the
environment around vehicles. Various perception modules
are fused for this understanding, and many pattern recognition
computing power of the target system without any change of
the network architecture or additional training.
Most traditional methods of traffic line detection extract
and computer vision techniques are applied for these per-
low-level traffic line features using various hand-craft features
ception modules [1], [2]. Lane detection, which can localize
like color [7], [8], or edges [9], [10]. These low-level features
the drivable area on a road, is a major perception technique.
can be combined using a Hough transform [11], [12] or
There are many ways to recognize lanes, but most techniques
Kalman filter [13]; the combined features generate traffic
utilize traffic line detection [3], [4] or road region segmentation
line segment information. These methods are simple and
[5], [6]. In this paper, we focus on traffic line detection for
can be adapted to various environments without significant
recognizing lanes. Fig. 1 shows the purpose of our proposed
modification. Still, the performance of these methods depends
method, which predicts exact key points of lanes from input
on condition of the testing environment such as lighting and
RGB images and, using embedding features extracted by the
occlusion.
This work was partly supported by Institute of Information communi- Deep learning methods show outstanding performance for
cations Technology Planning Evaluation (IITP) grant funded by the Korea complex scenes. Among deep learning methods, Convolutional
Government (MSIT) (No. 2014-3-00077, Development of Global Multi-target Neural Network (CNN) methods are primarily applied for
Tracking and Event Prediction Techniques Based on Real-time Large-Scale
Video Analysis), the National Research Foundation of Korea (NRF) grant feature extraction in computer vision [14], [15]. Semantic
funded by the Korea Government (MSIT) (No. 2019R1A2C2087489), and segmentation methods [16], [17], [18], the major research
GIST Research Institute(GRI) grant funded by the GIST in 2019. area in computer vision, are frequently applied to traffic
Y. Ko, Y. Lee, S. Azam, F. Munir and M. Jeon are with the School of
Electrical Engineering and Computer Science, Gwangju Institute of Science line detection problems to make inferences about shapes and
and Technology (GIST), Gwangju, 61005, South Korea (e-mail: {koyeongmin, locations [19], [20], [21], [22]. Some methods use multi-
brightyoun, shoaibazam, farzeen.munir, mgjeon}@gist.ac.kr). class approaches to distinguish individual traffic line instances.
W. Pedrycz is with the Department of Electrical and Computer Engineering,
University of Alberta, Edmonton, AB T6R 2V4, Canada, with the Department Therefore, even though these methods can achieve outstanding
of Electrical and Computer Engineering, Faculty of Engineering, King Ab- performance, they can only be applied to scenes that consist
dulaziz University, Jeddah 21589, Saudi Arabia, and also with the Systems of fixed numbers of traffic lines. As a solution to this problem,
Research Institute, Polish Academy of Sciences, Warsaw 01-447, Poland
(email: [email protected]). instance segmentation methods are applied to distinguish indi-
Moongu Jeon is the corresponding author (e-mail: [email protected]) vidual instance. These semantic segmentation based traffic line
2

Fig. 2. Proposed framework with three main parts. 512 × 256 size input data is compressed by the resizing network; the compressed input is fed to the
predicting network, which includes four hourglass modules. Three output branches are applied at the ends of each hourglass block; they predict confidence,
offset, and embedding feature. The loss function can be calculated from the outputs of each hourglass block. By clipping several hourglass modules, required
computing resources can be adjusted.

detection methods require some post-processing to estimate However, false positives can lead to severe risks; incorrect
the exact location values of the predicted traffic lines. Avoiding identification of traffic lines by the module can cause rapid
this post-processing of the semantic segmentation approach, changes of the control values.
several other methods directly predict traffic line location [23], In summary, Fig. 2 shows our proposed framework for
[24]. traffic line detection. It has three output branches and predicts
The existing methods have certain limitations. The semantic the exact location and instance features of points on traffic
segmentation methods require the labeling or pre-processing lines. More details are introduced in section III. These are the
at the pixel level for training, which is cumbersome. These primary contributions of this study:
methods also predict many unnecessary points because se- • Using the key points estimation approach, we propose
mantic segmentation generates classified pixel images with a novel method for traffic line detection. It produces a
sizes identical to the given input image, even though only a more compact size prediction output than those of other
few points are required to recognize traffic lines. In addition, semantic segmentation-based methods.
existing methods are not adaptive to various environments • The framework consists of several hourglass modules,
according to available computing power. To apply them to and so we can obtain various models that have different
light systems like embedded boards, the entire architecture sizes by simple clipping because each hourglass module
should be modified and trained again. is trained simultaneously using the same loss function.
To overcome these limitations, our proposed method uses a • The proposed method can be applied to various scenes
deep learning model inspired by a stacked hourglass network that include any orientation of traffic lines, such as
to predict a few key points on traffic lines. The stacked hour- vertical or horizontal traffic lines, and arbitrary numbers
glass network [25] is usually applied in key points estimation of traffic lines.
fields such as pose estimation [26] and object detection [27], • The proposed method has lower false positives and the
[28]. Using sequence of down-sampling and up-sampling, noteworthy accuracy performance. It guarantees the sta-
the stacked hourglass network can extract information about bility of the autonomous driving car.
various scales. Because the stacked hourglass network includes
several hourglass modules that are trained by the same loss II. R ELATED W ORK
function, we can simultaneously obtain various models that A. Traffic Line Detection
have different parameter sizes by clipping some bays from Lane detection is an important research area in autonomous
the whole structure. Using the simple method inspired by point driving. Lane detection modules recognize drivable areas on
cloud instance segmentation, each key point is distinguished roads from input data. Traffic line detection is considered a
into individual instance [29]. main method for lane detection. Traffic line detection usually
Camera-based traffic line detection has been actively devel- localizes line markings that distinguish drivable areas on roads.
oped, and many state-of-the-art methods [30], [24] are almost Especially regarding RGB images as input data, various hand-
completely effective for public data sets. However, some crafted features have been proposed to detect traffic lines [31],
methods have higher rates of false positive. False negatives, [32], [33], [34], [35]. However, these methods show limitations
traffic lines that the module fails to detect, do not suddenly in complex scenarios.
change the control values, and correct control values can be Recently, deep learning has become a dominant method in
predicted from other detected traffic lines or previous results. computer vision research. Semantic segmentation [16], [17],
3

Fig. 3. Details of hourglass block consisting three types of bottle-neck layers: same bottle-necks, down bottle-necks, and up bottle-necks. Output branches
are applied at ends of hourglass layers; confidence output is forwarded to the next block.

[18] [36] is a major topic in perception research; it can classify estimation [26] is a major research topic in the key points
pixels of the input image into individual class. Generative estimation area. Stacked hourglass networks [25] consists of
methods [37], [38] can also perform a similar function. There- several hourglass modules that are trained simultaneously. The
fore, semantic segmentation methods and generative methods hourglass module can transfer various scales’ information to
are suitable for expressing complex shapes of lines. [20], [30], deeper layers, helping the whole network obtain both global
[39], and [40] show applications of semantic segmentation and and local features. Because of this property, an hourglass
the generative model for traffic line detection. Some methods network is frequently utilized to detect centers or corners of
use multi-class approaches to distinguish each instance; how- objects in the object detection area. Not only network archi-
ever, multi-class approaches can classify only fixed numbers tecture or loss function but also refinement methods adapted to
of instances. Instance segmentation approaches are proposed existing networks are developed for key point estimation. [43]
as solutions to this limitation. Neven et al. [41] attempted suggests a feature aggregation and coarse-to-fine supervision
to solve this problem of multi-class approaches with instance method that can be applied to other multi-stage methods. [44]
segmentation. Their proposed LaneNet has a shared encoder proposes the refinement network that improves the results of
and two decoders. One of these decoders performs binary other existing models. In this paper, these refinement methods
lane segmentation; the other predicts embedding features for are not applied to indicate performance of our proposed
instance segmentation. framework; however, they can be applied to improve the
Although semantic segmentation methods can predict lines performance.
that have complex shapes, during training and testing they
require pixel-level labeled data and post-processing to extract III. M ETHOD
exact points on lines. Some direct methods [23], [24] directly
generate exact points on lines. [23] predicts exact starting and For lane detection, we train a neural network that consists of
terminal points, and x-axis values of the fixed y-axis values for several hourglass modules. The network, which we will refer
each traffic line. [24] presents the Line Proposal Unit (LPU) to as the Point Instance Network (PINet), generates points
inspired by the Region Proposal Network (RPN) of Faster R- on lanes and distinguishes predicted points into individual
CNN [42]. LPU predicts horizontal offsets for fixed y-axis instance. To achieve these tasks, our proposed neural network
values along certain pre-defined line proposals. includes three output branches, a confidence branch, offset
These approaches, the semantic segmentation method, the branch, and embedding branch. The confidence and offset
generative method, and the direct method, produce many branches predict exact points of traffic lines; loss functions
unnecessary output values. In semantic segmentation and gen- inspired from YOLO [45] are applied. The embedding branch
erative method, not all pixels are required to recognize traffic generates the embedding features of each predicted point;
lines; an exact line can be predicted from a few key points. the embedding feature is fed to the clustering process to
Direct methods also have certain unnecessary predictions like distinguish each instance. The loss function of the embedding
the length, starting points, and terminal points of the given branch is inspired by an instance segmentation method. The
target traffic lines that are unknown. Similarity Group Proposal Network (SPGN) [29], an instance
segmentation frameworks for 3D point cloud, introduces a
simple technique and a loss function for instance segmentation.
B. Key Points Estimation Based on the contents proposed by SPGN, we design a loss
Key points estimation techniques predict from input im- function fitting to discriminate each instance of the predicted
ages certain important points called key points. Human pose traffic lines. Section II-A introduces details of the main archi-
4

network consists of several hourglass modules, each including


an encoder, decoder, and three output branches, as shown in
Fig. 3. Some skip-connections transfer the information of the
various scales to deeper layers. Each colored block in Fig. 3 is
a bottle-neck module; these bottle-neck modules are described
in Fig. 4. There are three kinds of bottle-neck: same, down,
and up bottle-necks. The same bottle-neck generates output
that has the same size as the input. The down bottle-neck
is applied for down-sampling in the encoder; the first layer
of the down bottle-neck is replaced by a convolution layer
with filter size 3, stride 2, and padding 1. The transposed
convolution layer with filter size 3, stride 2, and padding 1 is
applied for the up bottle-neck in the up-sampling layers. Each
output branch has three convolution layers, and generates a
64 × 32 grid. Confidence values about key point existence,
offset, and embedding feature of each cell in the output grid
are predicted by the output branches. Table II shows details of
Fig. 4. Details of bottle-neck. The three kinds of bottle-neck have different the predicting network. Because a deeper network has better
first layers according to their purposes. performance [25], it can act as a teacher network. Therefore,
using knowledge distillation techniques, we can expect better
performance for clipped short networks. The channel of each
tecture; Section II-B consists of details about the loss function;
output branch is different (confidence: 1, offset: 2, embedding:
and Section II-C shows the implementation in detail.
4), and the corresponding loss function is applied according
to the goal of each output branch.
A. Architecture
Fig. 2 shows the proposed framework of the network. Input TABLE II
D ETAILS OF PREDICTING NETWORK
RGB image size is 512×256; it is fed to the resizing network.
This image is compressed to a smaller size (64 × 32) by the Layer Size/Stride Output size
sequence of convolution layers in the resizing network; the Input data 128*64*32
output of the resizing network is fed to the predicting network. Encoder Bottle-neck(down) 128*32*16
An arbitrary number of hourglass modules can be included in Bottle-neck(down) 128*16*8
Bottle-neck(down) 128*8*4
the predicting network; four hourglass modules are used in Bottle-neck(down) 128*4*2
this study. All hourglass modules are trained simultaneously Bottle-neck 128*4*2
by the same loss function. After the training step, user can (Distillation layer) Bottle-neck 128*4*2
Bottle-neck 128*4*2
choose how many hourglass modules to use according to Bottle-neck 128*4*2
the computing power, without any additional training. The Decoder Bottle-neck(up) 128*8*4
following sections provide details about each network. Bottle-neck(up) 128*16*8
1) Resizing Network: The resizing network reduces the Bottle-neck(up) 128*32*16
Bottle-neck(up) 128*64*32
input image’s size to save memory and inference time. First, Output branch Conv+Prelu+bn 3/1 64*64*32
the input RGB image size is 512 × 256. This network consists Conv+Prelu+bn 3/1 32*64*32
of three convolution layers. All convolution layers are applied Conv 1/1 C*64*32
with filter size 3×3, stride 2, and padding size 1. Prelu [46] and
batch normalization [47] are utilized after each convolution
layer. Finally, this network generates resized output with
64 × 32 size. Table I shows details of the constituent layers. B. Loss Function
For training, four loss functions are applied to each output
TABLE I branch of the hourglass networks. The following sections
D ETAILS OF RESIZING NETWORK provide details of each loss function. As in Table II, the output
Layer Size/Stride Output size
branch generates a 64 grid, and each cell in the output grid
Input data 3*512*256 consist of the predicted values of 7 channels, including the
Conv+Prelu+bn 3/2 32*256*128 confidence value (1 channel), offset (2 channel) value, and
Conv+Prelu+bn 3/2 64*128*64 embedding feature (4 channel). Confidence value determines
Conv+Prelu+bn 3/2 128*64*32
whether or not key points of the traffic line exist; offset value
localizes the exact position of the key points predicted by
2) Predicting Network: The resizing network output is the confidence value, and the embedding feature is utilized
fed to the prediction part, which will be described in this to distinguish key points into individual instance. Therefore,
section. This part predicts the exact points on the traffic lines three loss functions, except for the distillation loss function,
and the embedding features for instance segmentation. This are applied to each cell of the output grid. The distillation loss
5

function to distillate the knowledge of the teacher network is features are the same in this instance. Equations 3 and 4 show
adapted to the distillation layer of each encoder, as shown the loss function of the feature branch:
in Table II. Details of each predicted value and feature are
included by the following sections. Ne X
Ne
1 X
1) Confidence Loss: The confidence output branch predicts Lf eature = 2 l(i, j),
Ne i j
the confidence value of each cell. If a key point is present in the (4)
(
cell, the confidence value is close to 1, if not, it is 0. The output ||Fi − Fj ||2 if Iij = 1
of the confidence branch has 1 channel, and it is fed to the next l(i, j) = ,
max(0, K − ||Fi − Fj ||2 ) if Iij = 0
hourglass module. The confidence loss consists of two parts,
existence loss and non-existence loss. The existence loss is where Fi denotes the predicted embedding feature of a cell
applied to cells that include key points; the non-existence loss i, Iij indicates whether cell i and cell j are same instance
is utilized to reduce the confidence value of each background or not, and K is a constant such that K > 0. If Iij = 1,
cell. The non-existence loss is computed at cells that predict the cells are the same instance, and if Iij = 0, these cells
confidence values higher than 0.01. Because cells away from are different instances. When the network is trained, the loss
key points converge rapidly, this technique helps the training function makes features closer when each cell belongs to
concentrate on cells closer to the key points. The following the same instance; it distributes features when cells belong
shows the loss function of the confidence branch: to different instances. We can distinguish key points into
1 X ∗ 2
individual instance using the simple distance-based clustering
Lexist = (cc − cc ) , (1) technique. In this study, if embedding features of certain
Ne
cc ∈Ge predicted key points are within a certain distance, we consider
1 X that they are the same instance. The feature size is set at 4 in
Lnon exist = (c∗c − cc )2 this study, but this size is observed to have no major effect on
Nn
cc ∈Gn the performance.
cc >0.01 (2)
X 4) Distillation Loss: According to Newell et al. [25], better
+ 0.00001 · c2c , performance is observed when more hourglass modules are
cc ∈Gn
stacked. Therefore, the deepest hourglass module can be a
where Ne denotes the number of cells that include key points, teacher network, and we expect that clipped short networks
Nn denotes the number of cells that do not include any key that are lighter than the teacher network will show better
points, Ge denotes a set of cells that consist of key points, performance if a knowledge distillation method is applied.
Gn denotes a set of cells that consist of points, cc denotes the Zagoruyko & Komodakis [48] proposed a simple knowledge
predicted value of each cell in the confidence output branch, distillation method that can be applied to the CNN model.
and c∗c denotes the ground-truth value. The ground truth value This method allows a student network to imitate a teacher
of the cell that has key point is 1; otherwise it is 0. At inference network; Hou et al. [30] show that the method can improve
time, if the confidence value is bigger than the pre-defined the performance of the whole framework. Equation 5. shows
threshold, we consider that a key point exists at the cell. The the loss function for distillation:
second term of Lnon exist is a regularization term.
2) Offset Loss: From the offset branch, PINet predicts the M
X
exact location of the key points for each output cell. The output Ldistillation = D(F (AM ) − F (Am )),
of each cell has a value between 0 and 1; this value indicates m
the position related to the corresponding cell. In this paper, a F (AM ) = S(G(Am )), S : spatial sof tmax, (5)
cell is matched to 8 pixels of the input image. For example, C
X
if the predicted offset value is 0.5, the real position of the key G(Am ) = | Ami |2 , G : RC×H×W → RH×W ,
point is 4 pixels away from the edge of the cell. The offset i=1
branch has two channels for predicting the x-axis and y-axis where D denotes the sum of square, Am denotes the distil-
offsets. Equation 2 shows the loss function: lation layer output at the m-th hourglass modules, as shown in
1 X ∗ Table II, M denotes the number of hourglass modules, Am i
2
Lof f set = (cx − cx ) denotes the i-th channel of Am , and all operators like sum,
Ne
cx ∈Ge power, and absolute value (| · |) are elementwise.
1 X ∗ (3)
The total loss Ltotal is equal to the weighted sum of the
+ (cy − cy )2 .
Ne above four loss terms, and the whole network is trained using
cy ∈Ge
an end-to-end procedure with the following total loss:
Because the ground truth does not exist at cells that include
Ltotal = γe Lexist + γn Lnon−exist + γo Lof f set
no key points, these cells are ignored when the offset loss is (6)
calculated. + γf Lf eature + γd Ldistillation .
3) Embedding Feature Loss: The loss function of this In the training step, we set γo to 0.2, γf to 0.5, and γd to
branch is inspired by SGPN, a 3D points cloud instance 0.1. γe and γn are described at Section IV. The proposed loss
segmentation method [29]. The branch is trained to make function is adapted to the output branch of each hourglass
the embedding feature of each cell closer if the embedding module; this helps the whole network to be trained stably.
6

TABLE III
DATASET SUMMARY

Dataset Train Test Resolution Type


TuSimple 3,626 2,782 1280 × 720 highway
CULane 88,880 34,680 1640 × 5900 urban, rual, highway, various light condition and weather

The exact values of the hyper-parameters are shown in the


following section. PINet predicts the exact position of key
points on traffic lines, and the spline curve fitting method is
applied to obtain a smoother curve.

IV. E XPERIMENTS
In this section, we evaluate PINet on two public datasets,
TuSimple [49] and CULane [20]. The following Section A
introduces the overview and evaluation metric used for each
dataset in the official evaluation methods. Section B shows
the evaluation results of PINet; Section C includes an ablation
study on the effect of the knowledge distillation method.

A. Dataset
Our proposed network, PINet, is trained on both TuSimple
and CULane. Table III summarizes information of the two
datasets. TuSimple is relatively simpler than CULane because
Fig. 5. Data augmentation methods. (a) is the original image, and (b), (c),
the TuSimple dataset consists of only the highway environment
(d), (e), (f), and (g) show examples of the applied data augmentation methods. and fewer obstacles. We use the official evaluation source
codes to evaluate PINet; the details of the datasets and
evaluation metrics are described in the following section.
C. Implementation Detail 1) TuSimple: TuSimple dataset consists of 3,626 training
sets and 2,782 testing sets. Accuracy is the main evaluation
All input images are resized to 512 × 256 size and nor-
metric of the TuSimple dataset, defined by the following
malized from values of RGB of 0 ∼ 255 to values of 0 ∼ 1
equation according to the average number of correct points:
before the data are fed to the proposed network in both training
and testing. The two public datasets used for the evaluation X Cclip
accuracy = , (7)
of the proposed method, TuSimple [49] and CULane [20], S clip
clip
provide x-axis values of traffic lines according to the fixed y-
axis values. Due to the annotation method, some traffic lines where Cclip denotes the number of points correctly predicted
close to the horizontal line are annotated sparsely. To solve this by the trained module for the given image clip, and Sclip
problem, we make additional annotations every 10 pixels of the denotes the number of ground-truth points in the clip. The
x-axis by linear regression from the original data. Various data rates false negative (FN) and the false positive (FP) are also
augmentation methods like shadowing, adding noise, flipping, provided by the following equation:
translation, rotation, and intensity changing are also applied; Fpred
these methods are shown in Fig. 5. FP = , (8)
Npred
Additionally, the two public datasets include a lot of image
Mpred
frames; however, the data are imbalanced. For example, the FN = , (9)
testing set of the CULane dataset consists of various categories Ngt
such as normal, night, and crossroad; the numbers of category where Fpred denotes the number of wrongly predicted lanes,
frames are vary widely. The exact ratios of the CULane Npred denotes the number of predicted lanes, Mpred denotes
category can be found in Section IV-B, the results section. To the number of missed lanes, and Ngt denotes the number of
resolve this issue, we sample hard data that show poor loss ground-truth lanes.
values in the training step, and increase the selection ratio 2) CULane: The CULane dataset includes 88,880 train-
of the hard data. The concept is similar to the hard negative ing images and 34,680 testing images. Unlike the TuSimple
mining technique. dataset, various road types such as urban and night are shown
We use one GPU (GTX 2080ti 11GB) for training and in the CULane dataset. We follow the official evaluation
testing; source code is written in Pytorch. In the training metric [20] for evaluation of the CULane dataset. According
step, each batch contains six images; hyper-parameters like to [20], each traffic line is assumed to have 30 pixel width
thresholds and coefficients are determined experimentally. and we calculate the intersection-over-union(IoU) between the
7

TABLE IV
E VALUATION RESULTS FOR CUL ANE DATASET.
(F IRST AND SECOND BEST RESULTS ARE HIGHLIGHTED IN RED AND BLUE .)

Category Proportion PINet(1H) PINet(2H) PINet(3H) PINet(4H) SCNN [20] R-101-SAD [30] ERFNet-E2E [50]
Normal 27.7% 85.8 89.6 90.2 90.3 90.6 90.7 91.0
Crowed 23.4% 67.1 71.9 72.4 72.3 69.7 70.0 73.1
Night 20.3% 61.7 67.0 67.7 67.7 66.1 66.3 67.9
No Line 11.7% 44.8 49.3 49.6 49.8 43.4 43.5 46.6
Shadow 2.7% 63.1 67.0 68.4 68.4 66.9 67.0 74.1
Arrow 2.6% 79.6 84.2 83.6 83.7 84.1 84.4 85.8
Dazzle Light 1.4% 59.4 65.2 66.4 66.3 58.5 59.9 64.5
Curve 1.2% 63.3 66.2 65.4 65.6 64.4 65.7 71.9
Crossroad 9.0% 1534 1505 1486 1427 1990 2052 2022
Total - 69.4 73.8 74.3 74.4 71.6 71.8 74.0

prediction of the evaluated model and the ground truth. In TABLE V


CULane dataset, F1-measure is the major evaluation metric; E VALUATION RESULTS FOR TU S IMPLE DATASET.
(F IRST AND SECOND BEST RESULTS ARE HIGHLIGHTED IN RED AND
it is defined as the following equation. BLUE .)

2 ∗ P recision ∗ Recall Method Acc FP FN


F 1 measure = , (10)
P recision + Recall SCNN [20] 96.53% 0.0617 0.0180
LaneNet(+H-net) [41] 96.38% 0.0780 0.0244
where P recision = T PT+FP TP
P and Recall = T P +F N . TP is PointLaneNet(MoblieNet)[23] 96.34% 0.0467 0.0518
a the true positive, which means a prediction that has larger ENet-SAD[30] 96.64% 0.0602 0.0205
IoU than the threshold, 0.5. FP is a false positive and FN is a ERFNet-E2E [50] 96.02% 0.0321 0.0428
Line-CNN [24] 96.82% 0.0442 0.0197
false negative.
PINet(1H) 95.81% 0.0585 0.0330
PINet(2H) 96.51% 0.0467 0.0254
B. Result PINet(3H) 96.72% 0.0365 0.0243
PINet(4H) 96.75% 0.0310 0.0250
1) TuSimple: Evaluation of the TuSimple dataset requires
exact x-axis values for certain fixed y-axis values. The detailed
evaluation results can be seen in Table V; Fig. 6 shows TABLE VI
certain results for the TuSimple dataset. The value nH in PARAMETER SIZE AND FPS ( ON GTX2080 TI ) OF PIN ET
Tables IV - VI means that the network consists of n hourglass parameter(M) fps
modules. Though pre-trained weights and extra datasets are not PINet(1H) 1.08 40
used, PINet also shows high performance in term of accuracy PINet(2H) 2.08 35
and false positive rate. The false negative rate also shows a PINet(3H) 3.07 30
PINet(4H) 4.06 25
reasonable value.
Table VI shows the number of parameters and the fps on
the GTX 2080ti GPU according to the number of hourglass
modules. Most components of PINet are built of bottle-neck looks as if the effect of distillation is optimal when the depth is
layers. This architecture can save a lot of memory. PINet three hourglass modules in our proposed architecture. Finally,
can run at 25 fps when all hourglass networks are used, PINet works better than other methods for the hard light
and if only one hourglass network is applied, the network condition. Night, and dazzle light categories in the CULane
works about 40 fps. When the short network is evaluated, dataset include the hard light condition; PINet shows higher
the network is just clipped from the whole trained network, performance in these categories. However, because PINet is
without any additional training. The deepest network has based on the key points estimation method, local occlusions or
higher performance, but the performances of the clipped short unclear traffic lines can negatively influence the performance.
networks show subtle differences from that of deepest network. Crowed, arrow, and curve categories can be examples of PINet
The distance threshold is 0.08 to distinguish each instance; showing slightly lower performance in these categories. PINet
confidence thresholds are 0.35 (4H), 0.32 (3H), 0.30 (2H), shows the highest performance for the overall F1 measure
and 0.52 (1H); γe and γn are 1.0 and 1.0. on the CULane dataset. The distance threshold is 0.08 for
2) CULane: Table IV and Fig. 7 show detailed results of distinguishing each instance; confidence thresholds are 0.94
PINet on the CULane dataset. We observe three features in (4H), 0.95 (3H), 0.96 (2H), and 0.97 (1H); γe and γn are
the result. The first is that PINet shows a particularly low initially set by 1.0 and 1.0. γe is changed from 1.0 to 2.5 at
false positive rate on the CULane dataset. This means that the last 40 epochs.
wrong prediction of lanes by our PINet is rarer than in other
methods; this guarantees the safety performance. Second, the C. Ablation Study
clipped networks 2H and 3H show a performance similar to We investigate the effects of the knowledge distillation
that of the whole network; only 1H has poor performance. It method, whose purpose of this knowledge distillation method
8

Fig. 6. Results for TuSimple dataset. First row is ground truth; the second row is predicted results of PINet.

Fig. 7. Results of CULane dataset. First row is ground truth; the second row is predicted results of PINet.

is to reduce the gap between the clipped short network and TABLE VII
the deepest network that acts as a teacher network. Table AVERAGE PERFORMANCE GAP BETWEEN WHOLE NETWORK AND CLIPPED
SHORT NETWORK ON T U S IMPLE DATASET ( LOWER IS BETTER ).
VII shows the results of the ablation study. The average
performance gap is calculated using the following equation: 4H-3H 4H-2H 4H-1H
Distillation (a) 0.0096 0.0234 0.0739
N
1 X 4H No distillation (b) 0.0130 0.0327 0.1092
AGn = P − PinH , (11) a/b (%) 73.85 71.56 67.67
N i i
where AGn denotes the average performance gap between
4H and nH, N denotes the total number of training epochs
for this ablation study, and PinH denotes the performance of point instance segmentation method. Method can work in
nH at the i-th epoch. The performance is evaluated on the real-time. In addition, PINet can be clipped according to the
tuSimple test set; we collect data for the first 30 epochs. When computing power of the target system; the clipped network
the distillation method is applied, the average performance gap can be applied directly without any additional training. PINet
between the whole network and the clipped short networks is achieves high performance and a lower rate of false positives;
lower when the distillation method is not applied. This means the low false positive rate guarantees the safety performance
that the distillation method helps the clipped short network to of autonomous driving cars because wrongly predicted lanes
mimic the teacher network well. rarely occur. Particularly, PINet show better performance than
other methods in difficult light conditions such as night,
V. C ONCLUSION shadow, and dazzling light; however, PINet has limitations
In this study, we have proposed a novel lane detection when local occlusions or unclear traffic lines exist. We have
method, PINet, combining with the point estimation and the shown by ablation study that the knowledge distillation method
9

improves the performance of the clipped short network. As a [22] Q. Zou, H. Jiang, Q. Dai, Y. Yue, L. Chen, and Q. Wang, “Robust lane
result, we have observed that the clipped short network’s per- detection from continuous driving scenes using deep neural networks,”
IEEE transactions on vehicular technology, vol. 69, no. 1, pp. 41–54,
formance is close to that of the whole network’s performance. 2019.
[23] Z. Chen, Q. Liu, and C. Lian, “Pointlanenet: Efficient end-to-end cnns
for accurate real-time lane detection,” in 2019 IEEE Intelligent Vehicles
R EFERENCES Symposium (IV), pp. 2563–2568, IEEE, 2019.
[24] X. Li, J. Li, X. Hu, and J. Yang, “Line-cnn: End-to-end traffic line
[1] Y. Lee, J. Lee, H. Ahn, and M. Jeon, “Snider: Single noisy image detection with line proposal unit,” IEEE Transactions on Intelligent
denoising and rectification for improving license plate recognition,” in Transportation Systems, vol. 21, no. 1, pp. 248–258, 2020.
Proceedings of the IEEE International Conference on Computer Vision [25] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for
Workshops, pp. 0–0, 2019. human pose estimation,” in European conference on computer vision,
[2] F. Munir, S. Azam, A. M. Sheri, Y. Ko, and M. Jeon, “Where am pp. 483–499, Springer, 2016.
i: Localization and 3d maps for autonomous vehicles.,” in VEHITS, [26] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang, “Learning feature
pp. 452–457, 2019. pyramids for human pose estimation,” in proceedings of the IEEE
[3] Y. Wang, E. K. Teoh, and D. Shen, “Lane detection and tracking using international conference on computer vision, pp. 1281–1290, 2017.
b-snake,” Image and Vision computing, vol. 22, no. 4, pp. 269–280, [27] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet:
2004. Keypoint triplets for object detection,” in 2019 IEEE/CVF International
[4] Z. Kim, “Robust lane detection and tracking in challenging scenarios,” Conference on Computer Vision (ICCV), pp. 6568–6577, 2019.
IEEE Transactions on Intelligent Transportation Systems, vol. 9, no. 1, [28] X. Zhou, J. Zhuo, and P. Krahenbuhl, “Bottom-up object detection
pp. 16–26, 2008. by grouping extreme and center points,” in Proceedings of the IEEE
[5] Y. Li, W. Ding, X. Zhang, and Z. Ju, “Road detection algorithm Conference on Computer Vision and Pattern Recognition, pp. 850–859,
for autonomous navigation systems based on dark channel prior and 2019.
vanishing point in complex road scenes,” Robotics and Autonomous [29] W. Wang, R. Yu, Q. Huang, and U. Neumann, “Sgpn: Similarity
Systems, vol. 85, pp. 1 – 11, 2016. group proposal network for 3d point cloud instance segmentation,” in
[6] H. Wang, Y. Sun, and M. Liu, “Self-supervised drivable area and road Proceedings of the IEEE Conference on Computer Vision and Pattern
anomaly segmentation using rgb-d data for robotic wheelchairs,” IEEE Recognition, pp. 2569–2578, 2018.
Robotics and Automation Letters, vol. 4, no. 4, pp. 4386–4393, 2019. [30] Y. Hou, Z. Ma, C. Liu, and C. C. Loy, “Learning lightweight lane
[7] Y. He, H. Wang, and B. Zhang, “Color-based road detection in urban detection cnns by self attention distillation,” in Proceedings of the IEEE
traffic scenes,” IEEE Transactions on intelligent transportation systems, International Conference on Computer Vision, pp. 1013–1021, 2019.
vol. 5, no. 4, pp. 309–318, 2004. [31] H. Deusch, J. Wiest, S. Reuter, M. Szczot, M. Konrad, and K. Dietmayer,
[8] K.-Y. Chiu and S.-F. Lin, “Lane detection using color-based segmen- “A random finite set approach to multiple lane detection,” in 2012 15th
tation,” in IEEE Proceedings. Intelligent Vehicles Symposium, 2005., International IEEE Conference on Intelligent Transportation Systems,
pp. 706–711, IEEE, 2005. pp. 270–275, IEEE, 2012.
[9] Y. Wang, D. Shen, and E. K. Teoh, “Lane detection using catmull-rom [32] Y. U. Yim and S.-Y. Oh, “Three-feature based automatic lane detec-
spline,” in IEEE International Conference on Intelligent Vehicles, vol. 1, tion algorithm (tfalda) for autonomous driving,” IEEE Transactions on
pp. 51–57, 1998. Intelligent Transportation Systems, vol. 4, no. 4, pp. 219–225, 2003.
[10] C. Lee and J.-H. Moon, “Robust lane detection and tracking for real-time [33] A. Borkar, M. Hayes, and M. T. Smith, “A novel lane detection system
applications,” IEEE Transactions on Intelligent Transportation Systems, with efficient ground truth generation,” IEEE Transactions on Intelligent
vol. 19, no. 12, pp. 4043–4048, 2018. Transportation Systems, vol. 13, no. 1, pp. 365–374, 2011.
[11] R. O. Duda and P. E. Hart, “Use of the hough transformation to detect [34] D. C. Andrade, F. Bueno, F. R. Franco, R. A. Silva, J. H. Z. Neme,
lines and curves in pictures,” Communications of the ACM, vol. 15, E. Margraf, W. T. Omoto, F. A. Farinelli, A. M. Tusset, S. Okida,
no. 1, pp. 11–15, 1972. M. M. D. Santos, A. Ventura, S. Carvalho, and R. d. S. Amaral, “A novel
[12] S. Luo, X. Zhang, J. Hu, and J. Xu, “Multiple lane detection via strategy for road lane detection and tracking based on a vehicles forward
combining complementary structural constraints,” IEEE Transactions on monocular camera,” IEEE Transactions on Intelligent Transportation
Intelligent Transportation Systems, pp. 1–10, 2020. Systems, vol. 20, no. 4, pp. 1497–1507, 2019.
[13] A. Borkar, M. Hayes, and M. T. Smith, “Robust lane detection and [35] J. M. lvarez, A. M. Lpez, T. Gevers, and F. Lumbreras, “Combining
tracking with ransac and kalman filter,” in 2009 16th IEEE International priors, appearance, and context for road detection,” IEEE Transactions
Conference on Image Processing (ICIP), pp. 3261–3264, IEEE, 2009. on Intelligent Transportation Systems, vol. 15, no. 3, pp. 1168–1178,
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification 2014.
with deep convolutional neural networks,” in Advances in neural infor- [36] H. Choi, H. Ahn, K. Joonmo, and M. Jeon, “Adfnet: Accumulated
mation processing systems, pp. 1097–1105, 2012. decoder features for real-time semantic segmentation,” IET Computer
[15] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Vision, 2020.
Proceedings of the IEEE international conference on computer vision, [37] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
pp. 2961–2969, 2017. S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
[16] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con- in Advances in neural information processing systems, pp. 2672–2680,
volutional encoder-decoder architecture for image segmentation,” IEEE 2014.
transactions on pattern analysis and machine intelligence, vol. 39, [38] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image
no. 12, pp. 2481–2495, 2017. translation using cycle-consistent adversarial networks,” in Proceedings
[17] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep of the IEEE international conference on computer vision, pp. 2223–
neural network architecture for real-time semantic segmentation,” arXiv 2232, 2017.
preprint arXiv:1606.02147, 2016. [39] S.-Y. Lo, H.-M. Hang, S.-W. Chan, and J.-J. Lin, “Multi-class lane
[18] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks semantic segmentation using efficient convolutional networks,” in 2019
for biomedical image segmentation,” in International Conference on IEEE 21st International Workshop on Multimedia Signal Processing
Medical image computing and computer-assisted intervention, pp. 234– (MMSP), pp. 1–6, IEEE, 2019.
241, Springer, 2015. [40] M. Ghafoorian, C. Nugteren, N. Baka, O. Booij, and M. Hofmann, “El-
[19] W.-J. Yang, Y.-T. Cheng, and P.-C. Chung, “Improved lane detection gan: Embedding loss driven generative adversarial networks for lane
with multilevel features in branch convolutional neural networks,” IEEE detection,” in Proceedings of the European Conference on Computer
Access, vol. 7, pp. 173148–173156, 2019. Vision (ECCV), pp. 0–0, 2018.
[20] X. Pan, J. Shi, P. Luo, X. Wang, and X. Tang, “Spatial as deep: Spatial [41] D. Neven, B. De Brabandere, S. Georgoulis, M. Proesmans, and
cnn for traffic scene understanding,” in Thirty-Second AAAI Conference L. Van Gool, “Towards end-to-end lane detection: an instance segmen-
on Artificial Intelligence, 2018. tation approach,” in 2018 IEEE intelligent vehicles symposium (IV),
[21] W. Van Gansbeke, B. De Brabandere, D. Neven, M. Proesmans, and pp. 286–291, IEEE, 2018.
L. Van Gool, “End-to-end lane detection through differentiable least- [42] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
squares fitting,” in Proceedings of the IEEE International Conference object detection with region proposal networks,” in Advances in neural
on Computer Vision Workshops, pp. 0–0, 2019. information processing systems, pp. 91–99, 2015.
10

[43] W. Li, Z. Wang, B. Yin, Q. Peng, Y. Du, T. Xiao, G. Yu, H. Lu, Farzeen Munir received the B.S degree in Electrical
Y. Wei, and J. Sun, “Rethinking on multi-stage networks for human Engineering from Pakistan Institute of Engineering
pose estimation,” arXiv preprint arXiv:1901.00148, 2019. and Applied Sciences, Pakistan in 2013, and MS
[44] G. Moon, J. Y. Chang, and K. M. Lee, “Posefix: Model-agnostic degree in System Engineering from Pakistan Insti-
general human pose refinement network,” in Proceedings of the IEEE tute of Engineering and Applied Sciences, Pakistan
Conference on Computer Vision and Pattern Recognition, pp. 7773– in 2015. Now she is pursing her PhD degree at
7781, 2019. Gwangju Institute of Science and Technology, Korea
[45] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look in Electrical Engineering and Computer Science. Her
once: Unified, real-time object detection,” in Proceedings of the IEEE current research interest include, machine Learning,
conference on computer vision and pattern recognition, pp. 779–788, deep neural network, autonomous driving and com-
2016. puter vision.
[46] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
Surpassing human-level performance on imagenet classification,” in
Proceedings of the IEEE international conference on computer vision,
pp. 1026–1034, 2015.
[47] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in Proceedings
of the 32nd International Conference on International Conference on
Machine Learning - Volume 37, ICML15, p. 448456, JMLR.org, 2015.
[48] S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Moongu Jeon received the B.S. degree in archi-
Improving the performance of convolutional neural networks via atten- tectural engineering from Korea University, Seoul,
tion transfer,” arXiv preprint arXiv:1612.03928, 2016. South Korea, in 1988, and the M.S. and Ph.D. de-
[49] “The tusimple lane challenge,” in http://benchmark.tusimple.ai/. grees in computer science and scientific computation
[50] S. Yoo, H. Seok Lee, H. Myeong, S. Yun, H. Park, J. Cho, and from the University of Minnesota, Minneapolis, MN,
D. Hoon Kim, “End-to-end lane marker detection via row-wise clas- USA, in 1999 and 2001, respectively. As the masters
sification,” in Proceedings of the IEEE/CVF Conference on Computer degree researcher, he was involved in optimal control
Vision and Pattern Recognition Workshops, pp. 1006–1007, 2020. problems with the University of California at Santa
Barbara, Santa Barbara, CA, USA, from 2001 to
2003, and then moved to the National Research
Council of Canada, where he was involved in the
sparse representation of high-dimensional data and the image processing,
until July 2005. In 2005, he joined the Gwangju Institute of Science and
Yeongmin Ko received the B.S. degree in School Technology, Gwangju, South Korea, where he is currently a Full Professor
of Electrical Engineering from Gwangju Institute of with the School of Electrical Engineering and Computer Science. His current
Science and Technology (GIST), Gwangju, South research interests include machine learning, computer vision, and artificial
Korea, in 2017. He is currently pursuing the Ph.D. intelligence.
degree with the School of Electrical Engineering and
Computer Science, Gwangju Institute of Science and
Technology. His current research interests include
computer vision, self-driving, and deep learning.

Witold Pedrycz received the M.Sc., Ph.D., and


D.Sc. degrees from the Silesian University of Tech-
nology, Gliwice, Poland.
He is a Professor and the Canada Research Chair
Younkwan Lee received the B.S. degree in com-
of Computational Intelligence with the Department
puter science from Korea Aerospace University,
of Electrical and Computer Engineering, University
Gyeonggi, South Korea, in 2016. He is currently pur-
of Alberta, Edmonton, AB, Canada. He is also with
suing the Ph.D. degree with the School of Electrical
the Systems Research Institute, Polish Academy of
Engineering and Computer Science, Gwangju Insti-
Sciences, Warsaw, Poland. Dr. Pedrycz is a Foreign
tute of Science and Technology (GIST), Gwangju,
Member of the Polish Academy of Sciences and
South Korea. His current research interests include
a fellow of the Royal Society of Canada. He has
computer vision, machine learning, and deep learn-
authored 17 research monographs and edited volumes covering various aspects
ing.
of computational intelligence, data mining, and software engineering. His
current research interests include computational intelligence, fuzzy modeling
and granular computing, knowledge discovery and data science, fuzzy control,
pattern recognition, knowledge-based neural networks, relational computing,
and software engineering.
Dr. Pedrycz was a recipient of the Prestigious Norbert Wiener Award from
the IEEE Systems, Man, and Cybernetics Society in 2007; the IEEE Canada
Shoaib Azam received the B.S. degree in Engi- Computer Engineering Medal; the Cajastur Prize for Soft Computing from the
neering Sciences from Ghulam Ishaq Khan Institute European Centre for Soft Computing; the Killam Prize; and the Fuzzy Pioneer
of Science and Technology, Pakistan in 2010, and Award from the IEEE Computational Intelligence Society. He is vigorously
MS degree in Robotics and Intelligent Machine En- involved in editorial activities. He is an Editor-in-Chief of Information
gineering from National University of Science and Sciences, Editor-in-Chief of WIREs Data Mining and Knowledge Discovery
Technology, Pakistan in 2015. He is currently pur- (Wiley), and International Journal of Granular Computing (Springer). He
suing the Ph.D. degree with the Department of Elec- currently serves on the Advisory Board of IEEE Transactions on Fuzzy
trical Engineering and Computer Science, Gwangju Systems and is a member of a number of editorial boards of other international
Institute of Science and Technology, Gwangju, South journals.
Korea. His current research interests include artificial
intelligence and machine learning, computer vision,
robotics and autonomous driving.

You might also like