0% found this document useful (0 votes)
24 views10 pages

# CondLaneNet A Top-To-Down Lane Detection Framework Based On Conditional Convolution

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views10 pages

# CondLaneNet A Top-To-Down Lane Detection Framework Based On Conditional Convolution

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

CondLaneNet: a Top-to-down Lane Detection Framework Based on Conditional

Convolution

Lizhe Liu1 Xiaohao Chen1 Siyu Zhu1 Ping Tan12


1 2
Alibaba Group Simon Fraser University
arXiv:2105.05003v2 [cs.CV] 10 Jun 2021

Abstract

Modern deep-learning-based lane detection methods are


successful in most scenarios but struggling for lane lines
with complex topologies. In this work, we propose Cond-
LaneNet, a novel top-to-down lane detection framework
that detects the lane instances first and then dynamically
predicts the line shape for each instance. Aiming to re- Figure 1. Scenes of lane lines with complex topologies. It is chal-
solve lane instance-level discrimination problem, we intro- lenging to cope with the scenes such as the dense lines(the first
duce a conditional lane detection strategy based on condi- row) and the fork lines(the second row). Different instances are
tional convolution and row-wise formulation. Further, we represented by different colors in this figure.
design the Recurrent Instance Module(RIM) to overcome
the problem of detecting lane lines with complex topolo- studies about lane detection have focused on deep learn-
gies such as dense lines and fork lines. Benefit from the ing [34]. Early deep-learning-based methods detect lane
end-to-end pipeline which requires little post-process, our lines through segmentation [28, 27]. Recently, various
method has real-time efficiency. We extensively evaluate methods such as anchor-based methods [32, 2, 39], row-
our method on three benchmarks of lane detection. Re- wise detection methods [30, 29, 41], and parametric predic-
sults show that our method achieves state-of-the-art per- tion methods [31, 25] have been proposed and continue to
formance on all three benchmark datasets. Moreover, our refresh the accuracy and efficiency.
method has the coexistence of accuracy and efficiency, Although deep-learning-based lane detection methods
e.g. a 78.14 F1 score and 220 FPS on CULane. Our have made great progress [42], there are still many chal-
code is available at https://github.com/aliyun/ lenges.
conditional-lane-detection. A common problem for lane detection is instance-level
discrimination. Most lane detection methods [27, 28, 19,
32, 12, 21, 29, 2, 30, 41, 39] predict lane points first and
1. Introduction
then aggregate the points into lines. But it is still a com-
Artificial intelligence technology is increasingly be- mon challenge to assign different points to different lane
ing used in the driving field, which is conducive to au- instances [34]. A simple solution is to label the lane lines
tonomous driving and the Advanced Driver Assistance Sys- into classes of a fixed number(e.g. labeled as 0, 1, 2, 3
tem(ADAS). As a basic problem in autonomous driving, if the maximum lane number is 4) and make a multi-class
lane detection plays a vital role in applications such as ve- classification [28, 30, 41, 3]. But the limitation is that only
hicle real-time positioning, driving route planning, lane- a predefined, fixed number of lanes can be detected [27].
keeping assist, and adaptive cruise control. To overcome this limitation, the post-clustering strategy is
Traditional lane detection methods usually rely on hand- investigated [27, 19]. However, this strategy is struggling
crafted operators to extract features [24, 43, 13, 17, 15, 1, for some cases such as dense lines. Another approach is
16, 33], and then fit the line shape through post-processing anchor-based methods [25, 22, 39]. But it is not flexible
such as Hough transform [24, 43] and Random Sampling to predict the line shape due to the fixed shape of the an-
Consensus (RANSAC) [17, 15]. However, traditional meth- chor [39].
ods faile in maintaining robustness in real scene since Another challenge is the detection of lane lines with
the hand-crafted models cannot cope with the diversity complex topologies, such as fork lines and dense lines, as is
of lane lines in different scenarios [27]. Recently, most shown in Figure 1. Such cases are common in driving sce-
narios, e.g. fork lines usually appear when the number of mance. Different from general semantic segmentation
lanes changes. Homayounfar et al.[10] proposed an offline tasks, lane detection requires instance-level discrimination.
lane detection method for HDMap(High Definition Map) Early methods [28, 6] used a multi-class classification strat-
which can deal with the fork lines. However, there are few egy for lane instance discrimination. As explained in the
studies on the perception of lane lines with complex topolo- previous section, this strategy is inflexible. For higher in-
gies for real-time driving scenarios. stance accuracy, the post-clustering strategy [4] was widely
The lane detection task is similar to instance segmen- applied [27, 19]. Considering that the segmentation-based
tation, which requires assigning different pixels to differ- methods generally predict a down-scaled mask, some meth-
ent instances. Recently, some studies [35, 38] have investi- ods [19] predict an offset map for refinement. Recently,
gated the conditional instance segmentation strategy, which some studies [3, 30] indicated that it is inefficient to de-
is also promising for lane detection tasks. However, it is scribe the lane line as a mask because the emphasis of
inefficient to directly apply this strategy to lane detection, segmentation is obtaining accurate classification per pixel
since the constraint for the mask is not completely consis- rather than specifying the line shape. To overcome this
tent with specifying the line shape. [3, 34, 30]. problem, anchor-based methods and row-wise detection
In this work, we propose a novel lane detection frame- methods were proposed.
work called CondLaneNet. Aiming to resolve the lane
instance-level discrimination problem, we propose the con- 2.2. Anchor-based Methods
ditional lane detection strategy inspired by CondInst [35] Anchor-based methods [32, 2, 39] take a top-to-down
and SOLOv2 [38]. Different from the instance segmen- pipeline and focus the optimization on the line shape by re-
tation tasks, we focus the optimization on specifying the gressing the relative coordinates. The predefined anchors
lane line shape based on the row-wise formulation [30, 41]. can reduce the impact of the no-visual-clue problem [32]
Moreover, we design the Recurrent Instance Module(RIM) and improve the ability of instance discrimination. Due
to deal with the detection of lane lines with complex topolo- to the slender shape of lane lines, the widely used box-
gies such as the dense lines and fork lines. Besides, ben- anchor in object detection [7] cannot be used directly. Point-
efit from the end-to-end pipeline that requires little post- LaneNet [2] and CurveLane [39] used vertical lines as an-
process, our method achieves real-time efficiency. The con- chors. LaneATT [32] designed anchors with a slender
tributions of this work are summarised as follows: shape and achieves state-of-the-art performance on multi-
• We have greatly improved the ability of lane instance- ple datasets. However, the fixed anchor shape results in a
level discrimination by the proposed conditional lane low degree of freedom in describing the line shape [39].
detection strategy and row-wise formulation.
2.3. Row-wise Detection Methods
• We solve the problem of detecting lane lines with com-
Row-wise detection methods [30, 29, 41] make good use
plex topologies such as dense lines and fork lines by
of the shape prior and predict the line location for each
the proposed RIM.
row. In the training phase, the constraint on the overall
• Our CondLaneNet framework achieves state-of-the-art line shape is realized through the location constraint of each
performance on multiple datasets, e.g. an 86.10 F1 row. Based on the continuity and consistency of the pre-
score(4.6% higher than SOTA) on CurveLanes and a dicted locations from row to row, shape constraints can be
79.48 F1 score(3.2% higher than SOTA) on CULane. added to the model [29, 30]. Besides, in terms of efficiency,
Moreover, the small version of our CondLaneNet has some recent row-wise detection methods[30, 41, 11] have
high efficiency while ensuring high accuracy, e.g. a achieved advantages. However, instance-level discrimina-
78.14 F1 score and 220 FPS on CULane. tion is still the main problem for row-wise formulation. As
the widely used post-clustering module [4] in segmentation-
2. Related Work based methods [27, 19] cannot be directly integrated into
This section introduces the recent deep-learning-based the row-wise formulation, row-wise detection methods still
lane detection methods. According to the strategy of take the multi-class classification strategy for lane instance
line shape description, current methods can be divided discrimination. Considering the impressive performance on
into four categories: segmentation-based methods, anchor- accuracy and efficiency, we also adopt the row-wise formu-
based methods, row-wise detection methods, and paramet- lation and propose some novel strategies to overcome the
ric prediction methods. instance-level discrimination problem.

2.1. Segmentation-based Methods 2.4. Parametric Prediction Methods


Segmentation-based methods [28, 12, 27, 19, 21, 6] Different from the above methods which predict points,
are most common and have achieved impressive perfor- parametric prediction methods directly output parametric
Backbone Proposal head Conditional shape head
Transformer Convolution+
Encoder BN+ReLU
Row-wise
location

Vertical range
Proposal Convolution

Linear
heatmap
Conditional


Location Offset


maps convolution
Parameter maps
map … … Kenerl parameters
RIM RIM RIM Conditional for location map
Conditional
Input image convolution convolution Kenerl parameters
Shared branch for offset map

Figure 2. The structure of our CondLaneNet framework. The backbone adopts standard ResNet [8] and FPN [23] for multi-scale feature
extraction. The transformer encoder module [37] is added for more efficient context feature extraction. The proposal head is responsible
for detecting the proposal points which are located at the start point of the line. Meanwhile, a parameter map that contains the dynamic
convolution kernels is predicted. The conditional shape head predicts the row-wise location, the vertical range, and the offset map to
describe the shape for each line. To address the cases of dense lines and fork lines, the RIM is designed.

lines expressed by curve equation. PolyLaneNet [31] firstly Step 1: Instance detection Step 2: Shape prediction
proposed to use a deep network to regress the lane curve
equation. LSTR [25] introduced transformer [37] to lane
detection task and get 420fps detection speed. However, Instance 1 Instance 2
Instance 1 Instance 3
the parametric prediction methods have not surpassed other Instance 2
methods in terms of accuracy. Instance 3
a. Conditional Instance Segmentation
3. Methods Step 1: Instance detection Step 2: Shape prediction
C×H×W
Given an input image I ∈ R , the goal of
our CondLaneNet is to predict a collection of lanes L =
Instance 1 Instance 2
{l1 , l2 , ..., lN }, where N is the total number of lanes. Gen-
erally, each lane lk is represented by an ordered set of coor-
dinates as follows. Instance 1 Instance 2 Instance 3&4 Instance 4
Instance 3
b. Conditional Lane Detection
lk = [(xk1 , yk1 ), (xk2 , yk2 ), ..., (xkNk , ykNk )] (1)
Figure 3. The difference between conditional instance segmen-
Where k is the index of lane and Nk is the max number of tation and the proposed conditional lane detection strategy. Our
sample points of the kth lane. CondLaneNet detects the start point of the lane lines to detect the
The overall structure of our CondLaneNet is shown in instance and uses the row-wise formulation to describe the line
Figure 2. This section will first present the conditional shape instead of the mask. The overlapping lines can be distin-
lane detection strategy, then introduce the RIM(Recurrent guished based on the proposed RIM, which will be detailed in
Section 3.2.
Instance Module), and finally detail the framework design.
3.1. Conditional Lane Detection
Focusing on the instance-level discrimination ability, we
propose the conditional lane detection strategy based on This strategy has achieved impressive performance on
conditional convolution – a convolution operation with dy- instance segmentation tasks [35, 38]. However, directly
namic kernel parameters [14, 40]. The conditional detec- applying the conditional instance segmentation strategy to
tion process [35, 38] has two steps: instance detection and lane detection is blunt and inappropriate. On the one
shape prediction, as is shown in Figure 3. The instance hand, the segmentation-based shape description is ineffi-
detection step predicts the object instance and regresses a cient for lane lines due to the excessively high degree of
set of dynamic kernel parameters for each instance. In the freedom [30]. On the other hand, the instance detection
shape prediction step, conditional convolutions are applied strategy for general objects is not suitable for slender and
to specify the instance shape. This process is conditioned curved objects due to the inconspicuous visual character-
on the dynamic kernel parameters. Since each instance cor- istic of the border and the central. Our conditional lane
responds to a set of dynamic kernel parameters, the shapes detection strategy improves shape prediction and instance
can be predicted instance-wisely. detection to address the above problems.
3.1.1 Shape Prediction In the training phase, L1-loss is applied.
We improve the row-wise formulation [30] to predict the 1 X
line shape based on our conditional shape head, as is shown `row = |E(x̂i ) − xi | (4)
Nv
i∈V
in Figure 2. In the row-wise formulation, we predict the
lane location on each row and then aggregate the locations Where V represents the vertical range of the labeled line,
to get the lane line in the order from bottom to top, based Nv is the number of valid rows.
on the prior of the line shape. Our row-wise formulation has
three components: the row-wise location, the vertical range,
Vertical Range The vertical lane range is determined by
and the offset map. The first two outputs are basic elements
row-wisely predicting whether the lane line passes through
for most row-wise detection methods [30, 41]. Besides, we
the current row, as is shown in Figure 4. We add a linear
predict an offset map as the third output for further refine-
layer and perform binary-classification row by row. We use
ment.
the feature vector of each row in the location map as the
𝑋 input. The softmax-cross-entropy loss is adopted to guide
the training process.
Linear

𝑌 X
i i
Positive `range = (−ygt log(vi ) − (1 − ygt )log(1 − vi )) (5)
Negetive i

Vertical range Where vi represents the predicted positive probability for


Row-wise location i
ith row and ygt is the groundtruth of ith row.
Figure 4. The process of parsing the row-wise location and the
vertical range from the location map. Offset Map The row-wise location defined in Equation 3
points to the abscissa of the vertex on the left side of the
grid, rather than the precise location. Thus, we add the off-
Row-wise Location As is shown in Figure 4, we divide set map to predict the offset in the horizontal direction near
the input image into grids of shape Y × X and predict a the row-wise location for each row. We use L1-loss to con-
corresponded location map, which is a feature map of shape strain the offset map as follows.
1 × Y × X output by the proposed conditional shape head.
1 X
On the location map, each row has an abscissa indicating `of f set = δ̂ij − δij (6)
the location of the lane line. NΩ
(j,i)∈Ω
To get the row-wise location, a basic approach is to pro-
cess the X-classes classification in each row. In inference where δ̂ij and δij are the predicted offset and the label offset
time, the row-wise location is determined by picking the on coordinate (j, i). We define Ω as the area near the lane
most responsive abscissa in each row. However, a common line with a fixed width. NΩ is the number of pixels in Ω.
situation is that the line location is between the two grids,
and both the two grids should have a high response. To Shape Description Each output lane line is represented
overcome this problem, we introduce the following formu- as an ordered set of coordinates. For kth line, the coordinate
lation. (xik , yki ) of the ith row is represented as follows.
For each row, we predict the probability that the lane line  i
appears in each grid. yk = H/Y · i
(7)
xik = W/X · (locik + δ(locik , i))
i
pi = sof tmax(floc ) (2)  k  k
k k
Where i represents the ith row, i
floc
is the feature vector of Where i ∈ vmin , vmax , vmin and vmax are respectively
the ith row of location map floc , pi is the probability vector the minimum and maximum values of the predicted vertical
for the ith row. range, locki is rounded down from Eik , δ(·) is the predicted
The final row-wise location is defined as the expected offset.
abscissa.
3.1.2 Instance Detection
X
E(x̂i ) = j · pij (3)
j We design the proposal head for instance detection, as is
Where E(x̂i ) is the expected abscissa, pij is the proba- shown in Figure 2. For general conditional instance seg-
bility of the lane line passing through the coordinate (j, i). mentation methods [35, 38], the instance is detected in an
end-to-end pipeline by predicting the central of each object. in Figure 2, RIM is added for each proposal point. There-
However, it is hard to predict the central for the slender and fore, each proposal point can guide the shape prediction of
curved lines because the visual characteristic of the line cen- multiple lane instances.
tral is not obvious. We adopt cross-entropy loss to constrain the state output
We detect the lane instance by detecting the proposal as follows.
point located at the start point of the line. The start point has
a more clear definition and more obvious visual character- 1 X
istic than the central. We follow CenterNet [5] and predict `state = − [yi · log(si ) + (1 − yi ) · log(1 − si )]
Ns i
a proposal heatmap to detect the proposal points. To con-
straint the proposal heatmap, we adopt focal loss following (9)
CornerNet [20] and CenterNet[5]. Where si is the output of softmax operation for ith state,
result yi is the ground truth for the ith state and Ns is the
total number of the state outputs in a batch.
(
−1 X (1 − P̂xy )α log(P̂xy ) Pxy = 1
`point = In the training phase, the total loss is defined as follows.
Np xy (1 − Pxy )β (P̂xy )α log(1 − P̂xy ) otherwise
(8) `total = `point +α`row +β`range +γ`of f set +η`state (10)
Where Pxy is the label at coordinate (x, y) and P̂xy
is the predicted value at coordinate (x, y) of the proposal The hyperparameters α, β, γ and η are set to 1.0, 1.0, 0.4
heatmap. Np is the number of proposal points in the input and 1.0 respectively.
image.
Besides, we regress the dynamic kernel parameters by 3.3. Architecture
predicting a parameter map following CondInst [35] and The overall architecture is shown in Figure 2. We adopt
SOLOv2 [38]. The constraints of the parameter map are ResNet [8] as the backbone and add a standard FPN [23]
constructed through the constraints on the line shape. module to provide integrated multi-scale features. The pro-
posal head detects the lane instances by predicting the pro-
3.2. Recurrent Instance Module posal heatmap of shape 1 × Hp × Wp . Meanwhile, a pa-
In the proposal head described above, each proposal rameter map of shape Cp × Hp × Wp that contains the
point is bound to a lane instance. However, in practice, mul- dynamic kernel parameters is predicted. For the instance
tiple lane lines can fall in the same proposal point such as with the proposal point located at (xp , yp ), the correspond-
the fork lanes. To deal with the above cases, we propose the ing dynamic kernel parameters are contained in the Cp di-
Recurrent Instance Module(RIM). mensional kernel feature vector at (xp , yp ) on the parameter
map. Further, given the kernel feature vector, the RIM re-
Fc 𝑠! Fc 𝑠!$# currently predicts the dynamic kernel parameters. Finally,
ℎ! ℎ!$# the conditional shape head predicts the line shape instance-
module 𝑘! module 𝑘!$#
wisely conditioned on the dynamic kernel parameters.
𝐶!"# 𝐶!$#
LSTM cell 𝐶! LSTM cell
ℎ!"# ℎ!$#
ℎ!
Encoder Layer
𝑓!"# 𝑓! ×𝑁 Conv3x3-bn-relu Self-attention Module
Reshape
Figure 5. The Recurrent Instance Module. In this figure, h and c Softmax
are the short-term memory and long-term memory respectively, f Self-attention
is the input feature vector, s is the output state logit, k is the output V K Q
kernel parameter vector. Linear Linear Linear
Position
encodings Flatten
The structure of the proposed RIM is shown in Figure 5. Conv3x3-bn-relu
Based on LSTM(Long Short-term Memory) [9], the RIM
recurrently predicts a state vector si and a kernel parame- Figure 6. The structure of the transformer encoder. The ⊕, and
ter vector ki . We define si as two-dimensional logits that ⊗ respectively represent matrix addition, dot-product operation
indicate two states: “continue” or “stop”. The vector ki and element-wise product operation.
contains the kernel parameters for subsequent instance-wise
dynamic convolution. In the inference phase, the RIM re- Our framework requires a strong capability of context
currently predicts the lane-wise kernel parameters bound to feature fusion. For example, the prediction of the proposal
the same proposal point until the state is “stop”. As is shown point is based on the features of the entire lane line which
generally has an elongated shape and long-range. There- with accuracy greater than 85% is considered as a true-
fore, we add a transformer encoder structure to the last layer positive otherwise false positive or false negative. Besides,
of the backbone for the fusion of contextual information. the F1 score is also reported.
We retain the two-dimensional spatial features in the en-
coder layer and use convolutions for feature extraction. The 4.1.3 Implementation details
structure of the transformer encoder used in our framework
is shown in Figure 6. We fix the large, medium, and small versions of our Cond-
LaneNet for all three datasets. The difference between the
4. Experiments three models is shown in Table 2. For all three datasets,
input images are resized to 800×320 pixels during train-
4.1. Experimental Setting ing and testing. Since there are no cases of fork lines in
4.1.1 Datasets CULane and TuSimple, RIM is only applied for the Curve-
Lanes dataset. In the optimizing process, we use Adam op-
To extensively evaluate the proposed method, we conducte timizer [18] and step learning rate decay [26] with an ini-
experiments on three benchmarks: CurveLanes [39], CU- tial learning rate of 3e-4. For each dataset, we train on the
Lane [28], and TuSimple [36]. CurveLanes is a recently training set without any extra data. We respectively train 14,
proposed benchmark with cases of complex topologies such 16 and 70 epochs for CurveLanes, CULane and TuSimple
as fork lines and dense lines. CULane is a widely used large with a batchsize of 32. The results are reported on the test
lane detection dataset with 9 different scenarios. TuSimple set for CULane and TuSimple. For CurveLanes, we report
is another widely used dataset of highway driving scenes. the results on the validation set following CurveLane [39].
The details of the three datasets are shown in Tab. 1. All the experiments were computed on a machine with an
Dataset Train Val. Test Road type Fork
RTX2080 GPU.

CurveLanes 100K 20K 30K Urban&Highway Proposal head input Shape head input
Model name Backbone
CULane 88.9K 9.7K 34.7K Urban&Highway ×
Large Resnet-101 downscale 16 downscale 4
TuSimple 3.3K 0.4K 2.8K Highway × Medium Resnet-34 downscale 16 downscale 8
Small Resnet-18 downscale 16 downscale 8
Table 1. Details of three datasets. Table 2. Difference of different versions of our CondLaneNet.

4.1.2 Evaluation Metrics 4.2. Results


For CurveLanes and CULane, we adopte the evaluation The visualization results on the CurveLanes, CULane,
metrics of SCNN [28] which utilizes the F1 measure as the and TuSimple datasets are shown in the Figure 7. The
metric. IoU between the predicted lane line and GT label is results show that our method can cope with complex line
taken for judging whether a sample is true positive (TP) or topologies. Even for the cases of dense lines and fork lines,
false positive (FP) or false negative (FN). IoU of two lines our method can also successfully discriminate the instances.
is defined as the IoU of their masks with a fixed line width.
Further, F1-measure is calculated as follows:
Method F1 Precision Recall FPS GFlops(G)
TP SCNN [28] 65.02 76.13 56.74 328.4
P recision = (11) Enet-SAD [12] 50.31 63.60 41.60 3.9
TP + FP
PointLaneNet [2] 78.47 86.33 72.91 14.8
CurveLane-S [39] 81.12 93.58 71.59 7.4
TP CurveLane-M [39] 81.80 93.49 72.71 11.6
Recall = (12)
TP + FN CurveLane-L [39] 82.29 91.11 75.03 20.7
CondLaneNet-S 85.09 87.75 82.58 154 10.3
2 × P recision × Recall CondLaneNet-M 85.92 88.29 83.68 109 19.7
F1 = (13) CondLaneNet-L 86.10 88.98 83.41 48 44.9
P recision + Recall
Table 3. Comparison of different methods on CurveLanes.
For TuSimple dataset [36], there are three official indi-
cators: false-positive rate (FPR), false-negative rate (FNR),
CurveLanes The comparison results on CurveLanes are
and accuracy.
shown in Tabel 3. CurveLanes contains cases of lane
P
clip Cclip
lines with complex topologies such as curve, fork, and
accuracy = P (14) dense lanes. Our large version of CondLaneNet achieves
clip Sclip
a new state-of-the-art F1 score of 86.10, 4.63% higher than
Where Cclip is the number of correctly predicted lane points CurveLane-L. Our small version of CondLaneNet still has
and Sclip is the total number of lane points of a clip. Lane a performance of an 85.09 F1 score (3.40% higher than
Prediction GT
Prediction GT
Prediction GT

Figure 7. Visualization results on CurveLanes(the first row), CULane(the middle row) and TuSimple(the last row) datasets. Different lane
instances are represented by different colors.

SOTA). Since our model can deal with cases of the fork and LaneATT-S, CondLaneNet-S achieves a 4.01 % F1 score
dense lane lines, there is a significant improvement in the improvement with similar efficiency. In most scenarios of
recall indicator. Correspondingly, false-positive results will CULane, the small version of our CondLaneNet exceeds all
increase, resulting in a decrease in the precision indicator. previous methods in the F1 measure.

CULane The results of our CondLaneNet and other state- Tusimple The results on TuSimple are shown in Table
of-the-art methods on CULane are shown in Tabel 4. Our 5. Relatively, the gap between different methods on this
method achieves a new state-of-the-art result of a 79.48 dataset is smaller, due to the smaller amount of data and
F1 score, which has increased by 3.19%. Moreover, our more single scenes. Our method achieves a new state-of-
method achieves the best performance in eight of nine sce- the-art F1 score of 97.24. Besides, the small version of our
narios, showing robustness to different scenarios. For some method gets a 97.01 F1 score with 220 FPS.
hard cases such as curve and night, our methods have ob-
4.3. Ablation Study of Improvement Strategies
vious advantages. Besides, the small version of our Cond-
LaneNet gets a 78.14 F1 score with a speed of 220 FPS, 1.12 We performed ablation experiments on the CurveLanes
higher and 8.5× speed than LaneATT-L. Compared with dataset based on the small version of our CondLaneNet.

Category Total Normal Crowded Dazzle Shadow No line Arrow Curve Cross Night FPS GFlops(G)
SCNN [28] 71.60 90.60 69.70 58.50 66.90 43.40 84.10 64.40 1990 66.10 7.5 328.4
ERFNet-E2E [41] 74.00 91.00 73.10 64.50 74.10 46.60 85.80 71.90 2022 67.90
FastDraw [29] 85.90 63.60 57.00 69.90 40.60 79.40 65.20 7013 57.80 90.3
ENet-SAD [12] 70.80 90.10 68.80 60.20 65.90 41.60 84.00 65.70 1998 66.00 75 3.9
UFAST-ResNet34 [30] 72.30 90.70 70.20 59.50 69.30 44.40 85.70 69.50 2037 66.70 175.0
UFAST-ResNet18 [30] 68.40 87.70 66.00 58.40 62.80 40.20 81.00 57.90 1743 62.10 322.5
ERFNet-IntRA-KD [11] 72.40 100.0
CurveLanes-NAS-S [39] 71.40 88.30 68.60 63.20 68.00 47.90 82.50 66.00 2817 66.20 9.0
CurveLanes-NAS-M [39] 73.50 90.20 70.50 65.90 69.30 48.80 85.70 67.50 2359 68.20 35.7
CurveLanes-NAS-L [39] 74.80 90.70 72.30 67.70 70.10 49.40 85.80 68.40 1746 68.90 86.5
LaneATT-Small [32] 75.13 91.17 72.71 65.82 68.03 49.13 87.82 63.75 1020 68.58 250 9.3
LaneATT-Medium [32] 76.68 92.14 75.03 66.47 78.15 49.39 88.38 67.72 1330 70.72 171 18.0
LaneATT-Large [32] 77.02 91.74 76.16 69.47 76.31 50.46 86.29 64.05 1264 70.81 26 70.5
CondLaneNet-Small 78.14 92.87 75.79 70.72 80.01 52.39 89.37 72.40 1364 73.23 220 10.2
CondLaneNet-Medium 78.74 93.38 77.14 71.17 79.93 51.85 89.89 73.88 1387 73.92 152 19.6
CondLaneNet-Large 79.48 93.47 77.44 70.93 80.91 54.13 90.16 75.21 1201 74.80 58 44.8
Table 4. Comparison of different methods on CULane.
Method F1 Acc FP FN FPS GFLOPS Model Small Medium Large
SCNN [28] 95.97 96.53 6.17 1.80 7.5 Target P. point Line P. point Line P. point Line
EL-GAN [6] 96.26 94.90 4.12 3.36 10.0 Standard 88.35 85.09 88.99 85.92 89.54 86.10
PINet [19] 97.21 96.70 2.94 2.63
S. w/o encoder 85.51 82.97 88.68 85.91 89.33 85.98
LineCNN [22] 96.79 96.87 4.42 1.97 30.0
PointLaneNet [2] 95.07 96.34 4.67 5.18 71.0
Hacked 88.05 84.39 88.90 85.93 89.37 85.99
ENet-SAD [12] 95.92 96.64 6.02 2.05 75.0 Table 7. Ablation study of the transformer encoder module on
ERF-E2E [41] 96.25 96.02 3.21 4.28 CurveLanes.
FastDraw [29] 93.92 95.20 7.60 4.50 90.3
UFAST-ResNet34 [30] 88.02 95.86 18.91 3.75 169.5
UFAST-ResNet18 [30] 87.87 95.82 19.05 3.92 312.5
PolyLaneNet [31] 90.62 93.36 9.42 9.33 115.0 0.9 ing the proposal points and then predicts the shape for each
LSTR [25] 96.86 96.18 2.91 3.38 420 0.3 instance. The accuracy of the proposal points greatly af-
LaneATT-ResNet18 [32] 96.71 95.57 3.56 3.01 250 9.3 fects the final accuracy of the lane lines. We design differ-
LaneATT-ResNet34 [32] 96.77 95.63 3.53 2.92 171 18.0 ent control groups to compare the accuracy of the proposal
LaneATT-ResNet122 [32] 96.06 96.10 5.64 2.17 26 70.5
CondLaneNet-S 97.01 95.48 2.18 3.80 220 10.2
points and lane lines on CurveLanes. We define the pro-
CondLaneNet-M 96.98 95.37 2.20 3.82 154 19.6 posal points which locate in the eight neighborhoods of the
CondLaneNet-L 97.24 96.54 2.01 3.50 58 44.8 groundtruth points as the true-positive samples. Consider-
Table 5. Comparison of different methods on TuSimple. ing the function of RIM, the proposal point corresponding
to multiple lines are regarded as multiple different proposal
points. We report the F1 score of the proposal points and
The results are shown in Tabel 6. We take the lane detec- lane lines, as is shown in Table 7.
tion model based on the original conditional instance seg-
mentation strategy [35, 38] (as is shown in Figure 3.a) as The first row shows the results of the small, medium and
the baseline. The first row shows the results of the baseline. large versions of the standard CondLaneNet. In the sec-
In the second row, the proposed conditional lane detection ond row, the transformer encoder is removed. In the third
strategy is applied and the lane mask expression is replaced row, we hack the inference process of the second row by
by the row-wise formulation(as is shown in 3.b). In the replacing the proposal heatmap with the proposal heatmap
third row, the offset map for post-refinement is added. In output by the standard model(the first row). For the small
the fourth row, the transformer encoder is added and the version, removing the encoder leads to a significant drop
offset map is removed. The fifth row presents the result of for both proposal points and lanes. However, using the pro-
the model with the row-wise formulation, the offset map, posal heatmap of the standard model, the results on the third
and the transformer encoder. In the last row, RIM is added. row are close to the first row.
The above results prove that the function of the encoder
Baseline Row-wise Offset Encoder RIM F1 score is mainly to improve the detection of the proposal points,

72.19
√ which rely on contextual features and global information.
√ √ 80.09(+7.9)
81.24(+9.05) Besides, the contextual features can be more fully refined
√ √
81.85(+9.66) in deeper networks. Therefore, for the medium and large
√ √ √
√ √ √ √ 83.41(+11.22) versions, the improvement of the encoder is far less than
85.09(+12.90) the small version.
Table 6. Ablation study of the improvement strategies on Curve-
Lanes base on the small version of our CondLaneNet.
5. Conclusion
Comparing the first two rows, we can see that the pro-
posed conditional lane detection strategy has significantly In this work, We proposed CondLaneNet, a novel top-
improved the performance. Comparing the results of the to-down lane detection framework that detects the lane in-
2nd and the 3rd row, the 4th and the 5th row, we can see the stances first and then instance-wisely predict the shapes.
positive effect of the offset map. Moreover, the transformer Aiming to resolve the instance-level discrimination prob-
encoder plays a vital role in our framework, which can be lem, we proposed the conditional lane detection strategy
indicated by comparing the 2nd and the 4th row, the 3rd and based on conditional convolution and row-wise formula-
the 5th row. Besides, RIM designed for the fork lines and tion. Moreover, we designed RIM to cope with complex
dense lines also improves the accuracy. lane line topologies such as dense lines and fork lines. Our
CondLaneNet framework refreshed the state-of-the-art per-
4.4. Ablation Study of Transformer Encoder
formance on CULane, CurveLanes, and TuSimple. More-
This section further analyzes the function of the trans- over, on CULane and CurveLanes, the small version of our
former encoder which indicates a vital role in the previous CondLaneNet not only surpassed other methods in accu-
experiments. Our method first detects instances by detect- racy, but also presented real-time efficiency.
References [15] Ruyi Jiang, Reinhard Klette, Tobi Vaudrey, and Shigang
Wang. New lane model and distance transform for lane de-
[1] Amol Borkar, Monson Hayes, and Mark T Smith. Robust tection and tracking. In Proceedings of the International
lane detection and tracking with ransac and kalman filter. In Conference on Computer Analysis of Images and Patterns,
Proceedings of the IEEE International Conference on Image pages 1044–1052, 2009.
Processing (ICIP), pages 3261–3264, 2009.
[16] Yan Jiang, Feng Gao, and Guoyan Xu. Computer vision-
[2] Zhenpeng Chen, Qianfei Liu, and Chenfan Lian. Point- based multiple-lane detection on straight road and in a curve.
lanenet: Efficient end-to-end cnns for accurate real-time lane In Proceedings of the International Conference on Image
detection. In IEEE Intelligent Vehicles Symposium (IV), Analysis and Signal Processing, pages 114–117, 2010.
pages 2563–2568, 2019. [17] ZuWhan Kim. Robust lane detection and tracking in chal-
[3] Shriyash Chougule, Nora Koznek, Asad Ismail, Ganesh lenging scenarios. IEEE Transactions on Intelligent Trans-
Adam, Vikram Narayan, and Matthias Schulze. Reliable portation Systems, 9(1):16–26, 2008.
multilane detection and classification by utilizing cnn as a [18] Diederick P Kingma and Jimmy Ba. Adam: A method for
regression network. In Proceedings of the European Confer- stochastic optimization. In Proceedings of the International
ence on Computer Vision (ECCV) Workshops, 2018. Conference on Learning Representations (ICLR), 2015.
[4] Bert De Brabandere, Davy Neven, and Luc Van Gool. [19] Yeongmin Ko, Jiwon Jun, Donghwuy Ko, and Moongu Jeon.
Semantic instance segmentation with a discriminative loss Key points estimation and point instance segmentation ap-
function. arXiv preprint arXiv:1708.02551, 2017. proach for lane detection. arXiv preprint arXiv:2002.06604,
[5] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qing- 2020.
ming Huang, and Qi Tian. Centernet: Keypoint triplets for [20] Hei Law and Jia Deng. Cornernet: Detecting objects as
object detection. In Proceedings of the IEEE International paired keypoints. In Proceedings of the European Confer-
Conference on Computer Vision, pages 6569–6578, 2019. ence on Computer Vision (ECCV), pages 734–750, 2018.
[6] Mohsen Ghafoorian, Cedric Nugteren, Nóra Baka, Olaf [21] Seokju Lee, Junsik Kim, Jae Shin Yoon, Seunghak
Booij, and Michael Hofmann. El-gan: Embedding loss Shin, Oleksandr Bailo, Namil Kim, Tae-Hee Lee, Hyun
driven generative adversarial networks for lane detection. In Seok Hong, Seung-Hoon Han, and In So Kweon. Vpgnet:
Proceedings of the European Conference on Computer Vi- Vanishing point guided network for lane and road marking
sion (ECCV) Workshops, 2018. detection and recognition. In Proceedings of the IEEE Inter-
[7] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- national Conference on Computer Vision, pages 1947–1955,
shick. Mask r-cnn. In Proceedings of the IEEE International 2017.
Conference on Computer Vision, pages 2961–2969, 2017. [22] Xiang Li, Jun Li, Xiaolin Hu, and Jian Yang. Line-cnn:
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. End-to-end traffic line detection with line proposal unit.
Deep residual learning for image recognition. In Proceed- IEEE Transactions on Intelligent Transportation Systems,
ings of the IEEE Conference on Computer Vision and Pattern 21(1):248–258, 2019.
Recognition, pages 770–778, 2016. [23] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
[9] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term Bharath Hariharan, and Serge Belongie. Feature pyramid
memory. Neural computation, 9(8):1735–1780, 1997. networks for object detection. In Proceedings of the IEEE
[10] Namdar Homayounfar, Wei-Chiu Ma, Justin Liang, Xinyu Conference on Computer Vision and Pattern Recognition,
Wu, Jack Fan, and Raquel Urtasun. Dagmapper: Learning pages 2117–2125, 2017.
to map by discovering lane topology. In Proceedings of the [24] Guoliang Liu, Florentin Wörgötter, and Irene Markelić.
IEEE International Conference on Computer Vision, pages Combining statistical hough transform and particle filter for
2911–2920, 2019. robust lane detection and tracking. In IEEE Intelligent Vehi-
[11] Yuenan Hou, Zheng Ma, Chunxiao Liu, Tak-Wai Hui, and cles Symposium (IV), pages 993–997, 2010.
Chen Change Loy. Inter-region affinity distillation for road [25] Ruijin Liu, Zejian Yuan, Tie Liu, and Zhiliang Xiong. End-
marking segmentation. In Proceedings of the IEEE Con- to-end lane shape prediction with transformers. In Proceed-
ference on Computer Vision and Pattern Recognition, pages ings of the IEEE Winter Conference on Applications of Com-
12486–12495, 2020. puter Vision, pages 3694–3702, 2021.
[12] Yuenan Hou, Zheng Ma, Chunxiao Liu, and Chen Change [26] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
Loy. Learning lightweight lane detection cnns by self atten- regularization. In Proceedings of the International Confer-
tion distillation. In Proceedings of the IEEE International ence on Learning Representations (ICLR), 2019.
Conference on Computer Vision, pages 1013–1021, 2019. [27] Davy Neven, Bert De Brabandere, Stamatios Georgoulis,
[13] Junhwa Hur, Seung-Nam Kang, and Seung-Woo Seo. Multi- Marc Proesmans, and Luc Van Gool. Towards end-to-end
lane detection in urban driving environments using condi- lane detection: an instance segmentation approach. In IEEE
tional random fields. In IEEE Intelligent Vehicles Symposium Intelligent Vehicles Symposium (IV), pages 286–291, 2018.
(IV), pages 1297–1302, 2013. [28] Xingang Pan, Jianping Shi, Ping Luo, Xiaogang Wang, and
[14] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc Xiaoou Tang. Spatial as deep: Spatial cnn for traffic scene
Van Gool. Dynamic filter networks. In Advances in Neu- understanding. In Proceedings of the AAAI Conference on
ral Information Processing Systems, page 667–675, 2016. Artificial Intelligence, 2018.
[29] Jonah Philion. Fastdraw: Addressing the long tail of lane de- [43] Shengyan Zhou, Yanhua Jiang, Junqiang Xi, Jianwei Gong,
tection by adapting a sequential prediction network. In Pro- Guangming Xiong, and Huiyan Chen. A novel lane detec-
ceedings of the IEEE Conference on Computer Vision and tion based on geometrical model and gabor filter. In IEEE
Pattern Recognition, pages 11582–11591, 2019. Intelligent Vehicles Symposium (IV), pages 59–64, 2010.
[30] Zequn Qin, Huanyu Wang, and Xi Li. Ultra fast structure-
aware deep lane detection. In Proceedings of the European
Conference on Computer Vision (ECCV), pages 276–291,
2020.
[31] Lucas Tabelini, Rodrigo Berriel, Thiago M Paixao, Claudine
Badue, Alberto F De Souza, and Thiago Oliveira-Santos.
Polylanenet: Lane estimation via deep polynomial regres-
sion. In Proceedings of the International Conference on Pat-
tern Recognition, 2020.
[32] Lucas Tabelini, Rodrigo Berriel, Thiago M Paixão, Clau-
dine Badue, Alberto F De Souza, and Thiago Olivera-Santos.
Keep your eyes on the lane: Attention-guided lane detection.
arXiv preprint arXiv:2010.12035, 2020.
[33] Huachun Tan, Yang Zhou, Yong Zhu, Danya Yao, and
Keqiang Li. A novel curve lane detection based on im-
proved river flow and ransa. In Proceedings of the Interna-
tional IEEE Conference on Intelligent Transportation Sys-
tems, pages 133–138, 2014.
[34] Jigang Tang, Songbin Li, and Peng Liu. A review of lane
detection methods based on deep learning. Pattern Recogni-
tion, 2020.
[35] Zhi Tian, Chunhua Shen, and Hao Chen. Conditional con-
volutions for instance segmentation. In Proceedings of the
European Conference on Computer Vision (ECCV), 2020.
[36] TuSimple. Tusimple lane detection benchmark,
2017. https://github.com/TuSimple/
tusimple-benchmark.
[37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in Neural
Information Processing Systems, 2017.
[38] Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chun-
hua Shen. SOLOv2: Dynamic and fast instance segmenta-
tion. In Advances in Neural Information Processing Systems,
pages 17721–17732, 2020.
[39] Hang Xu, Shaoju Wang, Xinyue Cai, Wei Zhang, Xiaodan
Liang, and Zhenguo Li. Curvelane-nas: Unifying lane-
sensitive architecture search and adaptive point blending. In
Proceedings of the European Conference on Computer Vi-
sion (ECCV), pages 689–704, 2020.
[40] Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan
Ngiam. Condconv: Conditionally parameterized convolu-
tions for efficient inference. In Advances in Neural Informa-
tion Processing Systems, 2019.
[41] Seungwoo Yoo, Hee Seok Lee, Heesoo Myeong, Sungrack
Yun, Hyoungwoo Park, Janghoon Cho, and Duck Hoon Kim.
End-to-end lane marker detection via row-wise classifica-
tion. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition Workshops, pages 1006–
1007, 2020.
[42] Ekim Yurtsever, Jacob Lambert, Alexander Carballo, and
Kazuya Takeda. A survey of autonomous driving: Com-
mon practices and emerging technologies. IEEE Access,
8:58443–58469, 2020.

You might also like