Keep Your Eyes On The Lane: Real-Time Attention-Guided Lane Detection
Keep Your Eyes On The Lane: Real-Time Attention-Guided Lane Detection
1
• A model that enables faster training and inference times Other approaches. In FastDraw [18], the author proposes
than most other models (reaching 250 FPS and almost a novel learning-based approach to decode the lane struc-
an order of magnitude less multiply-accumulate opera- tures, which avoids the need for clustering post-processing
tions (MACs) than the previous state-of-the-art); steps (required in segmentation and row-wise classifica-
tion methods). Although the proposed method is shown
• A novel anchor-based attention mechanism for lane to achieve high speeds, it does not perform better than exist-
detection which is potentially useful in other domains ing state-of-the-art methods in terms of accuracy. The same
where the objects being detected are correlated. effect is shown in PolyLaneNet [23], where an even faster
model, based on deep polynomial regression, is proposed. In
that approach, the model learns to output a polynomial for
2. Related work
each lane. Despite its speed, the model struggles with the
Although the first lane detection approaches rely on clas- imbalanced nature of lane detection datasets, as evidenced
sical computer vision, substantial progress on accuracy and by the high bias towards straight lanes in its predictions. In
efficiency has been achieved with recent deep learning meth- Line-CNN [13], an anchor-based method for lane detection
ods. Thus, this literature review focuses on deep lane detec- is presented. This model achieves state-of-the-art results
tors. This section first discusses the dominant approaches, on a public dataset and promising results on another that is
which are based on segmentation [17, 11, 29, 15] or row- not publicly available. Despite the real-time efficiency, the
wise classification [10, 20, 27], and, subsequently, review model is considerably slower than other approaches. More-
solutions in other directions. Finally, the lack of reproducibil- over, the code is not public, which makes the results difficult
ity (a common issue in lane detection works) is discussed. to reproduce. There are also works addressing other parts
of the pipeline of a lane detector. In [12], a post-processing
method with a focus on occlusion cases is proposed, achiev-
Segmentation-based methods. In this approach, predic- ing results considerably higher than other works, but at the
tions are made on a per-pixel basis, classifying each pixel as cost of notably low speeds (around 4 FPS).
either lane or background. With the segmentation map gen-
erated, a post-processing step is necessary to decode it into Reproducibility. As noted in [23], many of the cited
a set of lanes. In SCNN [17], the authors propose a scheme works do not publish the code to reproduce the results re-
specifically designed for long thin structures and show its ported [13, 18, 27], or, in some cases, the code is only
effectiveness in lane detection. However, the method is partially public [11, 10]. This hinders deeper qualitative
slow (7.5 FPS), which hinders its applicability in real-world and quantitative comparisons. For instance, the two most
cases. Since larger backbones are one of the main culprits common metrics to measure a model’s efficiency are multi-
for slower speeds, the authors, in [11], proposes a self atten- ply–accumulate operations (MACs) and frames-per-second
tion distillation (SAD) module to aggregate contextual infor- (FPS). While the first does not depend on the benchmark plat-
mation. The module allows the use of a more lightweight form, it is not always a good proxy for the second, which is
backbone, achieving a high-performance while maintaining the true goal. Therefore, FPS comparisons are also hindered
real-time efficiency. In CurveLanes-NAS [26], the authors by the lack of source code.
propose the use of neural architecture search (NAS) to find Unlike most of the previously proposed methods that
a better backbone. Although they achieved state-of-the-art managed to achieve high speeds at the cost of accuracy, we
results, their NAS is extremely expensive computationally, propose a method that is both faster and more accurate than
requiring 5,000 GPU hours per dataset. existing state-of-the-art ones. In addition, the full code to re-
produce the reported results is published for the community.
Row-wise classification methods. The row-wise classifi- 3. Proposed method
cation approach is a simple way to detect lanes based on a
grid division of the input image. For each row, the model LaneATT is an anchor-based single-stage model (like
predicts the most probable cell to contain a part of a lane YOLOv3 [21] or SSD [16]) for lane detection. An overview
marking. Since only one cell is selected on each row, this pro- of the method is shown in Figure 1. It receives as input RGB
cess is repeated for each possible lane in an image. Similar images I ∈ R3×HI ×WI taken from a front-facing camera
to segmentation methods, it also requires a post-processing mounted in a vehicle. The outputs are lane boundary lines
step to construct the set of lanes. The method was first intro- (hereafter called lanes, following the usual terminology in the
duced in E2E-LMD [27], achieving state-of-the-art results literature). To generate those outputs, a convolutional neural
on two datasets. In [20], the authors show that it is capable network (CNN), referred to as the backbone, generates a
of reaching high speed, although some accuracy is lost. This feature map that is then pooled to extract each anchor’s
approach is also used in IntRA-KD [10]. features. Those features are combined with a set of global
2
+ → addition
× → multiplication Anchor-based Attention Mechanism
⊕ → concatenation Feature Pooling
wi,0 wi,i−1 wi,i+1 wi,Nanc −1
× × × ×
Anchor i
θ
... ...
O Image plane softmax Anchor i
proj. aloc
0 aloc
i−1 aloc
i+1 aloc
Nanc −1
x3
FC Latt + x2
l=3
x1
θ
Lcls x0
Backbone pi = p0 . . . pK O p1 = 0.96
FC
F θ ⊕
FC ri = l x0 . . . xNpts −1
O
δback aloc
i aglob
i Lreg
For each anchor i
Figure 1. Overview of the proposed method. A backbone generates feature maps from an input image. Subsequently, each anchor is projected
onto the feature maps. This projection is used to pool features that are concatenated with another set of features created in the attention
module. Finally, using this resulting feature set, two layers, one for classification and another for regression, make the final predictions.
features produced by an attention module. By combining channel-wise reduced feature map F ∈ RCF ×HF ×WF . This
local and global features, the model can use information reduction is performed to reduce computational costs.
from other lanes more easily, which might be necessary in
cases with conditions such as occlusion or no visible lane 3.3. Anchor-based feature pooling
markings. Finally, the combined features are passed to fully-
connected layers to predict the final output lanes. An anchor defines the points of F that will be used for
the respective proposals. Since the anchors are modeled as
3.1. Lane and anchor representation lines, the interest points for a given anchor are those that
A lane is represented by 2D-points with equally-spaced intercept the anchor’s virtual line (considering the rasterized
Npts −1 HI line reduced to the feature maps dimensions). For every yj =
y-coordinates Y = {yi }i=0 , where yi = i · Npts −1 .
Since Y is fixed, a lane can then be defined only by its x- 0, 1, 2, . . . , HF − 1, there will be a single corresponding x-
Npts −1 coordinate,
coordinates X = {xi }i=0 , each xi associated with the
respective yi ∈ Y . Since most lanes do not cross the whole
image vertically, a start-index s and an end-index e are used 1
xj = (yj − yorig /δback ) + xorig /δback , (1)
to define the valid contiguous sequence of X. tan θ
Likewise Line-CNN [13], our method performs anchor-
based detection using lines instead of boxes, which means where (xorig , yorig ) and θ are, respectively, the origin point
that lanes’ proposals are made having these lines as refer- and slope of the anchor’s line, and δ back is the backbone’s
ences. An anchor is a “virtual” line in the image plane global stride. Thus, every anchor i will have its correspond-
defined by (i) an origin point O = (xorig , yorig ) (with ing feature vector aloc
i ∈ RCF ·HF (column-vector notation)
yorig ∈ Y ) located in one of the borders of the image (except pooled from F that carries local feature information (local
the top border) and (ii) a direction θ. The proposed method features). In cases where a part of the anchor is outside the
uses the same set of anchors as [13]. This lane and anchor boundaries of F, aloc
i is zero-padded.
representation satisfies the vast majority of real-world lanes. Notice that the pooling operation is similar to the Fast
R-CNN’s [8] region of interest projection (RoI projection),
3.2. Backbone
however, instead of using the proposal for pooling, a single-
The first stage of the proposed method is feature extrac- stage detector is achieved by using the anchor itself. Addi-
tion, which can be performed by any generic CNN, such tionally, the RoI pooling layer (used to generate fixed-size
as a ResNet [9]. The output of this stage is a feature map features) is not necessary for our method. Comparing to
0
Fback ∈ RCF ×HF ×WF from which the features for each Line-CNN [13], that leverages only the feature maps’ bor-
anchor will be extracted through a pooling process, as de- ders, our method can potentially explore all the feature map,
scribed in the next section. For dimensionality reduction, which enables the use of more lightweight backbones with
a 1 × 1 convolution is applied onto Fback , generating a smaller receptive field.
3
3.4. Attention mechanism layers, one for classification (Lcls ) and one for regres-
sion (Lreg ), which produce the final proposals. Lcls pre-
Depending on the model architecture, the information
dicts pi = {p0 , . . . , pK+1 } (item i) and Lreg predicts
carried by the pooled feature vector ends up being mostly
ri = l, {x0 , . . . , xNpts −1 } (items ii and iii).
local. This is particularly the case for shallower and faster
models, which tend to exploit backbones with smaller re- 3.6. Non-maximum Supression (NMS)
ceptive fields. However, in some cases (such as the ones
As usual in anchor-based deep detection, NMS is
with occlusion) the local information may not be enough to
paramount to reduce the number of false positives. In the pro-
predict the lane’s existence and its position. To address that
posed method, this procedure is applied both during training
problem, we propose an attention mechanism that acts on the
and test phases based on the lane distance metric proposed
local features (aloc
• ) to produce additional features (aglob
• ) Npts
that aggregate global information. in [13]. The distance between two lanes Xa = {xai }i=1
Npts
Basically, the attention mechanism structure is composed and Xb = {xbi }i=1 is computed based on their common
of a fully-connected layer Latt which processes a local fea- valid indices (or y-coordinates). Let s0 = max(sa , sb ) and
ture vector aloc e0 = min(ea , eb ) define the range of those common indices.
i and outputs a probability (weight) wi,j for
every anchor j, j 6= i. Formally, Thus, the lane distance metric is defined as
(
1
Pe0 a b 0 0
0 −s0 +1 · i=s0 |xi − xi |, e ≥ s
loc e
softmax(Latt (ai ))j ,
if j < i D(Xa , Xb ) =
+∞, otherwise.
wi,j = 0, if j = i (2)
loc
(5)
softmax(Latt (ai ))j−1 , if j > i
3.7. Model training
Afterwards, those weights are combined with the local fea-
During training, the distance metric in Equation (5) is
tures to produce a global feature vector of same dimension:
also used to define the positive and the negative anchors.
X First, the metric is used to measure the distance between
aglob
i = wi,j aloc
j . (3)
every anchor (those not filtered in NMS) and the ground-
j
truth lanes. Subsequently, the anchors with distance (Eq.
Naturally, the whole process can be implemented efficiently 5) lower than a threshold τp are considered positives, while
with matrix multiplication, since the same procedure is ex- those with distance greater than τn are considered negatives.
ecuted for all anchors. Let Nanc be the number of an- Anchors (and their associated proposals) with distance in
chors. Let Aloc = [aloc loc T between those thresholds are disregarded. The remainder
0 , . . . , aNanc −1 ] be the matrix
containing the local feature vectors (as rows) and W = Np&n are used in a multi-task loss defined as:
[wi,j ]Nanc ×Nanc the weight matrix, wi,j defined in Equa- Np&n −1
X
tion (2). Thus, global features can be computed as: L({pi , ri }i=0 )=λ Lcls (pi , p∗i )
i
X (6)
Aglob = W Aloc . (4) + Lreg (ri , r∗i ),
i
Notice that Aglob and Aloc have the same dimensions, i.e., where pi , ri are the classification and regression outputs
Aglob ∈ RNanc ×CF ·HF . for the anchor i, whereas p∗i and r∗i are the classification
and regression targets for the anchor i. The regression loss
3.5. Proposal prediction
is computed only with the length l and the x-coordinates
A lane proposal is predicted for each anchor and consists values corresponding to indices common to both the pro-
of three main components: (i) K + 1 probabilities (K lane posal and the ground-truth. The common indices (between
types and one class for “background” or invalid proposal), s0 and e0 ) of the x-coordinates are selected similarly to the
(ii) Npts offsets (the horizontal distance between the pre- lane distance (Equation (5)) but with e0 = egt instead of
diction and the anchor’s line), and (iii) the length l of the e0 = min(eprop , egt ), where eprop and egt are the end-
proposal (the number of valid offsets). The start-index (s) indexes for the proposal and its associated ground-truth,
for the proposal is directly determined by the y-coordinate respectively. If the end-index predicted in the proposal eprop
of the anchor’s origin (yorig , see Section 3.1). Thus, the is used, the training may become unstable by converging
end-index can be determined as e = s + ble − 1. to degenerate solutions (e.g., eprop might converge to zero).
To generate the final proposals, local and global informa- The functions Lcls and Lreg are the Focal Loss [14] and
glob
tion are aggregated by concatenating aloci and ai , produc- the Smooth L1, respectively. If the anchor i is considered
aug 2·CF ·HF
ing an augmented feature vector ai ∈ R . This negative, its corresponding Lreg is equal to 0. The factor λ
augmented vector is fed to two parallel fully-connected is used to balance the loss components.
4
Efficiency metrics. Two efficiency-related metrics are re-
ported: frames-per-second (FPS) and multiply-accumulate
operations (MACs). One MAC is approx. two floating oper-
ations (FLOPs). The FPS is computed using a single image
per batch and constant inputs, so the metric is not dependent
on I/O operations but only on the model’s efficiency.
5
TuSimple CULane
LaneATT (ResNet-122)
0.96 0.76
LaneATT (ResNet-122)
0.94
0.74
F1
F1
0.92
0.72
0.90
0.70
0.88
0.68
101 102 101 102
Latency (ms) Latency (ms)
LaneATT [28] EL-GAN [7] ENet-SAD [11] PolyLaneNet [23] Cascaded-CNN [19]
[20] SCNN [17] FastDraw [18] Line-CNN [13] PointLaneNet [5] ERFNet-IntRA-KD [10]
Method F1 (%) Acc (%) FDR (%) FNR (%) FPS MACs (G)
Source-code unavailable
EL-GAN [7] 96.26 94.90 4.12 3.36 10.0
Line-CNN [13] 96.79 96.87 4.42 1.97 30.0
FastDraw (ResNet-18) [18] 94.59 94.90 6.10 4.70
PointLaneNet [5] 95.07 96.34 4.67 5.18 71.0
[25] 95.80 71.5
R-18-E2E [27] 96.40 96.04 3.11 4.09
R-34-E2E [27] 96.58 96.22 3.08 3.76
Source-code available
SCNN [17] 95.97 96.53 6.17 1.80 7.5
Cascaded-CNN [19] 90.82 95.24 11.97 6.20 60.0
ENet-SAD [11] 95.92 96.64 6.02 2.05 75.0
[20] (ResNet-18) 87.87 95.82 19.05 3.92 312.5
[20] (ResNet-34) 88.02 95.86 18.91 3.75 169.5
PolyLaneNet [23] 90.62 93.36 9.42 9.33 115.0 1.7
LaneATT (ResNet-18) 96.71 95.57 3.56 3.01 250 9.3
LaneATT (ResNet-34) 96.77 95.63 3.53 2.92 171 18.0
LaneATT (ResNet-122) 96.06 96.10 5.64 2.17 26 70.5
Table 2. State-of-the-art results on TuSimple. For a fairer comparison, the FPS of the fastest method ([20]) was measured on the same machine
and conditions as our method. Additionally, all metrics for this method were computed using the official source code, since only the accuracy
was available in the paper. The best and second-best results across methods with source-code available are in bold and underlined, respectively.
on par with other state-of-the-art methods. However, it is Although they achieved high accuracy, the FDR is notably
also clear that the results in this dataset are saturated (high- high. For instance, our highest false-positive rate is 5.64%,
values) already, probably because its scenes are not complex using the ResNet-122, whereas their lowest is 18.91%, al-
and the metric is permissive [23]. This is evidenced by the most four times higher.
small difference in performance across methods, in contrast
to results in more complex datasets and less permissive met- 4.2. CULane
rics (as shown in Section 4.2). Nonetheless, our method is Dataset description. CULane [17] is one of the largest
much faster than others. The method proposed in [20] is publicly available lane detection datasets, and also one of
the only with speed comparable to ours. Since the FDR and the most complexes. All the images have 1640 × 590 pixels,
FNR metrics were not reported in their work, we computed and all test images are divided into nine categories, such as
them using the published code and reported those metrics. crowded, night, absence of visible lines, etc.
6
Method Total Normal Crowded Dazzle Shadow No line Arrow Curve Cross Night FPS MACs (G)
Source-code unavailable
[28] 73.10 89.70 76.50 67.40 65.50 35.10 82.20 63.20 68.70 24.0
FastDraw (ResNet-50) [18] 85.90 63.60 57.00 59.90 40.60 79.40 65.20 7013 57.80 90.3
PointLaneNet [5] 90.10 71.0
SpinNet [6] 74.20 90.50 71.70 62.00 72.90 43.20 85.00 50.70 68.10
R-18-E2E [27] 70.80 90.00 69.70 60.20 62.50 43.20 83.20 70.30 2296 63.30
R-34-E2E [27] 71.50 90.40 69.90 61.50 68.10 45.00 83.70 69.80 2077 63.20
R-101-E2E [27] 71.90 90.10 71.20 60.90 68.10 44.90 84.30 70.20 2333 65.20
ERFNet-E2E [27] 74.00 91.00 73.10 64.50 74.10 46.60 85.80 71.90 2022 67.90
Source-code available
SCNN [17] 71.60 90.60 69.70 58.50 66.90 43.40 84.10 64.40 1990 66.10 7.5
ENet-SAD [11] 70.80 90.10 68.80 60.20 65.90 41.60 84.00 65.70 1998 66.00 75
[20] (ResNet-18) 68.40 87.70 66.00 58.40 62.80 40.20 81.00 57.90 1743 62.10 322.5
[20] (ResNet-34) 72.30 90.70 70.20 59.50 69.30 44.40 85.70 69.50 2037 66.70 175.0
ERFNet-IntRA-KD [10] 72.40 100.0
SIM-CycleGAN [15] 73.90 91.80 71.80 66.40 76.20 46.10 87.80 67.10 2346 69.40
CurveLanes-NAS-S [26] 71.40 88.30 68.60 63.20 68.00 47.90 82.50 66.00 2817 66.20 9.0
CurveLanes-NAS-M [26] 73.50 90.20 70.50 65.90 69.30 48.80 85.70 67.50 2359 68.20 33.7
CurveLanes-NAS-L [26] 74.80 90.70 72.30 67.70 70.10 49.40 85.80 68.40 1746 68.90 86.5
LaneATT (ResNet-18) 75.09 91.11 72.96 65.72 70.91 48.35 85.49 63.37 1170 68.95 250 9.3
LaneATT (ResNet-34) 76.68 92.14 75.03 66.47 78.15 49.39 88.38 67.72 1330 70.72 171 18.0
LaneATT (ResNet-122) 77.02 91.74 76.16 69.47 76.31 50.46 86.29 64.05 1264 70.81 26 70.5
Table 3. State-of-the-art results on CULane. Since the images in the “Cross” category have no lanes, the reported number is the amount of
false-positives. For a fairer comparison, we measured the FPS of the fastest method ([20]) under the same machine and conditions as ours.
The best and second-best results across methods with source-code available are in bold and underlined, respectively.
Evaluation metrics. The only metric is the F1, which is and the efficiency of LaneATT.
based on the intersection over union (IoU). Since the IoU
relies on areas instead of points, a lane is represented as a 4.3. LLAMAS
thick line connecting the respective lane’s points. In par- Dataset description. LLAMAS [3] is a large lane detec-
ticular, the dataset’s official metric considers the lanes as tion dataset with over 100k images. The annotations were
30-pixels-thick lines. If a prediction has an IoU greater than not manually made, instead, they were generated using high
0.5 with a ground-truth lane, it is considered a true positive. definition maps. All images are from highway scenarios.
The evaluation is based on the CULane’s F1, which was
computed by the author of the LLAMAS benchmark since
Results. The results of LaneATT on CULane, along with the testing set’s annotations are not public.
other state-of-the-art methods, are shown in Table 3 and
in Figure 3 (right side). Qualitative results are shown in Fig-
Results. The results of LaneATT on LLAMAS, along with
ure 2 (middle row). We do not compare to the results shown
PolyLaneNet’s [23] results, are shown in Table 4. Qualita-
in [12], as the main contribution is a post-processing method
tive results are shown in Figure 2 (bottom row). Since the
that could easily be incorporated to our method, but the
benchmark is recent and only PolyLaneNet provided the
source-code is not public. Moreover, it is remarkably slow,
necessary source code to be evaluated on LLAMAS, it is the
which makes the model impractical in real world applica-
only comparable method. As evidenced, LaneATT is able to
tions (the full pipeline runs at less than 10 FPS, as reported
achieve an F1 greater than 90% in all three backbones. The
in their work). In this context, LaneATT achieves the highest
results can also be seen in the benchmark’s website 2 .
F1 among the methods compared while maintaining a high
efficiency (+170 FPS) on CULane, a dataset with highly 4.4. Efficiency trade-offs
complex scenes. Compared to [20], our most lightweight
model (with ResNet-18) surpasses their largest (with ResNet- Being efficient is crucial for a lane detection model. In
34) by almost 3% of F1 while being much faster (250 vs. some cases, it might even be necessary to trade some ac-
175 FPS on the same machine). Additionally, in “Night” curacy to achieve the application’s requirement. In this ex-
and “Shadow” scenes, our method outperforms all others, periment, some of the possible trade-offs are shown. In
including SIM-CycleGAN [15], specifically designed for 2 https : / / unsupervised - llamas . com / llamas /
7
Method F1 (%) Prec. (%) Rec. (%) Model F1 (%) FPS Params. (M)
PolyLaneNet [23] 88.40 88.87 87.93 LaneATT (ResNet-34) 76.68 171 22.13
LaneATT (ResNet-18) 93.46 96.92 90.24 − anchor-based pooling 64.89 188 21.39
LaneATT (ResNet-34) 93.74 96.79 90.88 − shared layers 75.45 142 22.34
LaneATT (ResNet-122) 93.54 96.82 90.47 − focal loss 75.54 171 22.13
− attention mechanism 75.78 196 21.37
Table 4. State-of-the-art results on LLAMAS.
Table 6. Ablation study results on CULane.
Modification F1 (%) FPS MACs (G) TT (h)
Nanc = 250 68.68 196 17.3 5.7 and Lcls ) for the final prediction, three pairs (six layers)
Nanc = 500 75.45 190 17.4 6.4 were used, one pair matching one boundary (left, bottom, or
Nanc = 750 75.80 181 17.7 7.8 right). That is, all anchors starting in the left boundary of the
Nanc = 1000 76.66 171 18.0 11.1
image had its proposals generated by the same pair of layers
Nanc = 1250 75.91 156 18.4 11.5
LL L B
reg and Lcls and similarly for the bottom (Lreg and Lcls )
B
HI × WI = 180 × 320 66.74 195 4.8 4.3 R R
and the right (Lreg and Lcls ) boundaries. In the fourth one,
HI × WI = 288 × 512 75.02 186 11.5 7.3
HI × WI = 360 × 640 76.66 171 18.0 11.1
the Focal Loss was replaced with the Cross Entropy, and in
the last one, the attention mechanism was removed.
Table 5. Efficiency trade-offs on CULane using the ResNet-34 The massive drop of performance when the anchor-based
backbone. “TT” stands for training time in hours. pooling procedure is removed shows its importance. This
procedure enabled the use of a more lightweight backbone,
which was not possible in Line-CNN [13] without a large
particular, different settings of image input size (HI × WI ) performance drop. The results also show that a layer for
and number of anchors (Nanc , as described in Section 3.8). each boundary of the image is not only unnecessary, but also
The results are shown in Table 5. They show that the num- detrimental to the model’s efficiency. Furthermore, using the
ber of anchors can be reduced for a slight improvement in Focal Loss instead of the Cross Entropy was also shown to
terms of efficiency without a large F1 drop. However, if be beneficial. Besides, it also eliminates the need for one hy-
the reduction is too large, the F1 starts to drop considerably. perparameter (the number of negative samples to be used in
Moreover, if too many anchors are used, the efficacy can also the loss computation). Finally, the proposed attention mech-
degrade, which might be a consequence of unnecessary an- anism is another modification that significantly increases the
chors disturbing the training. The results are similar for the model performance.
input size, although the MACs drops are larger. The largest
impact of both the number of anchors and the input size is 5. Conclusion
on the training time. During the inference, the proposals are
We proposed a real-time single-stage deep lane detection
filtered (using a confidence threshold) before the NMS pro-
model that outperforms state-of-the-art models, as shown by
cedure. During the training, there’s no such filtering. Since
an extensive comparison with the literature. The model is not
the NMS is one the main bottlenecks of the model, and its
only effective but also efficient. On TuSimple, the method
running time depends directly on the number of objects, the
achieves the second-highest reported F1 (a difference of only
number of anchors has a much higher impact on the training
0.02%) while being much faster than the top-F1 method (171
phase than on the testing phase.
vs. 30 FPS). On CULane, one of the largest and most com-
4.5. Ablation study plex lane detection datasets, the method establishes a new
state-of-the-art among real-time methods in terms of both
This experiment evaluates the impact of each major part speed and accuracy (+4.38% of F1 compared to the state-
(one at a time) of the proposed method: anchor-based pool- of-the-art method with a similar speed of around 170 FPS).
ing, shared layers, focal loss and the attention mechanism. Additionally, the method achieved a high F1 (+93%) on
The results are shown in Table 6. The first row comprises the the LLAMAS benchmark on all three backbones evaluated.
results for the standard LaneATT, while the following rows To achieve those results, along with other modifications, a
show the results for slightly modified versions of the stan- novel anchor-based attention mechanism was also proposed.
dard model. In the second row, the anchor-based pooling was The ablation study showed that this addition increased the
removed and the same procedure to select features of Line- model’s performance (F1 score) significantly when consid-
CNN [13] was used (i.e., only features from a single point in ering the gains obtained by the literature advance in recent
the feature map were used for each anchor). In the third one, years. Additionally, some efficiency trade-offs that are useful
instead of using a single pair of fully-connected layers (Lreg in practice were also shown.
8
References [15] Tong Liu, Zhaowei Chen, Yi Yang, Zehao Wu, and Haowei
Li. Lane Detection in Low-light Conditions Using an Effi-
[1] A. A. Assidiq, O. O. Khalifa, M. R. Islam, and S. Khan. Real cient Data Enhancement: Light Conditions Style Transfer. In
time lane detection for autonomous vehicles. In International Intelligent Vehicles Symposium (IV), 2020. 2, 7
Conference on Computer and Communication Engineering, [16] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
2008. 1 Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg.
[2] Claudine Badue, Rânik Guidolini, Raphael Vivacqua SSD: Single Shot Multibox Detector. In European Conference
Carneiro, Pedro Azevedo, Vinicius B. Cardoso, Avelino on Computer Vision (ECCV). Springer, 2016. 2
Forechi, Luan Jesus, Rodrigo Berriel, Thiago M. Paixão, Fil- [17] Xingang Pan, Jianping Shi, Ping Luo, Xiaogang Wang, and
ipe Mutz, Lucas de Paula Veronese, Thiago Oliveira-Santos, Xiaoou Tang. Spatial As Deep: Spatial CNN for Traffic Scene
and Alberto F. De Souza. Self-driving cars: A survey. Expert Understanding. In AAAI, February 2018. 1, 2, 5, 6, 7
Systems with Applications, 165:113816, 2021. 1 [18] Jonah Philion. FastDraw: Addressing the Long Tail of Lane
[3] Karsten Behrendt and Ryan Soussan. Unsupervised labeled Detection by Adapting a Sequential Prediction Network. In
lane marker dataset generation using maps. In International Conference on Computer Vision and Pattern Recognition
Conference on Computer Vision (ICCV), 2019. 1, 5, 7 (CVPR), 2019. 1, 2, 6, 7
[4] Rodrigo F. Berriel, Edilson de Aguiar, Alberto F. De Souza, [19] Fabio Pizzati, Marco Allodi, Alejandro Barrera, and Fernando
and Thiago Oliveira-Santos. Ego-Lane Analysis System Garcı́a. Lane Detection and Classification using Cascaded
(ELAS): Dataset and Algorithms. Image and Vision Comput- CNNs. In International Conference on Computer Aided Sys-
ing, 68:64–75, 2017. 1 tems Theory, 2019. 6
[5] Zhenpeng Chen, Qianfei Liu, and Chenfan Lian. Point- [20] Zequn Qin, Huanyu Wang, and Xi Li. Ultra Fast Structure-
LaneNet: Efficient end-to-end CNNs for Accurate Real-Time aware Deep Lane Detection. In European Conference on
Lane Detection. In Intelligent Vehicles Symposium (IV), 2019. Computer Vision (ECCV), 2020. 2, 6, 7
6, 7 [21] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali
[6] Ruochen Fan, Xuanrun Wang, Qibin Hou, Hanchao Liu, and Farhadi. You Only Look Once: Unified, Real-Time Object
Tai-Jiang Mu. SpinNet: Spinning Convolutional Network Detection. In Conference on Computer Vision and Pattern
for Lane Boundary Detection. Computational Visual Media, Recognition (CVPR), 2016. 2
5(4):417–428, 2019. 7 [22] Eduardo Romera, José M Alvarez, Luis M Bergasa, and
[7] Mohsen Ghafoorian, Cedric Nugteren, Nóra Baka, Olaf Booij, Roberto Arroyo. ERFNet: Efficient Residual Factorized Con-
and Michael Hofmann. EL-GAN: Embedding Loss Driven vNet for Real-Time Semantic Segmentation. Transactions on
Generative Adversarial Networks for Lane Detection. In Intelligent Transportation Systems, 19(1):263–272, 2017. 1
European Conference on Computer Vision (ECCV), 2018. 6 [23] Lucas Tabelini, Rodrigo Berriel, Thiago M. Paixão, Claudine
[8] Ross Girshick. Fast R-CNN. In International Conference on Badue, Alberto F. De Souza, and Thiago Oliveira-Santos.
Computer Vision (ICCV), 2015. 3 PolyLaneNet: Lane Estimation via Deep Polynomial Regres-
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. sion. In ICPR, 2020. 1, 2, 6, 7, 8
Deep Residual Learning for Image Recognition. In Confer- [24] TuSimple. Tusimple benchmark. https://github.
ence on Computer Vision and Pattern Recognition (CVPR), com/TuSimple/tusimple- benchmark. Accessed
2016. 3 September, 2020. 1, 5
[10] Yuenan Hou, Zheng Ma, Chunxiao Liu, Tak-Wai Hui, and [25] Wouter Van Gansbeke, Bert De Brabandere, Davy Neven,
Chen Change Loy. Inter-Region Affinity Distillation for Road Marc Proesmans, and Luc Van Gool. End-to-end Lane Detec-
Marking Segmentation. In Conference on Computer Vision tion through Differentiable Least-Squares Fitting. In ICCV
and Pattern Recognition (CVPR), 2020. 2, 6, 7 Workshop, 2019. 6
[11] Yuenan Hou, Zheng Ma, Chunxiao Liu, and Chen Change [26] Hang Xu, Shaoju Wang, Xinyue Cai, Wei Zhang, Xiaodan
Loy. Learning lightweight Lane Detection CNNs by Self At- Liang, and Zhenguo Li. CurveLane-NAS: Unifying Lane-
tention Distillation. In International Conference on Computer Sensitive Architecture Search and Adaptive Point Blending.
Vision (ICCV), 2019. 1, 2, 6, 7 In European Conference on Computer Vision (ECCV), 2020.
[12] Hussam Ullah Khan, Afsheen Rafaqat Ali, Ali Hassan, 2, 7
Ahmed Ali, Wajahat Kazmi, and Aamer Zaheer. Lane De- [27] Seungwoo Yoo, Hee Seok Lee, Heesoo Myeong, Sungrack
tection using Lane Boundary Marker Network with Road Yun, Hyoungwoo Park, Janghoon Cho, and Duck Hoon Kim.
Geometry Constraints. In Winter Conference on Applications End-to-End Lane Marker Detection via Row-wise Classifica-
of Computer Vision (WACV), 2020. 2, 7 tion. In IEEE CVPR Workshop, 2020. 2, 6, 7
[13] Xiang Li, Jun Li, Xiaolin Hu, and Jian Yang. Line-CNN: End- [28] Jie Zhang, Yi Xu, Bingbing Ni, and Zhenyu Duan. Geometric
to-end traffic line detection with line proposal unit. Trans- Constrained Joint Lane Segmentation and Lane Boundary
actions on Intelligent Transportation Systems, 21:248–258, Detection. In European Conference on Computer Vision
2019. 1, 2, 3, 4, 6, 8 (ECCV), 2018. 6, 7
[14] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and [29] Qin Zou, Hanwen Jiang, Qiyu Dai, Yuanhao Yue, Long Chen,
Piotr Dollár. Focal Loss for Dense Object Detection. In Con- and Qian Wang. Robust Lane Detection from Continuous
ference on Computer Vision and Pattern Recognition (CVPR), Driving Scenes using Deep Neural Networks. Transactions
2017. 4 on Vehicular Technology, 69(1):41–54, 2019. 2