2020-Huang - Tri-points based line segment detector
2020-Huang - Tri-points based line segment detector
Detector
Siyu Huang1 , Fangbo Qin2 , Pengfei Xiong1 , Ning Ding1 , Yijia He1(B) ,
and Xiao Liu1
1
Megvii Technology, Beijing, China
[email protected], [email protected], [email protected],
[email protected], [email protected]
2
Institute of Automation, Chinese Academy of Sciences, Beijing, China
[email protected]
1 Introduction
Compact environment description is an important issue in visual perception.
For man-made environments with various flat surfaces, line segments can encode
the environment structure, providing fundamental information to the upstream
vision tasks, such as vanishing point estimation [17,19], 3D structure reconstruc-
tion [16], distortion correction [24], and pose estimation [4,14].
S. Huang and N. Ding contribution was made when they were interns at Megvii
Research Beijing, Megvii Technology, China.
Fig. 1. Overview. (a) Compared to the existing two-step methods, TP-LSD detects
multiple line segments simultaneously in one step, providing better efficiency and com-
pactness. (b) Inference speed and F-score on Wireframe test set.
With the rapid advance of deep learning, deep neural networks are applied
to line segment detection. As shown in Fig. 1a, the existing methods have two
steps. With the top-down strategy it first detects the region of a line and then
squeezes the region into a line segment [22], which might be affected by regional
textures and does not have an explicit definition of endpoints. With the bottom-
up strategy it first detect junctions and then organize them to line segments
using grouping algorithm [7,8], or extra classifier [23,25,28], which might be
prone to the inaccurate junction predictions caused by local ambiguity. The
two-step strategy might also limit the inference speed in real-time applications.
Considering the above problems, we propose the tri-points (TP) representa-
tion, which uses a root-point as the unique identity to localize a line segment,
and the two corresponding end-points are represented by their displacements
w.r.t the root-point. Thus a TP encodes the length, orientation and location of
a line segment. Moreover, inspired by that human perceive line segments accord-
ing to straight lines, we leverage the straight line segmentation map as structural
prior to guide the inference of TPs, by embedding feature aggregation modules
which fuse the line-map with TP related features. Accordingly, Tri-Points Based
Line Segment Detector (TP-LSD) is designed, which has three parts: feature
extraction backbone, TP extraction branch, and line segmentation branch.
As to the evaluation of line segment detection, the current metrics either treat
a line segment as a set of pixels, or use squared euclidean distance to judge the
matching degree, which cannot reflect the various relationships between line seg-
ments such as intersection and overlapping. Therefore we propose a new metric
named line matching average precision from a camera model perspective.
In summary, the main contributions of this paper are as follows:
2 Related Work
2.1 Hand-Crafted Feature Based Methods
In the past few years, CNN-based methods have been introduced to solve the
edge detection problem. HED [20] treats edge detection problem as pixel-wise
binary classification, and achieves significant performance improvement com-
pared to traditional methods. Following this breakthrough, numerous methods
for edge detection have been proposed [11,15]. However, edge maps lack explicit
geometric information for compact environment representation.
Most recently, CNN-based method has been realized for line segment detec-
tion. Huang et al. [8] proposed DWP, which includes two parallel branches to
predict junction map and line heatmap in an image, then merges them as line
segments. Zhang et al. [25] and Zhou et al. [28] utilize a point-pair graph rep-
resentation for line segments. Their methods (PPGNet and L-CNN) first detect
junctions, then use an extra classifier to create an adjacency matrix to identify
whether a point-pair belongs to the same line segment. Xue e al. [22] creatively
presented regional attraction of line segment maps, and proposed AFM to predict
attraction field maps from raw images, followed by a squeeze module to produce
line segments. Furthermore, Xue et al. [23] proposed a 4-D holistic attraction
field map (H-AFM) to better parameterize line segments, and proposed HAWP
with L-CNN pipeline. Though learning-based methods have significant advan-
tages over the hand-crafted ones. However, their two-step strategy might limit
their real-time performance, and rely on extra classifier or heuristic post-process.
Moreover, the relationship between line-map and line segments is under-utilized.
TP-LSD: Tri-Points Based Line Segment Detector 773
3 Tri-Points Representation
The Tri-Points (TP) representation is inspired by how people model a long nar-
row object. Intuitively, we usually find a root point on a line, then extend it from
the root-point to two opposite directions and determine the endpoints. TP con-
tains three key-points and their spatial relationship to encode a line segment.
The root-point localizes the center of a line segment. The two end-points are
represented by two displacement vectors w.r.t the root point, as illustrated in
Fig. 2c, 2d. It is similar to SPM [13] used in human pose estimation. The conver-
sion from a TP to a vectorized line segment, which is denotes as TP generation
operation, is expressed by,
where (xr , yr ) denotes the root-point of a line segment. (xs , ys ) and (xe , ye )
represent its start-point and end-point, respectively. Generally, the most left
point is the start-point. Specially, if line segment is vertical, the upper point is
the start-point. ds (xr , yr ) and de (xr , yr ) denote the predicted 2D displacements
from root-point to its corresponding start-point and end-point, respectively.
774 S. Huang et al.
4 Methods
Based on the proposed Tri-Points, a one-step model TP-LSD is proposed for
line segment detection, whose architecture is shown in Fig. 3. A U-shape net-
work is used to generate shared features, which are then fed to two branches:
1) TP extraction branch, which contains a root-point detection task and a dis-
placement regression task; 2)line segmentation branch, which generates a pixel-
wise line-map. These two branches are bridged by feature aggregation modules.
Finally, after processed by point filter module, the filtered TPs are transformed
to vectorized line segment instances with TP generation operation.
whose ground truths are generated from the raw line segment labels. The three
tasks’ losses are combined as Eq. (3), where λroot,disp,line = {50, 1, 20}.
5 Evaluation Metrics
In this section, we briefly introduce two existed evaluation metrics: pixel based
metric and structural average precision, and then design a novel metric, line
matching average precision.
Pixel Based Metric: For a pixel on a detected line segment, if its minimum
distance to all the ground truth pixels is within the 1After evaluating all the
pixels on the detected line segments, the F-score FH can be calculated [8,22,
28]. The limitation is that it cannot reveal the continuity of line segment. For
example, if a long line segment is broken into several short ones, the F-score
is high but these split line segments is not suitable for 3D reconstruction or
wireframe parsing.
Structural Average Precision: The structural average precision (sAP) [28]
uses the sum of squared error (SSE) between the predicted end-points and their
ground truths as evaluation metric. The predicted line segment will be counted
as a true positive detection when its SSE is less than a threshold, such as =
5, 10, 15. However, line segment matching could be more complicated than point
pair correspondence. For example, in Fig. 4b, 4c, it is shown that sAP is not
discriminative enough for some different matching situations.
TP-LSD: Tri-Points Based Line Segment Detector 777
Fig. 4. Evaluation metrics for line segment detection. (a) The geometric explanation
of the proposed line matching score (LMS). The blue and red line segments on the
normalized image plane correspond to the detection and ground truth, respectively,
which determine two planes together with the optical center C. In (b) and (c), the
different matching situations could have the same SSE score 8 with sAP metric. In
contrast, the LMS gives the discriminative scores. (Color figure online)
Line Matching Average Precision: To better reflect the various line segment
matching situations in term of direction and position as well as length, the Line
Matching Score (LMS) is proposed. LMS contains two parts: Scoreθ denotes
the differences in angle and position, and Scorel denotes the matching degree in
length. The LMS is calculated by
where θ () is to calculate the angle between two vectors with the unit degree. ηθ
is a minimum threshold.
778 S. Huang et al.
Fig. 5. Comparison of line matching evaluation results using different metrics. (a) The
ground truth line segments marked by red. (b) Line matching result using sAP10 metric.
(c) Line matching result using proposed LAP metric. In (b) and (c), the mismatched
and matched line segments are marked by blue and red, respectively. The endpoints
are marked by cyan. (Color figure online)
Scorel demonstrates the overlap degree of two line segment. The ratio of
overlap length against the ground truth length is η1 . The ratio of overlap length
against the projection length is η2 .
Lpred ∩ Lgt Lpred ∩ Lgt
η1 = , η2 = (6)
Lgt Lpred |cos(α)|
where L is the length of line segment and Lpred ∩ Lgt is the overlap length of
the predicted line segment projected to the ground truth line segment. α is the
angle between the two line segments in 2D image. Then Scorel is calculated by,
η1 +η2
2 , if η1 ≥ ηl , and η2 ≥ ηl
Scorel = (7)
0, otherwise
where ηl denotes a minimum threshold. Since the focal length of a camera might
be unknown for public data sets, to make a fair comparison, we firstly re-scale
the detected line segments with the same ratio of resizing the original image to
the resolution 128 × 128, and set a virtual focal length f = 24. Besides we set
ηθ = 10◦ and ηl = 0.5 in this work.
Using LMS to determine true positive, i.e. a detected line segment is con-
sidered to be true positive if LMS> 0.5, we can calculate the Line Matching
Average Precision (LAP) on the entire test set. LAP is defined as the area
under the precision recall curve.
Analysis of Metric on Real Image. We compare the line matching evaluation
results between SSE used in sAP and LMS used in LAP on a real image, as
shown in Fig. 5. Comparing the areas labeled by yellow boxes in Fig. 5b and
Fig. 5a, the detected line segments have obvious error direction compared to
ground truth. However, SSE gives the same tolerance for line segments with
different lengths, and accepts them as true positive matches. In contrast, as
shown in Fig. 5c, LMS could better capture the direction errors and give the
TP-LSD: Tri-Points Based Line Segment Detector 779
Fig. 6. Gradient based interpretation of root-point detection. (a) Raw image and the
root points (white dots) of three line segments. (b–e) The gradient saliency maps of the
input layer backpropogated from the three root points detected by the four different
models, based on Guided Back-propogation method [17].
correct judgement. As shown by the green boxes in Fig. 5a and Fig. 5c, for the
line segment with the correct direction but the slightly shorter length compared
with the ground truth, namely, whose Scorel is lower than 1 but greater than
ηl , LMS will accept it while SSE would not. Considering that the direction of
line segments are more important in upper-level applications such as SLAM, this
deviation can be acceptable.
6 Experiments
Experiments are conducted on Wireframe dataset [8] and YorkUrban dataset [3].
Wireframe contains 5462 images of indoor and outdoor man-made environments,
among which 5000 images are used for training. To validate the generalization
ability, we also evaluate on YorkUrban Dataset [3], which has 102 test images.
We use the standard data augmentation procedure to expand the diversity
of training samples, including horizontal and vertical flip, rotation and scaling.
The hardware configuration includes four NVIDIA RTX 2080Ti GPUs and an
Intel Xeon Gold 6130 2.10 GHz CPU. We use the ADAM optimizer with an
initial learning rate of 1 × 10−3 , which is divided by 10 at the 150th, 250th, and
350th epoch. The total training epoch is 400.
FAM. FAM combines the cross-branch guidance with the line-map segmentation
branch. Although the FH metric increases indicating the better pixel localization
accuracy, sAP10 and LAP are slightly decreased, because of the a larger number
of line segments are detected.
MCM. Mixture Convolution Module is applied in root-point detection sub-
branch. Compared to the standard convolution layers, MCM improves the LAP
scores significantly, showing a better matching degree.
PFM. With PFM and the contribution ratio of α = 0.5, the precision is
increased while the recall slightly decreased, which lead to a better overall accu-
racy. The decrease in sAP10 and LAP is due to the reduced confidence of the
root-points after PFM.
Augmentation. The 7th row in Table 1 shows the data augmentation with
only horizontal and vertical flip. Compared to the result in 6th row, the lower
performance shows that the rotation and scaling based data augmentation can
further improve the performance.
Interpretability. To explore what the network learned from the line segment
detection task, we use Guided Backpropogation [18] to visualize which pixels are
important for the root-point detection. Guided Backpropogation interprets the
pixels’ importance degree on the input image, by calculating the gradient flow
from the output layer to the input images. The gradients flowed to the input
images from the three specific detected root-point are visualized in Fig. 6. We
find that the network automatically learns to localize the saliency region w.r.t a
root-point, which is along a complete line segment. It shows that the root point
detection task is mainly based on the line feature.
Furthermore, the integration of LSB lead to the higher influence of on-line
pixels to root point prediction. Comparing Fig. 6c to Fig. 6b, the former presents
higher gradient values along the line. The saliency maps obtained by model No.
TP-LSD: Tri-Points Based Line Segment Detector 781
Table 2. Evaluation results of different line segment detection methods. “/” means
that the score is too slow to be meaningful. The best two scores are shown in bold and
italic.
3 and model No. 4 are cleaner, and the saliency regions are more concentrated
on specific line segments. With the introduction of MCM in model No. 4, the
response of long line segment could be improved with a lager receptive field,
which can be shown by the comparison between Fig. 6d and Fig. 6e.
Fig. 7. Precision-recall curves of line segment detection. The models are trained on
Wireframe dataset and tested on both Wireframe and YorkUrban datasets. Scores
below 0.1 are not plotted. The PR curves for LAP of DWP are not ploted for its lower
score.
size of 320 × 320, the proposed TP-LSD achieve the real-time speed up to 78
FPS, offering the potential to be used in real-time applications like SLAM.
7 Conclusion
This paper proposes a faster and more compact model TP-LSD for line seg-
ment detection with the one-step strategy. Tri-points representation is used to
encodes a line segment with three keypoints, based on which the line-segment
detection is realized by end-to-end inference. Furthermore, the straight line-map
is produced based on segmentation task, and is used as structural prior cues
to guide the extraction of TPs. Both quantitatively and qualitatively, TP-LSD
shows the improved performances compared to the existing models. Besides, our
method achieves 78 FPS speed, showing potential to be integrated with real-
time applications, such as vanishing point estimation, 3D reconstruction and
pose estimation.
784 S. Huang et al.
References
1. Akinlar, C., Topal, C.: Edlines: real-time line segment detection by edge drawing
(ed). In: 2011 18th IEEE International Conference on Image Processing, pp. 2837–
2840, September 2011. https://doi.org/10.1109/ICIP.2011.6116138
2. Cho, N., Yuille, A., Lee, S.: A novel linelet-based representation for line seg-
ment detection. IEEE Trans. Pattern Anal. Mach. Intell. 40(5), 1195–1208 (2018).
https://doi.org/10.1109/TPAMI.2017.2703841
3. Denis, P., Elder, J.H., Estrada, F.J.: Efficient edge-based methods for estimating
manhattan frames in urban imagery. In: Forsyth, D., Torr, P., Zisserman, A. (eds.)
ECCV 2008. LNCS, vol. 5303, pp. 197–210. Springer, Heidelberg (2008). https://
doi.org/10.1007/978-3-540-88688-4 15
4. Elqursh, A., Elgammal, A.: Line-based relative pose estimation. In: Proceedings of
the IEEE Computer Society Conference on Computer Vision and Pattern Recog-
nition, pp. 3049–3056 (2011). https://doi.org/10.1109/CVPR.2011.5995512
5. Girshick, R.B.: Fast R-CNN. 2015 IEEE International Conference on Computer
Vision (ICCV), pp. 1440–1448 (2015)
6. Grompone von Gioi, R., Jakubowicz, J., Morel, J., Randall, G.: LSD: a fast line
segment detector with a false detection control. IEEE Trans. Pattern Anal. Mach.
Intell. 32(4), 722–732 (2010). https://doi.org/10.1109/TPAMI.2008.300
7. Huang, K., Gao, S.: Wireframe parsing with guidance of distance map. IEEE Access
7, 141036–141044 (2019). https://doi.org/10.1109/ACCESS.2019.2943885
8. Huang, K., Wang, Y., Zhou, Z., Ding, T., Gao, S., Ma, Y.: Learning to parse
wireframes in images of man-made environments. In: 2018 IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 626–635, June 2018. https://
doi.org/10.1109/CVPR.2018.00072
9. Law, H., Deng, J.: Cornernet: Detecting objects as paired keypoints. In: ECCV
(2018)
10. Liu, W., et al.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe,
N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham
(2016). https://doi.org/10.1007/978-3-319-46448-0 2
11. Liu, Y., Cheng, M.M., Hu, X., Wang, K., Bai, X.: Richer convolutional features
for edge detection. In: 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 5872–5881 (2017)
12. Lu, X., Yao, J., Li, K., Li, L.: Cannylines: a parameter-free line segment detector.
In: 2015 IEEE International Conference on Image Processing (ICIP), pp. 507–511,
September 2015. https://doi.org/10.1109/ICIP.2015.7350850
13. Nie, X., Zhang, J., Yan, S., Feng, J.: Single-stage multi-person pose machines.
In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp.
6950–6959 (2019)
14. Qin, F., Shen, F., Zhang, D., Liu, X., Xu, D.: Contour primitives of interest extrac-
tion method for microscopic images and its application on pose measurement. IEEE
Trans. Syst. Man Cybern. Syst. 48(8), 1348–1359 (2018). https://doi.org/10.1109/
TSMC.2017.2669219
15. Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., Jägersand, M.: BasNet:
boundary-aware salient object detection. In: 2019 IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), pp. 7471–7481 (2019)
16. Ramalingam, S., Brand, M.: Lifting 3D manhattan lines from a single image. In:
The IEEE International Conference on Computer Vision (ICCV), December 2013
TP-LSD: Tri-Points Based Line Segment Detector 785
17. Rother, C.: A new approach to vanishing point detection in architectural environ-
ments. Image Vis. Comput. 20(9–10), 647–655 (2002)
18. Springenberg, J., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity:
the all convolutional net. In: ICLR (Workshop Track) (2015)
19. Wan, F., Deng, F.: Using line segment clustering to detect vanishing point. In:
Advanced Materials Research. vol. 268, pp. 1553–1558. Trans Tech Publ (2011)
20. Xie, S., Tu, Z.: Holistically-nested edge detection. In: 2015 IEEE International
Conference on Computer Vision (ICCV). pp. 1395–1403, December 2015. https://
doi.org/10.1109/ICCV.2015.164
21. Xu, Z., Shin, B., Klette, R.: Accurate and robust line segment extraction using
minimum entropy with hough transform. IEEE Trans. Image Process. 24(3), 813–
822 (2015). https://doi.org/10.1109/TIP.2014.2387020
22. Xue, N., Bai, S., Wang, F., Xia, G.S., Wu, T., Zhang, L.: Learning attraction field
representation for robust line segment detection. In: CVPR (2018)
23. Xue, N., Wu, T., Bai, S., Wang, F.D., Xia, G.S., Zhang, L., Torr, P.H.S.:
Holistically-Attracted Wireframe Parsing (2020)
24. Xue, Z., Xue, N., Xia, G.S., Shen, W.: Learning to calibrate straight lines for
fisheye image rectification. In: The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2019
25. Zhang, Z., et al.: PPGNet: learning point-pair graph for line segment detection.
In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 7098–7107 (2019)
26. Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. ArXiv abs/1904.07850
(2019)
27. Zhou, X., Zhuo, J., Krähenbühl, P.: Bottom-up object detection by grouping
extreme and center points. In: CVPR (2019)
28. Zhou, Y., Qi, H., Ma, Y.: End-to-end wireframe parsing. In: 2019 IEEE/CVF
International Conference on Computer Vision (ICCV), October 2019. https://doi.
org/10.1109/iccv.2019.00105