Three-Dimensional Object Detection Network Based on Multi-Layer and Multi-Modal Fusion
Abstract
:1. Introduction
- We design and implement a 3D object detection network based on the multi-layer and multi-modal fusion method, which paints and encodes the point cloud (Frustum RGB PointPainting) in the frustum proposed by the 2D object detection network, therefore increasing the amount of input information in the 3D detection network.
- To solve the problem that doubling the number of channels may affect the spatial shape characteristics of the point cloud, a context-aware self-attention mechanism module is introduced to the 3D object detector, which can perceive the global image while extracting spatial features.
- CLOCs is introduced to fuse the 2D and 3D detection results without NMS to further improve the detection accuracy. Experiments on KITTI datasets prove that this fusion method has a significant improvement over the baseline of the LiDAR-only method, with an average mAP improvement of 6.3%.
2. Materials and Methods
2.1. 3D Object Detection Using Objectness
2.2. Three-Dimensional Object Detection Using Point Clouds
2.3. Three-Dimensional Object Detection Using Multi-Modal Fusion
2.4. Problem Statement
3. Three-Dimensional Object Detection Network Based on Multi-Modal and Multi-Layer Fusion
3.1. Early Fusion: Frustum RGB PointPainting (FRP)
Algorithm 1 Frustum RGB PointPainting |
Input: LiDAR point cloud (N is the number of points, D is the dimension, typically 4 for KITTI dataset.) Recommended channel Color channel for RGB channels Extrinsic transformation matrix Camera intrinsic matrix Method: for
do Point cloud projection: Recommended Channel acquisition: if Point ∈ boxes generated by 2D detection then else end if Color channel acquisition: Point cloud encode: end for Output: Encoded point cloud |
3.2. Three-Dimensional Object Detection Network Based on Self-Attention Mechanism
3.3. Late Fusion: CLOCs
4. Experimental Results
4.1. Experimental Environment
4.2. Detection Results and Analysis
4.3. Real-Time Comparison
4.4. Ablation Study
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
- Shi, S.; Wang, X.; Li, H. Pointrcnn: 3D object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar]
- Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
- Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3D object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
- Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D proposal generation and object detection from view aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
- Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. Pointpainting: Sequential fusion for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4604–4612. [Google Scholar]
- Xie, L.; Xiang, C.; Yu, Z.; Xu, G.; Yang, Z.; Cai, D.; He, X. PI-RCNN: An efficient multi-sensor 3D object detector with point-based attentive cont-conv fusion module. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12460–12467. [Google Scholar]
- Alexe, B.; Deselaers, T.; Ferrari, V. Measuring the Objectness of Image Windows. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2189–2202. [Google Scholar] [CrossRef] [PubMed]
- Kuo, W.; Hariharan, B.; Malik, J. DeepBox: Learning Objectness with Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2479–2487. [Google Scholar] [CrossRef]
- Kong, T.; Sun, F.; Yao, A.; Liu, H.; Lu, M.; Chen, Y. RON: Reverse Connection with Objectness Prior Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5244–5252. [Google Scholar] [CrossRef]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
- Simon, M.; Milz, S.; Amende, K.; Gross, H.M. Complex-yolo: Real-time 3D object detection on point clouds. arXiv 2018, arXiv:1803.06199. [Google Scholar]
- Ali, W.; Abdelkarim, S.; Zidan, M.; Zahran, M.; El Sallab, A. Yolo3d: End-to-end real-time 3D oriented object bounding box detection from lidar point cloud. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018; pp. 716–728. [Google Scholar]
- Shi, W.; Rajkumar, R. Point-gnn: Graph neural network for 3D object detection in a point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1711–1719. [Google Scholar]
- Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3Dssd: Point-based 3D single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11040–11048. [Google Scholar]
- Beltrán, J.; Guindel, C.; Moreno, F.M.; Cruzado, D.; Garcia, F.; De La Escalera, A. Birdnet: A 3D object detection framework from lidar information. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 3517–3523. [Google Scholar]
- Barrera, A.; Guindel, C.; Beltrán, J.; García, F. Birdnet+: End-to-end 3D object detection in lidar bird’s eye view. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, 20–23 September 2020; pp. 1–6. [Google Scholar]
- Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
- Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-rcnn: Point-voxel feature set abstraction for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar]
- Xu, D.; Anguelov, D.; Jain, A. Pointfusion: Deep sensor fusion for 3D bounding box estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 244–253. [Google Scholar]
- Wang, Z.; Jia, K. Frustum convnet: Sliding frustums to aggregate local point-wise features for amodal 3D object detection. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 1742–1749. [Google Scholar]
- Yoo, J.H.; Kim, Y.; Kim, J.; Choi, J.W. 3D-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3D object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 720–736. [Google Scholar]
- Paigwar, A.; Sierra-Gonzalez, D.; Erkent, Ö.; Laugier, C. Frustum-pointpillars: A multi-stage approach for 3D object detection using rgb camera and lidar. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2926–2933. [Google Scholar]
- Pang, S.; Morris, D.; Radha, H. CLOCs: Camera-LiDAR object candidates fusion for 3D object detection. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 10386–10393. [Google Scholar]
- Lin, C.; Tian, D.; Duan, X.; Zhou, J.; Zhao, D.; Cao, D. CL3D: Camera-LiDAR 3D object detection with point feature enhancement and point-guided fusion. IEEE Trans. Intell. Transp. Syst. 2022, 23, 18040–18050. [Google Scholar] [CrossRef]
- Samal, K.; Kumawat, H.; Saha, P.; Wolf, M.; Mukhopadhyay, S. Task-Driven RGB-Lidar Fusion for Object Tracking in Resource-Efficient Autonomous System. IEEE Trans. Intell. Veh. 2022, 7, 102–112. [Google Scholar] [CrossRef]
- Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Model | Map | Car | ||
---|---|---|---|---|
Easy | Medium | Difficult | ||
MV3D | 62.85 | 71.09 | 62.35 | 55.12 |
MV3D (LIDAR) | 56.94 | 66.77 | 52.73 | 51.31 |
F-PointNet | 71.26 | 81.20 | 70.39 | 62.19 |
AVOD | 65.92 | 73.59 | 65.78 | 58.38 |
PI-RCNN | 76.41 | 84.37 | 74.82 | 70.03 |
SECOND | 74.33 | 83.13 | 73.66 | 66.20 |
VoxelNet | 66.77 | 77.47 | 65.11 | 57.73 |
PointRCNN | 76.25 | 85.29 | 75.08 | 68.38 |
PointPillars | 73.59 | 80.36 | 73.64 | 66.79 |
3DMMF (ours) | 79.89 | 87.45 | 77.48 | 74.74 |
Model | Pre-Transmission Time | 3D Detection Time | Total Processing Time |
---|---|---|---|
PointPillars | N/A | 0.0202 | 0.0202 |
SECOND | N/A | 0.0626 | 0.0626 |
Frustum PointNet | 0.0204 | 0.1304 | 0.1508 |
MV3D | N/A | 0.4473 | 0.4473 |
AVOD | N/A | 0.1347 | 0.1347 |
PointPainting | 0.4385 | 0.0563 | 0.4948 |
3DMMF (ours) | 0.0204 | 0.0577 | 0.0781 |
Model | FRP | FSA | CLOCs | mAP | Car | ||
---|---|---|---|---|---|---|---|
Easy | Medium | Difficult | |||||
PointPillars | 73.59 | 80.36 | 73.64 | 66.79 | |||
PointPainting | 73.46 | 79.42 | 73.67 | 67.28 | |||
Model | ✓ | 74.83 | 80.78 | 74.12 | 69.59 | ||
✓ | 77.49 | 85.09 | 75.18 | 72.21 | |||
✓ | 75.62 | 82.26 | 76.12 | 68.49 | |||
✓ | ✓ | 79.62 | 87.34 | 77.16 | 74.35 | ||
✓ | ✓ | ✓ | 79.89 | 87.45 | 77.48 | 74.74 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhu, W.; Zhou, J.; Wang, Z.; Zhou, X.; Zhou, F.; Sun, J.; Song, M.; Zhou, Z. Three-Dimensional Object Detection Network Based on Multi-Layer and Multi-Modal Fusion. Electronics 2024, 13, 3512. https://doi.org/10.3390/electronics13173512
Zhu W, Zhou J, Wang Z, Zhou X, Zhou F, Sun J, Song M, Zhou Z. Three-Dimensional Object Detection Network Based on Multi-Layer and Multi-Modal Fusion. Electronics. 2024; 13(17):3512. https://doi.org/10.3390/electronics13173512
Chicago/Turabian StyleZhu, Wenming, Jia Zhou, Zizhe Wang, Xuehua Zhou, Feng Zhou, Jingwen Sun, Mingrui Song, and Zhiguo Zhou. 2024. "Three-Dimensional Object Detection Network Based on Multi-Layer and Multi-Modal Fusion" Electronics 13, no. 17: 3512. https://doi.org/10.3390/electronics13173512