Co-Fix3D: Enhancing 3D Object Detection with Collaborative Refinement

Wenxuan Li, Qin Zou, Chi Chen, Bo Du, Long Chen

Abstract

In the realm of autonomous driving, accurately detecting occluded or distant objects—referred to as weak positive samples—presents significant challenges. These challenges predominantly arise during query initialization, where an over-reliance on heatmap confidence often results in a high rate of false positives, consequently masking weaker detections and impairing system performance. To alleviate this issue, we propose a novel approach, Co-Fix3D, which employs a collaborative hybrid multi-stage parallel query generation mechanism for BEV representations. Our method incorporates the Local-Global Feature Enhancement (LGE) module, which refines BEV features to more effectively highlight weak positive samples. It uniquely leverages the Discrete Wavelet Transform (DWT) for accurate noise reduction and features refinement in localized areas, and incorporates an attention mechanism to more comprehensively optimize global BEV features. Moreover, our method increases the volume of BEV queries through a multi-stage parallel processing of the LGE, significantly enhancing the probability of selecting weak positive samples. This enhancement not only improves training efficiency within the decoder framework but also boosts overall system performance. Notably, Co-Fix3D achieves superior results on the stringent nuScenes benchmark, outperforming all previous models with a 69.1% mAP and 72.9% NDS on the LiDAR-based benchmark, and 72.3% mAP and 74.1% NDS on the multi-modality benchmark, without relying on test-time augmentation or additional datasets. The source code will be made publicly available upon acceptance.

INTRODUCTION

3D object detection(He et al. 2023; Meng et al. 2021; Qi et al. 2018; Yin et al. 2021) is crucial for autonomous driving vehicles and robotic systems, enabling precise identification and localization of objects within their environments. This field has seen significant advancements through sophisticated 3D neural network models, including Convolutional Neural Networks (CNNs)(Feng et al. 2023; Gilmer et al. 2017; Graham, Engelcke, and Van Der Maaten 2018) and transformer technologies(Zhao et al. 2021; Liu et al. 2022, 2023a; Wang et al. 2023).These models often utilize point clouds from depth-aware sensors, such as LiDAR, to capture the crucial geometric details of 3D spaces. However, the natural sparsity of these point clouds, combined with the difficulty of merging data from different sensors—including radar, cameras, and LiDAR—requires a unified perspective and reference framework. These complexities significantly challenge effective 3D object detection.

Refer to caption — Figure 1: Comparison of One-Stage and Multi-Stage 3D Detectors: (a) The one-stage 3D detector utilizes scores from the BEV feature heatmap to select the top $K$ units as queries. The complete BEV feature set serves as both keys and values, enabling precise predictions through a Transformer model. (b) The multi-stage 3D detector employs a multi-stage strategy, repeatedly selecting the top $K\times N$ units from the BEV heatmap as queries. The use of masks ensures that these queries are as non-overlapping as possible, enhancing the coverage of the queries.

To address the challenges in 3D object detection, modern systems increasingly utilize BEV representations. This approach offers a clear visualization of the spatial layout of objects, significantly enhancing system efficiency and decision-making capabilities. Currently, mainstream BEV 3D detection algorithms are divided into single-stage and multi-stage approaches (see Figure 1). The single-stage approach (Bai et al. 2022) (see Figure 1.(a)) integrates BEV with Transformer technology, dividing the detection process into two phases: an initial rough prediction using heatmaps, followed by refinement with Transformer technology to enhance accuracy. However, the effectiveness of this strategy heavily relies on the query initialization method. In road scenes, objects with low reflectivity, small sizes, or severe occlusion often result in inadequate performance of BEV features in detecting these weak targets. Furthermore, the absence of depth information in image data can lead to distortion and anomalies in BEV features. These factors lead to weak positive samples being overwhelmed by false positives, thus affecting detection capabilities. The multi-stage approach (Bai et al. 2022) (see Figure 1.(b)) aims to increase the number of queries to enhance the likelihood of detecting weak positive samples, thereby partially mitigating the challenges associated with detecting weak features. However, this method may increase the incidence of false negatives and does not fundamentally resolve the issue of poor BEV query initialization.

In this paper, we highlight significant limitations in existing methods that utilize heatmap-based query initialization. This method constrains the exploration of valuable weak positive samples and results in suboptimal performance when detecting potential targets within the 3D environment. To address this challenge, we have refined the identification of weak positive samples during the encoding phase by incorporating advanced image restoration techniques that enable precise correction of these samples within the BEV. We leverage the principles of DWT (Chen et al. 2024; Li et al. 2020), renowned for its efficacy in restoration, to facilitate meticulous feature reconstruction. Recognizing DWT’s inherent limitations in handling extensive global features, we have integrated attention mechanisms (Vaswani et al. 2017) to overcome its shortcomings in global context perception. As a result, We developed a LGE module that performs adaptive cross-stage denoising and feature enhancement, significantly boosting detection performance by improving the identification and scoring of weak positive samples.

Furthermore, we have augmented the overall query volume during the encoding stage of 3D object detection by implementing a multi-level filtering mechanism inspired by multi-stage 3D detectors (Chen et al. 2023b). Our experiments demonstrate that, with a fixed number of final output queries, the parallel LGE architecture markedly enhances perceptual capabilities by increasing the number of queries during testing—a contrast to the cascaded LGE, which did not demonstrate significant improvements. By integrating these innovative approaches, Co-Fix3D substantially improves both the quality of queries and the overall performance of 3D detection systems. Additionally, our technology notably enhances the accuracy of detecting small and partially occluded objects within complex environments. This advancement offers new perspectives on addressing the persistent challenges in 3D object detection.

In summary, our contribution can be summarized as follow: (1) We propose Co-Fix3D, a multi-stage, parallel architecture 3D detection network designed to repair BEV features end-to-end, enabling precise identification of challenging instances. (2) We proposed the LEG module that optimizes BEV features, significantly enhancing the detection of weak positive samples. (3) Our model has established new benchmarks on the nuScenes 3D detection leaderboard, outperforming all prior research in this domain.

Related Work

LiDAR-based 3D Object Detection.

LIDAR-based 3D object detection technologies are primarily categorized into three types: Point-based, Voxel-based, and Hybrid approaches. Point-based methods, such as PointNet(Qi et al. 2017a) and PointNet++(Qi et al. 2017b), directly process raw LiDAR data to extract critical features, enabling precise segmentation and refinement in models like PointRCNN (Shi, Wang, and Li 2019)and VoteNet(Qi et al. 2019). Due to their high computational demands the application of these methods is somewhat restricted in specific BEV scenarios. Voxel-based methods, including VoxelNet(Zhou and Tuzel 2018) and SECOND(Yan, Mao, and Li 2018), transform point clouds into structured grids, facilitating efficient feature extraction while preserving accuracy. CenterPoint(Yin, Zhou, and Krahenbuhl 2021) refines voxel-based detection for streamlined operations, while SST(Fan et al. 2022) targets the detection of smaller objects. Subsequent BEV-formatted 3D detectors are all based on voxel-based technologies. Hybrid methods like PV-RCNN(Bhattacharyya and Czarnecki 2020) combine the strengths of point-based and voxel-based techniques to enhance precision and efficiency.

Currently, dense BEV detection technologies such as TransFusion typically outperform sparse detectors in point cloud processing. Their successor, FocalFormer3D (Chen et al. 2023b), significantly enhances the detection performance by increasing the likelihood of selecting positive samples through an increased number of queries. However, when addressing small objects, low reflectance, or distant targets in real-world scenarios, these 3D detectors often overlook potential defects in BEV features that could impair overall detection outcomes. To address this issue, we have introduced the LGE module. This module repairs weak features in the BEV and improves their scores during the encoding stage, thereby increasing the number of positive sample queries and effectively ensuring high-quality and precise 3D object detection.

LiDAR-camera Fusion for 3D Object Detection.

LiDAR-camera fusion for 3D object detection(Chen et al. 2017; Liang et al. 2019) has become increasingly significant, with multimodal approaches often outperforming unimodal learning in capturing accurate latent space representations. These fusion methods can be categorized into early, middle, and late stages based on the timing of data integration. Early fusion methods(Chen et al. 2022b; Vora et al. 2020; Xu et al. 2021; Yin, Zhou, and Krähenbühl 2021), exemplified by pioneering works like enhance input points with corresponding image pixel features. However, they are sensitive to calibration errors. Late fusion approaches (Bai et al. 2022; Li et al. 2023; Liang et al. 2019; Yang et al. 2022a), such as those in , fuse multimodal information at the region proposal level, often resulting in limited interactions between modalities and suboptimal detection performance. In contrast, middle fusion(Li et al. 2022a, b, c; Liang et al. 2022), increasingly popular, promotes multimodal feature interaction at various stages, making it more robust to calibration errors.

Building on this understanding, our proposed method, Co-Fix3D, utilizes an intermediate fusion strategy by integrating image data into BEV features via the Lift-Splat-Shoot (LSS)(Philion and Fidler 2020) method. However, these BEV representations often exhibit flaws, leading to suboptimal detection. Co-Fix3D enhances these features with the LGE module, significantly boosting their effectiveness and ensuring robust performance in challenging conditions.

Method

The architecture of Co-Fix3D is illustrated in Figure 2. This paper first introduces Overview, followed by the LGE module and multiple parallel heatmap components.

Overview

Co-Fix3D integrates both point cloud and multimodal data modalities. For the point cloud mode, after processing through a 3D backbone network and associated flattening operations, we obtain the point cloud’s BEV features, denoted as $F_{LiDAR}\in\mathbb{R}^{H\times W\times 4C}$ , where $W$ , $H$ , and $C$ represent the width, height, and number of channels of the BEV feature map, respectively. Similarly, for the multimodal mode, after processing through a 2D backbone network and applying the original LSS method (without depth loss computation) (Philion and Fidler 2020), we obtain the image’s BEV features, denoted as $F_{Camera}\in\mathbb{R}^{H\times W\times C}$ . In the mode using only point cloud data, this module reduces the number of channels from $4C$ to $C$ ; in the data fusion mode, it reduces from $4C+C$ to $C$ , ultimately forming the initial BEV feature $F_{0}$ . The BEV features $F_{0}$ are optimized within the LGE module and used to generate corresponding BEV heatmaps $H\in\mathbb{R}^{H\times W\times c}$ ,where $c$ represent the category.

We use a multi-stage approach to generate queries, employing a mask mechanism to filter each stage progressively, allowing these parallel LGE modules to supervise different ground truths. We fistly initialized a mask $M\in{0,1}^{H\times W\times 1}$ , set entirely to 1. For the $(w,h)$ position and category $c$ of the heatmap at stage $i$ , we used Top-k selection on the heatmap to set $k$ instances of $M_{i}(w,h,c)$ to 0. This indicates that once a region is selected, subsequent stages will not re-explore that region. We then applied box-level pooling methods to handle these 0-marked masks, ensuring that the generated query locations are as evenly distributed within the BEV as possible. To ensure diversity in the samples processed by each module after introducing the LGE module, we multiply the mask by the GT heatmap. This guarantees that different LGE modules repair different targets, enhancing the perception of targets with varying degrees of damage and maximizing the potential for target recognition. Specifically, if early-stage LGE modules fail to detect certain samples, subsequent stages will continue to monitor and learn from these samples until the targets are detected or the maximum number of stages is reached.

To train these modules effectively, we used Gaussian focal loss (Bai et al. 2022) as the training loss function, ensuring that the GT counts of the heatmaps in the last two stages match those of the first stage. This method ensures consistency and effectiveness in the training process.

Local and Global Enhancement Module

The LGE module is designed to eliminate noise and correct distorted features in BEV features. This module effectively integrates local and global denoising methods to enhance the accuracy and efficiency of data processing. It consists of three main components: the Wavelet Encode module for local optimization, the Hybrid Encode module for global optimization, and the Wavelet Decode module for post-processing. Next, we will first explain these three modules in detail one by one, and then introduce the various attempts made during the design of the LGE.

Wavelet Encode. Following the significant success of DWT in image restoration and super-resolution(Chen et al. 2024; Ji et al. 2023), we also leverage wavelet encoding using DWT to effectively restore the features of BEV grids, as shown in Figure 3. DWT compresses data by focusing on significant wavelet components and removing redundant information, effectively isolating and mitigating noise and anomalies in BEV features. This capability is particularly useful for restoring BEV features as it efficiently handles large-scale point cloud datasets. DWT decomposes BEV features into four distinct channels: HH, HL, LH, and LL, each capturing specific information aspect. The specific calculation process is as follows:

		$\displaystyle{F}_{1}={\rm Reduce}({F_{0})},$		(1)
		$\displaystyle{F}_{LL},{F}_{LH},{F}_{HL},{F}_{HH}=DWT({F}_{1}),$		(2)
		$\displaystyle{F}_{2}=Concat({F}_{LL},{F}_{LH},{F}_{HL},{F}_{HH}),$		(3)

where Reduce( $\cdot$ ) refers to reducing the number of channels of $F_{0}$ from $C$ to $\frac{C}{4}$ , resulting in $F_{1}\in\mathbb{R}^{H\times W\times\frac{C}{4}}$ . After applying DWT( $\cdot$ ), $F_{LL},F_{LH},F_{HL},F_{HH}\in\mathbb{R}^{\frac{H}{2}\times\frac{W}{2}\times% \frac{C}{4}}$ . Finally, Concat concatenates these DWT( $\cdot$ ) results along the channel dimension, leading to $F_{2}\in\mathbb{R}^{\frac{H}{2}\times\frac{W}{2}\times C}$ .

Hybrid Encode. Hybrid Encode (see Fig. 3) employs a global attention mechanism to capture comprehensive global contextual information from BEV features, effectively minimizing noise and artifacts. This enhancement allows for a clearer and more precise distinction of complex sample features. Additionally, it integrates Flash Attention V2 (Dao 2023), greatly improving the efficiency of attention computations. In this module, BEV features $F_{0}$ are processed through a down-sampling layer and then flattened for the self-attention phase to assess feature importance. The process is described as follows:

	$\displaystyle S_{1}=DWConv(F_{0}),$		(4)
	$\displaystyle Q=K=V=Flatten(S_{1}),$		(5)
	$\displaystyle Q=Attn(Q,K,V),$		(6)
	$\displaystyle F_{3}=Reshape(FFN(Q)),$		(7)

where DWConv( $\cdot$ ) denotes down-sampling, resulting in $S_{1}\in\mathbb{R}^{\frac{H}{2}\times\frac{W}{2}\times C}$ . Here, Attn( $\cdot$ ) represents the multi-head self-attention mechanism. Finally, after processing through the feed-forward network (FFN( $\cdot$ )), the output is reshaped using Reshape( $\cdot$ ) to match the feature dimensions of $F_{2}$ .

Wavelet Decode. The Wavelet Decode module(see Fig.3 ) conducts post-processing. It primarily functions as a feedforward neural network and restores the resolution of BEV features. The process can be outlined as follows:

	$\displaystyle S_{2}=FW(Concat(F_{2},F_{3})),$		(8)
	$\displaystyle F_{4}=Depth(F_{p}+S_{2})),$		(9)
	$\displaystyle F_{5}=Decode(Concat(F_{4},F_{0})),$		(10)

where FW( $\cdot$ ) denotes a feedforward wavelet network which performs up-sampling to restore the original shape, resulting in $S_{2}\in\mathbb{R}^{H\times W\times C}$ . Depth( $\cdot$ ) refers to an intermediate neural network that expands the depth of the network. Finally, Decode( $\cdot$ ) simplifies the channel count, compressing the data for subsequent processing.

Design choices of LGE. To design an optimized LGE structure, we conducted multiple attempts as shown in Figure 4. Next, different forms of encoder are inserted to produce a series of variants based on baseline A, elaborated as follows:

Method		Mod.	mAP $\uparrow$	NDS $\uparrow$	Car	Truck	Bus	Trailer	C.V.	Ped.	Mot.	Byc.	T.C.	Bar.
Focals Conv (Chen et al. 2022a)	[NeurIPS 22]	L	63.8	70.0	86.7	56.3	67.7	59.5	23.8	87.5	64.5	36.3	81.4	74.1
TransFusion-L (Bai et al. 2022)	[CVPR 22]	L	65.5	70.2	86.2	56.7	66.3	58.8	28.2	86.1	68.3	44.2	82.0	78.2
LargeKernel3D (Chen et al. 2023a)	[CVPR 22]	L	65.4	70.5	85.5	53.8	64.4	59.5	29.7	85.9	72.7	46.8	79.9	75.5
Link (Lu et al. 2023)	[CVPR 23]	L	66.3	71.0	86.1	55.7	65.7	62.1	30.9	85.8	73.5	47.5	80.4	75.5
LiDARMultiNet (Ye et al. 2023)	[AAAI 23]	L	67.0	71.6	86.9	57.4	64.7	61.0	31.5	87.2	75.3	47.6	85.1	73.5
FSTR-L (Zhang et al. 2023)	[TGRS 23]	L	67.2	71.5	86.5	54.1	66.4	58.4	33.4	88.6	73.7	48.1	81.4	78.1
HEDNet (Zhang et al. 2024b)	[NeurIPS 23]	L	67.7	72.0	87.1	56.5	70.4	63.5	33.6	87.9	70.4	44.8	85.1	78.1
FocalFormer3D (Chen et al. 2023b)	[ICCV 23]	L	68.7	72.6	87.2	57.1	69.6	64.9	34.4	88.2	76.2	49.6	82.3	77.8
SAFDNet (Zhang et al. 2024a)	[CVPR 24]	L	68.3	72.3	87.3	57.3	68.0	63.7	37.3	89.0	71.1	44.8	84.9	79.5
Co-Fix3D (Ours)		L	69.1	72.9	88.5	59.4	69.0	65.5	37.6	88.0	75.2	47.9	81.9	77.9
TransFusion (Bai et al. 2022)	[CVPR 22]	LC	68.9	71.7	87.1	60.0	68.3	60.8	33.1	88.4	73.6	52.9	86.7	78.1
BEVFusion (Liang et al. 2022)	[NerIPS 22]	LC	69.2	71.8	88.1	60.9	69.3	62.1	34.4	89.2	72.2	52.2	85.2	78.2
BEVFusion-MIT (Liu et al. 2023b)	[ICRA 23]	LC	70.2	72.9	88.6	60.1	69.8	63.8	39.3	89.2	74.1	51.0	86.5	80.0
DeepInteraction (Yang et al. 2022b)	[NerIPS 22]	LC	70.8	73.4	87.9	60.2	70.8	63.8	37.5	91.7	75.4	54.5	87.2	80.4
ObjectFusion (Cai et al. 2023)	[ICCV 23]	LC	71.0	73.3	89.4	59.0	71.8	63.1	40.5	90.7	78.1	53.2	87.7	76.6
FocalFormer3D (Chen et al. 2023b)	[ICCV 23]	LC	71.6	73.9	88.5	61.4	71.7	66.4	35.9	89.7	80.3	57.1	85.3	79.3
GraphBEV (Yan et al. 2023)	[ECCV 24]	LC	71.7	73.6	89.2	60.0	72.1	64.5	40.8	90.9	76.8	53.3	88.9	80.1
Co-Fix3D (Ours)		LC	72.3	74.1	89.7	62.4	70.3	66.2	41.0	89.9	79.4	58.9	86.3	79.1

Table 1: Comparison with SOTA detectors on nuScenes TEST set. We do not use test-time augmentation or model ensemble. Mod.: Modality. C.V.: construction vehicle. Ped.: pedestrian. Mot.: motorcyclist. Byc.: bicyclist. T.C.: traffic cone. Bar.: barrier.

1.

A → B: Variant B incorporates a global attention module (Hybrid Encode) before the input of A, with the aim of first optimizing global features through attention and then optimizing local, pixel-level features.
2.

A → C: This variant reverses the optimization order of B, starting with local, pixellevel feature optimization before applying global feature attention optimization.
3.

A → D: Based on A, this variant undergoes internal optimization by expanding channels through DWT, followed by noise feature optimization through a global attention mechanism.
4.

A → E: This variant performs global attention and local optimization in parallel, concatenating the results. This parallel processing strategy leverages the strengths of both global and local optimizations, aiming to achieve a more comprehensive and refined features effect through the combined outcomes.

Each variant explores the best way to integrate global and local optimizations through different sequences and methods, aiming to achieve an optimal balance between detail restoration, noise suppression, and computational efficiency.

Experiments

Datasets

The nuScenes Dataset (Caesar et al. 2020) is a comprehensive outdoor dataset featuring 1,000 multi-modal scenes, each lasting 20 seconds and annotated at 2Hz. This dataset includes data from a 32-beam LiDAR at a 20FPS rate and images from a 6-view camera setup. We evaluated our method under both LiDAR-only and LiDAR-Camera fusion settings, using the official nuScenes metrics: mean average precision (mAP) and nuScenes detection score (NDS). Our training and evaluation adhered to the nuScenes standard protocol, analyzing data from the preceding nine frames for current frame assessment, in line with the official evaluation criteria.

Implementation Details

We developed our model using the PyTorch framework (Paszke et al. 2017) and the open-source MMDetection3D (Contributors 2020). The detection region spans $[-54.0m,54.0m]$ on the X and Y axes, and $[-5.0m,3.0m]$ on the Z axis. On the nuScenes dataset, we set the voxel size to 0.075m $\times$ 0.075m $\times$ 0.2m. In LiDAR mode, the backbone was initially trained for 20 epochs using CBGS (Zhu et al. 2019). Subsequently, we froze the pre-trained LiDAR backbone and continued training the detection head with multi-stage heatmaps for an additional six epochs, employing GT sample augmentation except in the final five epochs. In multi-modality mode, The image backbone network utilizes ResNet-50 and the image size set to 448 $\times$ 800, following the FocalFormer3D and BEVFusion approach, to project multi-view camera features onto a predefined voxel grid in 3D space. The BEV size of this voxel grid is set to 180 $\times$ 180, matching the 8 $\times$ downsampled BEV features generated by VoxelNet(Zhou and Tuzel 2018), with a channel dimension of 128. The camera backbone was trained for 20 epochs without CBGS. Then both the image and point cloud branches were frozen, only the fusion module and head module gradients were enabled, and the training continued for 10 epochs without CBGS. Our model is trained with the total batch size of 16 on 4 Nvidia 4090 GPUs. We utilize the AdamW(Loshchilov and Hutter 2017) optimizer for the optimization process. The initial learning rate is set to $1.0\times 10^{-4}$ , and we apply the one-cycle policy for learning rate adjustment.

State-of-the-Art Comparison

LiDAR-Based 3D object detection on test set.

In Tab. 1, we benchmarked the performance of our model on the nuScenes test set and compared it with the current leading LiDAR-based (’L’) and multimodal (’LC’) 3D object detectors. The results demonstrate that Co-Fix3D outperforms all existing state-of-the-art (SOTA) 3D detection algorithms. As a baseline for TransFusion-L, Co-Fix3D’s LiDAR mode achieved a 3.6% improvement in mAP and a 2.7% increase in NDS. Additionally, compared to recent single-modal detection methods such as HEDNet, SAFDNet, and FocalFormer3D, Co-Fix3D exhibited superior performance, with mAP gains of 1.4%, 0.8%, and 0.4%, respectively. Notably, Co-Fix3D achieved the highest detection results in certain categories, such as cars, trucks, trailers, and construction vehicles. This suggests that Co-Fix3D effectively enhances BEV features through parallel LGE, enabling more accurate identification of weak positive queries.

Multi-modal 3D object detection on test set.

We extended Co-Fix3D as a multimodal model and used it as a baseline for TransFusion-LC. In its multimodal mode, Co-Fix3D improved mAP by 3.4% and NDS by 2.4%. Furthermore, compared to the latest single-modal detection methods such as ObjectFusion, GraphBEV, and FocalFormer3D, Co-Fix3D exhibited superior performance, with mAP gains of 1.3%, 0.6%, and 0.7%, respectively. Notably, Co-Fix3D achieved the highest detection results in certain categories, such as cars and construction vehicles. This further demonstrates that Co-Fix3D enhances BEV features through parallel LGE, effectively identifying weak positive queries and thereby improving overall detection capability.

Method

Mod.

Image

Encoder

mAP

NDS

TransFusion-L (Bai et al. 2022)

64.9

69.9

HEDNet (Zhang et al. 2024b)

66.7

71.4

SAFDNet (Zhang et al. 2024a)

66.3

71.0

FocalFormer3D (Chen et al. 2023b)

66.5

71.1

Co-Fix3D (Ours)

66.8

71.3

TransFusion (Bai et al. 2022)

ResNet-50

67.5

71.3

BEVFusion (Liu et al. 2023b)

Swin-T

68.5

71.4

SparseFusion(Xie et al. 2023)

ResNet-50

70.4

72.8

FocalFormer3D (Chen et al. 2023b)

ResNet-50

70.5

73.0

Co-Fix3D (Ours)

ResNet-50

70.8

73.1

Table 2: Comparison with detectors on the nuScenes VALIDATION set. Mod.: Modality.

3D object detection on val set.

We present results on the nuScenes validation set, as detailed in Table 2. As a baseline for TransFusion-L, Co-Fix3D’s LiDAR mode achieved a 1.9% improvement in mAP and a 1.4% increase in NDS. Additionally, Co-Fix3D enhances the LiDAR-only baseline, FocalFormer3D, with an increase of 0.3% in mAP and 0.3% in NDS. For multi-mode scenarios, the improvement is 0.3% in mAP and 0.1% in NDS.

Ablation Study

We conducted several experiments on the validation set. We conducted 20 epochs of training tests, implementing a degradation strategy in the last five epochs.

LGE	C	P	Stage	Q	mAP $\uparrow$	NDS $\uparrow$
			1	200	64.8	70.1
$\checkmark$			1	200	65.9^↑1.1	70.9^↑0.8
	$\checkmark$		2	400	65.9	70.8
$\checkmark$	$\checkmark$		2	400	66.1^↑0.2	70.9 ^↑0.1
$\checkmark$		$\checkmark$	2	400	66.3^↑0.4	71.1^↑0.3
	$\checkmark$		3	600	66.2	70.9
$\checkmark$	$\checkmark$		3	600	66.4^↑0.2	71.0^↑0.1
$\checkmark$		$\checkmark$	3	600	66.6^↑0.4	71.2^↑0.3

Table 3: The impact of different stages. ”C” stands for cascade structure, while ”P” represents parallel structure. And ”Q” for query count. Red text indicates the improvement relative to the first line of that section.

Advantages of LGE.

Tab.3 primarily illustrates the advantages of the LGE module. For instance, in the first stage, incorporating the LGE structure improved mAP by 1.1% and NDS by 0.6%, significantly enhancing detection performance. In subsequent stages, configurations with the LGE module consistently outperformed those without it. Even within the cascade structure, setups with LGE outperformed those lacking the module. Furthermore, when comparing cascade and parallel structures, the parallel structure with LGE demonstrated superior performance. At stage 3, using the parallel LGE structure resulted in mAP and NDS improvements of 0.4% and 0.3%, respectively. These findings suggest that the LGE module effectively refines BEV features and filters out weak positive queries, thereby improving the recognition rate of hard-to-detect samples in later stages and enhancing overall mAP.

Feature Visualization To investigate the impact of the LGE module on BEV feature maps, we conducted a visualization analysis of the BEV features, as shown in Fig.5. The images reveal that the BEV features processed with the LGE module are significantly better than those without it, indicating that the LGE module can repair some defective features, thereby effectively enhancing the quality of queries in TransFusion-type models.

Cascaded Structure vs. Parallel Structure We compared parallel and cascaded structures—both incorporating LGE modules but differently connected—by evaluating their performance with the final two weight sets from Tab.3 . Maintaining a constant output of 300 bounding boxes, we increased query numbers to 1200 as shown in Tab.4. The parallel structure exhibited a 0.3% improvement in both mean Average Precision (mAP) and NuScenes Detection Score (NDS) over the cascaded structure. In the cascaded setup, raising the query count from 300 to 1200 yielded only a 0.1% boost in both metrics, possibly due to initial BEV feature alterations by the first-stage LGE, hindering further refinements by subsequent modules. Conversely, the parallel structure’s enhancements with increased queries underscore its superior efficacy in boosting detection performance.

LGE	C	P	Q	mAP $\uparrow$	NDS $\uparrow$
$\usym{1F5F8}$	$\usym{1F5F8}$		300	66.4	71.0
$\usym{1F5F8}$		$\usym{1F5F8}$	300	66.5^↑0.1	71.1^↑0.1
$\usym{1F5F8}$	$\usym{1F5F8}$		600	66.5	71.0
$\usym{1F5F8}$		$\usym{1F5F8}$	600	66.6^↑0.1	71.2^↑0.2
$\usym{1F5F8}$	$\usym{1F5F8}$		900	66.5	71.0
$\usym{1F5F8}$		$\usym{1F5F8}$	900	66.7^↑0.2	71.2^↑0.2
$\usym{1F5F8}$	$\usym{1F5F8}$		1200	66.5	71.1
$\usym{1F5F8}$		$\usym{1F5F8}$	1200	66.8^↑0.3	71.3^↑0.2

Table 4: The impact of different quantities of Q.

Design choices of LGE. To more effectively assess the LGE module’s effectiveness, we conducted experiments on each component within the LGE design choices, as detailed in Table 5. (a₀) indicates the performance without the enhancement module. Variant (a) serves as the baseline, utilizing only local optimization. Variant (b) integrates global optimization followed by local optimization, which leads to a 0.4% decrease in mAP, suggesting that global feature enhancement might negatively impact local optimization and thus reduce detection performance. Variant (c) introduces global optimization subsequent to local optimization, resulting in training anomalies like gradient issues and non-convergence. Variant (d) shows a 0.2% decrease in mAP, indicating that when local features are already well-optimized, additional global optimization is not beneficial and may even be detrimental. Variant (e) illustrates our proposed approach, where local and global optimizations are conducted simultaneously, resulting in a 0.2% increase in mAP. Therefore, we have chosen Variant (e) as our LGE module.

Variant	mAP $\uparrow$	NDS $\uparrow$
(a₀)	64.8	70.1
(a)	65.6	70.8
(b)	65.3	70.4
(c)	N/A	N/A
(d)	65.5	70.7
(e)	65.9	70.9

Table 5: Degin LGE

Visualization Fig. 6 displays our qualitative results on the nuScenes validation set. It can be seen that the performance of 3D object detection is quite good.

Conclusion

We introduced Co-Fix3D, an end-to-end 3D object detection network designed to enhance BEV features and leverage a collaborative network approach for comprehensive mining of potential samples. This method significantly boosted the performance of 3D object detection in autonomous driving scenarios. Extensive experiments confirmed that Co-Fix3D not only excels in single-modality point cloud detection but also in hybrid point cloud-image fusion modalities, achieving state-of-the-art performance on the nuScenes benchmark. We believe that Co-Fix3D will serve as a robust and efficient baseline for future research in this field.

References

Bai et al. (2022) Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; and Tai, C.-L. 2022. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1090–1099.
Bhattacharyya and Czarnecki (2020) Bhattacharyya, P.; and Czarnecki, K. 2020. Deformable PV-RCNN: Improving 3D object detection with learned deformations. arXiv preprint arXiv:2008.08766.
Caesar et al. (2020) Caesar, H.; Bankiti, V.; Lang, A. H.; Vora, S.; Liong, V. E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; and Beijbom, O. 2020. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11621–11631.
Cai et al. (2023) Cai, Q.; Pan, Y.; Yao, T.; Ngo, C.-W.; and Mei, T. 2023. Objectfusion: Multi-modal 3d object detection with object-centric fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 18067–18076.
Chen et al. (2024) Chen, H.; Li, W.; Gu, J.; Ren, J.; Sun, H.; Zou, X.; Zhang, Z.; Yan, Y.; and Zhu, L. 2024. Low-Res Leads the Way: Improving Generalization for Super-Resolution by Self-Supervised Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 25857–25867.
Chen et al. (2017) Chen, X.; Ma, H.; Wan, J.; Li, B.; and Xia, T. 2017. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1907–1915.
Chen et al. (2022a) Chen, Y.; Li, Y.; Zhang, X.; Sun, J.; and Jia, J. 2022a. Focal sparse convolutional networks for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5428–5437.
Chen et al. (2023a) Chen, Y.; Liu, J.; Zhang, X.; Qi, X.; and Jia, J. 2023a. LargeKernel3D: Scaling Up Kernels in 3D Sparse CNNs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13488–13498.
Chen et al. (2023b) Chen, Y.; Yu, Z.; Chen, Y.; Lan, S.; Anandkumar, A.; Jia, J.; and Alvarez, J. M. 2023b. Focalformer3d: focusing on hard instance for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8394–8405.
Chen et al. (2022b) Chen, Z.; Li, Z.; Zhang, S.; Fang, L.; Jiang, Q.; and Zhao, F. 2022b. Deformable feature aggregation for dynamic multi-modal 3D object detection. In European conference on computer vision, 628–644. Springer.
Contributors (2020) Contributors, M. 2020. MMDetection3D: OpenMMLab next-generation platform for general 3D object detection.
Dao (2023) Dao, T. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
Fan et al. (2022) Fan, L.; Pang, Z.; Zhang, T.; Wang, Y.-X.; Zhao, H.; Wang, F.; Wang, N.; and Zhang, Z. 2022. Embracing single stride 3d object detector with sparse transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8458–8468.
Feng et al. (2023) Feng, T.; Wang, W.; Wang, X.; Yang, Y.; and Zheng, Q. 2023. Clustering based point cloud representation learning for 3d analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8283–8294.
Gilmer et al. (2017) Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Dahl, G. E. 2017. Neural message passing for quantum chemistry. In International conference on machine learning, 1263–1272. PMLR.
Graham, Engelcke, and Van Der Maaten (2018) Graham, B.; Engelcke, M.; and Van Der Maaten, L. 2018. 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 9224–9232.
He et al. (2023) He, C.; Li, R.; Zhang, Y.; Li, S.; and Zhang, L. 2023. Msf: Motion-guided sequential fusion for efficient 3d object detection from point cloud sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5196–5205.
Ji et al. (2023) Ji, D.; Zhao, F.; Lu, H.; Tao, M.; and Ye, J. 2023. Ultra-high resolution segmentation with ultra-rich context: A novel benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 23621–23630.
Li et al. (2020) Li, Q.; Shen, L.; Guo, S.; and Lai, Z. 2020. Wavelet integrated CNNs for noise-robust image classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7245–7254.
Li et al. (2023) Li, X.; Ma, T.; Hou, Y.; Shi, B.; Yang, Y.; Liu, Y.; Wu, X.; Chen, Q.; Li, Y.; Qiao, Y.; et al. 2023. Logonet: Towards accurate 3d object detection with local-to-global cross-modal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17524–17534.
Li et al. (2022a) Li, Y.; Chen, Y.; Qi, X.; Li, Z.; Sun, J.; and Jia, J. 2022a. Unifying voxel-based representation with transformer for 3d object detection. Advances in Neural Information Processing Systems, 35: 18442–18455.
Li et al. (2022b) Li, Y.; Qi, X.; Chen, Y.; Wang, L.; Li, Z.; Sun, J.; and Jia, J. 2022b. Voxel field fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1120–1129.
Li et al. (2022c) Li, Y.; Yu, A. W.; Meng, T.; Caine, B.; Ngiam, J.; Peng, D.; Shen, J.; Lu, Y.; Zhou, D.; Le, Q. V.; et al. 2022c. Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17182–17191.
Liang et al. (2019) Liang, M.; Yang, B.; Chen, Y.; Hu, R.; and Urtasun, R. 2019. Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7345–7353.
Liang et al. (2022) Liang, T.; Xie, H.; Yu, K.; Xia, Z.; Lin, Z.; Wang, Y.; Tang, T.; Wang, B.; and Tang, Z. 2022. Bevfusion: A simple and robust lidar-camera fusion framework. Advances in Neural Information Processing Systems, 35: 10421–10434.
Liu et al. (2022) Liu, Y.; Wang, T.; Zhang, X.; and Sun, J. 2022. Petr: Position embedding transformation for multi-view 3d object detection. In European Conference on Computer Vision, 531–548. Springer.
Liu et al. (2023a) Liu, Y.; Yan, J.; Jia, F.; Li, S.; Gao, A.; Wang, T.; and Zhang, X. 2023a. Petrv2: A unified framework for 3d perception from multi-camera images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3262–3272.
Liu et al. (2023b) Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D. L.; and Han, S. 2023b. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), 2774–2781. IEEE.
Loshchilov and Hutter (2017) Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Lu et al. (2023) Lu, T.; Ding, X.; Liu, H.; Wu, G.; and Wang, L. 2023. LinK: Linear Kernel for LiDAR-based 3D Perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1105–1115.
Meng et al. (2021) Meng, Q.; Wang, W.; Zhou, T.; Shen, J.; Jia, Y.; and Van Gool, L. 2021. Towards a weakly supervised framework for 3d point cloud object detection and annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8): 4454–4468.
Paszke et al. (2017) Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in pytorch.
Philion and Fidler (2020) Philion, J.; and Fidler, S. 2020. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, 194–210. Springer.
Qi et al. (2019) Qi, C. R.; Litany, O.; He, K.; and Guibas, L. J. 2019. Deep hough voting for 3d object detection in point clouds. In proceedings of the IEEE/CVF International Conference on Computer Vision, 9277–9286.
Qi et al. (2018) Qi, C. R.; Liu, W.; Wu, C.; Su, H.; and Guibas, L. J. 2018. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, 918–927.
Qi et al. (2017a) Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017a. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 652–660.
Qi et al. (2017b) Qi, C. R.; Yi, L.; Su, H.; and Guibas, L. J. 2017b. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30.
Shi, Wang, and Li (2019) Shi, S.; Wang, X.; and Li, H. 2019. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 770–779.
Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30.
Vora et al. (2020) Vora, S.; Lang, A. H.; Helou, B.; and Beijbom, O. 2020. Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4604–4612.
Wang et al. (2023) Wang, H.; Shi, C.; Shi, S.; Lei, M.; Wang, S.; He, D.; Schiele, B.; and Wang, L. 2023. Dsvt: Dynamic sparse voxel transformer with rotated sets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13520–13529.
Xie et al. (2023) Xie, Y.; Xu, C.; Rakotosaona, M.-J.; Rim, P.; Tombari, F.; Keutzer, K.; Tomizuka, M.; and Zhan, W. 2023. Sparsefusion: Fusing multi-modal sparse representations for multi-sensor 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 17591–17602.
Xu et al. (2021) Xu, S.; Zhou, D.; Fang, J.; Yin, J.; Bin, Z.; and Zhang, L. 2021. Fusionpainting: Multimodal fusion with adaptive attention for 3d object detection. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), 3047–3054. IEEE.
Yan et al. (2023) Yan, J.; Liu, Y.; Sun, J.; Jia, F.; Li, S.; Wang, T.; and Zhang, X. 2023. Cross modal transformer: Towards fast and robust 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 18268–18278.
Yan, Mao, and Li (2018) Yan, Y.; Mao, Y.; and Li, B. 2018. Second: Sparsely embedded convolutional detection. Sensors, 18(10): 3337.
Yang et al. (2022a) Yang, H.; Liu, Z.; Wu, X.; Wang, W.; Qian, W.; He, X.; and Cai, D. 2022a. Graph r-cnn: Towards accurate 3d object detection with semantic-decorated local graph. In European Conference on Computer Vision, 662–679. Springer.
Yang et al. (2022b) Yang, Z.; Chen, J.; Miao, Z.; Li, W.; Zhu, X.; and Zhang, L. 2022b. Deepinteraction: 3d object detection via modality interaction. Advances in Neural Information Processing Systems, 35: 1992–2005.
Ye et al. (2023) Ye, D.; Zhou, Z.; Chen, W.; Xie, Y.; Wang, Y.; Wang, P.; and Foroosh, H. 2023. Lidarmultinet: Towards a unified multi-task network for lidar perception. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 3231–3240.
Yin et al. (2021) Yin, J.; Shen, J.; Gao, X.; Crandall, D. J.; and Yang, R. 2021. Graph neural network and spatiotemporal transformer attention for 3D video object detection from point clouds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8): 9822–9835.
Yin, Zhou, and Krahenbuhl (2021) Yin, T.; Zhou, X.; and Krahenbuhl, P. 2021. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11784–11793.
Yin, Zhou, and Krähenbühl (2021) Yin, T.; Zhou, X.; and Krähenbühl, P. 2021. Multimodal virtual point 3d detection. Advances in Neural Information Processing Systems, 34: 16494–16507.
Zhang et al. (2023) Zhang, D.; Zheng, Z.; Niu, H.; Wang, X.; and Liu, X. 2023. Fully Sparse Transformer 3D Detector for LiDAR Point Cloud. IEEE Transactions on Geoscience and Remote Sensing.
Zhang et al. (2024a) Zhang, G.; Chen, J.; Gao, G.; Li, J.; Liu, S.; and Hu, X. 2024a. SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection. arXiv preprint arXiv:2403.05817.
Zhang et al. (2024b) Zhang, G.; Junnan, C.; Gao, G.; Li, J.; and Hu, X. 2024b. Hednet: A hierarchical encoder-decoder network for 3d object detection in point clouds. Advances in Neural Information Processing Systems, 36.
Zhao et al. (2021) Zhao, H.; Jiang, L.; Jia, J.; Torr, P. H.; and Koltun, V. 2021. Point transformer. In Proceedings of the IEEE/CVF international conference on computer vision, 16259–16268.
Zhou and Tuzel (2018) Zhou, Y.; and Tuzel, O. 2018. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4490–4499.
Zhu et al. (2019) Zhu, B.; Jiang, Z.; Zhou, X.; Li, Z.; and Yu, G. 2019. Class-balanced grouping and sampling for point cloud 3d object detection. arXiv preprint arXiv:1908.09492.