0% found this document useful (0 votes)
17 views15 pages

FDGDFD

physics in line

Uploaded by

my.name.kostas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views15 pages

FDGDFD

physics in line

Uploaded by

my.name.kostas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Article

Mixed Label Assignment Realizes End-to-End Object Detection


Jiaquan Chen, Changbin Shao and Zhen Su *

School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212100, China;
[email protected] (J.C.); [email protected] (C.S.)
* Correspondence: [email protected]

Abstract: Currently, detectors have made significant progress in inference speed and accuracy.
However, these detectors require Non-Maximum Suppression (NMS) during the post-processing
stage to eliminate redundant boxes, which limits the optimization of model inference speed. We first
analyzed the reason for the dependence on NMS in the post-processing stage. The result showed that
a score loss in a one-to-many label assignment leads to the presence of high-quality redundant boxes,
making them difficult to remove. To realize end-to-end object detection and simplify the detection
pipeline, we propose herein a mixed label assignment (MLA) training method, which uses one-to-
many label assignment to provide rich supervision signals, alleviating the performance degradation,
and we eliminate the need for NMS in the post-processing stage by using one-to-one label assignment.
Additionally, a window feature propagation block (WFPB) is introduced, utilizing the inductive
bias of images to enable feature sharing in local regions. Through these methods, we conducted
experiments on the VOC and DUO datasets; our end-to-end detector MA-YOLOX achieved 66.0 mAP
and 52.6 mAP, respectively, outperforming the YOLOX by 1.7 and 1.6. Additionally, our model
performed faster than other real-time detectors without NMS.

Keywords: non-maximum suppression; end-to-end object detection; label assignment; YOLOX

1. Introduction
Citation: Chen, J.; Shao, C.; Su, Z. Object detection is a fundamental task in the field of computer vision, requiring the
Mixed Label Assignment Realizes identification and localization of objects within images. Early object detection methods
End-to-End Object Detection. evolved from two-stage [1,2] to one-stage [3,4]. The two-stage methods utilize Region
Electronics 2024, 13, 4856. https:// Proposal Networks (RPNs) to generate a set of candidate regions containing objects, while
doi.org/10.3390/electronics13234856 one-stage methods directly produce dense prediction boxes to achieve object localization,
simplifying the object detection process and speeding up inference. In recent years, from
Academic Editor: Eva Cernadas
Anchor-Based [5,6] to Anchor-Free [7], lightweight and high-performance network struc-
Received: 19 November 2024 tures have continuously simplified models, achieving significant progress and superior
Revised: 3 December 2024 performance. However, the series of prediction boxes generated by detectors often con-
Accepted: 4 December 2024 tains a large number of redundant results, necessitating the filtering of these redundant
Published: 9 December 2024 boxes during the post-processing stage, which involves the manually designed component
known as Non-Maximum Suppression (NMS). NMS effectively removes redundant boxes
with high overlap by calculating the Intersection over Union (IoU) between prediction
Copyright: © 2024 by the authors.
boxes for the same object, ensuring that only the optimal detection result for each object is
Published by MDPI on behalf of the
retained. Conventional detectors rely on NMS during the post-processing stage, making
World Electric Vehicle Association.
the detection process cumbersome and not truly end-to-end.
Licensee MDPI, Basel, Switzerland. Recently, Transformer-based object detectors (DETR, Detection with Transformer) [8]
This article is an open access article can directly predict without NMS, removing various manually designed components,
distributed under the terms and greatly simplifying the pipeline of object detection, and achieving end-to-end object de-
conditions of the Creative Commons tection. DETR uses a bipartite graph matching algorithm to find a positive sample for
Attribution (CC BY) license (https:// each ground truth box, achieving end to-end object detection. However, the high com-
creativecommons.org/licenses/by/ putational cost of DETR limits its effectiveness and prevents it from being fully utilized,
4.0/). while CNN-based detectors have achieved a reasonable trade-off between detection speed

Electronics 2024, 13, 4856. https://doi.org/10.3390/electronics13234856 https://www.mdpi.com/journal/electronics


Electronics 2024, 13, 4856 2 of 15

and accuracy, which can be further enhanced if NMS is not required. DeFCN [9] and
OneNet [10] each achieve end-to-end object detection using fully convolutional networks,
demonstrating that one-to-one label assignment is crucial for implementing end-to-end.
However, the training method of one-to-one label assignment leads to a decline in the
detector’s performance. Ref. [11] introduces a PSS module in the detection head to replace
NMS in FCOS [12] detector, by selecting a single positive sample for each instance, while
this approach increases the complexity of the detection head structure. To address these
issues, we propose a new end-to-end training method that maintains both the superior
performance and unchanged structure of the detection head.
CNN-based object detectors generate multiple nearly redundant predictions for each
object, and the post-processing stage uses Non-Maximum Suppression to select the optimal
prediction boxes as detection results. We delved into the reasons for the detector’s reliance
on NMS by examining the post-processing algorithm flow and discovered that the score
loss under one-to-many label assignment is a key factor causing numerous redundant boxes
that cannot be easily eliminated. Previous works have demonstrated that employing one-
to-one label assignment to eliminate NMS supports this finding. However, one-to-one label
assignment can lead to a significant decline in detector performance. To address this issue,
we propose an end-to-end training method called mixed label assignment (MLA). This ap-
proach uses one-to-one score loss to prevent the generation of high-quality redundant boxes,
eliminating the need for NMS and realizing end-to-end object detection. It also retains the
one-to-many bounding box regression loss, which provides rich supervisory information to
optimize the model and alleviate performance degradation. Additionally, DETR achieves
strong competitive results through the use of attention mechanisms. However, due to the
high computational cost and memory explosion caused by Attention [13] operating on
large-scale features, it is challenging to embed into every layer of the feature extraction
stage. To leverage the inductive bias of convolutions and images for feature propagation
in local areas, we propose a window feature propagation block (WFPB) that enhances the
feature sharing capability, making it more suitable for the feature extraction stage.
The main contributions of this paper are as follows:
• Propose a novel end-to-end training method, mixed label assignment, which elimi-
nates the need for NMS and simplifies the detection pipeline;
• Introduce a window feature propagation block that is better suited for the feature
extraction stage, enhancing local feature sharing;
• Conduct extensive comparative and ablation experiments on the PASCAL VOC and
DUO, demonstrating the superiority and effectiveness of the proposed method.

2. Related Work
2.1. End-to-End Object Detection
Carion et al. [8] firstly proposed the Transformer-based object detection model DETR
by using Hungarian matching to achieve one-to-one label assignment as DETR realizes
end-to-end object detection. DETR eliminates manually designed components needed in
traditional detectors, such as Non-Maximum Suppression algorithms, thus simplifying
the object detection pipeline. However, DETR still has two issues: heavy computational
burden and slow training convergence. Although Deformable DETR [14] reduces the
computational cost of Transformers by using deformable attention, and DAB-DETR [15]
accelerates training convergence by replacing queries with dynamic anchor box repre-
sentations, DETR’s training cost and inference speed are still significantly higher than
CNN-based object detectors.
Electronics 2024, 13, 4856 3 of 15

Convolutional networks can also achieve end-to-end detection. RelationNet [16]


introduces a relationship module to learn the relationships between ground truth boxes,
and uses relationship scores instead of IoU for redundant box filtering. Learnable NMS [17]
turns NMS into a learnable module, making the entire network learnable and achieving
end-to-end training. Sparse R-CNN [18] replaces dense anchor boxes with a set of sparse
proposal boxes, directly outputting the final detection results without the need for NMS
post-processing. DeFCN and OneNet adjust the one-to-many assignment to a one-to-one
label assignment, successfully eliminating the need for NMS in the post-processing stage
and proposing dynamic matching costs. Zhou et al. [11] added a PSS head to an FCOS,
automatically selecting a single positive sample for each instance and achieving end-to-end
object detection without altering the original training method.

2.2. Label Assignment


Object detection based on deep learning trains models by computing the loss between
predicted samples and ground truth labels, enabling detectors to acquire detection ca-
pabilities. For each image, object detectors generate a series of prediction boxes. How
to select appropriate samples to optimize the model has been a key research focus. In
early works, researchers selected positive and negative samples based on the position
of prediction boxes [19] or their IoU with ground truth boxes [20], and multiple positive
samples were assigned to each target, providing rich supervisory signals and accelerating
model convergence, known as one-to-many label assignment. However, relying solely on
the position information of prediction boxes as an assignment strategy is often not optimal.
ATSS [21] proposes a high-performance method for defining positive and negative samples
by calculating thresholds to select them. OTA [22] treats label assignment as an optimal
transport problem and dynamically estimates the number of positive samples. YOLOX
simplifies it to SimOTA. TOOD [23] combines regression and classification information
from predicted boxes to form a cost matrix for selecting positive samples. DETR replaces
one-to-many label assignment with one-to-one label assignment to eliminate redundancy.
Although one-to-one label assignment avoids redundant high-scoring boxes, it also
brings some drawbacks, such as less supervision for each instance. Subsequent research
addresses this issue [24–27]. For example, DN-DETR uses denoising training, randomly
adds noise, and reconstructs it into ground truth labels. MS-DETR employs parallel
decoders for one-to-many supervision. DeFCN uses auxiliary loss. These methods provide
richer supervisory signals by dual-label assignment without altering model inference, but
they introduce additional computational overhead during the training phase due to the
use of auxiliary branches.

3. Methods
Figure 1 presents MA-YOLOX’s structure and training methodology. While the object
detection network employs one-to-many label assignment (Baseline) training to provide
robust feature representations, this approach generates redundant boxes that typically
require NMS filtering for final detection results. Through the analysis of the post-processing
stage, we propose mixed label assignment (MLA), a novel end-to-end training method that
eliminates NMS requirements. Additionally, we integrate a window feature propagation
block (WFPB) to enhance local feature sharing and boost performance. These innovations
enable the detector to achieve superior detection results without NMS post-processing.
Electronics 2024, 13, 4856 4 of 15

Figure 1. MA-YOLOX network architecture.

3.1. Mixed Label Assignment


3.1.1. Task Assignment
For a series of prediction boxes generated by object detection models, during the post-
processing stage, predictions are first selected with confidence scores above a threshold
and then NMS is applied to filter the remaining predictions. Traditional detectors eliminate
numerous low-quality prediction boxes after the first step. However, some high-quality
redundant boxes for the same object remain, which can only be removed through NMS.
If the algorithm can remove all redundant boxes in the first step and obtain the desired
prediction boxes, NMS would be unnecessary.
Why are there redundant high-quality prediction boxes? We speculate that this is
caused by training under one-to-many label assignment. To verify this hypothesis, we
visualized the confidence heatmaps of the model under one-to-many and one-to-one
matching training methods. The results are shown in Figure 2. It can be observed that
under the one-to-many matching training method, multiple prediction locations for an
object in the image have high confidence. With only confidence-based filtering, selecting the
optimal prediction box becomes challenging. In contrast, with one-to-one matching, each
object corresponds to a single prediction sample, and the confidence of the surrounding
prediction boxes is suppressed. This approach eliminates the need for NMS to filter
redundant boxes, achieving end-to-end detection. DeFCN and DETR demonstrated that
one-to-one label assignment is crucial for end-to-end object detection. This paper explores
which element of one-to-one label assignment plays a decisive role.
The post-processing stage filters the predicted boxes solely based on the confidence
score, regardless of the quality of the predicted boxes. We hypothesize that the scoring
loss in one-to-one label assignment plays a decisive role. To verify whether bounding box
regression under different label assignments affects end-to-end results, we also visualize
confidence heatmaps under mixed label assignment, as shown in Figure 2. Mixed label as-
signment maintains single-prediction samples per target, similar to one-to-one assignment.
Therefore, using one-to-one or one-to-many label assignment for bounding box regression
does not affect the elimination of post-processing. Previous work achieved end-to-end
detection results using complete one-to-one label assignment, but detector performance de-
Electronics 2024, 13, 4856 5 of 15

creased significantly. Conventional detectors use one-to-many label assignment to provide


sufficient foreground samples during training, leading to powerful feature representations
and rich supervision signals, also causing redundant prediction boxes. Without NMS,
detector performance drops dramatically. Therefore, we propose a mixed label assignment
(MLA) training approach, as shown in Figure 3. By simultaneously optimizing the model
with both one-to-many and one-to-one label assignments, we could leverage the advan-
tages of one-to-many assignment while avoiding redundant box generation, maintaining
superior detection performance and achieving end-to-end object detection.

Figure 2. Visualization of confidence heatmaps predicted by various methods. The image is sourced
from the VOC2007 test set and contains examples: ‘Man’ and ‘horse’. The methods are sequentially
trained using one-to-many matching (Baseline), one-to-one matching (O2O), and mixed label alloca-
tion (MLA) proposed in this paper. The heatmaps represent the confidence scores for predictions at
the “P4, P5” scale. It can be found that the O2O and MLA methods significantly reduce the redundant
prediction of the same object compared to Baseline.

Figure 3. The end-to-end training method of mixed label assignment (MLA). For input images,
based on the model’s predictions of regression, classification, and confidence information, compute
the matching cost and choose positive samples, then use the best positive sample ‘1’ to optimize
the confidence prediction head by one-to-one label assignment; the positive samples ‘1, 2, 3, 4’ are
optimized by performing one-to-many matching for the regression and classification.

3.1.2. Matching Cost


In the model training stage, positive and negative samples need to be selected in order
to optimize the model. They are usually selected by designing matching cost to choose
appropriate samples. For instance, Faster RCNN uses the Intersection over Union (IoU)
between ground truth (GT) boxes and predict boxes as a matching cost, whereas YOLOV5
compares the distances between GT and predicted box centers. These matching methods
may not be optimal, considering only location information and ignoring classification
information. If suboptimal predictions are assigned as positive samples, it may complicate
Electronics 2024, 13, 4856 6 of 15

the model’s convergence, and OneNet proposes the matching cost without classification
information, which is one of the reasons hindering end-to-end. To select the best positive
and negative samples, for the sample i and target j, the matching cost is defined as follows:

Ci,j = Cscore (i, j) + λ · Cloc (i, j). (1)


The final matching cost is obtained by weighting the location cost and the score cost.
As shown in Equation (2), the score cost calculates the category score and confidence for
the predicted sample i and the cross-entropy loss with the target j. Equation (3) shows the
location cost, the IoU loss between the predicted box b̂ for sample i, and the ground truth
box b is denoted as Cloc (i, j). By combining both, we select prediction boxes with high score
and high IoU as positive samples, avoiding special cases where boxes with high IoU but
low score are considered positive samples, this leads to incorrect optimization objectives.
And λ balances the roles of regression and classification.

Cscore (i, j) = L(ĉi · con f i , c j ) (2)

Cloc (i, j) = −log( IOU (b̂, b)) (3)


Mixed label assignment includes one-to-one and one-to-many dual-label assignments.
One-to-one label assignment only requires selecting one positive sample for each instance.
As shown in Equation (4), it calculates the cost between instance σ and predicted sample j.
N and G represent the number of predictions and the number of image instances. For each
instance j, the cost is minimized to match the best positive sample σ̂. For one-to-many label
assignment, using the same assignment method as Baseline [28], K suboptimal samples are
selected for each instance from the cost matrix, where K is obtained by SimOTA.
G
σ̂ = arg min ∑ Cost ( j, σ ) (4)
σ∈ N j

3.2. Window Feature Propagation Block


In addition to mixed label assignment, this paper designs an efficient and embeddable
architecture to achieve more competitive end-to-end detection. DETR achieves strong
competitive performance with Transformers; however, self-attention leads to an exponential
increase in computation as the input size grows. Using self-attention in the backbone
feature extraction network often faces limitations in computational resources. This paper
introduces a new module, the window feature propagation block (WFPB), to replace
the Attention.
For image, convolution operations have locality and translation invariance compared
to Transformers. Convolutional neural networks, due to their inherent inductive biases [29],
do not rely on global information; when they capture partial features of an object, they can
infer the characteristics of adjacent areas using prior knowledge, aiding in recognizing or
locating targets through local features. MAE [30] reconstructs images from high-masked
images using only few positional information, inferring global information from local cues.
Swim Transformer [31] limits self-attention computation to non-overlapping local windows
to alleviate the computational pressure on large-scale feature maps, achieving outstanding
results and demonstrating the effectiveness of local features.
This article presents the performance of feature propagation in local areas as a sub-
stitute for attention operations. The structure of the window feature propagation block
(WFPB) is shown in Figure 4. First, for the feature f, a convolution with a kernel size of
1 reconstructs features in the spatial channel and reduces the computation for subsequent
feature propagation. By combining convolution and pool for feature propagation, in the
case of using a K × K convolution with a stride of 1, extracting features from the feature
map in a sliding window will result in partial overlap of adjacent window areas during
Electronics 2024, 13, 4856 7 of 15

the sampling process. Shared features are then extracted to achieve feature sharing. After
convolution sampling, each window implicitly contains features from the surrounding
windows. At the same time, by using pooling operations, important features from the local
area are sampled. By summing, the window can capture important features within the local
area, playing a role in feature propagation. At the same time, using large kernel pooling
operation, important features from the local area are sampled. By adding, the window
can capture important features within the local area, facilitating feature propagation. The
final step is to restore the channels using a 1 × 1 convolution and employ residual connec-
tions [32] to enhance stability. Setting the K × K size of convolution kernel serves as the
window size.

Figure 4. (a) The basic composition of the YOLOX backbone’s dark module; (b) the window feature
propagation block (WFPB).

The window feature propagation block in the feature extraction phase of the network
is incorporated. The backbone of YOLOX is composed of the basic dark module [33]. The
WFPB is applied after the convolution downsampling of the dark module, performing
feature propagation and cropping on the sampled feature map to achieve the best results.

4. Experiments
4.1. Applied Datasets
The public datasets PASCAL VOC [34] and DUO dataset [35] were chosen for model
evaluation. The PASCAL Visual Object Classes is a world-class computer vision challenge
that was proposed in 2005, which includes the task of object detection. It features a total
of 20 detection categories: person, bird, cat, cow, dog, horse, sheep, aeroplane, bicycle,
boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, and moni-
tor. These categories encompass commonly seen objects in daily life, including people,
animals, household items, and transportation vehicles. It provides a wealth of resources
for the development of object detection. This paper uses the 2012 training and validation
set, consisting of 11,540 images for training, and tests the results on the 2007 test set of
4952 images.
In contrast to VOC, the DUO dataset is an underwater object detection dataset. The
DUO dataset was proposed by the Underwater Robot Professional Contest in 2021, aimed
at robot picking based on underwater images. It contains about 6671 images in the training
set and 1111 images in the testing set. The dataset includes four categories of underwater
targets: holothurians, echinoderms, scallops, and starfish. These two datasets cover normal
terrestrial scenes and underwater scenes, respectively. Achieving excellent results on both
diverse datasets can better reflect the superiority of the method.
Electronics 2024, 13, 4856 8 of 15

4.2. Experimental Settings


We implemented our model based on the YOLOX framework, with Python 3.9, Pytorch
2.0.1, and CUDA Toolkit 12.4. The model was trained and tested on two NVIDIA RTX3090
GPUs, and Latency was calculated on an RTX4090 GPU with TensorRT. We used SGD as
our optimizer with a 0.01 initial learning rate. All our models were trained from scratch for
500 epochs with batch 16, with other details following those in [28]. In particular, to verify
the effectiveness of the end-to-end training method, we evaluated our model without using
NMS (W/O NMS).

4.3. Evaluation Metrics


This paper uses mAP and Latency to measure model accuracy and speed. mAP
represents the average accuracy of all classes, as described in Equation (7). Latency refers
to the time taken by the detector from receiving the image to producing the detection result.
The calculation formula is as follows:
TP
P= (5)
TP + FP

TP
R= (6)
TP + FN
N Z 1
1
mAP =
N ∑ P( R)dR (7)
i =1 0

Latency = Latency f + Tpostprocess (8)


The number of correctly predicted positive samples is denoted as TP (True Positive),
FP (False Positive) represents the incorrectly predicted positive samples, and FN (False
Negative) represents the mistakenly predicted negative samples. Latency f is the time
required for the model’s forward propagation, plus the post-processing time, resulting in
the total detection Latency. The combination of accuracy and speed can well reflect the
superiority of detector performance.

4.4. Comparisons to State-of-the-Art


To prove the effectiveness and efficiency of the proposed method, several represen-
tative detectors were adopted for comparison on benchmark datasets, including Faster
R-CNN [20], Cascade R-CNN [36], and RepPoints [37] for two-stage detection; FCOS [12],
ATSS [21], and GFL [38] for one-stage detection; and YOLOv5 [19] and YOLOv7 [39] for
real-time detection. All methods used the original settings and were trained from scratch
with a resolution of 640 × 640.
Table 1 shows the experimental results of different methods on the DUO dataset.
Using the COCO evaluation metrics [40], in addition to the conventional Average Precision
(AP) and AP at IoU threshold 0.5 (AP50 ), the evaluation metrics for the DUO dataset also
included detection results for small, medium, and large objects (APS , AP M , AP L ), providing
a more comprehensive comparison of the model’s performance in various aspects. It can be
observed that our method significantly outperforms other benchmark networks. Compared
to two-stage and one-stage networks, MA-YOLOX achieved better results with 66.0 mAP
using fewer parameters; this is also attributed to YOLOX’s excellent design. However,
compared to YOLOX, with an improvement of 1.7 mAP, our detector further enhanced
detection performance, significantly outperforming other real-time detectors. Among
the detection results for large, medium, and small objects, MA-YOLOX showed greater
improvement in the more challenging medium and small targets, with a 3.3 increase for
small objects and a 1.9 increase for medium objects, significantly enhancing the model’s
ability to detect small objects.
Electronics 2024, 13, 4856 9 of 15

Table 1. Comparison of different methods on the number of parameters and the accuracy on the
DUO dataset; w/o NMS means without NMS during the validation (val) stage.

Method Param. mAP mAP50 mAP75 mAPS mAP M mAP L


Faster R-CNN 41.14 54.8 75.9 63.1 53.0 56.2 53.8
Cascade R-CNN 68.94 55.6 75.5 63.8 44.9 57.4 54.4
RepPoints 36.60 56.0 80.2 63.1 40.8 58.5 53.7
RetinaNet 36.17 49.3 70.3 55.4 36.5 51.9 47.6
FCOS 31.84 53.0 77.1 59.9 39.7 55.6 50.5
ATSS 31.89 58.2 80.1 66.5 43.9 60.6 55.9
GFL 32.04 58.6 79.3 66.7 46.5 61.6 55.6
YOLOV5 7.07 63.9 84.5 71.8 43.7 65.4 63.0
YOLOX 8.94 64.3 83.7 72.2 49.7 66.3 62.6
YOLOV7 6.02 62.3 83.5 70.5 46.6 63.7 61.7
Ours (w/o NMS) 10.07 66.0 85.1 73.1 53.0 68.2 63.9
The best results are in bold.

We also compared the results of various real-time detectors YOLOv5, YOLOv7, and
YOLOV8 [41] on Pascal VOC, as shown in Table 2. VOC was derived from various images
in natural scenes and included many common categories from daily life, making the results
more reflective of the detector’s performance in real-world scenarios. The best result is
displayed in bold in the table; the detection results still outperformed other detectors.
During the evaluation of the detector, NMS was not used, demonstrating the effectiveness
of the mixed label assignment. Additionally, Table 2 also shows the inference speeds of
various real-time detectors. To present the results more intuitively, Figure 5 visualizes
the detection results in terms of speed and accuracy. Although MA-YOLOX’s number of
parameters or computational complexity was not the lowest, our method demonstrated a
significant advantage in inference speed, achieving a detection speed of 2.5ms per image.
This speed improvement is attributed to the removal of NMS in the post-processing stage,
which eliminates the additional computational overhead introduced by NMS. As a result,
MA-YOLOX is faster than other real-time detectors. Specifically, compared to YOLOX, MA-
YOLOX reduces post-processing time by 0.5 ms by removing NMS, increasing processing
speed by 50%. In comparison, MA-YOLOX is also faster than the YOLOV5, YOLOV7, and
YOLOV8 real-time detectors. This result further proves our core idea: removing NMS
is not only theoretically feasible, but it also significantly optimizes inference speed in
practical applications.

Figure 5. Visualization of model detection results for the VOC dataset.


Electronics 2024, 13, 4856 10 of 15

Table 2. Comparisons of different methods on the VOC dataset. Latency f denotes the Latency in the
forward process of the model without post-processing.

Method Param. (M) GFLOPS mAP50 Latency (ms) Latency f (ms)


YOLOV5 7.11 16.3 73.4 3.4 2.3
YOLOX 8.95 26.8 74.6 2.9 1.8
YOLOV7 6.06 13.2 70.9 2.6 1.5
YOLOV8 11.14 28.7 75.6 2.8 1.7
Ours (w/o NMS) 10.08 29.5 75.9 2.5 1.9
The best results are in bold.

To illustrate the importance of NMS in the post-processing stage, we visualize the


detection results of YOLOX with and without NMS and ours, as shown in Figure 6. YOLOX
and our detector can effectively detect objects in images; however, without NMS, YOLOX
generates multiple predicted boxes for the same object, which affects detection performance.
Especially in dense scenes, a large number of redundant boxes can obscure the original
information in the image. For conventional detectors, NMS is particularly important. MA-
YOLOX achieves the same results as Baseline that rely on NMS, even without using NMS.
Extensive experiments have demonstrated the effectiveness and feasibility of the method.

Figure 6. Visualization of detection results with or without NMS by YOLOX and MA-YOLOX.

4.5. Ablation Study on VOC


Table 3 shows the ablation results based on YOLOX-S, and Figure 7 provides a more
intuitive demonstration of the detector’s performance with and without NMS. It can be
observed that the performance of conventional detectors significantly declined without
NMS. Implementing end-to-end with one-to-one label assignment and the mixed label
assignment allowed the model to operate without NMS in the post-processing stage,
making the inference speed 0.5 ms faster than the Baseline. However, the one-to-one
label assignment led to a significant decline in the model’s detection results, from 51.0
to 44.9 mAP. In contrast, the mixed label assignment improved performance by 5 AP
compared to the one-to-one label assignment, reaching 50.6 mAP. Additionally, the window
feature propagation block improved the performance by about 2.0 mAP based on the mixed
Electronics 2024, 13, 4856 11 of 15

label assignment, exceeding the Baseline. The increase in parameters and computational
cost was only around 10%. Our method achieved a result of 52.6 AP and a Latency of 2.5 ms
without NMS, improved the detection results by 1.6 AP, and increased the inference speed
by 0.4 ms, outperforming the Baseline in both performance and speed. Moreover, without
NMS, there was a slight improvement in detection results, indicating that NMS removed
some more accurate predicted boxes, effectively proving the feasibility and effectiveness of
this approach.

Table 3. Ablation studies on VOC, evaluating model use conf = 0.001 and IoU threshold = 0.65.

End-to-
Model MLA WFPB Param. (M) GFLOPS mAP mAP(w/o NMS) Latency (ms)
End
8.95 26.80 51.0 18.6 2.9
✓ 8.95 26.80 44.9 44.9 2.4
YOLOX-S
✓ ✓ 8.95 26.80 50.6 50.7 2.4
✓ ✓ ✓ 10.08 29.57 52.6 52.7 2.5
The best results are in bold.

Figure 7. Visualization of ablation experiment results.

4.5.1. Analyses for Mixed Label Assignment


The training strategy for mixed label assignment involves both one-to-one and one-
to-many dual-label assignment, where one key aspect is how to allocate positive and
negative samples. Choosing the right positive and negative samples is crucial for model
training. As shown in Equation (1), the hyperparameter λ in the matching cost controls the
importance ratio between classification and regression. Different values of λ lead to the
model optimizing different positive samples during the training, which impacts the model’s
performance. Table 4 presents the model training results under various hyperparameters,
with λ = 5 achieving the best detection results. When λ is either too high or too low, it has
a certain effect on model training, with λ = 1 reaching the worst results. This indicates
that during the assignment process, more consideration should be given to the location of
the predicted boxes. However, if λ is too high, ignoring the importance of classification
negatively affects the model’s performance.
Electronics 2024, 13, 4856 12 of 15

Table 4. The results of matching cost training under different hyperparameters λ.

λ mAP mAP50
1 50.1 73.1
3 50.2 73.7
4 50.4 73.5
5 50.6 73.7
7 50.4 73.3
The best results are in bold.

4.5.2. Analyses for Window Feature Propagation Block


The structure of the window feature propagation block is shown in Figure 4. The
parameter K determined the window size for feature propagation, and we tested the
model’s performance for K = 1, 3, 5, 7, as shown in Table 5. It can be observed that the
performance for K = 3, 5, 7 was significantly better than for K = 1, demonstrating the
feasibility of the window feature propagation block. However, as the window size K
increased, the parameters and computational cost also increased significantly, and we
found that the model performance improved slowly. This indicates that some redundant
features are captured as the window expands. The best result was obtained when K = 7,
but the number of parameters increased too much, leading to the final decision of setting
the window size K to 3.

Table 5. The results of the window feature-propagating blocks at different K sizes in VOC.

Model K Size mAP


1×1 52.56
3×3 53.70
YOLOX-S
5×5 54.06
7×7 54.31
The best results are in bold.

Additionally, to demonstrate the superiority of the window feature propagation block,


we compared the detection results of WFPB with other attention modules, CBAM [42],
CA [43], and PSA [44], under the same conditions, as shown in Table 6. CBAM (Convo-
lutional Block Attention Module) and CA (Channel Attention) both implement attention
mechanisms through convolution in spatial and channel dimensions. In contrast, PSA
(partial self-attention) is based on self-attention, making it suitable for image Attention
mechanisms. WFPB achieved 53.7, clearly outperforming the other modules.

Table 6. Comparison of different Attention methods.

Method mAP mAP50


Baseline 51.0 74.6
WFPB 53.7 77.4
CBAM 50.1 75.2
CA 52.0 76.2
PSA 52.7 76.9
The best results are in bold.

4.6. Generalization Experiments


Mixed label assignment is a completely new end-to-end training method that can
be effectively transferred to other advanced models. We chose to conduct generalization
experiments on the YOLOV8 model, applying mixed label assignment. The results shown
in Table 7 demonstrate the successful removal of the reliance on NMS in the post-processing
phase of traditional detectors. The original end-to-end training method (O2O) suffered
from significant performance degradation due to the lack of sufficient foreground samples.
Electronics 2024, 13, 4856 13 of 15

However, mixed label assignment effectively alleviated this issue. The generalization
experiments further highlight the effectiveness of this method.

Table 7. Generalization experiments.

Model Assignment mAP (W/o NMS) mAP50 (W/o NMS)


Baseline 22.8 28.7
YOLOV8 O2O 44.4 66.3
MLA 50.8 71.6

5. Conclusions
This paper proposes a novel end-to-end training method called mixed label assign-
ment, which avoids the performance degradation caused by one-to-one label assignment
in traditional methods while preserving the advantages of one-to-many label assignment.
This approach does not require additional branches or training overhead, significantly im-
proving the performance of end-to-end object detection. Furthermore, the window feature
propagation module effectively shares features in local regions by leveraging inductive
bias, and has achieved remarkable results. Our experiments demonstrated the importance
of local region features in image-based detection tasks. The detector based on our method
outperformed the Baseline in both detection results and inference speed. We hope that the
design introduced in this work will contribute to the development of better end-to-end
training methods for object detection.

Author Contributions: Conceptualization, J.C. and C.S.; Methodology, J.C.; Software, J.C.; Validation,
J.C.; Formal Analysis, J.C.; Investigation, J.C.; Resources, J.C.; Data Curation, J.C.; Writing—Original
Draft Preparation, J.C.; Writing—Review and Editing, J.C., C.S. and Z.S.; Visualization, J.C.; Supervi-
sion, Z.S.; Project Administration, Z.S.; Funding Acquisition, Z.S. All authors have read and agreed
to the published version of the manuscript.
Funding: This research was funded by Jiangsu Provincial Key Research and Development Pro-
gram (No. BE2022136), High-tech Ship Research Projects (No. CBG4N21-4-3), Key Research and
Development Program of Zhenjiang City (No. GY2023019).
Data Availability Statement: The PASCAL VOC Datasets were from http://host.robots.ox.ac.uk/
pascal/VOC/voc2012/index.html (accessed on 25 May 2024) and the DUO dataset can be found at
https://github.com/chongweiliu/DUO (accessed on 25 May 2024).
Conflicts of Interest: The authors declare that they have no known competing financial interests or
personal relationships that could have appeared to influence the work reported in this paper.

References
1. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
pp. 580–587.
2. Girshick, R. Fast r-cnn. arXiv 2015, arXiv:1504.08083.
3. Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016.
4. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of
the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings,
Part I 14; pp. 21–37.
5. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271.
6. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
7. Lin, T. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002.
8. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In
Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229.
Electronics 2024, 13, 4856 14 of 15

9. Wang, J.; Song, L.; Li, Z.; Sun, H.; Sun, J.; Zheng, N. End-to-end object detection with fully convolutional network. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15849–15858.
10. Sun, P.; Jiang, Y.; Xie, E.; Shao, W.; Yuan, Z.; Wang, C.; Luo, P. What makes for end-to-end object detection? In Proceedings of the
International Conference on Machine Learning, Online, 18–24 July 2021; pp. 9934–9944.
11. Zhou, Q.; Yu, C.; Shen, C.; Wang, Z.; Li, H. Object Detection Made Simpler by Eliminating Heuristic NMS. arXiv 2021,
arXiv:2101.11782. [CrossRef]
12. Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. arXiv 2019, arXiv:1904.01355.
13. Vaswani, A. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach,
CA, USA, 4–9 December 2017.
14. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv
2020, arXiv:2010.04159.
15. Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. Dab-detr: Dynamic anchor boxes are better queries for detr.
arXiv 2022, arXiv:2201.12329.
16. Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; Wei, Y. Relation networks for object detection. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3588–3597.
17. Hosang, J.; Benenson, R.; Schiele, B. Learning non-maximum suppression. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4507–4515.
18. Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C. Sparse r-cnn: End-to-end object
detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463.
19. YOLOv5. 2021. Available online: https://github.com/ultralytics/yolov5 (accessed on 25 May 2024).
20. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv
2015, arXiv:1506.01497. [CrossRef] [PubMed]
21. Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training
sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA,
13–19 June 2020; pp. 9759–9768.
22. Ge, Z.; Liu, S.; Li, Z.; Yoshie, O.; Sun, J. Ota: Optimal transport assignment for object detection. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 303–312.
23. Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the 2021
IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3490–3499.
24. Jia, D.; Yuan, Y.; He, H.; Wu, X.; Yu, H.; Lin, W.; Sun, L.; Zhang, C.; Hu, H. Detrs with hybrid matching. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19702–19712.
25. Chen, Q.; Chen, X.; Wang, J.; Zhang, S.; Yao, K.; Feng, H.; Han, J.; Ding, E.; Zeng, G.; Wang, J. Group detr: Fast detr training with
group-wise one-to-many assignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris,
France, 4–6 October 2023; pp. 6633–6642.
26. Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022;
pp. 13619–13627.
27. Zhao, C.; Sun, Y.; Wang, W.; Chen, Q.; Ding, E.; Yang, Y.; Wang, J. MS-DETR: Efficient DETR Training with Mixed Supervision.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024;
pp. 17027–17036.
28. Ge, Z. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430.
29. Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
30. He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009.
31. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted
windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October
2021; pp. 10012–10022.
32. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
33. Redmon, J. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
34. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J.
Comput. Vis. 2010, 88, 303–338. [CrossRef]
35. Liu, C.; Li, H.; Wang, S.; Zhu, M.; Wang, D.; Fan, X.; Wang, Z. A dataset and benchmark of underwater object detection for robot
picking. In Proceedings of the 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shenzhen, China,
5–9 July 2021; pp. 1–6.
36. Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162.
Electronics 2024, 13, 4856 15 of 15

37. Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9657–9666.
38. Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed
bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012.
39. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object
detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada,
17–24 June 2023; pp. 7464–7475.
40. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in
context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September
2014; Proceedings, Part V 13; pp. 740–755.
41. Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed
on 25 May 2024).
42. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference
on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
43. Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722.
44. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024,
arXiv:2405.14458.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like